Chapter 4: Dueling Bandits


\[ P(b_i \succ b_j) = (c_{ij} + 1) / 2 \] - Let us denote \(\epsilon(b_i, b_j) = c_{ij} / 2\). We can rewrite the Bradley-Terry model as follows:
\[ P(b_i \succ b_j) = \epsilon(b_i, b_j) + \frac{1}{2} \]
By this model, we have the following properties: \[ \epsilon(b_i, b_i) = 0, \epsilon(b_i, b_j) = -\epsilon(b_j, b_i)\text{ and,} \] \[ b_i \succ b_j \text{ if and only if } \epsilon(b_i, b_j) > 0 \] We also assume there is a total order over the bandits, i.e., there exists a permutation \(\varphi\) such that \(b_{\varphi(1)} \succ b_{\varphi(2)} \succ \dots \succ b_{\varphi(K)}. Without loss of generality, we have\)$ b_i b_j (i) < (j) $$ Thus, the best bandit is \(b_{\varphi(1)}\).
To qualify the decision at each turn \(t\), we start to construct the regret measurement. Recall that the probability of winning or losing in a pairwise comparison is at least \(0.5\), the instantaneous regret is defined as: \[ r_t = P(b_{\varphi(1)} \succ b_1) + P(b_{\varphi(1)} \succ b_2) - 1 \] where \(r_t\) is the regret of the decision at turn \(t\).
Then, the total cumulative regret is defined as: \[ \begin{aligned} R_T = \sum_{t=1}^{T} r_t &= \sum_{t=1}^{T} \left[ P(b_{\varphi(1)} \succ b_{1,t}) + P(b_{\varphi(1)} \succ b_{2,t}) - 1 \right]\\ &= \sum_{t=1}^{T} \left[\left(\epsilon(b_{\varphi(1)}, b_{1,t}) + \frac{1}{2}\right) + \left(\epsilon(b_{\varphi(1)}, b_{2,t}) + \frac{1}{2}\right) - 1\right]\\ &= \sum_{t=1}^{T} \left[\epsilon(b_{\varphi(1)}, b_{1,t}) + \epsilon(b_{\varphi(1)}, b_{2,t})\right] \end{aligned} \] where \(R_T\) is the total cummulative regret after \(T\) turns.
\[ \epsilon(b_i, b_j | x_i, x_j) \in \left[\frac{-1}{2}, \frac{1}{2}\right] \]
Traditional acquisition functions for dueling bandit problem: - Interleaved Filtered (Yue et al., 2012) - Thompson Sampling (Agrawal & Goyal, 2013) - Dueling Bandit Gradient Descent (DBGD) (Dudík et al., 2015)
Auer (2002); Yue et al. (2012); Agrawal and Goyal (2013); Dudik et al. (2015)
Novoseller et al. (2020)
Novoseller et al. (2020)
Novoseller et al. (2020)
Novoseller et al. (2020)
Novoseller et al. (2020)
The DPS algorithm is summarized as follows:

where \(\mathbb{I}_{[\tau_{i2} \succ \tau_{i1}]} = P(\tau_{i2} \succ \tau_{i1})\) is the indicator function that returns \(1\) if \(\tau_{i2} \succ \tau_{i1}\) and \(0\) otherwise.
Novoseller et al. (2020)
ADVANTAGE: Sample policy from dynamics and utility models
Input: \(f_p\): state transition posterior, \(f_r\): utility posterior 1. Sample \(\tilde{p} \sim f_p(\cdot)\) 2. Sample \(\tilde{r} \sim f_r(\cdot)\) 3. Solve \(\pi^* = \arg \sup_{\pi} \sum_{s\in\mathcal{S}} p_0(s)V_{\pi, 1}(s | \tilde{p}, \tilde{r})\) 4. Return \(\pi^*\)
Novoseller et al. (2020)
FEEDBACK: Update dynamics and utility models based on new user feedback
Input: \(\mathcal{H} = \{ \tau_{i1}, \tau_{i2}, y_i \}\), \(f_p\): state transition posterior, \(f_r\): utility posterior 1. Apply Bayesian update to \(f_p\) using \(\mathcal{H}\) 2. Apply Bayesian update to \(f_r\) using \(\mathcal{H}\) 3. Return \(f_p\), \(f_r\)
Novoseller et al. (2020)
In DPS, the process of updating the dynamics posterior is straightforward. We can assume the dynamics are fully observed and model them with Dirichlet distribution. Then, the likelihood of the state transition is multinomial, and we can update the posterior as follows: \[ f_p(s_{t+1} | s_t, a_t) = \text{Dirichlet}(\alpha + \text{count}(s_{t+1} | s_t, a_t)) \]
This formula can be interpreted as we update the posterior by adding the count of the new state to the prior.
Novoseller et al. (2020)
We will now analyze the asymptotic Bayesian regret of DFS under a Bayesian linear regression model. The analysis contains three steps: 1. Proving DPS is asymptotic-consistent (i.e., the probability of selecting the optimal policy converges to 1). 2. Bounding one-sided Bayesian regret for \(\pi_{i2}\), which means DPS is only able to select \(\pi_{i2}\) and \(\pi_{i1}\) is sampled from a fixed distribution. 3. Assuming the distribution of \(\pi_{i1}\) is drifting but converging, we bound the Bayesian regret for \(\pi_{i2}\).
Novoseller et al. (2020)
Proposition 1: The sampled dynamics converge in distribution to their true value as the DPS iteration increase 1. Let the posterior distribution of the dynamics for each state-action pair \(s_t, a_t\) be \(P(s_{t+1} | s_t, a_t)\) and the true distribution be \(P^*(s_{t+1} | s_t, a_t)\), with \(\epsilon > 0\) and \(\delta > 0\): \[ P(|P(s_{t+1} | s_t, a_t) - P^*(s_{t+1} | s_t, a_t)| > \epsilon) < \delta \]
Let \(N(s, a)\) represent the number of times state-action pair \(s, a\) has been observed. As \(N(s, a) \to \infty\), the posterior distribution concentrates around the true distribution \(P^*(s_{t+1} | s_t, a_t)\).
The remaining problem is to prove that DPS will visit all state-action pairs infinitely often.
Novoseller et al. (2020)
Lemma 3 (Novoseller et al., 2020): Under DPS, every state-action pair is visited infinitely-often.
Proof sketch: - The proof proceeds by assuming that there exists a state-action pair that is visited only finitely-many times. - This assumption will lead to a contradiction: once this state-action pair is no longer visited, the reward model posterior is no longer updated with respect to it. Then, DPS is guaranteed to eventually sample a high enough reward for this state-action that the resultant policy will prioritize visiting it.
Novoseller et al. (2020)
Proposition 2: With probability of \(1 - \delta\), where delta is a parameter of the Bayesian linear regression model, the sampled rewards converge in distribution to the true reward parameters, \(\bar{r}\), as the DPS iteration increases.
Novoseller et al. (2020)
According to Theorem 2 from Abbasi-Yadkori et al. (2011), under certain regularity conditions, we can bound the error between the estimated reward parameter \(\hat{r}_i\) and the true reward parameter \(r_i\) with high probability. With probability at least \(1 - \delta\): \[ \| \hat{r}_i - r_i \|_{\mathbf{M}_i} \leq \beta_i(\delta) \] - \(\mathbf{M}_i\): design (covariance) matrix - \(\beta_i(\delta)\): confidence bound (depends on \(\delta\)) - This defines an ellipsoid around the true reward parameters.
Novoseller et al. (2020)
Novoseller et al. (2020)
Theorem 1: With probability \(1 - \delta\), the sampled policies \(\pi_{i1}, \pi_{i2}\) converge in distribution to the optimal policy \(\pi^*\) as \(i \to \infty\).
Novoseller et al. (2020)
Novoseller et al. (2020)
Russo and Van Roy (2016); Novoseller et al. (2020)
\[ \Gamma_i = \frac{ \mathbb{E}_i[ (y^*_i - y_i)^2 ] }{ \mathbb{I}_i[\pi^*; (\pi_{i2}, \tau_{i1}, \tau_{i2}, x_{i2} - x_{i1}, y_i )] } \] - Numerator: Squared instantaneous one-sided regret of policy \(\pi_{i2}\) (exploitation). - Denominator: Information gained about the optimal policy \(\pi^*\) (exploration).
Russo and Van Roy (2016); Novoseller et al. (2020)
Novoseller et al. (2020)
where \(y_i = \bar{r}(\tau) = \bar{r}^{\top}x_{\tau_{i}}\) is the expected utility of the trajectory \(\tau_i\).
Novoseller et al. (2020)
When policy \(\pi_{i1}\) is drawn from a fixed distribution: - Apply information-theoretic regret analysis similar to Russo and Van Roy (2016). - Lemma 12 (Novoseller et al., 2020): If \(\Gamma_i \leq \bar{\Gamma}\) for all iterations \(i\): \[ \mathbb{E}[ \text{Reg}_2(T)] \leq \sqrt{ \bar{\Gamma} \mathbb{H}[\pi^*] N } \] where \(\mathbb{H}[\pi^*]\) is the entropy of the optimal policy \(\pi^*\) and \(N\) is the number of DPS iterations.
Russo and Van Roy (2016); Novoseller et al. (2020)
Proof for Lemma 12: The expression we are working with (before applying Cauchy-Schwarz) is:
\[ \mathbb{E} \left[ \text{Regret}(T, \pi^{TS}) \right] \leq \mathbb{E} \left[ \sum_{t=1}^{T} \sqrt{\Gamma_t \mathbb{I}_t[A^*; (A_t, Y_t, A_t)]} \right] \]
where \(\Gamma_t\) is some scaling factor, \(\mathbb{I}_t[A^*; (A_t, Y_t, A_t)]\) is the information gain at time \(t\), and the summation is taken over the entire time horizon \(T\).
The Cauchy-Schwarz inequality for sums states that for any sequences \(\{a_t\}\) and \(\{b_t\}\), we have: \[ \left( \sum_{t=1}^{T} a_t b_t \right)^2 \leq \left( \sum_{t=1}^{T} a_t^2 \right) \left( \sum_{t=1}^{T} b_t^2 \right) \] Taking square roots on both sides, we get: \[ \sum_{t=1}^{T} a_t b_t \leq \sqrt{ \left( \sum_{t=1}^{T} a_t^2 \right) \left( \sum_{t=1}^{T} b_t^2 \right)} \]
Now, applying this inequality to the expression for regret, we associate \(a_t\) with \(\sqrt{\Gamma_t}\) and \(b_t\) with \(\sqrt{\mathbb{I}_t[A^*; (A_t, Y_t, A_t)]}\). Specifically, we write:
\[ \mathbb{E} \left[ \sum_{t=1}^{T} \sqrt{\Gamma_t \mathbb{I}_t[A^*; (A_t, Y_t, A_t)]} \right] \]
as the product of two sequences: \[ \sum_{t=1}^{T} \sqrt{\Gamma_t} \cdot \sqrt{\mathbb{I}_t[A^*; (A_t, Y_t, A_t)]}. \]
Applying Cauchy-Schwarz to this sum gives:
\[ \begin{aligned} \mathbb{E} & \left[ \sum_{t=1}^{T} \sqrt{\Gamma_t \mathbb{I}_t[A^*; (A_t, Y_t, A_t)[]} \right] \\ &\leq \sqrt{ \mathbb{E} \left[ \sum_{t=1}^{T} \Gamma_t \right] \cdot \mathbb{E} \left[ \sum_{t=1}^{T} \mathbb{I}_t[A^*; (A_t, Y_t, A_t)] \right]}. \end{aligned} \]
\[ \begin{aligned} \mathbb{E} & \left[ \sum_{t=1}^{T} \sqrt{\Gamma_t \mathbb{I}_t[A^*; (A_t, Y_t, A_t)]} \right] \\ &\leq \sqrt{ \mathbb{E} \left[ \sum_{t=1}^{T} \Gamma_t \right] \cdot \mathbb{E} \left[ \sum_{t=1}^{T} \mathbb{I}_t[A^*; (A_t, Y_t, A_t)] \right]}. \end{aligned} \]
Assuming \(\Gamma_t\) is bounded, say \(\Gamma_t \leq \bar{\Gamma}\) for all \(t\), the bound becomes:
\[ \begin{aligned} \mathbb{E} \left[ \text{Regret}(T, \pi^{TS}) \right] &\leq \sqrt{ T \cdot \bar{\Gamma} \cdot \mathbb{E} \left[ \sum_{t=1}^{T} \mathbb{I}_t[A^*; (A_t, Y_t, A_t)] \right]} \end{aligned} \]
This step is where the bound on the regret is simplified using the total information gain across the horizon \(T\). The bound scales with \(\sqrt{T}\), which reflects the growth of regret with time, but it is also modulated by the total information gathered during the process.
Let \(Z_t = (A_t, Y_t, A_t)\). We can write the total information gain as:
\[ \mathbb{E} \left[ \mathbb{I}_t \left[ A^* ; Z_t \right] \right] = \mathbb{I} \left[ A^* ; Z_t | Z_1, \dots, Z_{t-1} \right], \] and the total information gain across all \(T\) steps is:
\[ \begin{aligned} \mathbb{E} \sum_{t=1}^T \mathbb{I}_t \left[ A^* ; Z_t \right] &= \sum_{t=1}^T \mathbb{I} \left( A^* ; Z_t | Z_1, \dots, Z_{t-1} \right)\\ &\stackrel{(c)}{=} \mathbb{I} \left[ A^* ; \left( Z_1, \dots, Z_T \right) \right]\\ &= \mathbb{H} \left[ A^* \right] - \mathbb{H} \left[ A^* | Z_1, \dots, Z_T \right] \stackrel{(d)}{\leq} \mathbb{H} \left[ A^* \right] \end{aligned} \]
where \((c)\) follows from the chain rule for mutual information, and \((d)\) follows from the non-negativity of entropy.
Gathering all the pieces together, we have:
\[ \begin{aligned} \mathbb{E} \left[ \text{Regret}(T, \pi^{TS}) \right] &\leq \sqrt{ T \cdot \bar{\Gamma} \cdot \mathbb{E} \left[ \sum_{t=1}^{T} \mathbb{I}_t [A^*; (A_t, Y_t, A_t)] \right]}\\ &= \sqrt{ T \cdot \bar{\Gamma} \cdot \mathbb{H}[A^*]} \end{aligned} \]
Novoseller et al. (2020)
Novoseller et al. (2020)
Novoseller et al. (2020)
Proof Outline
Novoseller et al. (2020)
Recall Theorem 1, Theorem 2, and Lemma 17: - Theorem 1: With probability \(1-\delta\), the sampled policies \(\pi_{i1}, \pi_{i2}\) converge in distribution to the optimal policy \(\pi^*\) as \(i \to \infty\). - Theorem 2: If the policy \(\pi_{i1}\) is drawn from a fixed distribution, the one-sided Bayesian regret rate for \(\pi_{i2}\) is bounded by \(S \sqrt{ \frac{AT \log A}{2} }\). - Lemma 17: If the sampling distribution of \(\pi_{i1}\) is converging to a fixed distribution, then the information ratio \(\Gamma_i\) for \(\pi_{i2}\)’s one-sided regret is bounded by \(\frac{SA}{2}\).
Novoseller et al. (2020)
Theorem 3: With probability \(1 - \delta\), the expected Bayesian regret \(\mathbb{E}[\text{Reg}(T)]\) of DPS achieves an asymptotic rate: \[ \begin{aligned} \mathbb{E}[\text{Reg}(T)] &= \mathbb{E} \left\{ \sum_{i=1}^{\left\lceil \frac{T}{2h} \right\rceil} \sum_{s \in \mathcal{S}} p_0(s) \left[ 2V_{\pi^*,1}(s) - V_{\pi_{i1},1}(s) - V_{\pi_{i2},1}(s) \right] \right\}\\ &= \mathbb{E}[\text{Reg}_1(T)] + \mathbb{E}[\text{Reg}_2(T)] \\ &= S \sqrt{ \frac{AT \log A}{2} } + S \sqrt{ \frac{AT \log A}{2} } = S \sqrt{2AT \log A} \end{aligned} \] - We now can conclude that the DPS algorithm is asymptotically optimal in terms of Bayesian regret.
Novoseller et al. (2020)

Empirical performance of DPS: Each simulated environment is shown under the two least-noisy user preference models that were evaluated. The plots show DPS with three models: Gaussian process regression (GPR), Bayesian linear regression, and a Gaussian process preference model.
Novoseller et al. (2020)

Plots display the mean +/- one standard deviation over 100 runs of each algorithm tested. Overall, we see that DPS performs well and is robust to the choice of credit assignment model.
Novoseller et al. (2020)

Chapter 4.1: Dueling Bandits