4  Decisions

Intended Learning Outcomes

By the end of this chapter you will be able to:

  • Distinguish the measurement perspective (reducing uncertainty) from the reward maximization perspective (managing uncertainty) in preference learning.
  • Derive Thompson Sampling for linear objectives using Laplace approximation to handle intractable posteriors.
  • Extend Thompson Sampling to nonlinear objectives using Gaussian Processes and GP classification.
  • Define regret (strong, weak, and Bayesian) in dueling bandit settings and explain when each notion is appropriate.
  • Compare winner concepts—Condorcet, Borda, and Von Neumann winners—and identify when they may diverge.
  • Apply acquisition functions (Uniform, Epsilon-Greedy, UCB, Pure Exploration, qEUBO) for sequential decision-making with preference feedback.
  • Analyze regret bounds for Preferential Bayesian Optimization algorithms and understand their convergence guarantees.
  • Implement Thompson Sampling and dueling bandit algorithms in practical settings.
  • (Optional) Derive the policy gradient theorem and explain how REINFORCE and GRPO use it for preference-based optimization.

This chapter can be covered in four 50-minute lectures:

Lecture 1 (Sections 4.1–4.2): Thompson Sampling

  • Motivation: From measurement to decision-making (10 min)
  • Exploration-exploitation tradeoff (10 min)
  • Thompson Sampling with Laplace approximation (linear case) (15 min)
  • Code demo: Linear Thompson Sampling simulation (15 min)

Lecture 2 (Sections 4.2–4.3): GP Thompson Sampling and Dueling Bandits

  • Gaussian Process Thompson Sampling (nonlinear case) (20 min)
  • From multi-armed bandits to dueling bandits (15 min)
  • Regret definitions and winner concepts (15 min)

Lecture 3 (Sections 4.3–4.4): Dueling Bandit Algorithms and Applications

  • Acquisition functions: Uniform, ε-Greedy, UCB (10 min)
  • Interleaved Filter algorithm (15 min)
  • DBGD, Sparring EXP4, Feel-good Thompson Sampling (10 min)
  • Case studies: LLM response selection and clinical trials (15 min)

Lecture 4 (Sections 4.5 and 4.7): Preferential Bayesian Optimization and CIRL

  • PBO problem formulation and Copeland scores (15 min)
  • Acquisition functions: qEUBO, Pure Exploration, POP-BO (15 min)
  • CIRL: From passive oracles to active human agents (15 min)
  • Discussion: Choice architecture and responsible deployment (5 min)

Lecture 5 (Optional) (Section 4.6): RL Foundations for Preference Optimization

  • MDP formulation and the LLM generation MDP (10 min)
  • Policy gradient theorem derivation (15 min)
  • REINFORCE, baselines, and variance reduction (10 min)
  • GRPO: From RLHF to group-relative optimization (15 min)

4.1 Chapter Overview

Chapter 1 developed models for how preferences arise from latent utilities. Chapter 2 showed how to estimate these utilities from observed data. Chapter 3 addressed how to collect data efficiently through active query selection. This chapter tackles the complementary question: given uncertainty about user preferences, how should we choose actions to maximize reward?

The shift is subtle but important. Chapter 3’s measurement perspective treats uncertainty as something to be reduced—the goal is to learn the preference function as accurately as possible. This chapter’s reward maximization perspective treats uncertainty as something to be managed—the goal is to make good decisions now, even with incomplete knowledge. Consider an LLM serving system: it must decide which response to show each user, balancing exploitation (showing responses it believes are good) against exploration (trying novel responses to learn what users prefer).

Chapter Structure. We build from simple to complex settings:

  1. Thompson Sampling (Section 4.3): Linear and nonlinear (GP) approaches to sequential decision-making with preference feedback
  2. Dueling Bandits (Section 4.5.3): When actions can only be compared pairwise, introducing regret notions and winner concepts (Condorcet, Borda, von Neumann)
  3. Applications: Case studies connecting these methods to LLM alignment and clinical trials
  4. Preferential Bayesian Optimization: Advanced acquisition functions (qEUBO, POP-BO) with convergence guarantees
  5. RL Foundations (Optional) (Section 4.7): MDP formulation, the policy gradient theorem, and GRPO for preference optimization
  6. Human-Agent Cooperation (CIRL): Moving beyond passive oracles to model humans as strategic agents

Connections to Prior Chapters:

Concept Chapters 1–3 This Chapter
Bradley-Terry model Likelihood, MLE, Fisher info Dueling bandit preference model, LLM selection
Laplace approximation GP posterior inference Thompson Sampling posterior
Acquisition functions D-optimal, info gain (Ch. 3) UCB, qEUBO, Pure Exploration
DPO / reward learning Objective function, ADPO (Ch. 3) Reward maximization under uncertainty
Policy gradient / GRPO RLHF pipeline (Ch. 1) Full derivation, GRPO algorithm (Optional)
Winner concepts Condorcet, Borda, von Neumann → reappear in Ch. 5

4.2 The Reward Maximization under Uncertainty Problem

Historical Note: Thompson Sampling

Thompson Sampling dates back to 1933, when William Thompson proposed it for clinical trials. For decades it was largely forgotten, overshadowed by frequentist methods like UCB (Upper Confidence Bound). Its rediscovery in the 2010s, particularly for online advertising and recommendation systems, revealed that this simple Bayesian algorithm often outperforms more complex alternatives. The key insight: sample a plausible world from your beliefs, then act optimally within it.

The fundamental tension in sequential decision-making is the exploration-exploitation tradeoff:

  • Exploitation: Choose actions believed to be good based on current knowledge
  • Exploration: Try uncertain actions to gain information for better future decisions

A purely exploitative strategy may miss superior options it never tried. A purely explorative strategy wastes resources on suboptimal choices. Effective algorithms balance both.

Assumption: Stationary Utilities

The bandit framework assumes item utilities are fixed over time: the reward distribution for each arm does not change across rounds. All regret bounds in this chapter rely on this stationarity assumption. In practice, preferences drift—users’ tastes evolve, LLM responses improve through training, and the item set itself may change (new products, new model versions). When utilities are non-stationary, standard regret bounds no longer apply, and algorithms must explicitly account for drift, for instance by discounting old observations (as the Elo K-factor does implicitly in Chapter 2) or by using non-stationary bandit algorithms with sliding windows or change-point detection.

Each user \(i\) interacts with a set of available items indexed by \(j\), each characterized by a known parameter vector \(V_j \in \mathbb{R}^K\). The user’s utility function is \(H_{ij} = h_i(V_j),\) where \(h_i: \mathbb{R}^K \rightarrow \mathbb{R}\) maps item features to latent utility. We only observe noisy or partial feedback, such as item-wise or pairwise responses. The agent’s objective is to select items sequentially to maximize reward.

In the linear case, we assume \(h_i(V_j) = U_i^\top V_j\), with \(U_i \in \mathbb{R}^K\) representing the user’s latent vector. This recovers the logistic or Rasch model when observations are binary: \(p(Y_{ij}=1) = \sigma(U_i^\top V_j).\) In this setting, the uncertainty about \(U_i\) defines a posterior distribution \(p(U_i \mid \text{data})\), and the system’s exploration–exploitation tradeoff can be quantified by the expected utility and posterior variance. Algorithms such as Thompson Sampling operate directly on this posterior, balancing exploration (reducing uncertainty about \(U_i\)) and exploitation (recommending high-utility \(V_j\)).

In the nonlinear case, a common choice is to place a Gaussian process prior over \(h_i\): \(h_i \sim \mathcal{GP}(m(\cdot), k(\cdot,\cdot)),\) where the kernel \(k(V, V')\) encodes similarity between items. This yields the Bayesian optimization formulation. The system uses acquisition functions (e.g., Thompson Sampling or expected improvement) to trade off exploitation and exploration.

Code: Exploration vs Exploitation Visualization

4.3 Thompson Sampling under Linear Objective

Thompson Sampling (TS) is among the most conceptually transparent and empirically successful methods for decision-making under uncertainty. It converts Bayesian inference into a simple randomized decision rule: sample a plausible world, then act optimally within it. Formally, suppose we maintain a posterior distribution over the latent user embedding \(p(U_i \mid \mathcal{D}_t) \propto p(\mathcal{D}_t \mid U_i) p(U_i),\) where \(\mathcal{D}_t = {(V_j, Y_{ij})}_{j=1}^t\) denotes the user’s observed responses. At each round \(t+1\), Thompson Sampling draws a single sample \(\tilde{U}_i^{(t)} \sim p(U_i \mid \mathcal{D}_t),\) and then chooses the next item that maximizes the sampled expected utility: \(j^* = \arg\max_j \tilde{U}_i^{(t)\top} V_j.\) This randomized strategy naturally balances exploration and exploitation: if posterior uncertainty is large, different draws \(\tilde{U}_i^{(t)}\) lead to diverse item choices; as the posterior concentrates, the policy converges to greedy exploitation.

Common Pitfall: Thompson Sampling Requires a Correct Posterior

Thompson Sampling’s theoretical guarantees depend on maintaining an accurate posterior over the latent parameters. A common mistake is to use a poor approximation (e.g., a point estimate, or a Gaussian approximation when the true posterior is multimodal) and still expect TS to explore effectively. If the posterior is overconfident, TS degenerates to greedy exploitation too early; if underconfident, it wastes rounds exploring suboptimal items. The Laplace approximation works well for logistic likelihoods (Bradley-Terry), but always check that the posterior is unimodal. Also, do not add explicit exploration bonuses (as in UCB) on top of Thompson Sampling — the randomization is the exploration mechanism, and adding bonuses breaks the balance.

Worked Example: Thompson Sampling Intuition

Suppose we have 3 items and a 1D user embedding. After 5 observations, the posterior over \(U\) is approximately \(\mathcal{N}(1.2, 0.5^2)\).

Round 1: Draw \(\tilde{U} = 1.6\) (high sample). With item features \(V_1 = 0.5, V_2 = 1.0, V_3 = -0.3\):

  • Expected utilities: \(\tilde{U} \cdot V_1 = 0.80\), \(\tilde{U} \cdot V_2 = 1.60\), \(\tilde{U} \cdot V_3 = -0.48\)
  • Select \(j^* = 2\) (highest utility under this sample)

Round 2: Draw \(\tilde{U} = 0.7\) (low sample). Now:

  • Expected utilities: \(0.35\), \(0.70\), \(-0.21\)
  • Still select \(j^* = 2\), but the margin is smaller

Round 3: Draw \(\tilde{U} = -0.1\) (rare tail sample). Now:

  • Expected utilities: \(-0.05\), \(-0.10\), \(0.03\)
  • Select \(j^* = 3\)exploration! The unlikely sample led us to try a different item.

Key insight: Thompson Sampling explores automatically because posterior uncertainty generates diverse samples. Early on (wide posterior), many different items get selected. As the posterior narrows around the true \(U\), exploration decreases naturally — no explicit exploration schedule needed.

Under binary or preference feedback modeled by a logistic link, \[ p(Y_{ij}=1 \mid U_i, V_j) = \sigma(U_i^\top V_j), \tag{4.1}\] the posterior becomes non-Gaussian and analytically intractable. We therefore approximate it with a Laplace approximation, replacing the true posterior with a local Gaussian centered at the mode (the MAP estimate).

Let the log-posterior be: \[ \ell(U_i) = \log p(U_i) + \sum_{j \in \mathcal{D}_t} \log p(Y_{ij} \mid U_i). \tag{4.2}\] We expand \(\ell(U_i)\) around its maximum a posteriori estimate \(\hat{U}_i\): \[ \ell(U_i) \approx \ell(\hat{U}_i) - \tfrac{1}{2}(U_i - \hat{U}_i)^\top G (U_i - \hat{U}_i), \tag{4.3}\] where \(G = -\nabla^2_{U_i} \ell(\hat{U}_i)\) is the negative Hessian of the log-posterior. This yields the Gaussian approximation: \[ p(U_i \mid \mathcal{D}_t) \approx \mathcal{N}(\hat{U}_i, G^{-1}). \tag{4.4}\]

In the logistic model, \[ G = I + \sum_{j} p_j(1-p_j) V_j V_j^\top, \tag{4.5}\] where \(p_j = \sigma(\hat{U}_i^\top V_j)\) and \(I\) arises from the Gaussian prior \(p(U_i) = \mathcal{N}(0, I)\).

At each time step \(t\):

  1. Update posterior approximation Compute \(\hat{U}_i\) by maximizing the log-posterior: \[ \hat{U}_i = \arg\max_{U_i} \Big[\sum_{j\in\mathcal{D}_t} Y_{ij}\log\sigma(U_i^\top V_j) + (1-Y_{ij})\log(1-\sigma(U_i^\top V_j)) - \tfrac{1}{2}|U_i|^2 \Big]. \tag{4.6}\]

  2. Compute Hessian at the mode \[ G = I + \sum_{j} p_j(1-p_j) V_j V_j^\top. \tag{4.7}\]

  3. Sample a posterior draw \[ \tilde{U}_i^{(t)} \sim \mathcal{N}(\hat{U}_i, G^{-1}). \tag{4.8}\]

  4. Select next item \[ j^* = \arg\max_j \tilde{U}_i^{(t)\top} V_j. \tag{4.9}\]

This algorithm naturally unifies measurement and decision: the posterior (a measurement construct) drives the sampling rule (a decision construct).

Code: Thompson Sampling with Laplace Approximation
Code: Linear Thompson Sampling Simulation

This simulation runs Thompson Sampling with linear utility over multiple rounds, tracking how the posterior concentrates and regret accumulates.

4.4 Thompson Sampling under Nonlinear Objective

The linear formulation assumes that a user’s latent utility varies linearly with item features. In practice, preference landscapes are often highly nonlinear: users may favor intermediate difficulty, exhibit saturation, or respond differently to combinations of attributes. To capture such complex relations, we move from a parametric vector representation \((U_i)\) to a nonparametric function representation \((h_i)\).

A Gaussian Process (GP) defines a distribution over functions: \[ h_i \sim \mathcal{GP}(m(\cdot), k(\cdot, \cdot)), \tag{4.10}\] where \(m(V)\) is the mean function (often assumed to be zero) and \(k(V, V')\) is the kernel, encoding similarity between items. Intuitively, a GP asserts that similar items yield similar utilities, with the kernel controlling smoothness and generalization.

For any finite collection of inputs \(V_{1:T}\), the corresponding function values \[ H_{i, 1:M} = [h_i(V_1), \dots, h_i(V_M)] \tag{4.11}\] follow a multivariate normal distribution: \[ H_{i, 1:M} \sim \mathcal{N}(m_{1:M}, K_{1:M,1:M}), \tag{4.12}\] where \((K)_{jj'} = k(V_j, V_{j'})\).

The probability density function (PDF) of a multivariate normal random variable \[ H \sim \mathcal{N}(m, K) \tag{4.13}\] is given by \[ p(H) = \frac{1}{(2\pi)^{t/2} |K|^{1/2}} \exp \Big(-\tfrac{1}{2}(H-m)^\top K^{-1} (H-m) \Big). \tag{4.14}\] Taking the logarithm yields \[ \log p(H) = -\tfrac{1}{2}(H-m)^\top K^{-1} (H-m) - \tfrac{1}{2}\log|K| - \tfrac{t}{2}\log(2\pi). \tag{4.15}\]

In most inference contexts, constant terms (those not depending on \(H\)) can be dropped, since they cancel when normalizing the posterior.

This nonparametric formulation is advantageous because:

  • It automatically adapts model complexity to the data through the kernel.
  • It provides closed-form posterior updates under Gaussian noise.
  • It generalizes across items by exploiting smoothness in feature space.

The RBF (Radial Basis Function) kernel is particularly common: \[ k(V, V') = \sigma^2 \exp\left(-\frac{\|V - V'\|^2}{2\ell^2}\right), \tag{4.16}\] where \(\ell\) is the lengthscale (controlling smoothness) and \(\sigma^2\) is the variance (controlling amplitude). Small lengthscales produce wiggly functions; large lengthscales produce smooth functions.

Code: Kernel Comparison and GP Prior Samples

For binary observations, \[ p(Y_{ij}=1 \mid h_i(V_j)) = \sigma(h_i(V_j)), \tag{4.17}\] the likelihood is non-Gaussian, and the posterior \[ p(h_i \mid \mathcal{D}_t) \propto p(Y_{1:t} \mid h_i) p(h_i) \tag{4.18}\] becomes intractable. To approximate it, we again use the Laplace approximation.

Let \(H_i = [h_i(V_1), \dots, h_i(V_t)]\). The log-posterior is \[ \ell(H_i) = \log p(Y_{1:t} \mid H_i) - \tfrac{1}{2}H_i^\top K_t^{-1}H_i - \tfrac{1}{2}\log|K_t|. \tag{4.19}\]

Let \[ \hat H_i = \arg\max_{h} \ell(h) \tag{4.20}\] be the mode (MAP).

The multivariate second-order Taylor expansion of a smooth scalar function \(\ell(\cdot)\) around \(\hat H_i\) is \[ \ell(H_i) \approx \ell(\hat H_i) + \nabla\ell(\hat H_i)^\top(H_i-\hat H_i) + \tfrac12(H_i-\hat H_i)^\top \nabla^2\ell(\hat H_i)(H_i-\hat H_i). \tag{4.21}\]

  • Because \(\hat H_i\) is a maximizer, the gradient vanishes: \[ \nabla\ell(\hat H_i)=0. \tag{4.22}\]
  • The Hessian at a maximum is negative semidefinite: \[ \nabla^2\ell(\hat H_i) \preceq 0. \tag{4.23}\]

So the expansion simplifies to \[ \ell(H_i) \approx \ell(\hat H_i) + \tfrac12 (H_i-\hat H_i)^\top \nabla^2\ell(\hat H_i)(H_i-\hat H_i). \tag{4.24}\]

Define the negative Hessian (a positive semidefinite matrix) \[ W := -\nabla^2\ell(\hat H_i) \succeq 0. \tag{4.25}\]

Then \[ \ell(H_i) \approx \ell(\hat H_i) - \tfrac12 (H_i-\hat H_i)^\top W (H_i-\hat H_i). \tag{4.26}\]

Laplace’s idea is: if the log posterior is approximately quadratic, then the posterior is approximately Gaussian. Exponentiate the quadratic approximation:

\[ p(H_i\mid\mathcal D_t) \propto \exp\big(\ell(H_i)\big) \approx \exp \Big(\ell(\hat H_i)-\tfrac12(H_i-\hat H_i)^\top W (H_i-\hat H_i)\Big). \tag{4.27}\]

Since \(\exp(\ell(\hat H_i))\) is just a constant in \(H_i\), this is proportional to the kernel of a multivariate normal with mean \(\hat H_i\) and precision \(W\) (i.e., covariance \(W^{-1}\)):

\[ p(H_i\mid\mathcal D_t) \approx \mathcal N\big(\hat H_i, W^{-1}\big). \tag{4.28}\]

The resulting approximate posterior can be projected to any new input \(V_*\) via standard GP conditioning: \[ p(h(V_*) \mid \mathcal{D}_t) \approx \mathcal{N}(\mu_t(V_*), \sigma_t^2(V_*)), \tag{4.29}\] where \[ \mu_t(V_*) = k_*^\top K_t^{-1}\hat{H_i}, \quad \sigma_t^2(V_*) = k(V_*,V_*) - k_*^\top (K_t + W^{-1})^{-1} k_*. \tag{4.30}\] where the \(k_*\) is the covariance between the new point and previous points.

Once the posterior is approximated, Thompson Sampling proceeds as before: sample one plausible function from the posterior, then act optimally with respect to it.

Algorithmically:

  1. Approximate the posterior \(p(h_i \mid \mathcal{D}_t) \approx \mathcal{N}(\hat{H_i}, \Sigma_f)\) via Laplace.
  2. Sample a function realization \[ \tilde{h}_i \sim \mathcal{GP}(m_t(\cdot), k_t(\cdot, \cdot)), \tag{4.31}\] where \(m_t\) and \(k_t\) are derived from the approximate posterior.
  3. Select the next item \[ j^* = \arg\max_j \tilde{h}_i(V_j). \tag{4.32}\]

This random sampling induces exploration: uncertain regions of the function space produce high variance in \(\tilde{h}_i(V_j)\), encouraging the agent to test them. As more observations are collected, the posterior variance shrinks and the strategy converges toward greedy exploitation.

Below is a minimal illustration using GP classification with Laplace approximation and Thompson sampling. (For practical scalability, packages like gpytorch or scikit-learn can be used.)

Code: GP Thompson Sampling
Code: GP Posterior Visualization

The following code visualizes how the GP posterior updates after observing binary responses. The shaded region shows the 95% credible interval.

Code: Thompson Sampling Simulation

This simulation shows Thompson Sampling in action, demonstrating how the algorithm balances exploration and exploitation over time.

Key Takeaway: Thompson Sampling

Thompson Sampling provides a remarkably simple and effective solution to the exploration-exploitation tradeoff: sample a plausible world from your posterior beliefs, then act optimally within it. This naturally concentrates exploration where uncertainty is highest, without requiring explicit exploration bonuses. Combined with Laplace approximation (for tractable posteriors) and Gaussian Processes (for flexible reward functions), Thompson Sampling provides a complete framework for sequential decision-making under preference uncertainty—connecting directly to the Bayesian inference tools developed in Chapter 2.

4.5 Dueling Bandit

4.5.1 From Multi-Armed Bandits to Dueling Bandits

The multi-armed bandit (MAB) problem involves a gambler deciding which lever to pull on an MAB machine to maximize the winning rate, despite not knowing which machine is the most rewarding. This scenario highlights the need to balance exploration (trying new machines to discover potential higher rewards) and exploitation (using current knowledge to maximize gains). MAB algorithms address this dilemma by making decisions under uncertainty to achieve the best possible outcomes based on gathered data. At the core of the MAB problem is a set of actions, or ‘arms,’ denoted by \(\mathcal{A} = \{1, 2, \ldots, K\}\), where \(K\) signifies the total number of arms. For each round \(t\), the agent selects an arm \(a_t \in \mathcal{A}\) and receives a reward \(r_t\), sampled from an arm-specific, unknown probability distribution. The expected reward of pulling arm \(a\) is represented as \(\mu_a = \mathbb{E}[r_t | a]\).

The multi-armed bandit framework can be extended in various ways to model more complex scenarios. In the infinite-armed bandit problem, the set of possible arms \(\mathcal{A}\) is either very large or infinite. This introduces significant challenges in exploration, as the agent cannot afford to explore each arm even once. Algorithms for infinite-armed bandits typically assume some regularity or structure of the reward function across arms to make the problem tractable. The contextual bandit problem extends the bandit framework by incorporating observable external states or contexts that influence the reward distributions of arms. The agent’s task is to learn policies that map contexts to arms to maximize reward. This model is particularly powerful for personalized recommendations, where the context can include user features or historical interactions. In dueling bandit problems, the agent chooses two arms to pull simultaneously and receives feedback only on which of the two is better, not the actual reward values. This pairwise comparison model is especially useful in scenarios where absolute evaluations are difficult, but relative preferences are easier to determine, such as in ranking systems.

Contextual bandits extend the multi-armed bandits by making decisions conditional on the state of the environment and previous observations. The benefit of such a model is that observing the environment can provide additional information, potentially leading to better rewards and outcomes. In each iteration, the agent is presented with the context of the environment, then decides on an action based on the context and previous observations. Finally, the agent observes the action’s outcome and reward. Throughout this process, the agent aims to maximize the expected reward.

In many real-world contexts, one may not have a real-valued reward (or at least a reliable one) associated with a decision. Instead, we may only have observations indicating which of a set of bandits was optimal in a given scenario. The assumption is that within these observations of preferred choices among a set of options, there is an implicit reward or payoff encapsulated in that decision. Consider the following examples:

  1. Dietary preferences: When providing food recommendations to humans, it is often not possible to quantify an explicit reward from recommending a specific food item. Instead, we can offer meal options and observe which one the person selects.

  2. Video recommendation: Websites like YouTube and TikTok recommend specific videos to users. It is typically not feasible to measure the reward a person gains from watching a video. However, we can infer that a user preferred one video over another. From these relative preference observations, we can develop a strategy to recommend videos they are likely to enjoy.

  3. Exoskeleton gait optimization: Tucker et al. (2020) created a framework that uses human-evaluated preferences for an exoskeleton gait algorithm to develop an optimal strategy for the exoskeleton to assist a human in walking. A human cannot reliably produce a numerical value for how well the exoskeleton helped them walk but can reliably indicate which option performed best according to their preferences.

Generally, we assume access to a set of actions. A noteworthy assumption is that any observations we make are unbiased estimates of the payoff. This means that if we observe a human preferred one option over another (or several others), the preferred option had a higher implicit reward or payoff than the alternatives. In the case of dietary preferences, this may mean that a human liked the preferred option; in the case of video recommendations, a user was more entertained, satisfied, or educated by the video they selected than the other options.

The overarching context is that we do not have direct or reliable access to rewards. We may not have a reward at all (for some decisions, it may be impossible to define a real value to the outcome), or it may be noisy (for example, if we ask a human to rate their satisfaction on a scale of 1 to 10). We use relative comparisons to evaluate the best of multiple options in this case. Our goal is to minimize total regret in the face of noisy comparisons. Humans may not always provide consistent observations (since human decision-making is not guaranteed to be consistent). However, we can still determine an optimal strategy with the observed comparisons. We aim to minimize the frequency of sub-optimal decisions according to human preferences. In practice, many formulations of bandits can allow for infinitely many bandits (for example, in continuous-value and high-dimensional spaces). However, this situation can be intractable when determining an optimal decision strategy. With infinite options, how can we always ensure we have chosen the best? We will constrain our bandits to a discrete space to enable efficient exploration. We will assume that we have \(k\) bandits, \(b_i, i \in [1, k]\), and our task is to choose the one that will minimize regret.

With the framework outlined, we now define our approach more formally. This method was introduced by (Yue et al. 2012), and proofs for the guarantees and derivations of parameters can be found in their work.

To determine the optimal action, we will compare pairwise to ascertain the probability that an action \(b_i\) is preferred over another \(b_j\), where \(i \ne j\). Concretely, we assume access to a function \(\epsilon\) that helps determine this probability; in practice, this can be done with an oracle, such as asking a human which of two options they prefer: \[P(b_i > b_j) = \varepsilon(b_i, b_j) + \frac{1}{2}. \tag{4.33}\] With this model, three basic properties govern the values provided by \(\epsilon\): \[\epsilon(b_i, b_j) = -\epsilon(b_j, b_i), \epsilon(b_i, b_i) = 0, \epsilon(b_i, b_j) \in \left(-\frac{1}{2}, \frac{1}{2} \right). \tag{4.34}\]

We assume there is a total ordering of bandits, such that \(b_i \succ b_j\) implies \(\epsilon(b_i, b_j) > 0\). We impose two constraints to properly model comparisons:

  • Strong Stochastic Transitivity: We must maintain our total ordering of bandits, and as such, the comparison model also respects this ordering: \[b_i \succ b_j \succ b_k \Rightarrow \epsilon(b_i, b_k) \ge \text{max}\{\epsilon(b_i, b_j), \epsilon(b_j, b_k)\}. \tag{4.35}\]

  • Stochastic Triangle Inequality: We also impose a triangle inequality, which captures the condition that the probability of a bandit winning (or losing) a comparison will exhibit diminishing returns as it becomes increasingly superior (or inferior) to the competing bandit: \[b_i \succ b_j \succ b_k \Rightarrow \epsilon(b_i, b_k) \le \epsilon(b_i, b_j) + \epsilon(b_j, b_k). \tag{4.36}\]

These assumptions may initially seem limiting; however, common models for comparisons satisfy these constraints. For example, the Bradley-Terry Model follows \(P(b_i > b_j) = \frac{\mu_i}{\mu_i + \mu_j}\). The Gaussian model with unit variance also satisfies these constraints: \(P(b_i > b_j) = P(X_i - X_j > 0)\), where \(X_i - X_j \sim N(\mu_i - \mu_j, 2)\).

Code: Preference Matrix Visualization

To accurately model the preferences between bandits in our framework of pairwise bandit comparisons and regret, we must track certain parameters in our algorithm. First, we will maintain a running empirical estimate of the probability of bandit preferences based on our observations. It is important to note that we do not have direct access to an \(\epsilon\) function. Instead, we must present two bandits to a human, who selects a winner. To do this, we define: \[\hat{P}_{i, j} = \frac{\# b_i\ \text{wins}}{\# \text{comparisons between}\ i \text{and}\ j}. \tag{4.37}\]

We will also compute confidence intervals at each timestep for each of the entries in \(\hat{P}\) as \[\hat{C}_t = \left( \hat{P}_t - c_t, \hat{P}_t + c_t \right), \tag{4.38}\] where \(c_t = \sqrt{\frac{4\log(\frac{1}{\delta})}{t}}\). Note that \(\delta = \frac{1}{TK^2}\), where \(T\) is the time horizon and \(K\) is the number of bandits.

Previously, we discussed approaches for finding the best action in a specific context. Now, we consider changing contexts, which means there is no longer a static hidden preference matrix \(P\). Instead, at every time step, there is a preference matrix \(P_C\) depending on context \(C\). We consider a context \(C\) and a preference matrix \(P_C\) to be chosen by nature as a result of the given environment (Yue et al., 2012). The goal of a contextual bandits algorithm is to find a policy \(\pi\) that maps contexts to a Von Neumann winner distribution over our bandits. That is, our policy \(\pi\) should map any context to some distribution over our bandits such that sampling from that distribution is preferred to a random action for that context.

4.5.2 Regret

The agent aims to pick a sequence of arms \((a_1, a_2, \ldots, a_T)\) across a succession of time steps \(t = 1\) to \(t = T\) to maximize the total accumulated reward. Formally, the strategy seeks to maximize the sum of the expected rewards: \(\max_{a_1, \ldots, a_T} \mathbb{E} \left[\sum_{t=1}^{T} r_t\right]\). Regret is defined as the difference between the cumulative reward that could have been obtained by always pulling the best arm (in hindsight, after knowing the reward distributions) and the cumulative reward actually obtained by the algorithm. Formally, if \(\mu^*\) is the expected reward of the best arm and \(\mu_{a_t}\) is the expected reward of the arm chosen at time \(t\), the regret after \(T\) time steps is given by \(R(T) = T \cdot \mu^* - \sum_{t=1}^{T} \mu_{a_t}\). The objective of a bandit algorithm is to minimize this regret over time, effectively learning to make decisions that are as close as possible to the decisions of an oracle that knows the reward distributions beforehand. Low regret indicates an algorithm that has often learned to choose well-performing arms, balancing the exploration of unknown arms with the exploitation of arms that are already known to perform well. Thus, an efficient bandit algorithm exhibits sub-linear regret growth, meaning that the average regret per round tends to zero as the number of rounds \(T\) goes to infinity: \(\lim_{T \to \infty} \frac{R(T)}{T} = 0\). Minimizing regret is a cornerstone in the design of bandit algorithms, and its analysis helps in understanding the long-term efficiency and effectiveness of different bandit strategies.

As previously discussed, our goal is to select the bandit that minimizes a quantity that reflects regret or the cost of not selecting the optimal bandit at all times. We can leverage our comparison model to define a quantity for regret over some time horizon \(T\), which is the number of decisions we make (selecting what we think is the best bandit at each iteration). Assuming we know the best bandit \(b^*\) (and we know that there is a best bandit, since there is a total ordering of our discrete bandits), we can define two notions of regret:

  • Strong regret: aims to capture the fraction of users who would prefer the optimal bandit \(b^*\) over the worse of the options \(b_1, b_2\) we provide at a given step:\(R_T = \sum_{t = 1}^T \text{max} \left\{ \epsilon(b^*, b_1^{(t)}), \epsilon(b^*, b_2^{(t)}) \right\}\)

  • Weak regret: aims to capture the fraction of users who would prefer the optimal bandit \(b^*\) over the better of the options \(b_1, b_2\) we provide at a given step:\(\tilde{R}_T = \sum_{t = 1}^T \text{min} \left\{ \epsilon(b^*, b_1^{(t)}), \epsilon(b^*, b_2^{(t)}) \right\}\)

4.5.3 Winner Concepts

The best bandit described in our regret definition is called a Condorcet Winner. This is the strongest form of winner. It’s the action \(A_{i}\) which is preferred to each other action \(A_j\) with \(p > 0.5\) in a head-to-head election. While the above introduced notions of regret assume an overall best bandit to exist, there might be settings, where no bandit wins more than half head-to-head duels. A set of actions without a Condorcet winner is described by the following preference matrix, where each entry \(\Delta_{jk}\) is \(p(j \succ k) - 0.5\), the probability that action \(j\) is preferred over action \(k\) minus 0.5. There is no Condorcet winner as there is no action that is preferred with \(p > 0.5\) over all other actions. Imagine, you want to find the best pizza to eat (=action). There may not be a pizza that wins more than half of the head-to-head duels against every other pizza.

However, we might still have an intuition of the best pizza. Therefore Sui et al., 2018 introduce the concepts of different \(\textit{winners}\) in dueling bandit problems (Sui et al. 2018). In this example, we might define the best pizza as the most popular one. We call the Pizza receiving the most votes in a public vote the Borda Winner, or formally, Borda winner \(j = \arg\max_{i \in A, i \neq j} \left(\sum p(j \succ i)\right)\). In contrast to the Condorcet Winner setting, there is always guaranteed to be one or more (in the case of a tie) Borda winners for a set of actions. However - if there is a Condorcet Winner, this might not necessarily be the same as a Borda Winner: In our Pizza example, a Pepperoni Pizza might win more than half of its head-to-head duels, while the Cheese-Pizza is still the most popular in a public poll.

A more generic concept of winner is the Von Neumann Winner, which describes a probability distribution rather than a single bandit winner. A Von Neumann winner simply prescribes a probability distribution \(W\) such that sampling from this distribution ‘beats’ an action from the random uniform distribution with \(p > 0.5\). In our pizza example, this would correspond to trusting a friend to order whichever Pizza he likes, because this may still be preferred to ordering randomly. Formally, \(W\) is a Von Neumann if \((j \sim W, k \sim R) [p(p(j \succ k) > 0.5) > 0.5]\) where \(R\) describes the uniform probability distribution over our actions. The concept of a Von Neumann winner is useful in contextual bandits, which will be introduced later. In these settings, the preference matrix depends on different context, which may have different Borda winners, just as different parties may vote for different pizzas.

No Universal Winner Concept

The three winner concepts—Condorcet, Borda, and von Neumann—can identify different items as the “best,” and a Condorcet winner may not even exist. The choice of winner concept is therefore a normative design decision, not a purely technical one: it determines what “regret” means and what the algorithm optimizes for. This parallels a deeper tension formalized by Arrow’s impossibility theorem (Section 5.2): there is no aggregation rule that satisfies all desirable properties simultaneously. When deploying dueling bandit algorithms, practitioners should explicitly choose a winner concept that matches their application’s values—majority dominance (Condorcet), average popularity (Borda), or worst-case robustness (von Neumann).

A B C D E F
A 0 0.03 -0.02 0.06 0.10 0.11
B -0.03 0 0.03 0.05 0.08 0.11
C -0.03 0 0.04 0.07 0.09
D -0.06 -0.05 -0.04 0 0.05 0.07
E -0.10 -0.08 -0.07 -0.05 0 0.03
F -0.11 -0.11 -0.09 -0.07 -0.03 0
Figure 4.1: Violation of Condorcet Winner. Highlighted entries are different from Table 1. No Condorcet winner exists as no arm could beat every other arm.
Code: Pizza Voting Example - When Winners Diverge

This example demonstrates a scenario where three groups have different pizza preferences, resulting in no Condorcet winner. The Borda winner and Von Neumann winner provide alternative solutions.

Code: Computing Different Winner Types

Next, we introduce two performance measures for the planner. The asymptotic ex-post regret is defined as \[\text{Regret}(\mu_1, \ldots \mu_K) = T\cdot \max_i \mu_i - \sum_{i=1}^T E[\mu_{I_t}]. \tag{4.39}\]

Intuitively, this represents the difference between the reward achieved by always taking the action with the highest possible reward and the expected welfare of the recommendation algorithm (based on the actions it recommends at each timestep).

We also define a weaker performance measure, the Bayesian regret, which is defined as \[\text {Bayesian regret}=E_{\mu_1, \ldots, \mu_K \sim \text {Prior}}\left[\operatorname{Regret}\left(\mu_1, \ldots, \mu_K\right)\right] \tag{4.40}\]

With a Bayesian optimal policy, we would like either definition of regret to vanish as \(T\to \infty\); we are considering “large-market optimal" settings where there are many short-lived, rather than a few long-term, users. Note the fact that ex-post regret is prior-free makes it robust to inaccuracies on the prior.

4.5.4 Acquisition Functions

Various strategies have been developed to balance the exploration-exploitation trade-off. These strategies differ in selecting arms based on past experiences and rewards.

4.5.4.1 Classical Acquisition Functions

Uniform acquisition function is the most straightforward approach where each arm is selected uniformly randomly over time. This strategy does not consider the past rewards and treats each arm equally promising regardless of the observed outcomes. It is a purely explorative strategy that ensures each arm is sampled enough to estimate its expected reward, but it does not exploit the information to optimize rewards. In mathematical terms, if \(N_t(a)\) denotes the number of times arm \(a\) has been selected up to time \(t\), the Uniform Strategy would ensure that \(N_t(a) \approx \frac{t}{K}\) for all arms \(a\) as \(t\) grows large: \(P(a_t = a) = \frac{1}{K}\)

The Epsilon Greedy is a popular method that introduces a balance between exploration and exploitation. With a small probability \(\epsilon\), it explores by choosing an arm at random, and with a probability \(1 - \epsilon\), it exploits by selecting the arm with the highest estimated reward so far. This strategy incrementally favors actions that have historically yielded higher rewards, but still allows for occasional exploration to discover better options potentially. The parameter \(\epsilon\) is chosen based on the desired exploration level, often set between 0.01 and 0.1. \[P(a_t = a) = \begin{cases} \frac{\epsilon}{K} + 1 - \epsilon & \text{if } a = \arg\max_{a'} \hat{\mu}_{a'} \\ \frac{\epsilon}{K} & \text{otherwise} \end{cases} \tag{4.41}\]

Upper Confidence Bound (UCB) acquisition function takes a more sophisticated approach to the exploration-exploitation dilemma. It selects arms based on both the estimated rewards and the uncertainty or variance associated with those estimates. Specifically, it favors arms with high upper confidence bounds on the estimated rewards, which is a sum of the estimated mean and a confidence interval that decreases with the number of times the arm has been played. This ensures that arms with less certainty (those played less often) are considered more often, naturally balancing exploration with exploitation as the uncertainty is reduced over time.

\[P(a_t = a) = \begin{cases} 1 & \text{if } a = \arg\max_{a'} \left( \hat{\mu}_{a'} + \sqrt{\frac{2 \ln t}{N_t(a')}} \right) \\ 0 & \text{otherwise} \end{cases} \tag{4.42}\]

4.5.4.2 Interleaved Filter

This algorithm tries to find the best bandit (Condorcet Winner) in a discrete, limited bandit-space via pairwise comparisons of the bandits. We will now introduce the algorithm for the Interleaved Filter as provided in (Yue et al. 2012) to solve a dueling bandit setup. It starts with a randomly defined best bandit \(\hat{b}\) and iteratively compares it to set \(W\) containing the remaining bandits \(b\) resulting in winning probabilities \(\hat{P}_{\hat{b},b}\) and confidence interval \(\hat{C}_{\hat{b},b}\). If a bandit \(b\) is confidently worse than \(\hat{b}\), it is removed from \(W\). If a bandit \(b'\) is confidently better than \(\hat{b}\), it is set as new best bandit \(\hat{b}\) and bandit \(\hat{b}\) as well as every other bandit \(b\) worse than \(\hat{b}\) are removed from \(W\). This is done, until \(W\) is empty, leaving the final \(\hat{b}\) as the predicted best bandit.

input: \(T\), \(B=\{b_1, \dots, b_k\}\) \(\delta \gets 1/(TK^2)\) Choose \(\hat{b} \in B\) randomly \(W \gets \{b_1, \dots, b_k\} \backslash \{\hat{b}\}\) \(\forall b \in W\), maintain estimate \(\hat{P}_{\hat{b},b}\) of \(P(\hat{b} > b)\) according to (6) \(\forall b \in W\), maintain \(1 - \delta\) confidence interval \(\hat{C}_{\hat{b},b}\) of \(\hat{P}_{\hat{b},b}\) according to (7), (8) compare \(\hat{b}\) and \(b\) update \(\hat{P}_{\hat{b},b}\), \(\hat{C}_{\hat{b},b}\) \(W \gets W \backslash \{b\}\)

\(W \gets W \backslash \{b\}\) \(\hat{b} \gets b'\), \(W \gets W \backslash \{b'\}\) \(\forall b \in W\), reset \(\hat{P}_{\hat{b},b}\) and \(\hat{C}_{\hat{b},b}\) \(\hat{T} \gets\) Total Comparisons Made \((\hat{b}, \hat{T})\)

Parameter Initialization

In lines 1-6 of the algorithm, we take the inputs and first compute the value \(\delta\) which is used to compute our confidence intervals. We select an initial guess of an optimal bandit \(\hat{b}\) by uniformly sampling from all bandits \(\mathcal{B}\). We also keep a running set of bandit candidates \(W\), which is initialized to be \(\mathcal{B} \setminus \{\hat{b}\}\). At this point, we also initialize our empirical estimates for \(\hat{P}, \hat{C}\).

Next, we will repeat several steps until our working set of bandit candidates \(W\) is empty.

Update Estimates Based on Comparisons

The first step at each iteration (lines 8-11) is to look at all candidates in \(W\), and compare them to our current guess \(\hat{b}\) using an oracle (e.g. by asking a human which of \(\hat{b}\) or \(b \in W\) is preferred). With this new set of wins and comparisons, we update our estimates of \(\hat{P}, \hat{C}\).

Prune Suboptimal Bandits

In lines 12-13, with updated comparison win probabilities and corresponding confidence intervals, we can remove bandit candidates from \(W\) that we are confident \(\hat{b}\) is better than. The intuition here is that we are mostly sure that our current best guess is better than some of the candidates, and we don’t need to consider those candidates in future iterations.

Check for Better Bandits from Candidate Set

Now that our candidate set of bandits may be smaller, in lines 15-21 we check if there are any bandits \(b'\) that we are confident are better than our current best guess. If we do find such a candidate, we remove bandits which \(\hat{P}\) indicates \(b\) is likely worse than \(\hat{b}\). Note that in this step, we do not require the probability to be outside the confidence interval, since we already found one we believe to be significantly closer to optimal than our current best guess.

Once we remove the candidates likely worse than \(\hat{b}\), we crown \(b'\) as the new best guess, e.g. \(\hat{b} := b'\). Consequently, we remove \(b'\) from \(W\) and reset our empirical win counters \(\hat{P}, \hat{C}\).

With this algorithm defined, let us look at some provisions of the method with respect to identifying the optimal strategy. Note that the proofs and derivations for these quantities are provided in (Yue et al. 2012).

First, the method guarantees that for the provided time horizon \(T\), the algorithm returns the correct bandit with probability \(P \ge 1 - \frac{1}{T}\). It is interesting and useful to note that if one has a strict requirement for the probability of identifying the correct bandit, one can compute the time horizon \(T\) that guarantees this outcome at that probability. Furthermore, a time horizon of 1 leaves no probabilistic guarantee of a successful outcome, and increasing \(T\) has diminishing returns. Second, in the event that the algorithm returns an incorrect bandit, the maximal regret incurred is linear with respect to \(T\), e.g. \(\mathcal(O)(T)\). This is also a useful provision as it allows us to estimate the overall cost in the worst case outcome. Based on these two provisions, we can compute the expected cumulative regret from running the Interleaved Filter algorithm, which is: \[\mathbb{E}\left[R_T\right] \le \left(1 - \frac{1}{T}\right) \mathbb{E}\left[ R_T^{IF} \right] + \frac{1}{T}\mathcal{O}(T) \\ = \mathcal{O}\left(\mathbb{E}\left[ R_T^{IF} \right] + 1\right) \tag{4.43}\]

Interestingly, the original work shows that these bounds hold for both strong and weak regret. As demonstrated, the Interleaved Filter algorithm [fig-if] provides a robust method to ascertain the optimal bandit or strategy given a set of options and only noisy comparisons. In most real-world scenarios for modeling human preferences, it is not possible to observe a real-world reward value, or at least a reliable one and as such this method is a useful way to properly model human preferences.

Furthermore, the algorithm provides strong guarantees for the probability of selecting the correct bandit, maximal regret, and the number of comparisons required. It is even more impressive that the method can do so without severely limiting constraints; as demonstrated, the most commonly used models satisfy the imposed constraints.

As we look to model human preferences, we can certainly leverage this method for k-armed dueling bandits to identify the best strategy to solve human-centric challenges, from video recommendation to meal selection and exoskeleton-assisted walking.

Code: Interleaved Filter Implementation
Code: Comparing Dueling Bandit Algorithms

4.5.4.3 Dueling Bandit Gradient Descent

The Interleaved Filter works well when the action space is finite, but many practical settings involve continuous parameter spaces. For example, a search engine may be parameterized by a weight vector \(w \in \mathbb{R}^d\) controlling how different ranking signals are combined, and the goal is to find the \(w\) that produces the most preferred results. Dueling Bandit Gradient Descent (DBGD) (Yue and Joachims 2009) extends the dueling bandit framework to this continuous setting by performing gradient-free optimization using only preference feedback.

The idea is intuitive: at each step, compare the current parameters \(w_t\) against a random perturbation \(w_t' = w_t + \delta u_t\). If users prefer the perturbed system, take a step in that direction. This is analogous to stochastic gradient descent, but using preference comparisons instead of gradient evaluations.

input: \(\gamma\), \(\delta\), \(w_1\)

Sample unit vector \(u_t\) uniformly

\(w_t' \gets P_W(w_t + \delta u_t)\)

Compare \(w_t\) and \(w_t'\)

\(w_{t+1} \gets P_W(w_t + \gamma u_t)\)

\(w_{t+1} \gets w_t\)

We first choose exploration step length \(\delta\), exploitation step length \(\gamma\), and starting point (in unit-sphere) \(w_1\). Choose a query and sample a random unit vector \(u_t\). We duel \(w_t\) and \(w_t'\), where \(w_t\) is our current point in the sphere, and \(w_t'\) is our exploratory comparison, which is generated by taking a random step of length \(\delta\), such that \(w_t' = w_t + \delta u_t\). The objective of this duel is to ascertain the binary preference of users with respect to the results yielded by the IR systems parameterized by \(w_t\) and \(w_t'\) respectively, taking query \(q_t\) as an input. The parameters that get the majority of the votes in the head to head win. If \(w_t\) wins, then we keep the parameters for the next iteration. If \(w_t'\) wins the duel, we update our parameters in the direction of \(u_t\) by taking a step of length \(\gamma\). Note that the algorithm describes projection operation \(P_W(\overrightarrow{v})\). Since \(u_t\) is chosen randomly, \(w_t + \delta u_t\) or \(w_t + \gamma u_t\) could exist outside of the unit sphere where all possible parameter configurations lie. In this case, we simply project the point back onto the sphere using said projection \(P_W(\overrightarrow{v})\).

Yue and Joachims show that this algorithm has sublinear regret in \(T\), the number of iterations. We note that the algorithm assumes that there exists a hidden reward function \(R(w)\) that maps system parameters \(w_t\) to a reward value which is smooth and strictly concave over the input space \(W\).

Lastly, we would also like to give motivation behind \(\delta\) and \(\gamma\) being different values. We need a \(\delta\) that is sufficiently large that the comparison between a system parameterized by \(w_t\) and \(w_t'\) is meaningful. On the other hand, we may wish to take a smaller step in the direction of \(w_t'\) during our update step, as during a duel, we only score \(w_t\) against \(w_t'\) over the results on one query \(q_t\). Having \(\delta > \gamma\) allows us to get reward signal from meaningfully different points while also updating our belief of the best point \(w_{\text{best}}\) gradually.

Sparring EXP4

While DBGD handles continuous action spaces, many real-world dueling problems also involve context—the best action depends on the situation. For example, the best search ranking for a query about “python” differs depending on whether the user is a programmer or a pet owner. Sparring EXP4 (Zoghi et al. 2015) addresses this contextual dueling bandit setting by adapting the EXP4 algorithm (originally designed for contextual bandits with scalar rewards) to preference feedback.

The key idea is to run two instances of the EXP4 algorithm against each other. Each instance maintains embeddings for the actions and produces a probability distribution over choices given the current context. When a context arrives, each instance samples an action, the two actions are dueled, and the winner receives a positive reward while the loser receives a negative reward proportional to the preference strength. This “sparring” approach elegantly converts pairwise preference feedback into the scalar reward signals that EXP4 requires.

4.5.4.4 Feel-good Thompson Sampling

Feel-good Thompson Sampling (Zhang 2021) offers an alternative approach to the contextual dueling bandit setting, connecting back to the Thompson Sampling framework developed earlier in this chapter. Like Sparring EXP4, it handles context-dependent preferences, but it uses posterior sampling rather than adversarial learning. The algorithm aims to find a Von Neumann winner by minimizing cumulative average regret: \[\text{Regret}(T) := \sum_{t=1}^{T} \left[ r_{*}(x_t, a_{t}^{*}) - \frac{r_{*}(x_t, a_{t}^{1}) + r_{*}(x_t, a_{t}^{2})}{2} \right], \tag{4.44}\] where \(r_{*}(x_t, a_{t})\) is the true, hidden reward function of a context \(x_t\) and action \(a_t\). Thompson sampling is an iterative process of receiving preference over two actions, each maximizing a different approximation of the reward function based on past data and adding this new information to the data.

Finding good approximations of the reward function at time \(t\) is done by sampling two reward function parameters \(\theta_t^{j=1}\) and \(\theta_t^{j=2}\) from a posterior distribution based on all previous data \(p_j(\cdot \mid S_{t-1})\). This posterior distribution is proportional to the multiplication of the prior and the likelihood function, which is a Gaussian in standard Thompson sampling. In Feel-Good Thompson sampling, an additional term called "Feel-good exploration" encourages parameters \(\theta\) with a large maximum reward in previous rounds. This change to the likelihood function may increase probabilities in uncertain areas, thus exploring those regions. All that’s left is to select an action maximizing each reward function approximation and receive a preference \(y_t\) on one of them to add the new information to the dataset(Zhang 2021).

Initialize \(S_0 = \varnothing\). Receive prompt \(x_t\) and action space \(\mathcal{A}_t\). Sample model parameter \(\theta_t^j\) from the posterior distribution \(p^j(\cdot \mid S_{t-1})\) Select response \(a_t^j = \arg\max_{a \in \mathcal{A}_t} \langle \theta_t^j, \phi(x_t, a) \rangle\). Receive preference \(y_t\). Update dataset \(S_t \leftarrow S_{t-1} \cup \{(x_t, a_t^1, a_t^2, y_t)\}\).

Key Takeaway: Choosing an Algorithm

The dueling bandit algorithms covered above address different problem settings:

Algorithm Action Space Context? Winner Concept Key Strength
Interleaved Filter Finite No Condorcet Simple, provable
DBGD Continuous No Reward maximizer Gradient-free optimization
Sparring EXP4 Finite Yes Copeland Adversarial robustness
Feel-good TS Finite Yes Von Neumann Bayesian, posterior sampling

When choosing an algorithm, consider: (1) Is the action space discrete or continuous? (2) Does the optimal action depend on context? (3) What computational resources are available? Thompson Sampling-based methods are generally preferred when a good posterior approximation is available, while adversarial methods like Sparring EXP4 are more robust when the preference model is misspecified.

4.5.5 Applications

The dueling bandit and Thompson Sampling frameworks developed above have broad applicability whenever systems must learn from preference feedback rather than scalar rewards. We highlight two case studies that illustrate how these methods connect to the book’s central themes.

4.5.5.1 Case Study 1: LLM Response Selection as a Dueling Bandit

Consider an LLM serving system that generates \(K\) candidate responses to each user prompt and must select which to display. This is exactly a contextual dueling bandit problem:

  • Context: The user prompt \(x_t\) and its embedding
  • Actions: The \(K\) candidate responses \(\{y_1, \ldots, y_K\}\)
  • Preference feedback: The user indicates which response they prefer (via explicit rating, continued engagement, or regeneration requests)
  • Objective: Maximize user satisfaction while learning which response characteristics users value

The preference model is Bradley-Terry over response embeddings: \(p(y_j \succ y_k \mid x) = \sigma(\beta \log \frac{\pi_\theta(y_j \mid x)}{\pi_{\text{ref}}(y_j \mid x)} - \beta \log \frac{\pi_\theta(y_k \mid x)}{\pi_{\text{ref}}(y_k \mid x)})\), exactly the DPO formulation from Chapter 2. Thompson Sampling provides a principled way to select which pair of responses to show: sample a reward function from the posterior and present the pair with the highest expected information gain.

This connects the bandit framework to the RLHF pipeline: online DPO is essentially Thompson Sampling on the Bradley-Terry preference model. The exploration-exploitation tradeoff determines whether the system shows responses it is confident about (exploitation) or tries novel response styles to learn user preferences (exploration).

4.5.5.2 Case Study 2: Clinical Trial Dosing

In personalized medicine, the dueling bandit framework addresses a critical challenge: how to assign drug dosages when the optimal dose depends on individual patient characteristics and can only be evaluated through patient preferences or outcomes.

Consider Warfarin dosing (Bastani and Bayati 2020), where the standard practice is to start with a fixed dose and adjust based on symptoms. This is suboptimal because patients vary significantly in their metabolism. The contextual bandit formulation models this as:

  • Context: Patient features \(x_t\) (age, weight, genetic markers, medication history)
  • Actions: Dosage levels \(a_t \in \{\)low, medium, high\(\}\)
  • Reward: Whether the patient’s INR (blood clotting measure) falls in the therapeutic range

Using the linear Thompson Sampling framework from earlier in this chapter, we model the reward as \(r(x_t, a_t) = x_t^\top \theta_{a_t}\), where \(\theta_{a_t}\) captures the relationship between patient features and outcomes for each dosage level. The key insight is that while each patient’s outcome reveals information about only one dosage level (we cannot simultaneously try all doses), the contextual structure allows information to transfer across similar patients.

This application highlights a tension central to this book: the system must balance exploration (trying dosages it is uncertain about to learn their effects) against exploitation (prescribing the dosage believed to be best for the current patient). In clinical settings, the cost of exploration is high—a suboptimal dose can cause serious harm—making efficient algorithms like Thompson Sampling especially valuable, since they concentrate exploration where uncertainty is highest rather than exploring uniformly.

4.5.5.3 Broader Applications

Beyond these case studies, bandit methods with preference feedback have been applied to recommendation systems (Zhou et al. 2017; Bouneffouf, Bouzeghoub, and Gançarski 2012), where the cold-start problem for new users is naturally handled by Thompson Sampling’s uncertainty-driven exploration; to dialogue systems (Liu et al. 2018), where response selection is modeled as a contextual bandit; and to search engine optimization (Yue and Joachims 2009), where DBGD tunes ranking parameters from user click preferences. For a comprehensive survey, see Bouneffouf, Rish, and Aggarwal (2020).

4.6 Preferential Bayesian Optimization

The traditional Bayesian optimization (BO) problem is described as follows. There is a black-box objective function \(g: \mathcal{X} \rightarrow \Re\) defined on a bounded subset \(\mathcal{X} \subseteq \Re^q\) such that direct queries to the function are expensive or not possible. However, we would like to solve the global optimization problem of finding \(\mathbf{x}_{\min }=\arg \min _{\mathbf{x} \in \mathcal{X}} g(\mathbf{x})\). This is highly analogous to modeling human preferences, since it is the case that direct access to a human’s latent preference function is not possible but we would still like to find its optimum, such as in A/B tests or recommender systems.

We approach this problem for human preferences with Preferential Bayesian Optimization (PBO), as the key difference is that we are able to query the preference function through pairwise comparisons of data points, i.e. duels. This is a form of indirect observation of the objective function, which models real-world scenarios closely: we commonly need to to optimize a function via data about preferences. With humans, it has been demonstrated that we are better at evaluating differences rather than absolute magnitudes (Kahneman and Tversky 1979) and therefore PBO models can be applied in various contexts.

When Is Pairwise Feedback the Right Choice?

PBO assumes the oracle can only provide pairwise comparisons, not scalar ratings. This is well-justified when absolute evaluations are cognitively difficult or poorly calibrated (e.g., “How good is this robot trajectory on a scale of 1-10?”), but in some domains scalar feedback is readily available and strictly more informative—each scalar observation constrains the reward function more than a binary comparison. When both modalities are feasible, standard Bayesian optimization with scalar feedback typically requires fewer queries to find the optimum. PBO is most valuable when: (1) comparisons are genuinely easier for humans than absolute ratings, (2) the rating scale is arbitrary or poorly anchored across evaluators, or (3) the goal is to find the best option rather than estimate the full reward function.

4.6.1 Problem statement

The problem of finding the optimum of a latent preference function defined on \(\mathcal{X}\) can be reduced to determining a sequence of duels on \(\mathcal{X} \times \mathcal{X}\). From each duel \(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] \in\) \(\mathcal{X} \times \mathcal{X}\) we obtain binary feedback \(\{0,1\}\) indicating whether or not \(\mathbf{x}\) is preferred over \(\mathbf{x}^{\prime}\) (\(g(\mathbf{x}) < g(\mathbf{x}^{\prime})\)). We consider that \(\mathbf{x}\) is the winner of the duel if the output is \(\{1\}\) and that \(\mathbf{x}^{\prime}\) wins the duel if the output is \(\{0\}\). The aim is to find \(\mathbf{x}_{\min }\) by reducing as much as possible the number of queried duels.

The key idea in PBO is to learn a preference function in the space of duels using a Gaussian process. We define a joint reward \(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\) on each duel which is never directly observed. Instead, the feedback we obtain after each pair is a binary output \(y \in\) \(\{0,1\}\) indicating which of the two inputs is preferred. One definition of f we will use (though others are possible) is \(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})\). The more \(\mathbf{x}^{\prime}\) is preferred over \(\mathbf{x}\), the bigger the reward.

We define the model of preference using a Bernoulli likelihood, where \(p\left(y=1 \mid\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\) and \(p\left(y=0 \mid\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\pi_f\left(\left[\mathbf{x}^{\prime}, \mathbf{x}\right]\right)\) for some inverse link function \(\pi: \Re \times \Re \rightarrow[0,1]\). \(\pi_f\) has the property that \(\pi_f\left(\left[\mathbf{x}^{\prime}, \mathbf{x}\right]\right)=1-\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\). A natural choice for \(\pi_f\) is the logistic function \[\label{eq:bernoulli_pref} \pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\sigma\left(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\right)=\frac{1}{1+e^{-f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)}},\] but others are possible. Therefore we have that for any duel \(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\) in which \(g(\mathbf{x}) \leq g\left(\mathbf{x}^{\prime}\right)\) it holds that \(\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) \geq 0.5\). \(\pi_f\) is a preference function that maps each query \(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\) to the probability of having a preference on the left input \(\mathbf{x}\) over the right input \(\mathbf{x}^{\prime}\).

When we marginalize over the right input \(\mathbf{x}^{\prime}\) of \(f\), the global minimum of \(f\) in \(\mathcal{X}\) coincides with \(\mathbf{x}_{\min }\). We also introduce the definition of the Copeland score function for a point \(\mathbf{x}\) as \[S(\mathbf{x})=\operatorname{Vol}(\mathcal{X})^{-1} \int_{\mathcal{X}} \mathbb{I}_{\left\{\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) \geq 0.5\right\}} d \mathbf{x}^{\prime} \tag{4.45}\] where \(\operatorname{Vol}(\mathcal{X})=\int_{\mathcal{X}} d \mathbf{x}^{\prime}\) is a normalizing constant that bounds \(S(\mathbf{x})\) in the interval \([0,1]\). If \(\mathcal{X}\) is a finite set, the Copeland score is simply the proportion of duels that a certain element \(\mathbf{x}\) will win with probability larger than 0.5. A soft variant we will use instead of the Copeland score is the soft-Copeland score, defined as \[\label{eq:soft-copeland} C(\mathbf{x})=\operatorname{Vol}(\mathcal{X})^{-1} \int_{\mathcal{X}} \pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d \mathbf{x}^{\prime}\] where the probability function \(\pi_f\) is integrated over \(\mathcal{X}\). This score aims to capture the average probability of \(\mathbf{x}\) being the winner of a duel.

We define the Condorcet winner \(\mathbf{x}_c\) as the point with maximal soft-Copeland score. Note that this corresponds to the global minimum of \(f\), since the defining integral takes maximum value for points \(\mathbf{x} \in \mathcal{X}\) where \(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\) \(g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})>0\) or all \(\mathbf{x}^{\prime}\), occurring only if \(\mathbf{x}_c\) is a minimum of \(f\). Therefore, if the preference function \(\pi_f\) can be learned by observing the results of duels then our optimization problem of finding the minimum of \(f\) can be solved by finding the Condorcet winner of the Copeland score.

Code: Copeland Score Visualization
Code: GP Preference Model for PBO

4.6.2 Acquisition Functions

We describe several acquisition functions for sequential learning of the Condorcet winner. Our dataset \(\mathcal{D}=\left\{\left[\mathbf{x}_i, \mathbf{x}_i^{\prime}\right], y_i\right\}_{i=1}^N\) represents the \(N\) duels that have been performed so far. We aim to define a sequential policy \(\alpha\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] ; \mathcal{D}_j, \theta\right)\) for querying duels, where \(\theta\) is a vector of model hyper-parameters, in order to find the minimum of the latent function \(g\) as quickly as possible. Using Gaussian processes (GP) for classification with our dataset \(\mathcal{D}\) allows us to perform inference over \(f\) and \(\pi_f\).

Pure Exploration

The output variable \(y_{\star}\) of a prediction follows a Bernoulli distribution with probability given by the preference function \(\pi_f\). To carry out exploration as a policy, one method is to search for the duel where GP is most uncertain about the probability of the outcome (has the highest variance of \(\sigma\left(f_{\star}\right)\) ), which is the result of transforming out epistemic uncertainty about \(f\), modeled by a GP, through the logistic function. The first order moment of this distribution coincides with the expectation of \(y_{\star}\) but its variance is \[\begin{aligned} \mathbb{V}\left[\sigma\left(f_{\star}\right)\right] & =\int\left(\sigma\left(f_{\star}\right)-\mathbb{E}\left[\sigma\left(f_{\star}\right)\right]\right)^2 p\left(f_{\star} \mid \mathcal{D},\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d f_{\star} \\ & =\int \sigma\left(f_{\star}\right)^2 p\left(f_{\star} \mid \mathcal{D},\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d f_{\star}-\mathbb{E}\left[\sigma\left(f_{\star}\right)\right]^2 \end{aligned} \tag{4.46}\] which explicitly takes into account the uncertainty over \(f\). Hence, pure exploration of duels space can be carried out by maximizing \[\alpha_{\mathrm{PE}}\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] \mid \mathcal{D}_j\right)=\mathbb{V}\left[\sigma\left(f_{\star}\right)\left|\left[\mathbf{x}_{\star}, \mathbf{x}_{\star}^{\prime}\right]\right| \mathcal{D}_j\right] . \tag{4.47}\]

Note that in this case, duels that have been already visited will have a lower chance of being visited again even in cases in which the objective takes similar values in both players. In practice, this acquisition functions requires computation of an intractable integral, that we approximate using Monte-Carlo.

Principled Optimistic Preferential Bayesian Optimization (POP-BO)

In a slightly modified problem setup (Xu et al. 2024), the algorithm tries to solve for the MLE \(\hat{g}\) and its confidence set \(\mathcal{B}_g\) where \(g\) is the ground truth black-box function. Assumptions include that \(g\) is a member of a reproducing kernel Hilbert space (RKHS) \(\mathcal{H}_k\) for some kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\), and \(\|g\|_k \leq B\) so that \(\mathcal{B}_g = \left\{\tilde{g} \in \mathcal{H}_k \mid\|\tilde{g}\|_k \leq B\right\}\). Similarly defining \(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})\), we model the preference function with a Bernoulli distribution as in Equation [eq:bernoulli_pref] and also assume that probabilities follow the Bradley-Terry model, i.e. \[\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\sigma\left(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\right)=\frac{e^{g(\mathbf{x})}}{e^{g(\mathbf{x})}+e^{g\left(\mathbf{x^{\prime}}\right)}} \tag{4.48}\]

The update rule for MLE \(\hat{g}\) is (equation 8,6,5) \[\begin{aligned} \hat{g}_t^{\text {MLE }}&:= \arg \underset{\tilde{g} \in \mathcal{B}^t_g}{\max}\ell_t(\tilde{g}) \\ \ell_t(\tilde{g}) &:= \log \prod_{\tau=1}^t y_\tau \pi_{\tilde{f}}([\mathbf{x_\tau}, \mathbf{x^{\prime}_\tau}])+\left(1-y_\tau\right)\left(1-\pi_{\tilde{f}}([\mathbf{x_\tau}, \mathbf{x^{\prime}_\tau}])\right) \\ &=\sum_{\tau=1}^t \log \left(\frac{e^{\tilde{g}(\mathbf{x_\tau})} y_\tau+e^{\tilde{g}(\mathbf{x_\tau^\prime})}\left(1-y_\tau\right)}{e^{\tilde{g}(\mathbf{x_\tau})}+e^{\tilde{g}(\mathbf{x_\tau^\prime})}}\right) \\ &=\sum_{\tau=1}^t\left(\tilde{g}(\mathbf{x_\tau}) y_\tau+\tilde{g}(\mathbf{x_\tau^\prime})\left(1-y_\tau\right)\right)-\sum_{\tau=1}^t \log \left(e^{\tilde{g}(\mathbf{x_\tau})}+e^{\tilde{g}(\mathbf{x_\tau^\prime})}\right) \end{aligned} \tag{4.49}\]

(Eq 22 shows how to represent this as a convex optimisation problem so that it can be solved)

The update rule for the confidence set \(\mathcal{B}_f^{t+1}\) is, (eq 9, 10?)

\[\begin{aligned} &\forall \epsilon, \delta > 0 \\ &\mathcal{B}_g^{t+1}:=\left\{\tilde{g} \in \mathcal{B}_g \mid \ell_t(\tilde{g}) \geq \ell_t\left(\hat{g}_t^{\mathrm{MLE}}\right)-\beta_1(\epsilon, \delta, t)\right\} \end{aligned} \tag{4.50}\] where \[\beta_1(\epsilon, \delta, t):=\sqrt{32 t B^2 \log \frac{\pi^2 t^2 \mathcal{N}\left(\mathcal{B}_f, \epsilon,\|\cdot\|_{\infty}\right)}{6 \delta}}+ C_L \epsilon t=\mathcal{O}\left(\sqrt{t \log \frac{t \mathcal{N}\left(\mathcal{B}_f, \epsilon,\|\cdot\|_{\infty}\right)}{\delta}}+\epsilon t\right), \tag{4.51}\] with \(C_L\) a constant independent of \(\delta, t\) and \(\epsilon\). \(\epsilon\) is typically chosen to be \(1 / T\), where T is the running horizon of the algorithm. This satisfies the theorem that, \[\mathbb{P}\left(g \in \mathcal{B}_g^{t+1}, \forall t \geq 1\right) \geq 1-\delta . \tag{4.52}\]

Intuitively, the confidence set \(\mathcal{B}_g^{t+1}\) includes the functions with the log-likelihood value that is only ‘a little worse’ than the maximum likelihood estimator, and the theorem states that \(\mathcal{B}_g^{t+1}\) contains the ground-truth function \(g\) with high probability.

Inner level optimization in Line 4 of the algorithm can also be represented as a convex optimisation problem so that it can be solved, Eq 24, 25. The outer optimisation can be solved using grid search or Eq 26 for medium size problems.

Given the initial point \(\mathbf{x_0} \in \mathcal{X}\) and set \(\mathcal{B}_g^1 = \mathcal{B}_g\) Set the reference point \(\mathbf{x_t^{\prime}} = \mathbf{x_{t-1}}\) Compute \(\mathbf{x_t} \in \arg\max_{\mathbf{x} \in \mathcal{X}} \max_{\tilde{g} \in \mathcal{B}_g^t} (\tilde{g}(\mathbf{x}) - \tilde{g}(\mathbf{x_t^{\prime}}))\), with the inner optimal function denoted as \(\tilde{g}_t\) Obtain the output of the duel \(y_t\) and append the new data point to \(\mathcal{D}_t\) Update the maximum likelihood estimator \(\hat{g}_t^{\mathrm{MLE}}\) and the posterior confidence set \(\mathcal{B}_g^{t+1}\).

qEUBO: Decision-Theoretic EUBO

qEUBO (Astudillo et al. 2023) derives an acquisition function that extends duels to \(q>2\) options which we call queries. Let \(X=\left(\mathbf{x_1}, \ldots, \mathbf{x_q}\right) \in \mathcal{X}^q\) denote a query containing two points or more, and let \(g: \mathcal{X} \rightarrow \Re\) be the latent preference function. Then after \(n\) user queries, we define the expected utility of the best option (qEUBO) as \[\mathrm{qEUBO}_n(X)=\mathbb{E}_n\left[\max \left\{g\left(x_1\right), \ldots, g\left(x_q\right)\right\}\right]. \tag{4.53}\]

We now show that qEUBO is one-step Bayes optimal, meaning that each step chooses the query that maximises the expected utility received by the human. For a query \(X \in \mathcal{X}^q\), let \[V_n(X)=\mathbb{E}_n\left[\max _{x \in \mathbb{X}} \mathbb{E}_{n+1}[g(x)] \mid X_{n+1}=X\right] . \tag{4.54}\] Then \(V_n\) defines the expected utility received if an additional query \(X_{n+1}=X\) is performed, and maximizing \(V_n\) is one-step Bayes optimal. Since \(\max _{x \in \mathbb{X}} \mathbb{E}_n[f(x)]\) does not depend on \(X_{n+1}\), we can also equivalently maximize \[\mathbb{E}_n\left[\max _{x \in \mathbb{X}} \mathbb{E}_{n+1}[g(x)]-\max _{x \in \mathbb{X}} \mathbb{E}_n[g(x)] \mid X_{n+1}=X\right], \tag{4.55}\] which takes the same form as the knowledge gradient acquisition function (Wu and Frazier 2018) in standard Bayesian optimization.

\(V_n\) involves a nested stochastic optimization task, while qEUBO is a much simpler policy. When human responses are noise-free, we are able to use qEUBO as a sufficient policy due to the following theorem:

\[\underset{X \in \mathbb{X}^q}{\operatorname{argmax}} \mathrm{qEUBO}_n(X) \subseteq \underset{X \in \mathbb{X}^q}{\operatorname{argmax}} V_n(X) . \tag{4.56}\]

Proof sketch. The proof establishes two key inequalities. First, the one-step value \(V_n(X)\) is bounded above by \(\text{qEUBO}_n(X^+(X))\), where \(X^+(X)\) contains the best post-observation recommendations—because selecting the best option from a query can only be as good as already knowing the best option in the pool. Second, \(\text{qEUBO}_n(X) \leq V_n(X)\) because the user-selected option from the query may not be the global optimum. Combining these via a proof by contradiction shows that any maximizer of qEUBO is also a maximizer of the one-step Bayes-optimal value function \(V_n\). See Astudillo et al. (2023) for the full proof. \(\square\)

Therefore a sufficient condition for following one-step Bayes optimality is by maximizing \(\text{qEUBO}_n\).

In experiments comparing qEUBO to other state-of-the-art acquisition functions, qEUBO consistently outperformed on most problems and was closely followed by qEI and qTS. These results also extended to experiments with multiple options when \(q>2\). In fact, there is faster convergence in regret when using more options in human queries. The regret analysis is presented in detail below.

qEI: Batch Expected Improvement

\[\begin{aligned} \mathrm{qEI}= & \mathbb{E}_{\mathbf{y}}\left[\left(\max _{i \in[1, \ldots, q]}\left(\mu_{\min }-y_i\right)\right)_{+}\right] \\ = & \sum_{i=1}^q \mathbb{E}_{\mathbf{y}}\left(\mu_{\min }-y_i \mid y_i \leq \mu_{\min }, y_i \leq y_j \forall j \neq i\right) \\ & p\left(y_i \leq \mu_{\min }, y_i \leq y_j \forall j \neq i\right) . \end{aligned} \tag{4.57}\]

qTS: Batch Thompson Sampling

Initial data \(\mathcal{D}_{\mathcal{I}(1)}=\{(\mathbf{x}_i, y_i)\}_{i \in \mathcal{I}(1)}\) Compute current posterior \(p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})\) Sample \(\boldsymbol{\theta}\) from \(p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})\) Select \(k \leftarrow \arg \max_{j \notin \mathcal{I}(t)} \mathbb{E}[y_j \mid \mathbf{x}_j, \boldsymbol{\theta}]\) Collect \(y_k\) by evaluating \(f\) at \(\mathbf{x}_k\) \(\mathcal{D}_{\mathcal{I}(t+1)} \leftarrow \mathcal{D}_{\mathcal{I}(t)} \cup \{(\mathbf{x}_k, y_k)\}\)

Initial data \(\mathcal{D}_{\mathcal{I}(1)}=\{\mathbf{x}_i, y_i\}_{i \in \mathcal{I}(1)}\), batch size \(S\) Compute current posterior \(p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})\) Sample \(\boldsymbol{\theta}\) from \(p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})\) Select \(k(s) \leftarrow \arg \max_{j \notin \mathcal{I}(t)} \mathbb{E}[y_j \mid \mathbf{x}_j, \boldsymbol{\theta}]\) \(\mathcal{D}_{\mathcal{I}(t+1)} = \mathcal{D}_{\mathcal{I}(t)} \cup \{\mathbf{x}_{k(s)}, y_{k(s)}\}_{s=1}^S\)

4.6.3 Regret Analysis

qEUBO Regret

With the definition of Bayesian simple regret, we have that qEUBO converges to zero at a rate of \(o(1/n)\), i.e.

\[\label{th:quebo_regret} \mathbb{E}\left[f\left(x^*\right)-f\left(\widehat{x}_n^*\right)\right]=o(1 / n)\]

where \(x^*=\operatorname{argmax}_{x \in \mathrm{X}} f(x)\) and \(\widehat{x}_n^* \in \operatorname{argmax}_{x \in \mathrm{X}} \mathbb{E}_n[f(x)]\).

This theorem holds under the following assumptions:

  1. \(f\) is injective \(\mathbf{P}(f(x)=f(y))=0\) for any \(x, y \in \mathbb{X}\) with \(x \neq y\).

  2. \(f\) represents the preferred option \(\exists a>1 / 2\) s.t. \(\mathbf{P}\left(r(X) \in \operatorname{argmax}_{i=1, \ldots, 2} f\left(x_i\right) \mid f(X)\right) \geq a \forall\) \(X=\left(x_1, x_2\right) \in \mathbb{X}^2\) with \(x_1 \neq x_2\) almost surely under the prior on \(f\).

  3. Expected difference in utility is proportional to probability of greater utility \(\exists \Delta \geq \delta>0\) s.t. \(\forall \mathcal{D}^{(n)} \text{and} \forall x, y \in \mathbb{X}\) (potentially depending on \(\mathcal{D}^{(n)}\)), \[\delta \mathbf{P}^{(n)}(f(x)>f(y)) \leq \mathbb{E}^{(n)}\left[\{f(x)-f(y)\}^{+}\right] \leq \Delta \mathbf{P}^{(n)}(f(x)>f(y)) \tag{4.58}\] almost surely under the prior on \(f\).

Further lemmas leading to a proof of Theorem [th:quebo_regret] is given in (Astudillo et al. 2023) Section B.

qEI Regret

The following theorem shows that, under the same assumptions used for qEUBO regret, simple regret of qEI can fail to converge to 0.

There exists a problem instance (i.e., \(\mathbb{X}\) and Bayesian prior distribution over f) satisfying the assumptions described in Theorem [th:quebo_regret] such that if the sequence of queries is chosen by maximizing qEI, then \(\mathbb{E}\left[f\left(x^*\right)-\right.\) \(\left.f\left(\widehat{x}_n^*\right)\right] \geq R\) for all \(n\), for a constant \(R>0\).

Proof sketch. The counterexample constructs four functions on \(\mathcal{X} = \{1,2,3,4\}\) with a carefully chosen prior. The key insight is that qEI optimizes improvement over the current best estimate, which in this construction is always \(x = 2\) (with \(f(2) = 0\)). Since comparing \(x = 3\) vs. \(x = 4\) always yields the highest expected improvement over \(f(2) = 0\), qEI perpetually selects the query \((3, 4)\)—never querying the pairs that would reveal whether the true optimum lies at \(x = 3\) or \(x = 4\). One can show by induction that this trapping behavior persists for all \(n\), yielding constant Bayesian simple regret \(R = p > 0\). In contrast, qEUBO avoids this trap because it maximizes the expected utility of the best option shown, which naturally encourages diverse queries. See Astudillo et al. (2023) Section B for the full induction argument. \(\square\)

POP-BO Regret

POP-BO achieves provable regret guarantees that depend on the choice of kernel. The cumulative regret bound involves the maximum information gain \(\gamma_T^{ff'}\), which measures the complexity of the function class defined by the kernel.

With probability at least \(1-\delta\), the cumulative regret of POP-BO satisfies \[R_T=\mathcal{O}\left(\sqrt{\beta_T \gamma_T^{f f^{\prime}} T}\right), \tag{4.59}\] and the simple regret of the reported solution converges as \[f\left(x^{\star}\right)-f\left(x_{t^{\star}}\right) \leq \mathcal{O}\left(\frac{\sqrt{\beta_T \gamma_T^{f f^{\prime}}}}{\sqrt{T}}\right). \tag{4.60}\]

The kernel-specific rates reveal how the smoothness of the preference function affects convergence:

For POP-BO with \(\epsilon=1/T\):

  1. Linear kernel \(k(x, y)=\langle x, y\rangle\): \(R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4}\right)\)
  2. Squared exponential kernel: \(R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4(d+1)}\right)\)
  3. Matérn kernel (smoothness \(\nu\)): \(R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4} T^{\frac{d}{\nu}\left(\frac{1}{4}+\frac{d+1}{4+2(d+1)^d / \nu}\right)}\right)\)

The rates share a \(T^{3/4}\) base rate, with kernel-dependent logarithmic or polynomial corrections. Smoother kernels (larger \(\nu\) in Matérn, or SE which corresponds to \(\nu \to \infty\)) yield tighter bounds, reflecting the intuition that smoother preference functions are easier to optimize from pairwise comparisons. Compared to standard BO (which achieves \(T^{1/2}\) rates), the extra \(T^{1/4}\) factor is the price of learning from preference feedback rather than direct function evaluations.

Code: Full PBO Simulation

This simulation demonstrates Preferential Bayesian Optimization finding the optimum of a latent function through pairwise comparisons.

4.7 Reinforcement Learning Foundations for Preference Optimization (Optional)

Note

This section is optional and covers reinforcement learning (RL) foundations that underlie modern preference optimization methods like RLHF, PPO, and GRPO. It provides the formal framework for CIRL (next section) and connects the bandit methods from earlier sections to the full sequential setting used in language model alignment. Readers familiar with RL may wish to skim Sections 4.6.1–4.6.2 and focus on Sections 4.6.3–4.6.5, which develop the policy gradient theorem and GRPO in the preference optimization context.

The bandit and Bayesian optimization methods developed so far operate in single-step or myopic settings: at each round, the learner selects an arm or a query pair, observes feedback, and updates its model. But the RLHF pipeline introduced in Chapter 1 and the CIRL framework in the next section both require reasoning over sequences of decisions, where early actions influence future states.

This section develops the RL machinery needed for these settings, culminating in GRPO (Group Relative Policy Optimization)—a modern method that directly connects policy gradient optimization to the Bradley-Terry preference model from Chapter 1. The progression from bandits to full RL mirrors this chapter’s arc from simple to complex decision-making under uncertainty.

4.7.1 Markov Decision Processes

The Markov Decision Process (MDP) generalizes the bandit setting by introducing states that evolve over time based on the learner’s actions.

Definition: Markov Decision Process (MDP)

An MDP is a tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma, \rho_0)\) where:

  • \(\mathcal{S}\): state space
  • \(\mathcal{A}\): action space
  • \(P(s' \mid s, a)\): transition probability from state \(s\) to \(s'\) under action \(a\)
  • \(R(s, a)\): reward function, \(R: \mathcal{S} \times \mathcal{A} \to \mathbb{R}\)
  • \(\gamma \in [0, 1]\): discount factor
  • \(\rho_0\): initial state distribution

A policy \(\pi_\theta(a \mid s)\) maps states to distributions over actions, parameterized by \(\theta\). An agent following policy \(\pi\) generates a trajectory: \[ \tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T), \quad s_0 \sim \rho_0, \; a_t \sim \pi(\cdot \mid s_t), \; s_{t+1} \sim P(\cdot \mid s_t, a_t) \tag{4.61}\]

The RL objective is to find the policy that maximizes expected cumulative reward: \[ J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t R(s_t, a_t)\right] \tag{4.62}\]

When \(T = 0\) (a single step with no state transitions), the MDP reduces to the bandit setting from earlier sections. The RL objective then reduces to maximizing expected immediate reward \(\mathbb{E}_{a \sim \pi_\theta}[R(a)]\), recovering the reward maximization problem from Section 4.3.

Example: LLM Generation as an MDP

Autoregressive text generation fits naturally into the MDP framework:

  • State \(s_t\): the prompt \(x\) concatenated with tokens generated so far, \((x, y_1, \ldots, y_t)\)
  • Action \(a_t\): the next token \(y_{t+1}\) from the vocabulary \(\mathcal{V}\)
  • Transition: deterministic—the new state appends the chosen token
  • Reward: \(R(s_t, a_t) = 0\) for all intermediate steps; at the terminal step \(T\), the reward is \(r_\phi(x, y)\) from a learned reward model (e.g., trained on Bradley-Terry preferences as in Chapter 1)
  • Discount: \(\gamma = 1\) (finite horizon, all tokens matter equally)

The policy \(\pi_\theta(a_t \mid s_t)\) is the language model itself, producing a distribution over the next token given the context.

The next section introduces CIRL, which extends this single-agent MDP to a two-player cooperative game where a human and robot jointly optimize a shared reward.

4.7.2 Value Functions and the Advantage

To understand how much better one action is compared to others, we define three related quantities.

The state-value function measures the expected return starting from state \(s\) and following policy \(\pi\): \[ V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T} \gamma^t R(s_t, a_t) \;\middle|\; s_0 = s\right] \tag{4.63}\]

The action-value function additionally conditions on taking a specific first action: \[ Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T} \gamma^t R(s_t, a_t) \;\middle|\; s_0 = s, a_0 = a\right] \tag{4.64}\]

The advantage function is the difference: \[ A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s) \tag{4.65}\]

The advantage tells us how much better action \(a\) is compared to the average action under the current policy at state \(s\). By definition, \(\mathbb{E}_{a \sim \pi(\cdot|s)}[A^\pi(s,a)] = 0\)—some actions have positive advantage (better than average) and others negative. This quantity will reappear as the key ingredient in GRPO’s group-relative baseline.

These quantities satisfy the Bellman equation, which recursively relates the value of a state to the values of its successors: \[ V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)}\left[R(s,a) + \gamma \, \mathbb{E}_{s' \sim P(\cdot|s,a)}[V^\pi(s')]\right] \tag{4.66}\]

4.7.3 The Policy Gradient Theorem

We now derive the central result that enables gradient-based policy optimization. The challenge is that the RL objective Equation 4.62 depends on \(\theta\) both through the policy \(\pi_\theta\) (which determines which actions are taken) and through the resulting trajectory distribution (which determines which states are visited). The policy gradient theorem shows that we can compute gradients using only the policy’s log-probabilities, without differentiating through the environment dynamics.

The score function trick. For any parameterized distribution \(\pi_\theta(a \mid s)\), we can write: \[ \nabla_\theta \pi_\theta(a \mid s) = \pi_\theta(a \mid s) \, \nabla_\theta \log \pi_\theta(a \mid s) \tag{4.67}\]

This identity converts a gradient of a probability into an expectation-friendly form. The term \(\nabla_\theta \log \pi_\theta(a \mid s)\) is called the score function—the same quantity that appears in the Fisher information matrix (Chapter 3), though here it serves a different purpose: instead of measuring information content, it tells us how to adjust the policy to increase reward.

Trajectory probability. The probability of a trajectory under policy \(\pi_\theta\) factors as: \[ p(\tau \mid \theta) = \rho_0(s_0) \prod_{t=0}^{T} \pi_\theta(a_t \mid s_t) \, P(s_{t+1} \mid s_t, a_t) \tag{4.68}\]

Taking the log-gradient, the transition terms \(P(s_{t+1} \mid s_t, a_t)\) vanish because they do not depend on \(\theta\): \[ \nabla_\theta \log p(\tau \mid \theta) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \tag{4.69}\]

This is the key insight: the environment dynamics drop out of the gradient, so we can estimate policy gradients without knowing or differentiating through the transition function \(P\).

Theorem: Policy Gradient

For a parameterized policy \(\pi_\theta\) with RL objective \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_{t=0}^{T} \gamma^t R(s_t, a_t)]\): \[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t\right] \tag{4.70}\] where \(G_t = \sum_{t'=t}^{T} \gamma^{t'-t} R(s_{t'}, a_{t'})\) is the return (cumulative discounted reward) from time step \(t\) onward.

Proof sketch. Write \(J(\theta) = \mathbb{E}_{\tau \sim p(\cdot|\theta)}[R(\tau)]\) where \(R(\tau) = \sum_t \gamma^t R(s_t, a_t)\). Apply the score function trick to the trajectory distribution: \(\nabla_\theta J = \mathbb{E}_\tau[R(\tau) \nabla_\theta \log p(\tau|\theta)]\). Substitute Equation 4.69. Finally, note that actions at time \(t\) cannot influence rewards at earlier times, so we can replace \(R(\tau)\) with the future return \(G_t\) for each term in the sum. \(\square\)

4.7.4 REINFORCE and Variance Reduction

The policy gradient theorem immediately yields the simplest policy gradient algorithm.

Algorithm: REINFORCE (Williams 1992)

Input: Parameterized policy \(\pi_\theta\), learning rate \(\alpha\)

Repeat:

  1. Sample trajectory \(\tau = (s_0, a_0, \ldots, s_T, a_T) \sim \pi_\theta\)
  2. For each time step \(t = 0, \ldots, T\):
    • Compute return \(G_t = \sum_{t'=t}^{T} \gamma^{t'-t} R(s_{t'}, a_{t'})\)
  3. Update: \(\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t\)

REINFORCE is an unbiased estimator of the policy gradient, but it suffers from high variance: a single trajectory can produce wildly different gradient estimates depending on the stochastic outcomes. This motivates variance reduction through baselines.

Baseline subtraction. We can subtract any function \(b(s_t)\) that depends only on the state (not the action) from the return without introducing bias: \[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, (G_t - b(s_t))\right] \tag{4.71}\]

Why is this unbiased? For any state-dependent baseline \(b(s)\): \[ \mathbb{E}_{a \sim \pi_\theta(\cdot|s)}\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_{a} \pi_\theta(a \mid s) = b(s) \cdot \nabla_\theta \, 1 = 0 \tag{4.72}\]

The probabilities sum to 1 regardless of \(\theta\), so the gradient of that sum is zero.

The variance-minimizing baseline turns out to be \(b^*(s_t) = V^\pi(s_t)\), which reduces the gradient to use the advantage \(A^\pi(s_t, a_t) = G_t - V^\pi(s_t)\). Intuitively, we only reinforce actions that are better than average—actions with positive advantage get upweighted, while those with negative advantage get downweighted.

Caution

In practice, \(V^\pi(s)\) is unknown and must itself be estimated, typically by training a separate neural network (the critic or value network) alongside the policy. This adds computational cost and can introduce its own approximation errors. GRPO, discussed next, avoids this entirely.

4.7.5 From RLHF to GRPO

We now connect the policy gradient framework to the RLHF pipeline introduced in Chapter 1 and develop GRPO as a modern, critic-free alternative to PPO.

The RLHF objective. Recall from Chapter 1 that RLHF trains a reward model \(r_\phi(x, y)\) on human preference data using the Bradley-Terry likelihood, then optimizes the policy to maximize reward while staying close to a reference policy \(\pi_{\text{ref}}\): \[ J_{\text{RLHF}}(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x, y)\right] - \beta \, \text{KL}\left[\pi_\theta(\cdot|x) \,\|\, \pi_{\text{ref}}(\cdot|x)\right] \tag{4.73}\]

The KL penalty prevents reward hacking: without it, the policy could exploit imperfections in the learned reward model \(r_\phi\) to achieve high reward scores that do not correspond to genuinely preferred outputs (see Chapter 7 for empirical discussion).

PPO. Proximal Policy Optimization (Schulman et al. 2017) applies the policy gradient theorem to the RLHF objective using a clipped surrogate that prevents excessively large policy updates. PPO requires a learned value function \(V_\psi(s)\) (the critic) to estimate advantages. Training this critic adds computational cost—roughly doubling the model parameters in the LLM setting, since the critic is typically the same size as the policy. Trust Region Policy Optimization [TRPO; Schulman et al. (2015)] preceded PPO with a similar motivation but used a hard KL constraint instead of clipping.

GRPO: Group Relative Policy Optimization. GRPO (Shao et al. 2024) eliminates the critic by replacing the learned value baseline with a group-relative baseline computed from multiple sampled outputs. The key idea is conceptually parallel to the dueling bandit paradigm from earlier in this chapter: rather than estimating absolute values (which requires a critic), we compare outputs within a group to determine which are relatively better.

Algorithm: Group Relative Policy Optimization (GRPO)

Input: Policy \(\pi_\theta\), reference policy \(\pi_{\text{ref}}\), reward model \(r_\phi\), group size \(G\), clip range \(\epsilon\), KL weight \(\beta\)

For each prompt \(x \sim \mathcal{D}\):

  1. Sample a group of outputs: \(\{y_1, \ldots, y_G\} \sim \pi_{\theta_{\text{old}}}(\cdot \mid x)\)
  2. Score each output: \(r_i = r_\phi(x, y_i)\) for \(i = 1, \ldots, G\)
  3. Normalize to get group-relative advantages: \[ \hat{A}_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)} \tag{4.74}\]
  4. Update the policy by maximizing: \[ \mathcal{L}_{\text{GRPO}}(\theta) = \frac{1}{G}\sum_{i=1}^{G} \min\!\left(\rho_i \hat{A}_i, \; \text{clip}(\rho_i, 1{-}\epsilon, 1{+}\epsilon) \, \hat{A}_i\right) - \beta \, \text{KL}[\pi_\theta \| \pi_{\text{ref}}] \tag{4.75}\] where \(\rho_i = \pi_\theta(y_i \mid x) \,/\, \pi_{\theta_{\text{old}}}(y_i \mid x)\) is the importance ratio.

The z-score normalization in Equation 4.74 plays the same role as pairwise comparisons in dueling bandits: it tells us which outputs are relatively better within the group, without requiring an absolute value estimate. When the group mean is high (all outputs are good), only the best outputs receive positive advantage; when the group mean is low, even mediocre outputs can be positively reinforced.

Key Insight: GRPO and Dueling Bandits

GRPO’s group-relative baseline is the RL analog of the dueling bandit paradigm. In dueling bandits, we never observe absolute arm rewards—only pairwise comparisons. Similarly, GRPO never estimates absolute state values—only relative quality within a sampled group. This connection runs deeper: the Bradley-Terry reward model \(r_\phi\) that scores GRPO’s outputs is the same preference model that defines the dueling bandit’s comparison probabilities. The circle from preference models (Chapter 1) through sequential decision-making (this chapter) back to policy optimization closes here.

Connection to DPO. When \(G = 2\) and the reward comes from a Bradley-Terry model, GRPO’s advantage reduces to \(\hat{A}_1 = -\hat{A}_2 \propto r_\phi(x, y_1) - r_\phi(x, y_2)\). This reward difference is exactly the quantity that determines preference probabilities in the Bradley-Terry model: \(p(y_1 \succ y_2 \mid x) = \sigma(r_\phi(x, y_1) - r_\phi(x, y_2))\). DPO optimizes this signal directly on a fixed offline dataset, while GRPO generates fresh samples during training and generalizes to groups larger than 2. The online sampling allows GRPO to explore the policy’s current output distribution, while DPO is limited to the distribution that generated the training data.

Code: REINFORCE vs. GRPO on a Preference Bandit

4.7.6 Connections

The progression from bandits (single-step) to RL (sequential) to preference-based RL (RLHF, GRPO) mirrors this chapter’s arc from simple to complex decision-making. The policy gradient theorem (Equation 4.70) provides the mathematical bridge: it tells us how to optimize any parameterized policy using only sampled trajectories and rewards. When those rewards come from a learned preference model (Bradley-Terry), we recover the modern alignment pipeline.

  • Forward to CIRL (next section): The MDP framework developed here provides the formal setting for CIRL, where the single-agent MDP becomes a two-player cooperative game with shared rewards.
  • Forward to Chapter 7: Chapter 7 discusses the practical lessons from deploying RLHF and GRPO at scale, including reward hacking and the gap between learned and true rewards.
  • Back to Chapter 3: The log-linear policy assumption used in ADPO—\(\pi(y \mid x; \theta) \propto \exp(\phi(x,y)^\top \theta)\)—is a special case of the parameterized policy \(\pi_\theta\), specifically the softmax policy class common in RL.
Key Takeaway: From Bandits to RL to Preferences

GRPO further simplifies the RL pipeline by replacing the value function with group-relative comparisons, connecting RL optimization back to the pairwise comparison paradigm that runs throughout this book. The key equations—Bradley-Terry preference probabilities (Chapter 1), Fisher information and score functions (Chapter 3), and the policy gradient theorem (this section)—all share the same mathematical core: the log-derivative of a parameterized probability model.

4.8 Human-Agent Cooperation and CIRL

The algorithms discussed so far—Thompson Sampling, dueling bandits, Preferential Bayesian Optimization, and the RL methods in the preceding section—all share a common assumption: the human is a passive oracle who responds to queries but does not actively shape the learning process. In practice, this assumption often breaks down. A user choosing between LLM responses may deliberately select the more challenging example to teach the system. A patient in a clinical trial may modify their behavior based on what they believe the system is learning. Humans are active agents whose feedback depends on their beliefs about how it will be used.

This section introduces Cooperative Inverse Reinforcement Learning (CIRL), a framework that models human and agent as jointly optimizing a shared objective. Where Thompson Sampling optimizes reward under uncertainty about the reward function, CIRL optimizes under uncertainty about the human’s preferences—and crucially, the human knows this and can act to reduce that uncertainty.

4.8.1 The CIRL Framework

In standard Inverse Reinforcement Learning (IRL), the agent learns a reward function from observing an expert’s behavior, then optimizes it. The expert demonstrates; the agent imitates. CIRL extends this to a two-player cooperative game:

  • Human H: Knows the true reward parameters \(\theta \in \Theta\)
  • Robot R: Does not know \(\theta\), but must act to maximize reward
  • Shared payoff: Both players receive \(R(s, a_H, a_R; \theta)\)—their interests are aligned

The key insight is that the human’s optimal policy is not just demonstrating good behavior, but teaching the robot about \(\theta\). Sometimes instructive actions (that reveal information) dominate expert demonstrations (that maximize immediate reward).

CIRL Formal Definition

A CIRL game extends the single-agent MDP (Equation 4.62) to a two-player cooperative setting:

  • State space \(\mathcal{S}\)
  • Action spaces \(\mathcal{A}_H\) (human) and \(\mathcal{A}_R\) (robot)
  • Transition function \(T(s' | s, a_H, a_R)\)
  • Reward function \(R(s, a_H, a_R; \theta)\) parameterized by \(\theta\)
  • Prior \(P_0(\theta)\) over reward parameters
  • Discount factor \(\gamma \in (0, 1)\)

At each time \(t\), both observe \((s_t, a_H^{t-1}, a_R^{t-1})\). Only H observes \(\theta\). The robot maintains a belief \(b_R^t(\theta)\) updated via Bayes’ rule. The key difference from the standard MDP is that the reward parameter \(\theta\) is observed only by the human, making this a game of asymmetric information.

4.8.2 Teaching vs. Demonstrating

A central result in CIRL is that expert demonstrations can be suboptimal for joint value. Consider:

  • Demonstration-by-expert: Human maximizes task reward at each step
  • Instructive teaching: Human sometimes sacrifices immediate reward to shape robot’s belief

When the robot is uncertain, instructive actions that reveal \(\theta\) can lead to higher cumulative reward than always acting optimally given \(\theta\).

Code: Teaching vs Demonstrating Simulation

4.8.3 Choice Architecture

As designers of preference learning systems, we are choice architects: the options we present and how we present them influence human behavior. Research in behavioral economics documents several effects:

  • Order effects: Items presented first or last receive disproportionate attention
  • Framing effects: How options are described changes perceived value
  • Default effects: Pre-selected options are chosen more often

These effects are typically not captured in standard choice models (which assume IIA or similar properties). Responsible deployment of preference learning systems requires awareness of these biases.

Key Takeaway: Beyond Passive Oracles

The transition from measurement to decision-making revealed that uncertainty should be managed, not just reduced. The transition from passive oracles to active humans reveals that humans themselves are strategic agents whose behavior depends on their beliefs about the system. CIRL provides a principled framework for this cooperative setting.

Assumption: Approximately Rational Humans

CIRL’s cooperative framework assumes the human acts approximately optimally given their beliefs about the robot’s learning process. If the human is systematically biased—exhibiting anchoring effects, framing dependence, or status quo bias—the robot may learn incorrect preferences from the human’s demonstrations. For example, a human who consistently overweights recent experiences (recency bias) will provide teaching signals that distort the learned reward function. More fundamentally, CIRL assumes the human wants to teach the robot, but in practice humans may be inattentive, adversarial, or simply unaware that their behavior is being interpreted as preference data. The choice architecture effects noted above compound this issue: the system’s interface can systematically bias the human’s responses in ways that CIRL does not account for.

Looking Ahead: From Individual Decisions to Collective Preferences

This chapter developed methods for sequential decision-making with preference feedback from a single user. But what happens when multiple users have different preferences? Chapter 5 (Aggregation) addresses this challenge using social choice theory and mechanism design. The winner concepts introduced here—Condorcet, Borda, von Neumann—will reappear in that context. In particular, a key result in Chapter 5 shows that DPO implicitly optimizes the Borda score when aggregating across annotators, connecting the dueling bandit framework developed here to the social choice theory of preference aggregation. Chapter 6 then asks the normative question: whose preferences should we optimize for, and what fairness constraints should apply?

4.9 Summary

  • This chapter shifts perspective from measuring preferences (Chapter 3) to making decisions under preference uncertainty—managing uncertainty to maximize cumulative reward rather than merely reducing it.
  • Thompson Sampling (“sample a plausible world, act optimally within it”) naturally balances exploration and exploitation by sampling from the posterior. Combined with the Laplace approximation, it handles both linear and nonlinear (GP-based) preference models.
  • Dueling bandits formalize decision-making when only pairwise comparisons are observable. Three winner concepts—Condorcet (beats all others), Borda (highest average win rate), and von Neumann (minimax optimal mixed strategy)—can diverge, and the choice of winner concept shapes what “regret” means.
  • Preferential Bayesian Optimization (PBO) tackles the continuous-armed case where the goal is to find the best option in a large or infinite space using only preference feedback. The qEUBO acquisition function provably converges while the seemingly natural qEI can fail.
  • Regret bounds for PBO scale as \(R_T = O(\sqrt{\beta_T \gamma_T T})\), where \(\gamma_T\) is the maximum information gain—providing convergence guarantees that depend on the kernel’s complexity.
  • Cooperative Inverse Reinforcement Learning (CIRL) models humans as strategic agents who actively teach rather than passively demonstrate, enabling more efficient preference communication through cooperative interaction.
  • (Optional) The policy gradient theorem provides the mathematical bridge from bandit methods to full RL-based preference optimization. GRPO eliminates the need for a learned value function by using group-relative advantages, connecting RL optimization back to the pairwise comparison paradigm.
  • As system designers, we are choice architects: order effects, framing, and defaults influence human choices in ways that standard models (IIA) do not capture, requiring awareness of these biases in deployment.

4.10 Quick Check

Test your understanding before proceeding to the exercises.

  1. Thompson Sampling draws from the posterior and acts greedily under the sample. Why does this produce exploration, even though each individual decision is “greedy”?

  2. Construct a 3-arm preference matrix where the Condorcet winner and Borda winner differ. (Hint: the Condorcet winner beats all others head-to-head but may have low average win rates.)

  3. Why does the qEI acquisition function fail to converge in some settings while qEUBO provably converges? What is the key difference in their objectives?

  4. In CIRL, what changes if the human acts as a teacher (choosing informative demonstrations) rather than a demonstrator (acting optimally for themselves)?

  5. (Optional) The policy gradient theorem uses \(\nabla_\theta \log \pi_\theta(a \mid s)\) to estimate the gradient of the RL objective. Why do the environment’s transition dynamics \(P(s' \mid s, a)\) not appear in the gradient, even though they affect the trajectory distribution?

If you read only five papers from this chapter’s references, make them these:

  1. Thompson (1933) — Thompson Sampling: the original 1933 paper on probability matching for sequential decisions
  2. Russo et al. (2018) — A modern tutorial on Thompson Sampling with theory and applications
  3. Sui et al. (2017) — Multi-dueling bandits: extending pairwise comparisons to multi-way preference feedback
  4. Brochu, Cora, and Freitas (2010) — A tutorial on Bayesian optimization, the foundation for preferential BO
  5. Hadfield-Menell et al. (2016) — CIRL: cooperative inverse reinforcement learning for value alignment
  6. Sutton and Barto (2018) — Reinforcement Learning: An Introduction, the standard reference for MDP formulations and policy gradient methods
  7. Shao et al. (2024) — DeepSeekMath: the paper introducing GRPO for LLM alignment

4.11 Discussion Questions

  1. Measurement vs. Optimization: How does the reward maximization perspective differ from the measurement perspective on preference learning? When might one approach be more appropriate than the other?

  2. Thompson Sampling Intuition: Explain in your own words how Thompson Sampling achieves the exploration-exploitation tradeoff. Why does sampling from the posterior distribution lead to exploration in uncertain regions?

  3. Laplace Approximation: Why is the Laplace approximation necessary for Thompson Sampling with binary feedback? What are the potential limitations of this approximation?

  4. Regret Definitions: Compare and contrast strong regret, weak regret, and Bayesian regret in the dueling bandit setting. Give an example scenario where each would be the most appropriate metric.

  5. Winner Concepts: When might the Condorcet winner, Borda winner, and Von Neumann winner diverge? Construct a simple example with 3-4 alternatives where these concepts give different answers.

  6. Acquisition Functions: What are the key tradeoffs between the UCB and Thompson Sampling acquisition functions? How does Pure Exploration differ in its objectives?

  7. qEUBO vs. qEI: Why does qEUBO provably converge while qEI can fail to converge under the same assumptions? What does this suggest about designing acquisition functions for preferential feedback?

  8. Copeland Score: How does the soft-Copeland score relate to finding the Condorcet winner? Why might we prefer the soft version over the hard indicator-based version?

  9. Contextual Bandits: How do contextual bandits extend the basic multi-armed bandit framework? Give three real-world examples where context is crucial for making good decisions.

  10. Exploration in Practice: In recommendation systems, how might information asymmetry be used ethically to encourage exploration? What are the potential risks of this approach?

  11. Bayesian Optimization Connection: Explain the connection between Preferential Bayesian Optimization and traditional Bayesian optimization. How does the preference feedback model change the inference problem?

  12. Scalability Challenges: What are the main computational challenges in scaling dueling bandit and PBO algorithms to large action spaces? How might these be addressed?

  13. (Optional) GRPO vs. DPO: Both GRPO and DPO optimize a policy using preference data. DPO is “offline” (optimizes directly on a fixed dataset), while GRPO is “online” (generates new samples during training). What are the tradeoffs? When might each be preferred?

  14. (Optional) Variance Reduction: GRPO uses a group-relative baseline while classical REINFORCE uses a learned value function baseline. Compare the computational and statistical tradeoffs. How does the group size \(G\) affect the bias-variance tradeoff?

4.12 Further Reading

For readers who want to go deeper into the topics introduced in this chapter, we recommend the following:

  • Russo et al. (2018), “A Tutorial on Thompson Sampling” — A modern tutorial covering the theory, applications, and extensions of Thompson Sampling, including connections to information-directed sampling and contextual bandits.
  • Bengs et al. (2021), “Preference-Based Online Learning with Dueling Bandits: A Survey” — A comprehensive survey of dueling bandit algorithms and their regret bounds, covering Condorcet, Borda, and von Neumann winner concepts and the algorithms designed for each.
  • Lattimore and Szepesvári (2020), Bandit Algorithms — The definitive modern textbook on bandit theory, providing rigorous treatments of regret bounds, exploration strategies, and the information-theoretic limits of sequential decision-making.
  • Shahriari et al. (2016), “Taking the Human Out of the Loop: A Review of Bayesian Optimization” — A survey connecting Bayesian optimization to preference learning, covering acquisition functions, GP surrogate models, and applications to hyperparameter tuning and experimental design.
  • Hadfield-Menell et al. (2016), “Cooperative Inverse Reinforcement Learning” — The foundational CIRL paper, formalizing the cooperative game between a human and a robot that learns the human’s reward function through interaction.
  • Sutton and Barto (2018), Reinforcement Learning: An Introduction — The standard RL textbook, essential for readers who want to understand the MDP foundations and policy gradient methods discussed in the optional RL section of this chapter.
  • Thaler and Sunstein (2008), Nudge: Improving Decisions About Health, Wealth, and Happiness — The foundational work on choice architecture, documenting how presentation order, framing, and defaults influence human choices in ways that standard preference models do not capture.
  • Kahneman and Tversky (1979), “Prospect Theory: An Analysis of Decision Under Risk” — The seminal paper on how humans systematically deviate from expected utility maximization, providing the behavioral foundations for why pairwise comparisons (PBO) are often more reliable than absolute ratings.

4.13 Exercises

4.13.1 Basic (*)

  1. Posterior Sampling: Given a logistic regression model with prior \(U \sim \mathcal{N}(0, I)\) and two observations \((V_1, Y_1=1)\) and \((V_2, Y_2=0)\) with \(V_1 = [1, 0]^\top\) and \(V_2 = [0, 1]^\top\), write down the log-posterior and identify the form of the Laplace approximation.

  2. Regret Calculation: Consider a dueling bandit with 3 arms where the preference matrix is: \[P = \begin{pmatrix} 0.5 & 0.6 & 0.7 \\ 0.4 & 0.5 & 0.6 \\ 0.3 & 0.4 & 0.5 \end{pmatrix}\] Identify the Condorcet winner and compute the expected regret of always choosing arms \((2, 3)\).

  3. Copeland Score: For the preference matrix in Exercise 2, compute the Copeland score for each arm.

  4. (Optional) MDP for LLM Generation: Formally define the MDP for autoregressive text generation. What is the state space? Action space? When is reward received? What is the discount factor? How does the horizon \(T\) relate to the generated sequence length?

4.13.2 Intermediate (**)

  1. GP Classification: Implement GP classification with Laplace approximation for a 1D toy problem. Generate binary labels from a latent sinusoidal function passed through a sigmoid, then fit a GP and visualize the posterior mean and uncertainty.

  2. Thompson Sampling Simulation: Implement Thompson Sampling for a linear bandit with Gaussian prior. Run simulations comparing Thompson Sampling to UCB and \(\varepsilon\)-greedy on a synthetic problem. Plot cumulative regret curves.

  3. Dueling Bandit Simulation: Implement the Interleaved Filter algorithm and test it on a dueling bandit problem with 5 arms. Compare its sample complexity to uniform exploration for finding the Condorcet winner.

  4. (Optional) Policy Gradient Derivation: Starting from the trajectory probability \(p(\tau \mid \theta) = \rho_0(s_0) \prod_t \pi_\theta(a_t \mid s_t) P(s_{t+1} \mid s_t, a_t)\), derive the policy gradient theorem. Show that subtracting a state-dependent baseline \(b(s_t)\) from the return does not bias the gradient estimate.

4.13.3 Advanced (***)

  1. Laplace Approximation Derivation: Starting from the GP binary classification model, derive the full Laplace approximation including the Hessian computation for the posterior covariance. Show that the resulting approximation is Gaussian with the correct mean and covariance.

  2. qEI Failure Case: Reproduce the proof that qEI can fail to converge by implementing the counterexample from the chapter. Verify empirically that qEI gets stuck while qEUBO converges.

  3. POP-BO Regret Bound: Following the proof sketch in the chapter, derive the regret bound for POP-BO with a linear kernel. Explain how the covering number \(\mathcal{N}(\mathcal{B}_f, \epsilon, \|\cdot\|_\infty)\) appears in the bound.

  4. Custom Acquisition Function: Design and implement a novel acquisition function for preferential Bayesian optimization that balances exploration and exploitation in a different way than qEUBO. Provide theoretical justification or empirical evidence for its effectiveness.

  5. (Optional) GRPO Implementation: Implement GRPO for a contextual bandit with Bradley-Terry rewards. Compare convergence and gradient variance against REINFORCE with a learned baseline for group sizes \(G \in \{2, 4, 8, 16\}\). Investigate: for what group size does the marginal benefit of additional samples diminish?

  6. (Optional) KL-Constrained Optimization: The RLHF objective includes a KL penalty \(\beta \, \text{KL}[\pi_\theta \| \pi_{\text{ref}}]\). Derive the optimal policy in closed form (it has the form \(\pi^*(y \mid x) \propto \pi_{\text{ref}}(y \mid x) \exp(r(x,y)/\beta)\)). Show that this is exactly the DPO implicit reward from Chapter 1.

References

Astudillo, Raul, Zi Wang Lin, Eytan Bakshy, and Peter Frazier. 2023. “qEUBO: A Decision-Theoretic Acquisition Function for Preferential Bayesian Optimization.” In International Conference on Artificial Intelligence and Statistics, 1093–1114. PMLR.
Bastani, Hamsa, and Mohsen Bayati. 2020. “Online Decision Making with High-Dimensional Covariates.” Operations Research 68 (1): 276–94. https://doi.org/10.1287/opre.2019.1902.
Bengs, Viktor, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. 2021. “Preference-Based Online Learning with Dueling Bandits: A Survey.” Journal of Machine Learning Research 22 (7): 1–108.
Bouneffouf, Djallel, Amel Bouzeghoub, and Alda Lopes Gançarski. 2012. “A Contextual-Bandit Algorithm for Mobile Context-Aware Recommender System.” In Neural Information Processing, edited by Tingwen Huang, Zhigang Zeng, Chuandong Li, and Chi Sing Leung, 324–31. Berlin, Heidelberg: Springer Berlin Heidelberg.
Bouneffouf, Djallel, Irina Rish, and Charu Aggarwal. 2020. “Survey on Applications of Multi-Armed and Contextual Bandits.” In 2020 IEEE Congress on Evolutionary Computation (CEC), 1–8. Glasgow, United Kingdom: IEEE Press. https://doi.org/10.1109/CEC48606.2020.9185782.
Brochu, Eric, Vlad M. Cora, and Nando de Freitas. 2010. “A Tutorial on Bayesian Optimization of Expensive Cost Functions.” arXiv Preprint arXiv:1012.2599.
Hadfield-Menell, Dylan, Stuart J Russell, Pieter Abbeel, and Anca Dragan. 2016. “Cooperative Inverse Reinforcement Learning.” Advances in Neural Information Processing Systems 29.
Kahneman, Daniel, and Amos Tversky. 1979. “Prospect Theory: Analysis of Decision Under Risk.” Econometrica 47 (2). https://doi.org/10.2307/1914185.
Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge: Cambridge University Press.
Liu, Bing, Tong Yu, Ian Lane, and Ole J. Mengshoel. 2018. “Customized Nonlinear Bandits for Online Response Selection in Neural Conversation Models.” In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’18/IAAI’18/EAAI’18. New Orleans, Louisiana, USA: AAAI Press.
Russo, Daniel J., Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018. “A Tutorial on Thompson Sampling.” Foundations and Trends in Machine Learning 11 (1): 1–96.
Schulman, John, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. “Trust Region Policy Optimization.” In International Conference on Machine Learning, 1889–97.
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” arXiv Preprint arXiv:1707.06347.
Shahriari, Bobak, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. 2016. “Taking the Human Out of the Loop: A Review of Bayesian Optimization.” Proceedings of the IEEE 104 (1): 148–75.
Shao, Zhihong, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv Preprint arXiv:2402.03300.
Sui, Yanan, Vincent Zhuang, Joel W. Burdick, and Yisong Yue. 2017. “Multi-Dueling Bandits with Dependent Arms.” In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI).
Sui, Yanan, Masrour Zoghi, Katja Hofmann, and Yisong Yue. 2018. “Advancements in Dueling Bandits.” Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. https://doi.org/10.24963/ijcai.2018/776.
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book.
Thaler, Richard H., and Cass R. Sunstein. 2008. Nudge: Improving Decisions about Health, Wealth, and Happiness. New Haven, CT: Yale University Press.
Thompson, William R. 1933. “On the Likelihood That One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.” Biometrika 25 (3–4): 285–94.
Williams, Ronald J. 1992. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” Machine Learning 8 (3): 229–56.
Wu, Jian, and Peter I. Frazier. 2018. “The Parallel Knowledge Gradient Method for Batch Bayesian Optimization.” https://arxiv.org/abs/1606.04414.
Xu, Wenjie, Wenbin Wang, Yuning Jiang, Bratislav Svetozarevic, and Colin N. Jones. 2024. “Principled Preferential Bayesian Optimization.” https://arxiv.org/abs/2402.05367.
Yue, Yisong, Josef Broder, Robert Kleinberg, and Thorsten Joachims. 2012. “The k-Armed Dueling Bandits Problem.” Journal of Computer and System Sciences 78 (5): 1538–56. https://doi.org/https://doi.org/10.1016/j.jcss.2011.12.028.
Yue, Yisong, and Thorsten Joachims. 2009. “Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem.” Proceedings of the 26th Annual International Conference on Machine Learning. https://doi.org/10.1145/1553374.1553527.
Zhang, Tong. 2021. “Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning.” CoRR abs/2110.00871. https://arxiv.org/abs/2110.00871.
Zhou, Qian, XiaoFang Zhang, Jin Xu, and Bin Liang. 2017. “Large-Scale Bandit Approaches for Recommender Systems.” In Neural Information Processing, edited by Derong Liu, Shengli Xie, Yuanqing Li, Dongbin Zhao, and El-Sayed M. El-Alfy, 811–21. Cham: Springer International Publishing.
Zoghi, Masrour, Zohar S. Karnin, Shimon Whiteson, and Maarten De Rijke. 2015. “Copeland Dueling Bandits.” In Advances in Neural Information Processing Systems. Vol. 28.