4 Decisions

Fullscreen Part 1 Fullscreen Part 2

Intended Learning Outcomes

By the end of this chapter you will be able to:

Distinguish the measurement perspective (reducing uncertainty) from the reward maximization perspective (managing uncertainty) in preference learning.
Derive Thompson Sampling for linear objectives using Laplace approximation to handle intractable posteriors.
Extend Thompson Sampling to nonlinear objectives using Gaussian Processes and GP classification.
Define regret (strong, weak, and Bayesian) in dueling bandit settings and explain when each notion is appropriate.
Compare winner concepts—Condorcet, Borda, and Von Neumann winners—and identify when they may diverge.
Apply acquisition functions (Uniform, Epsilon-Greedy, UCB, Pure Exploration, qEUBO) for sequential decision-making with preference feedback.
Analyze regret bounds for Preferential Bayesian Optimization algorithms and understand their convergence guarantees.
Implement Thompson Sampling and dueling bandit algorithms in practical settings.

Notation

This chapter uses notation established in previous chapters:

User embedding: $U_i \in \mathbb{R}^K$
Item features/loadings: $V_j \in \mathbb{R}^K$
Latent utility: $H_{ij} = h_i(V_j)$ where $h_i: \mathbb{R}^K \to \mathbb{R}$
Binary preference outcome: $Y_{ij} \in \{0, 1\}$
Sigmoid function: $\sigma(x) = 1/(1 + e^{-x})$

New notation for this chapter:

Dataset at time $t$: $\mathcal{D}_t = \{(V_j, Y_{ij})\}_{j=1}^t$
Posterior sample: $\tilde{U}_i^{(t)} \sim p(U_i \mid \mathcal{D}_t)$
Acquisition function: $\alpha(\cdot)$
Cumulative regret after $T$ steps: $R(T) = \sum_{t=1}^T (\mu^* - \mu_{a_t})$
GP posterior mean/variance: $\mu_t(\cdot), \sigma_t^2(\cdot)$
Kernel function: $k(\cdot, \cdot)$
Preference function: $\pi_f([\mathbf{x}, \mathbf{x}'])$ — probability that $\mathbf{x}$ is preferred to $\mathbf{x}'$

4.1 The Reward Maximization under Uncertainty Problem

Up to this point, we have considered the measurement problem: the goal was to learn about the user’s latent construct $h_i$ (or its low-dimensional embedding $U_i$) by choosing items that are most informative about it. In this view, uncertainty is something to be reduced. In contrast, the reward maximization perspective treats uncertainty as something to be managed. Here, the system must choose actions (items) that yield the highest expected reward, even though the latent function $h_i$ governing user preferences is unknown. This is the domain of Bayesian optimization formulations.

Historical Note: Thompson Sampling

Thompson Sampling dates back to 1933, when William Thompson proposed it for clinical trials. For decades it was largely forgotten, overshadowed by frequentist methods like UCB (Upper Confidence Bound). Its rediscovery in the 2010s, particularly for online advertising and recommendation systems, revealed that this simple Bayesian algorithm often outperforms more complex alternatives. The key insight: sample a plausible world from your beliefs, then act optimally within it.

The fundamental tension in sequential decision-making is the exploration-exploitation tradeoff:

Exploitation: Choose actions believed to be good based on current knowledge
Exploration: Try uncertain actions to gain information for better future decisions

A purely exploitative strategy may miss superior options it never tried. A purely explorative strategy wastes resources on suboptimal choices. Effective algorithms balance both.

Each user $i$ interacts with a set of available items indexed by $j$, each characterized by a known parameter vector $V_j \in \mathbb{R}^K$. The user’s utility function is $H_{ij} = h_i(V_j),$ where $h_i: \mathbb{R}^K \rightarrow \mathbb{R}$ maps item features to latent utility. We only observe noisy or partial feedback, such as item-wise or pairwise responses. The agent’s objective is to select items sequentially to maximize reward.

In the linear case, we assume $h_i(V_j) = U_i^\top V_j$, with $U_i \in \mathbb{R}^K$ representing the user’s latent vector. This recovers the logistic or Rasch model when observations are binary: $p(Y_{ij}=1) = \sigma(U_i^\top V_j).$ In this setting, the uncertainty about $U_i$ defines a posterior distribution $p(U_i \mid \text{data})$, and the system’s exploration–exploitation tradeoff can be quantified by the expected utility and posterior variance. Algorithms such as Thompson Sampling operate directly on this posterior, balancing exploration (reducing uncertainty about $U_i$) and exploitation (recommending high-utility $V_j$).

In the nonlinear case, a common choice is to place a Gaussian process prior over $h_i$: $h_i \sim \mathcal{GP}(m(\cdot), k(\cdot,\cdot)),$ where the kernel $k(V, V')$ encodes similarity between items. This yields the Bayesian optimization formulation. The system uses acquisition functions (e.g., Thompson Sampling or expected improvement) to trade off exploitation and exploration.

Code: Exploration vs Exploitation Visualization

4.2 Thompson Sampling under Linear Objective

Thompson Sampling (TS) is among the most conceptually transparent and empirically successful methods for decision-making under uncertainty. It converts Bayesian inference into a simple randomized decision rule: sample a plausible world, then act optimally within it. Formally, suppose we maintain a posterior distribution over the latent user embedding $p(U_i \mid \mathcal{D}_t) \propto p(\mathcal{D}_t \mid U_i) p(U_i),$ where $\mathcal{D}_t = {(V_j, Y_{ij})}_{j=1}^t$ denotes the user’s observed responses. At each round $t+1$, Thompson Sampling draws a single sample $\tilde{U}_i^{(t)} \sim p(U_i \mid \mathcal{D}_t),$ and then chooses the next item that maximizes the sampled expected utility: $j^* = \arg\max_j \tilde{U}_i^{(t)\top} V_j.$ This randomized strategy naturally balances exploration and exploitation: if posterior uncertainty is large, different draws $\tilde{U}_i^{(t)}$ lead to diverse item choices; as the posterior concentrates, the policy converges to greedy exploitation.

Under binary or preference feedback modeled by a logistic link, \[ p(Y_{ij}=1 \mid U_i, V_j) = \sigma(U_i^\top V_j), \tag{4.1}\] the posterior becomes non-Gaussian and analytically intractable. We therefore approximate it with a Laplace approximation, replacing the true posterior with a local Gaussian centered at the mode (the MAP estimate).

Let the log-posterior be: \[ \ell(U_i) = \log p(U_i) + \sum_{j \in \mathcal{D}_t} \log p(Y_{ij} \mid U_i). \tag{4.2}\] We expand $\ell(U_i)$ around its maximum a posteriori estimate $\hat{U}_i$: \[ \ell(U_i) \approx \ell(\hat{U}_i) - \tfrac{1}{2}(U_i - \hat{U}_i)^\top G (U_i - \hat{U}_i), \tag{4.3}\] where $G = -\nabla^2_{U_i} \ell(\hat{U}_i)$ is the negative Hessian of the log-posterior. This yields the Gaussian approximation: \[ p(U_i \mid \mathcal{D}_t) \approx \mathcal{N}(\hat{U}_i, G^{-1}). \tag{4.4}\]

In the logistic model, \[ G = I + \sum_{j} p_j(1-p_j) V_j V_j^\top, \tag{4.5}\] where $p_j = \sigma(\hat{U}_i^\top V_j)$ and $I$ arises from the Gaussian prior $p(U_i) = \mathcal{N}(0, I)$.

At each time step $t$:

Update posterior approximation Compute $\hat{U}_i$ by maximizing the log-posterior: \[ \hat{U}_i = \arg\max_{U_i} \Big[\sum_{j\in\mathcal{D}_t} Y_{ij}\log\sigma(U_i^\top V_j) + (1-Y_{ij})\log(1-\sigma(U_i^\top V_j)) - \tfrac{1}{2}|U_i|^2 \Big]. \tag{4.6}\]
Compute Hessian at the mode \[ G = I + \sum_{j} p_j(1-p_j) V_j V_j^\top. \tag{4.7}\]
Sample a posterior draw \[ \tilde{U}_i^{(t)} \sim \mathcal{N}(\hat{U}_i, G^{-1}). \tag{4.8}\]
Select next item \[ j^* = \arg\max_j \tilde{U}_i^{(t)\top} V_j. \tag{4.9}\]

This algorithm naturally unifies measurement and decision: the posterior (a measurement construct) drives the sampling rule (a decision construct).

Code: Thompson Sampling with Laplace Approximation

Code: Linear Thompson Sampling Simulation

This simulation runs Thompson Sampling with linear utility over multiple rounds, tracking how the posterior concentrates and regret accumulates.

4.3 Thompson Sampling under Nonlinear Objective

The linear formulation assumes that a user’s latent utility varies linearly with item features. In practice, preference landscapes are often highly nonlinear: users may favor intermediate difficulty, exhibit saturation, or respond differently to combinations of attributes. To capture such complex relations, we move from a parametric vector representation $(U_i)$ to a nonparametric function representation $(h_i)$.

A Gaussian Process (GP) defines a distribution over functions: \[ h_i \sim \mathcal{GP}(m(\cdot), k(\cdot, \cdot)), \tag{4.10}\] where $m(V)$ is the mean function (often assumed to be zero) and $k(V, V')$ is the kernel, encoding similarity between items. Intuitively, a GP asserts that similar items yield similar utilities, with the kernel controlling smoothness and generalization.

For any finite collection of inputs $V_{1:T}$, the corresponding function values \[ H_{i, 1:M} = [h_i(V_1), \dots, h_i(V_M)] \tag{4.11}\] follow a multivariate normal distribution: \[ H_{i, 1:M} \sim \mathcal{N}(m_{1:M}, K_{1:M,1:M}), \tag{4.12}\] where $(K)_{jj'} = k(V_j, V_{j'})$.

The probability density function (PDF) of a multivariate normal random variable \[ H \sim \mathcal{N}(m, K) \tag{4.13}\] is given by \[ p(H) = \frac{1}{(2\pi)^{t/2} |K|^{1/2}} \exp \Big(-\tfrac{1}{2}(H-m)^\top K^{-1} (H-m) \Big). \tag{4.14}\] Taking the logarithm yields \[ \log p(H) = -\tfrac{1}{2}(H-m)^\top K^{-1} (H-m) - \tfrac{1}{2}\log|K| - \tfrac{t}{2}\log(2\pi). \tag{4.15}\]

In most inference contexts, constant terms (those not depending on $H$) can be dropped, since they cancel when normalizing the posterior.

This nonparametric formulation is advantageous because:

It automatically adapts model complexity to the data through the kernel.
It provides closed-form posterior updates under Gaussian noise.
It generalizes across items by exploiting smoothness in feature space.

The RBF (Radial Basis Function) kernel is particularly common: \[ k(V, V') = \sigma^2 \exp\left(-\frac{\|V - V'\|^2}{2\ell^2}\right), \tag{4.16}\] where $\ell$ is the lengthscale (controlling smoothness) and $\sigma^2$ is the variance (controlling amplitude). Small lengthscales produce wiggly functions; large lengthscales produce smooth functions.

Code: Kernel Comparison and GP Prior Samples

For binary observations, \[ p(Y_{ij}=1 \mid h_i(V_j)) = \sigma(h_i(V_j)), \tag{4.17}\] the likelihood is non-Gaussian, and the posterior \[ p(h_i \mid \mathcal{D}_t) \propto p(Y_{1:t} \mid h_i) p(h_i) \tag{4.18}\] becomes intractable. To approximate it, we again use the Laplace approximation.

Let $H_i = [h_i(V_1), \dots, h_i(V_t)]$. The log-posterior is \[ \ell(H_i) = \log p(Y_{1:t} \mid H_i) - \tfrac{1}{2}H_i^\top K_t^{-1}H_i - \tfrac{1}{2}\log|K_t|. \tag{4.19}\]

Let \[ \hat H_i = \arg\max_{h} \ell(h) \tag{4.20}\] be the mode (MAP).

The multivariate second-order Taylor expansion of a smooth scalar function $\ell(\cdot)$ around $\hat H_i$ is \[ \ell(H_i) \approx \ell(\hat H_i) + \nabla\ell(\hat H_i)^\top(H_i-\hat H_i) + \tfrac12(H_i-\hat H_i)^\top \nabla^2\ell(\hat H_i)(H_i-\hat H_i). \tag{4.21}\]

Because $\hat H_i$ is a maximizer, the gradient vanishes: \[ \nabla\ell(\hat H_i)=0. \tag{4.22}\]
The Hessian at a maximum is negative semidefinite: \[ \nabla^2\ell(\hat H_i) \preceq 0. \tag{4.23}\]

So the expansion simplifies to \[ \ell(H_i) \approx \ell(\hat H_i) + \tfrac12 (H_i-\hat H_i)^\top \nabla^2\ell(\hat H_i)(H_i-\hat H_i). \tag{4.24}\]

Define the negative Hessian (a positive semidefinite matrix) \[ W := -\nabla^2\ell(\hat H_i) \succeq 0. \tag{4.25}\]

Then \[ \ell(H_i) \approx \ell(\hat H_i) - \tfrac12 (H_i-\hat H_i)^\top W (H_i-\hat H_i). \tag{4.26}\]

Laplace’s idea is: if the log posterior is approximately quadratic, then the posterior is approximately Gaussian. Exponentiate the quadratic approximation:

\[ p(H_i\mid\mathcal D_t) \propto \exp\big(\ell(H_i)\big) \approx \exp \Big(\ell(\hat H_i)-\tfrac12(H_i-\hat H_i)^\top W (H_i-\hat H_i)\Big). \tag{4.27}\]

Since $\exp(\ell(\hat H_i))$ is just a constant in $H_i$, this is proportional to the kernel of a multivariate normal with mean $\hat H_i$ and precision $W$ (i.e., covariance $W^{-1}$):

\[ p(H_i\mid\mathcal D_t) \approx \mathcal N\big(\hat H_i, W^{-1}\big). \tag{4.28}\]

The resulting approximate posterior can be projected to any new input $V_*$ via standard GP conditioning: \[ p(h(V_*) \mid \mathcal{D}_t) \approx \mathcal{N}(\mu_t(V_*), \sigma_t^2(V_*)), \tag{4.29}\] where \[ \mu_t(V_*) = k_*^\top K_t^{-1}\hat{H_i}, \quad \sigma_t^2(V_*) = k(V_*,V_*) - k_*^\top (K_t + W^{-1})^{-1} k_*. \tag{4.30}\] where the $k_*$ is the covariance between the new point and previous points.

Once the posterior is approximated, Thompson Sampling proceeds as before: sample one plausible function from the posterior, then act optimally with respect to it.

Algorithmically:

Approximate the posterior $p(h_i \mid \mathcal{D}_t) \approx \mathcal{N}(\hat{H_i}, \Sigma_f)$ via Laplace.
Sample a function realization \[ \tilde{h}_i \sim \mathcal{GP}(m_t(\cdot), k_t(\cdot, \cdot)), \tag{4.31}\] where $m_t$ and $k_t$ are derived from the approximate posterior.
Select the next item \[ j^* = \arg\max_j \tilde{h}_i(V_j). \tag{4.32}\]

This random sampling induces exploration: uncertain regions of the function space produce high variance in $\tilde{h}_i(V_j)$, encouraging the agent to test them. As more observations are collected, the posterior variance shrinks and the strategy converges toward greedy exploitation.

Below is a minimal illustration using GP classification with Laplace approximation and Thompson sampling. (For practical scalability, packages like gpytorch or scikit-learn can be used.)

Code: GP Thompson Sampling

Code: GP Posterior Visualization

The following code visualizes how the GP posterior updates after observing binary responses. The shaded region shows the 95% credible interval.

Code: Thompson Sampling Simulation

This simulation shows Thompson Sampling in action, demonstrating how the algorithm balances exploration and exploitation over time.

4.4 Dueling Bandit

4.4.1 From Multi-Armed Bandits to Dueling Bandits

The multi-armed bandit (MAB) problem involves a gambler deciding which lever to pull on an MAB machine to maximize the winning rate, despite not knowing which machine is the most rewarding. This scenario highlights the need to balance exploration (trying new machines to discover potential higher rewards) and exploitation (using current knowledge to maximize gains). MAB algorithms address this dilemma by making decisions under uncertainty to achieve the best possible outcomes based on gathered data. At the core of the MAB problem is a set of actions, or ‘arms,’ denoted by $\mathcal{A} = \{1, 2, \ldots, K\}$, where $K$ signifies the total number of arms. For each round $t$, the agent selects an arm $a_t \in \mathcal{A}$ and receives a reward $r_t$, sampled from an arm-specific, unknown probability distribution. The expected reward of pulling arm $a$ is represented as $\mu_a = \mathbb{E}[r_t | a]$.

The multi-armed bandit framework can be extended in various ways to model more complex scenarios. In the infinite-armed bandit problem, the set of possible arms $\mathcal{A}$ is either very large or infinite. This introduces significant challenges in exploration, as the agent cannot afford to explore each arm even once. Algorithms for infinite-armed bandits typically assume some regularity or structure of the reward function across arms to make the problem tractable. The contextual bandit problem extends the bandit framework by incorporating observable external states or contexts that influence the reward distributions of arms. The agent’s task is to learn policies that map contexts to arms to maximize reward. This model is particularly powerful for personalized recommendations, where the context can include user features or historical interactions. In dueling bandit problems, the agent chooses two arms to pull simultaneously and receives feedback only on which of the two is better, not the actual reward values. This pairwise comparison model is especially useful in scenarios where absolute evaluations are difficult, but relative preferences are easier to determine, such as in ranking systems.

Contextual bandits extend the multi-armed bandits by making decisions conditional on the state of the environment and previous observations. The benefit of such a model is that observing the environment can provide additional information, potentially leading to better rewards and outcomes. In each iteration, the agent is presented with the context of the environment, then decides on an action based on the context and previous observations. Finally, the agent observes the action’s outcome and reward. Throughout this process, the agent aims to maximize the expected reward.

In many real-world contexts, one may not have a real-valued reward (or at least a reliable one) associated with a decision. Instead, we may only have observations indicating which of a set of bandits was optimal in a given scenario. The assumption is that within these observations of preferred choices among a set of options, there is an implicit reward or payoff encapsulated in that decision. Consider the following examples:

Dietary preferences: When providing food recommendations to humans, it is often not possible to quantify an explicit reward from recommending a specific food item. Instead, we can offer meal options and observe which one the person selects.
Video recommendation: Websites like YouTube and TikTok recommend specific videos to users. It is typically not feasible to measure the reward a person gains from watching a video. However, we can infer that a user preferred one video over another. From these relative preference observations, we can develop a strategy to recommend videos they are likely to enjoy.
Exoskeleton gait optimization: Tucker et al. (2020) created a framework that uses human-evaluated preferences for an exoskeleton gait algorithm to develop an optimal strategy for the exoskeleton to assist a human in walking. A human cannot reliably produce a numerical value for how well the exoskeleton helped them walk but can reliably indicate which option performed best according to their preferences.

Generally, we assume access to a set of actions. A noteworthy assumption is that any observations we make are unbiased estimates of the payoff. This means that if we observe a human preferred one option over another (or several others), the preferred option had a higher implicit reward or payoff than the alternatives. In the case of dietary preferences, this may mean that a human liked the preferred option; in the case of video recommendations, a user was more entertained, satisfied, or educated by the video they selected than the other options.

The overarching context is that we do not have direct or reliable access to rewards. We may not have a reward at all (for some decisions, it may be impossible to define a real value to the outcome), or it may be noisy (for example, if we ask a human to rate their satisfaction on a scale of 1 to 10). We use relative comparisons to evaluate the best of multiple options in this case. Our goal is to minimize total regret in the face of noisy comparisons. Humans may not always provide consistent observations (since human decision-making is not guaranteed to be consistent). However, we can still determine an optimal strategy with the observed comparisons. We aim to minimize the frequency of sub-optimal decisions according to human preferences. In practice, many formulations of bandits can allow for infinitely many bandits (for example, in continuous-value and high-dimensional spaces). However, this situation can be intractable when determining an optimal decision strategy. With infinite options, how can we always ensure we have chosen the best? We will constrain our bandits to a discrete space to enable efficient exploration. We will assume that we have $k$ bandits, $b_i, i \in [1, k]$, and our task is to choose the one that will minimize regret.

With the framework outlined, we now define our approach more formally. This method was introduced by (Yue et al. 2012), and proofs for the guarantees and derivations of parameters can be found in their work.

To determine the optimal action, we will compare pairwise to ascertain the probability that an action $b_i$ is preferred over another $b_j$, where $i \ne j$. Concretely, we assume access to a function $\epsilon$ that helps determine this probability; in practice, this can be done with an oracle, such as asking a human which of two options they prefer: \[P(b_i > b_j) = \varepsilon(b_i, b_j) + \frac{1}{2}. \tag{4.33}\] With this model, three basic properties govern the values provided by $\epsilon$: \[\epsilon(b_i, b_j) = -\epsilon(b_j, b_i), \epsilon(b_i, b_i) = 0, \epsilon(b_i, b_j) \in \left(-\frac{1}{2}, \frac{1}{2} \right). \tag{4.34}\]

We assume there is a total ordering of bandits, such that $b_i \succ b_j$ implies $\epsilon(b_i, b_j) > 0$. We impose two constraints to properly model comparisons:

Strong Stochastic Transitivity: We must maintain our total ordering of bandits, and as such, the comparison model also respects this ordering: \[b_i \succ b_j \succ b_k \Rightarrow \epsilon(b_i, b_k) \ge \text{max}\{\epsilon(b_i, b_j), \epsilon(b_j, b_k)\}. \tag{4.35}\]
Stochastic Triangle Inequality: We also impose a triangle inequality, which captures the condition that the probability of a bandit winning (or losing) a comparison will exhibit diminishing returns as it becomes increasingly superior (or inferior) to the competing bandit: \[b_i \succ b_j \succ b_k \Rightarrow \epsilon(b_i, b_k) \le \epsilon(b_i, b_j) + \epsilon(b_j, b_k). \tag{4.36}\]

These assumptions may initially seem limiting; however, common models for comparisons satisfy these constraints. For example, the Bradley-Terry Model follows $P(b_i > b_j) = \frac{\mu_i}{\mu_i + \mu_j}$. The Gaussian model with unit variance also satisfies these constraints: $P(b_i > b_j) = P(X_i - X_j > 0)$, where $X_i - X_j \sim N(\mu_i - \mu_j, 2)$.

Code: Preference Matrix Visualization

To accurately model the preferences between bandits in our framework of pairwise bandit comparisons and regret, we must track certain parameters in our algorithm. First, we will maintain a running empirical estimate of the probability of bandit preferences based on our observations. It is important to note that we do not have direct access to an $\epsilon$ function. Instead, we must present two bandits to a human, who selects a winner. To do this, we define: \[\hat{P}_{i, j} = \frac{\# b_i\ \text{wins}}{\# \text{comparisons between}\ i \text{and}\ j}. \tag{4.37}\]

We will also compute confidence intervals at each timestep for each of the entries in $\hat{P}$ as \[\hat{C}_t = \left( \hat{P}_t - c_t, \hat{P}_t + c_t \right), \tag{4.38}\] where $c_t = \sqrt{\frac{4\log(\frac{1}{\delta})}{t}}$. Note that $\delta = \frac{1}{TK^2}$, where $T$ is the time horizon and $K$ is the number of bandits.

Previously, we discussed approaches for finding the best action in a specific context. Now, we consider changing contexts, which means there is no longer a static hidden preference matrix $P$. Instead, at every time step, there is a preference matrix $P_C$ depending on context $C$. We consider a context $C$ and a preference matrix $P_C$ to be chosen by nature as a result of the given environment (Yue et al., 2012). The goal of a contextual bandits algorithm is to find a policy $\pi$ that maps contexts to a Von Neumann winner distribution over our bandits. That is, our policy $\pi$ should map any context to some distribution over our bandits such that sampling from that distribution is preferred to a random action for that context.

4.4.2 Regret

The agent aims to pick a sequence of arms $(a_1, a_2, \ldots, a_T)$ across a succession of time steps $t = 1$ to $t = T$ to maximize the total accumulated reward. Formally, the strategy seeks to maximize the sum of the expected rewards: $\max_{a_1, \ldots, a_T} \mathbb{E} \left[\sum_{t=1}^{T} r_t\right]$. Regret is defined as the difference between the cumulative reward that could have been obtained by always pulling the best arm (in hindsight, after knowing the reward distributions) and the cumulative reward actually obtained by the algorithm. Formally, if $\mu^*$ is the expected reward of the best arm and $\mu_{a_t}$ is the expected reward of the arm chosen at time $t$, the regret after $T$ time steps is given by $R(T) = T \cdot \mu^* - \sum_{t=1}^{T} \mu_{a_t}$. The objective of a bandit algorithm is to minimize this regret over time, effectively learning to make decisions that are as close as possible to the decisions of an oracle that knows the reward distributions beforehand. Low regret indicates an algorithm that has often learned to choose well-performing arms, balancing the exploration of unknown arms with the exploitation of arms that are already known to perform well. Thus, an efficient bandit algorithm exhibits sub-linear regret growth, meaning that the average regret per round tends to zero as the number of rounds $T$ goes to infinity: $\lim_{T \to \infty} \frac{R(T)}{T} = 0$. Minimizing regret is a cornerstone in the design of bandit algorithms, and its analysis helps in understanding the long-term efficiency and effectiveness of different bandit strategies.

As previously discussed, our goal is to select the bandit that minimizes a quantity that reflects regret or the cost of not selecting the optimal bandit at all times. We can leverage our comparison model to define a quantity for regret over some time horizon $T$, which is the number of decisions we make (selecting what we think is the best bandit at each iteration). Assuming we know the best bandit $b^*$ (and we know that there is a best bandit, since there is a total ordering of our discrete bandits), we can define two notions of regret:

Strong regret: aims to capture the fraction of users who would prefer the optimal bandit $b^*$ over the worse of the options $b_1, b_2$ we provide at a given step:$R_T = \sum_{t = 1}^T \text{max} \left\{ \epsilon(b^*, b_1^{(t)}), \epsilon(b^*, b_2^{(t)}) \right\}$
Weak regret: aims to capture the fraction of users who would prefer the optimal bandit $b^*$ over the better of the options $b_1, b_2$ we provide at a given step:$\tilde{R}_T = \sum_{t = 1}^T \text{min} \left\{ \epsilon(b^*, b_1^{(t)}), \epsilon(b^*, b_2^{(t)}) \right\}$

4.4.3 Winner Concepts

The best bandit described in our regret definition is called a Condorcet Winner. This is the strongest form of winner. It’s the action $A_{i}$ which is preferred to each other action $A_j$ with $p > 0.5$ in a head-to-head election. While the above introduced notions of regret assume an overall best bandit to exist, there might be settings, where no bandit wins more than half head-to-head duels. A set of actions without a Condorcet winner is described by the following preference matrix, where each entry $\Delta_{jk}$ is $p(j \succ k) - 0.5$, the probability that action $j$ is preferred over action $k$ minus 0.5. There is no Condorcet winner as there is no action that is preferred with $p > 0.5$ over all other actions. Imagine, you want to find the best pizza to eat (=action). There may not be a pizza that wins more than half of the head-to-head duels against every other pizza.

However, we might still have an intuition of the best pizza. Therefore Sui et al., 2018 introduce the concepts of different $\textit{winners}$ in dueling bandit problems (Sui et al. 2018). In this example, we might define the best pizza as the most popular one. We call the Pizza receiving the most votes in a public vote the Borda Winner, or formally, Borda winner $j = \arg\max_{i \in A, i \neq j} \left(\sum p(j \succ i)\right)$. In contrast to the Condorcet Winner setting, there is always guaranteed to be one or more (in the case of a tie) Borda winners for a set of actions. However - if there is a Condorcet Winner, this might not necessarily be the same as a Borda Winner: In our Pizza example, a Pepperoni Pizza might win more than half of its head-to-head duels, while the Cheese-Pizza is still the most popular in a public poll.

A more generic concept of winner is the Von Neumann Winner, which describes a probability distribution rather than a single bandit winner. A Von Neumann winner simply prescribes a probability distribution $W$ such that sampling from this distribution ‘beats’ an action from the random uniform distribution with $p > 0.5$. In our pizza example, this would correspond to trusting a friend to order whichever Pizza he likes, because this may still be preferred to ordering randomly. Formally, $W$ is a Von Neumann if $(j \sim W, k \sim R) [p(p(j \succ k) > 0.5) > 0.5]$ where $R$ describes the uniform probability distribution over our actions. The concept of a Von Neumann winner is useful in contextual bandits, which will be introduced later. In these settings, the preference matrix depends on different context, which may have different Borda winners, just as different parties may vote for different pizzas.

	A	B	C	D	E	F
A	0	0.03	-0.02	0.06	0.10	0.11
B	-0.03	0	0.03	0.05	0.08	0.11
C		-0.03	0	0.04	0.07	0.09
D	-0.06	-0.05	-0.04	0	0.05	0.07
E	-0.10	-0.08	-0.07	-0.05	0	0.03
F	-0.11	-0.11	-0.09	-0.07	-0.03	0

Figure 4.1: Violation of Condorcet Winner. Highlighted entries are different from Table 1. No Condorcet winner exists as no arm could beat every other arm.

Code: Pizza Voting Example - When Winners Diverge

This example demonstrates a scenario where three groups have different pizza preferences, resulting in no Condorcet winner. The Borda winner and Von Neumann winner provide alternative solutions.

Code: Computing Different Winner Types

Next, we introduce two performance measures for the planner. The asymptotic ex-post regret is defined as \[\text{Regret}(\mu_1, \ldots \mu_K) = T\cdot \max_i \mu_i - \sum_{i=1}^T E[\mu_{I_t}]. \tag{4.39}\]

Intuitively, this represents the difference between the reward achieved by always taking the action with the highest possible reward and the expected welfare of the recommendation algorithm (based on the actions it recommends at each timestep).

We also define a weaker performance measure, the Bayesian regret, which is defined as \[\text {Bayesian regret}=E_{\mu_1, \ldots, \mu_K \sim \text {Prior}}\left[\operatorname{Regret}\left(\mu_1, \ldots, \mu_K\right)\right] \tag{4.40}\]

With a Bayesian optimal policy, we would like either definition of regret to vanish as $T\to \infty$; we are considering “large-market optimal" settings where there are many short-lived, rather than a few long-term, users. Note the fact that ex-post regret is prior-free makes it robust to inaccuracies on the prior.

4.4.4 Acquisition Functions

Various strategies have been developed to balance the exploration-exploitation trade-off. These strategies differ in selecting arms based on past experiences and rewards.

4.4.4.1 Classical Acquisition Functions

Uniform acquisition function is the most straightforward approach where each arm is selected uniformly randomly over time. This strategy does not consider the past rewards and treats each arm equally promising regardless of the observed outcomes. It is a purely explorative strategy that ensures each arm is sampled enough to estimate its expected reward, but it does not exploit the information to optimize rewards. In mathematical terms, if $N_t(a)$ denotes the number of times arm $a$ has been selected up to time $t$, the Uniform Strategy would ensure that $N_t(a) \approx \frac{t}{K}$ for all arms $a$ as $t$ grows large: $P(a_t = a) = \frac{1}{K}$

The Epsilon Greedy is a popular method that introduces a balance between exploration and exploitation. With a small probability $\epsilon$, it explores by choosing an arm at random, and with a probability $1 - \epsilon$, it exploits by selecting the arm with the highest estimated reward so far. This strategy incrementally favors actions that have historically yielded higher rewards, but still allows for occasional exploration to discover better options potentially. The parameter $\epsilon$ is chosen based on the desired exploration level, often set between 0.01 and 0.1. \[P(a_t = a) = \begin{cases} \frac{\epsilon}{K} + 1 - \epsilon & \text{if } a = \arg\max_{a'} \hat{\mu}_{a'} \\ \frac{\epsilon}{K} & \text{otherwise} \end{cases} \tag{4.41}\]

Upper Confidence Bound (UCB) acquisition function takes a more sophisticated approach to the exploration-exploitation dilemma. It selects arms based on both the estimated rewards and the uncertainty or variance associated with those estimates. Specifically, it favors arms with high upper confidence bounds on the estimated rewards, which is a sum of the estimated mean and a confidence interval that decreases with the number of times the arm has been played. This ensures that arms with less certainty (those played less often) are considered more often, naturally balancing exploration with exploitation as the uncertainty is reduced over time.

\[P(a_t = a) = \begin{cases} 1 & \text{if } a = \arg\max_{a'} \left( \hat{\mu}_{a'} + \sqrt{\frac{2 \ln t}{N_t(a')}} \right) \\ 0 & \text{otherwise} \end{cases} \tag{4.42}\]

4.4.4.2 Interleaved Filter

This algorithm tries to find the best bandit (Condorcet Winner) in a discrete, limited bandit-space via pairwise comparisons of the bandits. We will now introduce the algorithm for the Interleaved Filter as provided in (Yue et al. 2012) to solve a dueling bandit setup. It starts with a randomly defined best bandit $\hat{b}$ and iteratively compares it to set $W$ containing the remaining bandits $b$ resulting in winning probabilities $\hat{P}_{\hat{b},b}$ and confidence interval $\hat{C}_{\hat{b},b}$. If a bandit $b$ is confidently worse than $\hat{b}$, it is removed from $W$. If a bandit $b'$ is confidently better than $\hat{b}$, it is set as new best bandit $\hat{b}$ and bandit $\hat{b}$ as well as every other bandit $b$ worse than $\hat{b}$ are removed from $W$. This is done, until $W$ is empty, leaving the final $\hat{b}$ as the predicted best bandit.

input: $T$, $B=\{b_1, \dots, b_k\}$ $\delta \gets 1/(TK^2)$ Choose $\hat{b} \in B$ randomly $W \gets \{b_1, \dots, b_k\} \backslash \{\hat{b}\}$ $\forall b \in W$, maintain estimate $\hat{P}_{\hat{b},b}$ of $P(\hat{b} > b)$ according to (6) $\forall b \in W$, maintain $1 - \delta$ confidence interval $\hat{C}_{\hat{b},b}$ of $\hat{P}_{\hat{b},b}$ according to (7), (8) compare $\hat{b}$ and $b$ update $\hat{P}_{\hat{b},b}$, $\hat{C}_{\hat{b},b}$ $W \gets W \backslash \{b\}$

$W \gets W \backslash \{b\}$ $\hat{b} \gets b'$, $W \gets W \backslash \{b'\}$ $\forall b \in W$, reset $\hat{P}_{\hat{b},b}$ and $\hat{C}_{\hat{b},b}$ $\hat{T} \gets$ Total Comparisons Made $(\hat{b}, \hat{T})$

Parameter Initialization

In lines 1-6 of the algorithm, we take the inputs and first compute the value $\delta$ which is used to compute our confidence intervals. We select an initial guess of an optimal bandit $\hat{b}$ by uniformly sampling from all bandits $\mathcal{B}$. We also keep a running set of bandit candidates $W$, which is initialized to be $\mathcal{B} \setminus \{\hat{b}\}$. At this point, we also initialize our empirical estimates for $\hat{P}, \hat{C}$.

Next, we will repeat several steps until our working set of bandit candidates $W$ is empty.

Update Estimates Based on Comparisons

The first step at each iteration (lines 8-11) is to look at all candidates in $W$, and compare them to our current guess $\hat{b}$ using an oracle (e.g. by asking a human which of $\hat{b}$ or $b \in W$ is preferred). With this new set of wins and comparisons, we update our estimates of $\hat{P}, \hat{C}$.

Prune Suboptimal Bandits

In lines 12-13, with updated comparison win probabilities and corresponding confidence intervals, we can remove bandit candidates from $W$ that we are confident $\hat{b}$ is better than. The intuition here is that we are mostly sure that our current best guess is better than some of the candidates, and we don’t need to consider those candidates in future iterations.

Check for Better Bandits from Candidate Set

Now that our candidate set of bandits may be smaller, in lines 15-21 we check if there are any bandits $b'$ that we are confident are better than our current best guess. If we do find such a candidate, we remove bandits which $\hat{P}$ indicates $b$ is likely worse than $\hat{b}$. Note that in this step, we do not require the probability to be outside the confidence interval, since we already found one we believe to be significantly closer to optimal than our current best guess.

Once we remove the candidates likely worse than $\hat{b}$, we crown $b'$ as the new best guess, e.g. $\hat{b} := b'$. Consequently, we remove $b'$ from $W$ and reset our empirical win counters $\hat{P}, \hat{C}$.

With this algorithm defined, let us look at some provisions of the method with respect to identifying the optimal strategy. Note that the proofs and derivations for these quantities are provided in (Yue et al. 2012).

First, the method guarantees that for the provided time horizon $T$, the algorithm returns the correct bandit with probability $P \ge 1 - \frac{1}{T}$. It is interesting and useful to note that if one has a strict requirement for the probability of identifying the correct bandit, one can compute the time horizon $T$ that guarantees this outcome at that probability. Furthermore, a time horizon of 1 leaves no probabilistic guarantee of a successful outcome, and increasing $T$ has diminishing returns. Second, in the event that the algorithm returns an incorrect bandit, the maximal regret incurred is linear with respect to $T$, e.g. $\mathcal(O)(T)$. This is also a useful provision as it allows us to estimate the overall cost in the worst case outcome. Based on these two provisions, we can compute the expected cumulative regret from running the Interleaved Filter algorithm, which is: \[\mathbb{E}\left[R_T\right] \le \left(1 - \frac{1}{T}\right) \mathbb{E}\left[ R_T^{IF} \right] + \frac{1}{T}\mathcal{O}(T) \\ = \mathcal{O}\left(\mathbb{E}\left[ R_T^{IF} \right] + 1\right) \tag{4.43}\]

Interestingly, the original work shows that these bounds hold for both strong and weak regret. As demonstrated, the Interleaved Filter algorithm [fig-if] provides a robust method to ascertain the optimal bandit or strategy given a set of options and only noisy comparisons. In most real-world scenarios for modeling human preferences, it is not possible to observe a real-world reward value, or at least a reliable one and as such this method is a useful way to properly model human preferences.

Furthermore, the algorithm provides strong guarantees for the probability of selecting the correct bandit, maximal regret, and the number of comparisons required. It is even more impressive that the method can do so without severely limiting constraints; as demonstrated, the most commonly used models satisfy the imposed constraints.

As we look to model human preferences, we can certainly leverage this method for k-armed dueling bandits to identify the best strategy to solve human-centric challenges, from video recommendation to meal selection and exoskeleton-assisted walking.

Code: Interleaved Filter Implementation

Code: Comparing Dueling Bandit Algorithms

4.4.4.3 Dueling Bandit Gradient Descent

This algorithm tries to find the best bandit in a continuous bandit-space. Here, the set of all bandits is regarded as an Information-Retrieval (IR) system with infinite bandits uniquely defined by $w$. We will cover the Dueling Bandit Gradient Descent algorithm from Yue and Joachims 2009 (Yue and Joachims 2009). Yue and Joachims use the dueling bandits formulation for online IR optimization. They propose a retrieval system parameterized by a set of continuous variables lying in $W$, a $d$-dimensional unit-sphere. The DBGD algorithm adapts the current parameters $w_t$ of IR system by comparison with slightly altered parameters $w_t'$ both querying query $q_t$. Only if the IR outcome using $w_t'$ is preferred, the parameters are changed in their direction. We will now discuss the algorithm more detailed.

input: $\gamma$, $\delta$, $w_1$

Sample unit vector $u_t$ uniformly

$w_t' \gets P_W(w_t + \delta u_t)$

Compare $w_t$ and $w_t'$

$w_{t+1} \gets P_W(w_t + \gamma u_t)$

$w_{t+1} \gets w_t$

We first choose exploration step length $\delta$, exploitation step length $\gamma$, and starting point (in unit-sphere) $w_1$. Choose a query and sample a random unit vector $u_t$. We duel $w_t$ and $w_t'$, where $w_t$ is our current point in the sphere, and $w_t'$ is our exploratory comparison, which is generated by taking a random step of length $\delta$, such that $w_t' = w_t + \delta u_t$. The objective of this duel is to ascertain the binary preference of users with respect to the results yielded by the IR systems parameterized by $w_t$ and $w_t'$ respectively, taking query $q_t$ as an input. The parameters that get the majority of the votes in the head to head win. If $w_t$ wins, then we keep the parameters for the next iteration. If $w_t'$ wins the duel, we update our parameters in the direction of $u_t$ by taking a step of length $\gamma$. Note that the algorithm describes projection operation $P_W(\overrightarrow{v})$. Since $u_t$ is chosen randomly, $w_t + \delta u_t$ or $w_t + \gamma u_t$ could exist outside of the unit sphere where all possible parameter configurations lie. In this case, we simply project the point back onto the sphere using said projection $P_W(\overrightarrow{v})$.

Yue and Joachims show that this algorithm has sublinear regret in $T$, the number of iterations. We note that the algorithm assumes that there exists a hidden reward function $R(w)$ that maps system parameters $w_t$ to a reward value which is smooth and strictly concave over the input space $W$.

Lastly, we would also like to give motivation behind $\delta$ and $\gamma$ being different values. We need a $\delta$ that is sufficiently large that the comparison between a system parameterized by $w_t$ and $w_t'$ is meaningful. On the other hand, we may wish to take a smaller step in the direction of $w_t'$ during our update step, as during a duel, we only score $w_t$ against $w_t'$ over the results on one query $q_t$. Having $\delta > \gamma$ allows us to get reward signal from meaningfully different points while also updating our belief of the best point $w_{\text{best}}$ gradually.

Sparring EXP4

Zoghi et al. 2015 propose one algorithm for this problem — sparring EXP4, which duels two traditional EXP4 - algorithms. The (traditional) EXP4 algorithm solves the traditional contextual bandits — the case where we can directly observe a reward for a choice of bandit given a context. The EXP4 algorithm embeds each bandit as a vector. When the algorithm sees the context (called ‘advice’ in this formulation), it produces a probability distribution over the choices based on an adjusted softmax function on the inner product between the context and the bandit vectors. The probability function is different from a softmax as we assign some minimum probability that any action gets chosen to enforce exploration. A reward is then observed for the choice and propagated back through the embedding of the chosen bandit.

Sparring EXP4 runs two instances of the EXP4 algorithm against each other. Each EXP4 instance samples an action given a context, and then these choices are ‘dueled’ against each other. Instead of directly observing a reward, as for traditional EXP4, we instead observe two converse reward — a positive reward for the choice that won the duel and a negative reward to the choice that lost. The reward is proportional to the degree to which the bandit wins the duel, i.e. how likely the bandit is to be preferred over the other when users are queried for binary preferences. Like in traditional EXP4, the reward or negative reward is then propagated back through the representations of the bandits.

4.4.4.4 Feel-good Thompson sampling

This algorithm is a solution for the contextual dueling bandit setting, and aims to find a Von Neumann winner by minimizing cumulative average regret: \[\text{Regret}(T) := \sum_{t=1}^{T} \left[ r_{*}(x_t, a_{t}^{*}) - \frac{r_{*}(x_t, a_{t}^{1}) + r_{*}(x_t, a_{t}^{2})}{2} \right], \tag{4.44}\] where $r_{*}(x_t, a_{t})$ is the true, hidden reward function of a context $x_t$ and action $a_t$. Thompson sampling is an iterative process of receiving preference over two actions, each maximizing a different approximation of the reward function based on past data and adding this new information to the data.

Finding good approximations of the reward function at time $t$ is done by sampling two reward function parameters $\theta_t^{j=1}$ and $\theta_t^{j=2}$ from a posterior distribution based on all previous data $p_j(\cdot \mid S_{t-1})$. This posterior distribution is proportional to the multiplication of the prior and the likelihood function, which is a Gaussian in standard Thompson sampling. In Feel-Good Thompson sampling, an additional term called "Feel-good exploration" encourages parameters $\theta$ with a large maximum reward in previous rounds. This change to the likelihood function may increase probabilities in uncertain areas, thus exploring those regions. All that’s left is to select an action maximizing each reward function approximation and receive a preference $y_t$ on one of them to add the new information to the dataset(Zhang 2021).

Initialize $S_0 = \varnothing$. Receive prompt $x_t$ and action space $\mathcal{A}_t$. Sample model parameter $\theta_t^j$ from the posterior distribution $p^j(\cdot \mid S_{t-1})$ Select response $a_t^j = \arg\max_{a \in \mathcal{A}_t} \langle \theta_t^j, \phi(x_t, a) \rangle$. Receive preference $y_t$. Update dataset $S_t \leftarrow S_{t-1} \cup \{(x_t, a_t^1, a_t^2, y_t)\}$.

4.4.5 Applications

There are many applications where contextual bandits are used. Many of these applications can utilize human preferences. One particular application illustrates the benefits a contextual bandit would have over a multi-armed bandit: a website deciding which app to show someone visiting the website. A multi-armed bandit might decide to show someone an ad for a swimsuit because the swimsuit ads have gotten the most user clicks (which indicates human preference). A contextual bandit might choose differently, however. A contextual bandit will also take into account the context, which in this case might mean information about the user (location, previously visited pages, and device information). If it discovers the user lives in a cold environment, for example, it might suggest a sweater ad for the user instead and get a better chance of a click. There are many more examples of where contextual bandits can be applied. They can be applied in other web applications, such as to optimize search results, medical applications, such as how much of a medication to prescribe based on a patient’s history, and gaming applications, such as basing moves off of the state of a chess board to try to win. In each of the above examples, human feedback could have been introduced during training and leveraged to learn a reward function.

We explored different versions of bandits that address the exploration-exploitation trade-off in various real-world scenarios. These models have been employed across various fields, including but not limited to healthcare, finance, dynamic pricing, and anomaly detection. This section provides a deep dive into some real-world applications, emphasizing the value and advancements achieved by incorporating bandit methodologies. The content of this section draws upon the findings from the survey cited in reference (Bouneffouf, Rish, and Aggarwal 2020).

In healthcare, researchers have been applying bandits to address challenges in clinical trials and behavioral modeling (Bouneffouf, Rish, and Cecchi 2017; Bastani and Bayati 2020). One of the examples is drug dosing. Warfarin, an oral anticoagulant, has traditionally been administered using fixed dosing protocols. Physicians would then make subsequent adjustments based on the patient’s emerging symptoms. Nonetheless, inaccuracies in the initial dosage—whether too low or too high—can lead to serious complications like strokes and internal bleeding. In a pivotal study, researchers in (Bastani and Bayati 2020) modeled the Warfarin initial dosing as a contextual bandit problem to assign dosages to individual patients appropriately based on their medication history. Their contributions include the adaptation of the LASSO estimator to the bandit setting, achieving a theoretical regret bound of $O({s_0}^2 \log^2(dT)$, where $d$ represents the number of covariates, $s_0 << d$ signifies the number of pertinent covariates, and $T$ indicates the total number of users. Additionally, they conducted empirical experiments to validate the robustness of their methodology.

Within the finance sector, bandits have been instrumental in reshaping the landscape of portfolio optimization. Portfolio optimization is an approach to designing a portfolio based on the investor’s return and risk criteria, which fits the exploration-exploitation nature of the bandit problems. (Shen et al. 2015) utilized multi-armed bandits to exploit correlations between the instruments. They constructed orthogonal portfolios and integrated them with the UCB policy to achieve a cumulative regret bound of $\frac{8n}{\Delta*} \ln(m) + 5n$, where $n$, $m$, and $\Delta*$ denotes the number of available assets, total time steps, and the gap between the best-expected reward and the expected reward. On the other hand, (Huo and Fu 2017) focused on risk-awareness online portfolio optimization by incorporating a compute of the minimum spanning tree in the bipartite graph, which encodes a combination of financial institutions and assets that helps diversify and reduce exposure to systematic risk during the financial crisis.

Dynamic pricing, also known as demand-based pricing, refers to the strategy of setting flexible prices for products or services based on current market demands. The application of bandits in dynamic pricing offers a systematic approach to making real-time pricing decisions while balancing the trade-off between exploring new price points and exploiting known optimal prices. (Misra, Schwartz, and Abernethy 2019) proposed a policy where the company has only incomplete demand information. They derived an algorithm that balances immediate and future profits by combining multi-armed bandits with partial identification of consumer demand from economic theory.

Recommendation systems are essential components of numerous online platforms, guiding users through vast content landscapes to deliver tailored suggestions. These systems are instrumental in platforms like e-commerce sites, streaming platforms, and social media networks. However, the challenge of effectively recommending items to users is non-trivial, given the dynamic nature of user preferences and the vast amount of content available.

One of the most significant challenges in recommendation systems is the "cold start" problem. This issue arises when a new user joins a platform, and the system has limited or no information about the user’s preferences. Traditional recommendation algorithms struggle in such scenarios since they rely on historical user-item interactions. As discussed in (Zhou et al. 2017), the bandit setting is particularly suitable for large-scale recommender systems with a vast number of items. By continuously exploring user preferences and exploiting known interactions, bandit-based recommender systems can quickly adapt to new users, ensuring relevant recommendations in a few interactions. The continuous exploration inherent in bandit approaches also means that as a user’s preferences evolve, the system can adapt, ensuring that recommendations remain relevant. Recommending content that is up to date is also another important aspect of a recommendation system. In (Bouneffouf, Bouzeghoub, and Gançarski 2012), the concept of "freshness" in content is explored through the lens of the bandit problem. The Freshness-Aware Thompson Sampling algorithm introduced in this study aims to manage the recommendation of fresh documents according to the user’s risk of the situation.

Dialogue systems, often termed conversational agents or chatbots, aim to simulate human-like conversations with users. These systems are deployed across various platforms, including customer support, virtual assistants, and entertainment applications, and they are crucial for enhancing user experience and engagement. Response selection is fundamental to creating a natural and coherent dialogue flow. Traditional dialogue systems rely on a predefined set of responses or rules, which can make interactions feel scripted and inauthentic. In (Liu et al. 2018), the authors proposed a contextual multi-armed bandit model for online learning of response selection. Specifically, they utilized bidirectional LSTM to produce the distributed representations of a dialogue context and responses and customized the Thompson sampling method.

To create a more engaging and dynamic interaction, there’s a growing interest in developing pro-active dialogue systems that can initiate conversations without user initiation. (perez and Silander 2018) proposed a novel approach to this challenge with contextual bandits. By introducing memory models into the bandit framework, the system can recall past interactions, making its proactive responses more contextually relevant. Their contributions include the Contextual Attentive Memory Network, which implements a differentiable attention mechanism over past interactions.

(Upadhyay et al. 2019) addressed the challenge of orchestrating multiple independently trained dialogue agents or skills in a unified system. They attempted online posterior dialogue orchestration, defining it as selecting the most suitable subset of skills in response to a user’s input, which studying a context-attentive bandit model that operates under a skill execution budget, ensuring efficient and accurate response selection.

Anomaly detection refers to the task of identifying samples that behave differently from the majority. In (Ding, Li, and Liu 2019), the authors delve into anomaly detection in an interactive setting, allowing the system to actively engage with human experts through a limited number of queries about genuine anomalies. The goal is to present as many true anomalies to the human expert as possible after a fixed query budget is used up. They applied the multi-armed contextual bandit framework to address this issue. This algorithm adeptly integrates both nodal attributes and node dependencies into a unified model, efficiently managing the exploration-exploitation trade-off during anomaly queries.

There are many challenges associated with contextual bandits. The first challenge is that each action only reveals the reward for that particular action. Therefore, the algorithm has to work with incomplete information. This leads to the dilemma of exploitation versus exploration: when should the algorithm choose the best-known option versus trying new options for potentially better outcomes? Another significant challenge for contextual bandits is using context effectively. The context the environment gives needs to be explored to figure out which action is best for each context.

The overarching goal in systems designed for recommending options of high value to users is to achieve an optimal balance between exploration and exploitation. This dual approach is crucial in environments where user preferences and needs are dynamic and diverse. Exploration refers to the process of seeking out new options, learning about untried possibilities, and gathering fresh information that could lead to high-value recommendations. In contrast, exploitation involves utilizing existing knowledge and past experiences to recommend the best options currently known. This balance is key to maintaining a system that continuously adapts to changing user preferences while ensuring the reliability of its recommendations.

A key observation in such systems is the dual role of users as both producers and consumers of information. Each user’s experience contributes valuable data that informs future recommendations for others. For instance, platforms like Waze, Netflix, and Trip Advisor rely heavily on user input and feedback. Waze uses real-time traffic data from drivers to recommend optimal routes; Netflix suggests movies and shows based on viewing histories and ratings; Trip Advisor relies on traveler reviews to guide future tourists. In these examples, the balance between gathering new information (exploration) and recommending the best-known options (exploitation) is dynamically managed to enhance user experience and satisfaction. This approach underscores the importance of user engagement in systems where monetary incentives are not (or can not be) the primary driver.

Recommendation systems often face the challenge of overcoming user biases that can lead to a narrow exploration of options. Users come with preconceived notions and preferences, which can cause them to overlook potentially valuable options that initially appear inferior or unaligned with their interests. This predisposition can significantly limit the effectiveness of recommendation systems, as users might miss out on high-value choices simply due to their existing biases.

To counteract this, it is crucial for recommendation systems to actively incentivize exploration among users. One innovative approach to achieve this is through the strategic use of information asymmetry. By controlling and selectively presenting information, these systems can guide users to explore options they might not typically consider. This method aims to reveal the true potential of various options by nudging users out of their comfort zones and encouraging a broader exploration of available choices. An important note here is that the system is not lying to users - it only selectively reveals information it has.

The concept of incentivizing exploration becomes even more complex when considering different types of users. For instance, systems often encounter short-lived users who have little to gain from contributing to the system’s learning process, as their interactions are infrequent or based on immediate needs. Similarly, some users may operate under a ‘greedy’ principle, primarily seeking immediate gratification rather than contributing to the long-term accuracy and effectiveness of the system. In such scenarios, managing information asymmetry can be a powerful tool. By selectively revealing information, recommendation systems can create a sense of novelty and interest, prompting even the most transient or self-interested users to engage in exploration, thereby enhancing the system’s overall knowledge base and recommendation quality.

4.5 Preferential Bayesian Optimization

The traditional Bayesian optimization (BO) problem is described as follows. There is a black-box objective function $g: \mathcal{X} \rightarrow \Re$ defined on a bounded subset $\mathcal{X} \subseteq \Re^q$ such that direct queries to the function are expensive or not possible. However, we would like to solve the global optimization problem of finding $\mathbf{x}_{\min }=\arg \min _{\mathbf{x} \in \mathcal{X}} g(\mathbf{x})$. This is highly analogous to modeling human preferences, since it is the case that direct access to a human’s latent preference function is not possible but we would still like to find its optimum, such as in A/B tests or recommender systems.

We approach this problem for human preferences with Preferential Bayesian Optimization (PBO), as the key difference is that we are able to query the preference function through pairwise comparisons of data points, i.e. duels. This is a form of indirect observation of the objective function, which models real-world scenarios closely: we commonly need to to optimize a function via data about preferences. With humans, it has been demonstrated that we are better at evaluating differences rather than absolute magnitudes (Kahneman and Tversky 1979) and therefore PBO models can be applied in various contexts.

4.5.1 Problem statement

The problem of finding the optimum of a latent preference function defined on $\mathcal{X}$ can be reduced to determining a sequence of duels on $\mathcal{X} \times \mathcal{X}$. From each duel $\left[\mathbf{x}, \mathbf{x}^{\prime}\right] \in$ $\mathcal{X} \times \mathcal{X}$ we obtain binary feedback $\{0,1\}$ indicating whether or not $\mathbf{x}$ is preferred over $\mathbf{x}^{\prime}$ ($g(\mathbf{x}) < g(\mathbf{x}^{\prime})$). We consider that $\mathbf{x}$ is the winner of the duel if the output is $\{1\}$ and that $\mathbf{x}^{\prime}$ wins the duel if the output is $\{0\}$. The aim is to find $\mathbf{x}_{\min }$ by reducing as much as possible the number of queried duels.

The key idea in PBO is to learn a preference function in the space of duels using a Gaussian process. We define a joint reward $f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)$ on each duel which is never directly observed. Instead, the feedback we obtain after each pair is a binary output $y \in$ $\{0,1\}$ indicating which of the two inputs is preferred. One definition of f we will use (though others are possible) is $f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})$. The more $\mathbf{x}^{\prime}$ is preferred over $\mathbf{x}$, the bigger the reward.

We define the model of preference using a Bernoulli likelihood, where $p\left(y=1 \mid\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)$ and $p\left(y=0 \mid\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\pi_f\left(\left[\mathbf{x}^{\prime}, \mathbf{x}\right]\right)$ for some inverse link function $\pi: \Re \times \Re \rightarrow[0,1]$. $\pi_f$ has the property that $\pi_f\left(\left[\mathbf{x}^{\prime}, \mathbf{x}\right]\right)=1-\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)$. A natural choice for $\pi_f$ is the logistic function \[\label{eq:bernoulli_pref} \pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\sigma\left(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\right)=\frac{1}{1+e^{-f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)}},\] but others are possible. Therefore we have that for any duel $\left[\mathbf{x}, \mathbf{x}^{\prime}\right]$ in which $g(\mathbf{x}) \leq g\left(\mathbf{x}^{\prime}\right)$ it holds that $\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) \geq 0.5$. $\pi_f$ is a preference function that maps each query $\left[\mathbf{x}, \mathbf{x}^{\prime}\right]$ to the probability of having a preference on the left input $\mathbf{x}$ over the right input $\mathbf{x}^{\prime}$.

When we marginalize over the right input $\mathbf{x}^{\prime}$ of $f$, the global minimum of $f$ in $\mathcal{X}$ coincides with $\mathbf{x}_{\min }$. We also introduce the definition of the Copeland score function for a point $\mathbf{x}$ as \[S(\mathbf{x})=\operatorname{Vol}(\mathcal{X})^{-1} \int_{\mathcal{X}} \mathbb{I}_{\left\{\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) \geq 0.5\right\}} d \mathbf{x}^{\prime} \tag{4.45}\] where $\operatorname{Vol}(\mathcal{X})=\int_{\mathcal{X}} d \mathbf{x}^{\prime}$ is a normalizing constant that bounds $S(\mathbf{x})$ in the interval $[0,1]$. If $\mathcal{X}$ is a finite set, the Copeland score is simply the proportion of duels that a certain element $\mathbf{x}$ will win with probability larger than 0.5. A soft variant we will use instead of the Copeland score is the soft-Copeland score, defined as \[\label{eq:soft-copeland} C(\mathbf{x})=\operatorname{Vol}(\mathcal{X})^{-1} \int_{\mathcal{X}} \pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d \mathbf{x}^{\prime}\] where the probability function $\pi_f$ is integrated over $\mathcal{X}$. This score aims to capture the average probability of $\mathbf{x}$ being the winner of a duel.

We define the Condorcet winner $\mathbf{x}_c$ as the point with maximal soft-Copeland score. Note that this corresponds to the global minimum of $f$, since the defining integral takes maximum value for points $\mathbf{x} \in \mathcal{X}$ where $f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=$ $g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})>0$ or all $\mathbf{x}^{\prime}$, occurring only if $\mathbf{x}_c$ is a minimum of $f$. Therefore, if the preference function $\pi_f$ can be learned by observing the results of duels then our optimization problem of finding the minimum of $f$ can be solved by finding the Condorcet winner of the Copeland score.

Code: Copeland Score Visualization

Code: GP Preference Model for PBO

4.5.2 Acquisition Functions

We describe several acquisition functions for sequential learning of the Condorcet winner. Our dataset $\mathcal{D}=\left\{\left[\mathbf{x}_i, \mathbf{x}_i^{\prime}\right], y_i\right\}_{i=1}^N$ represents the $N$ duels that have been performed so far. We aim to define a sequential policy $\alpha\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] ; \mathcal{D}_j, \theta\right)$ for querying duels, where $\theta$ is a vector of model hyper-parameters, in order to find the minimum of the latent function $g$ as quickly as possible. Using Gaussian processes (GP) for classification with our dataset $\mathcal{D}$ allows us to perform inference over $f$ and $\pi_f$.

Pure Exploration

The output variable $y_{\star}$ of a prediction follows a Bernoulli distribution with probability given by the preference function $\pi_f$. To carry out exploration as a policy, one method is to search for the duel where GP is most uncertain about the probability of the outcome (has the highest variance of $\sigma\left(f_{\star}\right)$ ), which is the result of transforming out epistemic uncertainty about $f$, modeled by a GP, through the logistic function. The first order moment of this distribution coincides with the expectation of $y_{\star}$ but its variance is \[\begin{aligned} \mathbb{V}\left[\sigma\left(f_{\star}\right)\right] & =\int\left(\sigma\left(f_{\star}\right)-\mathbb{E}\left[\sigma\left(f_{\star}\right)\right]\right)^2 p\left(f_{\star} \mid \mathcal{D},\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d f_{\star} \\ & =\int \sigma\left(f_{\star}\right)^2 p\left(f_{\star} \mid \mathcal{D},\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d f_{\star}-\mathbb{E}\left[\sigma\left(f_{\star}\right)\right]^2 \end{aligned} \tag{4.46}\] which explicitly takes into account the uncertainty over $f$. Hence, pure exploration of duels space can be carried out by maximizing \[\alpha_{\mathrm{PE}}\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] \mid \mathcal{D}_j\right)=\mathbb{V}\left[\sigma\left(f_{\star}\right)\left|\left[\mathbf{x}_{\star}, \mathbf{x}_{\star}^{\prime}\right]\right| \mathcal{D}_j\right] . \tag{4.47}\]

Note that in this case, duels that have been already visited will have a lower chance of being visited again even in cases in which the objective takes similar values in both players. In practice, this acquisition functions requires computation of an intractable integral, that we approximate using Monte-Carlo.

Principled Optimistic Preferential Bayesian Optimization (POP-BO)

In a slightly modified problem setup (Xu et al. 2024), the algorithm tries to solve for the MLE $\hat{g}$ and its confidence set $\mathcal{B}_g$ where $g$ is the ground truth black-box function. Assumptions include that $g$ is a member of a reproducing kernel Hilbert space (RKHS) $\mathcal{H}_k$ for some kernel function $k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$, and $\|g\|_k \leq B$ so that $\mathcal{B}_g = \left\{\tilde{g} \in \mathcal{H}_k \mid\|\tilde{g}\|_k \leq B\right\}$. Similarly defining $f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})$, we model the preference function with a Bernoulli distribution as in Equation [eq:bernoulli_pref] and also assume that probabilities follow the Bradley-Terry model, i.e. \[\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\sigma\left(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\right)=\frac{e^{g(\mathbf{x})}}{e^{g(\mathbf{x})}+e^{g\left(\mathbf{x^{\prime}}\right)}} \tag{4.48}\]

The update rule for MLE $\hat{g}$ is (equation 8,6,5) \[\begin{aligned} \hat{g}_t^{\text {MLE }}&:= \arg \underset{\tilde{g} \in \mathcal{B}^t_g}{\max}\ell_t(\tilde{g}) \\ \ell_t(\tilde{g}) &:= \log \prod_{\tau=1}^t y_\tau \pi_{\tilde{f}}([\mathbf{x_\tau}, \mathbf{x^{\prime}_\tau}])+\left(1-y_\tau\right)\left(1-\pi_{\tilde{f}}([\mathbf{x_\tau}, \mathbf{x^{\prime}_\tau}])\right) \\ &=\sum_{\tau=1}^t \log \left(\frac{e^{\tilde{g}(\mathbf{x_\tau})} y_\tau+e^{\tilde{g}(\mathbf{x_\tau^\prime})}\left(1-y_\tau\right)}{e^{\tilde{g}(\mathbf{x_\tau})}+e^{\tilde{g}(\mathbf{x_\tau^\prime})}}\right) \\ &=\sum_{\tau=1}^t\left(\tilde{g}(\mathbf{x_\tau}) y_\tau+\tilde{g}(\mathbf{x_\tau^\prime})\left(1-y_\tau\right)\right)-\sum_{\tau=1}^t \log \left(e^{\tilde{g}(\mathbf{x_\tau})}+e^{\tilde{g}(\mathbf{x_\tau^\prime})}\right) \end{aligned} \tag{4.49}\]

(Eq 22 shows how to represent this as a convex optimisation problem so that it can be solved)

The update rule for the confidence set $\mathcal{B}_f^{t+1}$ is, (eq 9, 10?)

\[\begin{aligned} &\forall \epsilon, \delta > 0 \\ &\mathcal{B}_g^{t+1}:=\left\{\tilde{g} \in \mathcal{B}_g \mid \ell_t(\tilde{g}) \geq \ell_t\left(\hat{g}_t^{\mathrm{MLE}}\right)-\beta_1(\epsilon, \delta, t)\right\} \end{aligned} \tag{4.50}\] where \[\beta_1(\epsilon, \delta, t):=\sqrt{32 t B^2 \log \frac{\pi^2 t^2 \mathcal{N}\left(\mathcal{B}_f, \epsilon,\|\cdot\|_{\infty}\right)}{6 \delta}}+ C_L \epsilon t=\mathcal{O}\left(\sqrt{t \log \frac{t \mathcal{N}\left(\mathcal{B}_f, \epsilon,\|\cdot\|_{\infty}\right)}{\delta}}+\epsilon t\right), \tag{4.51}\] with $C_L$ a constant independent of $\delta, t$ and $\epsilon$. $\epsilon$ is typically chosen to be $1 / T$, where T is the running horizon of the algorithm. This satisfies the theorem that, \[\mathbb{P}\left(g \in \mathcal{B}_g^{t+1}, \forall t \geq 1\right) \geq 1-\delta . \tag{4.52}\]

Intuitively, the confidence set $\mathcal{B}_g^{t+1}$ includes the functions with the log-likelihood value that is only ‘a little worse’ than the maximum likelihood estimator, and the theorem states that $\mathcal{B}_g^{t+1}$ contains the ground-truth function $g$ with high probability.

Inner level optimization in Line 4 of the algorithm can also be represented as a convex optimisation problem so that it can be solved, Eq 24, 25. The outer optimisation can be solved using grid search or Eq 26 for medium size problems.

Given the initial point $\mathbf{x_0} \in \mathcal{X}$ and set $\mathcal{B}_g^1 = \mathcal{B}_g$ Set the reference point $\mathbf{x_t^{\prime}} = \mathbf{x_{t-1}}$ Compute $\mathbf{x_t} \in \arg\max_{\mathbf{x} \in \mathcal{X}} \max_{\tilde{g} \in \mathcal{B}_g^t} (\tilde{g}(\mathbf{x}) - \tilde{g}(\mathbf{x_t^{\prime}}))$, with the inner optimal function denoted as $\tilde{g}_t$ Obtain the output of the duel $y_t$ and append the new data point to $\mathcal{D}_t$ Update the maximum likelihood estimator $\hat{g}_t^{\mathrm{MLE}}$ and the posterior confidence set $\mathcal{B}_g^{t+1}$.

qEUBO: Decision-Theoretic EUBO

qEUBO (Astudillo et al. 2023) derives an acquisition function that extends duels to $q>2$ options which we call queries. Let $X=\left(\mathbf{x_1}, \ldots, \mathbf{x_q}\right) \in \mathcal{X}^q$ denote a query containing two points or more, and let $g: \mathcal{X} \rightarrow \Re$ be the latent preference function. Then after $n$ user queries, we define the expected utility of the best option (qEUBO) as \[\mathrm{qEUBO}_n(X)=\mathbb{E}_n\left[\max \left\{g\left(x_1\right), \ldots, g\left(x_q\right)\right\}\right]. \tag{4.53}\]

We now show that qEUBO is one-step Bayes optimal, meaning that each step chooses the query that maximises the expected utility received by the human. For a query $X \in \mathcal{X}^q$, let \[V_n(X)=\mathbb{E}_n\left[\max _{x \in \mathbb{X}} \mathbb{E}_{n+1}[g(x)] \mid X_{n+1}=X\right] . \tag{4.54}\] Then $V_n$ defines the expected utility received if an additional query $X_{n+1}=X$ is performed, and maximizing $V_n$ is one-step Bayes optimal. Since $\max _{x \in \mathbb{X}} \mathbb{E}_n[f(x)]$ does not depend on $X_{n+1}$, we can also equivalently maximize \[\mathbb{E}_n\left[\max _{x \in \mathbb{X}} \mathbb{E}_{n+1}[g(x)]-\max _{x \in \mathbb{X}} \mathbb{E}_n[g(x)] \mid X_{n+1}=X\right], \tag{4.55}\] which takes the same form as the knowledge gradient acquisition function (Wu and Frazier 2018) in standard Bayesian optimization.

$V_n$ involves a nested stochastic optimization task, while qEUBO is a much simpler policy. When human responses are noise-free, we are able to use qEUBO as a sufficient policy due to the following theorem:

\[\underset{X \in \mathbb{X}^q}{\operatorname{argmax}} \mathrm{qEUBO}_n(X) \subseteq \underset{X \in \mathbb{X}^q}{\operatorname{argmax}} V_n(X) . \tag{4.56}\]

Proof. Proof. For a query $X \in \mathcal{X}^q$, let $x^{+}(X, i) \in \operatorname{argmax}_{x \in \mathbb{X}} \mathbb{E}_n[g(x) \mid(X, i)]$ and define $X^{+}(X)=$ $\left(x^{+}(X, 1), \ldots, x^{+}(X, q)\right)$.

Claim 1 $V_n(X) \leq \mathrm{qEUBO}_n\left(X^{+}(X)\right) .$ We see that \[\begin{aligned} V_n(X) & =\sum_{i=1}^q \mathbf{P}_n(r(X)=i) \mathbb{E}_n[g\left(x^{+}(X, i)\right) ] \\ & \leq \sum_{i=1}^q \mathbf{P}_n(r(X)=i) \mathbb{E}_n[\max _{i=1, \ldots, q} g(x^{+}(X, i))] \\ & =\mathbb{E}_n\left[\max _{i=1, \ldots, q} g\left(x^{+}(X, i)\right)\right] \\ & =\mathrm{qEUBO}_n\left(X^{+}(X)\right), \end{aligned} \tag{4.57}\] as claimed.

Claim 2 $\mathrm{qEUBO}_n(X) \leq V_n(X) .$ For any given $X \in \mathbb{X}^q$ we have \[\mathbb{E}_n\left[f\left(x_{r(X)}\right) \mid(X, r(X))\right] \leq \max _{x \in \mathbb{X}} \mathbb{E}_n[f(x) \mid(X, r(X))] . \tag{4.58}\] Since $f\left(x_{r(X)}\right)=\max _{i=1, \ldots, q} f\left(x_i\right)$, taking expectations over $r(X)$ on both sides obtains the required result.

Now building on the arguments above, let $X^* \in \operatorname{argmax}_{X \in \mathbb{X}^q} \mathrm{qEUBO}_n(X)$ and suppose for contradiction that $X^* \notin \operatorname{argmax}_{X \in \mathbb{X}^q} V_n(X)$. Then, there exists $\widetilde{X} \in \mathbb{X}^q$ such that $V_n(\widetilde{X})>V_n\left(X^*\right)$. We have \[\begin{aligned} \operatorname{qEUBO}_n\left(X^{+}(\tilde{X})\right) & \geq V_n(\tilde{X}) \\ & >V_n\left(X^*\right) \\ & \geq \operatorname{qEUBO}_n\left(X^*\right) \\ & \geq \operatorname{qEUBO}_n\left(X^{+}(\tilde{X})\right) . \end{aligned} \tag{4.59}\]

The first inequality follows from (1). The second inequality is due to our supposition for contradiction. The third inequality is due to (2). Finally, the fourth inequality holds since $X^* \in \operatorname{argmax}_{X \in \mathbb{X}^q} \mathrm{qEUBO}_n(X)$. This contradiction concludes the proof. ◻

Therefore a sufficient condition for following one-step Bayes optimality is by maximizing $\text{qEUBO}_n$.

In experiments comparing qEUBO to other state-of-the-art acquisition functions, qEUBO consistently outperformed on most problems and was closely followed by qEI and qTS. These results also extended to experiments with multiple options when $q>2$. In fact, there is faster convergence in regret when using more options in human queries. The regret analysis is presented in detail below.

qEI: Batch Expected Improvement

\[\begin{aligned} \mathrm{qEI}= & \mathbb{E}_{\mathbf{y}}\left[\left(\max _{i \in[1, \ldots, q]}\left(\mu_{\min }-y_i\right)\right)_{+}\right] \\ = & \sum_{i=1}^q \mathbb{E}_{\mathbf{y}}\left(\mu_{\min }-y_i \mid y_i \leq \mu_{\min }, y_i \leq y_j \forall j \neq i\right) \\ & p\left(y_i \leq \mu_{\min }, y_i \leq y_j \forall j \neq i\right) . \end{aligned} \tag{4.60}\]

qTS: Batch Thompson Sampling

Initial data $\mathcal{D}_{\mathcal{I}(1)}=\{(\mathbf{x}_i, y_i)\}_{i \in \mathcal{I}(1)}$ Compute current posterior $p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})$ Sample $\boldsymbol{\theta}$ from $p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})$ Select $k \leftarrow \arg \max_{j \notin \mathcal{I}(t)} \mathbb{E}[y_j \mid \mathbf{x}_j, \boldsymbol{\theta}]$ Collect $y_k$ by evaluating $f$ at $\mathbf{x}_k$ $\mathcal{D}_{\mathcal{I}(t+1)} \leftarrow \mathcal{D}_{\mathcal{I}(t)} \cup \{(\mathbf{x}_k, y_k)\}$

Initial data $\mathcal{D}_{\mathcal{I}(1)}=\{\mathbf{x}_i, y_i\}_{i \in \mathcal{I}(1)}$, batch size $S$ Compute current posterior $p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})$ Sample $\boldsymbol{\theta}$ from $p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})$ Select $k(s) \leftarrow \arg \max_{j \notin \mathcal{I}(t)} \mathbb{E}[y_j \mid \mathbf{x}_j, \boldsymbol{\theta}]$ $\mathcal{D}_{\mathcal{I}(t+1)} = \mathcal{D}_{\mathcal{I}(t)} \cup \{\mathbf{x}_{k(s)}, y_{k(s)}\}_{s=1}^S$

4.5.3 Regret Analysis

qEUBO Regret

With the definition of Bayesian simple regret, we have that qEUBO converges to zero at a rate of $o(1/n)$, i.e.

\[\label{th:quebo_regret} \mathbb{E}\left[f\left(x^*\right)-f\left(\widehat{x}_n^*\right)\right]=o(1 / n)\]

where $x^*=\operatorname{argmax}_{x \in \mathrm{X}} f(x)$ and $\widehat{x}_n^* \in \operatorname{argmax}_{x \in \mathrm{X}} \mathbb{E}_n[f(x)]$.

This theorem holds under the following assumptions:

$f$ is injective $\mathbf{P}(f(x)=f(y))=0$ for any $x, y \in \mathbb{X}$ with $x \neq y$.
$f$ represents the preferred option $\exists a>1 / 2$ s.t. $\mathbf{P}\left(r(X) \in \operatorname{argmax}_{i=1, \ldots, 2} f\left(x_i\right) \mid f(X)\right) \geq a \forall$ $X=\left(x_1, x_2\right) \in \mathbb{X}^2$ with $x_1 \neq x_2$ almost surely under the prior on $f$.
Expected difference in utility is proportional to probability of greater utility $\exists \Delta \geq \delta>0$ s.t. $\forall \mathcal{D}^{(n)} \text{and} \forall x, y \in \mathbb{X}$ (potentially depending on $\mathcal{D}^{(n)}$), \[\delta \mathbf{P}^{(n)}(f(x)>f(y)) \leq \mathbb{E}^{(n)}\left[\{f(x)-f(y)\}^{+}\right] \leq \Delta \mathbf{P}^{(n)}(f(x)>f(y)) \tag{4.61}\] almost surely under the prior on $f$.

Further lemmas leading to a proof of Theorem [th:quebo_regret] is given in (Astudillo et al. 2023) Section B.

qEI Regret

The following theorem shows that, under the same assumptions used for qEUBO regret, simple regret of qEI can fail to converge to 0.

There exists a problem instance (i.e., $\mathbb{X}$ and Bayesian prior distribution over f) satisfying the assumptions described in Theorem [th:quebo_regret] such that if the sequence of queries is chosen by maximizing qEI, then $\mathbb{E}\left[f\left(x^*\right)-\right.$ $\left.f\left(\widehat{x}_n^*\right)\right] \geq R$ for all $n$, for a constant $R>0$.

Proof. Proof. Let $X = \{1, 2, 3, 4\}$ and consider the functions $f_i:X \rightarrow R$, for $i=1,2,3,4$, given by $f_i(1) = -1$ and $f_i(2) = 0$ for all $i$, and \[\begin{aligned} f_1(x) = \begin{cases} 1, &\ x=3\\ \frac{1}{2}, &\ x=4 \end{cases}, \hspace{0.5cm} f_2(x) = \begin{cases} \frac{1}{2}, &\ x=3\\ 1, &\ x=4 \end{cases}, \hspace{0.5cm} f_3(x) = \begin{cases} -\frac{1}{2}, &\ x=3\\ -1, &\ x=4 \end{cases}, \hspace{0.5cm} f_4(x) = \begin{cases} -1, &\ x=3\\ -\frac{1}{2}, &\ x=4 \end{cases}. \end{aligned} \tag{4.62}\]

Let $p$ be a number with $0 < p < 1/3$ and set $q=1-p$. We consider a prior distribution on $f$ with support $\{f_i\}_{i=1}^4$ such that \[\begin{aligned} p_i = Pr(f=f_i) = \begin{cases} p/2, i =1,2,\\ q/2, i=3,4. \end{cases} \end{aligned} \tag{4.63}\] We also assume the user’s response likelihood is given by $Pr(r(X)=1\mid f(x_1) > f(x_2)) = a$ for some $a$ such that $1/2 < a < 1$,

Let $D^{(n)}$ denote the set of observations up to time $n$ and let $p_i^{(n)} = Pr(f=f_i \mid \mathbb{E}^{(n)})$ for $i=1,2,3,4$. We let the initial data set be $\mathcal{D}^{(0)} = \{(X^{(0)}, r^{(0)})\}$, where $X^{(0)}= (1,2)$. We will prove that the following statements are true for all $n\geq 0$.

$p_i^{(n)} > 0$ for $i=1,2,3,4$.
$p_1^{(n)} < \frac{1}{2}p_3^{(n)}$ and $p_2^{(n)} < \frac{1}{2}p_4^{(n)}$.
$\arg \max_{x\in\mathcal{X}}\mathbb{E}^{(n)}[f(x)]=\{2\}$.
$\arg \max_{X\in\mathcal{X}^2}\text{qEI}^{(n)}(X) = \{(3, 4)\}$.

We prove this by induction over $n$. We begin by proving this for $n=0$. Since $f_i(1) < f_i(2)$ for all $i$, the posterior distribution on $f$ given $\mathcal{D}^{(0)}$ remains the same as the prior; i.e., $p_i^{(0)} = p_i$ for $i=1,2,3,4$. Using this, statements 1 and 2 can be easily verified. Now note that $\mathbb{E}^{(0)}[f(1)]=-1$, $\mathbb{E}^{(0)}[f(2)]=0$, and $\mathbb{E}^{(0)}[f(3)] = \mathbb{E}^{(0)}[f(4)] = \frac{3}{2}(p - q)$. Since $p < q$, it follows that $\arg \max_{x\in\mathcal{X}}\mathbb{E}^{(n)}[f(x)]=\{2\}$; i.e., statement 3 holds. Finally, since $\max_{x\in\{1,2\}}\mathbb{E}^{(0)}[f(x)] = 0$, the qEI acquisition function at time $n=0$ is given by $\text{qEI}^{(0)}(X) = \mathbb{E}^{(0)}[\{\max\{f(x_1), f(x_2)\}\}^+]$. A direct calculation can now be performed to verify that statement 4 holds. This completes the base case.

Now suppose statements 1-4 hold for some $n\geq 0$. Since $X^{(n+1)} = (3, 4)$, the posterior distribution on $f$ given $D^{(n+1)}$ is given by \[\begin{aligned} p_i^{(n+1)} \propto \begin{cases} p_i^{(n)}\ell, \ i=1,3,\\ p_i^{(n)} (1 - \ell), \ i=2,4, \end{cases} \end{aligned} \tag{4.64}\] where \[\ell = a I\{r^{(n+1)} = 1\} + (1-a)I\{r^{(n+1)} = 2\}. \tag{4.65}\] Observe that $0< \ell < 1$ since $0 < a < 1$. Thus, $\ell > 0$ and $1-\ell > 0$. Since $p_i^{(n)} > 0$ by the induction hypothesis, it follows from this that $p_i^{(n+1)} > 0$ for $i=1,2,3,4$. Moreover, since $p_i^{(n+1)} \propto p_i^{(n)}\ell$ for $i=1,3$ and $p_1^{(n)} < \frac{1}{2}p_3^{(n)}$ by the induction hypothesis, it follows that $p_1^{(n+1)} < \frac{1}{2}p_3^{(n+1)}$. Similarly, $p_2^{(n+1)} < \frac{1}{2}p_4^{(n+1)}$. Thus, statements 1 and 2 hold at time $n+1$.

Now observe that \[\begin{aligned} \mathbb{E}^{(n+1)}[f(3)] &= p_1^{(n+1)} + \frac{1}{2}p_2^{(n+1)} - \frac{1}{2}p_3^{(n+1)} - p_4^{(n+1)}\\ &= \left(p_1^{(n+1)} - \frac{1}{2}p_3^{(n+1)}\right) + \left(\frac{1}{2}p_2^{(n+1)} - p_4^{(n+1)}\right)\\ &\leq \left(p_1^{(n+1)} - \frac{1}{2}p_3^{(n+1)}\right) + \left(p_2^{(n+1)} - \frac{1}{2}p_4^{(n+1)}\right)\\ &\leq 0, \end{aligned} \tag{4.66}\] where the last inequality holds since $p_1^{(n+1)} < \frac{1}{2}p_3^{(n+1)}$ and $p_2^{(n+1)} < \frac{1}{2}p_4^{(n+1)}$. Similarly, we see that $\mathbb{E}^{(n+1)}[f(4)] \leq 0$. Since $\mathbb{E}^{(n+1)}[f(1)]=-1$ and $\mathbb{E}^{(n+1)}[f(2)]=0$, it follows that $\arg \max_{x\in\mathcal{X}}\mathbb{E}^{(n+1)}[f(x)]=\{2\}$; i.e., statement 3 holds at time $n+1$.

Since $\max_{x\in\mathcal{X}}\mathbb{E}^{(0)}[f(x)] = 0$, the qEI acquisition function at time $n+1$ is given by $\text{qEI}^{(n+1)}(X) = \mathbb{E}^{(n+1)}[\{\max\{f(x_1), f(x_2)\}\}^+]$. Since $f(1) \leq f(x)$ almost surely under the prior for all $x\in\mathcal{X}$, there is always a maximizer of qEI that does not contain $1$. Thus, to find the maximizer of qEI, it suffices to analyse its value at the pairs $(2, 3)$, $(3,4)$ and $(4,2)$. We have \[\text{qEI}^{(n+1)}(2, 3) = p_1^{(n+1)} + 1/2 p_2^{(n+1)}, \tag{4.67}\] \[\operatorname{qEI}^{(n+1)}(3, 4) = p_1^{(n+1)} + p_2^{(n+1)} \tag{4.68}\] and \[\operatorname{qEI}^{(n+1)}(4, 2) = 1/2p_1^{(n+1)} + p_2^{(n+1)}. \tag{4.69}\] Since $p_1^{(n+1)} > 0$ and $p_2^{(n+1)} > 0$, it follows that $\arg \max_{X \in X^2}\text{qEI}^{(n+1)}(X) = \{(3, 4)\}$, which concludes the proof by induction.

Finally, since $\arg \max_{x\in X}\mathbb{E}^{(n)}[f(x)]=\{2\}$ for all $n$, the Bayesian simple regret of qEI is given by \[\begin{aligned} \mathbb{E}\left[f(x^*) - f(2)\right] &= \sum_{i=1}p_i\left(\max_{x\in X}f_i(x) - f_i(2)\right)\\ &= p \end{aligned} \tag{4.70}\] for all $n$. ◻

POP-BO Regret

Commonly used kernel functions within the RKHS are:

Linear: \[k(x, \bar{x})=x^{\top} \bar{x} . \tag{4.71}\]
Squared Exponential (SE): \[k(x, \bar{x})=\sigma_{\mathrm{SE}}^2 \exp \left\{-\frac{\|x-\bar{x}\|^2}{l^2}\right\}, \tag{4.72}\] where $\sigma_{\mathrm{SE}}^2$ is the variance parameter and $l$ is the lengthscale parameter.
Matérn: \[k(x, \bar{x})=\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2 \nu} \frac{\|x-\bar{x}\|}{\rho}\right)^\nu K_\nu\left(\sqrt{2 \nu} \frac{\|x-\bar{x}\|}{\rho}\right), \tag{4.73}\] where $\rho$ and $\nu$ are the two positive parameters of the kernel function, $\Gamma$ is the gamma function, and $K_\nu$ is the modified Bessel function of the second kind. $\nu$ captures the smoothness of the kernel function.

With the definition of Bayesian simple regret, we have the following theorem defining the regret bound:

With probability at least $1-\delta$, the cumulative regret of POP-BO satisfies, \[R_T=\mathcal{O}\left(\sqrt{\beta_T \gamma_T^{f f^{\prime}} T}\right), \tag{4.74}\] where \[\beta_T=\beta(1 / T, \delta, T)=\mathcal{O}\left(\sqrt{T \log \frac{T \mathcal{N}\left(\mathcal{B}_f, 1 / T,\|\cdot\|_{\infty}\right)}{\delta}}\right). \tag{4.75}\]

The guaranteed convergence rate is characterised as:

[]{#th: popbo_converge label=“th: popbo_converge”} Let $t^{\star}$ be defined as in Eq. (19). With probability at least $1-\delta$, \[f\left(x^{\star}\right)-f\left(x_{t^{\star}}\right) \leq \mathcal{O}\left(\frac{\sqrt{\beta_T \gamma_T^{f f^{\prime}}}}{\sqrt{T}}\right) \tag{4.76}\]

Theorem [th: popbo_converge] highlights that by minimizing the known term $2\left(2 B+\lambda^{-1 / 2} \sqrt{\beta\left(\epsilon, \frac{\delta}{2}, t\right)}\right) \sigma_t^{f f^{\prime}}\left(\left(x_t, x_t^{\prime}\right)\right)$, the reported final solution $x_{t^{\star}}$ has a guaranteed convergence rate.

Further kernel-specific regret bounds for POP-BO are calculated as follows:

Setting $\epsilon=1 / T$ and running our POP-BO algorithm in Alg. 1,

If $k(x, y)=\langle x, y\rangle$, we have, \[R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4}\right) . \tag{4.77}\]
If $k(x, y)$ is a squared exponential kernel, we have, \[R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4(d+1)}\right) . \tag{4.78}\]
If $k(x, y)$ is a Matérn kernel, we have, \[\left.R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4} T^{\frac{d}{\nu}\left(\frac{1}{4}+\frac{d+1}{4+2(d+1)^d / \nu}\right.}\right)\right). \tag{4.79}\]

Code: Full PBO Simulation

This simulation demonstrates Preferential Bayesian Optimization finding the optimum of a latent function through pairwise comparisons.

4.6 Human-Agent Cooperation and CIRL

The algorithms discussed so far assume the human is a passive oracle: the system queries, the human responds. In practice, humans are active agents who may strategically teach, mislead, or collaborate with the system. This section introduces Cooperative Inverse Reinforcement Learning (CIRL), a framework where human and agent jointly optimize a shared objective.

4.6.1 The CIRL Framework

In standard Inverse Reinforcement Learning (IRL), the agent learns a reward function from observing an expert’s behavior, then optimizes it. The expert demonstrates; the agent imitates. CIRL extends this to a two-player cooperative game:

Human H: Knows the true reward parameters $\theta \in \Theta$
Robot R: Does not know $\theta$, but must act to maximize reward
Shared payoff: Both players receive $R(s, a_H, a_R; \theta)$—their interests are aligned

The key insight is that the human’s optimal policy is not just demonstrating good behavior, but teaching the robot about $\theta$. Sometimes instructive actions (that reveal information) dominate expert demonstrations (that maximize immediate reward).

CIRL Formal Definition

A CIRL game is defined by:

State space $\mathcal{S}$
Action spaces $\mathcal{A}_H$ (human) and $\mathcal{A}_R$ (robot)
Transition function $T(s' | s, a_H, a_R)$
Reward function $R(s, a_H, a_R; \theta)$ parameterized by $\theta$
Prior $P_0(\theta)$ over reward parameters
Discount factor $\gamma \in (0, 1)$

At each time $t$, both observe $(s_t, a_H^{t-1}, a_R^{t-1})$. Only H observes $\theta$. The robot maintains a belief $b_R^t(\theta)$ updated via Bayes’ rule.

4.6.2 Teaching vs. Demonstrating

A central result in CIRL is that expert demonstrations can be suboptimal for joint value. Consider:

Demonstration-by-expert: Human maximizes task reward at each step
Instructive teaching: Human sometimes sacrifices immediate reward to shape robot’s belief

When the robot is uncertain, instructive actions that reveal $\theta$ can lead to higher cumulative reward than always acting optimally given $\theta$.

Code: Teaching vs Demonstrating Simulation

4.6.3 Choice Architecture

As designers of preference learning systems, we are choice architects: the options we present and how we present them influence human behavior. Research in behavioral economics documents several effects:

Order effects: Items presented first or last receive disproportionate attention
Framing effects: How options are described changes perceived value
Default effects: Pre-selected options are chosen more often

These effects are typically not captured in standard choice models (which assume IIA or similar properties). Responsible deployment of preference learning systems requires awareness of these biases.

Key Takeaway: Beyond Passive Oracles

The transition from measurement to decision-making revealed that uncertainty should be managed, not just reduced. The transition from passive oracles to active humans reveals that humans themselves are strategic agents whose behavior depends on their beliefs about the system. CIRL provides a principled framework for this cooperative setting.

4.7 Discussion Questions

Measurement vs. Optimization: How does the reward maximization perspective differ from the measurement perspective on preference learning? When might one approach be more appropriate than the other?
Thompson Sampling Intuition: Explain in your own words how Thompson Sampling achieves the exploration-exploitation tradeoff. Why does sampling from the posterior distribution lead to exploration in uncertain regions?
Laplace Approximation: Why is the Laplace approximation necessary for Thompson Sampling with binary feedback? What are the potential limitations of this approximation?
Regret Definitions: Compare and contrast strong regret, weak regret, and Bayesian regret in the dueling bandit setting. Give an example scenario where each would be the most appropriate metric.
Winner Concepts: When might the Condorcet winner, Borda winner, and Von Neumann winner diverge? Construct a simple example with 3-4 alternatives where these concepts give different answers.
Acquisition Functions: What are the key tradeoffs between the UCB and Thompson Sampling acquisition functions? How does Pure Exploration differ in its objectives?
qEUBO vs. qEI: Why does qEUBO provably converge while qEI can fail to converge under the same assumptions? What does this suggest about designing acquisition functions for preferential feedback?
Copeland Score: How does the soft-Copeland score relate to finding the Condorcet winner? Why might we prefer the soft version over the hard indicator-based version?
Contextual Bandits: How do contextual bandits extend the basic multi-armed bandit framework? Give three real-world examples where context is crucial for making good decisions.
Exploration in Practice: In recommendation systems, how might information asymmetry be used ethically to encourage exploration? What are the potential risks of this approach?
Bayesian Optimization Connection: Explain the connection between Preferential Bayesian Optimization and traditional Bayesian optimization. How does the preference feedback model change the inference problem?
Scalability Challenges: What are the main computational challenges in scaling dueling bandit and PBO algorithms to large action spaces? How might these be addressed?

4.8 Exercises

4.8.1 ⭐ Basic

Posterior Sampling: Given a logistic regression model with prior $U \sim \mathcal{N}(0, I)$ and two observations $(V_1, Y_1=1)$ and $(V_2, Y_2=0)$ with $V_1 = [1, 0]^\top$ and $V_2 = [0, 1]^\top$, write down the log-posterior and identify the form of the Laplace approximation.
Regret Calculation: Consider a dueling bandit with 3 arms where the preference matrix is: \[P = \begin{pmatrix} 0.5 & 0.6 & 0.7 \\ 0.4 & 0.5 & 0.6 \\ 0.3 & 0.4 & 0.5 \end{pmatrix}\] Identify the Condorcet winner and compute the expected regret of always choosing arms $(2, 3)$.
Copeland Score: For the preference matrix in Exercise 2, compute the Copeland score for each arm.

4.8.2 ⭐⭐ Intermediate

GP Classification: Implement GP classification with Laplace approximation for a 1D toy problem. Generate binary labels from a latent sinusoidal function passed through a sigmoid, then fit a GP and visualize the posterior mean and uncertainty.
Thompson Sampling Simulation: Implement Thompson Sampling for a linear bandit with Gaussian prior. Run simulations comparing Thompson Sampling to UCB and $\varepsilon$-greedy on a synthetic problem. Plot cumulative regret curves.
Dueling Bandit Simulation: Implement the Interleaved Filter algorithm and test it on a dueling bandit problem with 5 arms. Compare its sample complexity to uniform exploration for finding the Condorcet winner.

4.8.3 ⭐⭐⭐ Advanced

Laplace Approximation Derivation: Starting from the GP binary classification model, derive the full Laplace approximation including the Hessian computation for the posterior covariance. Show that the resulting approximation is Gaussian with the correct mean and covariance.
qEI Failure Case: Reproduce the proof that qEI can fail to converge by implementing the counterexample from the chapter. Verify empirically that qEI gets stuck while qEUBO converges.
POP-BO Regret Bound: Following the proof sketch in the chapter, derive the regret bound for POP-BO with a linear kernel. Explain how the covering number $\mathcal{N}(\mathcal{B}_f, \epsilon, \|\cdot\|_\infty)$ appears in the bound.
Custom Acquisition Function: Design and implement a novel acquisition function for preferential Bayesian optimization that balances exploration and exploitation in a different way than qEUBO. Provide theoretical justification or empirical evidence for its effectiveness.

References

Astudillo, Raul, Zhiyuan Jerry Lin, Eytan Bakshy, and Peter I. Frazier. 2023. “qEUBO: A Decision-Theoretic Acquisition Function for Preferential Bayesian Optimization.” https://arxiv.org/abs/2303.15746.

Bastani, Hamsa, and Mohsen Bayati. 2020. “Online Decision Making with High-Dimensional Covariates.” Operations Research 68 (1): 276–94. https://doi.org/10.1287/opre.2019.1902.

Bouneffouf, Djallel, Amel Bouzeghoub, and Alda Lopes Gançarski. 2012. “A Contextual-Bandit Algorithm for Mobile Context-Aware Recommender System.” In Neural Information Processing, edited by Tingwen Huang, Zhigang Zeng, Chuandong Li, and Chi Sing Leung, 324–31. Berlin, Heidelberg: Springer Berlin Heidelberg.

Bouneffouf, Djallel, Irina Rish, and Charu Aggarwal. 2020. “Survey on Applications of Multi-Armed and Contextual Bandits.” In 2020 IEEE Congress on Evolutionary Computation (CEC), 1–8. Glasgow, United Kingdom: IEEE Press. https://doi.org/10.1109/CEC48606.2020.9185782.

Bouneffouf, Djallel, Irina Rish, and Guillermo A. Cecchi. 2017. “Bandit Models of Human Behavior: Reward Processing in Mental Disorders.” In Artificial General Intelligence, edited by Tom Everitt, Ben Goertzel, and Alexey Potapov, 237–48. Cham: Springer International Publishing.

Ding, Kaize, Jundong Li, and Huan Liu. 2019. “Interactive Anomaly Detection on Attributed Networks.” In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 357–65. WSDM ’19. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3289600.3290964.

Huo, Xiaoguang, and Feng Fu. 2017. “Risk-Aware Multi-Armed Bandit Problem with Application to Portfolio Selection.” Royal Society Open Science 4 (November). https://doi.org/10.1098/rsos.171377.

Kahneman, Daniel, and Amos Tversky. 1979. “Prospect Theory: Analysis of Decision Under Risk.” Econometrica 47 (2). https://doi.org/10.2307/1914185.

Liu, Bing, Tong Yu, Ian Lane, and Ole J. Mengshoel. 2018. “Customized Nonlinear Bandits for Online Response Selection in Neural Conversation Models.” In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’18/IAAI’18/EAAI’18. New Orleans, Louisiana, USA: AAAI Press.

Misra, Kanishka, Eric M. Schwartz, and Jacob Abernethy. 2019. “Dynamic Online Pricing with Incomplete Information Using Multiarmed Bandit Experiments.” Marketing Science 38 (2): 226–52. https://doi.org/10.1287/mksc.2018.1129.

perez, julien, and Tomi Silander. 2018. “Contextual Memory Bandit for Pro-Active Dialog Engagement.” https://openreview.net/forum?id=SJiHOSeR-.

Shen, Weiwei, Jun Wang, Yu-Gang Jiang, and Hongyuan Zha. 2015. “Portfolio Choices with Orthogonal Bandit Learning.” In Proceedings of the 24th International Conference on Artificial Intelligence, 974–80. IJCAI’15. Buenos Aires, Argentina: AAAI Press.

Sui, Yanan, Masrour Zoghi, Katja Hofmann, and Yisong Yue. 2018. “Advancements in Dueling Bandits.” Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. https://doi.org/10.24963/ijcai.2018/776.

Upadhyay, Sohini, Mayank Agarwal, Djallel Bouneffouf, and Yasaman Khazaeni. 2019. “A Bandit Approach to Posterior Dialog Orchestration Under a Budget.”

Wu, Jian, and Peter I. Frazier. 2018. “The Parallel Knowledge Gradient Method for Batch Bayesian Optimization.” https://arxiv.org/abs/1606.04414.

Xu, Wenjie, Wenbin Wang, Yuning Jiang, Bratislav Svetozarevic, and Colin N. Jones. 2024. “Principled Preferential Bayesian Optimization.” https://arxiv.org/abs/2402.05367.

Yue, Yisong, Josef Broder, Robert Kleinberg, and Thorsten Joachims. 2012. “The k-Armed Dueling Bandits Problem.” Journal of Computer and System Sciences 78 (5): 1538–56. https://doi.org/https://doi.org/10.1016/j.jcss.2011.12.028.

Yue, Yisong, and Thorsten Joachims. 2009. “Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem.” Proceedings of the 26th Annual International Conference on Machine Learning. https://doi.org/10.1145/1553374.1553527.

Zhang, Tong. 2021. “Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning.” CoRR abs/2110.00871. https://arxiv.org/abs/2110.00871.

Zhou, Qian, XiaoFang Zhang, Jin Xu, and Bin Liang. 2017. “Large-Scale Bandit Approaches for Recommender Systems.” In Neural Information Processing, edited by Derong Liu, Shengli Xie, Yuanqing Li, Dongbin Zhao, and El-Sayed M. El-Alfy, 811–21. Cham: Springer International Publishing.

--- title: Decisions format: html: include-after-body: text: | <script> // Auto-execute all pyodide cells after initialization document.addEventListener('DOMContentLoaded', function() { // Wait for pyodide to be fully ready (mainPyodide is set after loading) function waitForPyodide() { if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) { // Pyodide is ready, execute all cells with autorun=true if (typeof globalThis.qpyodideCellDetails !== 'undefined') { globalThis.qpyodideCellDetails.forEach((cell, index) => { if (cell.options && cell.options.autorun === 'true') { setTimeout(() => { const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`); if (runButton && !runButton.disabled) { runButton.click(); } }, index * 1000); // Stagger execution by 1 second each } }); } } else { setTimeout(waitForPyodide, 500); } } setTimeout(waitForPyodide, 2000); // Start checking after 2 seconds }); </script> filters: - pyodide --- ::: {.content-visible when-format="html"} <iframe src="../slides/4.1.dueling_bandit.html" style="width:45%; height:225px;" ></iframe> <iframe src="../slides/4.2.preferential_baysian_optimization.html" style="width:45%; height:225px;" ></iframe> [Fullscreen Part 1](../slides/4.1.dueling_bandit.html){.btn .btn-outline-primary .btn role="button"} [Fullscreen Part 2](../slides/4.2.preferential_baysian_optimization.html){.btn .btn-outline-primary .btn role="button"} ::: ::: {.callout-note} ## Intended Learning Outcomes By the end of this chapter you will be able to: - **Distinguish** the measurement perspective (reducing uncertainty) from the reward maximization perspective (managing uncertainty) in preference learning. - **Derive** Thompson Sampling for linear objectives using Laplace approximation to handle intractable posteriors. - **Extend** Thompson Sampling to nonlinear objectives using Gaussian Processes and GP classification. - **Define** regret (strong, weak, and Bayesian) in dueling bandit settings and explain when each notion is appropriate. - **Compare** winner concepts—Condorcet, Borda, and Von Neumann winners—and identify when they may diverge. - **Apply** acquisition functions (Uniform, Epsilon-Greedy, UCB, Pure Exploration, qEUBO) for sequential decision-making with preference feedback. - **Analyze** regret bounds for Preferential Bayesian Optimization algorithms and understand their convergence guarantees. - **Implement** Thompson Sampling and dueling bandit algorithms in practical settings. ::: ::: {.callout-note title="Notation"} This chapter uses notation established in previous chapters: - User embedding: $U_i \in \mathbb{R}^K$ - Item features/loadings: $V_j \in \mathbb{R}^K$ - Latent utility: $H_{ij} = h_i(V_j)$ where $h_i: \mathbb{R}^K \to \mathbb{R}$ - Binary preference outcome: $Y_{ij} \in \{0, 1\}$ - Sigmoid function: $\sigma(x) = 1/(1 + e^{-x})$ New notation for this chapter: - Dataset at time $t$: $\mathcal{D}_t = \{(V_j, Y_{ij})\}_{j=1}^t$ - Posterior sample: $\tilde{U}_i^{(t)} \sim p(U_i \mid \mathcal{D}_t)$ - Acquisition function: $\alpha(\cdot)$ - Cumulative regret after $T$ steps: $R(T) = \sum_{t=1}^T (\mu^* - \mu_{a_t})$ - GP posterior mean/variance: $\mu_t(\cdot), \sigma_t^2(\cdot)$ - Kernel function: $k(\cdot, \cdot)$ - Preference function: $\pi_f([\mathbf{x}, \mathbf{x}'])$ — probability that $\mathbf{x}$ is preferred to $\mathbf{x}'$ ::: ## The Reward Maximization under Uncertainty Problem Up to this point, we have considered the **measurement problem**: the goal was to learn about the user's latent construct $h_i$ (or its low-dimensional embedding $U_i$) by choosing items that are most informative about it. In this view, uncertainty is something to be **reduced**. In contrast, the **reward maximization perspective** treats uncertainty as something to be **managed**. Here, the system must choose actions (items) that yield the highest expected reward, even though the latent function $h_i$ governing user preferences is unknown. This is the domain of Bayesian optimization formulations. ::: {.callout-tip title="Historical Note: Thompson Sampling"} Thompson Sampling dates back to 1933, when William Thompson proposed it for clinical trials. For decades it was largely forgotten, overshadowed by frequentist methods like UCB (Upper Confidence Bound). Its rediscovery in the 2010s, particularly for online advertising and recommendation systems, revealed that this simple Bayesian algorithm often outperforms more complex alternatives. The key insight: **sample a plausible world from your beliefs, then act optimally within it**. ::: The fundamental tension in sequential decision-making is the **exploration-exploitation tradeoff**: - **Exploitation**: Choose actions believed to be good based on current knowledge - **Exploration**: Try uncertain actions to gain information for better future decisions A purely exploitative strategy may miss superior options it never tried. A purely explorative strategy wastes resources on suboptimal choices. Effective algorithms balance both. Each user $i$ interacts with a set of available items indexed by $j$, each characterized by a known parameter vector $V_j \in \mathbb{R}^K$. The user's utility function is $H_{ij} = h_i(V_j),$ where $h_i: \mathbb{R}^K \rightarrow \mathbb{R}$ maps item features to latent utility. We only observe noisy or partial feedback, such as item-wise or pairwise responses. The agent's objective is to select items sequentially to maximize reward. In the **linear case**, we assume $h_i(V_j) = U_i^\top V_j$, with $U_i \in \mathbb{R}^K$ representing the user's latent vector. This recovers the logistic or Rasch model when observations are binary: $p(Y_{ij}=1) = \sigma(U_i^\top V_j).$ In this setting, the uncertainty about $U_i$ defines a posterior distribution $p(U_i \mid \text{data})$, and the system's exploration–exploitation tradeoff can be quantified by the expected utility and posterior variance. Algorithms such as Thompson Sampling operate directly on this posterior, balancing exploration (reducing uncertainty about $U_i$) and exploitation (recommending high-utility $V_j$). In the **nonlinear case**, a common choice is to place a Gaussian process prior over $h_i$: $h_i \sim \mathcal{GP}(m(\cdot), k(\cdot,\cdot)),$ where the kernel $k(V, V')$ encodes similarity between items. This yields the Bayesian optimization formulation. The system uses acquisition functions (e.g., Thompson Sampling or expected improvement) to trade off exploitation and exploration. {{< include _plt_setup.qmd >}} ::: {.callout-note title="Code: Exploration vs Exploitation Visualization" collapse="false"} ```{pyodide-python} #| autorun: true # Visualize the exploration-exploitation tradeoff np.random.seed(42) # True reward function (unknown to agent) n_arms = 10 true_means = np.random.beta(2, 5, n_arms) * 0.8 + 0.1 # Random rewards in [0.1, 0.9] best_arm = np.argmax(true_means) def simulate_policy(policy_fn, T=200): """Simulate a policy and track cumulative reward.""" rewards = [] arm_counts = np.zeros(n_arms) arm_rewards = np.zeros(n_arms) for t in range(T): arm = policy_fn(t, arm_counts, arm_rewards) reward = np.random.binomial(1, true_means[arm]) arm_counts[arm] += 1 arm_rewards[arm] += reward rewards.append(reward) return np.cumsum(rewards), arm_counts # Greedy (pure exploitation after initial exploration) def greedy(t, counts, rewards): if t < n_arms: return t means = np.where(counts > 0, rewards / counts, 0) return np.argmax(means) # Random (pure exploration) def random_policy(t, counts, rewards): return np.random.randint(n_arms) # Epsilon-greedy (balanced) def eps_greedy(t, counts, rewards, eps=0.1): if np.random.rand() < eps or t < n_arms: return np.random.randint(n_arms) means = np.where(counts > 0, rewards / counts, 0) return np.argmax(means) # Thompson Sampling (Bayesian) def thompson(t, counts, rewards): samples = np.random.beta(rewards + 1, counts - rewards + 1) return np.argmax(samples) # Run simulations T = 200 cum_greedy, _ = simulate_policy(greedy, T) cum_random, _ = simulate_policy(random_policy, T) cum_eps, _ = simulate_policy(lambda t, c, r: eps_greedy(t, c, r, 0.1), T) cum_ts, counts_ts = simulate_policy(thompson, T) # Optimal cumulative reward optimal = np.arange(1, T+1) * true_means[best_arm] fig, axes = plt.subplots(1, 2, figsize=(8, 3)) # Left: Cumulative reward axes[0].plot(range(1, T+1), optimal, 'k--', linewidth=1, label='Optimal', alpha=0.7) axes[0].plot(range(1, T+1), cum_ts, 'b-', linewidth=1.5, label='Thompson Sampling') axes[0].plot(range(1, T+1), cum_eps, 'g-', linewidth=1, label='ε-greedy (ε=0.1)', alpha=0.8) axes[0].plot(range(1, T+1), cum_greedy, 'r-', linewidth=1, label='Greedy', alpha=0.8) axes[0].plot(range(1, T+1), cum_random, 'gray', linewidth=1, label='Random', alpha=0.6) axes[0].set_xlabel('Round $t$') axes[0].set_ylabel('Cumulative Reward') axes[0].set_title('Exploration-Exploitation Tradeoff') axes[0].legend(fontsize=6, loc='upper left') axes[0].grid(True, alpha=0.3) # Right: Arm selection frequency for Thompson Sampling colors = ['green' if i == best_arm else 'steelblue' for i in range(n_arms)] axes[1].bar(range(n_arms), counts_ts, color=colors, alpha=0.7) axes[1].set_xlabel('Arm') axes[1].set_ylabel('Times Selected') axes[1].set_title('Thompson Sampling Arm Selection') axes[1].set_xticks(range(n_arms)) plt.tight_layout() plt.show() print(f"Best arm: {best_arm} (mean reward: {true_means[best_arm]:.3f})") print(f"Thompson Sampling selected best arm {int(counts_ts[best_arm])}/{T} times ({100*counts_ts[best_arm]/T:.0f}%)") ``` ::: ## Thompson Sampling under Linear Objective {#sec-thompson-sampling} Thompson Sampling (TS) is among the most conceptually transparent and empirically successful methods for decision-making under uncertainty. It converts Bayesian inference into a simple randomized decision rule: sample a plausible world, then act optimally within it. Formally, suppose we maintain a posterior distribution over the latent user embedding $p(U_i \mid \mathcal{D}_t) \propto p(\mathcal{D}_t \mid U_i) p(U_i),$ where $\mathcal{D}_t = {(V_j, Y_{ij})}_{j=1}^t$ denotes the user’s observed responses. At each round $t+1$, Thompson Sampling draws a single sample $\tilde{U}_i^{(t)} \sim p(U_i \mid \mathcal{D}_t),$ and then chooses the next item that maximizes the sampled expected utility: $j^* = \arg\max_j \tilde{U}_i^{(t)\top} V_j.$ This randomized strategy naturally balances exploration and exploitation: if posterior uncertainty is large, different draws $\tilde{U}_i^{(t)}$ lead to diverse item choices; as the posterior concentrates, the policy converges to greedy exploitation. Under binary or preference feedback modeled by a logistic link, $$ p(Y_{ij}=1 \mid U_i, V_j) = \sigma(U_i^\top V_j), $$ {#eq-logistic-link} the posterior becomes non-Gaussian and analytically intractable. We therefore approximate it with a Laplace approximation, replacing the true posterior with a local Gaussian centered at the mode (the MAP estimate). Let the log-posterior be: $$ \ell(U_i) = \log p(U_i) + \sum_{j \in \mathcal{D}_t} \log p(Y_{ij} \mid U_i). $$ {#eq-log-posterior} We expand $\ell(U_i)$ around its maximum a posteriori estimate $\hat{U}_i$: $$ \ell(U_i) \approx \ell(\hat{U}_i) - \tfrac{1}{2}(U_i - \hat{U}_i)^\top G (U_i - \hat{U}_i), $$ {#eq-laplace-expansion} where $G = -\nabla^2_{U_i} \ell(\hat{U}_i)$ is the negative Hessian of the log-posterior. This yields the Gaussian approximation: $$ p(U_i \mid \mathcal{D}_t) \approx \mathcal{N}(\hat{U}_i, G^{-1}). $$ {#eq-laplace-approximation} In the logistic model, $$ G = I + \sum_{j} p_j(1-p_j) V_j V_j^\top, $$ {#eq-hessian-logistic} where $p_j = \sigma(\hat{U}_i^\top V_j)$ and $I$ arises from the Gaussian prior $p(U_i) = \mathcal{N}(0, I)$. At each time step $t$: 1. **Update posterior approximation** Compute $\hat{U}_i$ by maximizing the log-posterior: $$ \hat{U}_i = \arg\max_{U_i} \Big[\sum_{j\in\mathcal{D}_t} Y_{ij}\log\sigma(U_i^\top V_j) + (1-Y_{ij})\log(1-\sigma(U_i^\top V_j)) - \tfrac{1}{2}|U_i|^2 \Big]. $$ {#eq-map-estimate} 2. **Compute Hessian at the mode** $$ G = I + \sum_{j} p_j(1-p_j) V_j V_j^\top. $$ {#eq-hessian-step} 3. **Sample a posterior draw** $$ \tilde{U}_i^{(t)} \sim \mathcal{N}(\hat{U}_i, G^{-1}). $$ {#eq-posterior-sample} 4. **Select next item** $$ j^* = \arg\max_j \tilde{U}_i^{(t)\top} V_j. $$ {#eq-item-selection} This algorithm naturally unifies measurement and decision: the posterior (a measurement construct) drives the sampling rule (a decision construct). ::: {.callout-note title="Code: Thompson Sampling with Laplace Approximation"} ```{pyodide-python} import torch from torch.distributions import MultivariateNormal, Bernoulli from torch.optim import LBFGS def fit_logistic_map(Y, V, prior_var=1.0, max_iter=50): """Estimate MAP and approximate posterior via Laplace.""" K = V.shape[1] U = torch.zeros(K, requires_grad=True) opt = LBFGS([U], lr=1.0, max_iter=max_iter) def closure(): opt.zero_grad() logits = V @ U nll = - (Y * torch.log(torch.sigmoid(logits)) + (1 - Y) * torch.log(1 - torch.sigmoid(logits))).sum() nll = nll + 0.5 * (U @ U) / prior_var # Gaussian prior nll.backward() return nll opt.step(closure) with torch.no_grad(): p = torch.sigmoid(V @ U) W = torch.diag(p * (1 - p)) H = torch.eye(K) / prior_var + V.T @ W @ V cov = torch.inverse(H) return U.detach(), cov # Example usage torch.manual_seed(0) V = torch.randn(50, 3) U_true = torch.tensor([1.0, -0.5, 0.7]) Y = Bernoulli(torch.sigmoid(V @ U_true)).sample() U_map, cov = fit_logistic_map(Y, V) U_sample = MultivariateNormal(U_map, cov).sample() scores = (V @ U_sample).detach() j_star = scores.argmax().item() print(f"Next item to recommend: {j_star}") ``` ::: ```{python} #| echo: false import torch from torch.distributions import MultivariateNormal, Bernoulli from torch.optim import LBFGS def fit_logistic_map(Y, V, prior_var=1.0, max_iter=50): """Estimate MAP and approximate posterior via Laplace.""" K = V.shape[1] U = torch.zeros(K, requires_grad=True) opt = LBFGS([U], lr=1.0, max_iter=max_iter) def closure(): opt.zero_grad() logits = V @ U nll = - (Y * torch.log(torch.sigmoid(logits)) + (1 - Y) * torch.log(1 - torch.sigmoid(logits))).sum() nll = nll + 0.5 * (U @ U) / prior_var # Gaussian prior nll.backward() return nll opt.step(closure) with torch.no_grad(): p = torch.sigmoid(V @ U) W = torch.diag(p * (1 - p)) H = torch.eye(K) / prior_var + V.T @ W @ V cov = torch.inverse(H) return U.detach(), cov # Example usage torch.manual_seed(0) V = torch.randn(50, 3) U_true = torch.tensor([1.0, -0.5, 0.7]) Y = Bernoulli(torch.sigmoid(V @ U_true)).sample() U_map, cov = fit_logistic_map(Y, V) U_sample = MultivariateNormal(U_map, cov).sample() scores = (V @ U_sample).detach() j_star = scores.argmax().item() print(f"Next item to recommend: {j_star}") ``` ::: {.callout-note title="Code: Linear Thompson Sampling Simulation" collapse="false"} This simulation runs Thompson Sampling with linear utility over multiple rounds, tracking how the posterior concentrates and regret accumulates. ```{pyodide-python} #| autorun: true # Linear Thompson Sampling simulation np.random.seed(42) # Setup: 2D user embedding, 30 items K = 2 # Embedding dimension n_items = 30 T = 50 # Number of rounds # True user embedding (unknown to algorithm) U_true = np.array([1.0, -0.5]) # Item features V = np.random.randn(n_items, K) true_probs = 1 / (1 + np.exp(-V @ U_true)) best_item = np.argmax(V @ U_true) def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -500, 500))) def linear_ts_step(U_hat, U_cov, V_available): """Thompson Sampling: sample from posterior, select best item.""" # Sample from approximate posterior U_sample = np.random.multivariate_normal(U_hat, U_cov) # Select item with highest sampled utility scores = V_available @ U_sample return np.argmax(scores), U_sample def update_posterior(U_hat, U_cov, V_obs, Y_obs): """Update posterior via Laplace approximation (Newton step).""" n_obs = len(Y_obs) if n_obs == 0: return U_hat, U_cov # Newton-Raphson for MAP for _ in range(10): p = sigmoid(V_obs @ U_hat) grad = V_obs.T @ (Y_obs - p) - U_hat # Gradient (with N(0,I) prior) W = np.diag(p * (1 - p)) H = V_obs.T @ W @ V_obs + np.eye(K) # Hessian U_hat = U_hat + np.linalg.solve(H, grad) # Posterior covariance p = sigmoid(V_obs @ U_hat) W = np.diag(p * (1 - p)) U_cov = np.linalg.inv(V_obs.T @ W @ V_obs + np.eye(K)) return U_hat, U_cov # Run simulation U_hat = np.zeros(K) U_cov = np.eye(K) V_obs, Y_obs = [], [] regret = [] cumulative_regret = 0 posterior_history = [(U_hat.copy(), U_cov.copy())] for t in range(T): # Thompson Sampling step j, U_sample = linear_ts_step(U_hat, U_cov, V) # Observe response p = sigmoid(V[j] @ U_true) y = 1 if np.random.rand() < p else 0 V_obs.append(V[j]) Y_obs.append(y) # Update posterior U_hat, U_cov = update_posterior(np.zeros(K), np.eye(K), np.array(V_obs), np.array(Y_obs)) posterior_history.append((U_hat.copy(), U_cov.copy())) # Track regret instant_regret = (V[best_item] - V[j]) @ U_true cumulative_regret += max(0, instant_regret) regret.append(cumulative_regret) # Visualize fig, axes = plt.subplots(1, 3, figsize=(9, 3)) # Left: Posterior evolution (2D) from matplotlib.patches import Ellipse ax = axes[0] for i, (mu, cov) in enumerate(posterior_history[::10]): alpha = 0.3 + 0.7 * (i / (len(posterior_history) // 10)) eigenvalues, eigenvectors = np.linalg.eig(cov) angle = np.degrees(np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0])) width, height = 2 * np.sqrt(eigenvalues) ellipse = Ellipse(mu, width, height, angle=angle, alpha=alpha * 0.3, facecolor='blue', edgecolor='blue', linewidth=1) ax.add_patch(ellipse) ax.scatter([U_true[0]], [U_true[1]], c='red', s=100, marker='*', zorder=5, label='True $U$') ax.scatter([U_hat[0]], [U_hat[1]], c='blue', s=50, marker='o', zorder=5, label='Estimated $\\hat{U}$') ax.set_xlabel('$U_1$') ax.set_ylabel('$U_2$') ax.set_title('Posterior Concentration') ax.set_xlim(-2, 2) ax.set_ylim(-2, 2) ax.legend(fontsize=7) ax.set_aspect('equal') ax.grid(True, alpha=0.3) # Middle: Estimation error over time errors = [np.linalg.norm(ph[0] - U_true) for ph in posterior_history] axes[1].plot(range(len(errors)), errors, 'b-', linewidth=1.5) axes[1].set_xlabel('Round') axes[1].set_ylabel('$\\|\\hat{U} - U^*\\|$') axes[1].set_title('Estimation Error') axes[1].grid(True, alpha=0.3) # Right: Cumulative regret axes[2].plot(range(1, T+1), regret, 'r-', linewidth=1.5) axes[2].set_xlabel('Round') axes[2].set_ylabel('Cumulative Regret') axes[2].set_title('Regret Growth') axes[2].grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"True U: {U_true}") print(f"Estimated U: {U_hat.round(3)}") print(f"Final cumulative regret: {regret[-1]:.2f}") ``` ::: ## Thompson Sampling under Nonlinear Objective The linear formulation assumes that a user’s latent utility varies linearly with item features. In practice, preference landscapes are often **highly nonlinear**: users may favor intermediate difficulty, exhibit saturation, or respond differently to combinations of attributes. To capture such complex relations, we move from a **parametric vector representation** $(U_i)$ to a **nonparametric function representation** $(h_i)$. A **Gaussian Process (GP)** defines a distribution over functions: $$ h_i \sim \mathcal{GP}(m(\cdot), k(\cdot, \cdot)), $$ {#eq-gp-prior} where $m(V)$ is the mean function (often assumed to be zero) and $k(V, V')$ is the **kernel**, encoding similarity between items. Intuitively, a GP asserts that **similar items yield similar utilities**, with the kernel controlling smoothness and generalization. For any finite collection of inputs $V_{1:T}$, the corresponding function values $$ H_{i, 1:M} = [h_i(V_1), \dots, h_i(V_M)] $$ {#eq-gp-function-values} follow a multivariate normal distribution: $$ H_{i, 1:M} \sim \mathcal{N}(m_{1:M}, K_{1:M,1:M}), $$ {#eq-gp-multivariate-normal} where $(K)_{jj'} = k(V_j, V_{j'})$. The probability density function (PDF) of a multivariate normal random variable $$ H \sim \mathcal{N}(m, K) $$ {#eq-mvn-definition} is given by $$ p(H) = \frac{1}{(2\pi)^{t/2} |K|^{1/2}} \exp \Big(-\tfrac{1}{2}(H-m)^\top K^{-1} (H-m) \Big). $$ {#eq-mvn-pdf} Taking the logarithm yields $$ \log p(H) = -\tfrac{1}{2}(H-m)^\top K^{-1} (H-m) - \tfrac{1}{2}\log|K| - \tfrac{t}{2}\log(2\pi). $$ {#eq-mvn-log-pdf} In most inference contexts, constant terms (those not depending on $H$) can be dropped, since they cancel when normalizing the posterior. This nonparametric formulation is advantageous because: * It **automatically adapts model complexity** to the data through the kernel. * It **provides closed-form posterior updates** under Gaussian noise. * It **generalizes across items** by exploiting smoothness in feature space. The **RBF (Radial Basis Function) kernel** is particularly common: $$ k(V, V') = \sigma^2 \exp\left(-\frac{\|V - V'\|^2}{2\ell^2}\right), $$ {#eq-rbf-kernel} where $\ell$ is the **lengthscale** (controlling smoothness) and $\sigma^2$ is the **variance** (controlling amplitude). Small lengthscales produce wiggly functions; large lengthscales produce smooth functions. {{< include _plt_setup.qmd >}} ::: {.callout-note title="Code: Kernel Comparison and GP Prior Samples" collapse="false"} ```{pyodide-python} #| autorun: true # Visualize RBF kernels with different lengthscales def rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0): """RBF kernel between two sets of points.""" diff = X1[:, None] - X2[None, :] return variance * np.exp(-0.5 * diff**2 / lengthscale**2) # Create a grid of points X = np.linspace(-3, 3, 100) # Compare kernels with different lengthscales lengthscales = [0.3, 1.0, 3.0] fig, axes = plt.subplots(2, 3, figsize=(7, 4)) np.random.seed(42) for i, ls in enumerate(lengthscales): # Top row: kernel covariance from center point K_center = rbf_kernel(X, np.array([0.0]), lengthscale=ls).flatten() axes[0, i].plot(X, K_center, 'b-', linewidth=1.5) axes[0, i].set_title(f'$\\ell = {ls}$') axes[0, i].set_ylabel('$k(x, 0)$' if i == 0 else '') axes[0, i].set_xlabel('$x$') axes[0, i].grid(True, alpha=0.3) # Bottom row: sample functions from GP prior K = rbf_kernel(X, X, lengthscale=ls) + 1e-6 * np.eye(len(X)) L = np.linalg.cholesky(K) for _ in range(5): f_sample = L @ np.random.randn(len(X)) axes[1, i].plot(X, f_sample, alpha=0.6, linewidth=0.8) axes[1, i].set_ylabel('$h(x)$' if i == 0 else '') axes[1, i].set_xlabel('$x$') axes[1, i].set_ylim(-3, 3) axes[1, i].grid(True, alpha=0.3) axes[0, 0].set_title(f'Kernel ($\\ell = 0.3$, wiggly)') axes[0, 1].set_title(f'Kernel ($\\ell = 1.0$, medium)') axes[0, 2].set_title(f'Kernel ($\\ell = 3.0$, smooth)') plt.tight_layout() plt.show() print("Top: Kernel covariance from center. Bottom: Random function samples from GP prior.") print("Smaller lengthscale → more variation; larger lengthscale → smoother functions.") ``` ::: For **binary observations**, $$ p(Y_{ij}=1 \mid h_i(V_j)) = \sigma(h_i(V_j)), $$ {#eq-gp-binary-likelihood} the likelihood is non-Gaussian, and the posterior $$ p(h_i \mid \mathcal{D}_t) \propto p(Y_{1:t} \mid h_i) p(h_i) $$ {#eq-gp-posterior} becomes **intractable**. To approximate it, we again use the **Laplace approximation**. Let $H_i = [h_i(V_1), \dots, h_i(V_t)]$. The log-posterior is $$ \ell(H_i) = \log p(Y_{1:t} \mid H_i) - \tfrac{1}{2}H_i^\top K_t^{-1}H_i - \tfrac{1}{2}\log|K_t|. $$ {#eq-gp-log-posterior}  Let $$ \hat H_i = \arg\max_{h} \ell(h) $$ {#eq-gp-map} be the **mode** (MAP). The multivariate second-order Taylor expansion of a smooth scalar function $\ell(\cdot)$ around $\hat H_i$ is $$ \ell(H_i) \approx \ell(\hat H_i) + \nabla\ell(\hat H_i)^\top(H_i-\hat H_i) + \tfrac12(H_i-\hat H_i)^\top \nabla^2\ell(\hat H_i)(H_i-\hat H_i). $$ {#eq-taylor-expansion} * Because $\hat H_i$ is a maximizer, the gradient vanishes: $$ \nabla\ell(\hat H_i)=0. $$ {#eq-gradient-vanishes} * The Hessian at a maximum is **negative semidefinite**: $$ \nabla^2\ell(\hat H_i) \preceq 0. $$ {#eq-hessian-negative-semidefinite} So the expansion simplifies to $$ \ell(H_i) \approx \ell(\hat H_i) + \tfrac12 (H_i-\hat H_i)^\top \nabla^2\ell(\hat H_i)(H_i-\hat H_i). $$ {#eq-taylor-simplified} Define the **negative Hessian** (a positive semidefinite matrix) $$ W := -\nabla^2\ell(\hat H_i) \succeq 0. $$ {#eq-negative-hessian} Then $$ \ell(H_i) \approx \ell(\hat H_i) - \tfrac12 (H_i-\hat H_i)^\top W (H_i-\hat H_i). $$ {#eq-quadratic-approximation} Laplace's idea is: if the **log** posterior is approximately quadratic, then the **posterior** is approximately Gaussian. Exponentiate the quadratic approximation: $$ p(H_i\mid\mathcal D_t) \propto \exp\big(\ell(H_i)\big) \approx \exp \Big(\ell(\hat H_i)-\tfrac12(H_i-\hat H_i)^\top W (H_i-\hat H_i)\Big). $$ {#eq-exponentiated-approximation} Since $\exp(\ell(\hat H_i))$ is just a constant in $H_i$, this is proportional to the kernel of a multivariate normal with mean $\hat H_i$ and precision $W$ (i.e., covariance $W^{-1}$): $$ p(H_i\mid\mathcal D_t) \approx \mathcal N\big(\hat H_i, W^{-1}\big). $$ {#eq-gp-laplace-result} The resulting approximate posterior can be **projected to any new input** $V_*$ via standard GP conditioning: $$ p(h(V_*) \mid \mathcal{D}_t) \approx \mathcal{N}(\mu_t(V_*), \sigma_t^2(V_*)), $$ {#eq-gp-predictive} where $$ \mu_t(V_*) = k_*^\top K_t^{-1}\hat{H_i}, \quad \sigma_t^2(V_*) = k(V_*,V_*) - k_*^\top (K_t + W^{-1})^{-1} k_*. $$ {#eq-gp-predictive-mean-var} where the $k_*$ is the covariance between the new point and previous points. Once the posterior is approximated, **Thompson Sampling** proceeds as before: sample one plausible function from the posterior, then act optimally with respect to it. Algorithmically: 1. **Approximate the posterior** $p(h_i \mid \mathcal{D}_t) \approx \mathcal{N}(\hat{H_i}, \Sigma_f)$ via Laplace. 2. **Sample a function realization** $$ \tilde{h}_i \sim \mathcal{GP}(m_t(\cdot), k_t(\cdot, \cdot)), $$ {#eq-gp-function-sample} where $m_t$ and $k_t$ are derived from the approximate posterior. 3. **Select the next item** $$ j^* = \arg\max_j \tilde{h}_i(V_j). $$ {#eq-gp-item-selection} This random sampling induces exploration: uncertain regions of the function space produce high variance in $\tilde{h}_i(V_j)$, encouraging the agent to test them. As more observations are collected, the posterior variance shrinks and the strategy converges toward greedy exploitation. Below is a minimal illustration using **GP classification with Laplace approximation** and Thompson sampling. (For practical scalability, packages like `gpytorch` or `scikit-learn` can be used.) ::: {.callout-note title="Code: GP Thompson Sampling"} ```{pyodide-python} import torch import math from torch.distributions import MultivariateNormal, Bernoulli def rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0): diff = X1[:, None, :] - X2[None, :, :] dist2 = (diff ** 2).sum(-1) return variance * torch.exp(-0.5 * dist2 / lengthscale**2) def gp_laplace_binary(V, Y, lengthscale=1.0, variance=1.0, max_iter=10): N = len(Y) K = rbf_kernel(V, V, lengthscale, variance) f = torch.zeros(N) for _ in range(max_iter): p = torch.sigmoid(f) W = torch.diag(p * (1 - p)) z = f + (Y - p) / (p * (1 - p) + 1e-6) B = torch.eye(N) + torch.sqrt(W) @ K @ torch.sqrt(W) L = torch.linalg.cholesky(B) a = torch.cholesky_solve(torch.sqrt(W) @ z, L) f = K @ (torch.sqrt(W) @ a) W = torch.diag(p * (1 - p)) Sigma = K - K @ torch.sqrt(W) @ torch.cholesky_solve(torch.sqrt(W) @ K, L) return f.detach(), Sigma.detach() # Example usage torch.manual_seed(0) V = torch.linspace(-3, 3, 40).unsqueeze(1) true_f = torch.sin(V * 1.5) Y = Bernoulli(torch.sigmoid(true_f)).sample() f_hat, Sigma = gp_laplace_binary(V, Y) f_sample = MultivariateNormal(f_hat.squeeze(), Sigma).sample() j_star = f_sample.argmax().item() print(f"Next item index: {j_star}") ``` ::: ```{python} #| echo: false import torch import math from torch.distributions import MultivariateNormal, Bernoulli def rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0): diff = X1[:, None, :] - X2[None, :, :] dist2 = (diff ** 2).sum(-1) return variance * torch.exp(-0.5 * dist2 / lengthscale**2) def gp_laplace_binary(V, Y, lengthscale=1.0, variance=1.0, max_iter=10): N = len(Y) K = rbf_kernel(V, V, lengthscale, variance) f = torch.zeros(N) for _ in range(max_iter): p = torch.sigmoid(f) W = torch.diag(p * (1 - p)) z = f + (Y - p) / (p * (1 - p) + 1e-6) B = torch.eye(N) + torch.sqrt(W) @ K @ torch.sqrt(W) L = torch.linalg.cholesky(B) a = torch.cholesky_solve(torch.sqrt(W) @ z, L) f = K @ (torch.sqrt(W) @ a) W = torch.diag(p * (1 - p)) Sigma = K - K @ torch.sqrt(W) @ torch.cholesky_solve(torch.sqrt(W) @ K, L) return f.detach(), Sigma.detach() # Example usage torch.manual_seed(0) V = torch.linspace(-3, 3, 40).unsqueeze(1) true_f = torch.sin(V * 1.5) Y = Bernoulli(torch.sigmoid(true_f)).sample() f_hat, Sigma = gp_laplace_binary(V, Y) f_sample = MultivariateNormal(f_hat.squeeze(), Sigma).sample() j_star = f_sample.argmax().item() print(f"Next item index: {j_star}") ``` ::: {.callout-note title="Code: GP Posterior Visualization" collapse="false"} The following code visualizes how the GP posterior updates after observing binary responses. The shaded region shows the 95% credible interval. ```{pyodide-python} #| autorun: true # GP posterior visualization with binary observations def rbf_kernel_np(X1, X2, lengthscale=1.0, variance=1.0): diff = X1[:, None] - X2[None, :] return variance * np.exp(-0.5 * diff**2 / lengthscale**2) def gp_laplace_binary_np(X_obs, Y_obs, X_pred, lengthscale=1.0, variance=1.0, max_iter=10): """GP classification with Laplace approximation (numpy version).""" N = len(Y_obs) K = rbf_kernel_np(X_obs, X_obs, lengthscale, variance) + 1e-6 * np.eye(N) K_star = rbf_kernel_np(X_pred, X_obs, lengthscale, variance) K_ss = rbf_kernel_np(X_pred, X_pred, lengthscale, variance) # Newton iteration for MAP f = np.zeros(N) for _ in range(max_iter): p = 1 / (1 + np.exp(-f)) W = np.diag(p * (1 - p) + 1e-8) grad = Y_obs - p H = W + np.linalg.inv(K) f = f + np.linalg.solve(H, grad) # Posterior mean and variance at prediction points p = 1 / (1 + np.exp(-f)) W = np.diag(p * (1 - p) + 1e-8) K_inv = np.linalg.inv(K) mu_pred = K_star @ K_inv @ f cov_pred = K_ss - K_star @ np.linalg.inv(K + np.linalg.inv(W)) @ K_star.T return mu_pred, np.sqrt(np.diag(cov_pred) + 1e-8) # Generate data np.random.seed(42) X_grid = np.linspace(-3, 3, 100) true_f = np.sin(1.5 * X_grid) # True latent function # Observe at sparse locations X_obs = np.array([-2.5, -1.5, -0.5, 0.5, 1.5, 2.5]) f_obs = np.sin(1.5 * X_obs) Y_obs = (np.random.rand(len(X_obs)) < 1/(1 + np.exp(-f_obs))).astype(float) # Fit GP mu, sigma = gp_laplace_binary_np(X_obs, Y_obs, X_grid, lengthscale=1.0) # Plot fig, axes = plt.subplots(1, 2, figsize=(7, 3)) # Left: latent function posterior axes[0].plot(X_grid, true_f, 'k--', label='True $h(x)$', linewidth=1) axes[0].plot(X_grid, mu, 'b-', label='Posterior mean', linewidth=1.5) axes[0].fill_between(X_grid, mu - 2*sigma, mu + 2*sigma, alpha=0.3, color='blue', label='95% CI') axes[0].scatter(X_obs, f_obs, c=['green' if y else 'red' for y in Y_obs], s=50, zorder=5, edgecolors='black', linewidths=0.5) axes[0].set_xlabel('Item feature $V$') axes[0].set_ylabel('Latent utility $h(V)$') axes[0].set_title('GP Posterior over Utility') axes[0].legend(fontsize=6, loc='upper right') axes[0].grid(True, alpha=0.3) # Right: predicted probability prob_pred = 1 / (1 + np.exp(-mu)) prob_true = 1 / (1 + np.exp(-true_f)) axes[1].plot(X_grid, prob_true, 'k--', label='True $P(Y=1)$', linewidth=1) axes[1].plot(X_grid, prob_pred, 'b-', label='Predicted', linewidth=1.5) axes[1].scatter(X_obs, Y_obs, c=['green' if y else 'red' for y in Y_obs], s=50, zorder=5, edgecolors='black', linewidths=0.5) axes[1].set_xlabel('Item feature $V$') axes[1].set_ylabel('$P(Y=1|V)$') axes[1].set_title('Predicted Response Probability') axes[1].legend(fontsize=6, loc='upper right') axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() print("Green dots: Y=1 (positive response), Red dots: Y=0 (negative response)") ``` ::: ::: {.callout-note title="Code: Thompson Sampling Simulation" collapse="false"} This simulation shows Thompson Sampling in action, demonstrating how the algorithm balances exploration and exploitation over time. ```{pyodide-python} #| autorun: true # Thompson Sampling simulation with GP np.random.seed(123) # True utility function (unknown to algorithm) def true_utility(x): return np.sin(2 * x) + 0.5 * np.cos(x) # Available items (discrete set) n_items = 20 items = np.linspace(-2.5, 2.5, n_items) true_utils = true_utility(items) best_item = np.argmax(true_utils) # Thompson Sampling loop T = 30 # number of rounds observations_x = [] observations_y = [] chosen_items = [] regret = [] cumulative_regret = 0 for t in range(T): if t < 3: # Initial random exploration j = np.random.randint(n_items) else: # Fit GP on observations X_obs = np.array(observations_x) Y_obs = np.array(observations_y) # GP posterior K = rbf_kernel_np(X_obs, X_obs, lengthscale=0.8) + 1e-4 * np.eye(len(X_obs)) K_star = rbf_kernel_np(items, X_obs, lengthscale=0.8) K_ss = rbf_kernel_np(items, items, lengthscale=0.8) # Laplace approximation (simplified) f = np.zeros(len(X_obs)) for _ in range(5): p = 1 / (1 + np.exp(-f)) W = np.diag(p * (1 - p) + 1e-6) grad = Y_obs - p H = W + np.linalg.inv(K) f = f + np.linalg.solve(H, grad) p = 1 / (1 + np.exp(-f)) W = np.diag(p * (1 - p) + 1e-6) K_inv = np.linalg.inv(K) mu = K_star @ K_inv @ f cov = K_ss - K_star @ np.linalg.inv(K + np.linalg.inv(W)) @ K_star.T cov = 0.5 * (cov + cov.T) + 1e-4 * np.eye(n_items) # ensure symmetry # Sample from posterior (Thompson Sampling) L = np.linalg.cholesky(cov) f_sample = mu + L @ np.random.randn(n_items) # Choose item with highest sampled utility j = np.argmax(f_sample) # Observe response (binary) prob = 1 / (1 + np.exp(-true_utils[j])) y = 1 if np.random.rand() < prob else 0 observations_x.append(items[j]) observations_y.append(y) chosen_items.append(j) # Track regret instant_regret = true_utils[best_item] - true_utils[j] cumulative_regret += instant_regret regret.append(cumulative_regret) # Visualize results fig, axes = plt.subplots(1, 3, figsize=(8, 2.5)) # Left: True utility and selected items axes[0].plot(items, true_utils, 'k-', linewidth=1.5, label='True utility') axes[0].axvline(items[best_item], color='green', linestyle='--', alpha=0.7, label='Optimal') for i, j in enumerate(chosen_items): alpha = 0.3 + 0.7 * (i / T) # Later selections are more visible axes[0].scatter(items[j], true_utils[j], c='blue', s=20, alpha=alpha) axes[0].set_xlabel('Item') axes[0].set_ylabel('Utility') axes[0].set_title('Item Selection Over Time') axes[0].legend(fontsize=6) axes[0].grid(True, alpha=0.3) # Middle: Selection frequency selection_counts = np.bincount(chosen_items, minlength=n_items) axes[1].bar(range(n_items), selection_counts, color='steelblue', alpha=0.7) axes[1].axvline(best_item, color='green', linestyle='--', linewidth=2, label='Optimal') axes[1].set_xlabel('Item index') axes[1].set_ylabel('Times selected') axes[1].set_title('Selection Frequency') axes[1].legend(fontsize=6) # Right: Cumulative regret axes[2].plot(range(1, T+1), regret, 'b-', linewidth=1.5) axes[2].set_xlabel('Round') axes[2].set_ylabel('Cumulative regret') axes[2].set_title('Regret Over Time') axes[2].grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"Best item: {best_item}, Most selected: {np.argmax(selection_counts)}") print(f"Final cumulative regret: {regret[-1]:.2f}") ``` ::: ## Dueling Bandit ### From Multi-Armed Bandits to Dueling Bandits The multi-armed bandit (MAB) problem involves a gambler deciding which lever to pull on an MAB machine to maximize the winning rate, despite not knowing which machine is the most rewarding. This scenario highlights the need to balance exploration (trying new machines to discover potential higher rewards) and exploitation (using current knowledge to maximize gains). MAB algorithms address this dilemma by making decisions under uncertainty to achieve the best possible outcomes based on gathered data. At the core of the MAB problem is a set of actions, or 'arms,' denoted by $\mathcal{A} = \{1, 2, \ldots, K\}$, where $K$ signifies the total number of arms. For each round $t$, the agent selects an arm $a_t \in \mathcal{A}$ and receives a reward $r_t$, sampled from an arm-specific, unknown probability distribution. The expected reward of pulling arm $a$ is represented as $\mu_a = \mathbb{E}[r_t | a]$. The multi-armed bandit framework can be extended in various ways to model more complex scenarios. In the infinite-armed bandit problem, the set of possible arms $\mathcal{A}$ is either very large or infinite. This introduces significant challenges in exploration, as the agent cannot afford to explore each arm even once. Algorithms for infinite-armed bandits typically assume some regularity or structure of the reward function across arms to make the problem tractable. The contextual bandit problem extends the bandit framework by incorporating observable external states or contexts that influence the reward distributions of arms. The agent's task is to learn policies that map contexts to arms to maximize reward. This model is particularly powerful for personalized recommendations, where the context can include user features or historical interactions. In dueling bandit problems, the agent chooses two arms to pull simultaneously and receives feedback only on which of the two is better, not the actual reward values. This pairwise comparison model is especially useful in scenarios where absolute evaluations are difficult, but relative preferences are easier to determine, such as in ranking systems. Contextual bandits extend the multi-armed bandits by making decisions conditional on the state of the environment and previous observations. The benefit of such a model is that observing the environment can provide additional information, potentially leading to better rewards and outcomes. In each iteration, the agent is presented with the context of the environment, then decides on an action based on the context and previous observations. Finally, the agent observes the action's outcome and reward. Throughout this process, the agent aims to maximize the expected reward. In many real-world contexts, one may not have a real-valued reward (or at least a reliable one) associated with a decision. Instead, we may only have observations indicating which of a set of bandits was optimal in a given scenario. The assumption is that within these observations of preferred choices among a set of options, there is an implicit reward or payoff encapsulated in that decision. Consider the following examples: 1. **Dietary preferences**: When providing food recommendations to humans, it is often not possible to quantify an explicit reward from recommending a specific food item. Instead, we can offer meal options and observe which one the person selects. 2. **Video recommendation**: Websites like YouTube and TikTok recommend specific videos to users. It is typically not feasible to measure the reward a person gains from watching a video. However, we can infer that a user preferred one video over another. From these relative preference observations, we can develop a strategy to recommend videos they are likely to enjoy. 3. **Exoskeleton gait optimization**: Tucker et al. (2020) created a framework that uses human-evaluated preferences for an exoskeleton gait algorithm to develop an optimal strategy for the exoskeleton to assist a human in walking. A human cannot reliably produce a numerical value for how well the exoskeleton helped them walk but can reliably indicate which option performed best according to their preferences. Generally, we assume access to a set of actions. A noteworthy assumption is that any observations we make are unbiased estimates of the payoff. This means that if we observe a human preferred one option over another (or several others), the preferred option had a higher implicit reward or payoff than the alternatives. In the case of dietary preferences, this may mean that a human liked the preferred option; in the case of video recommendations, a user was more entertained, satisfied, or educated by the video they selected than the other options. The overarching context is that we do not have direct or reliable access to rewards. We may not have a reward at all (for some decisions, it may be impossible to define a real value to the outcome), or it may be noisy (for example, if we ask a human to rate their satisfaction on a scale of 1 to 10). We use relative comparisons to evaluate the best of multiple options in this case. Our goal is to minimize total regret in the face of noisy comparisons. Humans may not always provide consistent observations (since human decision-making is not guaranteed to be consistent). However, we can still determine an optimal strategy with the observed comparisons. We aim to minimize the frequency of sub-optimal decisions according to human preferences. In practice, many formulations of bandits can allow for infinitely many bandits (for example, in continuous-value and high-dimensional spaces). However, this situation can be intractable when determining an optimal decision strategy. With infinite options, how can we always ensure we have chosen the best? We will constrain our bandits to a discrete space to enable efficient exploration. We will assume that we have $k$ bandits, $b_i, i \in [1, k]$, and our task is to choose the one that will minimize regret. With the framework outlined, we now define our approach more formally. This method was introduced by [@YUE20121538], and proofs for the guarantees and derivations of parameters can be found in their work. To determine the optimal action, we will compare pairwise to ascertain the probability that an action $b_i$ is preferred over another $b_j$, where $i \ne j$. Concretely, we assume access to a function $\epsilon$ that helps determine this probability; in practice, this can be done with an oracle, such as asking a human which of two options they prefer: $$P(b_i > b_j) = \varepsilon(b_i, b_j) + \frac{1}{2}.$$ {#eq-preference-probability} With this model, three basic properties govern the values provided by $\epsilon$: $$\epsilon(b_i, b_j) = -\epsilon(b_j, b_i), \epsilon(b_i, b_i) = 0, \epsilon(b_i, b_j) \in \left(-\frac{1}{2}, \frac{1}{2} \right).$$ {#eq-epsilon-properties} We assume there is a total ordering of bandits, such that $b_i \succ b_j$ implies $\epsilon(b_i, b_j) > 0$. We impose two constraints to properly model comparisons: - **Strong Stochastic Transitivity**: We must maintain our total ordering of bandits, and as such, the comparison model also respects this ordering: $$b_i \succ b_j \succ b_k \Rightarrow \epsilon(b_i, b_k) \ge \text{max}\{\epsilon(b_i, b_j), \epsilon(b_j, b_k)\}.$$ {#eq-stochastic-transitivity} - **Stochastic Triangle Inequality**: We also impose a triangle inequality, which captures the condition that the probability of a bandit winning (or losing) a comparison will exhibit diminishing returns as it becomes increasingly superior (or inferior) to the competing bandit: $$b_i \succ b_j \succ b_k \Rightarrow \epsilon(b_i, b_k) \le \epsilon(b_i, b_j) + \epsilon(b_j, b_k).$$ {#eq-triangle-inequality} These assumptions may initially seem limiting; however, common models for comparisons satisfy these constraints. For example, the Bradley-Terry Model follows $P(b_i > b_j) = \frac{\mu_i}{\mu_i + \mu_j}$. The Gaussian model with unit variance also satisfies these constraints: $P(b_i > b_j) = P(X_i - X_j > 0)$, where $X_i - X_j \sim N(\mu_i - \mu_j, 2)$. ::: {.callout-note title="Code: Preference Matrix Visualization" collapse="false"} ```{pyodide-python} #| autorun: true # Generate and visualize a preference matrix from Bradley-Terry model def bradley_terry_matrix(strengths): """Generate preference matrix from Bradley-Terry strengths.""" n = len(strengths) P = np.zeros((n, n)) for i in range(n): for j in range(n): if i != j: P[i, j] = strengths[i] / (strengths[i] + strengths[j]) else: P[i, j] = 0.5 return P # Example: 5 bandits with different strengths np.random.seed(42) strengths = np.array([3.0, 2.5, 2.0, 1.5, 1.0]) labels = ['A', 'B', 'C', 'D', 'E'] P = bradley_terry_matrix(strengths) fig, axes = plt.subplots(1, 2, figsize=(7, 3)) # Preference matrix heatmap im = axes[0].imshow(P, cmap='RdYlGn', vmin=0, vmax=1) axes[0].set_xticks(range(len(labels))) axes[0].set_yticks(range(len(labels))) axes[0].set_xticklabels(labels) axes[0].set_yticklabels(labels) axes[0].set_xlabel('Opponent $j$') axes[0].set_ylabel('Bandit $i$') axes[0].set_title('Preference Matrix $P_{ij}$') for i in range(len(labels)): for j in range(len(labels)): axes[0].text(j, i, f'{P[i,j]:.2f}', ha='center', va='center', fontsize=7) plt.colorbar(im, ax=axes[0], shrink=0.8) # Epsilon matrix (Delta = P - 0.5) Delta = P - 0.5 im2 = axes[1].imshow(Delta, cmap='RdBu', vmin=-0.5, vmax=0.5) axes[1].set_xticks(range(len(labels))) axes[1].set_yticks(range(len(labels))) axes[1].set_xticklabels(labels) axes[1].set_yticklabels(labels) axes[1].set_xlabel('Opponent $j$') axes[1].set_ylabel('Bandit $i$') axes[1].set_title('$\\epsilon_{ij} = P_{ij} - 0.5$') for i in range(len(labels)): for j in range(len(labels)): axes[1].text(j, i, f'{Delta[i,j]:.2f}', ha='center', va='center', fontsize=7) plt.colorbar(im2, ax=axes[1], shrink=0.8) plt.tight_layout() plt.show() # Check for Condorcet winner for i, label in enumerate(labels): wins_all = all(P[i, j] > 0.5 for j in range(len(labels)) if j != i) if wins_all: print(f"Condorcet winner: {label} (beats all others with P > 0.5)") break ``` ::: To accurately model the preferences between bandits in our framework of pairwise bandit comparisons and regret, we must track certain parameters in our algorithm. First, we will maintain a running empirical estimate of the probability of bandit preferences based on our observations. It is important to note that we do not have direct access to an $\epsilon$ function. Instead, we must present two bandits to a human, who selects a winner. To do this, we define: $$\hat{P}_{i, j} = \frac{\# b_i\ \text{wins}}{\# \text{comparisons between}\ i \text{and}\ j}.$$ {#eq-empirical-preference} We will also compute confidence intervals at each timestep for each of the entries in $\hat{P}$ as $$\hat{C}_t = \left( \hat{P}_t - c_t, \hat{P}_t + c_t \right),$$ {#eq-confidence-interval} where $c_t = \sqrt{\frac{4\log(\frac{1}{\delta})}{t}}$. Note that $\delta = \frac{1}{TK^2}$, where $T$ is the time horizon and $K$ is the number of bandits. Previously, we discussed approaches for finding the best action in a specific context. Now, we consider changing contexts, which means there is no longer a static hidden preference matrix $P$. Instead, at every time step, there is a preference matrix $P_C$ depending on context $C$. We consider a context $C$ and a preference matrix $P_C$ to be chosen by nature as a result of the given environment (Yue et al., 2012). The goal of a contextual bandits algorithm is to find a policy $\pi$ that maps contexts to a Von Neumann winner distribution over our bandits. That is, our policy $\pi$ should map any context to some distribution over our bandits such that sampling from that distribution is preferred to a random action for that context. ### Regret The agent aims to pick a sequence of arms $(a_1, a_2, \ldots, a_T)$ across a succession of time steps $t = 1$ to $t = T$ to maximize the total accumulated reward. Formally, the strategy seeks to maximize the sum of the expected rewards: $\max_{a_1, \ldots, a_T} \mathbb{E} \left[\sum_{t=1}^{T} r_t\right]$. Regret is defined as the difference between the cumulative reward that could have been obtained by always pulling the best arm (in hindsight, after knowing the reward distributions) and the cumulative reward actually obtained by the algorithm. Formally, if $\mu^*$ is the expected reward of the best arm and $\mu_{a_t}$ is the expected reward of the arm chosen at time $t$, the regret after $T$ time steps is given by $R(T) = T \cdot \mu^* - \sum_{t=1}^{T} \mu_{a_t}$. The objective of a bandit algorithm is to minimize this regret over time, effectively learning to make decisions that are as close as possible to the decisions of an oracle that knows the reward distributions beforehand. Low regret indicates an algorithm that has often learned to choose well-performing arms, balancing the exploration of unknown arms with the exploitation of arms that are already known to perform well. Thus, an efficient bandit algorithm exhibits sub-linear regret growth, meaning that the average regret per round tends to zero as the number of rounds $T$ goes to infinity: $\lim_{T \to \infty} \frac{R(T)}{T} = 0$. Minimizing regret is a cornerstone in the design of bandit algorithms, and its analysis helps in understanding the long-term efficiency and effectiveness of different bandit strategies. As previously discussed, our goal is to select the bandit that minimizes a quantity that reflects regret or the cost of not selecting the optimal bandit at all times. We can leverage our comparison model to define a quantity for regret over some time horizon $T$, which is the number of decisions we make (selecting what we think is the best bandit at each iteration). Assuming we know the best bandit $b^*$ (and we know that there *is* a best bandit, since there is a total ordering of our discrete bandits), we can define two notions of regret: - Strong regret: aims to capture the fraction of users who would prefer the optimal bandit $b^*$ over the *worse* of the options $b_1, b_2$ we provide at a given step:$R_T = \sum_{t = 1}^T \text{max} \left\{ \epsilon(b^*, b_1^{(t)}), \epsilon(b^*, b_2^{(t)}) \right\}$ - Weak regret: aims to capture the fraction of users who would prefer the optimal bandit $b^*$ over the *better* of the options $b_1, b_2$ we provide at a given step:$\tilde{R}_T = \sum_{t = 1}^T \text{min} \left\{ \epsilon(b^*, b_1^{(t)}), \epsilon(b^*, b_2^{(t)}) \right\}$ ### Winner Concepts {#sec-borda} The best bandit described in our regret definition is called a **Condorcet Winner**. This is the strongest form of winner. It's the action **$A_{i}$** which is preferred to each other action **$A_j$** with $p > 0.5$ in a head-to-head election. While the above introduced notions of regret assume an overall best bandit to exist, there might be settings, where no bandit wins more than half head-to-head duels. A set of actions without a Condorcet winner is described by the following preference matrix, where each entry $\Delta_{jk}$ is $p(j \succ k) - 0.5$, the probability that action $j$ is preferred over action $k$ minus 0.5. There is no Condorcet winner as there is no action that is preferred with $p > 0.5$ over all other actions. Imagine, you want to find the best pizza to eat (=action). There may not be a pizza that wins more than half of the head-to-head duels against every other pizza. However, we might still have an intuition of the best pizza. Therefore Sui et al., 2018 introduce the concepts of different $\textit{winners}$ in dueling bandit problems [@advancements_dueling]. In this example, we might define the best pizza as the most popular one. We call the Pizza receiving the most votes in a public vote the **Borda Winner**, or formally, Borda winner $j = \arg\max_{i \in A, i \neq j} \left(\sum p(j \succ i)\right)$. In contrast to the Condorcet Winner setting, there is always guaranteed to be one or more (in the case of a tie) Borda winners for a set of actions. However - if there is a Condorcet Winner, this might not necessarily be the same as a Borda Winner: In our Pizza example, a Pepperoni Pizza might win more than half of its head-to-head duels, while the Cheese-Pizza is still the most popular in a public poll. A more generic concept of winner is the **Von Neumann Winner**, which describes a probability distribution rather than a single bandit winner. A Von Neumann winner simply prescribes a probability distribution $W$ such that sampling from this distribution 'beats' an action from the random uniform distribution with $p > 0.5$. In our pizza example, this would correspond to trusting a friend to order whichever Pizza he likes, because this may still be preferred to ordering randomly. Formally, $W$ is a Von Neumann if $(j \sim W, k \sim R) [p(p(j \succ k) > 0.5) > 0.5]$ where $R$ describes the uniform probability distribution over our actions. The concept of a Von Neumann winner is useful in contextual bandits, which will be introduced later. In these settings, the preference matrix depends on different context, which may have different Borda winners, just as different parties may vote for different pizzas. ::: {#fig-condorcet_violation} A B C D E F --- ------- ---------- ----------- ------- ------- ------ A 0 **0.03** **-0.02** 0.06 0.10 0.11 B -0.03 0 **0.03** 0.05 0.08 0.11 C -0.03 0 0.04 0.07 0.09 D -0.06 -0.05 -0.04 0 0.05 0.07 E -0.10 -0.08 -0.07 -0.05 0 0.03 F -0.11 -0.11 -0.09 -0.07 -0.03 0 : Violation of Condorcet Winner. Highlighted entries are different from Table 1. No Condorcet winner exists as no arm could beat every other arm. ::: ::: {.callout-note title="Code: Pizza Voting Example - When Winners Diverge" collapse="false"} This example demonstrates a scenario where three groups have different pizza preferences, resulting in no Condorcet winner. The Borda winner and Von Neumann winner provide alternative solutions. ```{pyodide-python} #| autorun: true # Pizza voting example: 3 pizzas, 3 voter groups with cyclic preferences # Group 1 (40%): Pepperoni > Cheese > Veggie # Group 2 (35%): Cheese > Veggie > Pepperoni # Group 3 (25%): Veggie > Pepperoni > Cheese pizzas = ['Pepperoni', 'Cheese', 'Veggie'] n = len(pizzas) # Construct preference matrix from group preferences # P[i,j] = probability that pizza i beats pizza j group_prefs = [ [0, 1, 2], # Group 1: Pepperoni(0) > Cheese(1) > Veggie(2) [1, 2, 0], # Group 2: Cheese > Veggie > Pepperoni [2, 0, 1], # Group 3: Veggie > Pepperoni > Cheese ] group_weights = [0.40, 0.35, 0.25] P = np.zeros((n, n)) for g, (pref, w) in enumerate(zip(group_prefs, group_weights)): for rank_i, i in enumerate(pref): for rank_j, j in enumerate(pref): if rank_i < rank_j: # i is ranked higher than j P[i, j] += w # Fill diagonal with 0.5 for i in range(n): P[i, i] = 0.5 print("Preference Matrix P[i,j] = P(pizza i beats pizza j):") print(f" {pizzas[0]:>10} {pizzas[1]:>10} {pizzas[2]:>10}") for i, name in enumerate(pizzas): print(f"{name:>10}", end="") for j in range(n): print(f" {P[i,j]:>10.2f}", end="") print() # Check for Condorcet winner print("\n--- Winner Analysis ---") condorcet = None for i in range(n): if all(P[i, j] > 0.5 for j in range(n) if j != i): condorcet = pizzas[i] break print(f"Condorcet winner: {condorcet if condorcet else 'NONE (no pizza beats all others!)'}") # Borda scores (sum of win probabilities) borda_scores = [sum(P[i, j] for j in range(n) if j != i) for i in range(n)] borda_winner = pizzas[np.argmax(borda_scores)] print(f"Borda scores: {dict(zip(pizzas, [f'{s:.2f}' for s in borda_scores]))}") print(f"Borda winner: {borda_winner}") # Visualize fig, axes = plt.subplots(1, 2, figsize=(7, 3)) # Preference matrix im = axes[0].imshow(P, cmap='RdYlGn', vmin=0, vmax=1) axes[0].set_xticks(range(n)) axes[0].set_yticks(range(n)) axes[0].set_xticklabels([p[:4] for p in pizzas]) axes[0].set_yticklabels([p[:4] for p in pizzas]) axes[0].set_title('Preference Matrix') for i in range(n): for j in range(n): color = 'white' if abs(P[i,j] - 0.5) > 0.2 else 'black' axes[0].text(j, i, f'{P[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color) plt.colorbar(im, ax=axes[0], shrink=0.8) # Borda scores colors = ['#ff6b6b' if s == max(borda_scores) else 'steelblue' for s in borda_scores] axes[1].bar(range(n), borda_scores, color=colors) axes[1].set_xticks(range(n)) axes[1].set_xticklabels(pizzas) axes[1].set_ylabel('Borda Score') axes[1].set_title('Borda Scores (Winner in red)') axes[1].grid(True, alpha=0.3, axis='y') plt.tight_layout() plt.show() print("\nNo Condorcet winner exists due to cyclic preferences between groups!") ``` ::: ::: {.callout-note title="Code: Computing Different Winner Types" collapse="false"} ```{pyodide-python} #| autorun: true def condorcet_winner(P): """Return Condorcet winner index, or None if doesn't exist.""" n = P.shape[0] for i in range(n): if all(P[i, j] > 0.5 for j in range(n) if j != i): return i return None def borda_winner(P): """Return Borda winner index (highest sum of pairwise win probs).""" n = P.shape[0] scores = [sum(P[i, j] for j in range(n) if j != i) for i in range(n)] return np.argmax(scores) def von_neumann_winner(P, max_iter=1000): """Compute Von Neumann (minimax) winner distribution via fictitious play.""" n = P.shape[0] # Payoff matrix: A[i,j] = P[i,j] - 0.5 (zero-sum game) A = P - 0.5 # Fictitious play to find equilibrium counts = np.ones(n) for _ in range(max_iter): # Best response to current empirical distribution dist = counts / counts.sum() payoffs = A @ dist best = np.argmax(payoffs) counts[best] += 1 return counts / counts.sum() # Test on various preference matrices print("Example 1: Clear Condorcet winner") P1 = np.array([[0.5, 0.7, 0.8], [0.3, 0.5, 0.6], [0.2, 0.4, 0.5]]) c1 = condorcet_winner(P1) b1 = borda_winner(P1) v1 = von_neumann_winner(P1) print(f" Condorcet: Arm {c1}, Borda: Arm {b1}, VN dist: {v1.round(2)}") print("\nExample 2: No Condorcet winner (cyclic)") P2 = np.array([[0.5, 0.6, 0.4], [0.4, 0.5, 0.7], [0.6, 0.3, 0.5]]) c2 = condorcet_winner(P2) b2 = borda_winner(P2) v2 = von_neumann_winner(P2) print(f" Condorcet: {'None' if c2 is None else f'Arm {c2}'}, Borda: Arm {b2}, VN dist: {v2.round(2)}") print("\nExample 3: Condorcet and Borda differ") P3 = np.array([[0.5, 0.51, 0.51, 0.51], [0.49, 0.5, 0.9, 0.9], [0.49, 0.1, 0.5, 0.6], [0.49, 0.1, 0.4, 0.5]]) c3 = condorcet_winner(P3) b3 = borda_winner(P3) borda_scores3 = [sum(P3[i, j] for j in range(4) if j != i) for i in range(4)] print(f" Borda scores: {[f'{s:.2f}' for s in borda_scores3]}") print(f" Condorcet: Arm {c3}, Borda: Arm {b3}") print(" → Arm 0 barely beats everyone, but Arm 1 dominates Arm 2,3 strongly!") ``` ::: Next, we introduce two performance measures for the planner. The **asymptotic ex-post regret** is defined as $$\text{Regret}(\mu_1, \ldots \mu_K) = T\cdot \max_i \mu_i - \sum_{i=1}^T E[\mu_{I_t}].$$ {#eq-asymptotic-regret} Intuitively, this represents the difference between the reward achieved by always taking the action with the highest possible reward and the expected welfare of the recommendation algorithm (based on the actions it recommends at each timestep). We also define a weaker performance measure, the **Bayesian regret**, which is defined as $$\text {Bayesian regret}=E_{\mu_1, \ldots, \mu_K \sim \text {Prior}}\left[\operatorname{Regret}\left(\mu_1, \ldots, \mu_K\right)\right]$$ {#eq-bayesian-regret} With a Bayesian optimal policy, we would like either definition of regret to vanish as $T\to \infty$; we are considering "large-market optimal\" settings where there are many short-lived, rather than a few long-term, users. Note the fact that ex-post regret is prior-free makes it robust to inaccuracies on the prior. ### Acquisition Functions Various strategies have been developed to balance the exploration-exploitation trade-off. These strategies differ in selecting arms based on past experiences and rewards. #### Classical Acquisition Functions **Uniform** acquisition function is the most straightforward approach where each arm is selected uniformly randomly over time. This strategy does not consider the past rewards and treats each arm equally promising regardless of the observed outcomes. It is a purely explorative strategy that ensures each arm is sampled enough to estimate its expected reward, but it does not exploit the information to optimize rewards. In mathematical terms, if $N_t(a)$ denotes the number of times arm $a$ has been selected up to time $t$, the Uniform Strategy would ensure that $N_t(a) \approx \frac{t}{K}$ for all arms $a$ as $t$ grows large: $P(a_t = a) = \frac{1}{K}$ The **Epsilon Greedy** is a popular method that introduces a balance between exploration and exploitation. With a small probability $\epsilon$, it explores by choosing an arm at random, and with a probability $1 - \epsilon$, it exploits by selecting the arm with the highest estimated reward so far. This strategy incrementally favors actions that have historically yielded higher rewards, but still allows for occasional exploration to discover better options potentially. The parameter $\epsilon$ is chosen based on the desired exploration level, often set between 0.01 and 0.1. $$P(a_t = a) = \begin{cases} \frac{\epsilon}{K} + 1 - \epsilon & \text{if } a = \arg\max_{a'} \hat{\mu}_{a'} \\ \frac{\epsilon}{K} & \text{otherwise} \end{cases}$$ {#eq-epsilon-greedy} **Upper Confidence Bound** (UCB) acquisition function takes a more sophisticated approach to the exploration-exploitation dilemma. It selects arms based on both the estimated rewards and the uncertainty or variance associated with those estimates. Specifically, it favors arms with high upper confidence bounds on the estimated rewards, which is a sum of the estimated mean and a confidence interval that decreases with the number of times the arm has been played. This ensures that arms with less certainty (those played less often) are considered more often, naturally balancing exploration with exploitation as the uncertainty is reduced over time. $$P(a_t = a) = \begin{cases} 1 & \text{if } a = \arg\max_{a'} \left( \hat{\mu}_{a'} + \sqrt{\frac{2 \ln t}{N_t(a')}} \right) \\ 0 & \text{otherwise} \end{cases}$$ {#eq-ucb} #### Interleaved Filter This algorithm tries to find the best bandit (Condorcet Winner) in a discrete, limited bandit-space via pairwise comparisons of the bandits. We will now introduce the algorithm for the Interleaved Filter as provided in [@YUE20121538] to solve a dueling bandit setup. It starts with a randomly defined *best bandit* $\hat{b}$ and iteratively compares it to set $W$ containing the remaining bandits $b$ resulting in winning probabilities $\hat{P}_{\hat{b},b}$ and confidence interval $\hat{C}_{\hat{b},b}$. If a bandit $b$ is *confidently worse* than $\hat{b}$, it is removed from $W$. If a bandit $b'$ is *confidently better* than $\hat{b}$, it is set as new *best bandit* $\hat{b}$ and bandit $\hat{b}$ as well as every other bandit $b$ *worse* than $\hat{b}$ are removed from $W$. This is done, until $W$ is empty, leaving the final $\hat{b}$ as the predicted best bandit. :::: algorithm ::: algorithmic **input:** $T$, $B=\{b_1, \dots, b_k\}$ $\delta \gets 1/(TK^2)$ Choose $\hat{b} \in B$ randomly $W \gets \{b_1, \dots, b_k\} \backslash \{\hat{b}\}$ $\forall b \in W$, maintain estimate $\hat{P}_{\hat{b},b}$ of $P(\hat{b} > b)$ according to (6) $\forall b \in W$, maintain $1 - \delta$ confidence interval $\hat{C}_{\hat{b},b}$ of $\hat{P}_{\hat{b},b}$ according to (7), (8) compare $\hat{b}$ and $b$ update $\hat{P}_{\hat{b},b}$, $\hat{C}_{\hat{b},b}$ $W \gets W \backslash \{b\}$ $W \gets W \backslash \{b\}$ $\hat{b} \gets b'$, $W \gets W \backslash \{b'\}$ $\forall b \in W$, reset $\hat{P}_{\hat{b},b}$ and $\hat{C}_{\hat{b},b}$ $\hat{T} \gets$ Total Comparisons Made $(\hat{b}, \hat{T})$ ::: :::: Parameter Initialization : In lines 1-6 of the algorithm, we take the inputs and first compute the value $\delta$ which is used to compute our confidence intervals. We select an initial guess of an optimal bandit $\hat{b}$ by uniformly sampling from all bandits $\mathcal{B}$. We also keep a running set of bandit candidates $W$, which is initialized to be $\mathcal{B} \setminus \{\hat{b}\}$. At this point, we also initialize our empirical estimates for $\hat{P}, \hat{C}$. Next, we will repeat several steps until our working set of bandit candidates $W$ is empty. Update Estimates Based on Comparisons : The first step at each iteration (lines 8-11) is to look at all candidates in $W$, and compare them to our current guess $\hat{b}$ using an oracle (e.g. by asking a human which of $\hat{b}$ or $b \in W$ is preferred). With this new set of wins and comparisons, we update our estimates of $\hat{P}, \hat{C}$. Prune Suboptimal Bandits : In lines 12-13, with updated comparison win probabilities and corresponding confidence intervals, we can remove bandit candidates from $W$ that we are *confident* $\hat{b}$ is better than. The intuition here is that we are mostly sure that our current best guess is better than some of the candidates, and we don't need to consider those candidates in future iterations. Check for Better Bandits from Candidate Set : Now that our candidate set of bandits may be smaller, in lines 15-21 we check if there are any bandits $b'$ that we are *confident* are better than our current best guess. If we do find such a candidate, we remove bandits which $\hat{P}$ indicates $b$ is *likely* worse than $\hat{b}$. Note that in this step, we do not require the probability to be outside the confidence interval, since we already found one we believe to be significantly closer to optimal than our current best guess. Once we remove the candidates *likely* worse than $\hat{b}$, we crown $b'$ as the new best guess, e.g. $\hat{b} := b'$. Consequently, we remove $b'$ from $W$ and reset our empirical win counters $\hat{P}, \hat{C}$. With this algorithm defined, let us look at some provisions of the method with respect to identifying the optimal strategy. Note that the proofs and derivations for these quantities are provided in [@YUE20121538]. First, the method guarantees that for the provided time horizon $T$, the algorithm returns the correct bandit with probability $P \ge 1 - \frac{1}{T}$. It is interesting and useful to note that if one has a strict requirement for the probability of identifying the correct bandit, one can compute the time horizon $T$ that guarantees this outcome at that probability. Furthermore, a time horizon of 1 leaves no probabilistic guarantee of a successful outcome, and increasing $T$ has diminishing returns. Second, in the event that the algorithm returns an incorrect bandit, the maximal regret incurred is linear with respect to $T$, e.g. $\mathcal(O)(T)$. This is also a useful provision as it allows us to estimate the overall cost in the worst case outcome. Based on these two provisions, we can compute the expected cumulative regret from running the Interleaved Filter algorithm, which is: $$\mathbb{E}\left[R_T\right] \le \left(1 - \frac{1}{T}\right) \mathbb{E}\left[ R_T^{IF} \right] + \frac{1}{T}\mathcal{O}(T) \\ = \mathcal{O}\left(\mathbb{E}\left[ R_T^{IF} \right] + 1\right)$$ {#eq-if-regret-bound} Interestingly, the original work shows that these bounds hold for both strong and weak regret. As demonstrated, the Interleaved Filter algorithm [\[fig-if\]](#fig-if){reference-type="ref" reference="fig-if"} provides a robust method to ascertain the optimal bandit or strategy given a set of options and only noisy comparisons. In most real-world scenarios for modeling human preferences, it is not possible to observe a real-world reward value, or at least a reliable one and as such this method is a useful way to properly model human preferences. Furthermore, the algorithm provides strong guarantees for the probability of selecting the correct bandit, maximal regret, and the number of comparisons required. It is even more impressive that the method can do so without severely limiting constraints; as demonstrated, the most commonly used models satisfy the imposed constraints. As we look to model human preferences, we can certainly leverage this method for k-armed dueling bandits to identify the best strategy to solve human-centric challenges, from video recommendation to meal selection and exoskeleton-assisted walking. ::: {.callout-note title="Code: Interleaved Filter Implementation" collapse="false"} ```{pyodide-python} #| autorun: true class InterleavedFilter: """Interleaved Filter algorithm for finding Condorcet winner.""" def __init__(self, n_arms, delta=0.1): self.n_arms = n_arms self.delta = delta self.reset() def reset(self): self.best = np.random.randint(self.n_arms) self.candidates = set(range(self.n_arms)) - {self.best} self.wins = {j: 0 for j in self.candidates} self.counts = {j: 0 for j in self.candidates} self.comparisons = 0 def get_confidence(self, t): if t == 0: return 1.0 return np.sqrt(4 * np.log(1 / self.delta) / t) def step(self, oracle_fn): """Perform one step: compare best against a candidate.""" if not self.candidates: return True # Done # Pick a candidate to compare candidate = list(self.candidates)[self.comparisons % len(self.candidates)] # Query oracle (returns 1 if best wins, 0 if candidate wins) result = oracle_fn(self.best, candidate) self.comparisons += 1 self.counts[candidate] += 1 if result == 1: self.wins[candidate] += 1 # Update estimates and prune to_remove = [] for j in list(self.candidates): if self.counts[j] > 0: p_hat = self.wins[j] / self.counts[j] c = self.get_confidence(self.counts[j]) # If best confidently beats j, remove j if p_hat - c > 0.5: to_remove.append(j) # If j confidently beats best, j becomes new best elif p_hat + c < 0.5: self.candidates.discard(j) self.candidates.add(self.best) self.best = j # Reset counts for new comparisons self.wins = {k: 0 for k in self.candidates} self.counts = {k: 0 for k in self.candidates} break for j in to_remove: self.candidates.discard(j) del self.wins[j] del self.counts[j] return len(self.candidates) == 0 # Test the algorithm np.random.seed(42) n_arms = 5 true_strengths = np.array([5.0, 4.0, 3.0, 2.0, 1.0]) # Arm 0 is best def oracle(i, j): """Noisy comparison: returns 1 if arm i beats arm j.""" p = true_strengths[i] / (true_strengths[i] + true_strengths[j]) return 1 if np.random.rand() < p else 0 # Run algorithm algo = InterleavedFilter(n_arms, delta=0.05) max_steps = 500 for step in range(max_steps): done = algo.step(oracle) if done: break print(f"Interleaved Filter Results:") print(f" True best arm: 0 (strength = {true_strengths[0]})") print(f" Found best arm: {algo.best}") print(f" Comparisons needed: {algo.comparisons}") print(f" Correct: {'Yes!' if algo.best == 0 else 'No'}") ``` ::: ::: {.callout-note title="Code: Comparing Dueling Bandit Algorithms" collapse="false"} ```{pyodide-python} #| autorun: true # Compare Interleaved Filter vs Random exploration np.random.seed(123) def run_experiment(n_arms, true_best, n_trials=20, max_comparisons=200): """Run multiple trials and measure sample complexity.""" # Interleaved Filter if_results = [] for _ in range(n_trials): algo = InterleavedFilter(n_arms, delta=0.1) for step in range(max_comparisons): done = algo.step(oracle) if done: break if_results.append((algo.best == true_best, algo.comparisons)) # Random exploration baseline rand_results = [] for _ in range(n_trials): pair_counts = np.zeros((n_arms, n_arms)) pair_wins = np.zeros((n_arms, n_arms)) for comp in range(max_comparisons): i, j = np.random.choice(n_arms, 2, replace=False) result = oracle(i, j) pair_counts[i, j] += 1 pair_counts[j, i] += 1 if result == 1: pair_wins[i, j] += 1 else: pair_wins[j, i] += 1 # Estimate Condorcet winner P_hat = np.where(pair_counts > 0, pair_wins / pair_counts, 0.5) best_guess = None for i in range(n_arms): if all(P_hat[i, j] > 0.5 for j in range(n_arms) if j != i): best_guess = i break if best_guess is None: best_guess = np.argmax([sum(P_hat[i, :]) for i in range(n_arms)]) rand_results.append((best_guess == true_best, max_comparisons)) return if_results, rand_results if_res, rand_res = run_experiment(5, 0) # Summarize if_acc = np.mean([r[0] for r in if_res]) if_comps = np.mean([r[1] for r in if_res]) rand_acc = np.mean([r[0] for r in rand_res]) rand_comps = np.mean([r[1] for r in rand_res]) print("Algorithm Comparison (20 trials, 5 arms):") print(f" Interleaved Filter: {if_acc*100:.0f}% accuracy, {if_comps:.0f} avg comparisons") print(f" Random Exploration: {rand_acc*100:.0f}% accuracy, {rand_comps:.0f} comparisons") # Visualization fig, ax = plt.subplots(1, 1, figsize=(5, 3)) x = ['Interleaved\nFilter', 'Random\nExploration'] ax.bar(x, [if_acc * 100, rand_acc * 100], color=['steelblue', 'coral'], alpha=0.8) ax.set_ylabel('Accuracy (%)') ax.set_title('Finding Condorcet Winner') ax.set_ylim(0, 105) for i, v in enumerate([if_acc, rand_acc]): ax.text(i, v*100 + 2, f'{v*100:.0f}%', ha='center', fontsize=10) ax.axhline(100, color='green', linestyle='--', alpha=0.5, label='Perfect') plt.tight_layout() plt.show() ``` ::: #### Dueling Bandit Gradient Descent This algorithm tries to find the best bandit in a continuous bandit-space. Here, the set of all bandits is regarded as an Information-Retrieval (IR) system with infinite bandits uniquely defined by $w$. We will cover the *Dueling Bandit Gradient Descent* algorithm from Yue and Joachims 2009 [@IR]. Yue and Joachims use the dueling bandits formulation for online IR optimization. They propose a retrieval system parameterized by a set of continuous variables lying in $W$, a $d$-dimensional unit-sphere. The DBGD algorithm adapts the current parameters $w_t$ of IR system by comparison with slightly altered parameters $w_t'$ both querying query $q_t$. Only if the IR outcome using $w_t'$ is preferred, the parameters are changed in their direction. We will now discuss the algorithm more detailed. :::: algorithm ::: algorithmic **input:** $\gamma$, $\delta$, $w_1$ Sample unit vector $u_t$ uniformly $w_t' \gets P_W(w_t + \delta u_t)$ Compare $w_t$ and $w_t'$ $w_{t+1} \gets P_W(w_t + \gamma u_t)$ $w_{t+1} \gets w_t$ ::: :::: We first choose exploration step length $\delta$, exploitation step length $\gamma$, and starting point (in unit-sphere) $w_1$. Choose a query and sample a random unit vector $u_t$. We duel $w_t$ and $w_t'$, where $w_t$ is our current point in the sphere, and $w_t'$ is our exploratory comparison, which is generated by taking a random step of length $\delta$, such that $w_t' = w_t + \delta u_t$. The objective of this duel is to ascertain the binary preference of users with respect to the results yielded by the IR systems parameterized by $w_t$ and $w_t'$ respectively, taking query $q_t$ as an input. The parameters that get the majority of the votes in the head to head win. If $w_t$ wins, then we keep the parameters for the next iteration. If $w_t'$ wins the duel, we update our parameters in the direction of $u_t$ by taking a step of length $\gamma$. Note that the algorithm describes projection operation $P_W(\overrightarrow{v})$. Since $u_t$ is chosen randomly, $w_t + \delta u_t$ or $w_t + \gamma u_t$ could exist outside of the unit sphere where all possible parameter configurations lie. In this case, we simply project the point back onto the sphere using said projection $P_W(\overrightarrow{v})$. Yue and Joachims show that this algorithm has sublinear regret in $T$, the number of iterations. We note that the algorithm assumes that there exists a hidden reward function $R(w)$ that maps system parameters $w_t$ to a reward value which is smooth and strictly concave over the input space $W$. Lastly, we would also like to give motivation behind $\delta$ and $\gamma$ being different values. We need a $\delta$ that is sufficiently large that the comparison between a system parameterized by $w_t$ and $w_t'$ is meaningful. On the other hand, we may wish to take a smaller step in the direction of $w_t'$ during our update step, as during a duel, we only score $w_t$ against $w_t'$ over the results on one query $q_t$. Having $\delta > \gamma$ allows us to get reward signal from meaningfully different points while also updating our belief of the best point $w_{\text{best}}$ gradually. #### Sparring EXP4 {#sparring-exp4 .unnumbered} Zoghi et al. 2015 propose one algorithm for this problem --- sparring EXP4, which duels two traditional EXP4 - algorithms. The (traditional) EXP4 algorithm solves the traditional contextual bandits --- the case where we can directly observe a reward for a choice of bandit given a context. The EXP4 algorithm embeds each bandit as a vector. When the algorithm sees the context (called 'advice' in this formulation), it produces a probability distribution over the choices based on an adjusted softmax function on the inner product between the context and the bandit vectors. The probability function is different from a softmax as we assign some minimum probability that any action gets chosen to enforce exploration. A reward is then observed for the choice and propagated back through the embedding of the chosen bandit. Sparring EXP4 runs two instances of the EXP4 algorithm against each other. Each EXP4 instance samples an action given a context, and then these choices are 'dueled' against each other. Instead of directly observing a reward, as for traditional EXP4, we instead observe two converse reward --- a positive reward for the choice that won the duel and a negative reward to the choice that lost. The reward is proportional to the degree to which the bandit wins the duel, i.e. how likely the bandit is to be preferred over the other when users are queried for binary preferences. Like in traditional EXP4, the reward or negative reward is then propagated back through the representations of the bandits. #### Feel-good Thompson sampling This algorithm is a solution for the contextual dueling bandit setting, and aims to find a Von Neumann winner by minimizing cumulative average regret: $$\text{Regret}(T) := \sum_{t=1}^{T} \left[ r_{*}(x_t, a_{t}^{*}) - \frac{r_{*}(x_t, a_{t}^{1}) + r_{*}(x_t, a_{t}^{2})}{2} \right],$$ {#eq-fgts-regret} where $r_{*}(x_t, a_{t})$ is the true, hidden reward function of a context $x_t$ and action $a_t$. Thompson sampling is an iterative process of receiving preference over two actions, each maximizing a different approximation of the reward function based on past data and adding this new information to the data. Finding good approximations of the reward function at time $t$ is done by sampling two reward function parameters $\theta_t^{j=1}$ and $\theta_t^{j=2}$ from a posterior distribution based on all previous data $p_j(\cdot \mid S_{t-1})$. This posterior distribution is proportional to the multiplication of the prior and the likelihood function, which is a Gaussian in standard Thompson sampling. In Feel-Good Thompson sampling, an additional term called \"Feel-good exploration\" encourages parameters $\theta$ with a large maximum reward in previous rounds. This change to the likelihood function may increase probabilities in uncertain areas, thus exploring those regions. All that's left is to select an action maximizing each reward function approximation and receive a preference $y_t$ on one of them to add the new information to the dataset[@fgts_cdb]. :::: algorithm ::: algorithmic Initialize $S_0 = \varnothing$. Receive prompt $x_t$ and action space $\mathcal{A}_t$. Sample model parameter $\theta_t^j$ from the posterior distribution $p^j(\cdot \mid S_{t-1})$ Select response $a_t^j = \arg\max_{a \in \mathcal{A}_t} \langle \theta_t^j, \phi(x_t, a) \rangle$. Receive preference $y_t$. Update dataset $S_t \leftarrow S_{t-1} \cup \{(x_t, a_t^1, a_t^2, y_t)\}$. ::: :::: ### Applications There are many applications where contextual bandits are used. Many of these applications can utilize human preferences. One particular application illustrates the benefits a contextual bandit would have over a multi-armed bandit: a website deciding which app to show someone visiting the website. A multi-armed bandit might decide to show someone an ad for a swimsuit because the swimsuit ads have gotten the most user clicks (which indicates human preference). A contextual bandit might choose differently, however. A contextual bandit will also take into account the context, which in this case might mean information about the user (location, previously visited pages, and device information). If it discovers the user lives in a cold environment, for example, it might suggest a sweater ad for the user instead and get a better chance of a click. There are many more examples of where contextual bandits can be applied. They can be applied in other web applications, such as to optimize search results, medical applications, such as how much of a medication to prescribe based on a patient's history, and gaming applications, such as basing moves off of the state of a chess board to try to win. In each of the above examples, human feedback could have been introduced during training and leveraged to learn a reward function. We explored different versions of bandits that address the exploration-exploitation trade-off in various real-world scenarios. These models have been employed across various fields, including but not limited to healthcare, finance, dynamic pricing, and anomaly detection. This section provides a deep dive into some real-world applications, emphasizing the value and advancements achieved by incorporating bandit methodologies. The content of this section draws upon the findings from the survey cited in reference [@bouneffouf2020survey]. In healthcare, researchers have been applying bandits to address challenges in clinical trials and behavioral modeling [@bouneffouf2017bandit; @bastani2020online]. One of the examples is drug dosing. Warfarin, an oral anticoagulant, has traditionally been administered using fixed dosing protocols. Physicians would then make subsequent adjustments based on the patient's emerging symptoms. Nonetheless, inaccuracies in the initial dosage---whether too low or too high---can lead to serious complications like strokes and internal bleeding. In a pivotal study, researchers in [@bastani2020online] modeled the Warfarin initial dosing as a contextual bandit problem to assign dosages to individual patients appropriately based on their medication history. Their contributions include the adaptation of the LASSO estimator to the bandit setting, achieving a theoretical regret bound of $O({s_0}^2 \log^2(dT)$, where $d$ represents the number of covariates, $s_0 << d$ signifies the number of pertinent covariates, and $T$ indicates the total number of users. Additionally, they conducted empirical experiments to validate the robustness of their methodology. Within the finance sector, bandits have been instrumental in reshaping the landscape of portfolio optimization. Portfolio optimization is an approach to designing a portfolio based on the investor's return and risk criteria, which fits the exploration-exploitation nature of the bandit problems. [@shen2015portfolio] utilized multi-armed bandits to exploit correlations between the instruments. They constructed orthogonal portfolios and integrated them with the UCB policy to achieve a cumulative regret bound of $\frac{8n}{\Delta*} \ln(m) + 5n$, where $n$, $m$, and $\Delta*$ denotes the number of available assets, total time steps, and the gap between the best-expected reward and the expected reward. On the other hand, [@huo2017risk] focused on risk-awareness online portfolio optimization by incorporating a compute of the minimum spanning tree in the bipartite graph, which encodes a combination of financial institutions and assets that helps diversify and reduce exposure to systematic risk during the financial crisis. Dynamic pricing, also known as demand-based pricing, refers to the strategy of setting flexible prices for products or services based on current market demands. The application of bandits in dynamic pricing offers a systematic approach to making real-time pricing decisions while balancing the trade-off between exploring new price points and exploiting known optimal prices. [@misra2019dynamic] proposed a policy where the company has only incomplete demand information. They derived an algorithm that balances immediate and future profits by combining multi-armed bandits with partial identification of consumer demand from economic theory. **Recommendation systems** are essential components of numerous online platforms, guiding users through vast content landscapes to deliver tailored suggestions. These systems are instrumental in platforms like e-commerce sites, streaming platforms, and social media networks. However, the challenge of effectively recommending items to users is non-trivial, given the dynamic nature of user preferences and the vast amount of content available. One of the most significant challenges in recommendation systems is the \"cold start\" problem. This issue arises when a new user joins a platform, and the system has limited or no information about the user's preferences. Traditional recommendation algorithms struggle in such scenarios since they rely on historical user-item interactions. As discussed in [@zhou2017large], the bandit setting is particularly suitable for large-scale recommender systems with a vast number of items. By continuously exploring user preferences and exploiting known interactions, bandit-based recommender systems can quickly adapt to new users, ensuring relevant recommendations in a few interactions. The continuous exploration inherent in bandit approaches also means that as a user's preferences evolve, the system can adapt, ensuring that recommendations remain relevant. Recommending content that is up to date is also another important aspect of a recommendation system. In [@bouneffouf2012a], the concept of \"freshness\" in content is explored through the lens of the bandit problem. The Freshness-Aware Thompson Sampling algorithm introduced in this study aims to manage the recommendation of fresh documents according to the user's risk of the situation. Dialogue systems, often termed conversational agents or chatbots, aim to simulate human-like conversations with users. These systems are deployed across various platforms, including customer support, virtual assistants, and entertainment applications, and they are crucial for enhancing user experience and engagement. Response selection is fundamental to creating a natural and coherent dialogue flow. Traditional dialogue systems rely on a predefined set of responses or rules, which can make interactions feel scripted and inauthentic. In [@liu2018customized], the authors proposed a contextual multi-armed bandit model for online learning of response selection. Specifically, they utilized bidirectional LSTM to produce the distributed representations of a dialogue context and responses and customized the Thompson sampling method. To create a more engaging and dynamic interaction, there's a growing interest in developing pro-active dialogue systems that can initiate conversations without user initiation. [@perez2018contextual] proposed a novel approach to this challenge with contextual bandits. By introducing memory models into the bandit framework, the system can recall past interactions, making its proactive responses more contextually relevant. Their contributions include the Contextual Attentive Memory Network, which implements a differentiable attention mechanism over past interactions. [@upadhyay2019a] addressed the challenge of orchestrating multiple independently trained dialogue agents or skills in a unified system. They attempted online posterior dialogue orchestration, defining it as selecting the most suitable subset of skills in response to a user's input, which studying a context-attentive bandit model that operates under a skill execution budget, ensuring efficient and accurate response selection. Anomaly detection refers to the task of identifying samples that behave differently from the majority. In [@ding2019interactive], the authors delve into anomaly detection in an interactive setting, allowing the system to actively engage with human experts through a limited number of queries about genuine anomalies. The goal is to present as many true anomalies to the human expert as possible after a fixed query budget is used up. They applied the multi-armed contextual bandit framework to address this issue. This algorithm adeptly integrates both nodal attributes and node dependencies into a unified model, efficiently managing the exploration-exploitation trade-off during anomaly queries. There are many challenges associated with contextual bandits. The first challenge is that each action only reveals the reward for that particular action. Therefore, the algorithm has to work with incomplete information. This leads to the dilemma of exploitation versus exploration: when should the algorithm choose the best-known option versus trying new options for potentially better outcomes? Another significant challenge for contextual bandits is using context effectively. The context the environment gives needs to be explored to figure out which action is best for each context. The overarching goal in systems designed for recommending options of high value to users is to achieve an optimal balance between exploration and exploitation. This dual approach is crucial in environments where user preferences and needs are dynamic and diverse. Exploration refers to the process of seeking out new options, learning about untried possibilities, and gathering fresh information that could lead to high-value recommendations. In contrast, exploitation involves utilizing existing knowledge and past experiences to recommend the best options currently known. This balance is key to maintaining a system that continuously adapts to changing user preferences while ensuring the reliability of its recommendations. A key observation in such systems is the dual role of users as both producers and consumers of information. Each user's experience contributes valuable data that informs future recommendations for others. For instance, platforms like Waze, Netflix, and Trip Advisor rely heavily on user input and feedback. Waze uses real-time traffic data from drivers to recommend optimal routes; Netflix suggests movies and shows based on viewing histories and ratings; Trip Advisor relies on traveler reviews to guide future tourists. In these examples, the balance between gathering new information (exploration) and recommending the best-known options (exploitation) is dynamically managed to enhance user experience and satisfaction. This approach underscores the importance of user engagement in systems where monetary incentives are not (or can not be) the primary driver. Recommendation systems often face the challenge of overcoming user biases that can lead to a narrow exploration of options. Users come with preconceived notions and preferences, which can cause them to overlook potentially valuable options that initially appear inferior or unaligned with their interests. This predisposition can significantly limit the effectiveness of recommendation systems, as users might miss out on high-value choices simply due to their existing biases. To counteract this, it is crucial for recommendation systems to actively incentivize exploration among users. One innovative approach to achieve this is through the strategic use of **information asymmetry**. By controlling and selectively presenting information, these systems can guide users to explore options they might not typically consider. This method aims to reveal the true potential of various options by nudging users out of their comfort zones and encouraging a broader exploration of available choices. An important note here is that the system is not lying to users - it only selectively reveals information it has. The concept of incentivizing exploration becomes even more complex when considering different types of users. For instance, systems often encounter short-lived users who have little to gain from contributing to the system's learning process, as their interactions are infrequent or based on immediate needs. Similarly, some users may operate under a 'greedy' principle, primarily seeking immediate gratification rather than contributing to the long-term accuracy and effectiveness of the system. In such scenarios, managing information asymmetry can be a powerful tool. By selectively revealing information, recommendation systems can create a sense of novelty and interest, prompting even the most transient or self-interested users to engage in exploration, thereby enhancing the system's overall knowledge base and recommendation quality. ## Preferential Bayesian Optimization The traditional Bayesian optimization (BO) problem is described as follows. There is a black-box objective function $g: \mathcal{X} \rightarrow \Re$ defined on a bounded subset $\mathcal{X} \subseteq \Re^q$ such that direct queries to the function are expensive or not possible. However, we would like to solve the global optimization problem of finding $\mathbf{x}_{\min }=\arg \min _{\mathbf{x} \in \mathcal{X}} g(\mathbf{x})$. This is highly analogous to modeling human preferences, since it is the case that direct access to a human's latent preference function is not possible but we would still like to find its optimum, such as in A/B tests or recommender systems. We approach this problem for human preferences with *Preferential Bayesian Optimization* (PBO), as the key difference is that we are able to query the preference function through pairwise comparisons of data points, i.e. *duels*. This is a form of indirect observation of the objective function, which models real-world scenarios closely: we commonly need to to optimize a function via data about preferences. With humans, it has been demonstrated that we are better at evaluating differences rather than absolute magnitudes [@kahneman_tversky_1979] and therefore PBO models can be applied in various contexts. ### Problem statement The problem of finding the optimum of a latent preference function defined on $\mathcal{X}$ can be reduced to determining a sequence of duels on $\mathcal{X} \times \mathcal{X}$. From each duel $\left[\mathbf{x}, \mathbf{x}^{\prime}\right] \in$ $\mathcal{X} \times \mathcal{X}$ we obtain binary feedback $\{0,1\}$ indicating whether or not $\mathbf{x}$ is preferred over $\mathbf{x}^{\prime}$ ($g(\mathbf{x}) < g(\mathbf{x}^{\prime})$). We consider that $\mathbf{x}$ is the winner of the duel if the output is $\{1\}$ and that $\mathbf{x}^{\prime}$ wins the duel if the output is $\{0\}$. The aim is to find $\mathbf{x}_{\min }$ by reducing as much as possible the number of queried duels. The key idea in PBO is to learn a preference function in the space of duels using a Gaussian process. We define a joint reward $f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)$ on each duel which is never directly observed. Instead, the feedback we obtain after each pair is a binary output $y \in$ $\{0,1\}$ indicating which of the two inputs is preferred. One definition of f we will use (though others are possible) is $f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})$. The more $\mathbf{x}^{\prime}$ is preferred over $\mathbf{x}$, the bigger the reward. We define the model of preference using a Bernoulli likelihood, where $p\left(y=1 \mid\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)$ and $p\left(y=0 \mid\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\pi_f\left(\left[\mathbf{x}^{\prime}, \mathbf{x}\right]\right)$ for some inverse link function $\pi: \Re \times \Re \rightarrow[0,1]$. $\pi_f$ has the property that $\pi_f\left(\left[\mathbf{x}^{\prime}, \mathbf{x}\right]\right)=1-\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)$. A natural choice for $\pi_f$ is the logistic function $$\label{eq:bernoulli_pref} \pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\sigma\left(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\right)=\frac{1}{1+e^{-f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)}},$$ but others are possible. Therefore we have that for any duel $\left[\mathbf{x}, \mathbf{x}^{\prime}\right]$ in which $g(\mathbf{x}) \leq g\left(\mathbf{x}^{\prime}\right)$ it holds that $\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) \geq 0.5$. $\pi_f$ is a preference function that maps each query $\left[\mathbf{x}, \mathbf{x}^{\prime}\right]$ to the probability of having a preference on the left input $\mathbf{x}$ over the right input $\mathbf{x}^{\prime}$. When we marginalize over the right input $\mathbf{x}^{\prime}$ of $f$, the global minimum of $f$ in $\mathcal{X}$ coincides with $\mathbf{x}_{\min }$. We also introduce the definition of the *Copeland score function* for a point $\mathbf{x}$ as $$S(\mathbf{x})=\operatorname{Vol}(\mathcal{X})^{-1} \int_{\mathcal{X}} \mathbb{I}_{\left\{\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) \geq 0.5\right\}} d \mathbf{x}^{\prime}$$ {#eq-copeland-score} where $\operatorname{Vol}(\mathcal{X})=\int_{\mathcal{X}} d \mathbf{x}^{\prime}$ is a normalizing constant that bounds $S(\mathbf{x})$ in the interval $[0,1]$. If $\mathcal{X}$ is a finite set, the Copeland score is simply the proportion of duels that a certain element $\mathbf{x}$ will win with probability larger than 0.5. A soft variant we will use instead of the Copeland score is the *soft-Copeland score*, defined as $$\label{eq:soft-copeland} C(\mathbf{x})=\operatorname{Vol}(\mathcal{X})^{-1} \int_{\mathcal{X}} \pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d \mathbf{x}^{\prime}$$ where the probability function $\pi_f$ is integrated over $\mathcal{X}$. This score aims to capture the average probability of $\mathbf{x}$ being the winner of a duel. We define the *Condorcet winner* $\mathbf{x}_c$ as the point with maximal soft-Copeland score. Note that this corresponds to the global minimum of $f$, since the defining integral takes maximum value for points $\mathbf{x} \in \mathcal{X}$ where $f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=$ $g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})>0$ or all $\mathbf{x}^{\prime}$, occurring only if $\mathbf{x}_c$ is a minimum of $f$. Therefore, if the preference function $\pi_f$ can be learned by observing the results of duels then our optimization problem of finding the minimum of $f$ can be solved by finding the Condorcet winner of the Copeland score. ::: {.callout-note title="Code: Copeland Score Visualization" collapse="false"} ```{pyodide-python} #| autorun: true # Visualize Copeland score for a 1D latent function def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -500, 500))) # True latent function (to minimize) def g(x): return (x - 0.3)**2 + 0.5 * np.sin(4 * x) # Preference function: P(x beats x') = sigmoid(g(x') - g(x)) def preference_prob(x, xp): return sigmoid(g(xp) - g(x)) # Compute soft-Copeland score at each point X = np.linspace(-1, 1, 100) X_sample = np.linspace(-1, 1, 50) # Points to compare against copeland_scores = [] for x in X: # Monte Carlo estimate of soft-Copeland probs = [preference_prob(x, xp) for xp in X_sample] copeland_scores.append(np.mean(probs)) copeland_scores = np.array(copeland_scores) # Find Condorcet winner (maximizer of Copeland score) condorcet_idx = np.argmax(copeland_scores) condorcet_x = X[condorcet_idx] # Also find true minimizer of g true_min_idx = np.argmin(g(X)) true_min_x = X[true_min_idx] fig, axes = plt.subplots(1, 2, figsize=(7, 3)) # Left: Latent function g(x) axes[0].plot(X, g(X), 'b-', linewidth=1.5, label='$g(x)$') axes[0].axvline(true_min_x, color='green', linestyle='--', alpha=0.7, label=f'True min: {true_min_x:.2f}') axes[0].scatter([true_min_x], [g(true_min_x)], color='green', s=80, zorder=5) axes[0].set_xlabel('$x$') axes[0].set_ylabel('$g(x)$') axes[0].set_title('Latent Function (to minimize)') axes[0].legend(fontsize=7) axes[0].grid(True, alpha=0.3) # Right: Soft-Copeland score axes[1].plot(X, copeland_scores, 'r-', linewidth=1.5, label='Soft-Copeland $C(x)$') axes[1].axvline(condorcet_x, color='purple', linestyle='--', alpha=0.7, label=f'Condorcet winner: {condorcet_x:.2f}') axes[1].scatter([condorcet_x], [copeland_scores[condorcet_idx]], color='purple', s=80, zorder=5) axes[1].set_xlabel('$x$') axes[1].set_ylabel('$C(x)$') axes[1].set_title('Soft-Copeland Score (to maximize)') axes[1].legend(fontsize=7) axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"True minimizer of g: x = {true_min_x:.3f}") print(f"Condorcet winner (max Copeland): x = {condorcet_x:.3f}") print(f"These should be close when the preference model is well-specified!") ``` ::: ::: {.callout-note title="Code: GP Preference Model for PBO" collapse="false"} ```{pyodide-python} #| autorun: true class GPPreferenceModel: """GP-based preference model for Preferential Bayesian Optimization.""" def __init__(self, lengthscale=0.3, variance=1.0): self.lengthscale = lengthscale self.variance = variance self.X_duels = [] # [(x, x'), ...] self.y_duels = [] # [1 if x wins, 0 if x' wins] def kernel(self, X1, X2): """RBF kernel on duel space.""" diff = X1[:, None] - X2[None, :] return self.variance * np.exp(-0.5 * diff**2 / self.lengthscale**2) def add_duel(self, x, xp, winner): """Add a duel observation. winner=1 means x wins.""" self.X_duels.append((x, xp)) self.y_duels.append(winner) def predict(self, x_test, xp_test): """Predict P(x beats x') using GP classification.""" if len(self.X_duels) == 0: return 0.5, 0.5 # Prior: uniform # Represent duels as difference in 1D X_train = np.array([d[0] - d[1] for d in self.X_duels]) y_train = np.array(self.y_duels) x_diff = x_test - xp_test # GP posterior (simplified Laplace) K = self.kernel(X_train, X_train) + 1e-4 * np.eye(len(X_train)) k_star = self.kernel(np.array([x_diff]), X_train).flatten() k_ss = self.kernel(np.array([x_diff]), np.array([x_diff]))[0, 0] # Laplace approximation f = np.zeros(len(y_train)) for _ in range(5): p = sigmoid(f) W = np.diag(p * (1 - p) + 1e-6) grad = y_train - p H = W + np.linalg.inv(K) f = f + np.linalg.solve(H, grad) # Posterior mean and variance K_inv = np.linalg.inv(K) mu = k_star @ K_inv @ f p = sigmoid(f) W = np.diag(p * (1 - p) + 1e-6) var = k_ss - k_star @ np.linalg.inv(K + np.linalg.inv(W)) @ k_star var = max(var, 1e-6) return sigmoid(mu), np.sqrt(var) # Test the model np.random.seed(42) model = GPPreferenceModel(lengthscale=0.3) # Generate some training duels for _ in range(10): x, xp = np.random.uniform(-1, 1, 2) true_pref = preference_prob(x, xp) winner = 1 if np.random.rand() < true_pref else 0 model.add_duel(x, xp, winner) # Predict on a grid X_grid = np.linspace(-1, 1, 30) pred_matrix = np.zeros((len(X_grid), len(X_grid))) for i, x in enumerate(X_grid): for j, xp in enumerate(X_grid): pred_matrix[i, j], _ = model.predict(x, xp) # Compute estimated Copeland scores est_copeland = pred_matrix.mean(axis=1) fig, axes = plt.subplots(1, 2, figsize=(7, 3)) # Predicted preference matrix im = axes[0].imshow(pred_matrix, extent=[-1, 1, -1, 1], origin='lower', cmap='RdYlGn', vmin=0, vmax=1) axes[0].set_xlabel("$x'$") axes[0].set_ylabel("$x$") axes[0].set_title('Predicted $P(x > x\')$') plt.colorbar(im, ax=axes[0], shrink=0.8) # Estimated vs true Copeland axes[1].plot(X, copeland_scores, 'k--', label='True Copeland', linewidth=1.5) axes[1].plot(X_grid, est_copeland, 'b-', label='Estimated (10 duels)', linewidth=1.5) axes[1].set_xlabel('$x$') axes[1].set_ylabel('Copeland score') axes[1].set_title('Copeland Score Estimation') axes[1].legend(fontsize=7) axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"Estimated Condorcet winner: x = {X_grid[np.argmax(est_copeland)]:.2f}") ``` ::: ### Acquisition Functions We describe several acquisition functions for sequential learning of the Condorcet winner. Our dataset $\mathcal{D}=\left\{\left[\mathbf{x}_i, \mathbf{x}_i^{\prime}\right], y_i\right\}_{i=1}^N$ represents the $N$ duels that have been performed so far. We aim to define a sequential policy $\alpha\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] ; \mathcal{D}_j, \theta\right)$ for querying duels, where $\theta$ is a vector of model hyper-parameters, in order to find the minimum of the latent function $g$ as quickly as possible. Using Gaussian processes (GP) for classification with our dataset $\mathcal{D}$ allows us to perform inference over $f$ and $\pi_f$. #### Pure Exploration {#pure-exploration .unnumbered} The output variable $y_{\star}$ of a prediction follows a Bernoulli distribution with probability given by the preference function $\pi_f$. To carry out exploration as a policy, one method is to search for the duel where GP is most uncertain about the probability of the outcome (has the highest variance of $\sigma\left(f_{\star}\right)$ ), which is the result of transforming out epistemic uncertainty about $f$, modeled by a GP, through the logistic function. The first order moment of this distribution coincides with the expectation of $y_{\star}$ but its variance is $$\begin{aligned} \mathbb{V}\left[\sigma\left(f_{\star}\right)\right] & =\int\left(\sigma\left(f_{\star}\right)-\mathbb{E}\left[\sigma\left(f_{\star}\right)\right]\right)^2 p\left(f_{\star} \mid \mathcal{D},\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d f_{\star} \\ & =\int \sigma\left(f_{\star}\right)^2 p\left(f_{\star} \mid \mathcal{D},\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d f_{\star}-\mathbb{E}\left[\sigma\left(f_{\star}\right)\right]^2 \end{aligned}$$ {#eq-variance-sigma} which explicitly takes into account the uncertainty over $f$. Hence, pure exploration of duels space can be carried out by maximizing $$\alpha_{\mathrm{PE}}\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] \mid \mathcal{D}_j\right)=\mathbb{V}\left[\sigma\left(f_{\star}\right)\left|\left[\mathbf{x}_{\star}, \mathbf{x}_{\star}^{\prime}\right]\right| \mathcal{D}_j\right] .$$ {#eq-pure-exploration} Note that in this case, duels that have been already visited will have a lower chance of being visited again even in cases in which the objective takes similar values in both players. In practice, this acquisition functions requires computation of an intractable integral, that we approximate using Monte-Carlo. #### Principled Optimistic Preferential Bayesian Optimization (POP-BO) {#principled-optimistic-preferential-bayesian-optimization-pop-bo .unnumbered} In a slightly modified problem setup [@xu2024principledpreferentialbayesianoptimization], the algorithm tries to solve for the MLE $\hat{g}$ and its confidence set $\mathcal{B}_g$ where $g$ is the ground truth black-box function. Assumptions include that $g$ is a member of a reproducing kernel Hilbert space (RKHS) $\mathcal{H}_k$ for some kernel function $k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$, and $\|g\|_k \leq B$ so that $\mathcal{B}_g = \left\{\tilde{g} \in \mathcal{H}_k \mid\|\tilde{g}\|_k \leq B\right\}$. Similarly defining $f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})$, we model the preference function with a Bernoulli distribution as in Equation [\[eq:bernoulli_pref\]](#eq:bernoulli_pref){reference-type="ref" reference="eq:bernoulli_pref"} and also assume that probabilities follow the Bradley-Terry model, i.e. $$\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\sigma\left(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\right)=\frac{e^{g(\mathbf{x})}}{e^{g(\mathbf{x})}+e^{g\left(\mathbf{x^{\prime}}\right)}}$$ {#eq-bradley-terry} The update rule for MLE $\hat{g}$ is (equation 8,6,5) $$\begin{aligned} \hat{g}_t^{\text {MLE }}&:= \arg \underset{\tilde{g} \in \mathcal{B}^t_g}{\max}\ell_t(\tilde{g}) \\ \ell_t(\tilde{g}) &:= \log \prod_{\tau=1}^t y_\tau \pi_{\tilde{f}}([\mathbf{x_\tau}, \mathbf{x^{\prime}_\tau}])+\left(1-y_\tau\right)\left(1-\pi_{\tilde{f}}([\mathbf{x_\tau}, \mathbf{x^{\prime}_\tau}])\right) \\ &=\sum_{\tau=1}^t \log \left(\frac{e^{\tilde{g}(\mathbf{x_\tau})} y_\tau+e^{\tilde{g}(\mathbf{x_\tau^\prime})}\left(1-y_\tau\right)}{e^{\tilde{g}(\mathbf{x_\tau})}+e^{\tilde{g}(\mathbf{x_\tau^\prime})}}\right) \\ &=\sum_{\tau=1}^t\left(\tilde{g}(\mathbf{x_\tau}) y_\tau+\tilde{g}(\mathbf{x_\tau^\prime})\left(1-y_\tau\right)\right)-\sum_{\tau=1}^t \log \left(e^{\tilde{g}(\mathbf{x_\tau})}+e^{\tilde{g}(\mathbf{x_\tau^\prime})}\right) \end{aligned}$$ {#eq-mle-update} (Eq 22 shows how to represent this as a convex optimisation problem so that it can be solved) The update rule for the confidence set $\mathcal{B}_f^{t+1}$ is, (eq 9, 10?) $$\begin{aligned} &\forall \epsilon, \delta > 0 \\ &\mathcal{B}_g^{t+1}:=\left\{\tilde{g} \in \mathcal{B}_g \mid \ell_t(\tilde{g}) \geq \ell_t\left(\hat{g}_t^{\mathrm{MLE}}\right)-\beta_1(\epsilon, \delta, t)\right\} \end{aligned}$$ {#eq-confidence-set-update} where $$\beta_1(\epsilon, \delta, t):=\sqrt{32 t B^2 \log \frac{\pi^2 t^2 \mathcal{N}\left(\mathcal{B}_f, \epsilon,\|\cdot\|_{\infty}\right)}{6 \delta}}+ C_L \epsilon t=\mathcal{O}\left(\sqrt{t \log \frac{t \mathcal{N}\left(\mathcal{B}_f, \epsilon,\|\cdot\|_{\infty}\right)}{\delta}}+\epsilon t\right),$$ {#eq-beta-function} with $C_L$ a constant independent of $\delta, t$ and $\epsilon$. $\epsilon$ is typically chosen to be $1 / T$, where T is the running horizon of the algorithm. This satisfies the theorem that, $$\mathbb{P}\left(g \in \mathcal{B}_g^{t+1}, \forall t \geq 1\right) \geq 1-\delta .$$ {#eq-confidence-guarantee} Intuitively, the confidence set $\mathcal{B}_g^{t+1}$ includes the functions with the log-likelihood value that is only 'a little worse' than the maximum likelihood estimator, and the theorem states that $\mathcal{B}_g^{t+1}$ contains the ground-truth function $g$ with high probability. Inner level optimization in Line 4 of the algorithm can also be represented as a convex optimisation problem so that it can be solved, Eq 24, 25. The outer optimisation can be solved using grid search or Eq 26 for medium size problems. :::: algorithm ::: algorithmic Given the initial point $\mathbf{x_0} \in \mathcal{X}$ and set $\mathcal{B}_g^1 = \mathcal{B}_g$ Set the reference point $\mathbf{x_t^{\prime}} = \mathbf{x_{t-1}}$ Compute $\mathbf{x_t} \in \arg\max_{\mathbf{x} \in \mathcal{X}} \max_{\tilde{g} \in \mathcal{B}_g^t} (\tilde{g}(\mathbf{x}) - \tilde{g}(\mathbf{x_t^{\prime}}))$, with the inner optimal function denoted as $\tilde{g}_t$ Obtain the output of the duel $y_t$ and append the new data point to $\mathcal{D}_t$ Update the maximum likelihood estimator $\hat{g}_t^{\mathrm{MLE}}$ and the posterior confidence set $\mathcal{B}_g^{t+1}$. ::: :::: #### qEUBO: Decision-Theoretic EUBO {#qeubo-decision-theoretic-eubo .unnumbered} qEUBO [@astudillo2023qeubodecisiontheoreticacquisitionfunction] derives an acquisition function that extends duels to $q>2$ options which we call *queries*. Let $X=\left(\mathbf{x_1}, \ldots, \mathbf{x_q}\right) \in \mathcal{X}^q$ denote a query containing two points or more, and let $g: \mathcal{X} \rightarrow \Re$ be the latent preference function. Then after $n$ user queries, we define the *expected utility of the best option* (qEUBO) as $$\mathrm{qEUBO}_n(X)=\mathbb{E}_n\left[\max \left\{g\left(x_1\right), \ldots, g\left(x_q\right)\right\}\right].$$ {#eq-qeubo} We now show that qEUBO is one-step Bayes optimal, meaning that each step chooses the query that maximises the expected utility received by the human. For a query $X \in \mathcal{X}^q$, let $$V_n(X)=\mathbb{E}_n\left[\max _{x \in \mathbb{X}} \mathbb{E}_{n+1}[g(x)] \mid X_{n+1}=X\right] .$$ {#eq-value-function} Then $V_n$ defines the expected utility received if an additional query $X_{n+1}=X$ is performed, and maximizing $V_n$ is one-step Bayes optimal. Since $\max _{x \in \mathbb{X}} \mathbb{E}_n[f(x)]$ does not depend on $X_{n+1}$, we can also equivalently maximize $$\mathbb{E}_n\left[\max _{x \in \mathbb{X}} \mathbb{E}_{n+1}[g(x)]-\max _{x \in \mathbb{X}} \mathbb{E}_n[g(x)] \mid X_{n+1}=X\right],$$ {#eq-knowledge-gradient} which takes the same form as the knowledge gradient acquisition function [@wu2018parallelknowledgegradientmethod] in standard Bayesian optimization. $V_n$ involves a nested stochastic optimization task, while qEUBO is a much simpler policy. When human responses are noise-free, we are able to use qEUBO as a sufficient policy due to the following theorem: ::: theorem $$\underset{X \in \mathbb{X}^q}{\operatorname{argmax}} \mathrm{qEUBO}_n(X) \subseteq \underset{X \in \mathbb{X}^q}{\operatorname{argmax}} V_n(X) .$$ {#eq-qeubo-optimality} ::: ::: proof *Proof.* For a query $X \in \mathcal{X}^q$, let $x^{+}(X, i) \in \operatorname{argmax}_{x \in \mathbb{X}} \mathbb{E}_n[g(x) \mid(X, i)]$ and define $X^{+}(X)=$ $\left(x^{+}(X, 1), \ldots, x^{+}(X, q)\right)$. **Claim 1** $V_n(X) \leq \mathrm{qEUBO}_n\left(X^{+}(X)\right) .$ We see that $$\begin{aligned} V_n(X) & =\sum_{i=1}^q \mathbf{P}_n(r(X)=i) \mathbb{E}_n[g\left(x^{+}(X, i)\right) ] \\ & \leq \sum_{i=1}^q \mathbf{P}_n(r(X)=i) \mathbb{E}_n[\max _{i=1, \ldots, q} g(x^{+}(X, i))] \\ & =\mathbb{E}_n\left[\max _{i=1, \ldots, q} g\left(x^{+}(X, i)\right)\right] \\ & =\mathrm{qEUBO}_n\left(X^{+}(X)\right), \end{aligned}$$ {#eq-claim1-proof} as claimed. **Claim 2** $\mathrm{qEUBO}_n(X) \leq V_n(X) .$ For any given $X \in \mathbb{X}^q$ we have $$\mathbb{E}_n\left[f\left(x_{r(X)}\right) \mid(X, r(X))\right] \leq \max _{x \in \mathbb{X}} \mathbb{E}_n[f(x) \mid(X, r(X))] .$$ {#eq-claim2-inequality} Since $f\left(x_{r(X)}\right)=\max _{i=1, \ldots, q} f\left(x_i\right)$, taking expectations over $r(X)$ on both sides obtains the required result. Now building on the arguments above, let $X^* \in \operatorname{argmax}_{X \in \mathbb{X}^q} \mathrm{qEUBO}_n(X)$ and suppose for contradiction that $X^* \notin \operatorname{argmax}_{X \in \mathbb{X}^q} V_n(X)$. Then, there exists $\widetilde{X} \in \mathbb{X}^q$ such that $V_n(\widetilde{X})>V_n\left(X^*\right)$. We have $$\begin{aligned} \operatorname{qEUBO}_n\left(X^{+}(\tilde{X})\right) & \geq V_n(\tilde{X}) \\ & >V_n\left(X^*\right) \\ & \geq \operatorname{qEUBO}_n\left(X^*\right) \\ & \geq \operatorname{qEUBO}_n\left(X^{+}(\tilde{X})\right) . \end{aligned}$$ {#eq-contradiction-chain} The first inequality follows from (1). The second inequality is due to our supposition for contradiction. The third inequality is due to (2). Finally, the fourth inequality holds since $X^* \in \operatorname{argmax}_{X \in \mathbb{X}^q} \mathrm{qEUBO}_n(X)$. This contradiction concludes the proof. ◻ ::: Therefore a sufficient condition for following one-step Bayes optimality is by maximizing $\text{qEUBO}_n$. In experiments comparing qEUBO to other state-of-the-art acquisition functions, qEUBO consistently outperformed on most problems and was closely followed by qEI and qTS. These results also extended to experiments with multiple options when $q>2$. In fact, there is faster convergence in regret when using more options in human queries. The regret analysis is presented in detail below. #### qEI: Batch Expected Improvement {#qei-batch-expected-improvement .unnumbered} $$\begin{aligned} \mathrm{qEI}= & \mathbb{E}_{\mathbf{y}}\left[\left(\max _{i \in[1, \ldots, q]}\left(\mu_{\min }-y_i\right)\right)_{+}\right] \\ = & \sum_{i=1}^q \mathbb{E}_{\mathbf{y}}\left(\mu_{\min }-y_i \mid y_i \leq \mu_{\min }, y_i \leq y_j \forall j \neq i\right) \\ & p\left(y_i \leq \mu_{\min }, y_i \leq y_j \forall j \neq i\right) . \end{aligned}$$ {#eq-qei} #### qTS: Batch Thompson Sampling {#qts-batch-thompson-sampling .unnumbered} :::: algorithm ::: algorithmic Initial data $\mathcal{D}_{\mathcal{I}(1)}=\{(\mathbf{x}_i, y_i)\}_{i \in \mathcal{I}(1)}$ Compute current posterior $p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})$ Sample $\boldsymbol{\theta}$ from $p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})$ Select $k \leftarrow \arg \max_{j \notin \mathcal{I}(t)} \mathbb{E}[y_j \mid \mathbf{x}_j, \boldsymbol{\theta}]$ Collect $y_k$ by evaluating $f$ at $\mathbf{x}_k$ $\mathcal{D}_{\mathcal{I}(t+1)} \leftarrow \mathcal{D}_{\mathcal{I}(t)} \cup \{(\mathbf{x}_k, y_k)\}$ ::: :::: :::: algorithm ::: algorithmic Initial data $\mathcal{D}_{\mathcal{I}(1)}=\{\mathbf{x}_i, y_i\}_{i \in \mathcal{I}(1)}$, batch size $S$ Compute current posterior $p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})$ Sample $\boldsymbol{\theta}$ from $p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})$ Select $k(s) \leftarrow \arg \max_{j \notin \mathcal{I}(t)} \mathbb{E}[y_j \mid \mathbf{x}_j, \boldsymbol{\theta}]$ $\mathcal{D}_{\mathcal{I}(t+1)} = \mathcal{D}_{\mathcal{I}(t)} \cup \{\mathbf{x}_{k(s)}, y_{k(s)}\}_{s=1}^S$ ::: :::: ### Regret Analysis #### qEUBO Regret {#qeubo-regret .unnumbered} With the definition of Bayesian simple regret, we have that qEUBO converges to zero at a rate of $o(1/n)$, i.e. ::: theorem $$\label{th:quebo_regret} \mathbb{E}\left[f\left(x^*\right)-f\left(\widehat{x}_n^*\right)\right]=o(1 / n)$$ ::: where $x^*=\operatorname{argmax}_{x \in \mathrm{X}} f(x)$ and $\widehat{x}_n^* \in \operatorname{argmax}_{x \in \mathrm{X}} \mathbb{E}_n[f(x)]$. This theorem holds under the following assumptions: 1. **$f$ is injective** $\mathbf{P}(f(x)=f(y))=0$ for any $x, y \in \mathbb{X}$ with $x \neq y$. 2. **$f$ represents the preferred option** $\exists a>1 / 2$ s.t. $\mathbf{P}\left(r(X) \in \operatorname{argmax}_{i=1, \ldots, 2} f\left(x_i\right) \mid f(X)\right) \geq a \forall$ $X=\left(x_1, x_2\right) \in \mathbb{X}^2$ with $x_1 \neq x_2$ almost surely under the prior on $f$. 3. **Expected difference in utility is proportional to probability of greater utility** $\exists \Delta \geq \delta>0$ s.t. $\forall \mathcal{D}^{(n)} \text{and} \forall x, y \in \mathbb{X}$ (potentially depending on $\mathcal{D}^{(n)}$), $$\delta \mathbf{P}^{(n)}(f(x)>f(y)) \leq \mathbb{E}^{(n)}\left[\{f(x)-f(y)\}^{+}\right] \leq \Delta \mathbf{P}^{(n)}(f(x)>f(y))$$ {#eq-utility-probability-bounds} almost surely under the prior on $f$. Further lemmas leading to a proof of Theorem [\[th:quebo_regret\]](#th:quebo_regret){reference-type="ref" reference="th:quebo_regret"} is given in [@astudillo2023qeubodecisiontheoreticacquisitionfunction] Section B. #### qEI Regret {#qei-regret .unnumbered} The following theorem shows that, under the same assumptions used for qEUBO regret, simple regret of qEI can fail to converge to 0. ::: theorem There exists a problem instance (i.e., $\mathbb{X}$ and Bayesian prior distribution over f) satisfying the assumptions described in Theorem [\[th:quebo_regret\]](#th:quebo_regret){reference-type="ref" reference="th:quebo_regret"} such that if the sequence of queries is chosen by maximizing qEI, then $\mathbb{E}\left[f\left(x^*\right)-\right.$ $\left.f\left(\widehat{x}_n^*\right)\right] \geq R$ for all $n$, for a constant $R>0$. ::: ::: proof *Proof.* Let $X = \{1, 2, 3, 4\}$ and consider the functions $f_i:X \rightarrow R$, for $i=1,2,3,4$, given by $f_i(1) = -1$ and $f_i(2) = 0$ for all $i$, and $$\begin{aligned} f_1(x) = \begin{cases} 1, &\ x=3\\ \frac{1}{2}, &\ x=4 \end{cases}, \hspace{0.5cm} f_2(x) = \begin{cases} \frac{1}{2}, &\ x=3\\ 1, &\ x=4 \end{cases}, \hspace{0.5cm} f_3(x) = \begin{cases} -\frac{1}{2}, &\ x=3\\ -1, &\ x=4 \end{cases}, \hspace{0.5cm} f_4(x) = \begin{cases} -1, &\ x=3\\ -\frac{1}{2}, &\ x=4 \end{cases}. \end{aligned}$$ {#eq-qei-counterexample-functions} Let $p$ be a number with $0 < p < 1/3$ and set $q=1-p$. We consider a prior distribution on $f$ with support $\{f_i\}_{i=1}^4$ such that $$\begin{aligned} p_i = Pr(f=f_i) = \begin{cases} p/2, i =1,2,\\ q/2, i=3,4. \end{cases} \end{aligned}$$ {#eq-qei-prior} We also assume the user's response likelihood is given by $Pr(r(X)=1\mid f(x_1) > f(x_2)) = a$ for some $a$ such that $1/2 < a < 1$, Let $D^{(n)}$ denote the set of observations up to time $n$ and let $p_i^{(n)} = Pr(f=f_i \mid \mathbb{E}^{(n)})$ for $i=1,2,3,4$. We let the initial data set be $\mathcal{D}^{(0)} = \{(X^{(0)}, r^{(0)})\}$, where $X^{(0)}= (1,2)$. We will prove that the following statements are true for all $n\geq 0$. 1. $p_i^{(n)} > 0$ for $i=1,2,3,4$. 2. $p_1^{(n)} < \frac{1}{2}p_3^{(n)}$ and $p_2^{(n)} < \frac{1}{2}p_4^{(n)}$. 3. $\arg \max_{x\in\mathcal{X}}\mathbb{E}^{(n)}[f(x)]=\{2\}$. 4. $\arg \max_{X\in\mathcal{X}^2}\text{qEI}^{(n)}(X) = \{(3, 4)\}$. We prove this by induction over $n$. We begin by proving this for $n=0$. Since $f_i(1) < f_i(2)$ for all $i$, the posterior distribution on $f$ given $\mathcal{D}^{(0)}$ remains the same as the prior; i.e., $p_i^{(0)} = p_i$ for $i=1,2,3,4$. Using this, statements 1 and 2 can be easily verified. Now note that $\mathbb{E}^{(0)}[f(1)]=-1$, $\mathbb{E}^{(0)}[f(2)]=0$, and $\mathbb{E}^{(0)}[f(3)] = \mathbb{E}^{(0)}[f(4)] = \frac{3}{2}(p - q)$. Since $p < q$, it follows that $\arg \max_{x\in\mathcal{X}}\mathbb{E}^{(n)}[f(x)]=\{2\}$; i.e., statement 3 holds. Finally, since $\max_{x\in\{1,2\}}\mathbb{E}^{(0)}[f(x)] = 0$, the qEI acquisition function at time $n=0$ is given by $\text{qEI}^{(0)}(X) = \mathbb{E}^{(0)}[\{\max\{f(x_1), f(x_2)\}\}^+]$. A direct calculation can now be performed to verify that statement 4 holds. This completes the base case. Now suppose statements 1-4 hold for some $n\geq 0$. Since $X^{(n+1)} = (3, 4)$, the posterior distribution on $f$ given $D^{(n+1)}$ is given by $$\begin{aligned} p_i^{(n+1)} \propto \begin{cases} p_i^{(n)}\ell, \ i=1,3,\\ p_i^{(n)} (1 - \ell), \ i=2,4, \end{cases} \end{aligned}$$ {#eq-qei-posterior-update} where $$\ell = a I\{r^{(n+1)} = 1\} + (1-a)I\{r^{(n+1)} = 2\}.$$ {#eq-ell-definition} Observe that $0< \ell < 1$ since $0 < a < 1$. Thus, $\ell > 0$ and $1-\ell > 0$. Since $p_i^{(n)} > 0$ by the induction hypothesis, it follows from this that $p_i^{(n+1)} > 0$ for $i=1,2,3,4$. Moreover, since $p_i^{(n+1)} \propto p_i^{(n)}\ell$ for $i=1,3$ and $p_1^{(n)} < \frac{1}{2}p_3^{(n)}$ by the induction hypothesis, it follows that $p_1^{(n+1)} < \frac{1}{2}p_3^{(n+1)}$. Similarly, $p_2^{(n+1)} < \frac{1}{2}p_4^{(n+1)}$. Thus, statements 1 and 2 hold at time $n+1$. Now observe that $$\begin{aligned} \mathbb{E}^{(n+1)}[f(3)] &= p_1^{(n+1)} + \frac{1}{2}p_2^{(n+1)} - \frac{1}{2}p_3^{(n+1)} - p_4^{(n+1)}\\ &= \left(p_1^{(n+1)} - \frac{1}{2}p_3^{(n+1)}\right) + \left(\frac{1}{2}p_2^{(n+1)} - p_4^{(n+1)}\right)\\ &\leq \left(p_1^{(n+1)} - \frac{1}{2}p_3^{(n+1)}\right) + \left(p_2^{(n+1)} - \frac{1}{2}p_4^{(n+1)}\right)\\ &\leq 0, \end{aligned}$$ {#eq-expected-f3} where the last inequality holds since $p_1^{(n+1)} < \frac{1}{2}p_3^{(n+1)}$ and $p_2^{(n+1)} < \frac{1}{2}p_4^{(n+1)}$. Similarly, we see that $\mathbb{E}^{(n+1)}[f(4)] \leq 0$. Since $\mathbb{E}^{(n+1)}[f(1)]=-1$ and $\mathbb{E}^{(n+1)}[f(2)]=0$, it follows that $\arg \max_{x\in\mathcal{X}}\mathbb{E}^{(n+1)}[f(x)]=\{2\}$; i.e., statement 3 holds at time $n+1$. Since $\max_{x\in\mathcal{X}}\mathbb{E}^{(0)}[f(x)] = 0$, the qEI acquisition function at time $n+1$ is given by $\text{qEI}^{(n+1)}(X) = \mathbb{E}^{(n+1)}[\{\max\{f(x_1), f(x_2)\}\}^+]$. Since $f(1) \leq f(x)$ almost surely under the prior for all $x\in\mathcal{X}$, there is always a maximizer of qEI that does not contain $1$. Thus, to find the maximizer of qEI, it suffices to analyse its value at the pairs $(2, 3)$, $(3,4)$ and $(4,2)$. We have $$\text{qEI}^{(n+1)}(2, 3) = p_1^{(n+1)} + 1/2 p_2^{(n+1)},$$ {#eq-qei-23} $$\operatorname{qEI}^{(n+1)}(3, 4) = p_1^{(n+1)} + p_2^{(n+1)}$$ {#eq-qei-34} and $$\operatorname{qEI}^{(n+1)}(4, 2) = 1/2p_1^{(n+1)} + p_2^{(n+1)}.$$ {#eq-qei-42} Since $p_1^{(n+1)} > 0$ and $p_2^{(n+1)} > 0$, it follows that $\arg \max_{X \in X^2}\text{qEI}^{(n+1)}(X) = \{(3, 4)\}$, which concludes the proof by induction. Finally, since $\arg \max_{x\in X}\mathbb{E}^{(n)}[f(x)]=\{2\}$ for all $n$, the Bayesian simple regret of qEI is given by $$\begin{aligned} \mathbb{E}\left[f(x^*) - f(2)\right] &= \sum_{i=1}p_i\left(\max_{x\in X}f_i(x) - f_i(2)\right)\\ &= p \end{aligned}$$ {#eq-qei-final-regret} for all $n$. ◻ ::: #### POP-BO Regret {#pop-bo-regret .unnumbered} Commonly used kernel functions within the RKHS are: 1. Linear: $$k(x, \bar{x})=x^{\top} \bar{x} .$$ {#eq-kernel-linear} 2. Squared Exponential (SE): $$k(x, \bar{x})=\sigma_{\mathrm{SE}}^2 \exp \left\{-\frac{\|x-\bar{x}\|^2}{l^2}\right\},$$ {#eq-kernel-se} where $\sigma_{\mathrm{SE}}^2$ is the variance parameter and $l$ is the lengthscale parameter. 3. Matérn: $$k(x, \bar{x})=\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2 \nu} \frac{\|x-\bar{x}\|}{\rho}\right)^\nu K_\nu\left(\sqrt{2 \nu} \frac{\|x-\bar{x}\|}{\rho}\right),$$ {#eq-kernel-matern} where $\rho$ and $\nu$ are the two positive parameters of the kernel function, $\Gamma$ is the gamma function, and $K_\nu$ is the modified Bessel function of the second kind. $\nu$ captures the smoothness of the kernel function. With the definition of Bayesian simple regret, we have the following theorem defining the regret bound: ::: theorem With probability at least $1-\delta$, the cumulative regret of POP-BO satisfies, $$R_T=\mathcal{O}\left(\sqrt{\beta_T \gamma_T^{f f^{\prime}} T}\right),$$ {#eq-popbo-regret-bound} where $$\beta_T=\beta(1 / T, \delta, T)=\mathcal{O}\left(\sqrt{T \log \frac{T \mathcal{N}\left(\mathcal{B}_f, 1 / T,\|\cdot\|_{\infty}\right)}{\delta}}\right).$$ {#eq-beta-t} ::: The guaranteed convergence rate is characterised as: ::: theorem []{#th: popbo_converge label="th: popbo_converge"} Let $t^{\star}$ be defined as in Eq. (19). With probability at least $1-\delta$, $$f\left(x^{\star}\right)-f\left(x_{t^{\star}}\right) \leq \mathcal{O}\left(\frac{\sqrt{\beta_T \gamma_T^{f f^{\prime}}}}{\sqrt{T}}\right)$$ {#eq-popbo-convergence} ::: Theorem [\[th: popbo_converge\]](#th: popbo_converge){reference-type="ref" reference="th: popbo_converge"} highlights that by minimizing the known term $2\left(2 B+\lambda^{-1 / 2} \sqrt{\beta\left(\epsilon, \frac{\delta}{2}, t\right)}\right) \sigma_t^{f f^{\prime}}\left(\left(x_t, x_t^{\prime}\right)\right)$, the reported final solution $x_{t^{\star}}$ has a guaranteed convergence rate. Further kernel-specific regret bounds for POP-BO are calculated as follows: ::: theorem Setting $\epsilon=1 / T$ and running our POP-BO algorithm in Alg. 1, 1. If $k(x, y)=\langle x, y\rangle$, we have, $$R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4}\right) .$$ {#eq-regret-linear-kernel} 2. If $k(x, y)$ is a squared exponential kernel, we have, $$R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4(d+1)}\right) .$$ {#eq-regret-se-kernel} 3. If $k(x, y)$ is a Matérn kernel, we have, $$\left.R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4} T^{\frac{d}{\nu}\left(\frac{1}{4}+\frac{d+1}{4+2(d+1)^d / \nu}\right.}\right)\right).$$ {#eq-regret-matern-kernel} ::: ::: {.callout-note title="Code: Full PBO Simulation" collapse="false"} This simulation demonstrates Preferential Bayesian Optimization finding the optimum of a latent function through pairwise comparisons. ```{pyodide-python} #| autorun: true # Full PBO simulation np.random.seed(42) # True latent function (unknown to algorithm) def true_g(x): return (x - 0.2)**2 + 0.3 * np.sin(5 * x) # True preference probability def true_pref(x, xp): return sigmoid(true_g(xp) - true_g(x)) class PBO: """Preferential Bayesian Optimization with Pure Exploration.""" def __init__(self, bounds=(-1, 1), n_candidates=20): self.bounds = bounds self.candidates = np.linspace(bounds[0], bounds[1], n_candidates) self.model = GPPreferenceModel(lengthscale=0.2) self.history = [] def pure_exploration_acquisition(self): """Select duel with highest variance.""" best_var = -1 best_pair = None for x in self.candidates: for xp in self.candidates: if x != xp: _, var = self.model.predict(x, xp) if var > best_var: best_var = var best_pair = (x, xp) return best_pair def thompson_sampling_acquisition(self): """Thompson Sampling: sample from posterior, optimize.""" # Estimate Copeland scores with noise scores = [] for x in self.candidates: total = 0 for xp in self.candidates: if x != xp: mu, std = self.model.predict(x, xp) # Add noise for exploration sample = mu + np.random.randn() * std * 0.5 total += sample scores.append(total) incumbent = self.candidates[np.argmax(scores)] # Choose challenger with high uncertainty variances = [] for xp in self.candidates: if xp != incumbent: _, var = self.model.predict(incumbent, xp) variances.append((var, xp)) challenger = max(variances)[1] return incumbent, challenger def step(self, use_ts=True): """Perform one iteration.""" if len(self.history) < 3: # Random exploration initially x, xp = np.random.choice(self.candidates, 2, replace=False) elif use_ts: x, xp = self.thompson_sampling_acquisition() else: x, xp = self.pure_exploration_acquisition() # Query oracle winner = 1 if np.random.rand() < true_pref(x, xp) else 0 self.model.add_duel(x, xp, winner) self.history.append((x, xp, winner)) def get_recommendation(self): """Return current best estimate of optimum.""" scores = [] for x in self.candidates: total = 0 for xp in self.candidates: if x != xp: mu, _ = self.model.predict(x, xp) total += mu scores.append(total) return self.candidates[np.argmax(scores)] # Run PBO pbo = PBO(bounds=(-1, 1), n_candidates=15) n_iterations = 25 recommendations = [] for t in range(n_iterations): pbo.step(use_ts=True) recommendations.append(pbo.get_recommendation()) # Ground truth X_grid = np.linspace(-1, 1, 100) true_optimum = X_grid[np.argmin(true_g(X_grid))] # Visualize results fig, axes = plt.subplots(1, 3, figsize=(9, 3)) # Left: True function and recommendations axes[0].plot(X_grid, true_g(X_grid), 'k-', linewidth=1.5, label='True $g(x)$') axes[0].axvline(true_optimum, color='green', linestyle='--', label=f'True opt: {true_optimum:.2f}') axes[0].axvline(recommendations[-1], color='red', linestyle=':', linewidth=2, label=f'PBO: {recommendations[-1]:.2f}') axes[0].set_xlabel('$x$') axes[0].set_ylabel('$g(x)$') axes[0].set_title('Latent Function') axes[0].legend(fontsize=6) axes[0].grid(True, alpha=0.3) # Middle: Recommendation over time axes[1].plot(range(1, n_iterations+1), recommendations, 'b.-', linewidth=1, markersize=4) axes[1].axhline(true_optimum, color='green', linestyle='--', label='True optimum') axes[1].set_xlabel('Iteration') axes[1].set_ylabel('Recommended $x$') axes[1].set_title('Recommendation Convergence') axes[1].legend(fontsize=7) axes[1].grid(True, alpha=0.3) # Right: Distance to optimum distances = [abs(r - true_optimum) for r in recommendations] axes[2].plot(range(1, n_iterations+1), distances, 'r.-', linewidth=1, markersize=4) axes[2].set_xlabel('Iteration') axes[2].set_ylabel('$|x_{rec} - x^*|$') axes[2].set_title('Distance to Optimum') axes[2].set_yscale('log') axes[2].grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"True optimum: x* = {true_optimum:.3f}") print(f"PBO recommendation after {n_iterations} duels: x = {recommendations[-1]:.3f}") print(f"Final distance to optimum: {distances[-1]:.4f}") ``` ::: ## Human-Agent Cooperation and CIRL The algorithms discussed so far assume the human is a passive oracle: the system queries, the human responds. In practice, humans are **active agents** who may strategically teach, mislead, or collaborate with the system. This section introduces **Cooperative Inverse Reinforcement Learning (CIRL)**, a framework where human and agent jointly optimize a shared objective. ### The CIRL Framework In standard Inverse Reinforcement Learning (IRL), the agent learns a reward function from observing an expert's behavior, then optimizes it. The expert demonstrates; the agent imitates. CIRL extends this to a **two-player cooperative game**: - **Human H**: Knows the true reward parameters $\theta \in \Theta$ - **Robot R**: Does not know $\theta$, but must act to maximize reward - **Shared payoff**: Both players receive $R(s, a_H, a_R; \theta)$—their interests are aligned The key insight is that the human's optimal policy is not just demonstrating good behavior, but **teaching** the robot about $\theta$. Sometimes instructive actions (that reveal information) dominate expert demonstrations (that maximize immediate reward). ::: {.callout-note title="CIRL Formal Definition"} A CIRL game is defined by: - State space $\mathcal{S}$ - Action spaces $\mathcal{A}_H$ (human) and $\mathcal{A}_R$ (robot) - Transition function $T(s' | s, a_H, a_R)$ - Reward function $R(s, a_H, a_R; \theta)$ parameterized by $\theta$ - Prior $P_0(\theta)$ over reward parameters - Discount factor $\gamma \in (0, 1)$ At each time $t$, both observe $(s_t, a_H^{t-1}, a_R^{t-1})$. Only H observes $\theta$. The robot maintains a belief $b_R^t(\theta)$ updated via Bayes' rule. ::: ### Teaching vs. Demonstrating A central result in CIRL is that **expert demonstrations can be suboptimal** for joint value. Consider: - **Demonstration-by-expert**: Human maximizes task reward at each step - **Instructive teaching**: Human sometimes sacrifices immediate reward to shape robot's belief When the robot is uncertain, instructive actions that reveal $\theta$ can lead to higher cumulative reward than always acting optimally given $\theta$. ::: {.callout-note title="Code: Teaching vs Demonstrating Simulation" collapse="false"} ```{pyodide-python} #| autorun: true # Simplified CIRL simulation: teaching vs demonstrating np.random.seed(42) # Two possible reward types: theta=0 (prefers action 0) or theta=1 (prefers action 1) # Expert demonstration: always take optimal action for true theta # Instructive teaching: sometimes take suboptimal action to reveal theta def simulate_cirl(teaching=False, T=20): """Simulate CIRL interaction.""" true_theta = np.random.randint(2) # True reward type robot_belief = 0.5 # P(theta=1) rewards = [] beliefs = [robot_belief] for t in range(T): if teaching and t < 3: # Instructive: exaggerate preference to teach robot human_action = true_theta else: # Demonstration: take optimal action (same as teaching when certain) human_action = true_theta # Robot chooses based on belief if robot_belief > 0.5: robot_action = 1 elif robot_belief < 0.5: robot_action = 0 else: robot_action = np.random.randint(2) # Reward: both get reward if robot matches true theta reward = 1.0 if robot_action == true_theta else 0.0 rewards.append(reward) # Robot updates belief (simplified Bayesian update) # Observing human action provides signal about theta likelihood_1 = 0.9 if human_action == 1 else 0.1 # P(action | theta=1) likelihood_0 = 0.9 if human_action == 0 else 0.1 # P(action | theta=0) posterior_1 = robot_belief * likelihood_1 posterior_0 = (1 - robot_belief) * likelihood_0 robot_belief = posterior_1 / (posterior_1 + posterior_0 + 1e-10) beliefs.append(robot_belief) return np.cumsum(rewards), beliefs # Compare strategies over multiple trials n_trials = 50 demo_rewards = [] teach_rewards = [] for _ in range(n_trials): demo_r, _ = simulate_cirl(teaching=False) teach_r, _ = simulate_cirl(teaching=True) demo_rewards.append(demo_r[-1]) teach_rewards.append(teach_r[-1]) # Single detailed run np.random.seed(0) demo_cum, demo_beliefs = simulate_cirl(teaching=False) np.random.seed(0) teach_cum, teach_beliefs = simulate_cirl(teaching=True) fig, axes = plt.subplots(1, 2, figsize=(7, 3)) # Left: Belief evolution T = len(demo_beliefs) axes[0].plot(range(T), demo_beliefs, 'b-', label='Demonstrating', linewidth=1.5) axes[0].plot(range(T), teach_beliefs, 'r-', label='Teaching', linewidth=1.5) axes[0].axhline(1.0, color='green', linestyle='--', alpha=0.5, label='True θ=1') axes[0].set_xlabel('Time step') axes[0].set_ylabel('Robot belief P(θ=1)') axes[0].set_title('Robot Belief Evolution') axes[0].legend(fontsize=7) axes[0].grid(True, alpha=0.3) axes[0].set_ylim(-0.05, 1.05) # Right: Average cumulative reward axes[1].bar(['Demonstrating', 'Teaching'], [np.mean(demo_rewards), np.mean(teach_rewards)], yerr=[np.std(demo_rewards)/np.sqrt(n_trials), np.std(teach_rewards)/np.sqrt(n_trials)], color=['steelblue', 'coral'], alpha=0.8, capsize=5) axes[1].set_ylabel('Avg Cumulative Reward') axes[1].set_title(f'Teaching vs Demonstrating ({n_trials} trials)') axes[1].grid(True, alpha=0.3, axis='y') plt.tight_layout() plt.show() print(f"Teaching leads to faster belief convergence and can improve joint reward.") print(f"Demonstrating avg reward: {np.mean(demo_rewards):.1f} ± {np.std(demo_rewards):.1f}") print(f"Teaching avg reward: {np.mean(teach_rewards):.1f} ± {np.std(teach_rewards):.1f}") ``` ::: ### Choice Architecture As designers of preference learning systems, we are **choice architects**: the options we present and how we present them influence human behavior. Research in behavioral economics documents several effects: - **Order effects**: Items presented first or last receive disproportionate attention - **Framing effects**: How options are described changes perceived value - **Default effects**: Pre-selected options are chosen more often These effects are typically not captured in standard choice models (which assume IIA or similar properties). Responsible deployment of preference learning systems requires awareness of these biases. ::: {.callout-tip title="Key Takeaway: Beyond Passive Oracles"} The transition from measurement to decision-making revealed that uncertainty should be managed, not just reduced. The transition from passive oracles to active humans reveals that **humans themselves are strategic agents** whose behavior depends on their beliefs about the system. CIRL provides a principled framework for this cooperative setting. ::: ## Discussion Questions 1. **Measurement vs. Optimization**: How does the reward maximization perspective differ from the measurement perspective on preference learning? When might one approach be more appropriate than the other? 2. **Thompson Sampling Intuition**: Explain in your own words how Thompson Sampling achieves the exploration-exploitation tradeoff. Why does sampling from the posterior distribution lead to exploration in uncertain regions? 3. **Laplace Approximation**: Why is the Laplace approximation necessary for Thompson Sampling with binary feedback? What are the potential limitations of this approximation? 4. **Regret Definitions**: Compare and contrast strong regret, weak regret, and Bayesian regret in the dueling bandit setting. Give an example scenario where each would be the most appropriate metric. 5. **Winner Concepts**: When might the Condorcet winner, Borda winner, and Von Neumann winner diverge? Construct a simple example with 3-4 alternatives where these concepts give different answers. 6. **Acquisition Functions**: What are the key tradeoffs between the UCB and Thompson Sampling acquisition functions? How does Pure Exploration differ in its objectives? 7. **qEUBO vs. qEI**: Why does qEUBO provably converge while qEI can fail to converge under the same assumptions? What does this suggest about designing acquisition functions for preferential feedback? 8. **Copeland Score**: How does the soft-Copeland score relate to finding the Condorcet winner? Why might we prefer the soft version over the hard indicator-based version? 9. **Contextual Bandits**: How do contextual bandits extend the basic multi-armed bandit framework? Give three real-world examples where context is crucial for making good decisions. 10. **Exploration in Practice**: In recommendation systems, how might information asymmetry be used ethically to encourage exploration? What are the potential risks of this approach? 11. **Bayesian Optimization Connection**: Explain the connection between Preferential Bayesian Optimization and traditional Bayesian optimization. How does the preference feedback model change the inference problem? 12. **Scalability Challenges**: What are the main computational challenges in scaling dueling bandit and PBO algorithms to large action spaces? How might these be addressed? ## Exercises ### ⭐ Basic 1. **Posterior Sampling**: Given a logistic regression model with prior $U \sim \mathcal{N}(0, I)$ and two observations $(V_1, Y_1=1)$ and $(V_2, Y_2=0)$ with $V_1 = [1, 0]^\top$ and $V_2 = [0, 1]^\top$, write down the log-posterior and identify the form of the Laplace approximation. 2. **Regret Calculation**: Consider a dueling bandit with 3 arms where the preference matrix is: $$P = \begin{pmatrix} 0.5 & 0.6 & 0.7 \\ 0.4 & 0.5 & 0.6 \\ 0.3 & 0.4 & 0.5 \end{pmatrix}$$ Identify the Condorcet winner and compute the expected regret of always choosing arms $(2, 3)$. 3. **Copeland Score**: For the preference matrix in Exercise 2, compute the Copeland score for each arm. ### ⭐⭐ Intermediate 4. **GP Classification**: Implement GP classification with Laplace approximation for a 1D toy problem. Generate binary labels from a latent sinusoidal function passed through a sigmoid, then fit a GP and visualize the posterior mean and uncertainty. 5. **Thompson Sampling Simulation**: Implement Thompson Sampling for a linear bandit with Gaussian prior. Run simulations comparing Thompson Sampling to UCB and $\varepsilon$-greedy on a synthetic problem. Plot cumulative regret curves. 6. **Dueling Bandit Simulation**: Implement the Interleaved Filter algorithm and test it on a dueling bandit problem with 5 arms. Compare its sample complexity to uniform exploration for finding the Condorcet winner. ### ⭐⭐⭐ Advanced 7. **Laplace Approximation Derivation**: Starting from the GP binary classification model, derive the full Laplace approximation including the Hessian computation for the posterior covariance. Show that the resulting approximation is Gaussian with the correct mean and covariance. 8. **qEI Failure Case**: Reproduce the proof that qEI can fail to converge by implementing the counterexample from the chapter. Verify empirically that qEI gets stuck while qEUBO converges. 9. **POP-BO Regret Bound**: Following the proof sketch in the chapter, derive the regret bound for POP-BO with a linear kernel. Explain how the covering number $\mathcal{N}(\mathcal{B}_f, \epsilon, \|\cdot\|_\infty)$ appears in the bound. 10. **Custom Acquisition Function**: Design and implement a novel acquisition function for preferential Bayesian optimization that balances exploration and exploitation in a different way than qEUBO. Provide theoretical justification or empirical evidence for its effectiveness.