4  Decisions

Fullscreen Part 1 Fullscreen Part 2

4.1 Dueling Bandit

The multi-armed bandit (MAB) problem involves a gambler deciding which lever to pull on an MAB machine to maximize the winning rate, despite not knowing which machine is the most rewarding. This scenario highlights the need to balance exploration (trying new machines to discover potential higher rewards) and exploitation (using current knowledge to maximize gains). MAB algorithms address this dilemma by making decisions under uncertainty to achieve the best possible outcomes based on gathered data. At the core of the MAB problem is a set of actions, or ‘arms,’ denoted by \(\mathcal{A} = \{1, 2, \ldots, K\}\), where \(K\) signifies the total number of arms. For each round \(t\), the agent selects an arm \(a_t \in \mathcal{A}\) and receives a reward \(r_t\), sampled from an arm-specific, unknown probability distribution. The expected reward of pulling arm \(a\) is represented as \(\mu_a = \mathbb{E}[r_t | a]\).

The multi-armed bandit framework can be extended in various ways to model more complex scenarios. In the infinite-armed bandit problem, the set of possible arms \(\mathcal{A}\) is either very large or infinite. This introduces significant challenges in exploration, as the agent cannot afford to explore each arm even once. Algorithms for infinite-armed bandits typically assume some regularity or structure of the reward function across arms to make the problem tractable. The contextual bandit problem extends the bandit framework by incorporating observable external states or contexts that influence the reward distributions of arms. The agent’s task is to learn policies that map contexts to arms to maximize reward. This model is particularly powerful for personalized recommendations, where the context can include user features or historical interactions. In dueling bandit problems, the agent chooses two arms to pull simultaneously and receives feedback only on which of the two is better, not the actual reward values. This pairwise comparison model is especially useful in scenarios where absolute evaluations are difficult, but relative preferences are easier to determine, such as in ranking systems.

Contextual bandits extend the multi-armed bandits by making decisions conditional on the state of the environment and previous observations. The benefit of such a model is that observing the environment can provide additional information, potentially leading to better rewards and outcomes. In each iteration, the agent is presented with the context of the environment, then decides on an action based on the context and previous observations. Finally, the agent observes the action’s outcome and reward. Throughout this process, the agent aims to maximize the expected reward.

In many real-world contexts, one may not have a real-valued reward (or at least a reliable one) associated with a decision. Instead, we may only have observations indicating which of a set of bandits was optimal in a given scenario. The assumption is that within these observations of preferred choices among a set of options, there is an implicit reward or payoff encapsulated in that decision. Consider the following examples:

  1. Dietary preferences: When providing food recommendations to humans, it is often not possible to quantify an explicit reward from recommending a specific food item. Instead, we can offer meal options and observe which one the person selects.

  2. Video recommendation: Websites like YouTube and TikTok recommend specific videos to users. It is typically not feasible to measure the reward a person gains from watching a video. However, we can infer that a user preferred one video over another. From these relative preference observations, we can develop a strategy to recommend videos they are likely to enjoy.

  3. Exoskeleton gait optimization: Tucker et al. (2020) created a framework that uses human-evaluated preferences for an exoskeleton gait algorithm to develop an optimal strategy for the exoskeleton to assist a human in walking. A human cannot reliably produce a numerical value for how well the exoskeleton helped them walk but can reliably indicate which option performed best according to their preferences.

Generally, we assume access to a set of actions. A noteworthy assumption is that any observations we make are unbiased estimates of the payoff. This means that if we observe a human preferred one option over another (or several others), the preferred option had a higher implicit reward or payoff than the alternatives. In the case of dietary preferences, this may mean that a human liked the preferred option; in the case of video recommendations, a user was more entertained, satisfied, or educated by the video they selected than the other options.

The overarching context is that we do not have direct or reliable access to rewards. We may not have a reward at all (for some decisions, it may be impossible to define a real value to the outcome), or it may be noisy (for example, if we ask a human to rate their satisfaction on a scale of 1 to 10). We use relative comparisons to evaluate the best of multiple options in this case. Our goal is to minimize total regret in the face of noisy comparisons. Humans may not always provide consistent observations (since human decision-making is not guaranteed to be consistent). However, we can still determine an optimal strategy with the observed comparisons. We aim to minimize the frequency of sub-optimal decisions according to human preferences. In practice, many formulations of bandits can allow for infinitely many bandits (for example, in continuous-value and high-dimensional spaces). However, this situation can be intractable when determining an optimal decision strategy. With infinite options, how can we always ensure we have chosen the best? We will constrain our bandits to a discrete space to enable efficient exploration. We will assume that we have \(k\) bandits, \(b_i, i \in [1, k]\), and our task is to choose the one that will minimize regret.

With the framework outlined, we now define our approach more formally. This method was introduced by (Yue et al. 2012), and proofs for the guarantees and derivations of parameters can be found in their work.

To determine the optimal action, we will compare pairwise to ascertain the probability that an action \(b_i\) is preferred over another \(b_j\), where \(i \ne j\). Concretely, we assume access to a function \(\epsilon\) that helps determine this probability; in practice, this can be done with an oracle, such as asking a human which of two options they prefer: \[P(b_i > b_j) = \varepsilon(b_i, b_j) + \frac{1}{2}.\] With this model, three basic properties govern the values provided by \(\epsilon\): \[\epsilon(b_i, b_j) = -\epsilon(b_j, b_i), \epsilon(b_i, b_i) = 0, \epsilon(b_i, b_j) \in \left(-\frac{1}{2}, \frac{1}{2} \right).\]

We assume there is a total ordering of bandits, such that \(b_i \succ b_j\) implies \(\epsilon(b_i, b_j) > 0\). We impose two constraints to properly model comparisons:

  • Strong Stochastic Transitivity: We must maintain our total ordering of bandits, and as such, the comparison model also respects this ordering: \[b_i \succ b_j \succ b_k \Rightarrow \epsilon(b_i, b_k) \ge \text{max}\{\epsilon(b_i, b_j), \epsilon(b_j, b_k)\}. \tag{4.1}\]

  • Stochastic Triangle Inequality: We also impose a triangle inequality, which captures the condition that the probability of a bandit winning (or losing) a comparison will exhibit diminishing returns as it becomes increasingly superior (or inferior) to the competing bandit: \[b_i \succ b_j \succ b_k \Rightarrow \epsilon(b_i, b_k) \le \epsilon(b_i, b_j) + \epsilon(b_j, b_k). \tag{4.2}\]

These assumptions may initially seem limiting; however, common models for comparisons satisfy these constraints. For example, the Bradley-Terry Model follows \(P(b_i > b_j) = \frac{\mu_i}{\mu_i + \mu_j}\). The Gaussian model with unit variance also satisfies these constraints: \(P(b_i > b_j) = P(X_i - X_j > 0)\), where \(X_i - X_j \sim N(\mu_i - \mu_j, 2)\).

To accurately model the preferences between bandits in our framework of pairwise bandit comparisons and regret, we must track certain parameters in our algorithm. First, we will maintain a running empirical estimate of the probability of bandit preferences based on our observations. It is important to note that we do not have direct access to an \(\epsilon\) function. Instead, we must present two bandits to a human, who selects a winner. To do this, we define: \[\hat{P}_{i, j} = \frac{\# b_i\ \text{wins}}{\# \text{comparisons between}\ i \text{and}\ j}.\]

We will also compute confidence intervals at each timestep for each of the entries in \(\hat{P}\) as \[\hat{C}_t = \left( \hat{P}_t - c_t, \hat{P}_t + c_t \right),\] where \(c_t = \sqrt{\frac{4\log(\frac{1}{\delta})}{t}}\). Note that \(\delta = \frac{1}{TK^2}\), where \(T\) is the time horizon and \(K\) is the number of bandits.

Previously, we discussed approaches for finding the best action in a specific context. Now, we consider changing contexts, which means there is no longer a static hidden preference matrix \(P\). Instead, at every time step, there is a preference matrix \(P_C\) depending on context \(C\). We consider a context \(C\) and a preference matrix \(P_C\) to be chosen by nature as a result of the given environment (Yue et al., 2012). The goal of a contextual bandits algorithm is to find a policy \(\pi\) that maps contexts to a Von Neumann winner distribution over our bandits. That is, our policy \(\pi\) should map any context to some distribution over our bandits such that sampling from that distribution is preferred to a random action for that context.

4.1.1 Regret

The agent aims to pick a sequence of arms \((a_1, a_2, \ldots, a_T)\) across a succession of time steps \(t = 1\) to \(t = T\) to maximize the total accumulated reward. Formally, the strategy seeks to maximize the sum of the expected rewards: \(\max_{a_1, \ldots, a_T} \mathbb{E} \left[\sum_{t=1}^{T} r_t\right]\). Regret is defined as the difference between the cumulative reward that could have been obtained by always pulling the best arm (in hindsight, after knowing the reward distributions) and the cumulative reward actually obtained by the algorithm. Formally, if \(\mu^*\) is the expected reward of the best arm and \(\mu_{a_t}\) is the expected reward of the arm chosen at time \(t\), the regret after \(T\) time steps is given by \(R(T) = T \cdot \mu^* - \sum_{t=1}^{T} \mu_{a_t}\). The objective of a bandit algorithm is to minimize this regret over time, effectively learning to make decisions that are as close as possible to the decisions of an oracle that knows the reward distributions beforehand. Low regret indicates an algorithm that has often learned to choose well-performing arms, balancing the exploration of unknown arms with the exploitation of arms that are already known to perform well. Thus, an efficient bandit algorithm exhibits sub-linear regret growth, meaning that the average regret per round tends to zero as the number of rounds \(T\) goes to infinity: \(\lim_{T \to \infty} \frac{R(T)}{T} = 0\). Minimizing regret is a cornerstone in the design of bandit algorithms, and its analysis helps in understanding the long-term efficiency and effectiveness of different bandit strategies.

As previously discussed, our goal is to select the bandit that minimizes a quantity that reflects regret or the cost of not selecting the optimal bandit at all times. We can leverage our comparison model to define a quantity for regret over some time horizon \(T\), which is the number of decisions we make (selecting what we think is the best bandit at each iteration). Assuming we know the best bandit \(b^*\) (and we know that there is a best bandit, since there is a total ordering of our discrete bandits), we can define two notions of regret:

  • Strong regret: aims to capture the fraction of users who would prefer the optimal bandit \(b^*\) over the worse of the options \(b_1, b_2\) we provide at a given step:\(R_T = \sum_{t = 1}^T \text{max} \left\{ \epsilon(b^*, b_1^{(t)}), \epsilon(b^*, b_2^{(t)}) \right\}\)

  • Weak regret: aims to capture the fraction of users who would prefer the optimal bandit \(b^*\) over the better of the options \(b_1, b_2\) we provide at a given step:\(\tilde{R}_T = \sum_{t = 1}^T \text{min} \left\{ \epsilon(b^*, b_1^{(t)}), \epsilon(b^*, b_2^{(t)}) \right\}\)

The best bandit described in our regret definition is called a Condorcet Winner. This is the strongest form of winner. It’s the action \(A_{i}\) which is preferred to each other action \(A_j\) with \(p > 0.5\) in a head-to-head election. While the above introduced notions of regret assume an overall best bandit to exist, there might be settings, where no bandit wins more than half head-to-head duels. A set of actions without a Condorcet winner is described by the following preference matrix, where each entry \(\Delta_{jk}\) is \(p(j \succ k) - 0.5\), the probability that action \(j\) is preferred over action \(k\) minus 0.5. There is no Condorcet winner as there is no action that is preferred with \(p > 0.5\) over all other actions. Imagine, you want to find the best pizza to eat (=action). There may not be a pizza that wins more than half of the head-to-head duels against every other pizza.

However, we might still have an intuition of the best pizza. Therefore Sui et al., 2018 introduce the concepts of different \(\textit{winners}\) in dueling bandit problems (Sui et al. 2018). In this example, we might define the best pizza as the most popular one. We call the Pizza receiving the most votes in a public vote the Borda Winner, or formally, Borda winner \(j = \arg\max_{i \in A, i \neq j} \left(\sum p(j \succ i)\right)\). In contrast to the Condorcet Winner setting, there is always guaranteed to be one or more (in the case of a tie) Borda winners for a set of actions. However - if there is a Condorcet Winner, this might not necessarily be the same as a Borda Winner: In our Pizza example, a Pepperoni Pizza might win more than half of its head-to-head duels, while the Cheese-Pizza is still the most popular in a public poll.

A more generic concept of winner is the Von Neumann Winner, which describes a probability distribution rather than a single bandit winner. A Von Neumann winner simply prescribes a probability distribution \(W\) such that sampling from this distribution ‘beats’ an action from the random uniform distribution with \(p > 0.5\). In our pizza example, this would correspond to trusting a friend to order whichever Pizza he likes, because this may still be preferred to ordering randomly. Formally, \(W\) is a Von Neumann if \((j \sim W, k \sim R) [p(p(j \succ k) > 0.5) > 0.5]\) where \(R\) describes the uniform probability distribution over our actions. The concept of a Von Neumann winner is useful in contextual bandits, which will be introduced later. In these settings, the preference matrix depends on different context, which may have different Borda winners, just as different parties may vote for different pizzas.

A B C D E F
A 0 0.03 -0.02 0.06 0.10 0.11
B -0.03 0 0.03 0.05 0.08 0.11
C -0.03 0 0.04 0.07 0.09
D -0.06 -0.05 -0.04 0 0.05 0.07
E -0.10 -0.08 -0.07 -0.05 0 0.03
F -0.11 -0.11 -0.09 -0.07 -0.03 0
Figure 4.1: Violation of Condorcet Winner. Highlighted entries are different from Table 1. No Condorcet winner exists as no arm could beat every other arm.

Next, we introduce two performance measures for the planner. The asymptotic ex-post regret is defined as \[\text{Regret}(\mu_1, \ldots \mu_K) = T\cdot \max_i \mu_i - \sum_{i=1}^T E[\mu_{I_t}].\]

Intuitively, this represents the difference between the reward achieved by always taking the action with the highest possible reward and the expected welfare of the recommendation algorithm (based on the actions it recommends at each timestep).

We also define a weaker performance measure, the Bayesian regret, which is defined as \[\text {Bayesian regret}=E_{\mu_1, \ldots, \mu_K \sim \text {Prior}}\left[\operatorname{Regret}\left(\mu_1, \ldots, \mu_K\right)\right]\]

With a Bayesian optimal policy, we would like either definition of regret to vanish as \(T\to \infty\); we are considering “large-market optimal" settings where there are many short-lived, rather than a few long-term, users. Note the fact that ex-post regret is prior-free makes it robust to inaccuracies on the prior.

4.1.2 Acquisition Functions

Various strategies have been developed to balance the exploration-exploitation trade-off. These strategies differ in selecting arms based on past experiences and rewards.

4.1.2.1 Classical Acquisition Functions

Uniform acquisition function is the most straightforward approach where each arm is selected uniformly randomly over time. This strategy does not consider the past rewards and treats each arm equally promising regardless of the observed outcomes. It is a purely explorative strategy that ensures each arm is sampled enough to estimate its expected reward, but it does not exploit the information to optimize rewards. In mathematical terms, if \(N_t(a)\) denotes the number of times arm \(a\) has been selected up to time \(t\), the Uniform Strategy would ensure that \(N_t(a) \approx \frac{t}{K}\) for all arms \(a\) as \(t\) grows large: \(P(a_t = a) = \frac{1}{K}\)

The Epsilon Greedy is a popular method that introduces a balance between exploration and exploitation. With a small probability \(\epsilon\), it explores by choosing an arm at random, and with a probability \(1 - \epsilon\), it exploits by selecting the arm with the highest estimated reward so far. This strategy incrementally favors actions that have historically yielded higher rewards, but still allows for occasional exploration to discover better options potentially. The parameter \(\epsilon\) is chosen based on the desired exploration level, often set between 0.01 and 0.1. \[P(a_t = a) = \begin{cases} \frac{\epsilon}{K} + 1 - \epsilon & \text{if } a = \arg\max_{a'} \hat{\mu}_{a'} \\ \frac{\epsilon}{K} & \text{otherwise} \end{cases}\]

Upper Confidence Bound (UCB) acquisition function takes a more sophisticated approach to the exploration-exploitation dilemma. It selects arms based on both the estimated rewards and the uncertainty or variance associated with those estimates. Specifically, it favors arms with high upper confidence bounds on the estimated rewards, which is a sum of the estimated mean and a confidence interval that decreases with the number of times the arm has been played. This ensures that arms with less certainty (those played less often) are considered more often, naturally balancing exploration with exploitation as the uncertainty is reduced over time.

\[P(a_t = a) = \begin{cases} 1 & \text{if } a = \arg\max_{a'} \left( \hat{\mu}_{a'} + \sqrt{\frac{2 \ln t}{N_t(a')}} \right) \\ 0 & \text{otherwise} \end{cases}\]

4.1.2.2 Interleaved Filter

This algorithm tries to find the best bandit (Condorcet Winner) in a discrete, limited bandit-space via pairwise comparisons of the bandits. We will now introduce the algorithm for the Interleaved Filter as provided in (Yue et al. 2012) to solve a dueling bandit setup. It starts with a randomly defined best bandit \(\hat{b}\) and iteratively compares it to set \(W\) containing the remaining bandits \(b\) resulting in winning probabilities \(\hat{P}_{\hat{b},b}\) and confidence interval \(\hat{C}_{\hat{b},b}\). If a bandit \(b\) is confidently worse than \(\hat{b}\), it is removed from \(W\). If a bandit \(b'\) is confidently better than \(\hat{b}\), it is set as new best bandit \(\hat{b}\) and bandit \(\hat{b}\) as well as every other bandit \(b\) worse than \(\hat{b}\) are removed from \(W\). This is done, until \(W\) is empty, leaving the final \(\hat{b}\) as the predicted best bandit.

input: \(T\), \(B=\{b_1, \dots, b_k\}\) \(\delta \gets 1/(TK^2)\) Choose \(\hat{b} \in B\) randomly \(W \gets \{b_1, \dots, b_k\} \backslash \{\hat{b}\}\) \(\forall b \in W\), maintain estimate \(\hat{P}_{\hat{b},b}\) of \(P(\hat{b} > b)\) according to (6) \(\forall b \in W\), maintain \(1 - \delta\) confidence interval \(\hat{C}_{\hat{b},b}\) of \(\hat{P}_{\hat{b},b}\) according to (7), (8) compare \(\hat{b}\) and \(b\) update \(\hat{P}_{\hat{b},b}\), \(\hat{C}_{\hat{b},b}\) \(W \gets W \backslash \{b\}\)

\(W \gets W \backslash \{b\}\) \(\hat{b} \gets b'\), \(W \gets W \backslash \{b'\}\) \(\forall b \in W\), reset \(\hat{P}_{\hat{b},b}\) and \(\hat{C}_{\hat{b},b}\) \(\hat{T} \gets\) Total Comparisons Made \((\hat{b}, \hat{T})\)

Parameter Initialization

In lines 1-6 of the algorithm, we take the inputs and first compute the value \(\delta\) which is used to compute our confidence intervals. We select an initial guess of an optimal bandit \(\hat{b}\) by uniformly sampling from all bandits \(\mathcal{B}\). We also keep a running set of bandit candidates \(W\), which is initialized to be \(\mathcal{B} \setminus \{\hat{b}\}\). At this point, we also initialize our empirical estimates for \(\hat{P}, \hat{C}\).

Next, we will repeat several steps until our working set of bandit candidates \(W\) is empty.

Update Estimates Based on Comparisons

The first step at each iteration (lines 8-11) is to look at all candidates in \(W\), and compare them to our current guess \(\hat{b}\) using an oracle (e.g. by asking a human which of \(\hat{b}\) or \(b \in W\) is preferred). With this new set of wins and comparisons, we update our estimates of \(\hat{P}, \hat{C}\).

Prune Suboptimal Bandits

In lines 12-13, with updated comparison win probabilities and corresponding confidence intervals, we can remove bandit candidates from \(W\) that we are confident \(\hat{b}\) is better than. The intuition here is that we are mostly sure that our current best guess is better than some of the candidates, and we don’t need to consider those candidates in future iterations.

Check for Better Bandits from Candidate Set

Now that our candidate set of bandits may be smaller, in lines 15-21 we check if there are any bandits \(b'\) that we are confident are better than our current best guess. If we do find such a candidate, we remove bandits which \(\hat{P}\) indicates \(b\) is likely worse than \(\hat{b}\). Note that in this step, we do not require the probability to be outside the confidence interval, since we already found one we believe to be significantly closer to optimal than our current best guess.

Once we remove the candidates likely worse than \(\hat{b}\), we crown \(b'\) as the new best guess, e.g. \(\hat{b} := b'\). Consequently, we remove \(b'\) from \(W\) and reset our empirical win counters \(\hat{P}, \hat{C}\).

With this algorithm defined, let us look at some provisions of the method with respect to identifying the optimal strategy. Note that the proofs and derivations for these quantities are provided in (Yue et al. 2012).

First, the method guarantees that for the provided time horizon \(T\), the algorithm returns the correct bandit with probability \(P \ge 1 - \frac{1}{T}\). It is interesting and useful to note that if one has a strict requirement for the probability of identifying the correct bandit, one can compute the time horizon \(T\) that guarantees this outcome at that probability. Furthermore, a time horizon of 1 leaves no probabilistic guarantee of a successful outcome, and increasing \(T\) has diminishing returns. Second, in the event that the algorithm returns an incorrect bandit, the maximal regret incurred is linear with respect to \(T\), e.g. \(\mathcal(O)(T)\). This is also a useful provision as it allows us to estimate the overall cost in the worst case outcome. Based on these two provisions, we can compute the expected cumulative regret from running the Interleaved Filter algorithm, which is: \[\mathbb{E}\left[R_T\right] \le \left(1 - \frac{1}{T}\right) \mathbb{E}\left[ R_T^{IF} \right] + \frac{1}{T}\mathcal{O}(T) \\ = \mathcal{O}\left(\mathbb{E}\left[ R_T^{IF} \right] + 1\right)\]

Interestingly, the original work shows that these bounds hold for both strong and weak regret. As demonstrated, the Interleaved Filter algorithm [fig-if] provides a robust method to ascertain the optimal bandit or strategy given a set of options and only noisy comparisons. In most real-world scenarios for modeling human preferences, it is not possible to observe a real-world reward value, or at least a reliable one and as such this method is a useful way to properly model human preferences.

Furthermore, the algorithm provides strong guarantees for the probability of selecting the correct bandit, maximal regret, and the number of comparisons required. It is even more impressive that the method can do so without severely limiting constraints; as demonstrated, the most commonly used models satisfy the imposed constraints.

As we look to model human preferences, we can certainly leverage this method for k-armed dueling bandits to identify the best strategy to solve human-centric challenges, from video recommendation to meal selection and exoskeleton-assisted walking.

4.1.2.3 Dueling Bandit Gradient Descent

This algorithm tries to find the best bandit in a continuous bandit-space. Here, the set of all bandits is regarded as an Information-Retrieval (IR) system with infinite bandits uniquely defined by \(w\). We will cover the Dueling Bandit Gradient Descent algorithm from Yue and Joachims 2009 (Yue and Joachims 2009). Yue and Joachims use the dueling bandits formulation for online IR optimization. They propose a retrieval system parameterized by a set of continuous variables lying in \(W\), a \(d\)-dimensional unit-sphere. The DBGD algorithm adapts the current parameters \(w_t\) of IR system by comparison with slightly altered parameters \(w_t'\) both querying query \(q_t\). Only if the IR outcome using \(w_t'\) is preferred, the parameters are changed in their direction. We will now discuss the algorithm more detailed.

input: \(\gamma\), \(\delta\), \(w_1\)

Sample unit vector \(u_t\) uniformly

\(w_t' \gets P_W(w_t + \delta u_t)\)

Compare \(w_t\) and \(w_t'\)

\(w_{t+1} \gets P_W(w_t + \gamma u_t)\)

\(w_{t+1} \gets w_t\)

We first choose exploration step length \(\delta\), exploitation step length \(\gamma\), and starting point (in unit-sphere) \(w_1\). Choose a query and sample a random unit vector \(u_t\). We duel \(w_t\) and \(w_t'\), where \(w_t\) is our current point in the sphere, and \(w_t'\) is our exploratory comparison, which is generated by taking a random step of length \(\delta\), such that \(w_t' = w_t + \delta u_t\). The objective of this duel is to ascertain the binary preference of users with respect to the results yielded by the IR systems parameterized by \(w_t\) and \(w_t'\) respectively, taking query \(q_t\) as an input. The parameters that get the majority of the votes in the head to head win. If \(w_t\) wins, then we keep the parameters for the next iteration. If \(w_t'\) wins the duel, we update our parameters in the direction of \(u_t\) by taking a step of length \(\gamma\). Note that the algorithm describes projection operation \(P_W(\overrightarrow{v})\). Since \(u_t\) is chosen randomly, \(w_t + \delta u_t\) or \(w_t + \gamma u_t\) could exist outside of the unit sphere where all possible parameter configurations lie. In this case, we simply project the point back onto the sphere using said projection \(P_W(\overrightarrow{v})\).

Yue and Joachims show that this algorithm has sublinear regret in \(T\), the number of iterations. We note that the algorithm assumes that there exists a hidden reward function \(R(w)\) that maps system parameters \(w_t\) to a reward value which is smooth and strictly concave over the input space \(W\).

Lastly, we would also like to give motivation behind \(\delta\) and \(\gamma\) being different values. We need a \(\delta\) that is sufficiently large that the comparison between a system parameterized by \(w_t\) and \(w_t'\) is meaningful. On the other hand, we may wish to take a smaller step in the direction of \(w_t'\) during our update step, as during a duel, we only score \(w_t\) against \(w_t'\) over the results on one query \(q_t\). Having \(\delta > \gamma\) allows us to get reward signal from meaningfully different points while also updating our belief of the best point \(w_{\text{best}}\) gradually.

Sparring EXP4

Zoghi et al. 2015 propose one algorithm for this problem — sparring EXP4, which duels two traditional EXP4 - algorithms. The (traditional) EXP4 algorithm solves the traditional contextual bandits — the case where we can directly observe a reward for a choice of bandit given a context. The EXP4 algorithm embeds each bandit as a vector. When the algorithm sees the context (called ‘advice’ in this formulation), it produces a probability distribution over the choices based on an adjusted softmax function on the inner product between the context and the bandit vectors. The probability function is different from a softmax as we assign some minimum probability that any action gets chosen to enforce exploration. A reward is then observed for the choice and propagated back through the embedding of the chosen bandit.

Sparring EXP4 runs two instances of the EXP4 algorithm against each other. Each EXP4 instance samples an action given a context, and then these choices are ‘dueled’ against each other. Instead of directly observing a reward, as for traditional EXP4, we instead observe two converse reward — a positive reward for the choice that won the duel and a negative reward to the choice that lost. The reward is proportional to the degree to which the bandit wins the duel, i.e. how likely the bandit is to be preferred over the other when users are queried for binary preferences. Like in traditional EXP4, the reward or negative reward is then propagated back through the representations of the bandits.

4.1.2.4 Feel-good Thompson sampling

This algorithm is a solution for the contextual dueling bandit setting, and tries to minimize cumulative average regret (= find WHAT WINNER?!Von Neumann???): \[\text{Regret}(T) := \sum_{t=1}^{T} \left[ r_{*}(x_t, a_{t}^{*}) - \frac{r_{*}(x_t, a_{t}^{1}) + r_{*}(x_t, a_{t}^{2})}{2} \right],\] where \(r_{*}(x_t, a_{t})\) is the true, hidden reward function of a context \(x_t\) and action \(a_t\). Thompson sampling is an iterative process of receiving preference over two actions, each maximizing a different approximation of the reward function based on past data and adding this new information to the data.

Finding good approximations of the reward function at time \(t\) is done by sampling two reward function parameters \(\theta_t^{j=1}\) and \(\theta_t^{j=2}\) from a posterior distribution based on all previous data \(p_j(\cdot \mid S_{t-1})\). This posterior distribution is proportional to the multiplication of the prior and the likelihood function, which is a Gaussian in standard Thompson sampling. In Feel-Good Thompson sampling, an additional term called "Feel-good exploration" encourages parameters \(\theta\) with a large maximum reward in previous rounds. This change to the likelihood function may increase probabilities in uncertain areas, thus exploring those regions. All that’s left is to select an action maximizing each reward function approximation and receive a preference \(y_t\) on one of them to add the new information to the dataset(Zhang 2021).

Initialize \(S_0 = \varnothing\). Receive prompt \(x_t\) and action space \(\mathcal{A}_t\). Sample model parameter \(\theta_t^j\) from the posterior distribution \(p^j(\cdot \mid S_{t-1})\) Select response \(a_t^j = \arg\max_{a \in \mathcal{A}_t} \langle \theta_t^j, \phi(x_t, a) \rangle\). Receive preference \(y_t\). Update dataset \(S_t \leftarrow S_{t-1} \cup \{(x_t, a_t^1, a_t^2, y_t)\}\).

4.1.3 Applications

There are many applications where contextual bandits are used. Many of these applications can utilize human preferences. One particular application illustrates the benefits a contextual bandit would have over a multi-armed bandit: a website deciding which app to show someone visiting the website. A multi-armed bandit might decide to show someone an ad for a swimsuit because the swimsuit ads have gotten the most user clicks (which indicates human preference). A contextual bandit might choose differently, however. A contextual bandit will also take into account the context, which in this case might mean information about the user (location, previously visited pages, and device information). If it discovers the user lives in a cold environment, for example, it might suggest a sweater ad for the user instead and get a better chance of a click. There are many more examples of where contextual bandits can be applied. They can be applied in other web applications, such as to optimize search results, medical applications, such as how much of a medication to prescribe based on a patient’s history, and gaming applications, such as basing moves off of the state of a chess board to try to win. In each of the above examples, human feedback could have been introduced during training and leveraged to learn a reward function.

We explored different versions of bandits that address the exploration-exploitation trade-off in various real-world scenarios. These models have been employed across various fields, including but not limited to healthcare, finance, dynamic pricing, and anomaly detection. This section provides a deep dive into some real-world applications, emphasizing the value and advancements achieved by incorporating bandit methodologies. The content of this section draws upon the findings from the survey cited in reference (Bouneffouf, Rish, and Aggarwal 2020).

In healthcare, researchers have been applying bandits to address challenges in clinical trials and behavioral modeling (Bouneffouf, Rish, and Cecchi 2017; Bastani and Bayati 2020). One of the examples is drug dosing. Warfarin, an oral anticoagulant, has traditionally been administered using fixed dosing protocols. Physicians would then make subsequent adjustments based on the patient’s emerging symptoms. Nonetheless, inaccuracies in the initial dosage—whether too low or too high—can lead to serious complications like strokes and internal bleeding. In a pivotal study, researchers in (Bastani and Bayati 2020) modeled the Warfarin initial dosing as a contextual bandit problem to assign dosages to individual patients appropriately based on their medication history. Their contributions include the adaptation of the LASSO estimator to the bandit setting, achieving a theoretical regret bound of \(O({s_0}^2 \log^2(dT)\), where \(d\) represents the number of covariates, \(s_0 << d\) signifies the number of pertinent covariates, and \(T\) indicates the total number of users. Additionally, they conducted empirical experiments to validate the robustness of their methodology.

Within the finance sector, bandits have been instrumental in reshaping the landscape of portfolio optimization. Portfolio optimization is an approach to designing a portfolio based on the investor’s return and risk criteria, which fits the exploration-exploitation nature of the bandit problems. (Shen et al. 2015) utilized multi-armed bandits to exploit correlations between the instruments. They constructed orthogonal portfolios and integrated them with the UCB policy to achieve a cumulative regret bound of \(\frac{8n}{\Delta*} \ln(m) + 5n\), where \(n\), \(m\), and \(\Delta*\) denotes the number of available assets, total time steps, and the gap between the best-expected reward and the expected reward. On the other hand, (Huo and Fu 2017) focused on risk-awareness online portfolio optimization by incorporating a compute of the minimum spanning tree in the bipartite graph, which encodes a combination of financial institutions and assets that helps diversify and reduce exposure to systematic risk during the financial crisis.

Dynamic pricing, also known as demand-based pricing, refers to the strategy of setting flexible prices for products or services based on current market demands. The application of bandits in dynamic pricing offers a systematic approach to making real-time pricing decisions while balancing the trade-off between exploring new price points and exploiting known optimal prices. (Misra, Schwartz, and Abernethy 2019) proposed a policy where the company has only incomplete demand information. They derived an algorithm that balances immediate and future profits by combining multi-armed bandits with partial identification of consumer demand from economic theory.

are essential components of numerous online platforms, guiding users through vast content landscapes to deliver tailored suggestions. These systems are instrumental in platforms like e-commerce sites, streaming platforms, and social media networks. However, the challenge of effectively recommending items to users is non-trivial, given the dynamic nature of user preferences and the vast amount of content available.

One of the most significant challenges in recommendation systems is the "cold start" problem. This issue arises when a new user joins a platform, and the system has limited or no information about the user’s preferences. Traditional recommendation algorithms struggle in such scenarios since they rely on historical user-item interactions. As discussed in (Zhou et al. 2017), the bandit setting is particularly suitable for large-scale recommender systems with a vast number of items. By continuously exploring user preferences and exploiting known interactions, bandit-based recommender systems can quickly adapt to new users, ensuring relevant recommendations in a few interactions. The continuous exploration inherent in bandit approaches also means that as a user’s preferences evolve, the system can adapt, ensuring that recommendations remain relevant. Recommending content that is up to date is also another important aspect of a recommendation system. In (Bouneffouf, Bouzeghoub, and Gançarski 2012), the concept of "freshness" in content is explored through the lens of the bandit problem. The Freshness-Aware Thompson Sampling algorithm introduced in this study aims to manage the recommendation of fresh documents according to the user’s risk of the situation.

Dialogue systems, often termed conversational agents or chatbots, aim to simulate human-like conversations with users. These systems are deployed across various platforms, including customer support, virtual assistants, and entertainment applications, and they are crucial for enhancing user experience and engagement. Response selection is fundamental to creating a natural and coherent dialogue flow. Traditional dialogue systems rely on a predefined set of responses or rules, which can make interactions feel scripted and inauthentic. In (Liu et al. 2018), the authors proposed a contextual multi-armed bandit model for online learning of response selection. Specifically, they utilized bidirectional LSTM to produce the distributed representations of a dialogue context and responses and customized the Thompson sampling method.

To create a more engaging and dynamic interaction, there’s a growing interest in developing pro-active dialogue systems that can initiate conversations without user initiation. (perez and Silander 2018) proposed a novel approach to this challenge with contextual bandits. By introducing memory models into the bandit framework, the system can recall past interactions, making its proactive responses more contextually relevant. Their contributions include the Contextual Attentive Memory Network, which implements a differentiable attention mechanism over past interactions.

(Upadhyay et al. 2019) addressed the challenge of orchestrating multiple independently trained dialogue agents or skills in a unified system. They attempted online posterior dialogue orchestration, defining it as selecting the most suitable subset of skills in response to a user’s input, which studying a context-attentive bandit model that operates under a skill execution budget, ensuring efficient and accurate response selection.

Anomaly detection refers to the task of identifying samples that behave differently from the majority. In (Ding, Li, and Liu 2019), the authors delve into anomaly detection in an interactive setting, allowing the system to actively engage with human experts through a limited number of queries about genuine anomalies. The goal is to present as many true anomalies to the human expert as possible after a fixed query budget is used up. They applied the multi-armed contextual bandit framework to address this issue. This algorithm adeptly integrates both nodal attributes and node dependencies into a unified model, efficiently managing the exploration-exploitation trade-off during anomaly queries.

There are many challenges associated with contextual bandits. The first challenge is that each action only reveals the reward for that particular action. Therefore, the algorithm has to work with incomplete information. This leads to the dilemma of exploitation versus exploration: when should the algorithm choose the best-known option versus trying new options for potentially better outcomes? Another significant challenge for contextual bandits is using context effectively. The context the environment gives needs to be explored to figure out which action is best for each context.

The overarching goal in systems designed for recommending options of high value to users is to achieve an optimal balance between exploration and exploitation. This dual approach is crucial in environments where user preferences and needs are dynamic and diverse. Exploration refers to the process of seeking out new options, learning about untried possibilities, and gathering fresh information that could lead to high-value recommendations. In contrast, exploitation involves utilizing existing knowledge and past experiences to recommend the best options currently known. This balance is key to maintaining a system that continuously adapts to changing user preferences while ensuring the reliability of its recommendations.

A key observation in such systems is the dual role of users as both producers and consumers of information. Each user’s experience contributes valuable data that informs future recommendations for others. For instance, platforms like Waze, Netflix, and Trip Advisor rely heavily on user input and feedback. Waze uses real-time traffic data from drivers to recommend optimal routes; Netflix suggests movies and shows based on viewing histories and ratings; Trip Advisor relies on traveler reviews to guide future tourists. In these examples, the balance between gathering new information (exploration) and recommending the best-known options (exploitation) is dynamically managed to enhance user experience and satisfaction. This approach underscores the importance of user engagement in systems where monetary incentives are not (or can not be) the primary driver.

Recommendation systems often face the challenge of overcoming user biases that can lead to a narrow exploration of options. Users come with preconceived notions and preferences, which can cause them to overlook potentially valuable options that initially appear inferior or unaligned with their interests. This predisposition can significantly limit the effectiveness of recommendation systems, as users might miss out on high-value choices simply due to their existing biases.

To counteract this, it is crucial for recommendation systems to actively incentivize exploration among users. One innovative approach to achieve this is through the strategic use of information asymmetry. By controlling and selectively presenting information, these systems can guide users to explore options they might not typically consider. This method aims to reveal the true potential of various options by nudging users out of their comfort zones and encouraging a broader exploration of available choices. An important note here is that the system is not lying to users - it only selectively reveals information it has.

The concept of incentivizing exploration becomes even more complex when considering different types of users. For instance, systems often encounter short-lived users who have little to gain from contributing to the system’s learning process, as their interactions are infrequent or based on immediate needs. Similarly, some users may operate under a ‘greedy’ principle, primarily seeking immediate gratification rather than contributing to the long-term accuracy and effectiveness of the system. In such scenarios, managing information asymmetry can be a powerful tool. By selectively revealing information, recommendation systems can create a sense of novelty and interest, prompting even the most transient or self-interested users to engage in exploration, thereby enhancing the system’s overall knowledge base and recommendation quality.

4.2 Preferential Bayesian Optimization

The traditional Bayesian optimization (BO) problem is described as follows. There is a black-box objective function \(g: \mathcal{X} \rightarrow \Re\) defined on a bounded subset \(\mathcal{X} \subseteq \Re^q\) such that direct queries to the function are expensive or not possible. However, we would like to solve the global optimization problem of finding \(\mathbf{x}_{\min }=\arg \min _{\mathbf{x} \in \mathcal{X}} g(\mathbf{x})\). This is highly analogous to modeling human preferences, since it is the case that direct access to a human’s latent preference function is not possible but we would still like to find its optimum, such as in A/B tests or recommender systems.

We approach this problem for human preferences with Preferential Bayesian Optimization (PBO), as the key difference is that we are able to query the preference function through pairwise comparisons of data points, i.e. duels. This is a form of indirect observation of the objective function, which models real-world scenarios closely: we commonly need to to optimize a function via data about preferences. With humans, it has been demonstrated that we are better at evaluating differences rather than absolute magnitudes (Kahneman and Tversky 1979) and therefore PBO models can be applied in various contexts.

4.2.1 Problem statement

The problem of finding the optimum of a latent preference function defined on \(\mathcal{X}\) can be reduced to determining a sequence of duels on \(\mathcal{X} \times \mathcal{X}\). From each duel \(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] \in\) \(\mathcal{X} \times \mathcal{X}\) we obtain binary feedback \(\{0,1\}\) indicating whether or not \(\mathbf{x}\) is preferred over \(\mathbf{x}^{\prime}\) (\(g(\mathbf{x}) < g(\mathbf{x}^{\prime})\)). We consider that \(\mathbf{x}\) is the winner of the duel if the output is \(\{1\}\) and that \(\mathbf{x}^{\prime}\) wins the duel if the output is \(\{0\}\). The aim is to find \(\mathbf{x}_{\min }\) by reducing as much as possible the number of queried duels.

The key idea in PBO is to learn a preference function in the space of duels using a Gaussian process. We define a joint reward \(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\) on each duel which is never directly observed. Instead, the feedback we obtain after each pair is a binary output \(y \in\) \(\{0,1\}\) indicating which of the two inputs is preferred. One definition of f we will use (though others are possible) is \(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})\). The more \(\mathbf{x}^{\prime}\) is preferred over \(\mathbf{x}\), the bigger the reward.

We define the model of preference using a Bernoulli likelihood, where \(p\left(y=1 \mid\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\) and \(p\left(y=0 \mid\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\pi_f\left(\left[\mathbf{x}^{\prime}, \mathbf{x}\right]\right)\) for some inverse link function \(\pi: \Re \times \Re \rightarrow[0,1]\). \(\pi_f\) has the property that \(\pi_f\left(\left[\mathbf{x}^{\prime}, \mathbf{x}\right]\right)=1-\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\). A natural choice for \(\pi_f\) is the logistic function \[\label{eq:bernoulli_pref} \pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\sigma\left(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\right)=\frac{1}{1+e^{-f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)}},\] but others are possible. Therefore we have that for any duel \(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\) in which \(g(\mathbf{x}) \leq g\left(\mathbf{x}^{\prime}\right)\) it holds that \(\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) \geq 0.5\). \(\pi_f\) is a preference function that maps each query \(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\) to the probability of having a preference on the left input \(\mathbf{x}\) over the right input \(\mathbf{x}^{\prime}\).

When we marginalize over the right input \(\mathbf{x}^{\prime}\) of \(f\) (is this correct?), the global minimum of \(f\) in \(\mathcal{X}\) coincides with \(\mathbf{x}_{\min }\). We also introduce the definition of the Copeland score function for a point \(\mathbf{x}\) as \[S(\mathbf{x})=\operatorname{Vol}(\mathcal{X})^{-1} \int_{\mathcal{X}} \mathbb{I}_{\left\{\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) \geq 0.5\right\}} d \mathbf{x}^{\prime}\] where \(\operatorname{Vol}(\mathcal{X})=\int_{\mathcal{X}} d \mathbf{x}^{\prime}\) is a normalizing constant that bounds \(S(\mathbf{x})\) in the interval \([0,1]\). If \(\mathcal{X}\) is a finite set, the Copeland score is simply the proportion of duels that a certain element \(\mathbf{x}\) will win with probability larger than 0.5. A soft variant we will use instead of the Copeland score is the soft-Copeland score, defined as \[\label{eq:soft-copeland} C(\mathbf{x})=\operatorname{Vol}(\mathcal{X})^{-1} \int_{\mathcal{X}} \pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d \mathbf{x}^{\prime}\] where the probability function \(\pi_f\) is integrated over \(\mathcal{X}\). This score aims to capture the average probability of \(\mathbf{x}\) being the winner of a duel.

We define the Condorcet winner \(\mathbf{x}_c\) as the point with maximal soft-Copeland score. Note that this corresponds to the global minimum of \(f\), since the defining integral takes maximum value for points \(\mathbf{x} \in \mathcal{X}\) where \(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\) \(g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})>0\) or all \(\mathbf{x}^{\prime}\), occurring only if \(\mathbf{x}_c\) is a minimum of \(f\). Therefore, if the preference function \(\pi_f\) can be learned by observing the results of duels then our optimization problem of finding the minimum of \(f\) can be solved by finding the Condorcet winner of the Copeland score.

4.2.2 Acquisition Functions

We describe several acquisition functions for sequential learning of the Condorcet winner. Our dataset \(\mathcal{D}=\left\{\left[\mathbf{x}_i, \mathbf{x}_i^{\prime}\right], y_i\right\}_{i=1}^N\) represents the \(N\) duels that have been performed so far. We aim to define a sequential policy \(\alpha\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] ; \mathcal{D}_j, \theta\right)\) for querying duels, where \(\theta\) is a vector of model hyper-parameters, in order to find the minimum of the latent function \(g\) as quickly as possible. Using Gaussian processes (GP) for classification with our dataset \(\mathcal{D}\) allows us to perform inference over \(f\) and \(\pi_f\).

Pure Exploration

The output variable \(y_{\star}\) of a prediction follows a Bernoulli distribution with probability given by the preference function \(\pi_f\). To carry out exploration as a policy, one method is to search for the duel where GP is most uncertain about the probability of the outcome (has the highest variance of \(\sigma\left(f_{\star}\right)\) ), which is the result of transforming out epistemic uncertainty about \(f\), modeled by a GP, through the logistic function. The first order moment of this distribution coincides with the expectation of \(y_{\star}\) but its variance is \[\begin{aligned} \mathbb{V}\left[\sigma\left(f_{\star}\right)\right] & =\int\left(\sigma\left(f_{\star}\right)-\mathbb{E}\left[\sigma\left(f_{\star}\right)\right]\right)^2 p\left(f_{\star} \mid \mathcal{D},\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d f_{\star} \\ & =\int \sigma\left(f_{\star}\right)^2 p\left(f_{\star} \mid \mathcal{D},\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right) d f_{\star}-\mathbb{E}\left[\sigma\left(f_{\star}\right)\right]^2 \end{aligned}\] which explicitly takes into account the uncertainty over \(f\). Hence, pure exploration of duels space can be carried out by maximizing \[\alpha_{\mathrm{PE}}\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right] \mid \mathcal{D}_j\right)=\mathbb{V}\left[\sigma\left(f_{\star}\right)\left|\left[\mathbf{x}_{\star}, \mathbf{x}_{\star}^{\prime}\right]\right| \mathcal{D}_j\right] .\]

Note that in this case, duels that have been already visited will have a lower chance of being visited again even in cases in which the objective takes similar values in both players. In practice, this acquisition functions requires computation of an intractable integral, that we approximate using Monte-Carlo.

Principled Optimistic Preferential Bayesian Optimization (POP-BO)

In a slightly modified problem setup (Xu et al. 2024), the algorithm tries to solve for the MLE \(\hat{g}\) and its confidence set \(\mathcal{B}_g\) where \(g\) is the ground truth black-box function. Assumptions include that \(g\) is a member of a reproducing kernel Hilbert space (RKHS) \(\mathcal{H}_k\) for some kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\), and \(\|g\|_k \leq B\) so that \(\mathcal{B}_g = \left\{\tilde{g} \in \mathcal{H}_k \mid\|\tilde{g}\|_k \leq B\right\}\). Similarly defining \(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=g\left(\mathbf{x}^{\prime}\right)-g(\mathbf{x})\), we model the preference function with a Bernoulli distribution as in Equation [eq:bernoulli_pref] and also assume that probabilities follow the Bradley-Terry model, i.e. \[\pi_f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)=\sigma\left(f\left(\left[\mathbf{x}, \mathbf{x}^{\prime}\right]\right)\right)=\frac{e^{g(\mathbf{x})}}{e^{g(\mathbf{x})}+e^{g\left(\mathbf{x^{\prime}}\right)}}\]

The update rule for MLE \(\hat{g}\) is (equation 8,6,5) \[\begin{aligned} \hat{g}_t^{\text {MLE }}&:= \arg \underset{\tilde{g} \in \mathcal{B}^t_g}{\max}\ell_t(\tilde{g}) \\ \ell_t(\tilde{g}) &:= \log \prod_{\tau=1}^t y_\tau \pi_{\tilde{f}}([\mathbf{x_\tau}, \mathbf{x^{\prime}_\tau}])+\left(1-y_\tau\right)\left(1-\pi_{\tilde{f}}([\mathbf{x_\tau}, \mathbf{x^{\prime}_\tau}])\right) \\ &=\sum_{\tau=1}^t \log \left(\frac{e^{\tilde{g}(\mathbf{x_\tau})} y_\tau+e^{\tilde{g}(\mathbf{x_\tau^\prime})}\left(1-y_\tau\right)}{e^{\tilde{g}(\mathbf{x_\tau})}+e^{\tilde{g}(\mathbf{x_\tau^\prime})}}\right) \\ &=\sum_{\tau=1}^t\left(\tilde{g}(\mathbf{x_\tau}) y_\tau+\tilde{g}(\mathbf{x_\tau^\prime})\left(1-y_\tau\right)\right)-\sum_{\tau=1}^t \log \left(e^{\tilde{g}(\mathbf{x_\tau})}+e^{\tilde{g}(\mathbf{x_\tau^\prime})}\right) \end{aligned}\]

(Eq 22 shows how to represent this as a convex optimisation problem so that it can be solved)

The update rule for the confidence set \(\mathcal{B}_f^{t+1}\) is, (eq 9, 10?)

\[\begin{aligned} &\forall \epsilon, \delta > 0 \\ &\mathcal{B}_g^{t+1}:=\left\{\tilde{g} \in \mathcal{B}_g \mid \ell_t(\tilde{g}) \geq \ell_t\left(\hat{g}_t^{\mathrm{MLE}}\right)-\beta_1(\epsilon, \delta, t)\right\} \end{aligned}\] where \[\beta_1(\epsilon, \delta, t):=\sqrt{32 t B^2 \log \frac{\pi^2 t^2 \mathcal{N}\left(\mathcal{B}_f, \epsilon,\|\cdot\|_{\infty}\right)}{6 \delta}}+ C_L \epsilon t=\mathcal{O}\left(\sqrt{t \log \frac{t \mathcal{N}\left(\mathcal{B}_f, \epsilon,\|\cdot\|_{\infty}\right)}{\delta}}+\epsilon t\right),\] with \(C_L\) a constant independent of \(\delta, t\) and \(\epsilon\). \(\epsilon\) is typically chosen to be \(1 / T\), where T is the running horizon of the algorithm. This satisfies the theorem that, \[\mathbb{P}\left(g \in \mathcal{B}_g^{t+1}, \forall t \geq 1\right) \geq 1-\delta .\]

Intuitively, the confidence set \(\mathcal{B}_g^{t+1}\) includes the functions with the log-likelihood value that is only ‘a little worse’ than the maximum likelihood estimator, and the theorem states that \(\mathcal{B}_g^{t+1}\) contains the ground-truth function \(g\) with high probability.

Inner level optimization in Line 4 of the algorithm can also be represented as a convex optimisation problem so that it can be solved, Eq 24, 25. The outer optimisation can be solved using grid search or Eq 26 for medium size problems.

Given the initial point \(\mathbf{x_0} \in \mathcal{X}\) and set \(\mathcal{B}_g^1 = \mathcal{B}_g\) Set the reference point \(\mathbf{x_t^{\prime}} = \mathbf{x_{t-1}}\) Compute \(\mathbf{x_t} \in \arg\max_{\mathbf{x} \in \mathcal{X}} \max_{\tilde{g} \in \mathcal{B}_g^t} (\tilde{g}(\mathbf{x}) - \tilde{g}(\mathbf{x_t^{\prime}}))\), with the inner optimal function denoted as \(\tilde{g}_t\) Obtain the output of the duel \(y_t\) and append the new data point to \(\mathcal{D}_t\) Update the maximum likelihood estimator \(\hat{g}_t^{\mathrm{MLE}}\) and the posterior confidence set \(\mathcal{B}_g^{t+1}\).

qEUBO: Decision-Theoretic EUBO

qEUBO (Astudillo et al. 2023) derives an acquisition function that extends duels to \(q>2\) options which we call queries. Let \(X=\left(\mathbf{x_1}, \ldots, \mathbf{x_q}\right) \in \mathcal{X}^q\) denote a query containing two points or more, and let \(g: \mathcal{X} \rightarrow \Re\) be the latent preference function. Then after \(n\) user queries, we define the expected utility of the best option (qEUBO) as \[\mathrm{qEUBO}_n(X)=\mathbb{E}_n\left[\max \left\{g\left(x_1\right), \ldots, g\left(x_q\right)\right\}\right].\]

We now show that qEUBO is one-step Bayes optimal, meaning that each step chooses the query that maximises the expected utility received by the human. For a query \(X \in \mathcal{X}^q\), let \[V_n(X)=\mathbb{E}_n\left[\max _{x \in \mathbb{X}} \mathbb{E}_{n+1}[g(x)] \mid X_{n+1}=X\right] .\] Then \(V_n\) defines the expected utility received if an additional query \(X_{n+1}=X\) is performed, and maximizing \(V_n\) is one-step Bayes optimal. Since \(\max _{x \in \mathbb{X}} \mathbb{E}_n[f(x)]\) does not depend on \(X_{n+1}\), we can also equivalently maximize \[\mathbb{E}_n\left[\max _{x \in \mathbb{X}} \mathbb{E}_{n+1}[g(x)]-\max _{x \in \mathbb{X}} \mathbb{E}_n[g(x)] \mid X_{n+1}=X\right],\] which takes the same form as the knowledge gradient acquisition function (Wu and Frazier 2018) in standard Bayesian optimization.

\(V_n\) involves a nested stochastic optimization task, while qEUBO is a much simpler policy. When human responses are noise-free, we are able to use qEUBO as a sufficient policy due to the following theorem:

\[\underset{X \in \mathbb{X}^q}{\operatorname{argmax}} \mathrm{qEUBO}_n(X) \subseteq \underset{X \in \mathbb{X}^q}{\operatorname{argmax}} V_n(X) .\]

Proof. Proof. For a query \(X \in \mathcal{X}^q\), let \(x^{+}(X, i) \in \operatorname{argmax}_{x \in \mathbb{X}} \mathbb{E}_n[g(x) \mid(X, i)]\) and define \(X^{+}(X)=\) \(\left(x^{+}(X, 1), \ldots, x^{+}(X, q)\right)\).

Claim 1 \(V_n(X) \leq \mathrm{qEUBO}_n\left(X^{+}(X)\right) .\) We see that \[\begin{aligned} V_n(X) & =\sum_{i=1}^q \mathbf{P}_n(r(X)=i) \mathbb{E}_n[g\left(x^{+}(X, i)\right) ] \\ & \leq \sum_{i=1}^q \mathbf{P}_n(r(X)=i) \mathbb{E}_n[\max _{i=1, \ldots, q} g(x^{+}(X, i))] \\ & =\mathbb{E}_n\left[\max _{i=1, \ldots, q} g\left(x^{+}(X, i)\right)\right] \\ & =\mathrm{qEUBO}_n\left(X^{+}(X)\right), \end{aligned}\] as claimed.

Claim 2 \(\mathrm{qEUBO}_n(X) \leq V_n(X) .\) For any given \(X \in \mathbb{X}^q\) we have \[\mathbb{E}_n\left[f\left(x_{r(X)}\right) \mid(X, r(X))\right] \leq \max _{x \in \mathbb{X}} \mathbb{E}_n[f(x) \mid(X, r(X))] .\] Since \(f\left(x_{r(X)}\right)=\max _{i=1, \ldots, q} f\left(x_i\right)\), taking expectations over \(r(X)\) on both sides obtains the required result.

Now building on the arguments above, let \(X^* \in \operatorname{argmax}_{X \in \mathbb{X}^q} \mathrm{qEUBO}_n(X)\) and suppose for contradiction that \(X^* \notin \operatorname{argmax}_{X \in \mathbb{X}^q} V_n(X)\). Then, there exists \(\widetilde{X} \in \mathbb{X}^q\) such that \(V_n(\widetilde{X})>V_n\left(X^*\right)\). We have \[\begin{aligned} \operatorname{qEUBO}_n\left(X^{+}(\tilde{X})\right) & \geq V_n(\tilde{X}) \\ & >V_n\left(X^*\right) \\ & \geq \operatorname{qEUBO}_n\left(X^*\right) \\ & \geq \operatorname{qEUBO}_n\left(X^{+}(\tilde{X})\right) . \end{aligned}\]

The first inequality follows from (1). The second inequality is due to our supposition for contradiction. The third inequality is due to (2). Finally, the fourth inequality holds since \(X^* \in \operatorname{argmax}_{X \in \mathbb{X}^q} \mathrm{qEUBO}_n(X)\). This contradiction concludes the proof. ◻

Therefore a sufficient condition for following one-step Bayes optimality is by maximizing \(\text{qEUBO}_n\).

In experiments that were ran comparing qEUBO to other state-of-the-art acquisition functions, qEUBO consistently outperformed on most problems and was closely followed by qEI and qTS. These results also extended to experiments with multiple options when \(q>2\). In fact, there is faster convergence in regret when using more options in human queries. [Prove Theorem 3: Regret analysis]

qEI: Batch Expected Improvement

\[\begin{aligned} \mathrm{qEI}= & \mathbb{E}_{\mathbf{y}}\left[\left(\max _{i \in[1, \ldots, q]}\left(\mu_{\min }-y_i\right)\right)_{+}\right] \\ = & \sum_{i=1}^q \mathbb{E}_{\mathbf{y}}\left(\mu_{\min }-y_i \mid y_i \leq \mu_{\min }, y_i \leq y_j \forall j \neq i\right) \\ & p\left(y_i \leq \mu_{\min }, y_i \leq y_j \forall j \neq i\right) . \end{aligned}\]

qTS: Batch Thompson Sampling

Initial data \(\mathcal{D}_{\mathcal{I}(1)}=\{(\mathbf{x}_i, y_i)\}_{i \in \mathcal{I}(1)}\) Compute current posterior \(p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})\) Sample \(\boldsymbol{\theta}\) from \(p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})\) Select \(k \leftarrow \arg \max_{j \notin \mathcal{I}(t)} \mathbb{E}[y_j \mid \mathbf{x}_j, \boldsymbol{\theta}]\) Collect \(y_k\) by evaluating \(f\) at \(\mathbf{x}_k\) \(\mathcal{D}_{\mathcal{I}(t+1)} \leftarrow \mathcal{D}_{\mathcal{I}(t)} \cup \{(\mathbf{x}_k, y_k)\}\)

Initial data \(\mathcal{D}_{\mathcal{I}(1)}=\{\mathbf{x}_i, y_i\}_{i \in \mathcal{I}(1)}\), batch size \(S\) Compute current posterior \(p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})\) Sample \(\boldsymbol{\theta}\) from \(p(\boldsymbol{\theta} \mid \mathcal{D}_{\mathcal{I}(t)})\) Select \(k(s) \leftarrow \arg \max_{j \notin \mathcal{I}(t)} \mathbb{E}[y_j \mid \mathbf{x}_j, \boldsymbol{\theta}]\) \(\mathcal{D}_{\mathcal{I}(t+1)} = \mathcal{D}_{\mathcal{I}(t)} \cup \{\mathbf{x}_{k(s)}, y_{k(s)}\}_{s=1}^S\)

4.2.3 Regret Analysis

qEUBO Regret

With the definition of Bayesian simple regret, we have that qEUBO converges to zero at a rate of \(o(1/n)\), i.e.

\[\label{th:quebo_regret} \mathbb{E}\left[f\left(x^*\right)-f\left(\widehat{x}_n^*\right)\right]=o(1 / n)\]

where \(x^*=\operatorname{argmax}_{x \in \mathrm{X}} f(x)\) and \(\widehat{x}_n^* \in \operatorname{argmax}_{x \in \mathrm{X}} \mathbb{E}_n[f(x)]\).

This theorem holds under the following assumptions:

  1. \(f\) is injective \(\mathbf{P}(f(x)=f(y))=0\) for any \(x, y \in \mathbb{X}\) with \(x \neq y\).

  2. \(f\) represents the preferred option \(\exists a>1 / 2\) s.t. \(\mathbf{P}\left(r(X) \in \operatorname{argmax}_{i=1, \ldots, 2} f\left(x_i\right) \mid f(X)\right) \geq a \forall\) \(X=\left(x_1, x_2\right) \in \mathbb{X}^2\) with \(x_1 \neq x_2\) almost surely under the prior on \(f\).

  3. Expected difference in utility is proportional to probability of greater utility \(\exists \Delta \geq \delta>0\) s.t. \(\forall \mathcal{D}^{(n)} \text{and} \forall x, y \in \mathbb{X}\) (potentially depending on \(\mathcal{D}^{(n)}\)), \[\delta \mathbf{P}^{(n)}(f(x)>f(y)) \leq \mathbb{E}^{(n)}\left[\{f(x)-f(y)\}^{+}\right] \leq \Delta \mathbf{P}^{(n)}(f(x)>f(y))\] almost surely under the prior on \(f\).

Further lemmas leading to a proof of Theorem [th:quebo_regret] is given in (Astudillo et al. 2023) Section B.

qEI Regret

The following theorem shows that, under the same assumptions used for qEUBO regret, simple regret of qEI can fail to converge to 0.

There exists a problem instance (i.e., \(\mathbb{X}\) and Bayesian prior distribution over f) satisfying the assumptions described in Theorem [th:quebo_regret] such that if the sequence of queries is chosen by maximizing qEI, then \(\mathbb{E}\left[f\left(x^*\right)-\right.\) \(\left.f\left(\widehat{x}_n^*\right)\right] \geq R\) for all \(n\), for a constant \(R>0\).

Proof. Proof. Let \(X = \{1, 2, 3, 4\}\) and consider the functions \(f_i:X \rightarrow R\), for \(i=1,2,3,4\), given by \(f_i(1) = -1\) and \(f_i(2) = 0\) for all \(i\), and \[\begin{aligned} f_1(x) = \begin{cases} 1, &\ x=3\\ \frac{1}{2}, &\ x=4 \end{cases}, \hspace{0.5cm} f_2(x) = \begin{cases} \frac{1}{2}, &\ x=3\\ 1, &\ x=4 \end{cases}, \hspace{0.5cm} f_3(x) = \begin{cases} -\frac{1}{2}, &\ x=3\\ -1, &\ x=4 \end{cases}, \hspace{0.5cm} f_4(x) = \begin{cases} -1, &\ x=3\\ -\frac{1}{2}, &\ x=4 \end{cases}. \end{aligned}\]

Let \(p\) be a number with \(0 < p < 1/3\) and set \(q=1-p\). We consider a prior distribution on \(f\) with support \(\{f_i\}_{i=1}^4\) such that \[\begin{aligned} p_i = Pr(f=f_i) = \begin{cases} p/2, i =1,2,\\ q/2, i=3,4. \end{cases} \end{aligned}\] We also assume the user’s response likelihood is given by \(Pr(r(X)=1\mid f(x_1) > f(x_2)) = a\) for some \(a\) such that \(1/2 < a < 1\),

Let \(D^{(n)}\) denote the set of observations up to time \(n\) and let \(p_i^{(n)} = Pr(f=f_i \mid \mathbb{E}^{(n)})\) for \(i=1,2,3,4\). We let the initial data set be \(\mathcal{D}^{(0)} = \{(X^{(0)}, r^{(0)})\}\), where \(X^{(0)}= (1,2)\). We will prove that the following statements are true for all \(n\geq 0\).

  1. \(p_i^{(n)} > 0\) for \(i=1,2,3,4\).

  2. \(p_1^{(n)} < \frac{1}{2}p_3^{(n)}\) and \(p_2^{(n)} < \frac{1}{2}p_4^{(n)}\).

  3. \(\arg \max_{x\in\mathcal{X}}\mathbb{E}^{(n)}[f(x)]=\{2\}\).

  4. \(\arg \max_{X\in\mathcal{X}^2}\text{qEI}^{(n)}(X) = \{(3, 4)\}\).

We prove this by induction over \(n\). We begin by proving this for \(n=0\). Since \(f_i(1) < f_i(2)\) for all \(i\), the posterior distribution on \(f\) given \(\mathcal{D}^{(0)}\) remains the same as the prior; i.e., \(p_i^{(0)} = p_i\) for \(i=1,2,3,4\). Using this, statements 1 and 2 can be easily verified. Now note that \(\mathbb{E}^{(0)}[f(1)]=-1\), \(\mathbb{E}^{(0)}[f(2)]=0\), and \(\mathbb{E}^{(0)}[f(3)] = \mathbb{E}^{(0)}[f(4)] = \frac{3}{2}(p - q)\). Since \(p < q\), it follows that \(\arg \max_{x\in\mathcal{X}}\mathbb{E}^{(n)}[f(x)]=\{2\}\); i.e., statement 3 holds. Finally, since \(\max_{x\in\{1,2\}}\mathbb{E}^{(0)}[f(x)] = 0\), the qEI acquisition function at time \(n=0\) is given by \(\text{qEI}^{(0)}(X) = \mathbb{E}^{(0)}[\{\max\{f(x_1), f(x_2)\}\}^+]\). A direct calculation can now be performed to verify that statement 4 holds. This completes the base case.

Now suppose statements 1-4 hold for some \(n\geq 0\). Since \(X^{(n+1)} = (3, 4)\), the posterior distribution on \(f\) given \(D^{(n+1)}\) is given by \[\begin{aligned} p_i^{(n+1)} \propto \begin{cases} p_i^{(n)}\ell, \ i=1,3,\\ p_i^{(n)} (1 - \ell), \ i=2,4, \end{cases} \end{aligned}\] where \[\ell = a I\{r^{(n+1)} = 1\} + (1-a)I\{r^{(n+1)} = 2\}.\] Observe that \(0< \ell < 1\) since \(0 < a < 1\). Thus, \(\ell > 0\) and \(1-\ell > 0\). Since \(p_i^{(n)} > 0\) by the induction hypothesis, it follows from this that \(p_i^{(n+1)} > 0\) for \(i=1,2,3,4\). Moreover, since \(p_i^{(n+1)} \propto p_i^{(n)}\ell\) for \(i=1,3\) and \(p_1^{(n)} < \frac{1}{2}p_3^{(n)}\) by the induction hypothesis, it follows that \(p_1^{(n+1)} < \frac{1}{2}p_3^{(n+1)}\). Similarly, \(p_2^{(n+1)} < \frac{1}{2}p_4^{(n+1)}\). Thus, statements 1 and 2 hold at time \(n+1\).

Now observe that \[\begin{aligned} \mathbb{E}^{(n+1)}[f(3)] &= p_1^{(n+1)} + \frac{1}{2}p_2^{(n+1)} - \frac{1}{2}p_3^{(n+1)} - p_4^{(n+1)}\\ &= \left(p_1^{(n+1)} - \frac{1}{2}p_3^{(n+1)}\right) + \left(\frac{1}{2}p_2^{(n+1)} - p_4^{(n+1)}\right)\\ &\leq \left(p_1^{(n+1)} - \frac{1}{2}p_3^{(n+1)}\right) + \left(p_2^{(n+1)} - \frac{1}{2}p_4^{(n+1)}\right)\\ &\leq 0, \end{aligned}\] where the last inequality holds since \(p_1^{(n+1)} < \frac{1}{2}p_3^{(n+1)}\) and \(p_2^{(n+1)} < \frac{1}{2}p_4^{(n+1)}\). Similarly, we see that \(\mathbb{E}^{(n+1)}[f(4)] \leq 0\). Since \(\mathbb{E}^{(n+1)}[f(1)]=-1\) and \(\mathbb{E}^{(n+1)}[f(2)]=0\), it follows that \(\arg \max_{x\in\mathcal{X}}\mathbb{E}^{(n+1)}[f(x)]=\{2\}\); i.e., statement 3 holds at time \(n+1\).

Since \(\max_{x\in\mathcal{X}}\mathbb{E}^{(0)}[f(x)] = 0\), the qEI acquisition function at time \(n+1\) is given by \(\text{qEI}^{(n+1)}(X) = \mathbb{E}^{(n+1)}[\{\max\{f(x_1), f(x_2)\}\}^+]\). Since \(f(1) \leq f(x)\) almost surely under the prior for all \(x\in\mathcal{X}\), there is always a maximizer of qEI that does not contain \(1\). Thus, to find the maximizer of qEI, it suffices to analyse its value at the pairs \((2, 3)\), \((3,4)\) and \((4,2)\). We have \[\text{qEI}^{(n+1)}(2, 3) = p_1^{(n+1)} + 1/2 p_2^{(n+1)},\] \[\operatorname{qEI}^{(n+1)}(3, 4) = p_1^{(n+1)} + p_2^{(n+1)}\] and \[\operatorname{qEI}^{(n+1)}(4, 2) = 1/2p_1^{(n+1)} + p_2^{(n+1)}.\] Since \(p_1^{(n+1)} > 0\) and \(p_2^{(n+1)} > 0\), it follows that \(\arg \max_{X \in X^2}\text{qEI}^{(n+1)}(X) = \{(3, 4)\}\), which concludes the proof by induction.

Finally, since \(\arg \max_{x\in X}\mathbb{E}^{(n)}[f(x)]=\{2\}\) for all \(n\), the Bayesian simple regret of qEI is given by \[\begin{aligned} \mathbb{E}\left[f(x^*) - f(2)\right] &= \sum_{i=1}p_i\left(\max_{x\in X}f_i(x) - f_i(2)\right)\\ &= p \end{aligned}\] for all \(n\). ◻

POP-BO Regret

Commonly used kernel functions within the RKHS are:

  1. Linear: \[k(x, \bar{x})=x^{\top} \bar{x} .\]

  2. Squared Exponential (SE): \[k(x, \bar{x})=\sigma_{\mathrm{SE}}^2 \exp \left\{-\frac{\|x-\bar{x}\|^2}{l^2}\right\},\] where \(\sigma_{\mathrm{SE}}^2\) is the variance parameter and \(l\) is the lengthscale parameter.

  3. Matérn: \[k(x, \bar{x})=\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2 \nu} \frac{\|x-\bar{x}\|}{\rho}\right)^\nu K_\nu\left(\sqrt{2 \nu} \frac{\|x-\bar{x}\|}{\rho}\right),\] where \(\rho\) and \(\nu\) are the two positive parameters of the kernel function, \(\Gamma\) is the gamma function, and \(K_\nu\) is the modified Bessel function of the second kind. \(\nu\) captures the smoothness of the kernel function.

With the definition of Bayesian simple regret, we have the following theorem defining the regret bound:

With probability at least \(1-\delta\), the cumulative regret of POP-BO satisfies, \[R_T=\mathcal{O}\left(\sqrt{\beta_T \gamma_T^{f f^{\prime}} T}\right),\] where \[\beta_T=\beta(1 / T, \delta, T)=\mathcal{O}\left(\sqrt{T \log \frac{T \mathcal{N}\left(\mathcal{B}_f, 1 / T,\|\cdot\|_{\infty}\right)}{\delta}}\right).\]

The guaranteed convergence rate is characterised as:

[]{#th: popbo_converge label=“th: popbo_converge”} Let \(t^{\star}\) be defined as in Eq. (19). With probability at least \(1-\delta\), \[f\left(x^{\star}\right)-f\left(x_{t^{\star}}\right) \leq \mathcal{O}\left(\frac{\sqrt{\beta_T \gamma_T^{f f^{\prime}}}}{\sqrt{T}}\right)\]

Theorem [th: popbo_converge] highlights that by minimizing the known term \(2\left(2 B+\lambda^{-1 / 2} \sqrt{\beta\left(\epsilon, \frac{\delta}{2}, t\right)}\right) \sigma_t^{f f^{\prime}}\left(\left(x_t, x_t^{\prime}\right)\right)\), the reported final solution \(x_{t^{\star}}\) has a guaranteed convergence rate.

Further kernel-specific regret bounds for POP-BO are calculated as follows:

Setting \(\epsilon=1 / T\) and running our POP-BO algorithm in Alg. 1,

  1. If \(k(x, y)=\langle x, y\rangle\), we have, \[R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4}\right) .\]

  2. If \(k(x, y)\) is a squared exponential kernel, we have, \[R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4(d+1)}\right) .\]

  3. If \(k(x, y)\) is a Matérn kernel, we have, \[\left.R_T=\mathcal{O}\left(T^{3 / 4}(\log T)^{3 / 4} T^{\frac{d}{\nu}\left(\frac{1}{4}+\frac{d+1}{4+2(d+1)^d / \nu}\right.}\right)\right).\]

4.3 Case Study 1: Foundation Models for Robotics

Modern foundation models have been ubiquitous in discussions of powerful, general purpose AI systems that can accomplish myriad tasks across many disciplines such as programming, medicine, law, open question-answering and much more, with rapidly increasing capabilities (Bommasani et al. 2022). However, despite successes from large labs in controlled environments (Brohan et al. 2023) foundation models have not seen ubiquitous use in robotics due to shifting robot morphology, lack of data, and the sim to real gap in robotics (Walke et al. 2023). For this subsection we explore two promising approaches known as R3M and Voltron which are the first to leverage pre-training on vast amounts of data towards performance improvement on downstream robotic tasks despite the aforementioned issues (Nair et al. 2022; Karamcheti et al. 2023).

R3M represents a significant advancement in the field of robotic manipulation and learning. This model diverges from traditional approaches that rely on training from scratch within the same domain on the same robot data as instead it leverags pretraining on large datasets, akin to the practices in computer vision and natural language processing (NLP) where models are trained on diverse, large-scale datasets to create reusable, general-purpose representations. The core principle behind R3M is its training methodology. It is pre-trained on a wide array of human videos, encompassing various activities and interactions. This diverse dataset enables the model to capture a broad spectrum of physical interactions and dynamics, which are crucial for effective robotic manipulation known as EGO4D (Grauman et al. 2022). However, prior papers could not fit this dataset well, and R3M leveraged. The training utilizes a unique objective that combines time contrastive learning, video-language alignment, and a sparsity penalty. This objective ensures that R3M not only understands the temporal dynamics of scenes (i.e., how states transition over time) but also focuses on semantically relevant features, such as objects and their interrelations, while maintaining a compact and efficient representation. What sets R3M apart in the realm of robotics is its efficiency and effectiveness in learning from a limited amount of data. The model demonstrates remarkable performance in learning tasks in the real world with minimal human supervision – typically less than 10 minutes. This is a stark contrast to traditional models that require extensive and often prohibitively large datasets for training. Furthermore, R3M’s pre-trained nature allows for its application across a variety of tasks and environments without the need for retraining from scratch, making it a versatile tool in robotic manipulation. The empirical results from using R3M are compelling, leading to a 10% improvement over training from a pretrained image-net model, self-supervised approaches such as MoCo or even CLIP (Deng et al. 2009; He et al. 2020; Radford et al. 2021). Note however, that R3m does not use any language data which leaves quite a bit of supervision to be desired.

Building off the success of R3M, Voltron proposes a further extension of leveraging self-supervision and advancements in foundation models, and multi-modality. Voltron takes on an intuitive and simple dual use objective, where the trained model alternates between predicting the task in an image through natural language and classifying images based on a natural text label. This forces a nuanced understanding of both modalities (Radford et al. 2021). Voltron’s approach is distinguished by its versatility and depth of learning. It is adept at handling a wide range of robotic tasks, from low-level spatial feature recognition to high-level semantic understanding required in language-conditioned imitation and intent scoring. This flexibility makes it suitable for various applications in robotic manipulation, from grasping objects based on descriptive language to performing complex sequences of actions in response to verbal instructions. The authors rigorously test Voltron in scenarios such as dense segmentation for grasp affordance prediction, object detection in cluttered scenes, and learning multi-task language-conditioned policies for real-world manipulation with up to 15% improvement over baselines. In each of these domains, Voltron has shown a remarkable ability to outperform existing models like MVP and R3M, showcasing its superior adaptability and learning capabilities (Xiao et al. 2022). Moreover, Voltron’s framework allows for a balance between encoding low-level and high-level features, which is critical in the context of robotics. This balance enables the model to excel in both control tasks and those requiring deeper semantic understanding, offering a comprehensive solution in the realm of robotic vision and manipulation.

Voltron stands as a groundbreaking approach in the field of robotics, offering a language-driven, versatile, and efficient approach to learning and manipulation. Its ability to seamlessly integrate visual and linguistic data makes it a potent tool in the ever-evolving landscape of robotic technology, with potential applications that extend far beyond current capabilities. Interesting the authors show Voltron does not beat R3M off the shelf but only when trained on similar amounts of data. Nevertheless, Voltron’s success in diverse tasks and environments heralds a new era in robotic manipulation, where language and vision coalesce to create more intelligent, adaptable, and capable robotic systems.

On the note of applying AL to RL and environment settings, there have been many recent papers that have attempted to extend this to more modern RL environments. For example, the paper “When to Ask for Help” (Xie et al. 2022) examines the intersection of autonomous and AL. Instead of just expecting an RL agent to autonomously solve a task, making the assumption that an agent could get stuck and need human input to get “unstuck” is a key insight of the paper. In general, there has been an emphasis in recent literature in robotics on not just blindly using demonstration data as a form of human input, but rather actively querying a human and using this to better synthesize correct actions.

AL holds promise for enhancing AI models in real-world scenarios, yet several challenges persist. This discussion aims to provide an overview of these challenges.

Task-Specific Considerations: For certain tasks, the input space of a model may have some rare yet extremely important pockets which may never be discovered by AL and may cause severe blindspots in the model. In medical imaging for instance, there can be rare yet critical diseases. Designing AL strategies for medical image analysis must prioritize rare classes, such as various forms of cancers. Oftentimes, collecting data around those rare classes is not a recommendation of the AL process because these examples constitute heavy distribution drifts from the input distribution a model has seen.

Complex Task Adaptation: AL has predominantly been adopted for simple classification tasks, leaving more other types of tasks (generative ones for instance), less explored. In Natural Language Processing, tasks like natural language inference, question-answering pose additional complexities that affect the direct application of the AL process. While machine translation has seen AL applications, generation tasks in NLP require more thorough exploration. Challenges arise in obtaining unlabeled data, particularly for tasks with intricate inputs.

Unsupervised and Semi-Supervised Approaches: In the presence of large datasets without sufficient labels, unsupervised and semi-supervised approaches become crucial. These methods offer a means to extract information without relying on labeled data for every data point, potentially revolutionizing fields like medical image analysis. There is an ongoing need for methods that combine self/semi-supervised learning with AL.

Algorithm Scalability: Scalability is a critical concern for online AL algorithms, particularly when dealing with large datasets and high-velocity data streams. The computational demands of AL can become prohibitive as data volume increases, posing challenges for practical deployment. Issues of catastrophic forgetting and model plasticity further complicate scalability, requiring careful consideration in algorithm design.

Labeling Quality Assurance: The effectiveness of most online AL strategies hinges on the quality of labeled data. Ensuring labeling accuracy in real-world scenarios is challenging, with human annotators prone to errors, biases, and diverse interpretations. Addressing imperfections in labeling through considerations of oracle imperfections becomes essential in real-life AL applications. Solutions for cleaning up data and verifying its quality need to be more aggressively pursued.

Data Drift Challenges: Real-world settings introduce data drift, where distributions shift over time, challenging models to adapt for accurate predictions. These shifts can impact the quality of labeled data acquired in the AL process. For example, the criterion or proxy used for selecting informative instances may be thrown off when the distribution a model is trained on, and the distribution we want it to perform well on, are too far away from one another.

Evaluation in Real-Life Scenarios: While AL methods are often evaluated assuming access to ground-truth labels, the real motivation for AL lies in label scarcity. Assessing the effectiveness of AL strategies becomes challenging in real-life scenarios where ground-truth labels may be limited. In other words, one may verify the goodness of an AL algorithm within the lab, but once the algorithm is deployed for improving all sorts of models on all sorts of data distributions, verifying whether AL is actually improving a model is tricky, especially when collecting and labeling data from the target distribution is expensive and defeats the purpose of using AL in the first place.

By systematically addressing these challenges, the field of AL in AI can progress towards more effective and practical applications. In summary, AL is a promising modern tool to model training that presents potential benefits. As was mentioned at the start, there are numerous approaches that can be employed by AL, starting from reducing error of model’s prediction, reducing variance, to more conformal predictions. The flavor of AL heavily depends on the applications, which include robotics, LLM, autonomous vehicles, and more. We discussed in more detail how to perform AL for variance reduction in the case of predicting kinematics of the robotic arms, which showed decrease in MSE as well as more stable reduction in it. Next we talked about using AL for reducing the number of comparisons required to create a ranking of objects, and the examples discussed were able to achieve that but with some loss in the prediction accuracy. Finally, we discussed how AL can be used for modeling of reward functions within a dynamical system, which demonstrated improvements in performance and time required to achieve it. For a more hands-on experience with AL and demonstrated example, we encourage the readers to explore a blogpost by Max Halford (Halford 2023).

References

Astudillo, Raul, Zhiyuan Jerry Lin, Eytan Bakshy, and Peter I. Frazier. 2023. “qEUBO: A Decision-Theoretic Acquisition Function for Preferential Bayesian Optimization.” https://arxiv.org/abs/2303.15746.
Bastani, Hamsa, and Mohsen Bayati. 2020. “Online Decision Making with High-Dimensional Covariates.” Operations Research 68 (1): 276–94. https://doi.org/10.1287/opre.2019.1902.
Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, et al. 2022. “On the Opportunities and Risks of Foundation Models.” https://arxiv.org/abs/2108.07258.
Bouneffouf, Djallel, Amel Bouzeghoub, and Alda Lopes Gançarski. 2012. “A Contextual-Bandit Algorithm for Mobile Context-Aware Recommender System.” In Neural Information Processing, edited by Tingwen Huang, Zhigang Zeng, Chuandong Li, and Chi Sing Leung, 324–31. Berlin, Heidelberg: Springer Berlin Heidelberg.
Bouneffouf, Djallel, Irina Rish, and Charu Aggarwal. 2020. “Survey on Applications of Multi-Armed and Contextual Bandits.” In 2020 IEEE Congress on Evolutionary Computation (CEC), 1–8. Glasgow, United Kingdom: IEEE Press. https://doi.org/10.1109/CEC48606.2020.9185782.
Bouneffouf, Djallel, Irina Rish, and Guillermo A. Cecchi. 2017. “Bandit Models of Human Behavior: Reward Processing in Mental Disorders.” In Artificial General Intelligence, edited by Tom Everitt, Ben Goertzel, and Alexey Potapov, 237–48. Cham: Springer International Publishing.
Brohan, Anthony, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, et al. 2023. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” https://arxiv.org/abs/2307.15818.
Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “ImageNet: A Large-Scale Hierarchical Image Database.” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–55. IEEE.
Ding, Kaize, Jundong Li, and Huan Liu. 2019. “Interactive Anomaly Detection on Attributed Networks.” In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 357–65. WSDM ’19. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3289600.3290964.
Grauman, Kristen, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, et al. 2022. “Ego4D: Around the World in 3,000 Hours of Egocentric Video.” https://arxiv.org/abs/2110.07058.
Halford, Max. 2023. “Online Active Learning in 80 Lines of Python.”
He, Kaiming, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. “Momentum Contrast for Unsupervised Visual Representation Learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–38. IEEE.
Huo, Xiaoguang, and Feng Fu. 2017. “Risk-Aware Multi-Armed Bandit Problem with Application to Portfolio Selection.” Royal Society Open Science 4 (November). https://doi.org/10.1098/rsos.171377.
Kahneman, Daniel, and Amos Tversky. 1979. “Prospect Theory: Analysis of Decision Under Risk.” Econometrica 47 (2). https://doi.org/10.2307/1914185.
Karamcheti, Siddharth, Suraj Nair, Annie S. Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. 2023. “Language-Driven Representation Learning for Robotics.” https://arxiv.org/abs/2302.12766.
Liu, Bing, Tong Yu, Ian Lane, and Ole J. Mengshoel. 2018. “Customized Nonlinear Bandits for Online Response Selection in Neural Conversation Models.” In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’18/IAAI’18/EAAI’18. New Orleans, Louisiana, USA: AAAI Press.
Misra, Kanishka, Eric M. Schwartz, and Jacob Abernethy. 2019. “Dynamic Online Pricing with Incomplete Information Using Multiarmed Bandit Experiments.” Marketing Science 38 (2): 226–52. https://doi.org/10.1287/mksc.2018.1129.
Nair, Suraj, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. 2022. “R3M: A Universal Visual Representation for Robot Manipulation.” https://arxiv.org/abs/2203.12601.
perez, julien, and Tomi Silander. 2018. “Contextual Memory Bandit for Pro-Active Dialog Engagement.” https://openreview.net/forum?id=SJiHOSeR-.
Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” arXiv Preprint arXiv:2103.00020.
Shen, Weiwei, Jun Wang, Yu-Gang Jiang, and Hongyuan Zha. 2015. “Portfolio Choices with Orthogonal Bandit Learning.” In Proceedings of the 24th International Conference on Artificial Intelligence, 974–80. IJCAI’15. Buenos Aires, Argentina: AAAI Press.
Sui, Yanan, Masrour Zoghi, Katja Hofmann, and Yisong Yue. 2018. “Advancements in Dueling Bandits.” Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. https://doi.org/10.24963/ijcai.2018/776.
Upadhyay, Sohini, Mayank Agarwal, Djallel Bouneffouf, and Yasaman Khazaeni. 2019. “A Bandit Approach to Posterior Dialog Orchestration Under a Budget.”
Walke, Homer, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, et al. 2023. “BridgeData V2: A Dataset for Robot Learning at Scale.” https://arxiv.org/abs/2308.12952.
Wu, Jian, and Peter I. Frazier. 2018. “The Parallel Knowledge Gradient Method for Batch Bayesian Optimization.” https://arxiv.org/abs/1606.04414.
Xiao, Tete, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. 2022. “Masked Visual Pre-Training for Motor Control.” https://arxiv.org/abs/2203.06173.
Xie, Annie, Fahim Tajwar, Archit Sharma, and Chelsea Finn. 2022. “When to Ask for Help: Proactive Interventions in Autonomous Reinforcement Learning.” https://arxiv.org/abs/2210.10765.
Xu, Wenjie, Wenbin Wang, Yuning Jiang, Bratislav Svetozarevic, and Colin N. Jones. 2024. “Principled Preferential Bayesian Optimization.” https://arxiv.org/abs/2402.05367.
Yue, Yisong, Josef Broder, Robert Kleinberg, and Thorsten Joachims. 2012. “The k-Armed Dueling Bandits Problem.” Journal of Computer and System Sciences 78 (5): 1538–56. https://doi.org/https://doi.org/10.1016/j.jcss.2011.12.028.
Yue, Yisong, and Thorsten Joachims. 2009. “Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem.” Proceedings of the 26th Annual International Conference on Machine Learning. https://doi.org/10.1145/1553374.1553527.
Zhang, Tong. 2021. “Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning.” CoRR abs/2110.00871. https://arxiv.org/abs/2110.00871.
Zhou, Qian, XiaoFang Zhang, Jin Xu, and Bin Liang. 2017. “Large-Scale Bandit Approaches for Recommender Systems.” In Neural Information Processing, edited by Derong Liu, Shengli Xie, Yuanqing Li, Dongbin Zhao, and El-Sayed M. El-Alfy, 811–21. Cham: Springer International Publishing.