3  Elicitation

Fullscreen - AL Fullscreen - ME

3.1 The Active Learning Problem

Acquiring labeled data is expensive. Active learning (AL) is a learning paradigm that aims to reduce the amount of labeled data required to train a model to achieve high accuracy. AL algorithms iteratively select an input datapoint for an oracle (e.g., a human annotator) to label such that when the label is observed, the model improves the most. Two primary setups in AL is pool-based and stream-based. In pool-based AL, the model selects samples from a large unlabeled pool of data. For example, a model for text classification selects the most uncertain texts from a large pool to ask a human annotator to label. In stream-based AL, the model receives samples sequentially (one sample at a time) and decides whether to label them. The data is gone if the decision maker decides not to label it. In AL, a model is trained on the current dataset, and a set of candidate points is evaluated for potential inclusion. AL selects one of these points to add to the dataset based on an “acquisition function” defined with respect to the current model to estimate the value of each candidate point for improving model performance. The dataset is updated with the newly queried point, and the cycle repeats until the budget is exhausted or a predefined reliability criterion is met.

AL has successfully enhance various real-world systems. For example, AL can improve the computer vision models used in autonomous vehicles (Jarl et al. 2021). Probing a model to understand what type of data it would benefit from is more practical. In robotics, autonomous agents may query humans when unsure how to act when facing new situations (Taylor, Berrueta, and Murphey 2021). Here, collecting data often incurs significant financial and time costs because physical robot arm worns out over time. In meteorology, AL can help decide where to place additional sensors for weather predictions (Singh, Nowak, and Ramanathan 2006). Sensor placement involves deploying teams to remote locations and expensive construction for an extra data point. Choosing these locations and allocating resources wisely is of interest to governments and businesses. AL could also be employed to select data for fine-tuning large language models (LLMs) for specific downstream tasks (Margatina et al. 2023). Here, it might be difficult to fully describe a targeted NLP task. Often, instead of defining a task via a dataset of examples, it may be easier for a human to interact with the LLM for a specific use case, identify gaps in the model, and address those using AL.

Typically, in robotic, robots learn by observing human demonstrations. However, expert demonstrations are often limited, and training a supervised learning model would require vast amounts of demonstration data, which is difficult to obtain at scale. Demonstrations tend to be variable, reflecting the actions of individual humans, making the data collection process inconsistent. To address these limitations, alternative approaches have been proposed, such as using pairwise comparisons, where humans evaluate two action trajectories to determine the superior one, or employing physical corrections, in which reward functions are learned through human-robot interactions, with humans guiding the robot’s actions during the task. AL algorithms can be employed in preference learning tasks, where the objective is to develop a model that aligns with human preferences while minimizing the need for extensive labeled data or reducing the high cost of annotations.

Motivating by the pairwise preference setting, we consider a binary classification problem. The model is trained on a small labeled dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N\), where \(x_i\) represents the input data and \(y_i\) is the corresponding label. The model is uncertain about the class labels of some data points and can query an oracle to obtain the true labels of these data points. The goal is to minimize the number of queries to the oracle while maximizing the model’s performance. Here, the value of a datapoint is in how much it helps identify the underlying model, and this notion of informativeness is often quantify with uncertainty. Two primary types of uncertainty are often considered: epistemic and aleatoric uncertainty. Epistemic uncertainty, or model uncertainty, arises from a lack of knowledge and can be reduced by acquiring more data. This type of uncertainty is especially significant when the model lacks confidence due to insufficient or incomplete information in its training set. On the other hand, aleatoric uncertainty, or data uncertainty, stems from the inherent randomness within the data itself. Unlike epistemic uncertainty, aleatoric uncertainty cannot be reduced, even with additional data, as it reflects noise or unpredictability in the real data-generating process. AL often focuses on selecting data that reduce the epistemic uncertainty.

There are several method for quantify model uncertainty. Bayesian methods, such as Bayesian neural networks and Gaussian processes, offer a principled way of estimating uncertainty of parameter posterior distribution by iteratively updating a prior distribution over model. Exact posterior computation can become computationally prohibitive, especially for complex likelihood function, and approximated Bayesian computation is proposed to address this. For example, ensemble methods involve training multiple models and combining their predictions to provide an estimate of uncertainty. Ensemble methods are relatively easy to implement, but they are noisy and still somewhat expensive. Conformal prediction methods also provide a framework for estimating uncertainty by offering a measure of confidence in predictions based on the conformity of a given instance with the training data.

3.2 Estimating the Value of Additional Data with Acquisition Function

Uncertainty quantification plays a vital role in acquisition functions, which are central to AL strategies. These functions determine which samples are most valuable to label by evaluating their utility based on the model’s current uncertainty estimates. Common acquisition functions include uncertainty sampling (Zhu et al. 2010), which selects samples the model is least confident about, query-by-committee (Beluch et al. 2018), which utilizes a set of models to choose the most uncertain samples, and Bayesian AL by Disagreement (BALD) (Houlsby et al. 2011), which selects samples that maximize information gain by reducing model uncertainty. Through careful uncertainty quantification, acquisition functions guide the AL process, improving the model’s efficiency in learning from limited data. Other acquisition functions that can be employed include:

  • Expected model change (Cai, Zhang, and Zhou 2013): This approach focuses on labeling points that would have the most impact on changing the current model parameters.

  • Expected error reduction (Mussmann et al. 2022): Points that would most effectively reduce the model’s generalization error are labeled using this strategy.

  • Variance reduction (Cohn, Ghahramani, and Jordan 1996): This approach labels points that would minimize output variance, which is one component of error. By selecting points that reduce variability in the model’s predictions, it aims to improve overall performance.

Uncertainty sampling (Zhu et al. 2010) selects data points for which the model exhibits the greatest uncertainty, focusing labeling efforts on ambiguous samples where additional information is likely to yield the greatest benefit. Several acquisition strategies fall under uncertainty sampling, including entropy sampling, margin sampling, and least confidence sampling. Entropy sampling measures value of addition data by the entropy of the predicted probability distribution: \(\alpha(x) = - \sum_{y} p(y|x) \log p(y|x)\). Margin sampling focuses on the difference between the two highest predicted probabilities for a sample: \(\alpha(x) = p(y_1|x) - p(y_2|x)\), where \(y_1\) and \(y_2\) are two most likely classes. Least confidence sampling measures value of additional data by the lowest predicted probability for its most likely class: \(\alpha(x) = 1 - p(y_{\text{max}}|x)\), where \(y_{\text{max}}\) is the class with the highest probability. Consider a binary classification problem with three candidate \(x_1, x_2, x_3\). The code below demonstrate that uncertainty sampling methods yield the same conclusion of selecting \(x_1\).

code

Query-by-Committee (Beluch et al. 2018) is selects samples for labeling based on the level of disagreement among members of a committee. Several acquisition functions can be employed under this framework to quantify the disagreement. The vote entropy measures the uncertainty based on how often the committee members vote for each class. The acquisition function is defined as \(\alpha(x) = \mathbb{H}\left[V(y)/C\right]\), where \(V(y)\) is the number of votes for class \(y\) and \(C\) is the number of committee members. Consensus Entropy measures the entropy of the average probability distribution across committee members. It is given by \(\alpha(x) = \mathbb{H}[p_C(y|x)]\), where \(p_C(y|x)\) is the average probability distribution for sample \(x\) across all committee members. The KL divergence quantifies the disagreement by comparing the probability distribution of each committee member to the average distribution. The acquisition function is given by \(\alpha(x) = \frac{1}{C} \sum_{c=1}^{C} D_{KL}[p_C(y|x) || p_C(y|x)]\), where \(p_C(y|x)\) is the probability distribution of committee member \(c\) and \(p_C(y|x)\) is the average distribution across the committee. As an example, consider a binary classification problem with three candidate \(x_1\), \(x_2\), and \(x_3\) and three committee members. Numerical result below show that all acquisition functions selects \(x_1\).

code

Bayesian AL by Disagreement (BALD) (Houlsby et al. 2011) selects the samples for which the model expects to gain the most Shannon information when corresponding labels are observed:

\[ \begin{aligned} &\mathbb{I}(\theta; y|x, \mathcal{D}) = \mathbb{H}[p(y|x, \mathcal{D})] - \mathbb{E}_{p(\theta | \mathcal{D})} [\mathbb{H}[p(y|x, \theta, \mathcal{D})]] \\ &\mathbb{H}[p(y|x, \mathcal{D})] = \mathbb{H}\left[\int_{\theta} p(y|x, \theta, \mathcal{D}) p(\theta | \mathcal{D}) d\theta\right] \approx \mathbb{H}\left[\frac{1}{N}\sum_{i=1}^{N} p(y|x, \theta_i, \mathcal{D})\right] = \mathbb{H}\left[\overline{p}(y|x, \mathcal{D})\right] \\ &\mathbb{E}_{p(\theta|\mathcal{D})} [\mathbb{H}[p(y|x, \theta, \mathcal{D})]] = \mathbb{E}_{p(\theta|\mathcal{D})} \left[ - \sum_{y} p(y|x, \theta, \mathcal{D}) \log p(y|x, \theta, \mathcal{D}) \right] \approx - \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{y} p(y|x, \theta_i, \mathcal{D}) \log p(y|x, \theta_i, \mathcal{D}) \right) \end{aligned} \]

When there is significant disagreement among models, the predictive entropy (the first term) will be large, while the expected entropy (the second term) will be smaller. This difference represents the degree to which the models disagree. BALD selects points where this disagreement is maximized. As an example, consider a binary classification problem with two classes, \(y_1\) and \(y_2\). We have two samples, \(x_1\) and \(x_2\). BALD selects \(x_1\) for labeling.

code

AL by Variance Reduction (Cohn, Ghahramani, and Jordan 1996) is an algorithm designed to select the next data point for labeling based on the anticipated reduction in the model’s variance. The objective is to identify the point \(x \sim p(x)\) that, when labeled \(y_x\), will most effectively decrease the model’s variance. The expected error at a given input \(x\) is \(\mathbb{E}_{\hat{y} \sim p(\hat{y} | \mathcal{D}; x), y \sim p(y|x)} (\hat{y} - y)^2\). \(\hat{y}\) represents the model’s prediction, and \(y\) denotes the true label at \(x\). Using bias-variance decomposition (Geman, Bienenstock, and Doursat 1992), the expected error is decomposed as \[\begin{aligned} \mathbb{E} (\hat{y} - y)^2 = \mathbb{E}[(\hat{y} - \mathbb{E}[y|x]) + (\mathbb{E}[y|x] - y)]^2 = \mathbb{E} [(y - \mathbb{E}[y|x])^2] + 2\mathbb{E} [(\hat{y} - \mathbb{E}[y|x])(\mathbb{E}[y|x] - y)] + \mathbb{E}(\hat{y} - \mathbb{E}[y|x])^2 \end{aligned}\] where the expectation is taken over \(\hat{y} \sim p(\hat{y} | \mathcal{D}; x), y \sim p(y|x)\). The first term represents the variance of the true label \(y\), the second term evaluates to zero since \(\mathbb{E}_{\hat{y}, y}[\mathbb{E}[y|x] - y] = 0\), and the third term accounts for the variance of the model’s prediction \(\hat{y}\): \[\mathbb{E}(\hat{y} - \mathbb{E}[y|x])^2 = \mathbb{E}[(\hat{y} - \mathbb{E}[\hat{y}] + \mathbb{E}[\hat{y}] - \mathbb{E}[y|x])^2] = \mathbb{E}[(\hat{y} - \mathbb{E}[\hat{y}])^2] + (\mathbb{E}[\hat{y}] - \mathbb{E}[y|x])^2\]

Hence, \[\mathbb{E} (\hat{y} - y)^2 = \mathbb{E}_{y} [(y - \mathbb{E}[y|x])^2] + (\mathbb{E}_{\hat{y}} [\hat{y} - \mathbb{E}[y|x]] )^2 + \mathbb{E}_{\hat{y}} [(\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}])^2]\]

Here, the first term signifies the variance of the true label, which remains constant for a given \(x\). The second term captures how much the average model prediction deviates from the expected true label. The third term quantifies the model’s uncertainty at \(x\). Cohn, Ghahramani, and Jordan (1996) denotes the uncertainty term as \(\sigma^2_{\hat{y}} (x | \mathcal{D}) = \mathbb{E}_{\hat{y}} [(\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}])^2]\). The acquisition function is \(\mathbb{E}_{p(x)} [\sigma^2_{\hat{y}} (x | \tilde{\mathcal{D}})]\). One could rely on empirical measure like a loss on test labelled data to gauge model improvement, which can help decide the termination of data acquisition. The size of the data set and its relationship to the loss is tied to the model complexity. To evaluate the performance of variance reduction strategy, Cohn, Ghahramani, and Jordan (1996) studies the Arm2D problem. Arm2D is a kinematics problem where learner has to predict the tip position of a robotic arm given a set of joint angles \(\mathbf{\theta_1}, \mathbf{\theta_2}\). In this analysis, the two models are the Gaussian mixture model and locally-weighted regression (LOESS). The results shown that the variance of the learner decreases because the authors selected points to minimize expected variance. Additionally, we observe a related decrease in the mean square error (MSE) of both models as the dataset size increases. This is a notable outcome because the expected learner variance for these models can be computed accurately and efficiently relative to a new point. When integrated into the general AL loop, this significantly enhances model performance. In the case of the locally-weighted regression model (?fig-empirical:regress), it is surprising that if points were chosen randomly, the MSE would be highly unstable, with sharp fluctuations. However, when AL by variance reduction is applied, using expected learner variance as a proxy, the MSE decreases almost smoothly, aside from some initial instabilities.

3.3 Active Preference Learning with Ideal Point Model

For any \(n\) elements to be ranked, there are \(n!\) possible orderings that can result in the correct complete ranking. Given that a lower bound on sorting is \(n\log n\), obtaining a guaranteed true rating over \(n\) items requires \(n\log n\) pairwise comparisons if those comparisons are chosen at random. This number can be quite high and costly in many applications, especially since most ranking information comes from humans. The more comparisons they have to make, the more money and time is spent. This process can also be inefficient, as some comparisons provide more value to the learning process than others, making some comparisons a waste. This inefficiency can be detrimental in fields like psychology and market research, where comparisons are heavily utilized, and a faster process could offer significant benefits. The reason the lower bound on the number of comparisons is \(n\log n\) is that it assumes no prior information about the underlying space and field, so comparisons are chosen at random. However, leveraging the structures within the comparison space can provide more information about which comparisons are most valuable. For example, (G. and Nowak 2011) discusses how eye doctors have a wide range of options when assigning prescriptions for glasses, yet patients do not see them making many comparisons before deciding on the best option. This is because eye doctors incorporate domain knowledge into the process and only ask clients for comparisons when necessary. Applying similar knowledge in the ranking field leads to an AL approach that selects data based on the relevance of a comparison query toward finding the final \(\sigma(\Theta)\).

G. and Nowak (2011) explores AL within data that can be embedded in a \(d\)-dimensional embedding space, where comparisons between two different items divide the space into halves, with one object being superior in each half. By leveraging such geometry, the paper develops a geometric AL approach. Let \(\theta\) be the item representation in the embedding space. For each ranking \(\sigma\), there is a reference point \(r_{\sigma} \in \mathbb{R}^d\), such that if \(\theta_{i} \succ \theta_{j}\), \(||\theta_i - r_{\sigma}|| < ||\theta_j - r_{\sigma}||\). In other words, object \(i\) is closer to the reference point \(r_{\sigma}\) than object \(j\). \(\Sigma_{n,d}\) is the set of all possible rankings of the \(n\) items that satisfy the above embedding distances condition. Not all rankings will satisfy the embedding conditions, but multiple rankings might satisfy all those conditions. For every ranking \(\sigma\), there is \(M_n(\sigma)\), the number of pairwise comparisons needed to identify the ranking. When comparisons are done at random, \(\mathbb{E}[M_n(\sigma)] = n\log n\), and it can be reduced by incorporating geometry. \(q_{i,j}\) is the query of comparison between items \(i\) and \(j\).

As an example, G. and Nowak (2011) studies a 2D space with three items: \(\theta_1\), \(\theta_2\), and \(\theta_3\). There are pairwise queries \(q_{1,3}\), \(q_{2,3}\), and \(q_{1,2}\) between them, denoted by solid lines equidistant from the two items they compare. These lines split the \(R^2\) space into halves, with each half closer to one of the two items. The paper colors the side of the worse object for each query in dark grey and takes the intersection of these halves, resulting in the dark grey region in the image. This region indicates \(\Sigma_{n,2}\) since all points follow the embedding conditions. Specifically, for every point \(r\) in the dark grey area, \(||\theta_3 - r|| < ||\theta_2 - r|| < ||\theta_1 - r||\), meaning \(\theta_3 < \theta_2 < \theta_1\). Thus, every point \(r\) is one of the \(r_\sigma\) representing their respective rankings \(\sigma \in \Sigma_{n,2}\). In other words, the paper aims to have the reference points and dark grey region closest to the worst object and furthest from the best object.

The authors also denote the label for each query \(q_{i,j}\), such as label \(y_{i,j} = 1\{q_{i,j}\}\) (for example, \(y_{1,2} = 0, y_{3,2} = 1\)). This allows for deciding how to label new queries represented by dashed and dotted lines, depending on which items each query compares. Focusing on the dotted line, called \(q_{i,4}\), where \(i={1,2,3}\), and considering potential locations of \(\theta_4\), the line must be equidistant from one of the three items in the picture and \(\theta_4\), meaning \(\theta_4\) can be placed in three different locations. If the query performed is \(q_{2,4}\), then \(\theta_4\) will be closer to the dark grey area than \(\theta_2\), thus \(y_{2,4} = 0\). However, if \(q_{1,4}\) or \(q_{3,4}\) are performed, \(\theta_4\) will be further from the dark grey area than \(\theta_1\) or \(\theta_3\), meaning \(y_{1,4} = y_{3,4} = 1\). In this case, the labels are contradictory and depend on which object they are compared with, making such a query \(q_{i,4}\) ambiguous.

In contrast, the authors analyze the dashed line, called \(q_{i,5}\), where \(i={1,2,3}\), and consider potential locations of \(\theta_5\). Since the line must be equidistant from one of the three items in the picture and \(\theta_5\), it can be placed in three different locations. If one of the three potential queries is performed, \(\theta_5\) will be closer to the dark grey area than \(\theta_1\), \(\theta_2\), and \(\theta_3\), meaning \(y_{1,5} = y_{2,5} = y_{3,5} = 0\). In this case, all labels are the same regardless of which object is used, meaning such a query will not be contradictory, as all agree on the label. The goal is to perform as many ambiguous queries as possible and skip non-ambiguous queries to decrease the total \(M_n(\sigma)\). Intuitively, if there is contradictory information about a query, it needs to be erformed so that a human can clarify its direction. Conversely, if all sources of information from the domain space agree on the query’s label, that information can be used without asking a human, incorporating the knowledge of the embedding distances. Lastly, to consider the general case of the \(R^d\) space, rather than discussing halves of the image, it is essential to discuss half-spaces. Similarly, consider the half-space that assigns a label of \(1\) to the query and the half-space assigning a label of \(0\). If both half-spaces exist, they have conflicting information on the query, making the query ambiguous. However, if one of the half-spaces does not exist, it means the other is the full space, representing consistency in the label assignment and a non-ambiguous query.

It is important to demonstrate that the number of comparisons decreases. Specifically, (G. and Nowak 2011) shows that this algorithm has \(E[M_n(\sigma)] = O(d\log n)\), where \(d\) is the dimension of the space and \(d < n\), which improves on the \(O(n\log n)\) baseline. The proof can be studied in detail in the paper itself, but at a high level, it starts by reasoning about the probability of a query being ambiguous and a comparison being requested from a human, thus representing \(M_n = \Sigma_{k=1}^{n-1}\Sigma_{i=1}^k 1\{Requestq_{i,k+1}\}\). For that, the authors define \(Q(i,j)\), which represents the number of different rankings that exist for \(i\) elements in \(j\)-dimensional space (e.g., \(Q(1,d) = 1, Q(n,0) = 1, Q(n,1) = n!\)). In that case, \(|\Sigma_{n,d}| = Q(n,d)\). Further, using recurrence relations for \(Q(i,j)\), the authors derive that \(|\Sigma_{n,d}| = Q(n,d) = O(n^{2d})\), which is omitted here. Analogously, the authors define \(P(i,j)\), which represents the number of rankings in \(\Sigma_{n,d}\) that will still be possible with the addition of a new element \(i+1\) to the ranking items. \(P(i,j)\) estimates how much of the dark grey area will still exist after making a query for \(i+1\). As indicated there, the dotted line ambiguous query did not change the dark grey a rea at all (\(P(n,d) = Q(n,d)\)), whereas the dashed non-ambiguous query would cut a piece from it (\(P(n,d) < Q(n,d)\)). Thus, \(Request q_{i,k+1} = P(k,d) / Q(k,d)\), so a higher value indicates more possible rankings and an ambiguous query that needs to be requested to obtain more useful information. With this in mind, the authors derive that \(E[M_n(\sigma)] = O(d\log n)\), showing that fewer queries are needed for effective ranking.

The issue with this algorithm is that only one human provides the answers to the requested queries, which means it does not account for their biases. An alternative approach is a Robust Query Selection Algorithm (RQSA) (G. and Nowak 2011), which uses majority voting for every query to indicate the ground truth of the query’s label. However, the authors consider that a group of people can still give incorrect or divided responses. If the votes for each answer are almost equal in number, the authors push that query to the end of the algorithm to see if it can become a non-ambiguous query with more information learned. If it does not, an odd number of voters is used to determine the final ranking.

Table 3.1: Statistics for the Robust Query Selection Algorithm (RQSA) (G. and Nowak 2011) and the baseline of conducting all comparisons. \(y\) serves as a noisy ground truth, \(\tilde{y}\) is the result of all comparisons, and \(\hat{y}\) is the output of the RQSA.
Dimension 2 3
% of queries mean 14.5 18.5
std 5.3 6
Average error \(d(\bar{y}, y)\) 0.23 0.21
\(d(\bar{y}, y)\) 0.31 0.29

With regard to the accuracy and performance of the method, the authors did a ranking experiment on 100 different audio signals, results of which can be seen in Table 3.1. The ground truth labels came from humans, indicated by \(y\) in the table. That resulted in the existence of noise and potential errors in the ground truth, which could influence the performance of both the baseline algorithm that does all comparisons (\(\tilde{y}\)) and the Robust Query Selection Algorithm (RQSA) (\(\hat{y}\)). As can be seen in both 2 and 3-dimensional spaces RQSA performed worse by \(8\%\) compared to the baseline, which indicates that AL that uses the domain information can still be erroneous due to the inference of certain comparisons that sometimes may not be entirely correct. However, as can be seen by the upper part of Table 3.1, significantly less queries were requested compared to the baseline, which means that the approach can have a significant benefit at a cost of slight loss in accuracy.

User Information as Domain Knowledge for Active Learning

An alternative source of domain knowledge could be users themselves, who can indicate their uncertainty when it comes to comparing two items. Prior studies have shown (Amershi et al. 2014) that when presented with only two options when selecting which object is better, but not being able to properly decide, users would get frustrated and tend to respond more faultyly, creating noise and incorrect responses in the data. Through feedback and other studies (Guillory and Bilmes 2011) it was determined that presenting users with an option of indifference between the two items can remove those problems. Moreover, in connection to AL, the authors show that such an option helps to select more informative queries since it provides more domain knowledge that can be used, resulting in a decrease in the number of queries required. For this problem, the following terms are defined:

  1. \(c\) - a cost function that represents user preferences, and the result the model has to determine at the end of training. The preferred items will have lower costs, and less preferred ones will have higher costs. The goal is to determine this function with the fewest possible number of queries using AL.

  2. \(H\) - a set of hypotheses over the possible cost functions, where for each \(h \in H\) there is a cost function \(c_h\) associated with it.

  3. \(h^*\) - a true hypothesis that the model needs to determine, which has cost \(c_{h^*}\) associated with it

  4. \(t(x,y)\) - a test performed to compare items \(x\) and \(y\) (the user is being asked to provide a response to which item is better). Those tests result in changes and adjustments to \(H\) as more information is learned.

  5. \(o(x,y)\) - observation or result of \(t(x,y)\), where \(o(x,y) \in \{x<y, x>y\}\)

  6. \(S = \{(t_1, o_1), (t_2, o_2),...,(t_m, o_m)\}\) - a sequence of \(m\) pairs of tests and observations

  7. \(w(H|S)\) - probability mass of all hypotheses that are still consistent with the observations (similar to the dark grey area and \(Q(i,j)\)). This means that if \(h \in H\) is inconsistent with user responses received, it is removed from \(H\).

With the key terms defined, let’s consider the noiseless base setting where users only have two options for response. Those components will also later be translated to the setting with the third option so the true cost function can be determined there. \(w(H|S)\) is the sum of the weights of all hypotheses that are still consistent with the evidence: \(w(H|S) = \sum_{h \in H} w(h | S)\). Each \(w(h|S)\) is a probability of the evidence’s existence given such hypothesis: \(w(h|S) = p(S|h)\). Such probability comes from the test-observation pairs since they compose the set \(S\). Moreover, each test is independent of other tests, which gives \(p(S|h) = \prod_{(t,o) \in S} p((t,o) | h)\). In the noiseless setting, users will select an option that minimizes their cost function (selecting more preferred items), mathematically defined as: \[\begin{aligned} p((t, o = x) | h) = \begin{cases} 1 & c_h(x) < c_h(y)\\ 0 & else \end{cases} \end{aligned}\]

Users are not perfect evaluators. Prior work (Amershi et al. 2014) has shown that treating users as perfect can lead to poor performance. That gave rise to accounting for noise in users’ responses, but a majority of such work applies the same noise to all queries and all responses. While those led to great performance results (Guillory and Bilmes 2011), they don’t accurately reflect the real world, which gave rise to the idea of creating query-based noise. Effectively, for some of the queries it is important to incorporate the fact that the user is unsure and noisy, but for others, if the user is confident, noise in the response is not needed at all. For comparison-based learning, this means that the noise is related to the costs of the two items compared. Specifically for items \(x\) and \(y\), if \(c_{h^*}(x) \simeq c_{h^*}(y)\) then the items are hard to distinguish for the user, so here it is preferred to incorporate user uncertainty and noise. But if \(c_{h^*}(x) >> c_{h^*}(y)\), the user will certainly select \(y\) and the other way around, which is where the noise is not needed. Query-dependent noise is also supported in the psychology literature, which means that such an approach is more related to the real world. In particular, psychologists talk about the Luce-Sheppard Choice rule (Shepard 1957) when talking about comparisons. This rule previously gave rise to a logistic model based on the noise (Viappiani and Boutilier 2010) where the probability of observation for a given test is \(p((t, o = x) | h) \propto exp(-\gamma * c_h(x))\)

Figure 3.1: User response model in the noiseless setting
Figure 3.2: User response with Luce Sheppard noise model

Figure 3.1, Figure 3.2 demonstrate the difference between the noiseless setting and incorporating the Luce-Sheppard Choice rule. GBS is the baseline model with only 2 response options, and CLAUS is the model with the uncertainty option added. The figures show how incorporating such noise influences and smoothes the probability distribution of the user’s response.

We will now discuss the functionality of CLAUS, which is an algorithm designed by (Holladay et al. 2016) that allows users to select an uncertain response about the two options that they need to rank. The authors model such uncertainty as \(\epsilon\) and it is associated with each \(c_h\), so now every hypothesis \(h\) is defined over a pair of \((c_h, \epsilon_h)\). It is important to note that the goal is to still learn and maintain our objective on \(c\), \(\epsilon\) is only necessary to model the users’ responses. The uncertainty relates to the cost function as \(|c_h(x) - c_h(y)| < \epsilon_h\). This means that the user is uncertain between items \(x\) and \(y\) and their cost difference is negligible such that the user is not able to select which item is better. This in turn gives more information about the real value of the two items, as a binary response would indicate the user’s preference towards one item, which will not be real and will skew the cost functions. This causes modifications of the problem set-up:

  1. For test \(t(x,y)\) the observation will be \(o(x,y) \in \{x<y, x>y, \tilde{xy}\}\), where \(\tilde{xy}\) is the uncertain response.

  2. The probability distribution over the user’s response (?eq-prob_base) will now be defined as:

\[\begin{aligned} p((t, o = x) | h) = \begin{cases} 1 & c_h(x) < c_h(y) - \epsilon_h\\ 0 & else \end{cases}, \quad p((t, o = \tilde{xy}) | h) = \begin{cases} 1 & |c_h(x) - c_h(y)|^2 < \epsilon_h^2\\ 0 & else \end{cases} \end{aligned}\]

This means the user confidently selects \(x\) when it is better than \(y\) by more than \(\epsilon\), but if the squared difference of the cost functions of two items is negligible by \(\epsilon\) user will choose the indifferent option.

  1. Finally this also updates the noise model: \[\begin{aligned} &p((t, o = x) | h) \propto \exp(-\gamma * [c_h(x) - c_h(y)]) \\ &p((t, o = \tilde{xy}) | h) \propto exp(-1/\epsilon_h^2 * [c_h(x) - c_h(y)]^2) \end{aligned}\]

Rather than predicting a specific pair \((c_h, \epsilon_h)\), the algorithm focuses on predicting a group of pairs that are similar to one another, otherwise called equivalence class (?fig-equiv_c), which indicates not essentially different hypothesis for the cost function and uncertainty. That information is learned through each new test, as the algorithm updates the information about \(c\) and \(\epsilon\) that distinguishes between the distinct \(h\), finding the equivalence groups among them. Moreover, the authors tweaked the parameter responsible for the size of the equivalence class (how many hypotheses can be grouped together at a time).

Table 3.2: Performance of GBS and CLAUS with different labels for the uncertainty
Category Accuracy Query Count
GBS - About Equal \(94.15 \pm 0.52\) \(36.02 \pm 0.03\)
GBS - Not Sure \(\textbf{94.66} \pm \textbf{0.55}\) \(35.95 \pm 0.04\)
CLAUS - About Equal \(91.56 \pm 0.84\) \(\textbf{25.93} \pm \textbf{0.41}\)
CLAUS - Not Sure \(90.86 \pm 0.74\) \(26.98 \pm 0.47\)

The first performance evaluation is done on the number of queries and confirms that it decreases. The GBS model serves as the baseline, as it will do all of the comparison queries using the binary response options. The CLAUS model is measured over different values of \(\epsilon\) on the x-axis and over different sizes of the equivalence sets indicated by different shades of blue. Figure shows that all variants of CLAUS use approximately 10 fewer queries on average compared to GBS. Moreover, using bigger-sized equivalence classes can further decrease the number of needed queries. The most optimal \(\epsilon \simeq 0.07\), after which higher \(\epsilon\) does not provide any benefit.

Lastly, the authors considered the performance difference, which is indicated in Table 3.2. For that authors used two different labels for the uncertainty button in CLAUS, it was either labeled as “About Equal” or “Not Sure” as those can provoke different responses and feelings in users. Moreover, GBS and CLAUS-type responses were mixed in the same set of questions to the user, which splits the metrics for both in two as can be seen in Table 3.2. The performance of CLAUS is lower by \(3\%\) on average, showing that a smaller number of queries can still lead to a performance loss. However, the second column of Table 3.2 supports the information, as it also shows that 10 fewer queries were conducted on average.

AL can be essential in learning within dynamic systems and environments. Say we have an agent in an environment, and we want it to conform to a certain behavior as set by a human. How exactly do we go about doing this? In a traditional RL setting, this is solved by a class of algorithms under Inverse Reinforcement Learning. Techniques such as VICE and GAIL attempt to learn a reward function that can distinguish between states visited by the agent and states desired to be visited as defined by a human. In effect, a human will demonstrate what it would like the agent to do in the environment, and from there, learning is done. However, what if humans do not precisely know how an agent should optimally behave in an environment but still have some opinion on what trajectories would be better than others? This is where a paper like Active Preference-Based Learning of Reward Functions comes into the picture. The paper aims to use human preferences to aid an agent’s learning within a dynamic system.

A dynamic system contains human input, robotic input, and an environment state. The transitions between states is defined by \(f_{HR}\), so that we have \(x^{t+1} = f_{HR}(x^t, u_R, u_H)\). At a given time step \(t\), we have \(x_t\), \(u_R^t\), and \(u_H^t\). This can be encapsulated into a single \(d\) dimensional feature vector that the authors denote as \(\phi\). The paper then assumes that the underlying reward model we are trying to learn can be represented linearly. If we have our human reward preference function defined as \(r_H\), this means we can write \(r_H\) as \(r_H(x^t, u_R^t, u_H^t) = w^{\intercal}\phi(x^t, u_R^t, u_H^t)\). Because the reward function is linear, we can take the weight vector out of the summation if we want to calculate the reward over an entire trajectory:

\[R_{H}(x^0, u_R, u_H) = \sum_{t=0}^{N} r_{H}(x^t, u^t, u_H^t) \quad \Phi = \sum \phi(x^t, u_R^t, u_H^t) \quad R_H(traj) = w\cdot\Phi(traj)\]

First, the scale of \(w\) does not matter because we only care about the relative rewards produced with \(w\) (given two different trajectories, we want to answer the question of which trajectory a human would prefer, i.e. which one has a higher preference reward). This means we can constrain \(||w|| <= 1\), so the initial prior is uniform over a unit ball. From here, we can determine a probabilistic expression to assess whether we should prefer trajectory A or B (because it can be noisy with human input). Let \(I_t = +1\) if the human prefers trajectory \(A\). According to Bradley-Terry model, \(p(A \succ B|w) = \sigma(R_H(traj_A) - R_H(traj_B))\). Let \(\psi = \Phi(traj_a) - \Phi(traj_b). Then f_{\psi} (w) = p(I_t|w) = \sigma(I_t w^{\intercal}\psi)\). We can update \(p(w)\) everytime we get a result from a human preference query using Bayes’ rule: \(p(w|I_t) <- p(w) \cdot p(I_t|w)\) via Markov chain Monte Carlo method. This paper synthetically generates queries through an optimization process and then presents them to a human to pick between. The idea is that we want to generate a query that maximizes the conditional entropy \(H(I|w)\). We want to pick a query that we are most uncertain about given our current weights (thus having the highest conditional entropy given the weights): \[\max_{x^0, u_R, u_H^A, u_H^B} \min\{\mathbb{E}[1-f_{\psi}(w)], \mathbb{E}[1 - f_{-\psi}(w)]\}\]

To do so, we sample \(w_1, ... w_m\) from \(p(w)\), approximating the distribution \(p(w)\) as \(p(w) = \frac{1}{M} \sum \delta (w_i).\) We can now approximate the expectation expression as \(E[1 - f_{\psi}(w)] = \frac{1}{M} (\sum 1 - f_{\psi}(w_i))\), and now we can optimize the expression to generate a synthetic query. The algorithm itself works well, however there ends up being a bottle neck that each query needs to be synthesized before being sent to the human – one at a time. There is no room for parallelization and so the authors proposed a second algorithm in a separate paper that allows for the batching of queries:

\[\max_{\xi_{ib+1_A}, \xi_{ib+1_B}, ... , \xi_{ib+b_A}, \xi_{ib+b_B}} \mathbb{H}(I_{ib+1}, I_{ib+2}, .., I_{ib+b} | w)\]

We could consider optimizing this in the greedy fashion. This would mean just synthetically generating \(b\) independent queries. The drawback of this method would be that the queries would likely be very similar to each other. The authors propose a few other heuristics that would help guide the algorithm away from generating very similar queries, such as Medioid Selection where we have to cluster \(B\) greedy vectors into \(b < B\) groups and pick one vector from each group (the medioid). The authors also propose two other methods rooted in providing different queries: boundary medioids selection and successive elimination. The authors test both the non-batched and variety of batched learning algorithms on multiple environments. When graphed over \(N\) the non-batched AL approach does in the same ball-park of performance as the batched approaches. However, over time, we see that learning is a much slower process when not-batched.

3.4 Case Study 2: Performance Metric Elicitation

In binary classification problems, selecting an appropriate performance metric that aligns with the real-world task is crucial. The problem of metric elicitation aims to characterize and discover the performance metric of a practitioner, reflecting the rewards or costs associated with correct or incorrect classification. For instance, in medical contexts such as diagnosing a disease or determining the appropriateness of a treatment, trade-offs are made for incorrect decisions. Not administering a treatment could lead to the worsening of a disease (a false negative), whereas delivering the wrong treatment could cause adverse side effects worse than not treating the condition (a false positive). Rather than choosing from a limited set of default choices like the F1-score or weighted accuracy, metric elicitation considers the process of devising a metric that best matches the preferences of practitioners or users. This is achieved by querying an “oracle” who provides feedback on proposed potential metrics through pairwise comparisons. Since queries to humans are often expensive, the goal is to minimize the number of comparisons needed.

The motivation for the pairwise comparison aspect of metric elicitation (Hiranandani et al. 2019a) stems from a rich history of literature in psychology, economics, and computer science (Samuelson 1938; Mas-Colell 1977; Varian 2006; Braziunas and Boutilier 2012; Tamburrelli and Margara 2014), demonstrating that humans are often ineffective at providing absolute feedback on aspects such as potential prices, user interfaces, or even ML model outputs (hence the comparison-based structure of RLHF, for instance). Additionally, confusion matrices accurately capture binary metrics such as accuracy, \(F_\beta\), and Jaccard similarity by recording the number of false positives, true positives, false negatives, and true negatives obtained by a classifier. The main goal of this chapter is to introduce two binary-search procedures that can approximate the oracle’s performance metric for two types of metrics (linear and linear-fractional performance metrics) by presenting the oracle with confusion matrices generated by various classifiers. Essentially, we are learning an optimal threshold for classification given a decision boundary for a binary classification problem.

First, we introduce some relevant notation that will later be used to formalize notions of oracle queries, classifiers, and metrics. In this context, \(X \in \mathcal{X}\) represents an input random variable, while \(Y \in \{0, 1\}\) denotes the output random variable. We learn from a dataset of size \(n\), denoted by \(\{(x, y)_i\}^n_{i=1}\), which is generated independently and identically distributed (i.i.d.) from some distribution \(\mathbb{P}(X, Y)\). The conditional probability of the positive class, given some sample \(x\), is denoted by \(\eta(\vec{x}) = \mathbb{P}(Y=1 | X=x)\). The marginal probability of the positive class is represented by \(\zeta = \mathbb{P}(Y=1)\). The set of all potential classifiers is \(\mathcal{H} = \{h : \mathcal{X} \rightarrow \{0,1\}\}\). The confusion matrix for a classifier \(h\) is \(C(h, \mathbb{P}) \in \mathbb{R}^{2 \times 2}\), where \(C_{ij}(h, \mathbb{P}) = \mathbb{P}(Y=i, h=j)\) for \(i, j \in \{0,1\}\). These entries represent the false positives, true positives, false negatives, and true negatives, ensuring that \(\sum_{i,j}C_{ij}=1\). The set of all confusion matrices is denoted by \(\mathcal{C}\). Since \(FN(h, \mathbb{P}) = \zeta - TP(h, \mathbb{P})\) and \(FP(h, \mathbb{P}) = 1 - \zeta - TN(h, \mathbb{P})\), \(\mathcal{C}\) is actually a 2-dimensional space, not a 4-dimensional space.

Any hyperplane in the \((tp, tn)\) space is given by \(\ell := a \cdot tp + b \cdot tn = c\), where \(a, b, c \in \mathbb{R}\). Given a classifier \(h\), we define a performance metric \(\phi : [0, 1]^{2 \times 2} \rightarrow \mathbb{R}\). The value \(\phi(C(h))\), which represents the performance of a classifier with respect to a certain metric, is referred to as the utility of the classifier \(h\). We assume, without loss of generality, that a higher value of \(\phi\) indicates a better performance metric for \(h\). Our focus is to recover some metric \(\phi\) using comparisons between confusion matrices \(C(h)\), determined by classifiers \(h\), which approximates the oracle’s “ground-truth” metric \(\phi^*\). Next, we introduce two classes of performance metrics—Linear Performance Metrics (LPM) and Linear-Fractional Performance Metrics (LFPM)—for which we will present two elicitation algorithms.

An LPM, given constants \(\{a_{11}, a_{01}, a_{10}, a_{00}\} \in \mathbb{R}^{4}\), is defined as \(\phi(C) = a_{11} TP + a_{01} FP + a_{10} FN + a_{00} TN = m_{11} TP + m_{00} TN + m_{0}\), where \(m_{11} = (a_{11} - a_{10})\), \(m_{00} = (a_{00} - a_{01})\), and \(m_{0} = a_{10} \zeta + a_{01} (1 - \zeta)\). This reparametrization simplifies the metric by reducing dimensionality, making it more tractable for elicitation. One example of an LPM is weighted accuracy, defined as \(WA = w_1TP + w_2TN\), where adjusting \(w_1\) and \(w_2\) controls the relative importance of different types of misclassification. An LFPM, defined by constants \(\{a_{11}, a_{01}, a_{10}, a_{00}, b_{11}, b_{01}, b_{10}, b_{00}\} \in \mathbb{R}^{8}\), is given by: \[\phi(C) = \frac{a_{11} TP + a_{01} FP + a_{10} FN + a_{00} TN}{b_{11} TP + b_{01} FP + b_{10} FN + b_{00} TN} = \frac{p_{11} TP + p_{00} TN + p_{0}}{q_{11} TP + q_{00} TN + q_{0}},\] where \(p_{11} = (a_{11} - a_{10})\), \(p_{00} = (a_{00} - a_{01})\), \(q_{11} = (b_{11} - b_{10})\), \(q_{00} = (b_{00} - b_{01})\), \(p_{0} = a_{10} \zeta + a_{01} (1 - \zeta)\), and \(q_{0} = b_{10} \zeta + b_{01} (1 - \zeta)\). This parametrization also simplifies the elicitation process by reducing the number of variables. Common LFPMs include the \(F_\beta\) score and Jaccard similarity, defined as:

\[F_{\beta} = \frac{TP}{\frac{TP}{1+\beta^{2}} - \frac{TN}{1+\beta^{2}} + \frac{\beta^{2} \zeta + 1 - \zeta}{1+\beta^{2}}}, \quad JAC = \frac{TP}{1 - TN}. \tag{3.1}\]

Setting \(\beta = 1\) gives the F1 score, which is widely used as a classification metric. Since we are considering all possible metrics in the LPM and LFPM families, we need to make certain assumptions about \(\mathcal{C}\). Particularly, we will assume that \(g(t) = \mathbb{P}[\eta(X) \geq t]\) is continuous and strictly decreasing for \(t \in [0, 1]\); essentially, \(\eta\) has positive density and zero probability.

Additionally, \(\mathcal{C}\) is convex, closed, and contained within the rectangle \([0, \zeta] \times [0, 1-\zeta]\), and is rotationally symmetric around its center, \((\frac{\zeta}{2}, \frac{1-\zeta}{2})\), where the axes represent the proportion of true positives and negatives. The only vertices of \(\mathcal{C}\) are \((0, 1-\zeta)\) and \((\zeta, 0)\), corresponding to predicting all \(0\)’s or all \(1\)’s on a given dataset. Therefore, \(\mathcal{C}\) is strictly convex, and any line tangent to it is tangent at exactly one point, corresponding to one particular confusion matrix. Next, recall that an LPM is represented in terms of three parameters (\(\phi = m_{11}TP + m_{00}TN + m_0\)). We have just seen that this LPM and its corresponding confusion matrix correspond to a certain point on the boundary of \(\mathcal{C}\). We first note that this point is independent of \(m_0\). Additionally, we only care about the relative weightings of \(m_{11}\) and \(m_{00}\), not their actual values—they are scale invariant. Therefore, we can parametrize the space of LPMs as \(\varphi_{LPM} = \{\mathbf{m} = (\cos \theta, \sin \theta) : \theta \in [0, 2\pi]\}\), where \(\cos \theta\) corresponds to \(m_{00}\) and \(\sin \theta\) corresponds to \(m_{11}\). As we already know, we can recover the Bayes classifier given \(\mathbf{m}\), and it is unique, corresponding to one point on the boundary of \(\mathcal{C}\) due to its convexity. The supporting hyperplane at this point is defined as \(\bar{\ell}_{\mathbf{m}} := m_{11} \cdot tp + m_{00} \cdot tn = m_{11} \overline{TP}_{\mathbf{m}} + m_{00} \overline{TN}_{\mathbf{m}}\). We note that if \(m_{00}\) and \(m_{11}\) have opposite signs, then \(\bar{h}_m\) is the trivial classifier predicting all 1’s or all 0’s, since either predicting true positives or true negatives results in negative reward. This corresponds to a supporting hyperplane with a positive slope, so it can only be tangent at the vertices. Additionally, the boundary \(\partial \mathcal{C}\) can be split into upper and lower boundaries (\(\partial \mathcal{C}_{+}, \partial \mathcal{C}_{-}\)), corresponding to \(\theta \in (0, \pi/2)\) and \(\theta \in (\pi, 3\pi/2)\) respectively (and whether \(m_{00}, m_{11}\) are positive or negative). We also define the notions of Bayes optimal and inverse-optimal classifiers. Given a performance metric \(\phi\), we define:

  • The Bayes utility as \(\bar{\tau} := \sup_{h \in \mathcal{H}} \phi(C(h)) = \sup_{C \in \mathcal{C}} \phi(C)\); this is the highest achievable utility (using the metric \(\phi\)) over all classifiers $h \(\mathcal{H}\) for a given problem.
  • The Bayes classifier as \(\bar{h} := \arg \max_{h \in \mathcal{H}} \phi(C(h))\); this is the classifier \(h\) corresponding to the Bayes utility.
  • The Bayes confusion matrix as \(\bar{C} := \arg \max_{C \in \mathcal{C}} \phi(C)\); this is the confusion matrix corresponding to the Bayes utility and classifier.

Similarly, the inverse Bayes utility, classifier, and confusion matrix can be defined by replacing “\(\sup\)” with “\(\inf\)”; they represent the classifier and confusion matrix corresponding to the lower bound on utility for a given problem. We also have the following useful proposition:

proposition

Proposition 3.1 Let \(\phi \in \varphi_{LPM}\). Then

\[\bar{h}(x) = \left\{\begin{array}{lr} \unicode{x1D7D9} \left[\eta(x) \geq \frac{m_{00}}{m_{11} + m_{00}}\right], & m_{11} + m_{00} \geq 0 \\ \unicode{x1D7D9} \left[\frac{m_{00}}{m_{11} + m_{00}} \geq \eta(x)\right], & \text { o.w. } \end{array}\right\} \tag{3.2}\]

is a Bayes optimal classifier with respect to \(\phi\). The inverse Bayes classifier is given by \(\underline{h} = 1 - \bar{h}\).

This is a simple derivation based on the fact that we only get rewards from true positives and true negatives. Essentially, if we recover an LPM, we can use it to determine the best-performing classifier, obtained by placing a threshold on the conditional probability of a given sample, that corresponds to a confusion matrix. Therefore, the three notions of Bayes utility, classifier, and confusion matrix are functionally equivalent in our setting.

We will now formalize the problem of metric elicitation. Given two classifiers \(h\) and \(h'\) (or equivalently, two confusion matrices \(C\) and \(C'\)), we define an oracle query as the function:

\[\Gamma\left(h, h^{\prime}\right)=\Omega\left(C, C^{\prime}\right)=\unicode{x1D7D9}\left[\phi(C)>\phi\left(C^{\prime}\right)\right]=: \unicode{x1D7D9} \left[C \succ C^{\prime}\right], \tag{3.3}\]

which represents the classifier preferred by the practitioner. We can then define the metric elicitation problem for populations:

definition

Definition 3.1 Suppose the true (oracle) performance metric is \(\phi\). The goal is to recover a metric \(\hat{\phi}\) by querying the oracle for as few pairwise comparisons of the form \(\Omega\left(C, C^{\prime}\right)\) so that \(\|\phi - \hat{\phi}\|_{--} < \kappa\) for a sufficiently small \(\kappa > 0\) and for any suitable norm \(\|\cdot\|_{--}\).

In practice, we do not have access to the true probability distribution or the population, which would provide the true values of \(C\) and \(C'\). However, we can subtly alter this problem description to use \(\hat{C}\) and \(\hat{C}^{\prime}\), which are derived from our dataset of \(n\) samples:

definition

Definition 3.2 Suppose the true (oracle) performance metric is \(\phi\). The aim is to recover a metric \(\hat{\phi}\) by querying the oracle for as few pairwise comparisons of the form \(\Omega\left(\hat{C}, \hat{C}^{\prime}\right)\) so that \(\|\phi - \hat{\phi}\|_{--} < \kappa\) for a sufficiently small \(\kappa > 0\) and for any suitable norm \(\|\cdot\|_{--}\).

As is common in theoretical ML research, we solve the population problem and then consider ways to extend this to practical settings where we only have limited datasets of samples. In our case, this corresponds to calculating the confusion matrices from a portion of the dataset we have access to.

3.4.1 Linear Performance Metric Elicitation

For LPM elicitation, we need one more proposition.

proposition

Proposition 3.2 For a metric \(\psi\) (quasiconvex and monotone increasing in TP/TN) or \(\phi\) (quasiconcave and monotone increasing), and parametrization \(\rho^+\)/\(\rho^-\) of upper/lower boundary, composition \(\psi \circ \rho^-\) is quasiconvex and unimodal on [0, 1], and \(\phi \circ \rho^+\) is quasiconcave and unimodal on [0, 1].

Quasiconcavity and quasiconvexity are slightly more general variations on concavity and convexity. Their main useful property in our setting is that they are unimodal (they have a singular extremum), so we can devise a binary-search-style algorithm for eliciting the Bayes optimal and inverse-optimal confusion matrices for a given setting, as well as the corresponding \(\phi\)’s. We first note that to maximize a quasiconcave metric, in which \(\phi\) is monotonically increasing in \(TP\) and \(TN\), we note that the resulting maximizer (and supporting hyperplane) will occur on the upper boundary of \(\mathcal{C}\). We thus set our initial search range to be \([0, \pi/2]\) and repeatedly divide it into four regions. Then, we calculate the resulting confusion matrix on the 5 resulting boundaries of these regions and query the oracle \(4\) times. We repeat this in each iteration of the binary search until a maximizer is found.

remark
\begin{algorithm} \caption{Quasiconcave Metric Maximization} \begin{algorithmic} \State \textbf{input:} $\epsilon > 0$ and oracle $\Omega$ \State \textbf{initialize:} $\theta_a = 0, \theta_b = \frac{\pi}{2}$ \While{$|\theta_b - \theta_a| > \epsilon$} \State set $\theta_c = \frac{3\theta_a+\theta_b}{4}$, $\theta_d = \frac{\theta_a+\theta_b}{2}$, and $\theta_e = \frac{\theta_a+3\theta_b}{4}$ \State obtain $h\theta_a, h\theta_c, h\theta_d, h\theta_e, h\theta_b$ using Proposition 1 \State Compute $C\theta_a, C\theta_c, C\theta_d, C\theta_e, C\theta_b$ using (1) \State Query $\Omega(C\theta_c, C\theta_a), \Omega(C\theta_d, C\theta_c), \Omega(C\theta_e, C\theta_d)$, and $\Omega(C\theta_b, C\theta_e)$ \If{$q_{i,j}$ is ambiguous} \State request $q_{i,j}$'s label from reference \Else \State impute $q_{i,j}$'s label from previously labeled queries \EndIf \If{$C\theta' \succ C\theta'' \succ C\theta'''$ for consecutive $\theta < \theta' < \theta''$} \State assume the default order $C\theta \prec C\theta' \prec C\theta''$ \EndIf \If{$C\theta' \succ C\theta'' \succ C\theta'''$ for consecutive $\theta < \theta' < \theta''$} \State assume the default order $C\theta \prec C\theta' \prec C\theta''$ \EndIf \If{$C\theta_a \succ C\theta_c$} \State Set $\theta_b = \theta_d$ \ElsIf{$C\theta_a \prec C\theta_c \succ C\theta_d$} \State Set $\theta_b = \theta_d$ \ElsIf{$C\theta_c \prec C\theta_d \succ C\theta_e$} \State Set $\theta_a = \theta_c$ \State Set $\theta_b = \theta_e$ \ElsIf{$C\theta_d \prec C\theta_e \succ C\theta_b$} \State Set $\theta_a = \theta_d$ \Else \State Set $\theta_a = \theta_d$ \EndIf \EndWhile \State \textbf{output:} $\vec{m}, C$, and $\vec{l}$, where $\vec{m} = m_l(\theta_d), C = C\theta_d$, and $\vec{l} := (\vec{m}, (tp, tn)) = (\vec{m}, C)$ \end{algorithmic} \end{algorithm}

To elicit LPMs, we run Algorithm 1, querying the oracle in each iteration, and set the elicited metric \(\hat{m}\) (which is the maximizer on \(\mathcal{C}\)) to be the slope of the resulting hyperplane, since the metric is linear.

remark

Remark 3.2. To find the minimum of a quasiconvex metric, we flip all instances of \(\prec\) and \(\succ\), and use an initial search range of \([\pi, 3\pi/2]\); we use this algorithm, which we refer to as Algorithm 2, in our elicitation of LFPMs.

Next, we provide a Python implementation of Algorithm 1.

code
def get_m(theta):
    """
    Inputs: 
    - theta: the value that parametrizes m
    Outputs:
    - m_0 and m_1 for the LPM
    """

    return (math.cos(theta), math.sin(theta))

def lpm_elicitation(epsilon, oracle):
    """
    Inputs:
    - epsilon: some epsilon > 0 representing threshold of error
    - oracle: some function that accepts 2 confusion matrices and
        returns true if the first is preferred and false otherwise
    Outputs:
    - estimate for m, which is used to compute the LPM as described above
    """

    a = 0
    b = math.pi/2
    while (b - a > epsilon):
        c = (3 * a + b) / 4
        d = (a + b) / 2
        e = (a + 3 * b) / 4

        m_a, m_b, m_c, m_d, m_e = (get_m(x) for x in [a,b,c,d,e]) # using definition of m
        c_a, c_b, c_c, c_d, c_e = (get_c(x) for x in [m_a, m_b, m_c, m_d, m_e]) # compute classifier from m's then calculate confusion matrices
        
        response_ac = oracle(c_a, c_c)
        response_cd = oracle(c_c, c_d)
        response_de = oracle(c_d, c_e)
        response_eb = oracle(c_e, c_b)

        # update ranges to keep the peak
        if response_ac:
            b = d
        elif response_cd:
            b = d
        elif response_de:
            a = c
            b = e
        elif response_eb:
            a = d
        else:
            a = d
    return get_m(d), get_c(d)

3.4.2 Linear-Fractional Performance Metric Elicitation

Now, we present the next main result, which is an algorithm to elicit linear-fractional performance metrics. For this task, we will need the following assumption: Let \(\phi \in \varphi_{L F P M}\). We assume \(p_{11}, p_{00} \geq 0, p_{11} \geq q_{11}, p_{00} \geq q_{00},\) \(p_{0}=0, q_{0}=\) \(\left(p_{11}-q_{11}\right) \zeta+\left(p_{00}-q_{00}\right)(1-\zeta)\), and \(p_{11}+p_{00}=1\).

These assumptions guarantee that the LFPM \(\phi\) which we are trying to elicit is monotonically increasing in \(TP\) and \(TN\), just as in the LPM elicitation case. We first provide motivation and an overview of the approach for LFPM elicitation and then present pseudocode for the algorithm.

The general idea of the algorithm is to use Algorithm 1 to obtain a maximizer and a minimizer for the given dataset; these result in two systems of equations involving the true LFPM \(\phi^*\) with 1 degree of freedom. Then, we run a grid search that is independent of oracle queries to find the point where solutions to the systems match pointwise on the resulting confusion matrices; this occurs close to where the true metric lies.

More formally, suppose that the true metric is \[\phi^{*}(C)=\frac{p_{11}^{*} T P+p_{00}^{*} T N}{q_{11}^{*} T P+q_{00}^{*} T N+q_{0}^{*}}. \tag{3.4}\] Then, let \(\bar{\tau}\) and \(\underline{\tau}\) represent the maximizer and minimizer of \(\phi\) over \(\mathcal{C}\), respectively. There exists a hyperplane \[\begin{aligned} \bar{\ell}_{f}^{*}:=\left(p_{11}^{*}-\bar{\tau}^{*} q_{11}^{*}\right) t p+\left(p_{00}^{*}-\bar{\tau}^{*} q_{00}^{*}\right) t n=\bar{\tau}^{*} q_{0}^{*}, \end{aligned}\] which touches \(\mathcal{C}\) at \(\left(\overline{T P}^{*}, \overline{T N}^{*}\right)\) on \(\partial \mathcal{C}_{+}\). Correspondingly, there also exists a hyperplane \(\underline{\ell}_{f}^{*}:=\left(p_{11}^{*}-\underline{\tau}^{*} q_{11}^{*}\right) t p+\left(p_{00}^{*}-\underline{\tau}^{*} q_{00}^{*}\right) \operatorname{tn}=\underline{\tau}^{*} q_{0}^{*}\), which touches \(\mathcal{C}\) at \(\left(\underline{TP}^{*}, \underline{T N}^{*}\right)\) on \(\partial \mathcal{C}_{-}\). While we are unable to obtain Equation 3.4 and ?eq-eq3.49 directly, we can use Algorithm 1 to get a hyperplane \[\bar{\ell}:=\bar{m}_{11} t p+\bar{m}_{00} t n= \bar{m}_{11} \overline{T P}^{*}+\bar{m}_{00} \overline{T N}^{*} = \bar{C}_{0}, \tag{3.5}\] which is equivalent to \(\bar{\ell}_{f}^{*}\) (Equation 3.4) up to a constant multiple. From here, we can obtain the system of equations

\[p_{11}^{*}-\bar{\tau}^{*} q_{11}^{*}=\alpha \bar{m}_{11}, p_{00}^{*}-\bar{\tau}^{*} q_{00}^{*}=\alpha \bar{m}_{00}, \bar{\tau}^{*} q_{0}^{*}=\alpha \bar{C}_{0}, \tag{3.6}\] where \(\alpha > 0\) (we know it is \(\geq0\) due to our assumptions earlier and because \(\bar{m}\) is positive, but if it is equal to \(0\) then \(\phi^*\) would be constant. So, our resulting system of equations is \[\begin{aligned} p_{11}^{\prime}-\bar{\tau}^{*} q_{11}^{\prime}=\bar{m}_{11}, p_{00}^{\prime}-\bar{\tau}^{*} q_{00}^{\prime}=\bar{m}_{00}, \bar{\tau}^{*} q_{0}^{\prime}=\bar{C}_{0}. \end{aligned} \tag{3.7}\]

Now, similarly, we can approximate ?eq-eq3.49 using the algorithm we defined for quasiconvex metrics (Algorithm 2), where we altered the search range and comparisons. After finding the minimizer, we obtain the hyperplane \[\underline{\ell}:=\underline{m}_{11} t p+\underline{m}_{00} t n=\underline{m}_{11} \underline{TP}^{*}+\underline{m}_{00} \underline{TN}^{*} = \underline{C}_{0}, \tag{3.8}\] which is equivalent to \(\underline{\ell}_{f}^{*}\) (?eq-eq3.49) up to a constant multiple. So then, our system of equations is \[p_{11}^{*}-\underline{\tau}^{*} q_{11}^{*}=\gamma \underline{m}_{11}, p_{00}^{*}-\underline{\tau}^{*} q_{00}^{*}=\gamma \underline{m}_{00}, \underline{\tau}^{*} q_{0}^{*}=\gamma \underline{C}_{0}, \tag{3.9}\] where \(\gamma <0\) (for a reason analogous to why we have \(\alpha >0\)), meaning our resulting system of equations is \[\begin{aligned} p_{11}^{\prime \prime}-\underline{\tau}^{*} q_{11}^{\prime \prime}=\underline{m}_{11}, p_{00}^{\prime \prime}-\underline{\tau}^{*} q_{00}^{\prime \prime}=\underline{m}_{00}, \underline{\tau}^{*} q_{0}^{\prime \prime}=\underline{C}_{0}. \end{aligned} \tag{3.10}\]

Equation 3.9 and Equation 3.10 form the two systems of equations mentioned in our overview of the algorithm. Next, we demonstrate that they have only one degree of freedom. Note that if we know \(p_{11}'\), we could solve both systems of equations as follows: \[\begin{aligned} p_{00}^{\prime} &=1-p_{11}^{\prime}, q_{0}^{\prime}=\bar{C}_{0} \frac{P^{\prime}}{Q^{\prime}}\\ q_{11}^{\prime} &=\left(p_{11}^{\prime}-\bar{m}_{11}\right) \frac{P^{\prime}}{Q^{\prime}} \\ q_{00}^{\prime}&=\left(p_{00}^{\prime}-\bar{m}_{00}\right) \frac{P^{\prime}}{Q^{\prime}}, \end{aligned} \tag{3.11}\] where \(P^{\prime}=p_{11}^{\prime} \zeta+p_{00}^{\prime}(1-\zeta)\) and \(Q^{\prime}=P^{\prime}+\bar{C}_{0}-\) \(\bar{m}_{11} \zeta-\bar{m}_{00}(1-\zeta).\)

Now, suppose we know \(p_{11}'\). We could use this value to solve both systems Equation 3.9 and Equation 3.10, yielding two metrics, \(\phi'\) and \(\phi''\), from the maximizer and minimizer, respectively. Importantly, when \(p_{11}^{*} / p_{00}^{*}=p_{11}^{\prime} / p_{00}^{\prime}=p_{11}^{\prime \prime} / p_{00}^{\prime \prime}\), then \(\phi^{*}(C)=\phi^{\prime}(C) / \alpha=-\phi^{\prime \prime}(C) / \gamma\). Essentially, when we find a value of \(p_{11}'\) that results in \(\phi'\) and \(\phi''\) h aving constant ratios at all points on the boundary of \(\mathcal{C}\), we can obtain \(\phi^*\), as it is derivable from \(\phi'\) and \(\alpha\) (or, alternatively, \(\phi''\) and \(\gamma\)).

We will perform a grid search for \(p_{11}'\) on \([0,1]\). For each point in our search, we will compute \(\phi'\) and \(\phi''\). Then, we will generate several confusion matrices on the boundaries and calculate the ratio $’’ / \(\phi'\) for each. We will select the value of \(p_{11}'\) for which the ratio \(\phi'' / \phi'\) is closest to constant and use it to compute the elicited metric \(\hat{\phi}\). The pseudocode for LFPM elicitation is given in Algorithm 2.

\begin{algorithm} \caption{Grid Search for Best Ratio} \begin{algorithmic} \State \textbf{Input:} $k, \Delta$. \State \textbf{Initialize:} $\sigma_{\text{opt}} = \infty, p'_{11,\text{opt}} = 0$. \State Generate $C_1, \dots, C_k$ on $\partial C_+$ and $\partial C_-$ (Section 3). \State Generate $C_1, \dots, C_k$ on $\partial C_+$ and $\partial C_-$ (Section 3). \For{$p'_{11} = 0; \; p'_{11} \leq 1; \; p'_{11} = p'_{11} + \Delta$} \State Compute $\phi'$, $\phi''$ using Proposition 4. \State Compute array $r = \left[ \frac{\phi'(C_1)}{\phi''(C_1)}, \dots, \frac{\phi'(C_k)}{\phi''(C_k)} \right]$. \State Set $\sigma = \text{std}(r)$. \If{$\sigma < \sigma_{\text{opt}}$} \State Set $\sigma_{\text{opt}} = \sigma$ and $p'_{11,\text{opt}} = p'_{11}$. \EndIf \EndFor \State \textbf{Output:} $p'_{11,\text{opt}}$. \end{algorithmic} \end{algorithm}

We provide a Python implementation as below.

code
def lfpm_elicitation(k, delta):
    """
    Inputs:
    - k: the number of confusion matrices to evaluate on
    - delta: the spacing for the grid search
    Outputs:
    - p_11', which will allow us to compute the elicited LFPM
    """

    sigma_opt = np.inf
    p11_opt = 0
    C = compute_confusion_matrices(k) # generates k confusion matrices to evaluate on

    for i in range(int(1/delta)):
        p11 = i * delta
        phi1 = compute_upper_metric(p11) # solves the first system of equations with p11 
        phi2 = compute_lower_metric(p11) # solves the second system of equations with p11 
        utility_1 = [phi1(c) for c in C] #calculate phi for both systems of equations
        utility_2 = [phi2(c) for c in C]

        r = []
        for i in range(k):
            r.append(utility_1[i] / utility_2[i])
        sigma = np.std(r)

        if(sigma < sigma_opt):
            sigma_opt = sigma
            p11_opt = p11
    return p11_opt

In summary, to elicit LFPMs, we utilize a special property of the LPM minimizer and maximizer on \(\mathcal{C}\)–namely, that we can use the corresponding supporting hyperplanes to form a system of equations that can be used to approximate \(\phi^*\) if one parameter (\(p_{11}'\)) is found, and that this parameter can be found using an oracle-independent grid search. Importantly, these algorithms can be shown to satisfy significant theoretical guarantees. We provide formal statement and intuitive interpretation of these guarantees here, with their proofs available in the appendix of the original paper. First, we define the oracle noise \(\epsilon_{\Omega}\), which arises from the oracle potentially flipping the comparison output on two confusion matrices that are close enough in utility.

Given \(\epsilon, \epsilon_{\Omega} \geq 0\) and a metric \(\phi\) satisfying our assumptions, Algorithm 1 or Algorithm 2 finds an approximate maximizer/minimizer and supporting hyperplane. Additionally, the value of \(\phi\) at that point is within \(O\left(\sqrt{\epsilon_{\Omega}} + \epsilon\right)\) of the optimum, and the number of queries is \(O\left(\log \frac{1}{\epsilon}\right)\). Let \(\mathbf{m}^{*}\) be the true performance metric. Given \(\epsilon > 0\), LPM elicitation outputs a performance metric \(\hat{\mathbf{m}}\), such that \(\left\|\mathbf{m}^{*} - \hat{\mathbf{m}}\right\|_{\infty} \leq \sqrt{2} \epsilon + \frac{2}{k_{0}} \sqrt{2 k_{1} \epsilon_{\Omega}}\). These results ensure that Algorithm 1 and Algorithm 2 find an appropriate maximizer and minimizer in the search space, within a certain range of accuracy that depends on oracle and sample noise, and within a certain number of queries. Both of these statements are guaranteed by the binary search approach.

Let \(h_{\theta}\) and \(\hat{h}_{\theta}\) be two classifiers estimated using \(\eta\) and \(\hat{\eta}\), respectively. Further, let \(\bar{\theta}\) be such that \(h_{\bar{\theta}} = \arg \max _{\theta} \phi\left(h_{\theta}\right)\). Then \(\|C(\hat{h}_{\bar{\theta}}) - C\left(h_{\bar{\theta}}\right)\|_{\infty} = O\left(\left\|\hat{\eta}_{n} - \eta\right\|_{\infty}\right)\). This result indicates that the drop in elicited metric quality caused by using a dataset of samples rather than population confusion matrices is bounded by the drop in performance of the decision boundary \(\eta\). These three guarantees together ensure that oracle noise and sample noise do not amplify drops in performance when using metric elicitation; rather, these drops in performance are bounded by the drops that would typically occur when using the standard machine learning paradigm of training a decision boundary and using a pre-established metric. For further interesting exploration of the types of problems that can be solved using the framework of metric elicitation, we refer the reader to (Hiranandani, Narasimhan, and Koyejo 2020), which performs metric elicitation to determine the oracle’s ideal tradeoff between the classifier’s overall performance and the discrepancy between its performance on certain protected groups.

3.4.3 Multiclass Performance Metric Elicitation

Although the previous section only described metric elicitation for binary classification problems, the general framework can still be applied to multiclass classification problems(Hiranandani et al. 2019b). Consider the case of classifying subtypes of leukemia (Yang and Naiman 2014). We can train a neural network to predict conditional probability of a certain leukemia subtype given certain gene expressions. However, it may not be appropriate to classify the subtype purely based on whichever one has the highest confidence. For instance, a treatment for leukemia subtype C1 may be perfect for cases of C1, but it may be ineffective or harmful for certain other subtypes. Therefore, the final response from the classifier may not be as simple as as choosing the class with the highest conditional probability, just like how the threshold for binary classification may not always be 50%. With multiclass metric elicitation, we can show confusion matrices to an oracle (like the doctor in the leukemia example) to determine which classifier has the best tradeoffs. In (Hiranandani et al. 2019b), the authors focus on eliciting linear performance metrics, which is what we will describe in this chapter. Most of the notation from Binary Metric Elicitation still persists, just modified to provide categorical responses. \(X \in \mathcal{X}\) is the input random variable. \(Y \in [k]\) is the output random variable, where \([k]\) is the index set \(\{1, 2, \dots, k\}\).

The dataset of size \(n\) is denoted by \(\{(\vec{x}, y)\}_{i=1}^n\) generated independently and identically from \(\mathbb{P}(X, Y)\). \(\eta_i(\vec{x}) = \mathbb{P}(Y=i | X=\vec{x})\) gives the conditional probability of class \(i \in [k]\) given an observation. \(\xi_i = \mathbb{P}(Y=i)\) is the marginal probability of class \(i \in [k]\). The set of all classifiers is \(\mathcal{H} = \{h : \mathcal{X} \rightarrow \Delta_k\}\), where \(\Delta_k\) is (k-1) dimensional simplex. In this case, the outputs of classifiers are 1-hot vectors of size \(k\) where the only index with value 1 is the predicted class and all other positions have a value of 0. The confusion matrix for a classifier, \(h\), is \(C(h, \mathbb{P}) \in \mathbb{R}^{k \times k}\), where:

\[C_{ij}(h, \mathbb{P}) = \mathbb{P}(Y=i, h=j) \text{\qquad for } i, j \in [k] \tag{3.12}\]

Note that the confusion matrices are \(k\times k\) and store the joint probabilities of each type of classification for each possible class. This means that the sum of row \(i\) in the confusion matrix equals \(\xi_i\), because this is equivalent to adding over all possible classifications. Since we know the sums of each row, all diagonal elements can be reconstructed from just the off-diagonal elements, so a confusion matrix \(C(h, \mathbb{P})\) can be expressed as a vector of off-diagonal elements, \(\vec{c}(h, \mathbb{P}) = \textit{off-diag}(C(h, \mathbb{P}))\), and \(\vec{c} \in \mathbb{R}^q\) where \(q := k^2 - k\). The vector \(\vec{c}\) is called the vector of ‘off-diagonal confusions.’ The space of off-diagonal confusions is \(\mathcal{C} = \{\vec{c}(h, \mathbb{P}) : h \in \mathcal{H}\}\).

In cases where the oracle would care about the exact type of misclassification (i.e. misclassifying and object from class 1 as class 2), this off-diagonal confusion matrix is necessary. However, there are many cases where the performance of a classifier is determined by just the probability of correct prediction for each class, which just requires the diagonal elements. In these cases, we can define the vector of ‘diagonal confusions’ as \(\vec{d}(h, \mathbb{P}) = \textit{diag}(C(h, \mathbb{P})) \in \mathbb{R}^k\). The space of diagonal confusions is \(\mathcal{D} = \{\vec{d}(h, \mathbb{P}) : h \in \mathcal{H}\}\).

Finally, the setup for metric elicitation is identical to the one examined in the previous chapter. We still assume access to an oracle that can choose between two classifiers or confusion matrices, using notation \(\Gamma\) for comparing two classifiers and \(\Omega\) for comparing confusion matrices, which returns 1 if the first classifier is better and 0 otherwise. We still assume that the oracle behaves according to some unknown performance metric, and we wish to recover this metric up to some small error tolerance (based on a suitable norm). The two different types of confusion vectors result in different algorithms for metric elicitation, which we will explore in later sections.

A Diagonal Linear Performance Metric (DLPM) is a performance metric that only considers the diagonal elements in the confusion matrix. The metric is defined as \(\psi(\vec{d}) = \langle \vec{a}, \vec{d} \rangle\), where \(\vec{a} \in \mathbb{R}^k\) such that \(||\vec{a}||_1 = 1\). It is also called weighted accuracy (Narasimhan et al. 2015). The family of DLPMs is denoted as \(\varphi_{DLPM}\). Since these only consider the diagonal elements, which we want to maximize, we can focus on only eliciting monotonically increasing DLPMs, meaning that all elements in \(\vec{a}\) are non-negative.

Consider the trivial classifiers that only predict a single class at all times. The diagonal confusions when only predicting class \(i\) are \(\vec{v}_i \in \mathbb{R}^k\) with \(\xi_i\) at index \(i\) and zero elsewhere. Note that this is the maximum possible value in index \(i\), because this represents perfectly classifying all points that have a true class of \(i\). We can consider the space of diagonal confusions, visualized in Figure 3.4 (taken from (Hiranandani et al. 2019b)). The space of \(\mathcal{D}\) is strictly convex, closed, and contained in the box \([0, \xi_1] \times \dots \times [0, \xi_k]\). We also know that the only vertices are \(\vec{v}_i\) for each \(i \in [k]^{(k-1)}\).

Figure 3.4: (a) Geometry of space of diagonal confusions for \(k=3\). This is a convex region with three flat areas representing confusions when restricted to only two classes. (b) Geometry of diagonal confusions when restricted to classes \(k_1\) and \(k_2\). Notice how this is identical to the space of confusion matrices examined in the previous chapter.

We know that this is strictly convex under the assumption that an object from any class can be misclassified as any other class. Mathematically, the assumption is that \(g_{ij}(r) = \mathbb{P} \left[\frac{\eta_i(X)}{\eta_j(X)} \geq r \right]\) \(\forall i, j \in [k]\) are continuous and strictly decreasing for \(r \in [0, \infty)\).

We can also define the space of binary classification confusion matrices confined to classes \(k_1\) and \(k_2\), which is the 2-D \((k_1, k_2)\) axis-aligned face of \(\mathcal{D}\), denoted as \(\mathcal{D}_{k_1, k_2}\). Note that this is strictly convex, since \(\mathcal{D}\) itself is strictly convex, and it has the same geometry as the space of binary confusion matrices examined in the previous chapter. Therefore, we can construct an RBO classifier for \(\psi \in \varphi_{DLPM}\), parameterized by \(\vec{a}\), as follows: \[\begin{aligned} \bar{h}_{k_1, k_2}(\vec{x})= \left\{ \begin{array}{ll} k_1, \text{ if } a_{k_1} \eta_{k_1}(\vec{x}) \geq a_{k_2} \eta_{k_2}(\vec{x})\\ k_2, \text{ o.w.} \end{array} \right\}. \end{aligned} \tag{3.13}\]

We can parameterize the upper boundary of \(\mathcal{D}_{k_1, k_2}\), denoted as \(\partial \mathcal{D}^{+}_{k_1, k_2}\), using a single parameter \(m \in [0, 1]\). Specifically, we can construct a DLPM by setting \(a_{k_1} = m\), \(a_{k_2} = 1 - m\), and all others to 0. Using Equation 3.13, we can get the diagonal confusions, so varying \(m\) parameterizes \(\partial \mathcal{D}^{+}_{k_1, k_2}\). The parameterization is denoted as \(\nu(m; k_1, k_2)\).

3.4.3.1 Diagonal Linear Performance Metric Elicitation

Suppose the oracle follows a true metric, \(\psi\), that is linear and monotone increasing across all axes. If we consider the composition \(\psi \circ \nu(m; k_1, k_2): [0, 1] \rightarrow \mathbb{R}\), we know it must be concave and unimodal, because \(\mathcal{D}_{k_1, k_2}\) is a convex set. Therefore, we can find the value of \(m\) that maximizes \(\psi \circ \nu(m; k_1, k_2)\) for any given \(k_1\) and \(k_2\) using a binary search procedure.

Since the RBO classifier for classes \(k_1\) and \(k_2\) only rely on the relative weights of the classes in the DLPM (see Equation 3.13), finding the value of \(m\) that maximizes \(\psi \circ \nu(m; k_1, k_2)\) gives us the true relative ratio between \(a_{k_1}\) and \(a_{k_2}\). Specifically, from the definition of \(\nu\), we know that \(\frac{a_{k_2}}{a_{k_1}} = \frac{1-m}{m}\). We can therefore simply calculate the ratio between \(a_1\) and all other weights to reconstruct an estimate for the true metric. A python implementation of this algorithm is provided below.

code
import numpy as np

def rbo_dlpm(m, k1, k2, k):
    """
    This constructs DLPM weights for the upper boundary of the
    restricted diagonal confusions, given a parameter m.
    This is equivalent to \nu(m; k1, k2)
    
    Inputs:
    - m: parameter (between 0 and 1) for the upper boundary
    - k1: first axis for this  face
    - k2: second axis for this face
    - k: number of classes
    Outputs:
    - DLPM weights for this point on the upper boundary
    """
    new_a = np.zeros(k)
    new_a[k1] = m
    new_a[k2] = 1 - m
    return new_a

def dlpm_elicitation(epsilon, oracle, get_d, k):
    """
    Inputs:
    - epsilon: some epsilon > 0 representing threshold of error
    - oracle: some function that accepts 2 confusion matrices and
        returns true if the first is preferred and false otherwise
    - get_d: some function that accepts dlpm weights and returns 
        diagonal confusions
    - k: number of classes
    Outputs:
    - estimate for true DLPM weights
    """
    a_hat = np.zeros(k)
    a_hat[0] = 1
    for i in range(1, k):
        # iterate over each axis to find appropriate ratio
        a = 0  # lower bound of binary search
        b = 1  # upper bound of binary search

        while (b - a > epsilon):
            c = (3 * a + b) / 4
            d = (a + b) / 2
            e = (a + 3 * b) / 4

            # get diagonal confusions for each point
            d_a, d_c, d_d, d_e, d_b = (get_d(rbo_dlpm(x, 0, i, k)) 
                for x in [a, c, d, e, b])

            # query oracle for each pair
            response_ac = oracle(d_a, d_c)
            response_cd = oracle(d_c, d_d)
            response_de = oracle(d_d, d_e)
            response_eb = oracle(d_e, d_b)

            # update ranges to keep the peak
            if response_ac:
                b = d
            elif response_cd:
                b = d
            elif response_de:
                a = c
                b = e
            elif response_eb:
                a = d
            else:
                a = d

        midpt = (a + b) / 2
        a_hat[i] = (1 - midpt) / midpt
    return a_hat / np.sum(a_hat)

To use this algorithm for metric elicitation on a real dataset, we need to supply the “oracle” and “get_d” functions. The oracle function is an interface to an expert who judges which of two confusion matrices is better. The get_d function will need to construct a classifier given the DLPM weights, following the principles of the RBO classifier from Equation 3.13, and calculate the confusion matrix from a validation set.

Using the same oracle feedback noise model from the binary metric elicitation, we can make the following guarantees:

proposition

Given \(\epsilon, \epsilon_\Omega \geq 0\), and a 1-Lipschitz DLPM \(\varphi^*\) parameterized by \(\vec{a}^*\). Then the output \(\hat{a}\) of the DLPM elicitation algorithm after \(O((k-1)\log\frac{1}{\epsilon})\) queries to the oracle satisfies \(||\vec{a}^* - \hat{a}||_\infty \leq O(\epsilon + \sqrt{\epsilon_\Omega})\), which is equivalent to \(||\vec{a}^* - \hat{a}||_2 \leq O(\sqrt{k}(\epsilon + \sqrt{\epsilon_\Omega}))\).

In other words, the maximum difference between the estimate and true value along any component (indicated by the L-infinity norm) is linearly bounded by the sum of the epsilon specified by the algorithm and the square root of the oracle’s correctness guarantee (\(\epsilon_\Omega\)).

3.5 Case Study 3: Active Preference Learning in Robotics

How exactly do robots learn human preferences from just the pairwise comparisons, if they need to learn how to act in the environment itself? The comparisons in turn help robots learn the reward function of the human, which allows them to further take actions in real settings. Let’s say there are two trajectories \(\xi_A\) and \(\xi_B\) that might be taken as the next course of action in any context, like choosing the next turn, or choosing the next chatGPT response. The robot is offering both to a human for comparison. To answer which of them is better, the human would ask themselves if \(R(\xi_A)\) or \(R(\xi_B)\) is bigger, with \(R(\xi) = w * \phi(\xi)\) being the reward function. In this equation \(w\) and \(\phi(\xi)\) are vectors of weights and features of the trajectory, so alternatively, we can express this as:

\[R(\xi) = \begin{bmatrix} w_1 \\ w_2 \\ ... \\ w_N \end{bmatrix} \cdot \begin{bmatrix} \phi_1(\xi) \\ \phi_2(\xi) \\ ... \\ \phi_N(\xi) \end{bmatrix} \tag{3.14}\]

If one says that they preferred \(\xi_2\) less than \(\xi_1\) then it means \(\xi_2 < \xi_1 \implies R(\xi_2) < R(\xi_1) \implies w * \phi(\xi_2) < w * \phi(\xi_1) \implies 0 < w * (\phi(\xi_1) - \phi(\xi_2)) \implies 0 < w * \Phi\). Alternatively, if one preferred \(\xi_2\) more than \(\xi_1\), the signs would be flipped, resulting in \(0 > w * \Phi\). The two results can be represented in the N-dimensional space, where when it is split by the decision boundary, it creates half-spaces indicating preferences for each of the sides. For example we can see how a query between two items can split the plain into two halves, indicating preference towards one of the items. Such an image can be extended into bigger dimensions, where a line would become a separating hyperplane. If one is to truly believe the answers of one person, they would remove everything from the other side of the hyperplane that does not agree with the received human preference. But since humans are noisy, that approach is not optimal, thus most applications up-weight the indicated side of the plane to emphasize that points on that side are better, and down-weight the other side as they do not agree with the provided comparison.

How should someone choose which queries to conduct, otherwise, what is the most informative query sequence? After completing one query, the next query should be orthogonal to the previous one so that the potential space consistent with the preferences decreases in half. The intuition behind that is the potential space has all of the reward functions that agree with the provided answers, so to find a specific reward function for a human, decreasing the space narrows down the possible options. The original query created the blue space, and a new one created a red space, resulting in a purple intersection of the two which is still consistent with both of the queries’s results. The image shows that the purple portion is exactly half of the blue portion.

Figure 3.5: Creating further comparisons limits the space that agrees with answers to all of them. The blue area demonstrates a preference for object 1 over object 2. The red area demonstrates a preference for object 3 over object 4. Combination (purple area) shows the space that is consistent with both of those preferences.

Mathematically, from (Biyik and Sadigh 2018) this can be expressed as set \(F\) of potential queries \(\phi\), where \(F = \{\phi: \phi = \Phi(\xi_A) - \Phi(\xi_B), \xi_A, \xi_B \in \Xi\}\) (defining that a query is the difference between the features of two trajectories). Using that, the authors define a human update function \(f_{\phi}(w) = \min(1, \exp(I^T\phi))\) that accounts for how much of the space will still be consistent with the preferences. Finally, for a specific query, they define the minimum volume removed as \(\min\{\mathbb{E}[1 - f_{\phi}(w)], \mathbb{E}[1 - f_{-\phi}(w)]\}\) (expected size of the two sides of the remaining space after it is split by a query - purple area in Figure 3.5), and the final goal is to maximize that amount over all possible queries since it is optimal to get rid of as much space as possible to narrow down the options for the reward function: \(\max_{\phi} \min\{ \mathbb{E}[1 - f_{\phi}(w)], \mathbb{E}[1 - f_{-\phi}(w)]\}\). Effectively this is finding such \(\phi\) that maximizes the information one can get by asking the next comparison query. While this approach uses minimum volume removed, there can be other metrics inside the \(\max\) function. Some applications like movie recommendations do not require extra constraints, however in robotics one might want to add more constraints that satisfy certain rules, so that the resulting query follows the dynamics of the physical world.

The first real example of learning reward functions from pairwise comparisons is a 2D driving simulator from (Biyik and Sadigh 2018). In ?fig-car_direct you can see the setting of a 3-lane road with the orange car being controlled by the computer. The queries conducted for this problem are two different trajectories presented to the human, and they are asked to evaluate which one of them is better. For the features that contribute to the reward function, it is important to consider that robots might not find some of the information as informative for the learning process as a human would. For this example, the underlying features included the distance between lane boundaries, distance to other cars, and the heading and speed of the controlled car. The weights toward the last feature were weighted the highest according to the authors, since it takes a lot of effort for the car to change or correct its direction.

At the start of the learning process, the car had no direction learned and was moving all over the road. In the middle of learning after 30 queries, the simulator learned to follow the direction of the road and go straight but still experienced collisions. After 70 queries, the simulator learned to avoid collisions, as well as keep the car within the lane without swerving.

3.5.0.1 Active Learning for Pairwise Comparisons

We have discussed that pairwise comparisons should be selected to maximize the minimum volume of remaining options removed. The question that can come out of the driving example is does it really matter to follow that goal or does random choice of queries performs as well? It turns out that indeed most AL algorithms (purposefully selecting queries) over time converge with the performance of the random query selection, so in long term the performance is similar. However, what is different is that AL achieves better performance earlier, which in time-sensitive tasks can be a critical factor. One example of such a setting can be exoskeletons for humans as part of the rehabilitation after surgery (Li et al. 2021). Different people have significantly different walking patterns as well as rehabilitation requirements, so the exoskeleton needs to adapt to the human as soon as possible for a more successful rehabilitation. Figure Figure 3.6 demonstrates the difference in the time needed between the two approaches. In general, in robotics, the time differences that might seem small to a human might be detrimental to the final performance.

Figure 3.6: Performance of AL and random query selection algorithms in the task of exoskeleton learning with human preferences. (Li et al. 2021)

In conclusion, pairwise comparisons show to be a great way of learning linear reward functions, but at times present challenges or incapabilities that can be further improved with additional incorporations of approaches like AL. That improves many applications in terms of time spent getting to the result in case of exoskeleton adjustments, as well as getting to a middle ground between polar behaviors in applications like negotiations.

3.5.1 Application: Guiding Human Demonstrations in Robotics

A strong approach to learning policies for robotic manipulation is imitation learning, the technique of learning behaviors from human demonstrations. In particular, interactive imitation learning allows a group of humans to contribute their own demonstrations for a task, allowing for scalable learning. However, not all groups of demonstrators are equally helpful for interactive imitation learning.

The ideal set of demonstrations for imitation learning would follow a single, optimal method for performing the task, which a robot could learn to mimic. Conversely, multimodality, the presence of multiple optimal methods in the demonstration set, is challenging for imitation learning since it has to learn from contradicting information for how to accomplish a task. A common reason for multimodality is the fact that different people often subconsciously choose different paths for execution, as illustrated in Figure 3.7.

Figure 3.7: Examples of two different ways to insert a nut onto a round peg. The orange demonstration picks up the nut from the hole while the blue demonstration picks up the nut from the side (Gandhi et al. 2022)

Gandhi et al. (Gandhi et al. 2022) identifies whether demonstrations are compatible with one another and offer an active elicitation interface to guide humans to provide better demonstrations in interactive imitation learning. Their key motivation is to allow multiple users to contribute demonstrations over the course of data collection by guiding users towards compatible demonstrations. To identify whether a demonstration is “compatible” with a base policy trained with prior demonstrations, the researchers measure the likelihood of demonstrated actions under the base policy, and the novelty of the visited states. Intuitively, low likelihood and low novelty demonstrations should be excluded since they represent conflicting modes of behavior on states that the robot can already handle, and are therefore incompatible. This concept of compatibility is used for filtering a new set of demonstrations and actively eliciting compatible demonstrations. In the following subsections, we describe the process of estimating compatibility and active elicitation in more detal.

3.5.1.1 Estimating Compatiblity

We want to define a compatibility measure \(\mathcal{M}\), that estimates the performance of policy \(\pi_{base}\) that is retrained on a union of \(\mathcal{D}_{base}\), the known base dataset, and \(\mathcal{D}_{new}\), the newly collected dataset. To define this compatibility measure in a way that is easy to compute, we can use two interpretable metrics: likelihood and novelty. The likelihood of actions \(a_{new}\) in \(\mathcal{D}_{new}\) is measured as the negative mean squared error between actions predicted by the base policy and this proposed action:

\[likelihood(s_{new}, a_{new}) = -\mathbb{E}[|| \pi_{base}(s_{new}) - a_{new} ||^2_2]. \tag{3.15}\]

The novelty of the state \(s_{new}\) in \(\mathcal{D}_{new}\) is the standard deviation in the predicted actions under base policy:

\[novelty(s_{new}) = \mathrm{Var}[\pi_{base}(s_{new})]. \tag{3.16}\]

We can plot likelihood and novelty on a 2D plane, as shown in Figure 3.8, and identify thresholds on likelihood and novelty, denoted as \(\lambda\) and \(\eta\) respectively. Intuitively, demonstrations with low likelihood in low novelty states should be excluded, because this indicates that there is a conflict between the base behavior and the new demonstration due to multimodality. Note that in high novelty states, the likelihood should be disregarded because the base policy does not have a concrete idea for how to handle these states anyways so more data is needed.

Figure 3.8: Examples of plots of likelihood and novelty for compatible and incompatible operators (Gandhi et al. 2022)

The final compatibility metric, parameterized by the likelihood and novelty thresholds \(\lambda\) and \(\eta\), is \(\mathcal{M}(\mathcal{D}_{base}, (s_{new}, a_{new})) \in [0, 1]\), defined as:

\[\begin{aligned} \mathcal{M} = \begin{cases} 1 - \min(\frac{\mathbb{E}[|| \pi_{base}(s_{new}) - a_{new} ||^2_2]}{\lambda}, 1) & \text{ if } \text{novelty}(s_{new}) < \eta \\ 1 & \text{ otherwise } \end{cases}. \end{aligned} \tag{3.17}\]

Note that \(\lambda\) and \(\eta\) need to be specified by hand. This is accomplished by assuming the ability to collect a priori incompatible demonstrations to identify reasonable thresholds that remove the most datapoints in the incompatible demonstrations while keeping the most datapoints in the compatible demonstrations.

3.5.1.2 Case Studies with Fixed Sets

The researchers evaluate the utility of the compatibility metric on three tasks: placing a square nut on a square peg, placing a round nut on a round peg, and opening a drawer and placing a hammer inside. For each task, they train a base policy using a “proficient” operator’s demonstration while sampling trajectories from other operators for the new set. The naive baseline is to use all datapoints while the \(\mathcal{M}\)-Filtered demonstrations use the compatibility metric to filter out incompatible demonstrations. The results are presented in Table 3.3. As you can see, M-filtering results in equal or greater performance despite using less data than the naive baseline, demonstrating the effectiveness of compatibility-based filtering.

Table 3.3: Success rates (mean/std across 3 training runs) for policies trained on \(\mathcal{D}_{new}\) by using all the data (Naive) or filtering by compatibility (\(\mathcal{M}\)-Filtered) (Gandhi et al. 2022)
Square Nut Round Nut Hammer Placement
Operator Naive \(\mathcal{M}\)-Filtered Naive \(\mathcal{M}\)-Filtered Naive \(\mathcal{M}\)-Filtered
Base Operator 38.7 (2.1) - 13.3 (2.3) - 24.7 (6.1) -
Operator 1 54.3 (1.5) 61.0 (4.4) 26.7 (11.7) 32.0 (12.2) 38.0 (2.0) 39.7 (4.6)
Operator 2 40.3 (5.1) 42.0 (2.0) 22.0 (7.2) 26.7 (5.0) 33.3 (3.1) 32.7 (6.4)
Operator 3 37.3 (2.1) 42.7 (0.6) 17.3 (4.6) 18.0 (13.9) 8.0 (0.0) 12.0 (0.0)
Operator 4 27.3 (3.5) 37.3 (2.1) 7.3 (4.6) 13.3 (1.2) 4.0 (0.0) 4.0 (0.0)
Figure 3.9: The phases of the active elicitation interface: (a) initial prompting, (b) demonstrations with live feedback, and (c) corrective feedback (Gandhi et al. 2022)

3.5.1.3 Actively Eliciting Compatible Demonstrations

In the previous section, we assume access to a dataset that has already been collected, and we see how filtering out incompatible demonstrations helps improve performance. However, when collecting a new dataset, it would be better to ensure that operators collect compatible demonstrations from the start, allowing us to retain as much data as possible for training.

To actively elicit compatible demonstrations, the researchers set up a pipeline for live feedback and examples. At the start, operators are given a task specification and some episodes to practice using the robot. Then, the active elicitation process begins, as shown in Figure 3.9. Each operator is shown some rollouts of the base policy to understand the style of the base operator. Next, the operator provides a demonstration similar to the ones they were shown. As they record their demonstrations, the interface provides online feedback, with green indicating compatible actions and red indicating incompatible actions. If the number of incompatible state-action pairs (ones where \(\mathcal{M}\) is zero) exceeds 5% of the demonstration length, the demonstration is rejected. However, to provide corrective feedback, the interface shows the areas of the demonstration with the highest average incompatibility and also provides an expert demo that shows what should actually be done. Demonstrators can use this feedback to provide more compatible demonstrations moving forward.

This process helps improve the demonstration quality in both simulation and real experiments, as show in Table 3.4. Specifically, on the real results, active elicitation outperformed the base policy by 25% and naive data collection by 55%. Overall, active elicitation is a powerful tool to ensure that data collected for imitation learning improves the quality of the learned policy.

Table 3.4: Success rates (mean/std across users) for policies trained on \(\mathcal{D}_{new}\) by using all the data (Naive), filtering by compatibility (\(\mathcal{M}\)-Filtered), or using informed demonstration collection (Gandhi et al. 2022)
Task Ba se Naive Naive + Fil tered Informed
Round Nut 13. 3 (2.3) 9. 6 (4.6) 9.7 (4.2) 15 .7 (6.0)
Hammer Placement 24. 7 (6.1) 20. 8 (15.7) 22.0 (15.5) 31. 8 (16.3)
\(\left[ \textup{Real} \right]\) Food Plating 60.0 30. 0 (17.3) - 85 .0 (9.6)

A fundamental limitation of eliciting compatible demonstrations is the fact that the “base” demonstrator is considered the ground truth. When the base demonstrator specifies a preference, all other demonstrators must abide by it, even if they have strong preferences against it. For instance, when pouring milk and cereal into a bowl, different people have different preferences for what is the correct order, but active elicitation forces all demonstrators to follow the initial preference of the base operator. The researchers hope that future work can enable users to override the default demonstration set and follow a base behavior that better aligns with their preferences. This could enable multiple modes of behavior to be collected in data while only following a user’s specified preference instead of attempting to collapse all modes into a single policy.

Looking forward, active elicitation provides a foundation for allowing robots to query humans about the type of data needed, enabling more efficient data collection through transparency.

In summary, this chapter has explored the complexities and innovations in interAL as applied to large models within robotics. It begins by investigating pairwise comparisons and their role in efficiently learning linear reward functions from large datasets, overcoming limitations in supervised learning. When combined with active learning techniques, these comparisons supply timely, targeted, and context-appropriate feedback, enhancing performance in time-critical applications like exoskeleton adjustments during rehabilitation.

We then shift to imitation learning or inverse reward learning from demonstrations, emphasizing the difficulties introduced by multimodal demonstration sets. active elicitation approaches to compile compatible demonstrations, streamlining the learning process by guiding users to provide more valuable, steady examples are incredibly promising, however, to tackling this issue. This method shows promise in refining the interactive imitation learning data collection pipeline, enabling more capable and effective robotic training.

Additionally, the chapter examines the integration of foundation models into robotics, highlighting the transformative innovations of R3M and Voltron. R3M’s pre-training on diverse human activities dramatically improves robotic manipulation with minimal supervision. Meanwhile, Voltron builds on these capabilities by incorporating language-driven representation learning for remarkably adaptable and nuanced robotic task performance. These models represent significant leaps in robotics while opening new frontiers for future research and applications.

References

Amershi, Saleema, Maya Cakmak, W. Bradley Knox, and Todd Kulesza. 2014. “Power to the People: The Role of Humans in Interactive Machine Learning.” AI Magazine.
Beluch, William H., Tim Genewein, A. Nürnberger, and Jan M. Köhler. 2018. “The Power of Ensembles for Active Learning in Image Classification.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9368–77. https://api.semanticscholar.org/CorpusID:52838058.
Biyik, Erdem, and Dorsa Sadigh. 2018. “Batch Active Preference-Based Learning of Reward Functions.” In Proceedings of the 2nd Conference on Robot Learning, edited by Aude Billard, Anca Dragan, Jan Peters, and Jun Morimoto, 87:519–28. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v87/biyik18a.html.
Braziunas, Darius, and Craig Boutilier. 2012. “Minimax Regret Based Elicitation of Generalized Additive Utilities.” https://arxiv.org/abs/1206.5255.
Cai, Wenbin, Ya Zhang, and Jun Zhou. 2013. “Maximizing Expected Model Change for Active Learning in Regression.” In 2013 IEEE 13th International Conference on Data Mining, 51–60. https://doi.org/10.1109/ICDM.2013.104.
Cohn, David A., Zoubin Ghahramani, and Michael I. Jordan. 1996. “Active Learning with Statistical Models.” CoRR cs.AI/9603104. https://arxiv.org/abs/cs/9603104.
G., Jamieson Kevin, and Robert Nowak. 2011. “Active Ranking Using Pairwise Comparisons.” Advances in Neural Information Processing Systems 24.
Gandhi, Kanishk, Siddharth Karamcheti, Madeline Liao, and Dorsa Sadigh. 2022. “Eliciting Compatible Demonstrations for Multi-Human Imitation Learning.” In Proceedings of the 6th Conference on Robot Learning (CoRL).
Geman, Stuart, Elie Bienenstock, and René Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4: 1–58. https://api.semanticscholar.org/CorpusID:14215320.
Guillory, Andrew, and Jeff Bilmes. 2011. “Simultaneous Learning and Covering with Adversarial Noise.” ICML.
Hiranandani, Gaurush, Shant Boodaghians, Ruta Mehta, and Oluwasanmi Koyejo. 2019a. “Performance Metric Elicitation from Pairwise Classifier Comparisons.” In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, edited by Kamalika Chaudhuri and Masashi Sugiyama, 89:371–79. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v89/hiranandani19a.html.
Hiranandani, Gaurush, Shant Boodaghians, Ruta Mehta, and Oluwasanmi O Koyejo. 2019b. “Multiclass Performance Metric Elicitation.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/1fd09c5f59a8ff35d499c0ee25a1d47e-Paper.pdf.
Hiranandani, Gaurush, Harikrishna Narasimhan, and Sanmi Koyejo. 2020. “Fair Performance Metric Elicitation.” In Advances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, 33:11083–95. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/7ec2442aa04c157590b2fa1a7d093a33-Paper.pdf.
Holladay, Rachel, Shervin Javdani, Anca Dragan, and Siddhartha Srinivasa. 2016. “Active Comparison Based Learning Incorporating User Uncertainty and Noise.” Proceedings of RSS ’16 Workshop on Model Learning for Human-Robot Communication.
Houlsby, Neil, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. 2011. “Bayesian Active Learning for Classification and Preference Learning.” arXiv Preprint arXiv:1112.5745.
Jarl, Sanna, Linus Aronsson, Sadegh Rahrovani, and Morteza Haghir Chehreghani. 2021. “Active Learning of Driving Scenario Trajectories.” Eng. Appl. Artif. Intell. 113: 104972. https://api.semanticscholar.org/CorpusID:249113683.
Li, Kejun, Maegan Tucker, Erdem Biyik, Ellen Novoseller, Joel W. Burdick, Yanan Sui, Dorsa Sadigh, Yisong Yue, and Aaron D. Ames. 2021. “ROIAL: Region of Interest Active Learning for Characterizing Exoskeleton Gait Preference Landscapes.” In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE. https://doi.org/10.1109/icra48506.2021.9560840.
Margatina, Katerina, Timo Schick, Nikolaos Aletras, and Jane Dwivedi-Yu. 2023. “Active Learning Principles for in-Context Learning with Large Language Models.” ArXiv abs/2305.14264. https://api.semanticscholar.org/CorpusID:258841313.
Mas-Colell, Andreu. 1977. “The Recoverability of Consumers’ Preferences from Market Demand Behavior.” Econometrica 45 (6): 1409–30. http://www.jstor.org/stable/1912308.
Mussmann, Stephen, Julia Reisler, Daniel Tsai, Ehsan Mousavi, Shayne O’Brien, and Moises Goldszmidt. 2022. “Active Learning with Expected Error Reduction.” https://arxiv.org/abs/2211.09283.
Narasimhan, Harikrishna, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. 2015. “Consistent Multiclass Algorithms for Complex Performance Measures.” In Proceedings of the 32nd International Conference on Machine Learning, edited by Francis Bach and David Blei, 37:2398–2407. Proceedings of Machine Learning Research. Lille, France: PMLR. https://proceedings.mlr.press/v37/narasimhanb15.html.
Samuelson, P. A. 1938. “A Note on the Pure Theory of Consumer’s Behaviour.” Economica 5 (17): 61–71. http://www.jstor.org/stable/2548836.
Shepard, Roger N. 1957. “Stimulus and Response Generalization: A Stochastic Model Relating Generalization to Distance in Psychological Space.” Psychometrika 22(4):325–345.
Singh, Aarti, Robert D. Nowak, and Parameswaran Ramanathan. 2006. “Active Learning for Adaptive Mobile Sensing Networks.” 2006 5th International Conference on Information Processing in Sensor Networks, 60–68. https://api.semanticscholar.org/CorpusID:17590956.
Tamburrelli, Giordano, and Alessandro Margara. 2014. “Towards Automated a/b Testing.” In Search-Based Software Engineering. https://doi.org/10.1007/978-3-319-09940-8_13.
Taylor, Annalisa T., Thomas A. Berrueta, and Todd D. Murphey. 2021. “Active Learning in Robotics: A Review of Control Principles.” ArXiv abs/2106.13697. https://api.semanticscholar.org/CorpusID:235652039.
Varian, Hal R. 2006. “Revealed Preference.” In The SAGE Encyclopedia of Business Ethics and Society. https://api.semanticscholar.org/CorpusID:1632873.
Viappiani, Paolo, and Craig Boutilier. 2010. “Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets.” NIPS, 2352–60.
Yang, Sitan, and Daniel Q. Naiman. 2014. “Multiclass Cancer Classification Based on Gene Expression Comparison.” Statistical Applications in Genetics and Molecular Biology 13 (4): 477–96. https://doi.org/doi:10.1515/sagmb-2013-0053.
Zhu, Jingbo, Huizhen Wang, Benjamin Ka-Yin T’sou, and Matthew Y. Ma. 2010. “Active Learning with Sampling by Uncertainty and Density for Data Annotations.” IEEE Transactions on Audio, Speech, and Language Processing 18: 1323–31. https://api.semanticscholar.org/CorpusID:5777911.