3 Elicitation
By the end of this chapter you will be able to:
- Define the measurement problem for user preference elicitation and distinguish it from the learning problem in Chapter 3.
- Derive the Fisher information for Rasch/Bradley-Terry models and explain why items with \(p \approx 0.5\) are most informative.
- Compare A-optimal, D-optimal, and E-optimal design criteria for selecting informative queries.
- Apply Laplace approximation to update posterior beliefs about user preferences in real-time.
- Implement active selection policies that outperform random sampling in sample efficiency.
- Extend Fisher information analysis to non-linear scoring functions via local linearization.
- Apply information-theoretic acquisition functions to actively select queries for Gaussian Process preference models, building on the GP foundations from Chapters 2-3.
- Explain how D-optimal design applies to Direct Preference Optimization (DPO) for LLM alignment and derive the ADPO algorithm.
- Analyze active preference learning in robotics, including trajectory comparison and reward function inference.
3.1 Chapter Overview
Chapter 1 developed models for how preferences arise from latent utilities. Chapter 2 showed how to estimate these utilities from observed comparisons. This chapter addresses the complementary question: given that human feedback is expensive, how should we choose which comparisons to request?
This is the measurement problem in preference elicitation. Unlike the learning problem (which assumes data is given), the measurement problem actively designs experiments to maximize information gained per query. Consider training an LLM via RLHF: collecting 100,000 human preference labels is expensive and time-consuming. If we could achieve the same alignment quality with 20,000 strategically selected comparisons, we would save 80% of annotation cost.
The central insight is that not all comparisons are equally informative. Asking whether humans prefer “Hello!” to “Hi there!” reveals little about model quality. Comparing two substantively different responses to a complex prompt provides more signal. The mathematical framework of Fisher information and optimal experimental design formalizes this intuition.
Chapter Structure. We build from simple to complex settings:
- Scalar preferences (Section 3.2): The Rasch model and Fisher information
- Multi-dimensional preferences (Section 3.2.1): Factor models and optimal design criteria (A/D/E-optimal)
- Preference functions (Section 3.3): Linear and non-linear scoring functions
- Nonparametric models (Section 3.4): GP-based active learning with information-theoretic acquisition
- LLM alignment (Section 3.5): Active DPO for large language models
- Applications: Metric elicitation (Section 3.2.3) and robotics (Section 3.3.1)
Connections to Prior Chapters:
| Concept | Chapter 1-2 | This Chapter |
|---|---|---|
| Bradley-Terry model | Likelihood, MLE (Section 1.6.2) | Fisher information, query selection |
| Factor models \(U^\top V\) | Parameter estimation (Section 1.6) | D-optimal pair selection |
| GP preference models | Laplace inference (Section 2.4.2) | Information-gain acquisition |
| DPO loss | Objective function (Section 1.2) | ADPO active selection |
Historical Notes. Active learning for preferences has roots in multiple fields:
- Psychometrics (1960s): Computer-adaptive testing used Fisher information to select test items matching student ability
- Optimal experimental design (1970s): Fedorov and Kiefer developed A/D/E-optimality for linear models
- Bayesian experimental design (1990s): Chaloner & Verdinelli unified information-theoretic approaches
- RLHF/Active DPO (2020s): Applying these ideas to LLM alignment at scale
“The purpose of optimal design is to achieve the desired precision with minimum cost.” — V.V. Fedorov (1972)
3.2 Measurement of User Preference Vector
When items’ appeals \(V_j\) are known or pre-calibrated, we can learn a new user’s appetite \(U\) quickly by choosing informative items online. Asking only easy or only hard items gives near-deterministic responses that teach little. The key tool for quantifying “informativeness” is Fisher information.
Definition 3.1 For a parametric model \(p(Y \mid \theta)\), the Fisher Information about parameter \(\theta\) is: \[ \mathcal{I}(\theta) = \mathbb{E}\left[-\frac{\partial^2}{\partial \theta^2} \log p(Y \mid \theta)\right] = \mathbb{E}\left[\left(\frac{\partial}{\partial \theta} \log p(Y \mid \theta)\right)^2\right] \] Fisher Information quantifies how much a single observation reveals about \(\theta\). Higher Fisher Information means more precise estimation from each observation. The Cramér-Rao bound states that the variance of any unbiased estimator is at least \(\mathcal{I}(\theta)^{-1}\).
The Rasch model makes this precise. Recall from Section 1.6 that for a candidate item \(j\), the Bernoulli success probability is \(p_j(U) = \sigma(U + V_j).\) The Fisher information about \(U\) from a single response is \[\mathcal I_j(U) = \mathbb{E} \left[-\frac{\partial^2}{\partial U^2}\log p(Y \mid U, V_j)\right] = p_j(U) (1 - p_j(U)). \tag{3.1}\]
Proposition 3.1 For the Rasch model with \(p_j(U) = \sigma(U + V_j)\), the Fisher Information \(\mathcal{I}_j(U) = p_j(1-p_j)\) is maximized when \(p_j(U) = 0.5\), which occurs when item difficulty \(-V_j\) equals user ability \(U\).
The function \(f(p) = p(1-p)\) is a downward parabola with roots at \(p=0\) and \(p=1\). Taking the derivative: \(f'(p) = 1 - 2p = 0 \Rightarrow p = 0.5\). At this point, \(f(0.5) = 0.25\), which is the maximum. Since \(p_j(U) = \sigma(U + V_j) = 0.5\) when \(U + V_j = 0\), the optimal item has \(V_j = -U\).
A user has estimated ability \(\hat{U} = 1.0\). We have three candidate items with appeals \(V_1 = -2.0\) (hard), \(V_2 = -1.0\) (matched), and \(V_3 = 1.0\) (easy). Which item should we ask next?
Step 1. Compute predicted success probabilities: \[p_1 = \sigma(1.0 + (-2.0)) = \sigma(-1.0) \approx 0.269\] \[p_2 = \sigma(1.0 + (-1.0)) = \sigma(0.0) = 0.500\] \[p_3 = \sigma(1.0 + 1.0) = \sigma(2.0) \approx 0.881\]
Step 2. Compute Fisher information for each item: \[\mathcal{I}_1 = p_1(1-p_1) = 0.269 \times 0.731 \approx 0.197\] \[\mathcal{I}_2 = p_2(1-p_2) = 0.500 \times 0.500 = 0.250\] \[\mathcal{I}_3 = p_3(1-p_3) = 0.881 \times 0.119 \approx 0.105\]
Step 3. Select the most informative item: \(j^* = \arg\max_j \mathcal{I}_j = 2\) (the matched item).
Interpretation: Item 2, whose difficulty \(-V_2 = 1.0\) matches the user’s ability \(\hat{U} = 1.0\), is most informative because it produces responses closest to 50-50. The easy item (\(V_3\)) is least informative — the user almost certainly succeeds, so the response tells us little about their ability. This is the principle behind computerized adaptive testing: the test “follows” the student, always selecting questions near their ability level.
This result has a natural interpretation: items that are too easy (\(p \approx 1\)) or too hard (\(p \approx 0\)) yield near-deterministic responses that teach little about the user. Items matched to the user’s ability (\(p \approx 0.5\)) are maximally informative. So the best next item is one whose difficulty (\(-V_j\)) is close to the current estimate of \(U\). Active elicitation policies therefore “hover” around the user’s current location, rapidly shrinking uncertainty. Place a Gaussian prior \(U \sim \mathcal N(0,\sigma_0^2)\). After observing independent items \(\mathcal J\), a Laplace (or IRLS) update for \(U\) uses the score and observed information \[S(U) = \sum_{j\in\mathcal J} (y_j - p_j(U)), \qquad \mathcal I(U) = \sum_{j\in\mathcal J}p_j(U) (1 - p_j(U)) + \tau_0, \quad \tau_0 = \sigma_0^{-2}. \tag{3.2}\] A Newton step is \(\hat U \leftarrow \hat U + S(\hat U)\mathcal I(\hat U)^{-1}.\) The approximate posterior variance is \(\widehat{\mathrm{Var}}(U) \approx \mathcal I(\hat U)^{-1}.\) An item-selection rule that maximizes expected information picks \(j_t = \arg\max_j \; p_j(\hat U_{t-1}) (1 - p_j(\hat U_{t-1})).\) For evaluation, we can use reliability. With population variance \(\mathrm{Var}(U^\star)=\sigma_U^2\), reliability is the fraction of total variance explained by the test rather than estimation noise: \[ \text{Rel} \approx 1 - \frac{1}{N}\sum_{i=1}^N \frac{\widehat{\mathrm{Var}}(U_i)}{\sigma_U^2}. \tag{3.3}\] Below, the code simulates many users, a bank of calibrated items, and compares an Fisher policy to a random policy. We track reliability after each query count \(t\).
First, we set up the simulation with \(N\) users, \(M\) calibrated items, and define helper functions for Bayesian updating:
Next, we define the Bayesian update rule and the two item selection policies—Fisher-active (maximizing information) and random:
Now we run the simulation, comparing both policies across all users:
Finally, we visualize the results comparing Fisher-active vs random item selection:
Fisher-active item selection achieves higher reliability with fewer queries than random selection. The gain comes from targeting items near \(p=0.5\), where each response is maximally informative. This principle—measure where uncertainty is highest—underlies all active learning methods in this chapter.
The Fisher information framework above assumes item parameters (\(V_j\), \(Z_j\)) are known or pre-calibrated before elicitation begins. In practice, item parameters must often be estimated jointly with user preferences—this is the cold-start problem in recommender systems. When item parameters are uncertain, errors in item calibration propagate into suboptimal query selection: the system may ask about items it thinks are informative but are not. Joint estimation of user and item parameters requires different approaches, such as alternating optimization or fully Bayesian models that maintain uncertainty over both.
3.2.1 Factor Models: Multi-Dimensional Preferences
In practice, a single “ability” parameter may not capture the diversity of user preferences. Factor models generalize Rasch by giving each user a \(K\)-dimensional embedding \(U_i \in \mathbb R^K\) and each item a \(K\)-dimensional embedding \(V_j \in \mathbb R^K\), plus optional bias \(Z_j\). The linear predictor is \(H_{ij} = U_i^\top V_j + Z_j\) where \(p(Y_{ij}=1 \mid U_i, V_j, Z_j) = \sigma(H_{ij}).\)
A concrete instance is performance metric elicitation (Hiranandani et al. 2019). In binary classification, practitioners often have implicit preferences over classifiers that don’t match standard metrics like accuracy or F1—for instance, in medical diagnosis, the cost of a false negative (missed disease) may differ greatly from a false positive (unnecessary treatment). This is a \(K=2\) factor model where “items” are classifiers with known confusion matrices \(V_\theta = (\text{TP}_\theta, \text{TN}_\theta)\), and the “user preference” is the unknown metric weights \(\mathbf{m} = (m_{11}, m_{00})\). The utility \(\mathbf{m}^\top V_\theta = m_{11} \cdot \text{TP}_\theta + m_{00} \cdot \text{TN}_\theta\) is exactly the factor model form with known item parameters.
For a given user \(U\) and an item \(j\) with known parameters, the Bernoulli log-likelihood is \[\ell(U; y, j) = y \log \sigma(H) + (1-y)\log (1-\sigma(H)) \tag{3.4}\] The gradient with respect to \(U\) is \(\nabla_U \ell = (y - \sigma(H)) V_j.\) The expected Fisher Information per item is the negative expected Hessian: \[ \mathcal I_j(U) = \mathbb E[\nabla_U \ell \nabla_U \ell^\top ] = \sigma(H)(1-\sigma(H)) V_j V_j^\top. \tag{3.5}\] So each queried item contributes a rank-1 matrix in the direction of \(V_j\), scaled by the variance term \(\sigma(H)(1-\sigma(H))\). For a sequence of items \({j_t}\), the cumulative information is \[\mathcal I(U) = \sum_t \sigma(H_{j_t})(1-\sigma(H_{j_t})) V_{j_t} V_{j_t}^\top. \tag{3.6}\]
Fisher information quantifies the informativeness of a query under the assumed model. If the model is misspecified—for example, if preferences are not logistic, or if the factor structure has the wrong dimensionality \(K\)—then queries optimized for the Fisher criterion may not be optimal for learning the true preference structure. Under misspecification, information-theoretic criteria such as mutual information or Bayesian approaches that integrate over model uncertainty can be more robust. In practice, it is worth checking sensitivity to the assumed model by comparing active learning performance across different model specifications.
In multiple dimensions, information is a matrix. Classical optimal design offers various criteria for selecting informative queries:
Definition 3.2 Given a Fisher Information matrix \(\mathcal{I}(U)\), the classical optimal experimental design criteria are:
- A-Optimal (Average): Minimize \(\text{tr}(\mathcal{I}^{-1})\)—the average variance of parameter estimates
- D-Optimal (Determinant): Maximize \(\det(\mathcal{I})\)—the volume shrinkage of the posterior ellipsoid
- E-Optimal (Eigenvalue): Maximize \(\lambda_{\min}(\mathcal{I})\)—the precision in the worst-case direction
All three reduce uncertainty, but emphasize different aspects: A-optimal targets average precision, D-optimal targets overall information volume, and E-optimal targets the least-informed direction.
These three criteria optimize different objectives and can select different queries. A-optimal minimizes average variance (good when all parameters matter equally), D-optimal maximizes the determinant of the information matrix (shrinks the posterior ellipsoid volume — good for overall uncertainty), and E-optimal maximizes the minimum eigenvalue (guards against worst-case directions). A common mistake is to treat them as interchangeable. In practice: D-optimal naturally encourages diversity in selected items (redundant items give diminishing information), while E-optimal is conservative (focuses all effort on the least-informed direction). Choose the criterion based on your downstream task — if you need accurate ranking of all items, A-optimal is natural; if you need a single best decision, E-optimal may be preferable.
Adaptive measurement picks the next item \(j\) to maximize an acquisition function such as \(\det(\mathcal I+\mathcal I_j)\). We also use reliability as the evaluation metric here. If \(\Sigma_U\) is the population covariance and \(\widehat{\Sigma}_{\text{err}}\) the average posterior covariance, then \(\text{Reliability} = 1 - \mathrm{tr}(\widehat{\Sigma}_{\text{err}}) \mathrm{tr}(\Sigma_U)^{-1}.\) This generalizes the Rasch reliability to multi-dimensional latent traits. Below is a toy simulation with 2-dimensional users and items, comparing random item selection vs D-optimal Fisher active selection:
First, set up the factor model simulation with K-dimensional user and item embeddings:
Define the Bayesian update and D-optimal selection:
Run simulation comparing D-optimal vs random selection:
Visualize results:
3.2.2 Pairwise Preferences
We next discuss the measurement problem with pairwise data. Recall from Section 1.6.2 that the Bradley-Terry model specifies the probability that user \(i\) prefers item \(j\) over item \(k\) as \(p_{i}(j \succ k)=\sigma (U^\top(V_j-V_k) + Z_j-Z_k)\) with feature \(x_{jk}= V_j - V_k \in \mathbb R^K\) and offset \(b_{jk}=Z_j-Z_k\).
In metric elicitation, each oracle query asks: “Which classifier is better, \(h_\theta\) or \(h_{\theta'}\)?” The oracle’s answer is deterministic: \(\Omega(C_\theta, C_{\theta'}) = \mathbf{1}[\mathbf{m}^\top (V_\theta - V_{\theta'}) > 0]\), which is exactly a pairwise comparison with feature difference \(x = V_\theta - V_{\theta'}\). The deterministic oracle corresponds to the \(\sigma \to \text{step function}\) limit of Bradley-Terry.
For a single observation \(y \in \{0,1\}\) of \(j \succ k\) the per-pair log-likelihood is \[\ell(U; y, i, j, k)= y \log p + (1-y) \log (1-p_i). \tag{3.7}\]
Score and expected Fisher information for (U): \[\nabla_U \ell=(y-p),x_{jk}, \qquad \mathcal I_{jk}(U)=\mathbb E[-\nabla^2_U \ell] = w x_{jk} x_{jk}^\top, \tag{3.8}\] where \(w = p(1-p)\). For a sequence of queried pairs \({(j_t,k_t)}\) the cumulative information is \(\mathcal I(U)=\sum_t w_t,x_t x_t^\top.\) With a Gaussian prior \(U \sim \mathcal N(0, \Sigma_0)\) and a Laplace approximation, the posterior precision updates additively: \[ \Lambda_{t}=\Sigma_t^{-1}=\Sigma_{0}^{-1}+\sum_{s\le t} w_s,x_s x_s^\top. \tag{3.9}\] A single Newton step for the posterior mean after observing ((x,y)) is \[ \hat U \leftarrow \hat U + \Sigma_{t}^{\ }(y-p),x,\qquad p=\sigma(\hat U^\top x+b),\quad \Sigma_t = \left(\Lambda_{t-1}+w,xx^\top\right)^{-1}. \tag{3.10}\] This is the logistic-regression one-step IRLS update specialized to one new example. We evaluate candidates at the current \((\hat U,\Sigma)\). Write \(x\) for \(x_{jk}\) and \(w=\sigma(\hat U^\top x+b)\big(1-\sigma(\hat U^\top x+b)\big)\). Let \(\Lambda=\Sigma^{-1}\). Using the Sherman–Morrison formula, we obtain closed forms for all three design criteria:
Proposition 3.2 For a current precision matrix \(\Lambda = \Sigma^{-1}\) and a candidate pair \((j,k)\) with feature difference \(x = V_j - V_k\) and variance weight \(w = p(1-p)\), the D-optimal acquisition function has closed form: \[ \Delta_D(j,k) = \log\det(\Lambda + w xx^\top) - \log\det(\Lambda) = \log(1 + w x^\top \Sigma x) \] This follows from the matrix determinant lemma: \(\det(A + uv^\top) = (1 + v^\top A^{-1} u)\det(A)\).
Pick the pair that maximizes \(\log(1 + w x^\top\Sigma x)\). This requires only a vector-matrix-vector product, making it efficient for online selection.
The D-optimal (and A/E-optimal) criteria above are greedy: they select each query to maximize the immediate information gain, without considering how the current query affects future query opportunities. The optimal policy would look ahead over a budget of \(T\) queries and select the sequence that maximizes total information. In practice, the greedy approach works well when queries are roughly exchangeable, but it can perform poorly in batch selection settings (selecting multiple queries simultaneously) because it ignores redundancy—two individually informative queries may provide overlapping information. Batch extensions typically add diversity mechanisms, such as penalizing queries whose feature vectors \(V_j - V_k\) are correlated with already-selected pairs.
A-optimal (trace drop of covariance): \(\Delta_A(j,k);=;\mathrm{tr}(\Sigma)-\mathrm{tr}\big((\Lambda+w,xx^\top)^{-1}\big) = \frac{w,x^\top\Sigma^2 x}{1+w,x^\top\Sigma x}.\)
E-optimal (raise the worst-direction information): Exact update of \(\lambda_{\min}(\mathcal I)\) needs an eigen update. A practical proxy is to maximize \(\Delta_E^{\text{proxy}}(j,k)= w,x^\top\Sigma x,\) which targets directions where current uncertainty \(x^\top\Sigma x\) is large.
End-to-end toy pipeline (rank 2, known items, active D-opt vs random). First, define helper functions for D-optimal gain and Bayesian updates:
Set up simulation parameters and precompute candidate pairs:
Define selection policies and run the simulation:
Visualize the results:
For multi-dimensional preference vectors, D-optimal selection maximizes \(\log\det(\mathcal{I})\), shrinking the posterior ellipsoid in all directions. The Sherman-Morrison formula enables efficient computation: each candidate pair requires only a vector-matrix-vector product \(x^\top \Sigma x\). D-optimal design naturally encourages diversity—selecting pairs with redundant information directions provides diminishing log-det gain.
3.2.3 Metric Elicitation: Exploiting Quasiconcavity
The metric elicitation problem has special structure that enables even more efficient elicitation than general D-optimal design. A Linear Performance Metric (LPM) has the form: \[ \phi(C) = m_{11} \cdot \text{TP} + m_{00} \cdot \text{TN} + m_0 \tag{3.11}\] where the weights \((m_{11}, m_{00})\) encode tradeoffs between true positives and true negatives. Since only the ratio matters, we can parametrize as \(\mathbf{m} = (\cos\theta, \sin\theta)\) for \(\theta \in [0, 2\pi]\).
For medical diagnosis, \(m_{11} > m_{00}\) weights sensitivity (catching disease) over specificity (avoiding false alarms). For spam filtering, the reverse may hold. The key is that the practitioner’s implicit metric \(\mathbf{m}^*\) is unknown—we must learn it from pairwise comparisons.
Proposition 3.3 For a quasiconcave LPM \(\phi\) that is monotonically increasing in TP and TN, and a parametrization \(\rho^+(\theta)\) of the upper boundary of the confusion matrix space \(\mathcal{C}\), the composition \(\phi \circ \rho^+\) is quasiconcave and unimodal on \([0, \pi/2]\).
Quasiconcavity means the function has a single peak—no local maxima to trap us. This enables binary search rather than gradient-based optimization:
Input: Tolerance \(\epsilon > 0\), oracle \(\Omega\)
Initialize: \(\theta_a = 0\), \(\theta_b = \pi/2\)
While \(|\theta_b - \theta_a| > \epsilon\):
- Divide \([\theta_a, \theta_b]\) into four equal intervals with boundaries \(A, B, C, D, E\)
- Compute confusion matrices \(C_A, C_B, C_C, C_D, C_E\) for classifiers at each \(\theta\)
- Query oracle: compare all adjacent pairs to find which interval contains the maximum
- Update \([\theta_a, \theta_b]\) to the interval containing the maximum
Return: \(\hat{\mathbf{m}} = (\cos\hat\theta, \sin\hat\theta)\) where \(\hat\theta = (\theta_a + \theta_b)/2\)
Each iteration uses 4 queries and shrinks the search interval by factor of 2, giving \(O(\log(1/\epsilon))\) query complexity—exponentially better than the \(O(1/\epsilon^2)\) rate for general 2D estimation.
Proposition 3.4 Given learned metric weights \(\mathbf{m} = (m_{11}, m_{00})\) with \(m_{11} + m_{00} > 0\), the Bayes optimal classifier is: \[ \bar{h}(x) = \mathbf{1}\left[\eta(x) \geq \frac{m_{00}}{m_{11} + m_{00}}\right] \] where \(\eta(x) = P(Y=1 \mid X=x)\) is the conditional class probability.
This connects metric elicitation to decision theory: learning the metric \(\mathbf{m}\) is equivalent to learning the optimal classification threshold.
Extension to Linear-Fractional Metrics. The \(F_\beta\) score and Jaccard similarity are not linear but linear-fractional: \[ \phi(C) = \frac{p_{11} \cdot \text{TP} + p_{00} \cdot \text{TN} + p_0}{q_{11} \cdot \text{TP} + q_{00} \cdot \text{TN} + q_0} \] These can be elicited by first finding both the maximizer and minimizer of the implicit linear numerator (two binary searches), then solving for the fractional parameters. The query complexity remains \(O(\log(1/\epsilon))\).
Multiclass Extension. For \(K\)-class problems, diagonal linear performance metrics (DLPMs) weight correct predictions: \(\psi(\mathbf{d}) = \sum_k a_k d_k\) where \(d_k\) is the rate of correctly predicting class \(k\). Binary search over each pair of classes \((k_1, k_2)\) recovers the relative weights \(a_{k_2}/a_{k_1}\), with total complexity \(O(K^2 \log(1/\epsilon))\).
3.3 Measurement of User Preference Function
Suppose items \(j\) have feature vectors \(X_j \in \mathbb R^d\). Instead of learning a free scalar parameter \(V_j\) for each item, we assume \(V_j = W^\top X_j,\) where \(W \in \mathbb R^d\) is shared across items. The probability that item \(j\) beats item \(k\) is \(p(j \succ k) = \sigma(V_j - V_k) = \sigma (W^\top(X_j - X_k)).\) This is simply logistic regression on pairwise feature differences. Given one comparison outcome \(y \in \{0,1\}\) for pair \((j,k)\) with difference \(x_{jk} = X_j - X_k\), \[\ell(W) = y\log \sigma(W^\top x_{jk}) + (1-y)\log(1-\sigma(W^\top x_{jk})). \tag{3.12}\] Gradient: \(\nabla_W \ell = (y - p) x_{jk}, \quad p = \sigma( W^\top x_{jk} ).\) Fisher Information (expected negative Hessian): \(\mathcal I(W) = \mathbb E\big[(y-p)^2 x_{jk}x_{jk}^\top\big] = p(1-p),x_{jk}x_{jk}^\top.\)
With Gaussian prior \(W \sim \mathcal N(0,\sigma_0^2 I)\), a Laplace approximation yields posterior covariance \(\Sigma \approx (\sigma_0^{-2}I + \sum_t p_t(1-p_t) x_t x_t^\top)^{-1}.\) So, as in the Rasch case, Fisher information accumulates to form posterior precision. At each step, pick a candidate pair \((j,k)\) and compute \[ \Delta_D(j,k) = \log \det(\Lambda + w x_{jk} x_{jk}^\top) - \log\det(\Lambda), \quad w = p(1-p), \tag{3.13}\] where \(\Lambda = \Sigma^{-1}\). This is the D-optimal criterion, maximizing the expected log-determinant gain in information. A-optimal or E-optimal criteria are similarly defined. We’ll simulate item features, a ground-truth \(W\), and comparisons. Then we compare active D-optimal elicitation vs random pair selection for estimating \(W\). Performance metrics are MSE of \(\hat W\) vs true \(W\), and reliability.
First, set up the simulation with item features and ground-truth weights:
Define the Fisher gain and Bayesian update functions:
Run the simulation comparing D-optimal vs random selection:
Visualize the results:
3.3.1 Robotic Trajectory Learning
The linear preference framework applies directly to robotic trajectory learning (Biyik and Sadigh 2018). Consider a robot choosing between trajectories \(\xi_A\) and \(\xi_B\). The human evaluates them via a reward function: \[ R(\xi) = w^\top \phi(\xi) \tag{3.14}\] where \(w \in \mathbb{R}^N\) are unknown weights and \(\phi(\xi) \in \mathbb{R}^N\) extracts features (e.g., distance to lane boundaries, proximity to obstacles, heading, speed). This is exactly our linear model with “items” being trajectories and “features” being \(\phi(\xi)\).
Geometric Interpretation. Each pairwise comparison “prefer \(\xi_A\) over \(\xi_B\)” yields a constraint: \[ w^\top (\phi(\xi_A) - \phi(\xi_B)) > 0 \] This defines a half-space in the weight space. The feasible region for \(w\) is the intersection of all half-spaces consistent with observed preferences. Each query splits this region—orthogonal queries reduce volume by approximately half.
The optimal query selection criterion maximizes the minimum volume removed regardless of the oracle’s answer. For query feature difference \(\Phi = \phi(\xi_A) - \phi(\xi_B)\), this is \(\max_\Phi \min\{\mathbb{E}[1 - f_\Phi(w)], \mathbb{E}[1 - f_{-\Phi}(w)]\}\) where \(f_\Phi(w) = \min(1, \exp(w^\top \Phi))\) soft-weights consistency with the preference. This selects balanced queries where either response is informative.
Driving Simulator Example. In a 2D driving simulator, trajectory features include:
- Distance to lane boundaries
- Distance to other vehicles
- Heading angle
- Speed
With no learned preferences, the simulated car moves erratically. After 30 comparison queries, it learns to follow the lane direction. After 70 queries, it additionally learns to avoid collisions while staying in lane. The features with highest learned weights correspond to collision avoidance and lane-keeping.
Why Active Selection Matters. Both active and random query selection eventually converge to the true reward weights. However, active selection achieves target performance faster—critical in time-sensitive applications. For exoskeleton rehabilitation (Li et al. 2021), the robot must adapt to individual patients quickly; waiting for asymptotic convergence is not acceptable. Active learning provides crucial early-stage advantages.
Imitation Learning with Compatibility. When learning from multiple human demonstrators, multimodality arises: different people use different strategies for the same task (e.g., picking up an object from the left vs. right). Naive imitation learning fails on contradictory demonstrations.
The solution is to measure compatibility of new demonstrations with the base policy:
- Likelihood: How close is the demonstrated action to the base policy’s prediction?
- Novelty: How uncertain is the base policy at this state?
Demonstrations with low likelihood in low-novelty states indicate conflicting modes and should be filtered. This compatibility-based filtering improves success rates while using less data than naive aggregation (Gandhi et al. 2022).
Foundation Models for Representations. Recent work leverages foundation models pretrained on human video (R3M (Nair et al. 2022), Voltron (Karamcheti et al. 2023)) to provide better trajectory features \(\phi(\xi)\). These models learn that human hands are similar to robotic end-effectors, enabling 2-3x higher success rates with 5-10x fewer demonstrations. For preference learning, better representations mean each comparison is more informative.
3.3.2 Non-Linear Scoring Functions
So far we set \(V_j=W^\top X_j\). Suppose instead that items are scored by a non-linear function \[V_j = f_\theta(X_j),\qquad \theta\in\mathbb{R}^p, \qquad p(j \succ k) = \sigma (f_\theta(X_j) - f_\theta(X_k)) \tag{3.15}\] We want an online procedure that chooses the next comparison \((j,k)\) to learn \(\theta\) quickly. Exact Bayesian design is intractable for general \(f_\theta\). There is a simple and effective approximation that keeps all the math and code lightweight: local Gaussian posterior over \(\theta\) via Laplace + first-order model. Work at the current estimate \(\hat\theta\). Linearize the non-linear scorer around \(\hat\theta\): \[ f_\theta(X_j) \approx f_{\hat\theta}(X_j) + J_j\Delta\theta, \qquad J_j \equiv \left.\frac{\partial f_\theta(X_j)}{\partial \theta}\right| _{\theta = \hat\theta} \in \mathbb{R}^{1\times p}. \tag{3.16}\] The pairwise logit for \((j,k)\) becomes \[\underbrace{f_\theta(X_j)-f_\theta(X_k)}_{\text{logit}} \approx \underbrace{f_{\hat\theta}(X_j)-f_{\hat\theta}(X_k)}_{\text{offset}} + \underbrace{(J_j-J_k)}_{\phi_{jk}}\Delta\theta. \tag{3.17}\]
Define the local “design vector” \(\phi_{jk}\in\mathbb{R}^{1\times p}\). With a Gaussian prior \(\theta\sim\mathcal N(0,\sigma_0^2 I)\) and Bernoulli likelihood, the Laplace posterior around \(\hat\theta\) is \[ \theta \mid \mathcal D \approx \mathcal N(\hat\theta,;\Sigma),\qquad \Sigma^{-1} = \sigma_0^{-2} I + \sum_{t} w_t \phi_t^\top \phi_t \tag{3.18}\] where \(w_t = p_t(1-p_t)\) with \(p_t = \sigma\big(f_{\hat\theta}(X_{j_t})-f_{\hat\theta}(X_{k_t})\big)\). This is exactly the same structure as the linear case, with (x_{jk}) replaced by \(\phi_{jk}=J_j-J_k\). You only need Jacobian rows \(J_j\). At the current \((\hat\theta \Sigma)\) and for a candidate \((j,k)\), compute \(p=\sigma\big(f_{\hat\theta}(X_j)-f_{\hat\theta}(X_k)\big)\) and \(w = p(1-p)\).Compute \(\phi = J_j-J_k\) at \(\hat \theta\). Then use any standard optimal design score with Sherman–Morrison:
- D-optimal (log-det gain): \(\Delta_D(j,k) = \log ( 1 + w\phi \Sigma \phi^\top )\).
- A-optimal (trace drop): \(\Delta_A(j,k) = \dfrac{w \phi \Sigma^2 \phi^\top }{ 1+w \phi \Sigma \phi^\top}\).
- E-optimal proxy: \(\Delta_E^{\text{proxy}}(j,k) = w \phi \Sigma \phi^\top\).
Pick the pair with the largest gain. This is tractable because it only needs a vector–matrix–vector product with \(\Sigma\) and a Jacobian row. Mutual-information acquisition like BALD reduces to the same formulas under the Laplace Gaussian approximation. Maximizing (_D) is a good default.
We use a tiny two-layer network \(f_\theta(x)=a^\top \tanh(Bx)\) with \(\theta=(a,B)\). We derive \(J_j\) analytically. First, define the network and its Jacobian:
Set up Laplace approximation state and update functions:
Run the active learning loop comparing D-optimal vs random selection:
Visualize results:
3.4 Active Query Selection for Gaussian Process Models
The previous sections developed Fisher information for optimal query selection in parametric preference models—the Rasch model (Section 1) and linear scoring functions (Section 2). However, Chapter 2 introduced Gaussian Processes (GPs) as a flexible nonparametric alternative for reward modeling, and Chapter 3 covered GP inference via Laplace approximation. Here we address the natural question: how do we actively select queries when using GPs?
The key insight is that GP uncertainty—quantified by the posterior variance—naturally guides query selection. Unlike parametric models where Fisher information depends only on the current parameter estimate, GP uncertainty varies across the input space, being higher far from observed data and lower near it.
3.4.1 Information-Theoretic Acquisition Function
For GP preference learning, we seek queries that maximize information gain about the reward function. Given a dataset \(\mathcal{D}\) and a candidate query \(Q = (x_A, x_B)\), the acquisition function measures the expected reduction in uncertainty about \(r\): \[ a(Q) = I(r; y \mid \mathcal{D}, Q) = H(y \mid \mathcal{D}, Q) - \mathbb{E}_{r \sim p(r \mid \mathcal{D})}[H(y \mid r, Q)] \tag{3.19}\]
Interpretation of the two terms:
\(H(y \mid \mathcal{D}, Q)\): Predictive entropy—how uncertain is the query outcome under our current model? High when \(p(A \succ B) \approx 0.5\).
\(\mathbb{E}_r[H(y \mid r, Q)]\): Expected conditional entropy—how uncertain would the query be if we knew the true reward function? Low for “easy” comparisons, high for inherently noisy comparisons.
The acquisition function trades off these two terms: we want queries where the model is uncertain (high predictive entropy) but a human could provide a clear answer (low expected conditional entropy).
3.4.2 Practical Computation
Using the Laplace approximation from Chapter 3 (Section 2.4.2), the acquisition function can be computed as: \[ a(x_A, x_B) = h\left(\Phi\left(\frac{\mu_A - \mu_B}{\sqrt{2\sigma_{\text{noise}}^2 + g(x_A, x_B)}}\right)\right) - m(x_A, x_B) \tag{3.20}\]
where: - \(\mu_A, \mu_B\) are the GP posterior means at \(x_A, x_B\) - \(g(x_A, x_B) = \sigma_A^2 + \sigma_B^2 - 2\text{Cov}(r(x_A), r(x_B))\) captures uncertainty from both points - \(h(p) = -p\log_2 p - (1-p)\log_2(1-p)\) is binary entropy - \(\Phi\) is the standard normal CDF - \(m(x_A, x_B)\) is a correction term for human noise
Key property: The trivial query \(Q = (x, x)\) (comparing an item to itself) is a global minimizer, not maximizer, of this acquisition function. This contrasts with some other methods that can get stuck on uninformative queries.
First, we define the RBF kernel and helper functions for the simulation:
Next, we implement a GP model for active preference learning. The fit method uses Laplace approximation (Newton’s method) to find the posterior mode, and acquisition computes the information gain for a candidate query:
Now we run the simulation, comparing active (information-theoretic) query selection against random selection:
Finally, we visualize the results. The left panel shows how estimation error decreases with more comparisons; the right panel shows the learned reward functions:
3.4.3 When to Use GP Active Learning
| Scenario | Recommendation |
|---|---|
| Nonlinear reward structure | GP strongly preferred |
| Low-to-moderate dimensions (\(d \lesssim 10\)) | GP works well |
| Expensive human feedback | Active learning essential |
| Need uncertainty quantification | GP provides this naturally |
| Large datasets (\(n > 1000\)) | Consider sparse GP approximations |
| Very high dimensions | Linear models may be more practical |
3.5 Active Learning for Direct Preference Optimization
We now turn to a specific, important application: active learning for DPO (Direct Preference Optimization) in large language model alignment. Chapter 2 introduced DPO and its connection to Bradley-Terry; here we show how to apply active learning principles to reduce the number of human comparisons needed.
3.5.1 The DPO Setting
Recall from Section 1.2 that DPO directly optimizes a policy \(\pi_\theta\) on preference data without learning an explicit reward model. The key insight is that DPO implicitly assumes Bradley-Terry preferences: \[ p(y_w \succ y_l \mid x) = \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right) \tag{3.21}\]
Challenge: Large language models have billions of parameters, making exact Fisher information computation intractable. How can we apply the D-optimal design principles from Sections 1-2?
3.5.2 The Log-Linear Policy Assumption
The key insight is to approximate the policy as log-linear in the last-layer features: \[ \pi(y \mid x; \theta) \propto \exp(\phi(x, y)^\top \theta) \tag{3.22}\]
where \(\phi(x, y)\) is the feature representation from the model’s last layer. This assumption is justified by the neural tangent kernel perspective: for sufficiently large networks, the dynamics during training are approximately linear in the parameters.
Under this assumption, the DPO loss Hessian takes the form: \[ H(\theta) = \sum_{i=1}^n p_i(1-p_i) \cdot \Delta\phi_i \Delta\phi_i^\top \tag{3.23}\]
where \(\Delta\phi_i = \phi(x_i, y_{w,i}) - \phi(x_i, y_{l,i})\) is the feature difference between winning and losing responses, and \(p_i\) is the predicted preference probability.
Key observation: This Hessian is exactly the Fisher information matrix for the DPO model! The D-optimal criterion becomes: \[ \max_S \log\det\left(\sum_{i \in S} p_i(1-p_i) \cdot \Delta\phi_i \Delta\phi_i^\top\right) \tag{3.24}\]
3.5.3 The ADPO Algorithm
The Active DPO (ADPO) algorithm applies D-optimal design to select preference comparisons:
Input: Reference policy \(\pi_{\text{ref}}\), prompt set \(\{x_i\}\), response pairs \(\{(y_{i,1}, y_{i,2})\}\), budget \(T\)
Initialize: \(\theta_0 = \theta_{\text{ref}}\), \(S_0 = \emptyset\)
For \(t = 1, \ldots, T\):
Compute features: For each candidate pair \((x_i, y_{i,1}, y_{i,2})\):
- \(\Delta\phi_i = \phi(x_i, y_{i,1}) - \phi(x_i, y_{i,2})\)
- \(p_i = \sigma(\beta(\log\pi_{\theta_{t-1}}(y_{i,1}|x_i) - \log\pi_{\theta_{t-1}}(y_{i,2}|x_i)))\)
D-optimal selection: Select query maximizing: \[i^* = \arg\max_i \log\det\left(I + p_i(1-p_i) H_t^{-1} \Delta\phi_i \Delta\phi_i^\top\right) \tag{3.25}\] where \(H_t\) is the current Fisher information matrix
Query oracle: Obtain human preference \(y_t\) for pair \((x_{i^*}, y_{i^*,1}, y_{i^*,2})\)
Update: \(S_t = S_{t-1} \cup \{(i^*, y_t)\}\), retrain \(\theta_t\) on \(S_t\) using DPO loss
Return: Final policy \(\pi_{\theta_T}\)
ADPO’s theoretical guarantees rely on the log-linear policy assumption: that \(\log \pi_\theta(y \mid x)\) is approximately linear in \(\theta\), which holds in the neural tangent kernel (NTK) regime near the reference policy. This approximation degrades when: (1) the policy is fine-tuned far from the reference (large \(\|\theta - \theta_{\text{ref}}\|\)), (2) the model is small relative to the data complexity so the NTK regime does not hold, or (3) the loss landscape is highly non-convex with multiple basins. In these settings, the Fisher information computed under the log-linear assumption may poorly reflect the actual curvature of the loss, leading to suboptimal query selection. Practitioners should monitor whether the D-optimal gains predicted by the linear model correlate with actual improvements in policy quality.
3.5.4 Theoretical Guarantees
Under standard regularity conditions, ADPO achieves the following convergence rate:
Theorem 3.1 Let \(\theta^*\) be the optimal policy parameters. Under the log-linear policy assumption and standard regularity conditions, ADPO with \(n\) actively selected comparisons achieves: \[ \|\hat\theta_n - \theta^*\| = O\left(\frac{d}{\sqrt{n}}\right) \tag{3.26}\] where \(d\) is the effective dimension (rank of the Fisher information matrix). This matches the minimax optimal rate for \(d\)-dimensional estimation.
The key insight is that D-optimal design ensures the Fisher information grows proportionally to \(n\), leading to \(O(1/\sqrt{n})\) estimation error.
3.5.5 ADPO+ for Offline Settings
In many practical settings, we have access to a fixed pool of unlabeled preference pairs (e.g., model outputs) but cannot generate new queries. ADPO+ adapts the algorithm for this offline setting:
- Pool-based selection: Instead of synthesizing queries, select from the existing pool
- Batch selection: Select \(b\) queries at once to enable parallel annotation
- Diversity regularization: Add penalty for selecting similar queries
The D-optimal criterion naturally encourages diversity: selecting redundant queries provides diminishing information gain.
3.5.6 Empirical Guidance: When Does Active Learning Help?
| Factor | Active Learning Beneficial? |
|---|---|
| Annotation cost | High cost → Active learning essential |
| Data heterogeneity | Diverse prompts → More room for strategic selection |
| Budget constraints | Limited budget → Active learning shows larger gains |
| Model capacity | Larger models → Active learning helps reduce overfitting |
| Query pool size | Large pool → More room for intelligent selection |
Practical recommendation: Active learning for DPO provides the largest gains when: - Annotation budget is limited (< 10K comparisons) - Prompts cover diverse topics or difficulty levels - The goal is sample efficiency over raw performance
| Aspect | GP Active Learning | ADPO |
|---|---|---|
| Model | Nonparametric (GP) | Parametric (neural network) |
| Scalability | \(O(n^3)\) | \(O(d^2 n)\) with approximations |
| Use case | Small datasets, need uncertainty | LLM alignment, large models |
| Acquisition | Information gain | D-optimal design |
| Theoretical basis | Mutual information | Fisher information |
Both approaches share the core insight: use model uncertainty to guide query selection.
3.6 Summary
- When human feedback is expensive, which comparisons to request matters as much as how to learn from them. This chapter develops the theory of optimal query selection for preference learning.
- Fisher information \(\mathcal{I}_j(U) = p(1-p)\) quantifies how much a single comparison reveals about user preferences. Items with predicted win probability \(p \approx 0.5\) are most informative—close competitions reveal more than blowouts.
- Three optimal design criteria guide query selection: A-optimal (minimize average variance), D-optimal (maximize the determinant of the information matrix, shrinking the posterior ellipsoid’s volume), and E-optimal (minimize the worst-case variance).
- For multi-dimensional preferences, the Sherman-Morrison update enables efficient \(O(K^2)\) covariance updates after each query, making real-time active selection practical.
- Gaussian Process active learning extends these ideas to nonparametric models using information-theoretic acquisition functions—selecting queries that maximize mutual information between the response and the latent preference function.
- Active DPO (ADPO) applies D-optimal design to language model alignment, using a log-linear approximation to make Fisher information computation tractable for billion-parameter models. ADPO achieves the minimax-optimal convergence rate \(O(d/\sqrt{n})\).
- Across all settings—scalar preferences, factor models, GPs, and LLMs—the core insight is the same: use model uncertainty to guide query selection, focusing annotation effort where it reduces uncertainty most.
3.7 Quick Check
Test your understanding before proceeding to the exercises.
An item has predicted success probability \(p = 0.95\). What is its Fisher information \(\mathcal{I} = p(1-p)\)? Would you select this item for active learning? Why or why not?
A-optimal design minimizes the average posterior variance; D-optimal minimizes the volume of the confidence ellipsoid. In what situation would these select different items?
In ADPO, why is the log-linear policy assumption (\(\pi(y|x;\theta) \propto \exp(\phi(x,y)^\top \theta)\)) important? What would happen without it?
If you read only five papers from this chapter’s references, make them these:
- Rasch (1960) — The Rasch model: foundational psychometric model connecting item difficulty to person ability
- Fedorov (1972) — The theory of optimal experimental design (A/D/E-optimality)
- Chu and Ghahramani (2005) — Gaussian process preference learning: nonparametric Bayesian approach to utility functions
- Houlsby et al. (2011) — Bayesian active learning by disagreement (BALD), connecting information gain to query selection
- Christiano et al. (2017) — Deep RL from human preferences: the original active elicitation pipeline for reward learning
3.8 Discussion Questions
How does the reliability metric defined in this chapter relate to classical test theory in psychometrics? What assumptions underlie this correspondence?
Why does D-optimal design maximize the log-determinant of the information matrix rather than the trace? In what situations might A-optimal (trace-based) selection be preferable?
The local linearization approach for non-linear scoring functions trades off accuracy for tractability. What are the conditions under which this approximation is most accurate? When might it fail?
Active preference learning assumes that items’ appeals are known or pre-calibrated. How might estimation errors in item parameters affect the efficiency of active selection strategies?
The metric elicitation framework assumes an oracle with a consistent underlying preference. How would you modify the approach to handle an oracle whose preferences evolve over time or exhibit inconsistencies?
In the robotics case studies, pairwise comparisons are used to learn reward functions. What are the advantages and disadvantages of pairwise comparisons versus direct reward labeling for robotic learning?
Foundation models like R3M and Voltron leverage pretraining on human video data. What types of robotic tasks might benefit most from this approach? What tasks might still require substantial robot-specific data?
The active elicitation framework for imitation learning identifies “compatible” demonstrations. How might this concept of compatibility be extended to multi-agent settings where different agents have legitimately different optimal behaviors?
3.9 Further Reading
For readers who want to go deeper into the topics introduced in this chapter, we recommend the following:
- Fedorov (1972), Theory of Optimal Experiments — The classical reference on A/D/E-optimal experimental design, providing the mathematical foundations for the Fisher information-based query selection methods used throughout this chapter.
- Chaloner and Verdinelli (1995), “Bayesian Experimental Design: A Review” — A unified information-theoretic approach to experimental design that extends the Fisher information framework to Bayesian settings, including mutual information-based criteria that are robust to model misspecification.
- Settles (2012), Active Learning — A comprehensive survey of active learning, covering pool-based, stream-based, and query-synthesis approaches. Places the preference elicitation methods in this chapter within the broader active learning literature.
- Linden and Glas (2010), Elements of Adaptive Testing — How Fisher information drives modern standardized testing (GRE, GMAT), providing the practical context for the Rasch model-based elicitation methods discussed in this chapter.
- Houlsby et al. (2011), “Bayesian Active Learning for Classification and Preference Learning” — Introduces the BALD (Bayesian Active Learning by Disagreement) acquisition function, an information-theoretic alternative to D-optimal design that works well with GP models.
- Sadigh et al. (2017), “Active Preference-Based Learning of Reward Functions” — Active query selection for robot trajectory preferences, demonstrating how the Fisher information framework applies to continuous, high-dimensional preference learning in robotics.
- Hiranandani et al. (2019), “Multiclass Performance Metric Elicitation” — The paper behind the metric elicitation case study, showing how active preference queries can recover a practitioner’s implicit loss function for classifier evaluation.
3.10 Exercises
We place (*), (**), and (***) for exercises we deem relatively easy, medium, and hard.
3.10.1 Fisher Information Maximum (*)
Show that for the Rasch model with \(p = \sigma(U + V_j)\), the Fisher information \(\mathcal{I}_j(U) = p(1-p)\) is maximized when \(p = 0.5\). What does this imply about the optimal difficulty of items for measuring a user’s ability?
3.10.2 Sherman-Morrison Update (**)
Derive the Sherman-Morrison formula for efficiently updating the posterior covariance when adding a single rank-1 observation. Specifically, show that if \(\Sigma^{-1} \gets \Sigma^{-1} + w \cdot vv^\top\), then \(\Sigma \gets \Sigma - \frac{w \cdot \Sigma vv^\top \Sigma}{1 + w \cdot v^\top \Sigma v}\).
3.10.3 D-Optimal vs A-Optimal (**)
Construct a 2D example where D-optimal and A-optimal criteria select different items for the next query. Interpret the difference geometrically in terms of the posterior ellipsoid.
3.10.4 Pairwise Fisher Information (**)
For the Bradley-Terry model \(p(j \succ k | U) = \sigma(U^\top(V_j - V_k))\), derive the Fisher information matrix for \(U\) given a single pairwise comparison. How does this relate to the single-item case?
3.10.5 Active Selection Implementation (***)
Extend the D-optimal pairwise selection simulation to include all three optimality criteria (A, D, E). Compare their sample efficiency on a synthetic dataset of 50 items with 3-dimensional user preferences. Which criterion performs best under different prior uncertainty structures?
3.10.6 Noisy Oracle Elicitation (***)
Extend the LPM elicitation algorithm to handle noisy oracle responses where the oracle flips its answer with probability \(\epsilon < 0.5\). How does the query complexity scale with \(\epsilon\)?
3.10.7 Local Linearization Analysis (***)
Consider a scoring function \(f_\theta(x) = a^\top \tanh(Bx)\) with \(\theta = (a, B)\). Analytically derive the Jacobian \(J_j = \partial f_\theta(X_j) / \partial \theta\). Under what conditions on the hidden activations \(\tanh(BX_j)\) is the local Gaussian approximation most accurate?
3.10.8 Compatibility Metric Design (**)
In the active elicitation framework for imitation learning, the compatibility metric depends on thresholds \(\lambda\) and \(\eta\). Design an algorithm that automatically selects these thresholds from a held-out validation set of demonstrations. What objective should the algorithm optimize?