3 Elicitation
Fullscreen - AL Fullscreen - ME
By the end of this chapter you will be able to:
- Define the measurement problem for user preference elicitation and distinguish it from the learning problem in Chapter 3.
- Derive the Fisher information for Rasch/Bradley-Terry models and explain why items with \(p \approx 0.5\) are most informative.
- Compare A-optimal, D-optimal, and E-optimal design criteria for selecting informative queries.
- Apply Laplace approximation to update posterior beliefs about user preferences in real-time.
- Implement active selection policies that outperform random sampling in sample efficiency.
- Extend Fisher information analysis to non-linear scoring functions via local linearization.
- Apply information-theoretic acquisition functions to actively select queries for Gaussian Process preference models, building on the GP foundations from Chapters 2-3.
- Explain how D-optimal design applies to Direct Preference Optimization (DPO) for LLM alignment and derive the ADPO algorithm.
- Analyze active preference learning in robotics, including trajectory comparison and reward function inference.
Throughout this chapter, we use the following conventions (extending Chapter 2):
| Symbol | Meaning |
|---|---|
| \(U \in \mathbb{R}^K\) | User embedding/preference vector |
| \(V_j \in \mathbb{R}^K\) | Item embedding/appeal vector |
| \(Z_j \in \mathbb{R}\) | Item offset (difficulty in Rasch model: \(-V_j\)) |
| \(H_{ij} = U_i^\top V_j + Z_j\) | Latent utility for user \(i\) on item \(j\) |
| \(\mathcal{I}(U)\) | Fisher information (scalar or matrix) |
| \(p = \sigma(H)\) | Response probability under logistic model |
| \(\Lambda = \Sigma^{-1}\) | Posterior precision matrix |
| \(\text{Rel}\) | Reliability: \(1 - \text{tr}(\hat{\Sigma}_{\text{err}}) / \text{tr}(\Sigma_U)\) |
| \(\Delta_D, \Delta_A, \Delta_E\) | D/A/E-optimal acquisition functions |
| \(a(Q)\) | Acquisition function value for query \(Q\) |
| \(H(y)\) | Entropy of random variable \(y\) |
| \(I(r; y)\) | Mutual information between \(r\) and \(y\) |
3.1 Chapter Overview
Chapter 1 developed models for how preferences arise from latent utilities. Chapter 2 showed how to estimate these utilities from observed comparisons. This chapter addresses the complementary question: given that human feedback is expensive, how should we choose which comparisons to request?
This is the measurement problem in preference elicitation. Unlike the learning problem (which assumes data is given), the measurement problem actively designs experiments to maximize information gained per query. Consider training an LLM via RLHF: collecting 100,000 human preference labels is expensive and time-consuming. If we could achieve the same alignment quality with 20,000 strategically selected comparisons, we would save 80% of annotation cost.
The central insight is that not all comparisons are equally informative. Asking whether humans prefer “Hello!” to “Hi there!” reveals little about model quality. Comparing two substantively different responses to a complex prompt provides more signal. The mathematical framework of Fisher information and optimal experimental design formalizes this intuition.
Chapter Structure. We build from simple to complex settings:
- Scalar preferences (Section 3.2): The Rasch model and Fisher information
- Multi-dimensional preferences (Section 3.2.1): Factor models and optimal design criteria (A/D/E-optimal)
- Preference functions (Section 3.3): Linear and non-linear scoring functions
- Nonparametric models (Section 3.4): GP-based active learning with information-theoretic acquisition
- LLM alignment (Section 3.5): Active DPO for large language models
- Applications: Metric elicitation (Section 3.2.3) and robotics (Section 3.3.1)
Connections to Prior Chapters:
| Concept | Chapter 1-2 | This Chapter |
|---|---|---|
| Bradley-Terry model | Likelihood, MLE (Section 1.6.2) | Fisher information, query selection |
| Factor models \(U^\top V\) | Parameter estimation (Section 1.6) | D-optimal pair selection |
| GP preference models | Laplace inference (Section 2.4.2) | Information-gain acquisition |
| DPO loss | Objective function (Section 1.2) | ADPO active selection |
Historical Notes. Active learning for preferences has roots in multiple fields:
- Psychometrics (1960s): Computer-adaptive testing used Fisher information to select test items matching student ability
- Optimal experimental design (1970s): Fedorov and Kiefer developed A/D/E-optimality for linear models
- Bayesian experimental design (1990s): Chaloner & Verdinelli unified information-theoretic approaches
- RLHF/Active DPO (2020s): Applying these ideas to LLM alignment at scale
“The purpose of optimal design is to achieve the desired precision with minimum cost.” — V.V. Fedorov (1972)
3.2 Measurement of User Preference Vector
When items’ appeals \(V_j\) are known or pre-calibrated, we can learn a new user’s appetite \(U\) quickly by choosing informative items online. Asking only easy or only hard items gives near-deterministic responses that teach little. The key tool for quantifying “informativeness” is Fisher information.
Definition 3.1 For a parametric model \(p(Y \mid \theta)\), the Fisher Information about parameter \(\theta\) is: \[ \mathcal{I}(\theta) = \mathbb{E}\left[-\frac{\partial^2}{\partial \theta^2} \log p(Y \mid \theta)\right] = \mathbb{E}\left[\left(\frac{\partial}{\partial \theta} \log p(Y \mid \theta)\right)^2\right] \] Fisher Information quantifies how much a single observation reveals about \(\theta\). Higher Fisher Information means more precise estimation from each observation. The Cramér-Rao bound states that the variance of any unbiased estimator is at least \(\mathcal{I}(\theta)^{-1}\).
The Rasch model makes this precise. Recall from Section 1.6 that for a candidate item \(j\), the Bernoulli success probability is \(p_j(U) = \sigma(U + V_j).\) The Fisher information about \(U\) from a single response is \[\mathcal I_j(U) = \mathbb{E} \left[-\frac{\partial^2}{\partial U^2}\log p(Y \mid U, V_j)\right] = p_j(U) (1 - p_j(U)). \tag{3.1}\]
Proposition 3.1 For the Rasch model with \(p_j(U) = \sigma(U + V_j)\), the Fisher Information \(\mathcal{I}_j(U) = p_j(1-p_j)\) is maximized when \(p_j(U) = 0.5\), which occurs when item difficulty \(-V_j\) equals user ability \(U\).
The function \(f(p) = p(1-p)\) is a downward parabola with roots at \(p=0\) and \(p=1\). Taking the derivative: \(f'(p) = 1 - 2p = 0 \Rightarrow p = 0.5\). At this point, \(f(0.5) = 0.25\), which is the maximum. Since \(p_j(U) = \sigma(U + V_j) = 0.5\) when \(U + V_j = 0\), the optimal item has \(V_j = -U\).
This result has a natural interpretation: items that are too easy (\(p \approx 1\)) or too hard (\(p \approx 0\)) yield near-deterministic responses that teach little about the user. Items matched to the user’s ability (\(p \approx 0.5\)) are maximally informative. So the best next item is one whose difficulty (\(-V_j\)) is close to the current estimate of \(U\). Active elicitation policies therefore “hover” around the user’s current location, rapidly shrinking uncertainty. Place a Gaussian prior \(U \sim \mathcal N(0,\sigma_0^2)\). After observing independent items \(\mathcal J\), a Laplace (or IRLS) update for \(U\) uses the score and observed information \[S(U) = \sum_{j\in\mathcal J} (y_j - p_j(U)), \qquad \mathcal I(U) = \sum_{j\in\mathcal J}p_j(U) (1 - p_j(U)) + \tau_0, \quad \tau_0 = \sigma_0^{-2}. \tag{3.2}\] A Newton step is \(\hat U \leftarrow \hat U + S(\hat U)\mathcal I(\hat U)^{-1}.\) The approximate posterior variance is \(\widehat{\mathrm{Var}}(U) \approx \mathcal I(\hat U)^{-1}.\) An item-selection rule that maximizes expected information picks \(j_t = \arg\max_j \; p_j(\hat U_{t-1}) (1 - p_j(\hat U_{t-1})).\) For evaluation, we can use reliability. With population variance \(\mathrm{Var}(U^\star)=\sigma_U^2\), reliability is the fraction of total variance explained by the test rather than estimation noise: \[ \text{Rel} \approx 1 - \frac{1}{N}\sum_{i=1}^N \frac{\widehat{\mathrm{Var}}(U_i)}{\sigma_U^2}. \tag{3.3}\] Below, the code simulates many users, a bank of calibrated items, and compares an Fisher policy to a random policy. We track reliability after each query count \(t\).
First, we set up the simulation with \(N\) users, \(M\) calibrated items, and define helper functions for Bayesian updating:
Next, we define the Bayesian update rule and the two item selection policies—Fisher-active (maximizing information) and random:
Now we run the simulation, comparing both policies across all users:
Finally, we visualize the results comparing Fisher-active vs random item selection:
Fisher-active item selection achieves higher reliability with fewer queries than random selection. The gain comes from targeting items near \(p=0.5\), where each response is maximally informative. This principle—measure where uncertainty is highest—underlies all active learning methods in this chapter.
3.2.1 Factor Models: Multi-Dimensional Preferences
In practice, a single “ability” parameter may not capture the diversity of user preferences. Factor models generalize Rasch by giving each user a \(K\)-dimensional embedding \(U_i \in \mathbb R^K\) and each item a \(K\)-dimensional embedding \(V_j \in \mathbb R^K\), plus optional bias \(Z_j\). The linear predictor is \(H_{ij} = U_i^\top V_j + Z_j\) where \(p(Y_{ij}=1 \mid U_i, V_j, Z_j) = \sigma(H_{ij}).\)
A concrete instance is performance metric elicitation (Hiranandani et al. 2019). In binary classification, practitioners often have implicit preferences over classifiers that don’t match standard metrics like accuracy or F1—for instance, in medical diagnosis, the cost of a false negative (missed disease) may differ greatly from a false positive (unnecessary treatment). This is a \(K=2\) factor model where “items” are classifiers with known confusion matrices \(V_\theta = (\text{TP}_\theta, \text{TN}_\theta)\), and the “user preference” is the unknown metric weights \(\mathbf{m} = (m_{11}, m_{00})\). The utility \(\mathbf{m}^\top V_\theta = m_{11} \cdot \text{TP}_\theta + m_{00} \cdot \text{TN}_\theta\) is exactly the factor model form with known item parameters.
For a given user \(U\) and an item \(j\) with known parameters, the Bernoulli log-likelihood is \[\ell(U; y, j) = y \log \sigma(H) + (1-y)\log (1-\sigma(H)) \tag{3.4}\] The gradient with respect to \(U\) is \(\nabla_U \ell = (y - \sigma(H)) V_j.\) The expected Fisher Information per item is the negative expected Hessian: \[ \mathcal I_j(U) = \mathbb E[\nabla_U \ell \nabla_U \ell^\top ] = \sigma(H)(1-\sigma(H)) V_j V_j^\top. \tag{3.5}\] So each queried item contributes a rank-1 matrix in the direction of \(V_j\), scaled by the variance term \(\sigma(H)(1-\sigma(H))\). For a sequence of items \({j_t}\), the cumulative information is \[\mathcal I(U) = \sum_t \sigma(H_{j_t})(1-\sigma(H_{j_t})) V_{j_t} V_{j_t}^\top. \tag{3.6}\]
In multiple dimensions, information is a matrix. Classical optimal design offers various criteria for selecting informative queries:
Definition 3.2 Given a Fisher Information matrix \(\mathcal{I}(U)\), the classical optimal experimental design criteria are:
- A-Optimal (Average): Minimize \(\text{tr}(\mathcal{I}^{-1})\)—the average variance of parameter estimates
- D-Optimal (Determinant): Maximize \(\det(\mathcal{I})\)—the volume shrinkage of the posterior ellipsoid
- E-Optimal (Eigenvalue): Maximize \(\lambda_{\min}(\mathcal{I})\)—the precision in the worst-case direction
All three reduce uncertainty, but emphasize different aspects: A-optimal targets average precision, D-optimal targets overall information volume, and E-optimal targets the least-informed direction.
Adaptive measurement picks the next item \(j\) to maximize an acquisition function such as \(\det(\mathcal I+\mathcal I_j)\). We also use reliability as the evaluation metric here. If \(\Sigma_U\) is the population covariance and \(\widehat{\Sigma}_{\text{err}}\) the average posterior covariance, then \(\text{Reliability} = 1 - \mathrm{tr}(\widehat{\Sigma}_{\text{err}}) \mathrm{tr}(\Sigma_U)^{-1}.\) This generalizes the Rasch reliability to multi-dimensional latent traits. Below is a toy simulation with 2-dimensional users and items, comparing random item selection vs D-optimal Fisher active selection:
First, set up the factor model simulation with K-dimensional user and item embeddings:
Define the Bayesian update and D-optimal selection:
Run simulation comparing D-optimal vs random selection:
Visualize results:
3.2.2 Pairwise Preferences
We next discuss the measurement problem with pairwise data. Recall from Section 1.6.2 that the Bradley-Terry model specifies the probability that user \(i\) prefers item \(j\) over item \(k\) as \(p_{i}(j \succ k)=\sigma (U^\top(V_j-V_k) + Z_j-Z_k)\) with feature \(x_{jk}= V_j - V_k \in \mathbb R^K\) and offset \(b_{jk}=Z_j-Z_k\).
In metric elicitation, each oracle query asks: “Which classifier is better, \(h_\theta\) or \(h_{\theta'}\)?” The oracle’s answer is deterministic: \(\Omega(C_\theta, C_{\theta'}) = \mathbf{1}[\mathbf{m}^\top (V_\theta - V_{\theta'}) > 0]\), which is exactly a pairwise comparison with feature difference \(x = V_\theta - V_{\theta'}\). The deterministic oracle corresponds to the \(\sigma \to \text{step function}\) limit of Bradley-Terry.
For a single observation \(y \in \{0,1\}\) of \(j \succ k\) the per-pair log-likelihood is \[\ell(U; y, i, j, k)= y \log p + (1-y) \log (1-p_i). \tag{3.7}\]
Score and expected Fisher information for (U): \[\nabla_U \ell=(y-p),x_{jk}, \qquad \mathcal I_{jk}(U)=\mathbb E[-\nabla^2_U \ell] = w x_{jk} x_{jk}^\top, \tag{3.8}\] where \(w = p(1-p)\). For a sequence of queried pairs \({(j_t,k_t)}\) the cumulative information is \(\mathcal I(U)=\sum_t w_t,x_t x_t^\top.\) With a Gaussian prior \(U \sim \mathcal N(0, \Sigma_0)\) and a Laplace approximation, the posterior precision updates additively: \[ \Lambda_{t}=\Sigma_t^{-1}=\Sigma_{0}^{-1}+\sum_{s\le t} w_s,x_s x_s^\top. \tag{3.9}\] A single Newton step for the posterior mean after observing ((x,y)) is \[ \hat U \leftarrow \hat U + \Sigma_{t}^{\ }(y-p),x,\qquad p=\sigma(\hat U^\top x+b),\quad \Sigma_t = \left(\Lambda_{t-1}+w,xx^\top\right)^{-1}. \tag{3.10}\] This is the logistic-regression one-step IRLS update specialized to one new example. We evaluate candidates at the current \((\hat U,\Sigma)\). Write \(x\) for \(x_{jk}\) and \(w=\sigma(\hat U^\top x+b)\big(1-\sigma(\hat U^\top x+b)\big)\). Let \(\Lambda=\Sigma^{-1}\). Using the Sherman–Morrison formula, we obtain closed forms for all three design criteria:
Proposition 3.2 For a current precision matrix \(\Lambda = \Sigma^{-1}\) and a candidate pair \((j,k)\) with feature difference \(x = V_j - V_k\) and variance weight \(w = p(1-p)\), the D-optimal acquisition function has closed form: \[ \Delta_D(j,k) = \log\det(\Lambda + w xx^\top) - \log\det(\Lambda) = \log(1 + w x^\top \Sigma x) \] This follows from the matrix determinant lemma: \(\det(A + uv^\top) = (1 + v^\top A^{-1} u)\det(A)\).
Pick the pair that maximizes \(\log(1 + w x^\top\Sigma x)\). This requires only a vector-matrix-vector product, making it efficient for online selection.
A-optimal (trace drop of covariance): \(\Delta_A(j,k);=;\mathrm{tr}(\Sigma)-\mathrm{tr}\big((\Lambda+w,xx^\top)^{-1}\big) = \frac{w,x^\top\Sigma^2 x}{1+w,x^\top\Sigma x}.\)
E-optimal (raise the worst-direction information): Exact update of \(\lambda_{\min}(\mathcal I)\) needs an eigen update. A practical proxy is to maximize \(\Delta_E^{\text{proxy}}(j,k)= w,x^\top\Sigma x,\) which targets directions where current uncertainty \(x^\top\Sigma x\) is large.
End-to-end toy pipeline (rank 2, known items, active D-opt vs random). First, define helper functions for D-optimal gain and Bayesian updates:
Set up simulation parameters and precompute candidate pairs:
Define selection policies and run the simulation:
Visualize the results:
For multi-dimensional preference vectors, D-optimal selection maximizes \(\log\det(\mathcal{I})\), shrinking the posterior ellipsoid in all directions. The Sherman-Morrison formula enables efficient computation: each candidate pair requires only a vector-matrix-vector product \(x^\top \Sigma x\). D-optimal design naturally encourages diversity—selecting pairs with redundant information directions provides diminishing log-det gain.
3.2.3 Metric Elicitation: Exploiting Quasiconcavity
The metric elicitation problem has special structure that enables even more efficient elicitation than general D-optimal design. A Linear Performance Metric (LPM) has the form: \[ \phi(C) = m_{11} \cdot \text{TP} + m_{00} \cdot \text{TN} + m_0 \tag{3.11}\] where the weights \((m_{11}, m_{00})\) encode tradeoffs between true positives and true negatives. Since only the ratio matters, we can parametrize as \(\mathbf{m} = (\cos\theta, \sin\theta)\) for \(\theta \in [0, 2\pi]\).
For medical diagnosis, \(m_{11} > m_{00}\) weights sensitivity (catching disease) over specificity (avoiding false alarms). For spam filtering, the reverse may hold. The key is that the practitioner’s implicit metric \(\mathbf{m}^*\) is unknown—we must learn it from pairwise comparisons.
Proposition 3.3 For a quasiconcave LPM \(\phi\) that is monotonically increasing in TP and TN, and a parametrization \(\rho^+(\theta)\) of the upper boundary of the confusion matrix space \(\mathcal{C}\), the composition \(\phi \circ \rho^+\) is quasiconcave and unimodal on \([0, \pi/2]\).
Quasiconcavity means the function has a single peak—no local maxima to trap us. This enables binary search rather than gradient-based optimization:
Input: Tolerance \(\epsilon > 0\), oracle \(\Omega\)
Initialize: \(\theta_a = 0\), \(\theta_b = \pi/2\)
While \(|\theta_b - \theta_a| > \epsilon\):
- Divide \([\theta_a, \theta_b]\) into four equal intervals with boundaries \(A, B, C, D, E\)
- Compute confusion matrices \(C_A, C_B, C_C, C_D, C_E\) for classifiers at each \(\theta\)
- Query oracle: compare all adjacent pairs to find which interval contains the maximum
- Update \([\theta_a, \theta_b]\) to the interval containing the maximum
Return: \(\hat{\mathbf{m}} = (\cos\hat\theta, \sin\hat\theta)\) where \(\hat\theta = (\theta_a + \theta_b)/2\)
Each iteration uses 4 queries and shrinks the search interval by factor of 2, giving \(O(\log(1/\epsilon))\) query complexity—exponentially better than the \(O(1/\epsilon^2)\) rate for general 2D estimation.
Proposition 3.4 Given learned metric weights \(\mathbf{m} = (m_{11}, m_{00})\) with \(m_{11} + m_{00} > 0\), the Bayes optimal classifier is: \[ \bar{h}(x) = \mathbf{1}\left[\eta(x) \geq \frac{m_{00}}{m_{11} + m_{00}}\right] \] where \(\eta(x) = P(Y=1 \mid X=x)\) is the conditional class probability.
This connects metric elicitation to decision theory: learning the metric \(\mathbf{m}\) is equivalent to learning the optimal classification threshold.
Extension to Linear-Fractional Metrics. The \(F_\beta\) score and Jaccard similarity are not linear but linear-fractional: \[ \phi(C) = \frac{p_{11} \cdot \text{TP} + p_{00} \cdot \text{TN} + p_0}{q_{11} \cdot \text{TP} + q_{00} \cdot \text{TN} + q_0} \] These can be elicited by first finding both the maximizer and minimizer of the implicit linear numerator (two binary searches), then solving for the fractional parameters. The query complexity remains \(O(\log(1/\epsilon))\).
Multiclass Extension. For \(K\)-class problems, diagonal linear performance metrics (DLPMs) weight correct predictions: \(\psi(\mathbf{d}) = \sum_k a_k d_k\) where \(d_k\) is the rate of correctly predicting class \(k\). Binary search over each pair of classes \((k_1, k_2)\) recovers the relative weights \(a_{k_2}/a_{k_1}\), with total complexity \(O(K^2 \log(1/\epsilon))\).
3.3 Measurement of User Preference Function
Suppose items \(j\) have feature vectors \(X_j \in \mathbb R^d\). Instead of learning a free scalar parameter \(V_j\) for each item, we assume \(V_j = W^\top X_j,\) where \(W \in \mathbb R^d\) is shared across items. The probability that item \(j\) beats item \(k\) is \(p(j \succ k) = \sigma(V_j - V_k) = \sigma (W^\top(X_j - X_k)).\) This is simply logistic regression on pairwise feature differences. Given one comparison outcome \(y \in \{0,1\}\) for pair \((j,k)\) with difference \(x_{jk} = X_j - X_k\), \[\ell(W) = y\log \sigma(W^\top x_{jk}) + (1-y)\log(1-\sigma(W^\top x_{jk})). \tag{3.12}\] Gradient: \(\nabla_W \ell = (y - p) x_{jk}, \quad p = \sigma( W^\top x_{jk} ).\) Fisher Information (expected negative Hessian): \(\mathcal I(W) = \mathbb E\big[(y-p)^2 x_{jk}x_{jk}^\top\big] = p(1-p),x_{jk}x_{jk}^\top.\)
With Gaussian prior \(W \sim \mathcal N(0,\sigma_0^2 I)\), a Laplace approximation yields posterior covariance \(\Sigma \approx (\sigma_0^{-2}I + \sum_t p_t(1-p_t) x_t x_t^\top)^{-1}.\) So, as in the Rasch case, Fisher information accumulates to form posterior precision. At each step, pick a candidate pair \((j,k)\) and compute \[ \Delta_D(j,k) = \log \det(\Lambda + w x_{jk} x_{jk}^\top) - \log\det(\Lambda), \quad w = p(1-p), \tag{3.13}\] where \(\Lambda = \Sigma^{-1}\). This is the D-optimal criterion, maximizing the expected log-determinant gain in information. A-optimal or E-optimal criteria are similarly defined. We’ll simulate item features, a ground-truth \(W\), and comparisons. Then we compare active D-optimal elicitation vs random pair selection for estimating \(W\). Performance metrics are MSE of \(\hat W\) vs true \(W\), and reliability.
First, set up the simulation with item features and ground-truth weights:
Define the Fisher gain and Bayesian update functions:
Run the simulation comparing D-optimal vs random selection:
Visualize the results:
3.3.1 Robotic Trajectory Learning
The linear preference framework applies directly to robotic trajectory learning (Biyik and Sadigh 2018). Consider a robot choosing between trajectories \(\xi_A\) and \(\xi_B\). The human evaluates them via a reward function: \[ R(\xi) = w^\top \phi(\xi) \tag{3.14}\] where \(w \in \mathbb{R}^N\) are unknown weights and \(\phi(\xi) \in \mathbb{R}^N\) extracts features (e.g., distance to lane boundaries, proximity to obstacles, heading, speed). This is exactly our linear model with “items” being trajectories and “features” being \(\phi(\xi)\).
Geometric Interpretation. Each pairwise comparison “prefer \(\xi_A\) over \(\xi_B\)” yields a constraint: \[ w^\top (\phi(\xi_A) - \phi(\xi_B)) > 0 \] This defines a half-space in the weight space. The feasible region for \(w\) is the intersection of all half-spaces consistent with observed preferences. Each query splits this region—orthogonal queries reduce volume by approximately half.
The optimal query selection criterion maximizes the minimum volume removed regardless of the oracle’s answer. For query feature difference \(\Phi = \phi(\xi_A) - \phi(\xi_B)\), this is \(\max_\Phi \min\{\mathbb{E}[1 - f_\Phi(w)], \mathbb{E}[1 - f_{-\Phi}(w)]\}\) where \(f_\Phi(w) = \min(1, \exp(w^\top \Phi))\) soft-weights consistency with the preference. This selects balanced queries where either response is informative.
Driving Simulator Example. In a 2D driving simulator, trajectory features include:
- Distance to lane boundaries
- Distance to other vehicles
- Heading angle
- Speed
With no learned preferences, the simulated car moves erratically. After 30 comparison queries, it learns to follow the lane direction. After 70 queries, it additionally learns to avoid collisions while staying in lane. The features with highest learned weights correspond to collision avoidance and lane-keeping.
Why Active Selection Matters. Both active and random query selection eventually converge to the true reward weights. However, active selection achieves target performance faster—critical in time-sensitive applications. For exoskeleton rehabilitation (Li et al. 2021), the robot must adapt to individual patients quickly; waiting for asymptotic convergence is not acceptable. Active learning provides crucial early-stage advantages.
Imitation Learning with Compatibility. When learning from multiple human demonstrators, multimodality arises: different people use different strategies for the same task (e.g., picking up an object from the left vs. right). Naive imitation learning fails on contradictory demonstrations.
The solution is to measure compatibility of new demonstrations with the base policy:
- Likelihood: How close is the demonstrated action to the base policy’s prediction?
- Novelty: How uncertain is the base policy at this state?
Demonstrations with low likelihood in low-novelty states indicate conflicting modes and should be filtered. This compatibility-based filtering improves success rates while using less data than naive aggregation (Gandhi et al. 2022).
Foundation Models for Representations. Recent work leverages foundation models pretrained on human video (R3M (Nair et al. 2022), Voltron (Karamcheti et al. 2023)) to provide better trajectory features \(\phi(\xi)\). These models learn that human hands are similar to robotic end-effectors, enabling 2-3x higher success rates with 5-10x fewer demonstrations. For preference learning, better representations mean each comparison is more informative.
3.3.2 Non-Linear Scoring Functions
So far we set \(V_j=W^\top X_j\). Suppose instead that items are scored by a non-linear function \[V_j = f_\theta(X_j),\qquad \theta\in\mathbb{R}^p, \qquad p(j \succ k) = \sigma (f_\theta(X_j) - f_\theta(X_k)) \tag{3.15}\] We want an online procedure that chooses the next comparison \((j,k)\) to learn \(\theta\) quickly. Exact Bayesian design is intractable for general \(f_\theta\). There is a simple and effective approximation that keeps all the math and code lightweight: local Gaussian posterior over \(\theta\) via Laplace + first-order model. Work at the current estimate \(\hat\theta\). Linearize the non-linear scorer around \(\hat\theta\): \[ f_\theta(X_j) \approx f_{\hat\theta}(X_j) + J_j\Delta\theta, \qquad J_j \equiv \left.\frac{\partial f_\theta(X_j)}{\partial \theta}\right| _{\theta = \hat\theta} \in \mathbb{R}^{1\times p}. \tag{3.16}\] The pairwise logit for \((j,k)\) becomes \[\underbrace{f_\theta(X_j)-f_\theta(X_k)}_{\text{logit}} \approx \underbrace{f_{\hat\theta}(X_j)-f_{\hat\theta}(X_k)}_{\text{offset}} + \underbrace{(J_j-J_k)}_{\phi_{jk}}\Delta\theta. \tag{3.17}\]
Define the local “design vector” \(\phi_{jk}\in\mathbb{R}^{1\times p}\). With a Gaussian prior \(\theta\sim\mathcal N(0,\sigma_0^2 I)\) and Bernoulli likelihood, the Laplace posterior around \(\hat\theta\) is \[ \theta \mid \mathcal D \approx \mathcal N(\hat\theta,;\Sigma),\qquad \Sigma^{-1} = \sigma_0^{-2} I + \sum_{t} w_t \phi_t^\top \phi_t \tag{3.18}\] where \(w_t = p_t(1-p_t)\) with \(p_t = \sigma\big(f_{\hat\theta}(X_{j_t})-f_{\hat\theta}(X_{k_t})\big)\). This is exactly the same structure as the linear case, with (x_{jk}) replaced by \(\phi_{jk}=J_j-J_k\). You only need Jacobian rows \(J_j\). At the current \((\hat\theta \Sigma)\) and for a candidate \((j,k)\), compute \(p=\sigma\big(f_{\hat\theta}(X_j)-f_{\hat\theta}(X_k)\big)\) and \(w = p(1-p)\).Compute \(\phi = J_j-J_k\) at \(\hat \theta\). Then use any standard optimal design score with Sherman–Morrison:
- D-optimal (log-det gain): \(\Delta_D(j,k) = \log ( 1 + w\phi \Sigma \phi^\top )\).
- A-optimal (trace drop): \(\Delta_A(j,k) = \dfrac{w \phi \Sigma^2 \phi^\top }{ 1+w \phi \Sigma \phi^\top}\).
- E-optimal proxy: \(\Delta_E^{\text{proxy}}(j,k) = w \phi \Sigma \phi^\top\).
Pick the pair with the largest gain. This is tractable because it only needs a vector–matrix–vector product with \(\Sigma\) and a Jacobian row. Mutual-information acquisition like BALD reduces to the same formulas under the Laplace Gaussian approximation. Maximizing (_D) is a good default.
We use a tiny two-layer network \(f_\theta(x)=a^\top \tanh(Bx)\) with \(\theta=(a,B)\). We derive \(J_j\) analytically. First, define the network and its Jacobian:
Set up Laplace approximation state and update functions:
Run the active learning loop comparing D-optimal vs random selection:
Visualize results:
3.4 Active Query Selection for Gaussian Process Models
The previous sections developed Fisher information for optimal query selection in parametric preference models—the Rasch model (Section 1) and linear scoring functions (Section 2). However, Chapter 2 introduced Gaussian Processes (GPs) as a flexible nonparametric alternative for reward modeling, and Chapter 3 covered GP inference via Laplace approximation. Here we address the natural question: how do we actively select queries when using GPs?
The key insight is that GP uncertainty—quantified by the posterior variance—naturally guides query selection. Unlike parametric models where Fisher information depends only on the current parameter estimate, GP uncertainty varies across the input space, being higher far from observed data and lower near it.
3.4.1 Information-Theoretic Acquisition Function
For GP preference learning, we seek queries that maximize information gain about the reward function. Given a dataset \(\mathcal{D}\) and a candidate query \(Q = (x_A, x_B)\), the acquisition function measures the expected reduction in uncertainty about \(r\): \[ a(Q) = I(r; y \mid \mathcal{D}, Q) = H(y \mid \mathcal{D}, Q) - \mathbb{E}_{r \sim p(r \mid \mathcal{D})}[H(y \mid r, Q)] \tag{3.19}\]
Interpretation of the two terms:
\(H(y \mid \mathcal{D}, Q)\): Predictive entropy—how uncertain is the query outcome under our current model? High when \(p(A \succ B) \approx 0.5\).
\(\mathbb{E}_r[H(y \mid r, Q)]\): Expected conditional entropy—how uncertain would the query be if we knew the true reward function? Low for “easy” comparisons, high for inherently noisy comparisons.
The acquisition function trades off these two terms: we want queries where the model is uncertain (high predictive entropy) but a human could provide a clear answer (low expected conditional entropy).
3.4.2 Practical Computation
Using the Laplace approximation from Chapter 3 (Section 2.4.2), the acquisition function can be computed as: \[ a(x_A, x_B) = h\left(\Phi\left(\frac{\mu_A - \mu_B}{\sqrt{2\sigma_{\text{noise}}^2 + g(x_A, x_B)}}\right)\right) - m(x_A, x_B) \tag{3.20}\]
where: - \(\mu_A, \mu_B\) are the GP posterior means at \(x_A, x_B\) - \(g(x_A, x_B) = \sigma_A^2 + \sigma_B^2 - 2\text{Cov}(r(x_A), r(x_B))\) captures uncertainty from both points - \(h(p) = -p\log_2 p - (1-p)\log_2(1-p)\) is binary entropy - \(\Phi\) is the standard normal CDF - \(m(x_A, x_B)\) is a correction term for human noise
Key property: The trivial query \(Q = (x, x)\) (comparing an item to itself) is a global minimizer, not maximizer, of this acquisition function. This contrasts with some other methods that can get stuck on uninformative queries.
First, we define the RBF kernel and helper functions for the simulation:
Next, we implement a GP model for active preference learning. The fit method uses Laplace approximation (Newton’s method) to find the posterior mode, and acquisition computes the information gain for a candidate query:
Now we run the simulation, comparing active (information-theoretic) query selection against random selection:
Finally, we visualize the results. The left panel shows how estimation error decreases with more comparisons; the right panel shows the learned reward functions:
3.4.3 When to Use GP Active Learning
| Scenario | Recommendation |
|---|---|
| Nonlinear reward structure | GP strongly preferred |
| Low-to-moderate dimensions (\(d \lesssim 10\)) | GP works well |
| Expensive human feedback | Active learning essential |
| Need uncertainty quantification | GP provides this naturally |
| Large datasets (\(n > 1000\)) | Consider sparse GP approximations |
| Very high dimensions | Linear models may be more practical |
3.5 Active Learning for Direct Preference Optimization
We now turn to a specific, important application: active learning for DPO (Direct Preference Optimization) in large language model alignment. Chapter 2 introduced DPO and its connection to Bradley-Terry; here we show how to apply active learning principles to reduce the number of human comparisons needed.
3.5.1 The DPO Setting
Recall from Section 1.2 that DPO directly optimizes a policy \(\pi_\theta\) on preference data without learning an explicit reward model. The key insight is that DPO implicitly assumes Bradley-Terry preferences: \[ p(y_w \succ y_l \mid x) = \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right) \tag{3.21}\]
Challenge: Large language models have billions of parameters, making exact Fisher information computation intractable. How can we apply the D-optimal design principles from Sections 1-2?
3.5.2 The Log-Linear Policy Assumption
The key insight is to approximate the policy as log-linear in the last-layer features: \[ \pi(y \mid x; \theta) \propto \exp(\phi(x, y)^\top \theta) \tag{3.22}\]
where \(\phi(x, y)\) is the feature representation from the model’s last layer. This assumption is justified by the neural tangent kernel perspective: for sufficiently large networks, the dynamics during training are approximately linear in the parameters.
Under this assumption, the DPO loss Hessian takes the form: \[ H(\theta) = \sum_{i=1}^n p_i(1-p_i) \cdot \Delta\phi_i \Delta\phi_i^\top \tag{3.23}\]
where \(\Delta\phi_i = \phi(x_i, y_{w,i}) - \phi(x_i, y_{l,i})\) is the feature difference between winning and losing responses, and \(p_i\) is the predicted preference probability.
Key observation: This Hessian is exactly the Fisher information matrix for the DPO model! The D-optimal criterion becomes: \[ \max_S \log\det\left(\sum_{i \in S} p_i(1-p_i) \cdot \Delta\phi_i \Delta\phi_i^\top\right) \tag{3.24}\]
3.5.3 The ADPO Algorithm
The Active DPO (ADPO) algorithm applies D-optimal design to select preference comparisons:
Input: Reference policy \(\pi_{\text{ref}}\), prompt set \(\{x_i\}\), response pairs \(\{(y_{i,1}, y_{i,2})\}\), budget \(T\)
Initialize: \(\theta_0 = \theta_{\text{ref}}\), \(S_0 = \emptyset\)
For \(t = 1, \ldots, T\):
Compute features: For each candidate pair \((x_i, y_{i,1}, y_{i,2})\):
- \(\Delta\phi_i = \phi(x_i, y_{i,1}) - \phi(x_i, y_{i,2})\)
- \(p_i = \sigma(\beta(\log\pi_{\theta_{t-1}}(y_{i,1}|x_i) - \log\pi_{\theta_{t-1}}(y_{i,2}|x_i)))\)
D-optimal selection: Select query maximizing: \[i^* = \arg\max_i \log\det\left(I + p_i(1-p_i) H_t^{-1} \Delta\phi_i \Delta\phi_i^\top\right) \tag{3.25}\] where \(H_t\) is the current Fisher information matrix
Query oracle: Obtain human preference \(y_t\) for pair \((x_{i^*}, y_{i^*,1}, y_{i^*,2})\)
Update: \(S_t = S_{t-1} \cup \{(i^*, y_t)\}\), retrain \(\theta_t\) on \(S_t\) using DPO loss
Return: Final policy \(\pi_{\theta_T}\)
3.5.4 Theoretical Guarantees
Under standard regularity conditions, ADPO achieves the following convergence rate:
Theorem 3.1 Let \(\theta^*\) be the optimal policy parameters. Under the log-linear policy assumption and standard regularity conditions, ADPO with \(n\) actively selected comparisons achieves: \[ \|\hat\theta_n - \theta^*\| = O\left(\frac{d}{\sqrt{n}}\right) \tag{3.26}\] where \(d\) is the effective dimension (rank of the Fisher information matrix). This matches the minimax optimal rate for \(d\)-dimensional estimation.
The key insight is that D-optimal design ensures the Fisher information grows proportionally to \(n\), leading to \(O(1/\sqrt{n})\) estimation error.
3.5.5 ADPO+ for Offline Settings
In many practical settings, we have access to a fixed pool of unlabeled preference pairs (e.g., model outputs) but cannot generate new queries. ADPO+ adapts the algorithm for this offline setting:
- Pool-based selection: Instead of synthesizing queries, select from the existing pool
- Batch selection: Select \(b\) queries at once to enable parallel annotation
- Diversity regularization: Add penalty for selecting similar queries
The D-optimal criterion naturally encourages diversity: selecting redundant queries provides diminishing information gain.
3.5.6 Empirical Guidance: When Does Active Learning Help?
| Factor | Active Learning Beneficial? |
|---|---|
| Annotation cost | High cost → Active learning essential |
| Data heterogeneity | Diverse prompts → More room for strategic selection |
| Budget constraints | Limited budget → Active learning shows larger gains |
| Model capacity | Larger models → Active learning helps reduce overfitting |
| Query pool size | Large pool → More room for intelligent selection |
Practical recommendation: Active learning for DPO provides the largest gains when: - Annotation budget is limited (< 10K comparisons) - Prompts cover diverse topics or difficulty levels - The goal is sample efficiency over raw performance
| Aspect | GP Active Learning | ADPO |
|---|---|---|
| Model | Nonparametric (GP) | Parametric (neural network) |
| Scalability | \(O(n^3)\) | \(O(d^2 n)\) with approximations |
| Use case | Small datasets, need uncertainty | LLM alignment, large models |
| Acquisition | Information gain | D-optimal design |
| Theoretical basis | Mutual information | Fisher information |
Both approaches share the core insight: use model uncertainty to guide query selection.
3.6 Discussion Questions
How does the reliability metric defined in this chapter relate to classical test theory in psychometrics? What assumptions underlie this correspondence?
Why does D-optimal design maximize the log-determinant of the information matrix rather than the trace? In what situations might A-optimal (trace-based) selection be preferable?
The local linearization approach for non-linear scoring functions trades off accuracy for tractability. What are the conditions under which this approximation is most accurate? When might it fail?
Active preference learning assumes that items’ appeals are known or pre-calibrated. How might estimation errors in item parameters affect the efficiency of active selection strategies?
The metric elicitation framework assumes an oracle with a consistent underlying preference. How would you modify the approach to handle an oracle whose preferences evolve over time or exhibit inconsistencies?
In the robotics case studies, pairwise comparisons are used to learn reward functions. What are the advantages and disadvantages of pairwise comparisons versus direct reward labeling for robotic learning?
Foundation models like R3M and Voltron leverage pretraining on human video data. What types of robotic tasks might benefit most from this approach? What tasks might still require substantial robot-specific data?
The active elicitation framework for imitation learning identifies “compatible” demonstrations. How might this concept of compatibility be extended to multi-agent settings where different agents have legitimately different optimal behaviors?
3.7 Exercises
We place (*), (**), and (***) for exercises we deem relatively easy, medium, and hard.
3.7.1 Fisher Information Maximum (*)
Show that for the Rasch model with \(p = \sigma(U + V_j)\), the Fisher information \(\mathcal{I}_j(U) = p(1-p)\) is maximized when \(p = 0.5\). What does this imply about the optimal difficulty of items for measuring a user’s ability?
3.7.2 Sherman-Morrison Update (**)
Derive the Sherman-Morrison formula for efficiently updating the posterior covariance when adding a single rank-1 observation. Specifically, show that if \(\Sigma^{-1} \gets \Sigma^{-1} + w \cdot vv^\top\), then \(\Sigma \gets \Sigma - \frac{w \cdot \Sigma vv^\top \Sigma}{1 + w \cdot v^\top \Sigma v}\).
3.7.3 D-Optimal vs A-Optimal (**)
Construct a 2D example where D-optimal and A-optimal criteria select different items for the next query. Interpret the difference geometrically in terms of the posterior ellipsoid.
3.7.4 Pairwise Fisher Information (**)
For the Bradley-Terry model \(p(j \succ k | U) = \sigma(U^\top(V_j - V_k))\), derive the Fisher information matrix for \(U\) given a single pairwise comparison. How does this relate to the single-item case?
3.7.5 Active Selection Implementation (***)
Extend the D-optimal pairwise selection simulation to include all three optimality criteria (A, D, E). Compare their sample efficiency on a synthetic dataset of 50 items with 3-dimensional user preferences. Which criterion performs best under different prior uncertainty structures?
3.7.6 Noisy Oracle Elicitation (***)
Extend the LPM elicitation algorithm to handle noisy oracle responses where the oracle flips its answer with probability \(\epsilon < 0.5\). How does the query complexity scale with \(\epsilon\)?
3.7.7 Local Linearization Analysis (***)
Consider a scoring function \(f_\theta(x) = a^\top \tanh(Bx)\) with \(\theta = (a, B)\). Analytically derive the Jacobian \(J_j = \partial f_\theta(X_j) / \partial \theta\). Under what conditions on the hidden activations \(\tanh(BX_j)\) is the local Gaussian approximation most accurate?
3.7.8 Compatibility Metric Design (**)
In the active elicitation framework for imitation learning, the compatibility metric depends on thresholds \(\lambda\) and \(\eta\). Design an algorithm that automatically selects these thresholds from a held-out validation set of demonstrations. What objective should the algorithm optimize?