3 Elicitation

Intended Learning Outcomes

By the end of this chapter you will be able to:

Define the measurement problem for user preference elicitation and distinguish it from the learning problem in Chapter 3.
Derive the Fisher information for Rasch/Bradley-Terry models and explain why items with $p \approx 0.5$ are most informative.
Compare A-optimal, D-optimal, and E-optimal design criteria for selecting informative queries.
Apply Laplace approximation to update posterior beliefs about user preferences in real-time.
Implement active selection policies that outperform random sampling in sample efficiency.
Extend Fisher information analysis to non-linear scoring functions via local linearization.
Apply information-theoretic acquisition functions to actively select queries for Gaussian Process preference models, building on the GP foundations from Chapters 2-3.
Explain how D-optimal design applies to Direct Preference Optimization (DPO) for LLM alignment and derive the ADPO algorithm.
Analyze active preference learning in robotics, including trajectory comparison and reward function inference.

Notation

Throughout this chapter, we use the following conventions (extending Chapter 2):

Symbol	Meaning
$U \in \mathbb{R}^K$	User embedding/preference vector
$V_j \in \mathbb{R}^K$	Item embedding/appeal vector
$Z_j \in \mathbb{R}$	Item offset (difficulty in Rasch model: $-V_j$)
$H_{ij} = U_i^\top V_j + Z_j$	Latent utility for user $i$ on item $j$
$\mathcal{I}(U)$	Fisher information (scalar or matrix)
$p = \sigma(H)$	Response probability under logistic model
$\Lambda = \Sigma^{-1}$	Posterior precision matrix
$\text{Rel}$	Reliability: $1 - \text{tr}(\hat{\Sigma}_{\text{err}}) / \text{tr}(\Sigma_U)$
$\Delta_D, \Delta_A, \Delta_E$	D/A/E-optimal acquisition functions
$a(Q)$	Acquisition function value for query $Q$
$H(y)$	Entropy of random variable $y$
$I(r; y)$	Mutual information between $r$ and $y$

3.1 Chapter Overview

Chapter 1 developed models for how preferences arise from latent utilities. Chapter 2 showed how to estimate these utilities from observed comparisons. This chapter addresses the complementary question: given that human feedback is expensive, how should we choose which comparisons to request?

This is the measurement problem in preference elicitation. Unlike the learning problem (which assumes data is given), the measurement problem actively designs experiments to maximize information gained per query. Consider training an LLM via RLHF: collecting 100,000 human preference labels is expensive and time-consuming. If we could achieve the same alignment quality with 20,000 strategically selected comparisons, we would save 80% of annotation cost.

The central insight is that not all comparisons are equally informative. Asking whether humans prefer “Hello!” to “Hi there!” reveals little about model quality. Comparing two substantively different responses to a complex prompt provides more signal. The mathematical framework of Fisher information and optimal experimental design formalizes this intuition.

Chapter Structure. We build from simple to complex settings:

Scalar preferences (Section 3.2): The Rasch model and Fisher information
Multi-dimensional preferences (Section 3.2.1): Factor models and optimal design criteria (A/D/E-optimal)
Preference functions (Section 3.3): Linear and non-linear scoring functions
Nonparametric models (Section 3.4): GP-based active learning with information-theoretic acquisition
LLM alignment (Section 3.5): Active DPO for large language models
Applications: Metric elicitation (Section 3.2.3) and robotics (Section 3.3.1)

Connections to Prior Chapters:

Concept	Chapter 1-2	This Chapter
Bradley-Terry model	Likelihood, MLE (Section 1.6.2)	Fisher information, query selection
Factor models $U^\top V$	Parameter estimation (Section 1.6)	D-optimal pair selection
GP preference models	Laplace inference (Section 2.4.2)	Information-gain acquisition
DPO loss	Objective function (Section 1.2)	ADPO active selection

Historical Notes. Active learning for preferences has roots in multiple fields:

Psychometrics (1960s): Computer-adaptive testing used Fisher information to select test items matching student ability
Optimal experimental design (1970s): Fedorov and Kiefer developed A/D/E-optimality for linear models
Bayesian experimental design (1990s): Chaloner & Verdinelli unified information-theoretic approaches
RLHF/Active DPO (2020s): Applying these ideas to LLM alignment at scale

“The purpose of optimal design is to achieve the desired precision with minimum cost.” — V.V. Fedorov (1972)

3.2 Measurement of User Preference Vector

When items’ appeals $V_j$ are known or pre-calibrated, we can learn a new user’s appetite $U$ quickly by choosing informative items online. Asking only easy or only hard items gives near-deterministic responses that teach little. The key tool for quantifying “informativeness” is Fisher information.

Definition: Fisher Information

Definition 3.1 For a parametric model $p(Y \mid \theta)$, the Fisher Information about parameter $\theta$ is: \[ \mathcal{I}(\theta) = \mathbb{E}\left[-\frac{\partial^2}{\partial \theta^2} \log p(Y \mid \theta)\right] = \mathbb{E}\left[\left(\frac{\partial}{\partial \theta} \log p(Y \mid \theta)\right)^2\right] \] Fisher Information quantifies how much a single observation reveals about $\theta$. Higher Fisher Information means more precise estimation from each observation. The Cramér-Rao bound states that the variance of any unbiased estimator is at least $\mathcal{I}(\theta)^{-1}$.

The Rasch model makes this precise. Recall from Section 1.6 that for a candidate item $j$, the Bernoulli success probability is $p_j(U) = \sigma(U + V_j).$ The Fisher information about $U$ from a single response is \[\mathcal I_j(U) = \mathbb{E} \left[-\frac{\partial^2}{\partial U^2}\log p(Y \mid U, V_j)\right] = p_j(U) (1 - p_j(U)). \tag{3.1}\]

Proposition: Optimal Item Difficulty

Proposition 3.1 For the Rasch model with $p_j(U) = \sigma(U + V_j)$, the Fisher Information $\mathcal{I}_j(U) = p_j(1-p_j)$ is maximized when $p_j(U) = 0.5$, which occurs when item difficulty $-V_j$ equals user ability $U$.

Proof

The function $f(p) = p(1-p)$ is a downward parabola with roots at $p=0$ and $p=1$. Taking the derivative: $f'(p) = 1 - 2p = 0 \Rightarrow p = 0.5$. At this point, $f(0.5) = 0.25$, which is the maximum. Since $p_j(U) = \sigma(U + V_j) = 0.5$ when $U + V_j = 0$, the optimal item has $V_j = -U$.

This result has a natural interpretation: items that are too easy ($p \approx 1$) or too hard ($p \approx 0$) yield near-deterministic responses that teach little about the user. Items matched to the user’s ability ($p \approx 0.5$) are maximally informative. So the best next item is one whose difficulty ($-V_j$) is close to the current estimate of $U$. Active elicitation policies therefore “hover” around the user’s current location, rapidly shrinking uncertainty. Place a Gaussian prior $U \sim \mathcal N(0,\sigma_0^2)$. After observing independent items $\mathcal J$, a Laplace (or IRLS) update for $U$ uses the score and observed information \[S(U) = \sum_{j\in\mathcal J} (y_j - p_j(U)), \qquad \mathcal I(U) = \sum_{j\in\mathcal J}p_j(U) (1 - p_j(U)) + \tau_0, \quad \tau_0 = \sigma_0^{-2}. \tag{3.2}\] A Newton step is $\hat U \leftarrow \hat U + S(\hat U)\mathcal I(\hat U)^{-1}.$ The approximate posterior variance is $\widehat{\mathrm{Var}}(U) \approx \mathcal I(\hat U)^{-1}.$ An item-selection rule that maximizes expected information picks $j_t = \arg\max_j \; p_j(\hat U_{t-1}) (1 - p_j(\hat U_{t-1})).$ For evaluation, we can use reliability. With population variance $\mathrm{Var}(U^\star)=\sigma_U^2$, reliability is the fraction of total variance explained by the test rather than estimation noise: \[ \text{Rel} \approx 1 - \frac{1}{N}\sum_{i=1}^N \frac{\widehat{\mathrm{Var}}(U_i)}{\sigma_U^2}. \tag{3.3}\] Below, the code simulates many users, a bank of calibrated items, and compares an Fisher policy to a random policy. We track reliability after each query count $t$.

First, we set up the simulation with $N$ users, $M$ calibrated items, and define helper functions for Bayesian updating:

Next, we define the Bayesian update rule and the two item selection policies—Fisher-active (maximizing information) and random:

Now we run the simulation, comparing both policies across all users:

Finally, we visualize the results comparing Fisher-active vs random item selection:

Key Takeaway: Active Beats Random

Fisher-active item selection achieves higher reliability with fewer queries than random selection. The gain comes from targeting items near $p=0.5$, where each response is maximally informative. This principle—measure where uncertainty is highest—underlies all active learning methods in this chapter.

3.2.1 Factor Models: Multi-Dimensional Preferences

In practice, a single “ability” parameter may not capture the diversity of user preferences. Factor models generalize Rasch by giving each user a $K$-dimensional embedding $U_i \in \mathbb R^K$ and each item a $K$-dimensional embedding $V_j \in \mathbb R^K$, plus optional bias $Z_j$. The linear predictor is $H_{ij} = U_i^\top V_j + Z_j$ where $p(Y_{ij}=1 \mid U_i, V_j, Z_j) = \sigma(H_{ij}).$

A concrete instance is performance metric elicitation (Hiranandani et al. 2019). In binary classification, practitioners often have implicit preferences over classifiers that don’t match standard metrics like accuracy or F1—for instance, in medical diagnosis, the cost of a false negative (missed disease) may differ greatly from a false positive (unnecessary treatment). This is a $K=2$ factor model where “items” are classifiers with known confusion matrices $V_\theta = (\text{TP}_\theta, \text{TN}_\theta)$, and the “user preference” is the unknown metric weights $\mathbf{m} = (m_{11}, m_{00})$. The utility $\mathbf{m}^\top V_\theta = m_{11} \cdot \text{TP}_\theta + m_{00} \cdot \text{TN}_\theta$ is exactly the factor model form with known item parameters.

For a given user $U$ and an item $j$ with known parameters, the Bernoulli log-likelihood is \[\ell(U; y, j) = y \log \sigma(H) + (1-y)\log (1-\sigma(H)) \tag{3.4}\] The gradient with respect to $U$ is $\nabla_U \ell = (y - \sigma(H)) V_j.$ The expected Fisher Information per item is the negative expected Hessian: \[ \mathcal I_j(U) = \mathbb E[\nabla_U \ell \nabla_U \ell^\top ] = \sigma(H)(1-\sigma(H)) V_j V_j^\top. \tag{3.5}\] So each queried item contributes a rank-1 matrix in the direction of $V_j$, scaled by the variance term $\sigma(H)(1-\sigma(H))$. For a sequence of items ${j_t}$, the cumulative information is \[\mathcal I(U) = \sum_t \sigma(H_{j_t})(1-\sigma(H_{j_t})) V_{j_t} V_{j_t}^\top. \tag{3.6}\]

In multiple dimensions, information is a matrix. Classical optimal design offers various criteria for selecting informative queries:

Definition: Optimal Design Criteria

Definition 3.2 Given a Fisher Information matrix $\mathcal{I}(U)$, the classical optimal experimental design criteria are:

A-Optimal (Average): Minimize $\text{tr}(\mathcal{I}^{-1})$—the average variance of parameter estimates
D-Optimal (Determinant): Maximize $\det(\mathcal{I})$—the volume shrinkage of the posterior ellipsoid
E-Optimal (Eigenvalue): Maximize $\lambda_{\min}(\mathcal{I})$—the precision in the worst-case direction

All three reduce uncertainty, but emphasize different aspects: A-optimal targets average precision, D-optimal targets overall information volume, and E-optimal targets the least-informed direction.

Adaptive measurement picks the next item $j$ to maximize an acquisition function such as $\det(\mathcal I+\mathcal I_j)$. We also use reliability as the evaluation metric here. If $\Sigma_U$ is the population covariance and $\widehat{\Sigma}_{\text{err}}$ the average posterior covariance, then $\text{Reliability} = 1 - \mathrm{tr}(\widehat{\Sigma}_{\text{err}}) \mathrm{tr}(\Sigma_U)^{-1}.$ This generalizes the Rasch reliability to multi-dimensional latent traits. Below is a toy simulation with 2-dimensional users and items, comparing random item selection vs D-optimal Fisher active selection:

First, set up the factor model simulation with K-dimensional user and item embeddings:

Define the Bayesian update and D-optimal selection:

Run simulation comparing D-optimal vs random selection:

Visualize results:

3.2.2 Pairwise Preferences

We next discuss the measurement problem with pairwise data. Recall from Section 1.6.2 that the Bradley-Terry model specifies the probability that user $i$ prefers item $j$ over item $k$ as $p_{i}(j \succ k)=\sigma (U^\top(V_j-V_k) + Z_j-Z_k)$ with feature $x_{jk}= V_j - V_k \in \mathbb R^K$ and offset $b_{jk}=Z_j-Z_k$.

In metric elicitation, each oracle query asks: “Which classifier is better, $h_\theta$ or $h_{\theta'}$?” The oracle’s answer is deterministic: $\Omega(C_\theta, C_{\theta'}) = \mathbf{1}[\mathbf{m}^\top (V_\theta - V_{\theta'}) > 0]$, which is exactly a pairwise comparison with feature difference $x = V_\theta - V_{\theta'}$. The deterministic oracle corresponds to the $\sigma \to \text{step function}$ limit of Bradley-Terry.

For a single observation $y \in \{0,1\}$ of $j \succ k$ the per-pair log-likelihood is \[\ell(U; y, i, j, k)= y \log p + (1-y) \log (1-p_i). \tag{3.7}\]

Score and expected Fisher information for (U): \[\nabla_U \ell=(y-p),x_{jk}, \qquad \mathcal I_{jk}(U)=\mathbb E[-\nabla^2_U \ell] = w x_{jk} x_{jk}^\top, \tag{3.8}\] where $w = p(1-p)$. For a sequence of queried pairs ${(j_t,k_t)}$ the cumulative information is $\mathcal I(U)=\sum_t w_t,x_t x_t^\top.$ With a Gaussian prior $U \sim \mathcal N(0, \Sigma_0)$ and a Laplace approximation, the posterior precision updates additively: \[ \Lambda_{t}=\Sigma_t^{-1}=\Sigma_{0}^{-1}+\sum_{s\le t} w_s,x_s x_s^\top. \tag{3.9}\] A single Newton step for the posterior mean after observing ((x,y)) is \[ \hat U \leftarrow \hat U + \Sigma_{t}^{\ }(y-p),x,\qquad p=\sigma(\hat U^\top x+b),\quad \Sigma_t = \left(\Lambda_{t-1}+w,xx^\top\right)^{-1}. \tag{3.10}\] This is the logistic-regression one-step IRLS update specialized to one new example. We evaluate candidates at the current $(\hat U,\Sigma)$. Write $x$ for $x_{jk}$ and $w=\sigma(\hat U^\top x+b)\big(1-\sigma(\hat U^\top x+b)\big)$. Let $\Lambda=\Sigma^{-1}$. Using the Sherman–Morrison formula, we obtain closed forms for all three design criteria:

Proposition: D-Optimal Acquisition via Sherman-Morrison

Proposition 3.2 For a current precision matrix $\Lambda = \Sigma^{-1}$ and a candidate pair $(j,k)$ with feature difference $x = V_j - V_k$ and variance weight $w = p(1-p)$, the D-optimal acquisition function has closed form: \[ \Delta_D(j,k) = \log\det(\Lambda + w xx^\top) - \log\det(\Lambda) = \log(1 + w x^\top \Sigma x) \] This follows from the matrix determinant lemma: $\det(A + uv^\top) = (1 + v^\top A^{-1} u)\det(A)$.

Pick the pair that maximizes $\log(1 + w x^\top\Sigma x)$. This requires only a vector-matrix-vector product, making it efficient for online selection.

A-optimal (trace drop of covariance): $\Delta_A(j,k);=;\mathrm{tr}(\Sigma)-\mathrm{tr}\big((\Lambda+w,xx^\top)^{-1}\big) = \frac{w,x^\top\Sigma^2 x}{1+w,x^\top\Sigma x}.$
E-optimal (raise the worst-direction information): Exact update of $\lambda_{\min}(\mathcal I)$ needs an eigen update. A practical proxy is to maximize $\Delta_E^{\text{proxy}}(j,k)= w,x^\top\Sigma x,$ which targets directions where current uncertainty $x^\top\Sigma x$ is large.

End-to-end toy pipeline (rank 2, known items, active D-opt vs random). First, define helper functions for D-optimal gain and Bayesian updates:

Set up simulation parameters and precompute candidate pairs:

Define selection policies and run the simulation:

Visualize the results:

Key Takeaway: D-Optimal Design

For multi-dimensional preference vectors, D-optimal selection maximizes $\log\det(\mathcal{I})$, shrinking the posterior ellipsoid in all directions. The Sherman-Morrison formula enables efficient computation: each candidate pair requires only a vector-matrix-vector product $x^\top \Sigma x$. D-optimal design naturally encourages diversity—selecting pairs with redundant information directions provides diminishing log-det gain.

3.2.3 Metric Elicitation: Exploiting Quasiconcavity

The metric elicitation problem has special structure that enables even more efficient elicitation than general D-optimal design. A Linear Performance Metric (LPM) has the form: \[ \phi(C) = m_{11} \cdot \text{TP} + m_{00} \cdot \text{TN} + m_0 \tag{3.11}\] where the weights $(m_{11}, m_{00})$ encode tradeoffs between true positives and true negatives. Since only the ratio matters, we can parametrize as $\mathbf{m} = (\cos\theta, \sin\theta)$ for $\theta \in [0, 2\pi]$.

For medical diagnosis, $m_{11} > m_{00}$ weights sensitivity (catching disease) over specificity (avoiding false alarms). For spam filtering, the reverse may hold. The key is that the practitioner’s implicit metric $\mathbf{m}^*$ is unknown—we must learn it from pairwise comparisons.

Proposition: Quasiconcavity of LPM Composition

Proposition 3.3 For a quasiconcave LPM $\phi$ that is monotonically increasing in TP and TN, and a parametrization $\rho^+(\theta)$ of the upper boundary of the confusion matrix space $\mathcal{C}$, the composition $\phi \circ \rho^+$ is quasiconcave and unimodal on $[0, \pi/2]$.

Quasiconcavity means the function has a single peak—no local maxima to trap us. This enables binary search rather than gradient-based optimization:

Algorithm: LPM Elicitation via Binary Search

Input: Tolerance $\epsilon > 0$, oracle $\Omega$

Initialize: $\theta_a = 0$, $\theta_b = \pi/2$

While $|\theta_b - \theta_a| > \epsilon$:

Divide $[\theta_a, \theta_b]$ into four equal intervals with boundaries $A, B, C, D, E$
Compute confusion matrices $C_A, C_B, C_C, C_D, C_E$ for classifiers at each $\theta$
Query oracle: compare all adjacent pairs to find which interval contains the maximum
Update $[\theta_a, \theta_b]$ to the interval containing the maximum

Return: $\hat{\mathbf{m}} = (\cos\hat\theta, \sin\hat\theta)$ where $\hat\theta = (\theta_a + \theta_b)/2$

Each iteration uses 4 queries and shrinks the search interval by factor of 2, giving $O(\log(1/\epsilon))$ query complexity—exponentially better than the $O(1/\epsilon^2)$ rate for general 2D estimation.

Proposition: Bayes Optimal Classifier from LPM

Proposition 3.4 Given learned metric weights $\mathbf{m} = (m_{11}, m_{00})$ with $m_{11} + m_{00} > 0$, the Bayes optimal classifier is: \[ \bar{h}(x) = \mathbf{1}\left[\eta(x) \geq \frac{m_{00}}{m_{11} + m_{00}}\right] \] where $\eta(x) = P(Y=1 \mid X=x)$ is the conditional class probability.

This connects metric elicitation to decision theory: learning the metric $\mathbf{m}$ is equivalent to learning the optimal classification threshold.

Extension to Linear-Fractional Metrics. The $F_\beta$ score and Jaccard similarity are not linear but linear-fractional: \[ \phi(C) = \frac{p_{11} \cdot \text{TP} + p_{00} \cdot \text{TN} + p_0}{q_{11} \cdot \text{TP} + q_{00} \cdot \text{TN} + q_0} \] These can be elicited by first finding both the maximizer and minimizer of the implicit linear numerator (two binary searches), then solving for the fractional parameters. The query complexity remains $O(\log(1/\epsilon))$.

Multiclass Extension. For $K$-class problems, diagonal linear performance metrics (DLPMs) weight correct predictions: $\psi(\mathbf{d}) = \sum_k a_k d_k$ where $d_k$ is the rate of correctly predicting class $k$. Binary search over each pair of classes $(k_1, k_2)$ recovers the relative weights $a_{k_2}/a_{k_1}$, with total complexity $O(K^2 \log(1/\epsilon))$.

3.3 Measurement of User Preference Function

Suppose items $j$ have feature vectors $X_j \in \mathbb R^d$. Instead of learning a free scalar parameter $V_j$ for each item, we assume $V_j = W^\top X_j,$ where $W \in \mathbb R^d$ is shared across items. The probability that item $j$ beats item $k$ is $p(j \succ k) = \sigma(V_j - V_k) = \sigma (W^\top(X_j - X_k)).$ This is simply logistic regression on pairwise feature differences. Given one comparison outcome $y \in \{0,1\}$ for pair $(j,k)$ with difference $x_{jk} = X_j - X_k$, \[\ell(W) = y\log \sigma(W^\top x_{jk}) + (1-y)\log(1-\sigma(W^\top x_{jk})). \tag{3.12}\] Gradient: $\nabla_W \ell = (y - p) x_{jk}, \quad p = \sigma( W^\top x_{jk} ).$ Fisher Information (expected negative Hessian): $\mathcal I(W) = \mathbb E\big[(y-p)^2 x_{jk}x_{jk}^\top\big] = p(1-p),x_{jk}x_{jk}^\top.$

With Gaussian prior $W \sim \mathcal N(0,\sigma_0^2 I)$, a Laplace approximation yields posterior covariance $\Sigma \approx (\sigma_0^{-2}I + \sum_t p_t(1-p_t) x_t x_t^\top)^{-1}.$ So, as in the Rasch case, Fisher information accumulates to form posterior precision. At each step, pick a candidate pair $(j,k)$ and compute \[ \Delta_D(j,k) = \log \det(\Lambda + w x_{jk} x_{jk}^\top) - \log\det(\Lambda), \quad w = p(1-p), \tag{3.13}\] where $\Lambda = \Sigma^{-1}$. This is the D-optimal criterion, maximizing the expected log-determinant gain in information. A-optimal or E-optimal criteria are similarly defined. We’ll simulate item features, a ground-truth $W$, and comparisons. Then we compare active D-optimal elicitation vs random pair selection for estimating $W$. Performance metrics are MSE of $\hat W$ vs true $W$, and reliability.

First, set up the simulation with item features and ground-truth weights:

Code

Define the Fisher gain and Bayesian update functions:

Code

Run the simulation comparing D-optimal vs random selection:

Code

Visualize the results:

Code

3.3.1 Robotic Trajectory Learning

The linear preference framework applies directly to robotic trajectory learning (Biyik and Sadigh 2018). Consider a robot choosing between trajectories $\xi_A$ and $\xi_B$. The human evaluates them via a reward function: \[ R(\xi) = w^\top \phi(\xi) \tag{3.14}\] where $w \in \mathbb{R}^N$ are unknown weights and $\phi(\xi) \in \mathbb{R}^N$ extracts features (e.g., distance to lane boundaries, proximity to obstacles, heading, speed). This is exactly our linear model with “items” being trajectories and “features” being $\phi(\xi)$.

Geometric Interpretation. Each pairwise comparison “prefer $\xi_A$ over $\xi_B$” yields a constraint: \[ w^\top (\phi(\xi_A) - \phi(\xi_B)) > 0 \] This defines a half-space in the weight space. The feasible region for $w$ is the intersection of all half-spaces consistent with observed preferences. Each query splits this region—orthogonal queries reduce volume by approximately half.

The optimal query selection criterion maximizes the minimum volume removed regardless of the oracle’s answer. For query feature difference $\Phi = \phi(\xi_A) - \phi(\xi_B)$, this is $\max_\Phi \min\{\mathbb{E}[1 - f_\Phi(w)], \mathbb{E}[1 - f_{-\Phi}(w)]\}$ where $f_\Phi(w) = \min(1, \exp(w^\top \Phi))$ soft-weights consistency with the preference. This selects balanced queries where either response is informative.

Driving Simulator Example. In a 2D driving simulator, trajectory features include:

Distance to lane boundaries
Distance to other vehicles
Heading angle
Speed

With no learned preferences, the simulated car moves erratically. After 30 comparison queries, it learns to follow the lane direction. After 70 queries, it additionally learns to avoid collisions while staying in lane. The features with highest learned weights correspond to collision avoidance and lane-keeping.

Why Active Selection Matters. Both active and random query selection eventually converge to the true reward weights. However, active selection achieves target performance faster—critical in time-sensitive applications. For exoskeleton rehabilitation (Li et al. 2021), the robot must adapt to individual patients quickly; waiting for asymptotic convergence is not acceptable. Active learning provides crucial early-stage advantages.

Imitation Learning with Compatibility. When learning from multiple human demonstrators, multimodality arises: different people use different strategies for the same task (e.g., picking up an object from the left vs. right). Naive imitation learning fails on contradictory demonstrations.

The solution is to measure compatibility of new demonstrations with the base policy:

Likelihood: How close is the demonstrated action to the base policy’s prediction?
Novelty: How uncertain is the base policy at this state?

Demonstrations with low likelihood in low-novelty states indicate conflicting modes and should be filtered. This compatibility-based filtering improves success rates while using less data than naive aggregation (Gandhi et al. 2022).

Foundation Models for Representations. Recent work leverages foundation models pretrained on human video (R3M (Nair et al. 2022), Voltron (Karamcheti et al. 2023)) to provide better trajectory features $\phi(\xi)$. These models learn that human hands are similar to robotic end-effectors, enabling 2-3x higher success rates with 5-10x fewer demonstrations. For preference learning, better representations mean each comparison is more informative.

3.3.2 Non-Linear Scoring Functions

So far we set $V_j=W^\top X_j$. Suppose instead that items are scored by a non-linear function \[V_j = f_\theta(X_j),\qquad \theta\in\mathbb{R}^p, \qquad p(j \succ k) = \sigma (f_\theta(X_j) - f_\theta(X_k)) \tag{3.15}\] We want an online procedure that chooses the next comparison $(j,k)$ to learn $\theta$ quickly. Exact Bayesian design is intractable for general $f_\theta$. There is a simple and effective approximation that keeps all the math and code lightweight: local Gaussian posterior over $\theta$ via Laplace + first-order model. Work at the current estimate $\hat\theta$. Linearize the non-linear scorer around $\hat\theta$: \[ f_\theta(X_j) \approx f_{\hat\theta}(X_j) + J_j\Delta\theta, \qquad J_j \equiv \left.\frac{\partial f_\theta(X_j)}{\partial \theta}\right| _{\theta = \hat\theta} \in \mathbb{R}^{1\times p}. \tag{3.16}\] The pairwise logit for $(j,k)$ becomes \[\underbrace{f_\theta(X_j)-f_\theta(X_k)}_{\text{logit}} \approx \underbrace{f_{\hat\theta}(X_j)-f_{\hat\theta}(X_k)}_{\text{offset}} + \underbrace{(J_j-J_k)}_{\phi_{jk}}\Delta\theta. \tag{3.17}\]

Define the local “design vector” $\phi_{jk}\in\mathbb{R}^{1\times p}$. With a Gaussian prior $\theta\sim\mathcal N(0,\sigma_0^2 I)$ and Bernoulli likelihood, the Laplace posterior around $\hat\theta$ is \[ \theta \mid \mathcal D \approx \mathcal N(\hat\theta,;\Sigma),\qquad \Sigma^{-1} = \sigma_0^{-2} I + \sum_{t} w_t \phi_t^\top \phi_t \tag{3.18}\] where $w_t = p_t(1-p_t)$ with $p_t = \sigma\big(f_{\hat\theta}(X_{j_t})-f_{\hat\theta}(X_{k_t})\big)$. This is exactly the same structure as the linear case, with (x_{jk}) replaced by $\phi_{jk}=J_j-J_k$. You only need Jacobian rows $J_j$. At the current $(\hat\theta \Sigma)$ and for a candidate $(j,k)$, compute $p=\sigma\big(f_{\hat\theta}(X_j)-f_{\hat\theta}(X_k)\big)$ and $w = p(1-p)$.Compute $\phi = J_j-J_k$ at $\hat \theta$. Then use any standard optimal design score with Sherman–Morrison:

D-optimal (log-det gain): $\Delta_D(j,k) = \log ( 1 + w\phi \Sigma \phi^\top )$.
A-optimal (trace drop): $\Delta_A(j,k) = \dfrac{w \phi \Sigma^2 \phi^\top }{ 1+w \phi \Sigma \phi^\top}$.
E-optimal proxy: $\Delta_E^{\text{proxy}}(j,k) = w \phi \Sigma \phi^\top$.

Pick the pair with the largest gain. This is tractable because it only needs a vector–matrix–vector product with $\Sigma$ and a Jacobian row. Mutual-information acquisition like BALD reduces to the same formulas under the Laplace Gaussian approximation. Maximizing (_D) is a good default.

We use a tiny two-layer network $f_\theta(x)=a^\top \tanh(Bx)$ with $\theta=(a,B)$. We derive $J_j$ analytically. First, define the network and its Jacobian:

Set up Laplace approximation state and update functions:

Run the active learning loop comparing D-optimal vs random selection:

Visualize results:

3.4 Active Query Selection for Gaussian Process Models

The previous sections developed Fisher information for optimal query selection in parametric preference models—the Rasch model (Section 1) and linear scoring functions (Section 2). However, Chapter 2 introduced Gaussian Processes (GPs) as a flexible nonparametric alternative for reward modeling, and Chapter 3 covered GP inference via Laplace approximation. Here we address the natural question: how do we actively select queries when using GPs?

The key insight is that GP uncertainty—quantified by the posterior variance—naturally guides query selection. Unlike parametric models where Fisher information depends only on the current parameter estimate, GP uncertainty varies across the input space, being higher far from observed data and lower near it.

3.4.1 Information-Theoretic Acquisition Function

For GP preference learning, we seek queries that maximize information gain about the reward function. Given a dataset $\mathcal{D}$ and a candidate query $Q = (x_A, x_B)$, the acquisition function measures the expected reduction in uncertainty about $r$: \[ a(Q) = I(r; y \mid \mathcal{D}, Q) = H(y \mid \mathcal{D}, Q) - \mathbb{E}_{r \sim p(r \mid \mathcal{D})}[H(y \mid r, Q)] \tag{3.19}\]

Interpretation of the two terms:

$H(y \mid \mathcal{D}, Q)$: Predictive entropy—how uncertain is the query outcome under our current model? High when $p(A \succ B) \approx 0.5$.
$\mathbb{E}_r[H(y \mid r, Q)]$: Expected conditional entropy—how uncertain would the query be if we knew the true reward function? Low for “easy” comparisons, high for inherently noisy comparisons.

The acquisition function trades off these two terms: we want queries where the model is uncertain (high predictive entropy) but a human could provide a clear answer (low expected conditional entropy).

3.4.2 Practical Computation

Using the Laplace approximation from Chapter 3 (Section 2.4.2), the acquisition function can be computed as: \[ a(x_A, x_B) = h\left(\Phi\left(\frac{\mu_A - \mu_B}{\sqrt{2\sigma_{\text{noise}}^2 + g(x_A, x_B)}}\right)\right) - m(x_A, x_B) \tag{3.20}\]

where: - $\mu_A, \mu_B$ are the GP posterior means at $x_A, x_B$ - $g(x_A, x_B) = \sigma_A^2 + \sigma_B^2 - 2\text{Cov}(r(x_A), r(x_B))$ captures uncertainty from both points - $h(p) = -p\log_2 p - (1-p)\log_2(1-p)$ is binary entropy - $\Phi$ is the standard normal CDF - $m(x_A, x_B)$ is a correction term for human noise

Key property: The trivial query $Q = (x, x)$ (comparing an item to itself) is a global minimizer, not maximizer, of this acquisition function. This contrasts with some other methods that can get stuck on uninformative queries.

First, we define the RBF kernel and helper functions for the simulation:

Next, we implement a GP model for active preference learning. The fit method uses Laplace approximation (Newton’s method) to find the posterior mode, and acquisition computes the information gain for a candidate query:

Now we run the simulation, comparing active (information-theoretic) query selection against random selection:

Finally, we visualize the results. The left panel shows how estimation error decreases with more comparisons; the right panel shows the learned reward functions:

3.4.3 When to Use GP Active Learning

Scenario	Recommendation
Nonlinear reward structure	GP strongly preferred
Low-to-moderate dimensions ($d \lesssim 10$)	GP works well
Expensive human feedback	Active learning essential
Need uncertainty quantification	GP provides this naturally
Large datasets ($n > 1000$)	Consider sparse GP approximations
Very high dimensions	Linear models may be more practical

3.5 Active Learning for Direct Preference Optimization

We now turn to a specific, important application: active learning for DPO (Direct Preference Optimization) in large language model alignment. Chapter 2 introduced DPO and its connection to Bradley-Terry; here we show how to apply active learning principles to reduce the number of human comparisons needed.

3.5.1 The DPO Setting

Recall from Section 1.2 that DPO directly optimizes a policy $\pi_\theta$ on preference data without learning an explicit reward model. The key insight is that DPO implicitly assumes Bradley-Terry preferences: \[ p(y_w \succ y_l \mid x) = \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right) \tag{3.21}\]

Challenge: Large language models have billions of parameters, making exact Fisher information computation intractable. How can we apply the D-optimal design principles from Sections 1-2?

3.5.2 The Log-Linear Policy Assumption

The key insight is to approximate the policy as log-linear in the last-layer features: \[ \pi(y \mid x; \theta) \propto \exp(\phi(x, y)^\top \theta) \tag{3.22}\]

where $\phi(x, y)$ is the feature representation from the model’s last layer. This assumption is justified by the neural tangent kernel perspective: for sufficiently large networks, the dynamics during training are approximately linear in the parameters.

Under this assumption, the DPO loss Hessian takes the form: \[ H(\theta) = \sum_{i=1}^n p_i(1-p_i) \cdot \Delta\phi_i \Delta\phi_i^\top \tag{3.23}\]

where $\Delta\phi_i = \phi(x_i, y_{w,i}) - \phi(x_i, y_{l,i})$ is the feature difference between winning and losing responses, and $p_i$ is the predicted preference probability.

Key observation: This Hessian is exactly the Fisher information matrix for the DPO model! The D-optimal criterion becomes: \[ \max_S \log\det\left(\sum_{i \in S} p_i(1-p_i) \cdot \Delta\phi_i \Delta\phi_i^\top\right) \tag{3.24}\]

3.5.3 The ADPO Algorithm

The Active DPO (ADPO) algorithm applies D-optimal design to select preference comparisons:

Algorithm: Active DPO (ADPO)

Input: Reference policy $\pi_{\text{ref}}$, prompt set $\{x_i\}$, response pairs $\{(y_{i,1}, y_{i,2})\}$, budget $T$

Initialize: $\theta_0 = \theta_{\text{ref}}$, $S_0 = \emptyset$

For $t = 1, \ldots, T$:

Compute features: For each candidate pair $(x_i, y_{i,1}, y_{i,2})$:
- $\Delta\phi_i = \phi(x_i, y_{i,1}) - \phi(x_i, y_{i,2})$
- $p_i = \sigma(\beta(\log\pi_{\theta_{t-1}}(y_{i,1}|x_i) - \log\pi_{\theta_{t-1}}(y_{i,2}|x_i)))$
D-optimal selection: Select query maximizing: \[i^* = \arg\max_i \log\det\left(I + p_i(1-p_i) H_t^{-1} \Delta\phi_i \Delta\phi_i^\top\right) \tag{3.25}\] where $H_t$ is the current Fisher information matrix
Query oracle: Obtain human preference $y_t$ for pair $(x_{i^*}, y_{i^*,1}, y_{i^*,2})$
Update: $S_t = S_{t-1} \cup \{(i^*, y_t)\}$, retrain $\theta_t$ on $S_t$ using DPO loss

Return: Final policy $\pi_{\theta_T}$

3.5.4 Theoretical Guarantees

Under standard regularity conditions, ADPO achieves the following convergence rate:

Theorem: ADPO Convergence Rate

Theorem 3.1 Let $\theta^*$ be the optimal policy parameters. Under the log-linear policy assumption and standard regularity conditions, ADPO with $n$ actively selected comparisons achieves: \[ \|\hat\theta_n - \theta^*\| = O\left(\frac{d}{\sqrt{n}}\right) \tag{3.26}\] where $d$ is the effective dimension (rank of the Fisher information matrix). This matches the minimax optimal rate for $d$-dimensional estimation.

The key insight is that D-optimal design ensures the Fisher information grows proportionally to $n$, leading to $O(1/\sqrt{n})$ estimation error.

3.5.5 ADPO+ for Offline Settings

In many practical settings, we have access to a fixed pool of unlabeled preference pairs (e.g., model outputs) but cannot generate new queries. ADPO+ adapts the algorithm for this offline setting:

Pool-based selection: Instead of synthesizing queries, select from the existing pool
Batch selection: Select $b$ queries at once to enable parallel annotation
Diversity regularization: Add penalty for selecting similar queries

The D-optimal criterion naturally encourages diversity: selecting redundant queries provides diminishing information gain.

3.5.6 Empirical Guidance: When Does Active Learning Help?

Factor	Active Learning Beneficial?
Annotation cost	High cost → Active learning essential
Data heterogeneity	Diverse prompts → More room for strategic selection
Budget constraints	Limited budget → Active learning shows larger gains
Model capacity	Larger models → Active learning helps reduce overfitting
Query pool size	Large pool → More room for intelligent selection

Practical recommendation: Active learning for DPO provides the largest gains when: - Annotation budget is limited (< 10K comparisons) - Prompts cover diverse topics or difficulty levels - The goal is sample efficiency over raw performance

Comparison: GP Active Learning vs. ADPO

Aspect	GP Active Learning	ADPO
Model	Nonparametric (GP)	Parametric (neural network)
Scalability	$O(n^3)$	$O(d^2 n)$ with approximations
Use case	Small datasets, need uncertainty	LLM alignment, large models
Acquisition	Information gain	D-optimal design
Theoretical basis	Mutual information	Fisher information

Both approaches share the core insight: use model uncertainty to guide query selection.

3.6 Discussion Questions

How does the reliability metric defined in this chapter relate to classical test theory in psychometrics? What assumptions underlie this correspondence?
Why does D-optimal design maximize the log-determinant of the information matrix rather than the trace? In what situations might A-optimal (trace-based) selection be preferable?
The local linearization approach for non-linear scoring functions trades off accuracy for tractability. What are the conditions under which this approximation is most accurate? When might it fail?
Active preference learning assumes that items’ appeals are known or pre-calibrated. How might estimation errors in item parameters affect the efficiency of active selection strategies?
The metric elicitation framework assumes an oracle with a consistent underlying preference. How would you modify the approach to handle an oracle whose preferences evolve over time or exhibit inconsistencies?
In the robotics case studies, pairwise comparisons are used to learn reward functions. What are the advantages and disadvantages of pairwise comparisons versus direct reward labeling for robotic learning?
Foundation models like R3M and Voltron leverage pretraining on human video data. What types of robotic tasks might benefit most from this approach? What tasks might still require substantial robot-specific data?
The active elicitation framework for imitation learning identifies “compatible” demonstrations. How might this concept of compatibility be extended to multi-agent settings where different agents have legitimately different optimal behaviors?

3.7 Exercises

We place (*), (**), and (***) for exercises we deem relatively easy, medium, and hard.

3.7.1 Fisher Information Maximum (*)

Show that for the Rasch model with $p = \sigma(U + V_j)$, the Fisher information $\mathcal{I}_j(U) = p(1-p)$ is maximized when $p = 0.5$. What does this imply about the optimal difficulty of items for measuring a user’s ability?

3.7.2 Sherman-Morrison Update (**)

Derive the Sherman-Morrison formula for efficiently updating the posterior covariance when adding a single rank-1 observation. Specifically, show that if $\Sigma^{-1} \gets \Sigma^{-1} + w \cdot vv^\top$, then $\Sigma \gets \Sigma - \frac{w \cdot \Sigma vv^\top \Sigma}{1 + w \cdot v^\top \Sigma v}$.

3.7.3 D-Optimal vs A-Optimal (**)

Construct a 2D example where D-optimal and A-optimal criteria select different items for the next query. Interpret the difference geometrically in terms of the posterior ellipsoid.

3.7.4 Pairwise Fisher Information (**)

For the Bradley-Terry model $p(j \succ k | U) = \sigma(U^\top(V_j - V_k))$, derive the Fisher information matrix for $U$ given a single pairwise comparison. How does this relate to the single-item case?

3.7.5 Active Selection Implementation (***)

Extend the D-optimal pairwise selection simulation to include all three optimality criteria (A, D, E). Compare their sample efficiency on a synthetic dataset of 50 items with 3-dimensional user preferences. Which criterion performs best under different prior uncertainty structures?

3.7.6 Noisy Oracle Elicitation (***)

Extend the LPM elicitation algorithm to handle noisy oracle responses where the oracle flips its answer with probability $\epsilon < 0.5$. How does the query complexity scale with $\epsilon$?

3.7.7 Local Linearization Analysis (***)

Consider a scoring function $f_\theta(x) = a^\top \tanh(Bx)$ with $\theta = (a, B)$. Analytically derive the Jacobian $J_j = \partial f_\theta(X_j) / \partial \theta$. Under what conditions on the hidden activations $\tanh(BX_j)$ is the local Gaussian approximation most accurate?

3.7.8 Compatibility Metric Design (**)

In the active elicitation framework for imitation learning, the compatibility metric depends on thresholds $\lambda$ and $\eta$. Design an algorithm that automatically selects these thresholds from a held-out validation set of demonstrations. What objective should the algorithm optimize?

References

Biyik, Erdem, and Dorsa Sadigh. 2018. “Batch Active Preference-Based Learning of Reward Functions.” In Proceedings of the 2nd Conference on Robot Learning, edited by Aude Billard, Anca Dragan, Jan Peters, and Jun Morimoto, 87:519–28. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v87/biyik18a.html.

Gandhi, Kanishk, Siddharth Karamcheti, Madeline Liao, and Dorsa Sadigh. 2022. “Eliciting Compatible Demonstrations for Multi-Human Imitation Learning.” In Proceedings of the 6th Conference on Robot Learning (CoRL).

Hiranandani, Gaurush, Shant Boodaghians, Ruta Mehta, and Oluwasanmi Koyejo. 2019. “Performance Metric Elicitation from Pairwise Classifier Comparisons.” In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, edited by Kamalika Chaudhuri and Masashi Sugiyama, 89:371–79. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v89/hiranandani19a.html.

Karamcheti, Siddharth et al. 2023. “Language-Driven Representation Learning for Robotics.” arXiv Preprint arXiv:2302.12766.

Li, Kejun, Maegan Tucker, Erdem Biyik, Ellen Novoseller, Joel W. Burdick, Yanan Sui, Dorsa Sadigh, Yisong Yue, and Aaron D. Ames. 2021. “ROIAL: Region of Interest Active Learning for Characterizing Exoskeleton Gait Preference Landscapes.” In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE. https://doi.org/10.1109/icra48506.2021.9560840.

Nair, Suraj, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. 2022. “R3M: A Universal Visual Representation for Robot Manipulation.” https://arxiv.org/abs/2203.12601.

--- title: Elicitation format: html: include-after-body: text: | <script> // Auto-execute all pyodide cells after initialization document.addEventListener('DOMContentLoaded', function() { // Wait for pyodide to be fully ready (mainPyodide is set after loading) function waitForPyodide() { if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) { // Pyodide is ready, execute all cells with autorun=true if (typeof globalThis.qpyodideCellDetails !== 'undefined') { globalThis.qpyodideCellDetails.forEach((cell, index) => { if (cell.options && cell.options.autorun === 'true') { setTimeout(() => { const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`); if (runButton && !runButton.disabled) { runButton.click(); } }, index * 1000); // Stagger execution by 1 second each } }); } } else { setTimeout(waitForPyodide, 500); } } setTimeout(waitForPyodide, 2000); // Start checking after 2 seconds }); </script> filters: - pyodide --- ::: {.content-visible when-format="html"} <iframe src="../slides/3.1.active_learning.html" style="width:45%; height:225px;" ></iframe> <iframe src="../slides/3.2.metric_elicitation.html" style="width:45%; height:225px;" ></iframe> [Fullscreen - AL](../slides/3.1.active_learning.html){.btn .btn-outline-primary .btn role="button"} [Fullscreen - ME](../slides/3.2.metric_elicitation.html){.btn .btn-outline-primary .btn role="button"} ::: ::: {.callout-note} ## Intended Learning Outcomes By the end of this chapter you will be able to: - **Define** the measurement problem for user preference elicitation and distinguish it from the learning problem in Chapter 3. - **Derive** the Fisher information for Rasch/Bradley-Terry models and explain why items with $p \approx 0.5$ are most informative. - **Compare** A-optimal, D-optimal, and E-optimal design criteria for selecting informative queries. - **Apply** Laplace approximation to update posterior beliefs about user preferences in real-time. - **Implement** active selection policies that outperform random sampling in sample efficiency. - **Extend** Fisher information analysis to non-linear scoring functions via local linearization. - **Apply** information-theoretic acquisition functions to actively select queries for Gaussian Process preference models, building on the GP foundations from Chapters 2-3. - **Explain** how D-optimal design applies to Direct Preference Optimization (DPO) for LLM alignment and derive the ADPO algorithm. - **Analyze** active preference learning in robotics, including trajectory comparison and reward function inference. ::: ::: {.callout-note title="Notation"} Throughout this chapter, we use the following conventions (extending Chapter 2): | Symbol | Meaning | |--------|---------| | $U \in \mathbb{R}^K$ | User embedding/preference vector | | $V_j \in \mathbb{R}^K$ | Item embedding/appeal vector | | $Z_j \in \mathbb{R}$ | Item offset (difficulty in Rasch model: $-V_j$) | | $H_{ij} = U_i^\top V_j + Z_j$ | Latent utility for user $i$ on item $j$ | | $\mathcal{I}(U)$ | Fisher information (scalar or matrix) | | $p = \sigma(H)$ | Response probability under logistic model | | $\Lambda = \Sigma^{-1}$ | Posterior precision matrix | | $\text{Rel}$ | Reliability: $1 - \text{tr}(\hat{\Sigma}_{\text{err}}) / \text{tr}(\Sigma_U)$ | | $\Delta_D, \Delta_A, \Delta_E$ | D/A/E-optimal acquisition functions | | $a(Q)$ | Acquisition function value for query $Q$ | | $H(y)$ | Entropy of random variable $y$ | | $I(r; y)$ | Mutual information between $r$ and $y$ | ::: ## Chapter Overview Chapter 1 developed models for how preferences arise from latent utilities. Chapter 2 showed how to estimate these utilities from observed comparisons. This chapter addresses the complementary question: **given that human feedback is expensive, how should we choose which comparisons to request?** This is the *measurement problem* in preference elicitation. Unlike the learning problem (which assumes data is given), the measurement problem actively designs experiments to maximize information gained per query. Consider training an LLM via RLHF: collecting 100,000 human preference labels is expensive and time-consuming. If we could achieve the same alignment quality with 20,000 strategically selected comparisons, we would save 80% of annotation cost. The central insight is that **not all comparisons are equally informative**. Asking whether humans prefer "Hello!" to "Hi there!" reveals little about model quality. Comparing two substantively different responses to a complex prompt provides more signal. The mathematical framework of *Fisher information* and *optimal experimental design* formalizes this intuition. **Chapter Structure.** We build from simple to complex settings: 1. **Scalar preferences** (@sec-measurement): The Rasch model and Fisher information 2. **Multi-dimensional preferences** (@sec-factor-models-active): Factor models and optimal design criteria (A/D/E-optimal) 3. **Preference functions** (@sec-preference-function): Linear and non-linear scoring functions 4. **Nonparametric models** (@sec-active-learning): GP-based active learning with information-theoretic acquisition 5. **LLM alignment** (@sec-adpo): Active DPO for large language models 6. **Applications**: Metric elicitation (@sec-metric-elicitation) and robotics (@sec-robotics-trajectory) **Connections to Prior Chapters:** | Concept | Chapter 1-2 | This Chapter | |---------|------------|--------------| | Bradley-Terry model | Likelihood, MLE (@sec-bradley-terry) | Fisher information, query selection | | Factor models $U^\top V$ | Parameter estimation (@sec-latent-variable) | D-optimal pair selection | | GP preference models | Laplace inference (@sec-gp-inference) | Information-gain acquisition | | DPO loss | Objective function (@sec-llm-example) | ADPO active selection | **Historical Notes.** Active learning for preferences has roots in multiple fields: - **Psychometrics (1960s)**: Computer-adaptive testing used Fisher information to select test items matching student ability - **Optimal experimental design (1970s)**: Fedorov and Kiefer developed A/D/E-optimality for linear models - **Bayesian experimental design (1990s)**: Chaloner & Verdinelli unified information-theoretic approaches - **RLHF/Active DPO (2020s)**: Applying these ideas to LLM alignment at scale > "The purpose of optimal design is to achieve the desired precision with minimum cost." --- V.V. Fedorov (1972) ## Measurement of User Preference Vector {#sec-measurement} When items' appeals $V_j$ are known or pre-calibrated, we can learn a new user's appetite $U$ quickly by choosing informative items online. Asking only easy or only hard items gives near-deterministic responses that teach little. The key tool for quantifying "informativeness" is Fisher information. ::: {.callout-note title="Definition: Fisher Information"} ::: {#def-fisher-info} For a parametric model $p(Y \mid \theta)$, the **Fisher Information** about parameter $\theta$ is: $$ \mathcal{I}(\theta) = \mathbb{E}\left[-\frac{\partial^2}{\partial \theta^2} \log p(Y \mid \theta)\right] = \mathbb{E}\left[\left(\frac{\partial}{\partial \theta} \log p(Y \mid \theta)\right)^2\right] $$ Fisher Information quantifies how much a single observation reveals about $\theta$. Higher Fisher Information means more precise estimation from each observation. The Cramér-Rao bound states that the variance of any unbiased estimator is at least $\mathcal{I}(\theta)^{-1}$. ::: ::: The Rasch model makes this precise. Recall from @sec-latent-variable that for a candidate item $j$, the Bernoulli success probability is $p_j(U) = \sigma(U + V_j).$ The Fisher information about $U$ from a single response is $$\mathcal I_j(U) = \mathbb{E} \left[-\frac{\partial^2}{\partial U^2}\log p(Y \mid U, V_j)\right] = p_j(U) (1 - p_j(U)).$$ {#eq-rasch-fisher-info} ::: {.callout-note title="Proposition: Optimal Item Difficulty"} ::: {#prp-fisher-max} For the Rasch model with $p_j(U) = \sigma(U + V_j)$, the Fisher Information $\mathcal{I}_j(U) = p_j(1-p_j)$ is **maximized when $p_j(U) = 0.5$**, which occurs when item difficulty $-V_j$ equals user ability $U$. ::: ::: ::: {.callout-tip title="Proof" collapse="true"} The function $f(p) = p(1-p)$ is a downward parabola with roots at $p=0$ and $p=1$. Taking the derivative: $f'(p) = 1 - 2p = 0 \Rightarrow p = 0.5$. At this point, $f(0.5) = 0.25$, which is the maximum. Since $p_j(U) = \sigma(U + V_j) = 0.5$ when $U + V_j = 0$, the optimal item has $V_j = -U$. ::: This result has a natural interpretation: items that are too easy ($p \approx 1$) or too hard ($p \approx 0$) yield near-deterministic responses that teach little about the user. Items matched to the user's ability ($p \approx 0.5$) are maximally informative. So the best next item is one whose difficulty ($-V_j$) is close to the current estimate of $U$. Active elicitation policies therefore "hover" around the user’s current location, rapidly shrinking uncertainty. Place a Gaussian prior $U \sim \mathcal N(0,\sigma_0^2)$. After observing independent items $\mathcal J$, a Laplace (or IRLS) update for $U$ uses the score and observed information $$S(U) = \sum_{j\in\mathcal J} (y_j - p_j(U)), \qquad \mathcal I(U) = \sum_{j\in\mathcal J}p_j(U) (1 - p_j(U)) + \tau_0, \quad \tau_0 = \sigma_0^{-2}. $$ {#eq-rasch-score-info} A Newton step is $\hat U \leftarrow \hat U + S(\hat U)\mathcal I(\hat U)^{-1}.$ The approximate posterior variance is $\widehat{\mathrm{Var}}(U) \approx \mathcal I(\hat U)^{-1}.$ An item-selection rule that maximizes expected \textbf{incremental} information picks $j_t = \arg\max_j \; p_j(\hat U_{t-1}) (1 - p_j(\hat U_{t-1})).$ For evaluation, we can use reliability. With population variance $\mathrm{Var}(U^\star)=\sigma_U^2$, reliability is the fraction of total variance explained by the test rather than estimation noise: $$ \text{Rel} \approx 1 - \frac{1}{N}\sum_{i=1}^N \frac{\widehat{\mathrm{Var}}(U_i)}{\sigma_U^2}. $$ {#eq-reliability} Below, the code simulates many users, a bank of calibrated items, and compares an Fisher policy to a random policy. We track reliability after each query count $t$. First, we set up the simulation with $N$ users, $M$ calibrated items, and define helper functions for Bayesian updating: {{< include _plt_setup.qmd >}} ```{pyodide-python} #| autorun: true sigmoid = lambda x: 1.0 / (1.0 + np.exp(-x)) rng = np.random.default_rng(42) # Simulation parameters N = 200 # number of users M = 200 # number of calibrated items T = 40 # queries per user sigma_U = 1.0 # population SD of user abilities sigma0 = 1.0 # prior SD for each user # Generate true user abilities and item difficulties U_true = rng.normal(0, sigma_U, size=N) V = np.sort(rng.normal(0, 1.0, size=M)) print(f"Simulating {N} users, {M} items, {T} queries each") print(f"User abilities range: [{U_true.min():.2f}, {U_true.max():.2f}]") print(f"Item difficulties range: [{V.min():.2f}, {V.max():.2f}]") ``` Next, we define the Bayesian update rule and the two item selection policies—Fisher-active (maximizing information) and random: ```{pyodide-python} #| autorun: true def new_user_state(): """Initialize user state with prior.""" return {"U_hat": 0.0, "tau": 1.0/(sigma0**2), "var": sigma0**2} def update_user(state, vj, y): """Single Newton-IRLS update after observing response y to item j.""" U, tau = state["U_hat"], state["tau"] p = sigmoid(U + vj) S, I = (y - p), p * (1 - p) # score and Fisher info state["U_hat"] = U + S / (I + tau + 1e-12) state["tau"] = tau + I state["var"] = 1.0 / (state["tau"] + 1e-12) def choose_item_fisher(state, V, asked_mask): """Select item maximizing Fisher information at current estimate.""" cand = np.where(~asked_mask)[0] if cand.size == 0: return None vals = sigmoid(state["U_hat"] + V[cand]) info = vals * (1.0 - vals) return cand[np.argmax(info + 1e-6 * rng.random(size=cand.size))] def choose_item_random(state, V, asked_mask): """Select item uniformly at random.""" cand = np.where(~asked_mask)[0] return rng.choice(cand) if cand.size > 0 else None print("Update and selection functions defined.") ``` Now we run the simulation, comparing both policies across all users: ```{pyodide-python} #| autorun: true def run_policy(policy_fn, V, T): """Run policy for all users and track MSE/reliability curves.""" states = [new_user_state() for _ in range(N)] asked = np.zeros((N, M), dtype=bool) mse_curve, rel_curve = [], [] for t in range(1, T + 1): for i in range(N): j = policy_fn(states[i], V, asked[i]) if j is None: continue y = 1 if rng.random() < sigmoid(U_true[i] + V[j]) else 0 update_user(states[i], V[j], y) asked[i, j] = True U_hat = np.array([s["U_hat"] for s in states]) Var_hat = np.array([s["var"] for s in states]) mse_curve.append(np.mean((U_hat - U_true)**2)) rel_curve.append(1.0 - np.mean(Var_hat) / sigma_U**2) return np.array(mse_curve), np.array(rel_curve) # Run both policies mse_act, rel_act = run_policy(choose_item_fisher, V, T) mse_rnd, rel_rnd = run_policy(choose_item_random, V, T) print("Simulation complete!") ``` Finally, we visualize the results comparing Fisher-active vs random item selection: ```{pyodide-python} #| autorun: true fig, axes = plt.subplots(1, 2) axes[0].plot(np.arange(1, T+1), mse_rnd, 'o-', label="Random", alpha=0.7) axes[0].plot(np.arange(1, T+1), mse_act, 's-', label="Fisher-active", alpha=0.7) axes[0].set_xlabel("Queries per user"); axes[0].set_ylabel("MSE of U") axes[0].set_title("Estimation Error vs Queries"); axes[0].legend(); axes[0].grid(True, alpha=0.3) axes[1].plot(np.arange(1, T+1), rel_rnd, 'o-', label="Random", alpha=0.7) axes[1].plot(np.arange(1, T+1), rel_act, 's-', label="Fisher-active", alpha=0.7) axes[1].set_xlabel("Queries per user"); axes[1].set_ylabel("Reliability") axes[1].set_title("Reliability vs Queries"); axes[1].legend(); axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"Final MSE | Random: {mse_rnd[-1]:.4f} | Fisher: {mse_act[-1]:.4f}") print(f"Final Rel. | Random: {rel_rnd[-1]:.3f} | Fisher: {rel_act[-1]:.3f}") ``` ::: {.callout-important title="Key Takeaway: Active Beats Random"} Fisher-active item selection achieves higher reliability with fewer queries than random selection. The gain comes from targeting items near $p=0.5$, where each response is maximally informative. This principle---**measure where uncertainty is highest**---underlies all active learning methods in this chapter. ::: ### Factor Models: Multi-Dimensional Preferences {#sec-factor-models-active} In practice, a single "ability" parameter may not capture the diversity of user preferences. Factor models generalize Rasch by giving each user a $K$-dimensional embedding $U_i \in \mathbb R^K$ and each item a $K$-dimensional embedding $V_j \in \mathbb R^K$, plus optional bias $Z_j$. The linear predictor is $H_{ij} = U_i^\top V_j + Z_j$ where $p(Y_{ij}=1 \mid U_i, V_j, Z_j) = \sigma(H_{ij}).$ A concrete instance is **performance metric elicitation** [@pmlr-v89-hiranandani19a]. In binary classification, practitioners often have implicit preferences over classifiers that don't match standard metrics like accuracy or F1---for instance, in medical diagnosis, the cost of a false negative (missed disease) may differ greatly from a false positive (unnecessary treatment). This is a $K=2$ factor model where "items" are classifiers with known confusion matrices $V_\theta = (\text{TP}_\theta, \text{TN}_\theta)$, and the "user preference" is the unknown metric weights $\mathbf{m} = (m_{11}, m_{00})$. The utility $\mathbf{m}^\top V_\theta = m_{11} \cdot \text{TP}_\theta + m_{00} \cdot \text{TN}_\theta$ is exactly the factor model form with known item parameters. For a given user $U$ and an item $j$ with known parameters, the Bernoulli log-likelihood is $$\ell(U; y, j) = y \log \sigma(H) + (1-y)\log (1-\sigma(H))$$ {#eq-factor-loglik} The gradient with respect to $U$ is $\nabla_U \ell = (y - \sigma(H)) V_j.$ The expected Fisher Information per item is the negative expected Hessian: $$ \mathcal I_j(U) = \mathbb E[\nabla_U \ell \nabla_U \ell^\top ] = \sigma(H)(1-\sigma(H)) V_j V_j^\top. $$ {#eq-factor-fisher-info} So each queried item contributes a rank-1 matrix in the direction of $V_j$, scaled by the variance term $\sigma(H)(1-\sigma(H))$. For a sequence of items ${j_t}$, the cumulative information is $$\mathcal I(U) = \sum_t \sigma(H_{j_t})(1-\sigma(H_{j_t})) V_{j_t} V_{j_t}^\top.$$ {#eq-cumulative-fisher-info} In multiple dimensions, information is a matrix. Classical optimal design offers various criteria for selecting informative queries: ::: {.callout-note title="Definition: Optimal Design Criteria"} ::: {#def-optimal-design} Given a Fisher Information matrix $\mathcal{I}(U)$, the classical **optimal experimental design** criteria are: - **A-Optimal (Average)**: Minimize $\text{tr}(\mathcal{I}^{-1})$---the average variance of parameter estimates - **D-Optimal (Determinant)**: Maximize $\det(\mathcal{I})$---the volume shrinkage of the posterior ellipsoid - **E-Optimal (Eigenvalue)**: Maximize $\lambda_{\min}(\mathcal{I})$---the precision in the worst-case direction All three reduce uncertainty, but emphasize different aspects: A-optimal targets average precision, D-optimal targets overall information volume, and E-optimal targets the least-informed direction. ::: ::: Adaptive measurement picks the next item $j$ to maximize an acquisition function such as $\det(\mathcal I+\mathcal I_j)$. We also use reliability as the evaluation metric here. If $\Sigma_U$ is the population covariance and $\widehat{\Sigma}_{\text{err}}$ the average posterior covariance, then $\text{Reliability} = 1 - \mathrm{tr}(\widehat{\Sigma}_{\text{err}}) \mathrm{tr}(\Sigma_U)^{-1}.$ This generalizes the Rasch reliability to multi-dimensional latent traits. Below is a toy simulation with 2-dimensional users and items, comparing random item selection vs D-optimal Fisher active selection: First, set up the factor model simulation with K-dimensional user and item embeddings: ```{pyodide-python} #| autorun: true import numpy as np import matplotlib.pyplot as plt sigmoid = lambda x: 1.0 / (1.0 + np.exp(-x)) rng = np.random.default_rng(0) N, M, K, T = 100, 300, 2, 30 # users, items, factors, queries U_true = rng.normal(0, 1, size=(N, K)) V = rng.normal(0, 1, size=(M, K)) Z = rng.normal(0, 0.3, size=(M,)) def new_user_state(): return {"U_hat": np.zeros(K), "Sigma": np.eye(K) * 2.0} print(f"Factor model: {N} users, {M} items, K={K} dimensions") ``` Define the Bayesian update and D-optimal selection: ```{pyodide-python} #| autorun: true def update_user(state, vj, zj, y): """Update posterior with one item response via Laplace approximation.""" U, S = state["U_hat"], state["Sigma"] p = sigmoid(U @ vj + zj) I_j = p * (1 - p) * np.outer(vj, vj) Sigma_new = np.linalg.inv(np.linalg.inv(S) + I_j) state["U_hat"] = U + Sigma_new @ ((y - p) * vj) state["Sigma"] = Sigma_new def choose_item_Dopt(state, asked_mask, V, Z): """Select item maximizing D-optimal log-det gain.""" Prec = np.linalg.inv(state["Sigma"]) cand = np.where(~asked_mask)[0] scores = [] for j in cand: W = sigmoid(state["U_hat"] @ V[j] + Z[j]) W = W * (1 - W) scores.append(np.log(np.linalg.det(Prec + W * np.outer(V[j], V[j])))) return cand[np.argmax(scores)] def choose_random(state, asked_mask, V, Z): return rng.choice(np.where(~asked_mask)[0]) print("Update and selection functions defined.") ``` Run simulation comparing D-optimal vs random selection: ```{pyodide-python} #| autorun: true def run_policy(policy_fn): states = [new_user_state() for _ in range(N)] asked = np.zeros((N, M), dtype=bool) mse_curve, rel_curve = [], [] for t in range(T): for i in range(N): j = policy_fn(states[i], asked[i], V, Z) y = int(rng.random() < sigmoid(U_true[i] @ V[j] + Z[j])) update_user(states[i], V[j], Z[j], y) asked[i, j] = True U_hat = np.array([s["U_hat"] for s in states]) err_tr = np.mean([np.trace(s["Sigma"]) for s in states]) mse_curve.append(np.mean(np.sum((U_hat - U_true)**2, axis=1))) rel_curve.append(1 - err_tr / np.trace(np.cov(U_true.T))) return np.array(mse_curve), np.array(rel_curve) mse_rand, rel_rand = run_policy(choose_random) mse_act, rel_act = run_policy(choose_item_Dopt) print("Simulation complete!") ``` Visualize results: ```{pyodide-python} #| autorun: true fig, axes = plt.subplots(1, 2) xs = np.arange(1, T + 1) axes[0].plot(xs, mse_rand, 'o-', label="Random", alpha=0.7) axes[0].plot(xs, mse_act, 's-', label="D-opt", alpha=0.7) axes[0].set_xlabel("Queries"); axes[0].set_ylabel("MSE") axes[0].set_title("Estimation Error"); axes[0].legend(); axes[0].grid(True, alpha=0.3) axes[1].plot(xs, rel_rand, 'o-', label="Random", alpha=0.7) axes[1].plot(xs, rel_act, 's-', label="D-opt", alpha=0.7) axes[1].set_xlabel("Queries"); axes[1].set_ylabel("Reliability") axes[1].set_title("Reliability"); axes[1].legend(); axes[1].grid(True, alpha=0.3) plt.tight_layout(); plt.show() ``` ### Pairwise Preferences {#sec-pairwise-active} We next discuss the measurement problem with pairwise data. Recall from @sec-bradley-terry that the Bradley-Terry model specifies the probability that user $i$ prefers item $j$ over item $k$ as $p_{i}(j \succ k)=\sigma (U^\top(V_j-V_k) + Z_j-Z_k)$ with feature $x_{jk}= V_j - V_k \in \mathbb R^K$ and offset $b_{jk}=Z_j-Z_k$. In metric elicitation, each oracle query asks: "Which classifier is better, $h_\theta$ or $h_{\theta'}$?" The oracle's answer is deterministic: $\Omega(C_\theta, C_{\theta'}) = \mathbf{1}[\mathbf{m}^\top (V_\theta - V_{\theta'}) > 0]$, which is exactly a pairwise comparison with feature difference $x = V_\theta - V_{\theta'}$. The deterministic oracle corresponds to the $\sigma \to \text{step function}$ limit of Bradley-Terry. For a single observation $y \in \{0,1\}$ of $j \succ k$ the per-pair log-likelihood is $$\ell(U; y, i, j, k)= y \log p + (1-y) \log (1-p_i).$$ {#eq-pairwise-loglik} Score and expected Fisher information for (U): $$\nabla_U \ell=(y-p),x_{jk}, \qquad \mathcal I_{jk}(U)=\mathbb E[-\nabla^2_U \ell] = w x_{jk} x_{jk}^\top,$$ {#eq-pairwise-score-fisher} where $w = p(1-p)$. For a sequence of queried pairs ${(j_t,k_t)}$ the cumulative information is $\mathcal I(U)=\sum_t w_t,x_t x_t^\top.$ With a Gaussian prior $U \sim \mathcal N(0, \Sigma_0)$ and a Laplace approximation, the posterior precision updates additively: $$ \Lambda_{t}=\Sigma_t^{-1}=\Sigma_{0}^{-1}+\sum_{s\le t} w_s,x_s x_s^\top. $$ {#eq-pairwise-precision-update} A single Newton step for the posterior mean after observing ((x,y)) is $$ \hat U \leftarrow \hat U + \Sigma_{t}^{\ }(y-p),x,\qquad p=\sigma(\hat U^\top x+b),\quad \Sigma_t = \left(\Lambda_{t-1}+w,xx^\top\right)^{-1}. $$ {#eq-pairwise-newton-step} This is the logistic-regression one-step IRLS update specialized to one new example. We evaluate candidates at the current $(\hat U,\Sigma)$. Write $x$ for $x_{jk}$ and $w=\sigma(\hat U^\top x+b)\big(1-\sigma(\hat U^\top x+b)\big)$. Let $\Lambda=\Sigma^{-1}$. Using the Sherman–Morrison formula, we obtain closed forms for all three design criteria: ::: {.callout-note title="Proposition: D-Optimal Acquisition via Sherman-Morrison"} ::: {#prp-sherman-morrison} For a current precision matrix $\Lambda = \Sigma^{-1}$ and a candidate pair $(j,k)$ with feature difference $x = V_j - V_k$ and variance weight $w = p(1-p)$, the D-optimal acquisition function has closed form: $$ \Delta_D(j,k) = \log\det(\Lambda + w xx^\top) - \log\det(\Lambda) = \log(1 + w x^\top \Sigma x) $$ This follows from the matrix determinant lemma: $\det(A + uv^\top) = (1 + v^\top A^{-1} u)\det(A)$. ::: ::: Pick the pair that maximizes $\log(1 + w x^\top\Sigma x)$. This requires only a vector-matrix-vector product, making it efficient for online selection. * A-optimal (trace drop of covariance): $\Delta_A(j,k);=;\mathrm{tr}(\Sigma)-\mathrm{tr}\big((\Lambda+w,xx^\top)^{-1}\big) = \frac{w,x^\top\Sigma^2 x}{1+w,x^\top\Sigma x}.$ * E-optimal (raise the worst-direction information): Exact update of $\lambda_{\min}(\mathcal I)$ needs an eigen update. A practical proxy is to maximize $\Delta_E^{\text{proxy}}(j,k)= w,x^\top\Sigma x,$ which targets directions where current uncertainty $x^\top\Sigma x$ is large. End-to-end toy pipeline (rank 2, known items, active D-opt vs random). First, define helper functions for D-optimal gain and Bayesian updates: ```{pyodide-python} #| autorun: true import numpy as np import matplotlib.pyplot as plt sigmoid = lambda x: 1.0 / (1.0 + np.exp(-x)) def d_opt_gain(Sigma, x, w): """D-optimal log-det gain: log(1 + w x^T Sigma x)""" return np.log1p(w * x @ Sigma @ x) def one_step_update(U_hat, Sigma, x, b, y): """Sherman-Morrison update for posterior mean and covariance.""" p = sigmoid(U_hat @ x + b) w = p * (1 - p) Sx = Sigma @ x Sigma_new = Sigma - (w / (1.0 + w * (x @ Sx))) * np.outer(Sx, Sx) U_new = U_hat + Sigma_new @ ((y - p) * x) return U_new, Sigma_new print("Helper functions defined.") ``` Set up simulation parameters and precompute candidate pairs: ```{pyodide-python} #| autorun: true rng = np.random.default_rng(0) N, M, K, T = 200, 60, 2, 30 # users, items, rank, queries V = rng.normal(0, 1.0, size=(M, K)) # item embeddings Z = rng.normal(0, 0.3, size=M) # item biases U_true = rng.normal(0, 1.0, size=(N, K)) # true user embeddings # Precompute all pairs and their features pairs = np.array([(j, k) for j in range(M) for k in range(M) if j < k]) X = V[pairs[:,0]] - V[pairs[:,1]] # pairwise feature differences B = Z[pairs[:,0]] - Z[pairs[:,1]] # pairwise bias differences print(f"Setup: {N} users, {M} items, {len(pairs)} candidate pairs, {T} queries each") ``` Define selection policies and run the simulation: ```{pyodide-python} #| autorun: true def choose_pair_dopt(U_hat, Sigma, X, B): """Select pair maximizing D-optimal gain.""" h = X @ U_hat + B p = sigmoid(h) w = p * (1 - p) gains = np.array([d_opt_gain(Sigma, X[i], w[i]) for i in range(len(X))]) gains[np.isnan(gains)] = -np.inf return int(np.argmax(gains)) def run_policy(active=True): Uh = np.zeros((N, K)) Sigmas = np.tile(np.eye(K) * 2.0, (N, 1, 1)) mse_curve, rel_curve = [], [] for t in range(T): for i in range(N): idx = choose_pair_dopt(Uh[i], Sigmas[i], X, B) if active else rng.integers(len(pairs)) x, b = X[idx], B[idx] y = 1 if rng.random() < sigmoid(U_true[i] @ x + b) else 0 Uh[i], Sigmas[i] = one_step_update(Uh[i], Sigmas[i], x, b, y) mse_curve.append(np.mean(np.sum((Uh - U_true)**2, axis=1))) rel_curve.append(1.0 - np.mean(np.trace(Sigmas, axis1=1, axis2=2)) / K) return np.array(mse_curve), np.array(rel_curve) mse_d, rel_d = run_policy(active=True) mse_r, rel_r = run_policy(active=False) print("Simulation complete!") ``` Visualize the results: ```{pyodide-python} #| autorun: true fig, axes = plt.subplots(1, 2) xs = np.arange(1, T + 1) axes[0].plot(xs, mse_r, 'o-', label="Random", alpha=0.7) axes[0].plot(xs, mse_d, 's-', label="D-opt", alpha=0.7) axes[0].set_xlabel("Queries per user"); axes[0].set_ylabel("MSE") axes[0].set_title("Estimation Error"); axes[0].legend(); axes[0].grid(True, alpha=0.3) axes[1].plot(xs, rel_r, 'o-', label="Random", alpha=0.7) axes[1].plot(xs, rel_d, 's-', label="D-opt", alpha=0.7) axes[1].set_xlabel("Queries per user"); axes[1].set_ylabel("Reliability") axes[1].set_title("Reliability"); axes[1].legend(); axes[1].grid(True, alpha=0.3) plt.tight_layout(); plt.show() print(f"Final MSE | Random: {mse_r[-1]:.3f} | D-opt: {mse_d[-1]:.3f}") print(f"Final Rel. | Random: {rel_r[-1]:.3f} | D-opt: {rel_d[-1]:.3f}") ``` ::: {.callout-important title="Key Takeaway: D-Optimal Design"} For multi-dimensional preference vectors, D-optimal selection maximizes $\log\det(\mathcal{I})$, shrinking the posterior ellipsoid in all directions. The Sherman-Morrison formula enables efficient computation: each candidate pair requires only a vector-matrix-vector product $x^\top \Sigma x$. D-optimal design naturally encourages diversity---selecting pairs with redundant information directions provides diminishing log-det gain. ::: ### Metric Elicitation: Exploiting Quasiconcavity {#sec-metric-elicitation} The metric elicitation problem has special structure that enables even more efficient elicitation than general D-optimal design. A *Linear Performance Metric (LPM)* has the form: $$ \phi(C) = m_{11} \cdot \text{TP} + m_{00} \cdot \text{TN} + m_0 $$ {#eq-lpm} where the weights $(m_{11}, m_{00})$ encode tradeoffs between true positives and true negatives. Since only the ratio matters, we can parametrize as $\mathbf{m} = (\cos\theta, \sin\theta)$ for $\theta \in [0, 2\pi]$. For medical diagnosis, $m_{11} > m_{00}$ weights sensitivity (catching disease) over specificity (avoiding false alarms). For spam filtering, the reverse may hold. The key is that the practitioner's implicit metric $\mathbf{m}^*$ is unknown---we must learn it from pairwise comparisons. ::: {.callout-note title="Proposition: Quasiconcavity of LPM Composition"} ::: {#prp-quasiconcave} For a quasiconcave LPM $\phi$ that is monotonically increasing in TP and TN, and a parametrization $\rho^+(\theta)$ of the upper boundary of the confusion matrix space $\mathcal{C}$, the composition $\phi \circ \rho^+$ is **quasiconcave and unimodal** on $[0, \pi/2]$. ::: ::: Quasiconcavity means the function has a single peak---no local maxima to trap us. This enables binary search rather than gradient-based optimization: ::: {.callout-tip title="Algorithm: LPM Elicitation via Binary Search"} **Input:** Tolerance $\epsilon > 0$, oracle $\Omega$ **Initialize:** $\theta_a = 0$, $\theta_b = \pi/2$ **While** $|\theta_b - \theta_a| > \epsilon$: 1. Divide $[\theta_a, \theta_b]$ into four equal intervals with boundaries $A, B, C, D, E$ 2. Compute confusion matrices $C_A, C_B, C_C, C_D, C_E$ for classifiers at each $\theta$ 3. Query oracle: compare all adjacent pairs to find which interval contains the maximum 4. Update $[\theta_a, \theta_b]$ to the interval containing the maximum **Return:** $\hat{\mathbf{m}} = (\cos\hat\theta, \sin\hat\theta)$ where $\hat\theta = (\theta_a + \theta_b)/2$ ::: Each iteration uses 4 queries and shrinks the search interval by factor of 2, giving **$O(\log(1/\epsilon))$ query complexity**---exponentially better than the $O(1/\epsilon^2)$ rate for general 2D estimation. ::: {.callout-note title="Proposition: Bayes Optimal Classifier from LPM"} ::: {#prp-bayes-classifier} Given learned metric weights $\mathbf{m} = (m_{11}, m_{00})$ with $m_{11} + m_{00} > 0$, the Bayes optimal classifier is: $$ \bar{h}(x) = \mathbf{1}\left[\eta(x) \geq \frac{m_{00}}{m_{11} + m_{00}}\right] $$ where $\eta(x) = P(Y=1 \mid X=x)$ is the conditional class probability. ::: ::: This connects metric elicitation to decision theory: learning the metric $\mathbf{m}$ is equivalent to learning the optimal classification threshold. **Extension to Linear-Fractional Metrics.** The $F_\beta$ score and Jaccard similarity are not linear but *linear-fractional*: $$ \phi(C) = \frac{p_{11} \cdot \text{TP} + p_{00} \cdot \text{TN} + p_0}{q_{11} \cdot \text{TP} + q_{00} \cdot \text{TN} + q_0} $$ These can be elicited by first finding both the maximizer and minimizer of the implicit linear numerator (two binary searches), then solving for the fractional parameters. The query complexity remains $O(\log(1/\epsilon))$. **Multiclass Extension.** For $K$-class problems, diagonal linear performance metrics (DLPMs) weight correct predictions: $\psi(\mathbf{d}) = \sum_k a_k d_k$ where $d_k$ is the rate of correctly predicting class $k$. Binary search over each pair of classes $(k_1, k_2)$ recovers the relative weights $a_{k_2}/a_{k_1}$, with total complexity $O(K^2 \log(1/\epsilon))$. ## Measurement of User Preference Function {#sec-preference-function} Suppose items $j$ have feature vectors $X_j \in \mathbb R^d$. Instead of learning a free scalar parameter $V_j$ for each item, we assume $V_j = W^\top X_j,$ where $W \in \mathbb R^d$ is shared across items. The probability that item $j$ beats item $k$ is $p(j \succ k) = \sigma(V_j - V_k) = \sigma (W^\top(X_j - X_k)).$ This is simply logistic regression on pairwise feature differences. Given one comparison outcome $y \in \{0,1\}$ for pair $(j,k)$ with difference $x_{jk} = X_j - X_k$, $$\ell(W) = y\log \sigma(W^\top x_{jk}) + (1-y)\log(1-\sigma(W^\top x_{jk})). $$ {#eq-logistic-loglik} Gradient: $\nabla_W \ell = (y - p) x_{jk}, \quad p = \sigma( W^\top x_{jk} ).$ Fisher Information (expected negative Hessian): $\mathcal I(W) = \mathbb E\big[(y-p)^2 x_{jk}x_{jk}^\top\big] = p(1-p),x_{jk}x_{jk}^\top.$ With Gaussian prior $W \sim \mathcal N(0,\sigma_0^2 I)$, a Laplace approximation yields posterior covariance $\Sigma \approx (\sigma_0^{-2}I + \sum_t p_t(1-p_t) x_t x_t^\top)^{-1}.$ So, as in the Rasch case, Fisher information accumulates to form posterior precision. At each step, pick a candidate pair $(j,k)$ and compute $$ \Delta_D(j,k) = \log \det(\Lambda + w x_{jk} x_{jk}^\top) - \log\det(\Lambda), \quad w = p(1-p), $$ {#eq-d-optimal-criterion} where $\Lambda = \Sigma^{-1}$. This is the D-optimal criterion, maximizing the expected log-determinant gain in information. A-optimal or E-optimal criteria are similarly defined. We’ll simulate item features, a ground-truth $W$, and comparisons. Then we compare active D-optimal elicitation vs random pair selection for estimating $W$. Performance metrics are MSE of $\hat W$ vs true $W$, and reliability. First, set up the simulation with item features and ground-truth weights: ::: {.callout-note title="Code"} ```{pyodide-python} #| autorun: true import numpy as np import matplotlib.pyplot as plt sigmoid = lambda x: 1.0 / (1.0 + np.exp(-x)) rng = np.random.default_rng(0) d = 5 # feature dimension M = 50 # items T = 100 # queries (pairs) sigma0 = 2.0 # prior SD X = rng.normal(0, 1, (M, d)) W_true = rng.normal(0, 1, d) print(f"Setup: {M} items, {d}-dim features, {T} queries") ``` ::: Define the Fisher gain and Bayesian update functions: ::: {.callout-note title="Code"} ```{pyodide-python} #| autorun: true def fisher_gain(Sigma, x, w): """D-optimal log-det gain.""" Sx = Sigma @ x denom = 1 + w * (x @ Sx) Sigma_new = Sigma - (w / denom) * np.outer(Sx, Sx) return np.log(np.linalg.det(Sigma) / np.linalg.det(Sigma_new)) def update(W_hat, Sigma, x, y): """Newton update with Laplace covariance.""" p = sigmoid(W_hat @ x) w = p * (1 - p) Sx = Sigma @ x denom = 1 + w * (x @ Sx) Sigma_new = Sigma - (w / denom) * np.outer(Sx, Sx) W_new = W_hat + Sigma_new @ ((y - p) * x) return W_new, Sigma_new pairs = np.array([(j, k) for j in range(M) for k in range(M) if j < k]) Xdiff = X[pairs[:, 0]] - X[pairs[:, 1]] print(f"Functions defined. {len(pairs)} candidate pairs.") ``` ::: Run the simulation comparing D-optimal vs random selection: ::: {.callout-note title="Code"} ```{pyodide-python} #| autorun: true def run_policy(active=True): W_hat = np.zeros(d) Sigma = np.eye(d) * sigma0**2 mse_hist, rel_hist = [], [] for t in range(T): if active: gains = [fisher_gain(Sigma, x, sigmoid(W_hat @ x) * (1 - sigmoid(W_hat @ x))) for x in Xdiff] idx = np.argmax(gains) else: idx = rng.integers(len(Xdiff)) x = Xdiff[idx] y = rng.random() < sigmoid(W_true @ x) W_hat, Sigma = update(W_hat, Sigma, x, y) mse_hist.append(np.sum((W_hat - W_true)**2)) rel_hist.append(1 - np.trace(Sigma) / (d * np.var(W_true))) return np.array(mse_hist), np.array(rel_hist) mse_act, rel_act = run_policy(True) mse_rnd, rel_rnd = run_policy(False) print("Simulation complete!") ``` ::: Visualize the results: ::: {.callout-note title="Code"} ```{pyodide-python} #| autorun: true fig, axes = plt.subplots(1, 2) axes[0].plot(mse_rnd, label="Random", alpha=0.7) axes[0].plot(mse_act, label="D-optimal", alpha=0.7) axes[0].set_xlabel("Queries"); axes[0].set_ylabel("MSE") axes[0].set_title("Estimation Error"); axes[0].legend(); axes[0].grid(True, alpha=0.3) axes[1].plot(rel_rnd, label="Random", alpha=0.7) axes[1].plot(rel_act, label="D-optimal", alpha=0.7) axes[1].set_xlabel("Queries"); axes[1].set_ylabel("Reliability") axes[1].set_title("Reliability"); axes[1].legend(); axes[1].grid(True, alpha=0.3) plt.tight_layout(); plt.show() ``` ::: ### Robotic Trajectory Learning {#sec-robotics-trajectory} The linear preference framework applies directly to robotic trajectory learning [@pmlr-v87-biyik18a]. Consider a robot choosing between trajectories $\xi_A$ and $\xi_B$. The human evaluates them via a reward function: $$ R(\xi) = w^\top \phi(\xi) $$ {#eq-trajectory-reward} where $w \in \mathbb{R}^N$ are unknown weights and $\phi(\xi) \in \mathbb{R}^N$ extracts features (e.g., distance to lane boundaries, proximity to obstacles, heading, speed). This is exactly our linear model with "items" being trajectories and "features" being $\phi(\xi)$. **Geometric Interpretation.** Each pairwise comparison "prefer $\xi_A$ over $\xi_B$" yields a constraint: $$ w^\top (\phi(\xi_A) - \phi(\xi_B)) > 0 $$ This defines a **half-space** in the weight space. The feasible region for $w$ is the intersection of all half-spaces consistent with observed preferences. Each query splits this region---orthogonal queries reduce volume by approximately half. The optimal query selection criterion maximizes the minimum volume removed regardless of the oracle's answer. For query feature difference $\Phi = \phi(\xi_A) - \phi(\xi_B)$, this is $\max_\Phi \min\{\mathbb{E}[1 - f_\Phi(w)], \mathbb{E}[1 - f_{-\Phi}(w)]\}$ where $f_\Phi(w) = \min(1, \exp(w^\top \Phi))$ soft-weights consistency with the preference. This selects balanced queries where either response is informative. **Driving Simulator Example.** In a 2D driving simulator, trajectory features include: - Distance to lane boundaries - Distance to other vehicles - Heading angle - Speed With no learned preferences, the simulated car moves erratically. After 30 comparison queries, it learns to follow the lane direction. After 70 queries, it additionally learns to avoid collisions while staying in lane. The features with highest learned weights correspond to collision avoidance and lane-keeping. **Why Active Selection Matters.** Both active and random query selection eventually converge to the true reward weights. However, active selection achieves target performance **faster**---critical in time-sensitive applications. For exoskeleton rehabilitation [@Li_2021], the robot must adapt to individual patients quickly; waiting for asymptotic convergence is not acceptable. Active learning provides crucial early-stage advantages. **Imitation Learning with Compatibility.** When learning from multiple human demonstrators, *multimodality* arises: different people use different strategies for the same task (e.g., picking up an object from the left vs. right). Naive imitation learning fails on contradictory demonstrations. The solution is to measure **compatibility** of new demonstrations with the base policy: - **Likelihood**: How close is the demonstrated action to the base policy's prediction? - **Novelty**: How uncertain is the base policy at this state? Demonstrations with low likelihood in low-novelty states indicate conflicting modes and should be filtered. This compatibility-based filtering improves success rates while using less data than naive aggregation [@gandhi2022eliciting]. **Foundation Models for Representations.** Recent work leverages foundation models pretrained on human video (R3M [@nair2022r3m], Voltron [@karamcheti2023voltron]) to provide better trajectory features $\phi(\xi)$. These models learn that human hands are similar to robotic end-effectors, enabling 2-3x higher success rates with 5-10x fewer demonstrations. For preference learning, better representations mean each comparison is more informative. ### Non-Linear Scoring Functions {#sec-nonlinear-scoring} So far we set $V_j=W^\top X_j$. Suppose instead that items are scored by a non-linear function $$V_j = f_\theta(X_j),\qquad \theta\in\mathbb{R}^p, \qquad p(j \succ k) = \sigma (f_\theta(X_j) - f_\theta(X_k))$$ {#eq-nonlinear-scoring} We want an online procedure that chooses the next comparison $(j,k)$ to learn $\theta$ quickly. Exact Bayesian design is intractable for general $f_\theta$. There is a simple and effective approximation that keeps all the math and code lightweight: local Gaussian posterior over $\theta$ via Laplace + first-order model. Work at the current estimate $\hat\theta$. Linearize the non-linear scorer around $\hat\theta$: $$ f_\theta(X_j) \approx f_{\hat\theta}(X_j) + J_j\Delta\theta, \qquad J_j \equiv \left.\frac{\partial f_\theta(X_j)}{\partial \theta}\right| _{\theta = \hat\theta} \in \mathbb{R}^{1\times p}. $$ {#eq-linearization} The pairwise logit for $(j,k)$ becomes $$\underbrace{f_\theta(X_j)-f_\theta(X_k)}_{\text{logit}} \approx \underbrace{f_{\hat\theta}(X_j)-f_{\hat\theta}(X_k)}_{\text{offset}} + \underbrace{(J_j-J_k)}_{\phi_{jk}}\Delta\theta.$$ {#eq-pairwise-logit-approx} Define the local "design vector" $\phi_{jk}\in\mathbb{R}^{1\times p}$. With a Gaussian prior $\theta\sim\mathcal N(0,\sigma_0^2 I)$ and Bernoulli likelihood, the Laplace posterior around $\hat\theta$ is $$ \theta \mid \mathcal D \approx \mathcal N(\hat\theta,;\Sigma),\qquad \Sigma^{-1} = \sigma_0^{-2} I + \sum_{t} w_t \phi_t^\top \phi_t $$ {#eq-laplace-posterior} where $w_t = p_t(1-p_t)$ with $p_t = \sigma\big(f_{\hat\theta}(X_{j_t})-f_{\hat\theta}(X_{k_t})\big)$. This is exactly the same structure as the linear case, with (x_{jk}) replaced by $\phi_{jk}=J_j-J_k$. You only need Jacobian rows $J_j$. At the current $(\hat\theta \Sigma)$ and for a candidate $(j,k)$, compute $p=\sigma\big(f_{\hat\theta}(X_j)-f_{\hat\theta}(X_k)\big)$ and $w = p(1-p)$.Compute $\phi = J_j-J_k$ at $\hat \theta$. Then use any standard optimal design score with Sherman–Morrison: * D-optimal (log-det gain): $\Delta_D(j,k) = \log ( 1 + w\phi \Sigma \phi^\top )$. * A-optimal (trace drop): $\Delta_A(j,k) = \dfrac{w \phi \Sigma^2 \phi^\top }{ 1+w \phi \Sigma \phi^\top}$. * E-optimal proxy: $\Delta_E^{\text{proxy}}(j,k) = w \phi \Sigma \phi^\top$. Pick the pair with the largest gain. This is tractable because it only needs a vector–matrix–vector product with $\Sigma$ and a Jacobian row. Mutual-information acquisition like BALD reduces to the same formulas under the Laplace Gaussian approximation. Maximizing (\Delta_D) is a good default.  We use a tiny two-layer network $f_\theta(x)=a^\top \tanh(Bx)$ with $\theta=(a,B)$. We derive $J_j$ analytically. First, define the network and its Jacobian: ```{pyodide-python} #| autorun: true import numpy as np import matplotlib.pyplot as plt sigmoid = lambda x: 1.0 / (1.0 + np.exp(-x)) rng = np.random.default_rng(3) # Network dimensions d, h, M, T = 6, 8, 80, 120 # feature dim, hidden units, items, queries # Item features and ground-truth parameters X = rng.normal(0, 1, size=(M, d)) a_star = rng.normal(0, 0.8, size=(h,)) B_star = rng.normal(0, 0.6, size=(h, d)) def f_theta(a, B, x): """Two-layer network: a^T tanh(Bx)""" return a @ np.tanh(B @ x) def f_batch(a, B, X): """Batch evaluation over all items.""" return a @ np.tanh(B @ X.T) def J_row(a, B, x): """Jacobian ∂f/∂θ at x, with θ = [a, vec(B)].""" t = np.tanh(B @ x) sech2 = 1.0 - t**2 dfa = t # ∂f/∂a dfB = (a[:, None] * sech2[:, None]) * x[None, :] # ∂f/∂B return np.concatenate([dfa, dfB.ravel()]) V_star = f_batch(a_star, B_star, X) # true scores print(f"Network: {d}→{h}→1, {h + h*d} parameters, {M} items") ``` Set up Laplace approximation state and update functions: ```{pyodide-python} #| autorun: true p_dim = h + h*d # total parameters theta_hat = np.zeros(p_dim) Sigma = np.eye(p_dim) * 4.0 def split_theta(theta): return theta[:h], theta[h:].reshape(h, d) def update(theta, Sigma, j, k, y): """Laplace update after observing comparison (j,k) with outcome y.""" a, B = split_theta(theta) pjk = sigmoid(f_theta(a, B, X[j]) - f_theta(a, B, X[k])) w = pjk * (1 - pjk) phi = J_row(a, B, X[j]) - J_row(a, B, X[k]) Sphi = Sigma @ phi Sigma_new = Sigma - (w / (1 + w * (phi @ Sphi))) * np.outer(Sphi, Sphi) theta_new = theta + Sigma_new @ ((y - pjk) * phi) return theta_new, Sigma_new def d_opt_gain(Sigma, phi, w): return np.log1p(w * (phi @ (Sigma @ phi))) pairs = np.array([(j, k) for j in range(M) for k in range(M) if j < k]) print(f"Setup complete: {len(pairs)} candidate pairs") ``` Run the active learning loop comparing D-optimal vs random selection: ```{pyodide-python} #| autorun: true def run(active=True): theta, S = theta_hat.copy(), Sigma.copy() mse_hist, rel_hist = [], [] for t in range(T): a, B = split_theta(theta) if active: # Heuristic: evaluate on subset of pairs with small |logit| logits = f_batch(a, B, X) cand_idx = rng.choice(len(pairs), size=min(2000, len(pairs)), replace=False) deltas = logits[pairs[cand_idx, 0]] - logits[pairs[cand_idx, 1]] sel = cand_idx[np.argsort(np.abs(deltas))[:400]] gains = [d_opt_gain(S, J_row(a, B, X[j]) - J_row(a, B, X[k]), sigmoid(logits[j]-logits[k])*(1-sigmoid(logits[j]-logits[k]))) for j, k in pairs[sel]] j, k = pairs[sel[np.argmax(gains)]] else: j, k = pairs[rng.integers(len(pairs))] y = 1 if rng.random() < sigmoid(V_star[j] - V_star[k]) else 0 theta, S = update(theta, S, j, k, y) err = theta - np.concatenate([a_star, B_star.ravel()]) mse_hist.append(float(err @ err)) rel_hist.append(1 - np.trace(S) / (np.trace(S) + p_dim)) return np.array(mse_hist), np.array(rel_hist) mse_act, rel_act = run(active=True) mse_rnd, rel_rnd = run(active=False) print("Simulation complete!") ``` Visualize results: ```{pyodide-python} #| autorun: true fig, axes = plt.subplots(1, 2) x = np.arange(1, T+1) axes[0].plot(x, mse_rnd, 'o-', label="Random", alpha=0.7, markevery=10) axes[0].plot(x, mse_act, 's-', label="D-opt", alpha=0.7, markevery=10) axes[0].set_xlabel("Queries"); axes[0].set_ylabel("MSE in θ") axes[0].set_title("Parameter Estimation Error"); axes[0].legend(); axes[0].grid(True, alpha=0.3) axes[1].plot(x, rel_rnd, 'o-', label="Random", alpha=0.7, markevery=10) axes[1].plot(x, rel_act, 's-', label="D-opt", alpha=0.7, markevery=10) axes[1].set_xlabel("Queries"); axes[1].set_ylabel("Reliability") axes[1].set_title("Reliability Proxy"); axes[1].legend(); axes[1].grid(True, alpha=0.3) plt.tight_layout(); plt.show() ``` ## Active Query Selection for Gaussian Process Models {#sec-active-learning} The previous sections developed Fisher information for optimal query selection in *parametric* preference models—the Rasch model (Section 1) and linear scoring functions (Section 2). However, Chapter 2 introduced Gaussian Processes (GPs) as a flexible nonparametric alternative for reward modeling, and Chapter 3 covered GP inference via Laplace approximation. Here we address the natural question: **how do we actively select queries when using GPs?** The key insight is that GP uncertainty—quantified by the posterior variance—naturally guides query selection. Unlike parametric models where Fisher information depends only on the current parameter estimate, GP uncertainty varies across the input space, being higher far from observed data and lower near it. ### Information-Theoretic Acquisition Function For GP preference learning, we seek queries that maximize information gain about the reward function. Given a dataset $\mathcal{D}$ and a candidate query $Q = (x_A, x_B)$, the acquisition function measures the expected reduction in uncertainty about $r$: $$ a(Q) = I(r; y \mid \mathcal{D}, Q) = H(y \mid \mathcal{D}, Q) - \mathbb{E}_{r \sim p(r \mid \mathcal{D})}[H(y \mid r, Q)] $$ {#eq-gp-acquisition} **Interpretation of the two terms:** 1. **$H(y \mid \mathcal{D}, Q)$**: Predictive entropy—how uncertain is the query outcome under our current model? High when $p(A \succ B) \approx 0.5$. 2. **$\mathbb{E}_r[H(y \mid r, Q)]$**: Expected conditional entropy—how uncertain would the query be if we knew the true reward function? Low for "easy" comparisons, high for inherently noisy comparisons. The acquisition function trades off these two terms: we want queries where the *model* is uncertain (high predictive entropy) but a *human* could provide a clear answer (low expected conditional entropy). ### Practical Computation Using the Laplace approximation from Chapter 3 (@sec-gp-inference), the acquisition function can be computed as: $$ a(x_A, x_B) = h\left(\Phi\left(\frac{\mu_A - \mu_B}{\sqrt{2\sigma_{\text{noise}}^2 + g(x_A, x_B)}}\right)\right) - m(x_A, x_B) $$ {#eq-gp-acquisition-practical} where: - $\mu_A, \mu_B$ are the GP posterior means at $x_A, x_B$ - $g(x_A, x_B) = \sigma_A^2 + \sigma_B^2 - 2\text{Cov}(r(x_A), r(x_B))$ captures uncertainty from both points - $h(p) = -p\log_2 p - (1-p)\log_2(1-p)$ is binary entropy - $\Phi$ is the standard normal CDF - $m(x_A, x_B)$ is a correction term for human noise **Key property:** The trivial query $Q = (x, x)$ (comparing an item to itself) is a *global minimizer*, not maximizer, of this acquisition function. This contrasts with some other methods that can get stuck on uninformative queries. First, we define the RBF kernel and helper functions for the simulation: ```{pyodide-python} #| autorun: true import numpy as np import matplotlib.pyplot as plt np.random.seed(42) def rbf_kernel(X1, X2, sigma_f=1.0, length_scale=0.7): """Compute RBF kernel matrix between two sets of points.""" sq_dist = np.sum(X1**2, 1, keepdims=True) + np.sum(X2**2, 1) - 2*X1@X2.T return sigma_f**2 * np.exp(-sq_dist / (2 * length_scale**2)) sigmoid = lambda x: 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500))) binary_entropy = lambda p: -p*np.log2(p+1e-10) - (1-p)*np.log2(1-p+1e-10) # True reward function (unknown to the learner) def true_reward(x): return np.sin(2 * x) + 0.3 * x # Visualize the true reward X_grid = np.linspace(-3, 3, 100).reshape(-1, 1) plt.figure() plt.plot(X_grid, true_reward(X_grid), 'k-', lw=2) plt.xlabel('x'); plt.ylabel('r(x)') plt.title('True Reward Function (Unknown to Learner)') plt.grid(True, alpha=0.3) plt.show() ``` Next, we implement a GP model for active preference learning. The `fit` method uses Laplace approximation (Newton's method) to find the posterior mode, and `acquisition` computes the information gain for a candidate query: ```{pyodide-python} #| autorun: true class GPActivePreference: def __init__(self, X_pool, sigma_noise=0.1): self.X_pool = X_pool self.sigma_noise = sigma_noise self.comparisons = [] # list of (i, j, y) where y=1 means i>j self.r = np.zeros(len(X_pool)) self.Sigma = rbf_kernel(X_pool, X_pool) + 1e-4*np.eye(len(X_pool)) def fit(self): """Fit GP using Laplace approximation (Newton's method).""" if len(self.comparisons) == 0: return m = len(self.X_pool) self.K = rbf_kernel(self.X_pool, self.X_pool) + 1e-4*np.eye(m) self.K_inv = np.linalg.inv(self.K) # Newton's method for mode finding self.r = np.zeros(m) for _ in range(10): grad, W = np.zeros(m), np.zeros(m) for i, j, y in self.comparisons: p = sigmoid(self.r[i] - self.r[j]) grad[i] += (y - p); grad[j] -= (y - p) W[i] += p * (1 - p); W[j] += p * (1 - p) H = -np.diag(W + 1e-6) - self.K_inv self.r -= np.linalg.solve(H, grad - self.K_inv @ self.r) self.W = W self.Sigma = np.linalg.inv(self.K_inv + np.diag(W + 1e-6)) def acquisition(self, i, j): """Information gain acquisition: entropy of predicted comparison.""" if len(self.comparisons) == 0: return 0.5 var_diff = self.Sigma[i,i] + self.Sigma[j,j] - 2*self.Sigma[i,j] pred_std = np.sqrt(2*self.sigma_noise**2 + var_diff) p_pred = sigmoid((self.r[i] - self.r[j]) / pred_std) return binary_entropy(p_pred) def select_query(self): """Select the query with maximum information gain.""" m = len(self.X_pool) best_score, best_pair = -np.inf, (0, 1) for i in range(m): for j in range(i+1, m): score = self.acquisition(i, j) if score > best_score: best_score, best_pair = score, (i, j) return best_pair def add_comparison(self, i, j): """Simulate oracle response and update model.""" r_true = true_reward(self.X_pool.flatten()) p = sigmoid(r_true[i] - r_true[j]) y = 1 if np.random.rand() < p else 0 self.comparisons.append((i, j, y)) self.fit() print("GPActivePreference class defined successfully.") ``` Now we run the simulation, comparing active (information-theoretic) query selection against random selection: ```{pyodide-python} #| autorun: true # Setup X_pool = np.linspace(-3, 3, 30).reshape(-1, 1) n_queries = 25 r_true = true_reward(X_pool.flatten()) # Active selection gp_active = GPActivePreference(X_pool) mse_active = [] # Random selection gp_random = GPActivePreference(X_pool) mse_random = [] # Run comparison for t in range(n_queries): # Active: select most informative query i, j = gp_active.select_query() gp_active.add_comparison(i, j) mse_active.append(np.mean((gp_active.r - r_true)**2)) # Random: select query uniformly at random i, j = np.random.choice(len(X_pool), 2, replace=False) gp_random.add_comparison(i, j) mse_random.append(np.mean((gp_random.r - r_true)**2)) print(f"Simulation complete: {n_queries} queries collected for each method.") ``` Finally, we visualize the results. The left panel shows how estimation error decreases with more comparisons; the right panel shows the learned reward functions: ```{pyodide-python} #| autorun: true fig, axes = plt.subplots(1, 2) # Learning curves axes[0].plot(range(1, n_queries+1), mse_random, 'o-', label='Random', alpha=0.7) axes[0].plot(range(1, n_queries+1), mse_active, 's-', label='Active (Info Gain)', alpha=0.7) axes[0].set_xlabel('Number of comparisons') axes[0].set_ylabel('MSE of reward estimate') axes[0].set_title('GP Active vs Random Query Selection') axes[0].legend(); axes[0].grid(True, alpha=0.3) # Learned rewards axes[1].plot(X_pool, r_true, 'k--', lw=2, label='True reward') axes[1].plot(X_pool, gp_active.r, 'b-', lw=2, label='Active GP') axes[1].plot(X_pool, gp_random.r, 'r-', lw=2, alpha=0.7, label='Random GP') axes[1].fill_between(X_pool.flatten(), gp_active.r - 2*np.sqrt(np.diag(gp_active.Sigma)), gp_active.r + 2*np.sqrt(np.diag(gp_active.Sigma)), alpha=0.2, color='blue', label='±2σ (Active)') axes[1].set_xlabel('x'); axes[1].set_ylabel('r(x)') axes[1].set_title(f'Learned Rewards after {n_queries} Comparisons') axes[1].legend(); axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"Final MSE - Active: {mse_active[-1]:.4f}, Random: {mse_random[-1]:.4f}") print(f"Active achieves {mse_random[-1]/mse_active[-1]:.1f}x lower error") ``` ### When to Use GP Active Learning | Scenario | Recommendation | |----------|----------------| | Nonlinear reward structure | GP strongly preferred | | Low-to-moderate dimensions ($d \lesssim 10$) | GP works well | | Expensive human feedback | Active learning essential | | Need uncertainty quantification | GP provides this naturally | | Large datasets ($n > 1000$) | Consider sparse GP approximations | | Very high dimensions | Linear models may be more practical | ## Active Learning for Direct Preference Optimization {#sec-adpo} We now turn to a specific, important application: **active learning for DPO (Direct Preference Optimization)** in large language model alignment. Chapter 2 introduced DPO and its connection to Bradley-Terry; here we show how to apply active learning principles to reduce the number of human comparisons needed. ### The DPO Setting Recall from @sec-llm-example that DPO directly optimizes a policy $\pi_\theta$ on preference data without learning an explicit reward model. The key insight is that DPO implicitly assumes Bradley-Terry preferences: $$ p(y_w \succ y_l \mid x) = \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right) $$ {#eq-dpo-bradley-terry} **Challenge:** Large language models have billions of parameters, making exact Fisher information computation intractable. How can we apply the D-optimal design principles from Sections 1-2? ### The Log-Linear Policy Assumption The key insight is to approximate the policy as **log-linear** in the last-layer features: $$ \pi(y \mid x; \theta) \propto \exp(\phi(x, y)^\top \theta) $$ {#eq-log-linear-policy} where $\phi(x, y)$ is the feature representation from the model's last layer. This assumption is justified by the **neural tangent kernel** perspective: for sufficiently large networks, the dynamics during training are approximately linear in the parameters. Under this assumption, the DPO loss Hessian takes the form: $$ H(\theta) = \sum_{i=1}^n p_i(1-p_i) \cdot \Delta\phi_i \Delta\phi_i^\top $$ {#eq-dpo-hessian} where $\Delta\phi_i = \phi(x_i, y_{w,i}) - \phi(x_i, y_{l,i})$ is the feature difference between winning and losing responses, and $p_i$ is the predicted preference probability. **Key observation:** This Hessian is exactly the **Fisher information matrix** for the DPO model! The D-optimal criterion becomes: $$ \max_S \log\det\left(\sum_{i \in S} p_i(1-p_i) \cdot \Delta\phi_i \Delta\phi_i^\top\right) $$ {#eq-dpo-d-optimal} ### The ADPO Algorithm The **Active DPO (ADPO)** algorithm applies D-optimal design to select preference comparisons: ::: {.callout-note title="Algorithm: Active DPO (ADPO)"} **Input:** Reference policy $\pi_{\text{ref}}$, prompt set $\{x_i\}$, response pairs $\{(y_{i,1}, y_{i,2})\}$, budget $T$ **Initialize:** $\theta_0 = \theta_{\text{ref}}$, $S_0 = \emptyset$ **For** $t = 1, \ldots, T$: 1. **Compute features:** For each candidate pair $(x_i, y_{i,1}, y_{i,2})$: - $\Delta\phi_i = \phi(x_i, y_{i,1}) - \phi(x_i, y_{i,2})$ - $p_i = \sigma(\beta(\log\pi_{\theta_{t-1}}(y_{i,1}|x_i) - \log\pi_{\theta_{t-1}}(y_{i,2}|x_i)))$ 2. **D-optimal selection:** Select query maximizing: $$i^* = \arg\max_i \log\det\left(I + p_i(1-p_i) H_t^{-1} \Delta\phi_i \Delta\phi_i^\top\right)$$ {#eq-adpo-selection} where $H_t$ is the current Fisher information matrix 3. **Query oracle:** Obtain human preference $y_t$ for pair $(x_{i^*}, y_{i^*,1}, y_{i^*,2})$ 4. **Update:** $S_t = S_{t-1} \cup \{(i^*, y_t)\}$, retrain $\theta_t$ on $S_t$ using DPO loss **Return:** Final policy $\pi_{\theta_T}$ ::: ### Theoretical Guarantees Under standard regularity conditions, ADPO achieves the following convergence rate: ::: {.callout-note title="Theorem: ADPO Convergence Rate"} ::: {#thm-adpo-convergence} Let $\theta^*$ be the optimal policy parameters. Under the log-linear policy assumption and standard regularity conditions, ADPO with $n$ actively selected comparisons achieves: $$ \|\hat\theta_n - \theta^*\| = O\left(\frac{d}{\sqrt{n}}\right) $$ {#eq-adpo-convergence} where $d$ is the effective dimension (rank of the Fisher information matrix). This matches the **minimax optimal rate** for $d$-dimensional estimation. ::: ::: The key insight is that D-optimal design ensures the Fisher information grows proportionally to $n$, leading to $O(1/\sqrt{n})$ estimation error. ### ADPO+ for Offline Settings In many practical settings, we have access to a fixed pool of unlabeled preference pairs (e.g., model outputs) but cannot generate new queries. **ADPO+** adapts the algorithm for this offline setting: 1. **Pool-based selection:** Instead of synthesizing queries, select from the existing pool 2. **Batch selection:** Select $b$ queries at once to enable parallel annotation 3. **Diversity regularization:** Add penalty for selecting similar queries The D-optimal criterion naturally encourages diversity: selecting redundant queries provides diminishing information gain. ### Empirical Guidance: When Does Active Learning Help? | Factor | Active Learning Beneficial? | |--------|---------------------------| | **Annotation cost** | High cost → Active learning essential | | **Data heterogeneity** | Diverse prompts → More room for strategic selection | | **Budget constraints** | Limited budget → Active learning shows larger gains | | **Model capacity** | Larger models → Active learning helps reduce overfitting | | **Query pool size** | Large pool → More room for intelligent selection | **Practical recommendation:** Active learning for DPO provides the largest gains when: - Annotation budget is limited (< 10K comparisons) - Prompts cover diverse topics or difficulty levels - The goal is sample efficiency over raw performance ::: {.callout-tip title="Comparison: GP Active Learning vs. ADPO"} | Aspect | GP Active Learning | ADPO | |--------|-------------------|------| | Model | Nonparametric (GP) | Parametric (neural network) | | Scalability | $O(n^3)$ | $O(d^2 n)$ with approximations | | Use case | Small datasets, need uncertainty | LLM alignment, large models | | Acquisition | Information gain | D-optimal design | | Theoretical basis | Mutual information | Fisher information | Both approaches share the core insight: **use model uncertainty to guide query selection**. ::: ## Discussion Questions 1. How does the reliability metric defined in this chapter relate to classical test theory in psychometrics? What assumptions underlie this correspondence? 2. Why does D-optimal design maximize the log-determinant of the information matrix rather than the trace? In what situations might A-optimal (trace-based) selection be preferable? 3. The local linearization approach for non-linear scoring functions trades off accuracy for tractability. What are the conditions under which this approximation is most accurate? When might it fail? 4. Active preference learning assumes that items' appeals are known or pre-calibrated. How might estimation errors in item parameters affect the efficiency of active selection strategies? 5. The metric elicitation framework assumes an oracle with a consistent underlying preference. How would you modify the approach to handle an oracle whose preferences evolve over time or exhibit inconsistencies? 6. In the robotics case studies, pairwise comparisons are used to learn reward functions. What are the advantages and disadvantages of pairwise comparisons versus direct reward labeling for robotic learning? 7. Foundation models like R3M and Voltron leverage pretraining on human video data. What types of robotic tasks might benefit most from this approach? What tasks might still require substantial robot-specific data? 8. The active elicitation framework for imitation learning identifies "compatible" demonstrations. How might this concept of compatibility be extended to multi-agent settings where different agents have legitimately different optimal behaviors? ## Exercises We place (\*), (\*\*), and (\*\*\*) for exercises we deem relatively easy, medium, and hard. ### Fisher Information Maximum (\*) Show that for the Rasch model with $p = \sigma(U + V_j)$, the Fisher information $\mathcal{I}_j(U) = p(1-p)$ is maximized when $p = 0.5$. What does this imply about the optimal difficulty of items for measuring a user's ability? ### Sherman-Morrison Update (\*\*) Derive the Sherman-Morrison formula for efficiently updating the posterior covariance when adding a single rank-1 observation. Specifically, show that if $\Sigma^{-1} \gets \Sigma^{-1} + w \cdot vv^\top$, then $\Sigma \gets \Sigma - \frac{w \cdot \Sigma vv^\top \Sigma}{1 + w \cdot v^\top \Sigma v}$. ### D-Optimal vs A-Optimal (\*\*) Construct a 2D example where D-optimal and A-optimal criteria select different items for the next query. Interpret the difference geometrically in terms of the posterior ellipsoid. ### Pairwise Fisher Information (\*\*) For the Bradley-Terry model $p(j \succ k | U) = \sigma(U^\top(V_j - V_k))$, derive the Fisher information matrix for $U$ given a single pairwise comparison. How does this relate to the single-item case? ### Active Selection Implementation (\*\*\*) Extend the D-optimal pairwise selection simulation to include all three optimality criteria (A, D, E). Compare their sample efficiency on a synthetic dataset of 50 items with 3-dimensional user preferences. Which criterion performs best under different prior uncertainty structures? ### Noisy Oracle Elicitation (\*\*\*) Extend the LPM elicitation algorithm to handle noisy oracle responses where the oracle flips its answer with probability $\epsilon < 0.5$. How does the query complexity scale with $\epsilon$? ### Local Linearization Analysis (\*\*\*) Consider a scoring function $f_\theta(x) = a^\top \tanh(Bx)$ with $\theta = (a, B)$. Analytically derive the Jacobian $J_j = \partial f_\theta(X_j) / \partial \theta$. Under what conditions on the hidden activations $\tanh(BX_j)$ is the local Gaussian approximation most accurate? ### Compatibility Metric Design (\*\*) In the active elicitation framework for imitation learning, the compatibility metric depends on thresholds $\lambda$ and $\eta$. Design an algorithm that automatically selects these thresholds from a held-out validation set of demonstrations. What objective should the algorithm optimize?

Symbol	Meaning
\(U \in \mathbb{R}^K\)	User embedding/preference vector
\(V_j \in \mathbb{R}^K\)	Item embedding/appeal vector
\(Z_j \in \mathbb{R}\)	Item offset (difficulty in Rasch model: \(-V_j\))
\(H_{ij} = U_i^\top V_j + Z_j\)	Latent utility for user \(i\) on item \(j\)
\(\mathcal{I}(U)\)	Fisher information (scalar or matrix)
\(p = \sigma(H)\)	Response probability under logistic model
\(\Lambda = \Sigma^{-1}\)	Posterior precision matrix
\(\text{Rel}\)	Reliability: \(1 - \text{tr}(\hat{\Sigma}_{\text{err}}) / \text{tr}(\Sigma_U)\)
\(\Delta_D, \Delta_A, \Delta_E\)	D/A/E-optimal acquisition functions
\(a(Q)\)	Acquisition function value for query \(Q\)
\(H(y)\)	Entropy of random variable \(y\)
\(I(r; y)\)	Mutual information between \(r\) and \(y\)