Machine Learning from Human Preferences

Chapter 3: Elicitation

Overview

  • Chapters 2–3 established models and learned their parameters
  • Now: which queries should we ask?
  • Human feedback is expensive — strategically chosen comparisons can save 80% of annotation cost
  • Central tool: Fisher information + optimal experimental design
  • Key insight: not all comparisons are equally informative

Chapter Roadmap

Lecture 1 — Preference Vectors & Metric Elicitation

  • Fisher information & Rasch model (15 min)
  • Factor models & D/A/E-optimal design (15 min)
  • Pairwise preferences & Sherman-Morrison (10 min)
  • Metric elicitation via binary search (10 min)

Lecture 2 — Preference Functions & Active DPO

  • Linear & nonlinear preference functions (15 min)
  • GP active learning (15 min)
  • Active DPO for LLM alignment (15 min)
  • Summary (5 min)

Connections to Prior Chapters

Concept (Ch 2–3) This Chapter
Bradley-Terry model Fisher information, query selection
Factor models \(U^\top V\) D-optimal pair selection
GP preference models Information-gain acquisition
DPO loss ADPO active selection

“The purpose of optimal design is to achieve the desired precision with minimum cost.” — V.V. Fedorov (1972)

1. Fisher Information

Definition. For a parametric model \(p(Y \mid \theta)\), the Fisher Information is:

\[ \mathcal{I}(\theta) = \mathbb{E}\left[-\frac{\partial^2}{\partial \theta^2} \log p(Y \mid \theta)\right] \]

  • Quantifies how much a single observation reveals about \(\theta\)
  • Cramér-Rao bound: variance of any unbiased estimator \(\geq \mathcal{I}(\theta)^{-1}\)
  • Higher Fisher information \(\Rightarrow\) more precise estimation per observation

1. Fisher Info for the Rasch Model

For item \(j\) with \(p_j(U) = \sigma(U + V_j)\):

\[ \mathcal{I}_j(U) = p_j(U)(1 - p_j(U)) \]

1. Optimal Item Difficulty

Proposition. \(\mathcal{I}_j(U) = p_j(1 - p_j)\) is maximized when \(p_j = 0.5\), i.e., item difficulty \(-V_j\) equals user ability \(U\).

  • Items that are too easy (\(p \approx 1\)): near-deterministic, teach little
  • Items that are too hard (\(p \approx 0\)): also uninformative
  • Items matched to ability (\(p \approx 0.5\)): maximally informative
  • Best next item: difficulty \(-V_j\) close to current estimate of \(U\)

1. Active Item Selection

Score and cumulative information with prior \(U \sim \mathcal{N}(0, \sigma_0^2)\):

\[ \begin{aligned} S(U) &= \sum_{j \in \mathcal{J}} (y_j - p_j(U)) \\ \mathcal{I}(U) &= \sum_{j \in \mathcal{J}} p_j(U)(1 - p_j(U)) + \tau_0 \end{aligned} \]

  • Newton update: \(\hat{U} \leftarrow \hat{U} + S(\hat{U}) / \mathcal{I}(\hat{U})\)
  • Selection rule: \(j_t = \arg\max_j \; p_j(\hat{U})(1 - p_j(\hat{U}))\)
  • Posterior variance: \(\widehat{\text{Var}}(U) \approx \mathcal{I}(\hat{U})^{-1}\)
  • Reliability: \(\text{Rel} \approx 1 - \tfrac{1}{N}\sum_i \widehat{\text{Var}}(U_i) / \sigma_U^2\)

1. Fisher-Active vs Random

Fisher-active selection achieves higher reliability with fewer queries by targeting items near \(p = 0.5\).

1. Key Insight

Measure where uncertainty is highest.

  • Items near \(p = 0.5\) are maximally informative
  • Active selection “hovers” around the user’s current location
  • This principle underlies all active learning methods in this chapter:
    • Rasch \(\rightarrow\) Factor models \(\rightarrow\) Pairwise \(\rightarrow\) GPs \(\rightarrow\) DPO

2. Factor Models: Multi-Dimensional Preferences

Generalize Rasch to \(K\) dimensions: \(H_{ij} = U_i^\top V_j + Z_j\)

  • User embedding \(U_i \in \mathbb{R}^K\), item embedding \(V_j \in \mathbb{R}^K\)
  • Fisher information per item is a rank-1 matrix:

\[ \mathcal{I}_j(U) = \sigma(H)(1 - \sigma(H)) \; V_j V_j^\top \]

  • Cumulative: \(\mathcal{I}(U) = \sum_t \sigma(H_{j_t})(1 - \sigma(H_{j_t})) \; V_{j_t} V_{j_t}^\top\)
  • Each item contributes information in the direction of \(V_j\)

2. Optimal Design Criteria (D/A/E)

Given Fisher information matrix \(\mathcal{I}(U)\):

Criterion Objective Intuition
A-Optimal min tr(\(\mathcal{I}^{-1}\)) Average variance
D-Optimal max det(\(\mathcal{I}\)) Volume shrinkage of posterior ellipsoid
E-Optimal max \(\lambda_{\min}(\mathcal{I})\) Worst-case precision
  • All three reduce uncertainty, but emphasize different aspects
  • D-optimal naturally encourages diversity — redundant directions give diminishing gain

2. D-Optimal Selection for Factor Models

  • Pick next item maximizing \(\det(\mathcal{I} + \mathcal{I}_j)\)
  • Reliability generalized: \(\text{Rel} = 1 - \text{tr}(\hat{\Sigma}_{\text{err}}) / \text{tr}(\Sigma_U)\)

3. Pairwise Preferences

Bradley-Terry for pairs: \(p(j \succ k) = \sigma(U^\top(V_j - V_k) + Z_j - Z_k)\)

  • Feature difference: \(x_{jk} = V_j - V_k\)
  • Fisher information per pair:

\[ \mathcal{I}_{jk}(U) = w \cdot x_{jk} x_{jk}^\top, \quad w = p(1 - p) \]

  • Posterior precision accumulates additively:

\[ \Lambda_t = \Sigma_0^{-1} + \sum_{s \leq t} w_s \, x_s x_s^\top \]

3. D-Optimal via Sherman-Morrison

Proposition. The D-optimal acquisition has closed form:

\[ \Delta_D(j,k) = \log\det(\Lambda + w\,xx^\top) - \log\det(\Lambda) = \log(1 + w \, x^\top \Sigma \, x) \]

  • Follows from the matrix determinant lemma: \(\det(A + uv^\top) = (1 + v^\top A^{-1} u)\det(A)\)
  • Only requires a vector-matrix-vector product — efficient for online selection
  • A-optimal: \(\Delta_A = \tfrac{w \, x^\top \Sigma^2 x}{1 + w \, x^\top \Sigma \, x}\)
  • E-optimal proxy: \(\Delta_E^{\text{proxy}} = w \, x^\top \Sigma \, x\)

3. Pairwise D-Optimal: Results

D-optimal selection shrinks the posterior ellipsoid efficiently in all directions.

4. Metric Elicitation: Motivation

  • Different classifiers trade off error types differently
  • Which is “best” depends on the implicit metric
  • Medical diagnosis: false negatives (missed disease) are costly
  • Spam filtering: false positives (lost email) are costly
  • The practitioner’s metric \(\mathbf{m}^*\) is unknown

4. Metric Elicitation as Factor Model

Metric elicitation is a \(K=2\) factor model with known item parameters:

  • “Items” = classifiers with known confusion matrices \(V_\theta = (\text{TP}_\theta, \text{TN}_\theta)\)
  • “User preference” = unknown metric weights \(\mathbf{m} = (m_{11}, m_{00})\)
  • Linear Performance Metric: \(\phi(C) = m_{11} \cdot \text{TP} + m_{00} \cdot \text{TN} + m_0\)
  • Parametrize: \(\mathbf{m} = (\cos\theta, \sin\theta)\) for \(\theta \in [0, 2\pi]\)

4. Binary Search Algorithm

  • Query complexity: \(O(\log(1/\epsilon))\) — exponentially better than general 2D estimation at \(O(1/\epsilon^2)\)
  • Works even under noise with probabilistic oracle responses

4. Metric Elicitation: Extensions

  • Bayes optimal classifier: \(\bar{h}(x) = \mathbf{1}[\eta(x) \geq m_{00}/(m_{11} + m_{00})]\)
    • Learning the metric = learning the optimal classification threshold
  • Linear-fractional metrics (\(F_\beta\), Jaccard): two binary searches, still \(O(\log(1/\epsilon))\)
  • Multiclass: diagonal LPMs with \(K\) classes \(\Rightarrow\) \(O(K^2 \log(1/\epsilon))\) queries

5. Linear Preference Functions

Items scored by shared weight vector: \(V_j = W^\top X_j\)

\[ p(j \succ k) = \sigma(W^\top(X_j - X_k)) \]

  • This is logistic regression on pairwise feature differences \(x_{jk} = X_j - X_k\)
  • Fisher information: \(\mathcal{I}(W) = p(1-p) \, x_{jk} x_{jk}^\top\)
  • Laplace posterior: \(\Sigma \approx (\sigma_0^{-2}I + \sum_t p_t(1-p_t) \, x_t x_t^\top)^{-1}\)
  • Same D-optimal criterion: \(\Delta_D(j,k) = \log(1 + w \, x^\top \Sigma \, x)\)

5. Linear Preference: Results

D-optimal selection converges faster with the same mathematical framework as before.

5. Robotic Trajectory Learning

Trajectory reward: \(R(\xi) = w^\top \phi(\xi)\)

  • Unknown weights \(w\), known features \(\phi(\xi)\)
  • Each comparison “prefer \(\xi_A\) over \(\xi_B\)” gives a half-space constraint: \(w^\top(\phi(\xi_A) - \phi(\xi_B)) \succ 0\)
  • Acquisition: maximize minimum volume removed regardless of answer

Biyik and Sadigh (2018)

5. Trajectory Learning: Results & Extensions

  • Driving simulator: 0 queries \(\rightarrow\) erratic; 30 queries \(\rightarrow\) lane following; 70 queries \(\rightarrow\) collision avoidance
  • Active selection achieves target performance faster — critical for time-sensitive applications (e.g., exoskeleton rehabilitation)
  • Foundation models (R3M, Voltron): pretrained representations give 2–3x higher success with 5–10x fewer demos

5. Non-Linear Scoring Functions

For non-linear scorer \(V_j = f_\theta(X_j)\), linearize around \(\hat\theta\):

\[ f_\theta(X_j) \approx f_{\hat\theta}(X_j) + J_j \,\Delta\theta, \quad J_j = \left.\frac{\partial f_\theta(X_j)}{\partial \theta}\right|_{\hat\theta} \]

Pairwise logit:

\[ \underbrace{f_\theta(X_j) - f_\theta(X_k)}_{\text{logit}} \approx \underbrace{f_{\hat\theta}(X_j) - f_{\hat\theta}(X_k)}_{\text{offset}} + \underbrace{(J_j - J_k)}_{\phi_{jk}} \Delta\theta \]

5. D-Optimal for Nonlinear Models

Replace \(x_{jk}\) with Jacobian difference \(\phi_{jk} = J_j - J_k\):

\[ \Sigma^{-1} = \sigma_0^{-2}I + \sum_t w_t \, \phi_t^\top \phi_t \]

  • D-optimal: \(\Delta_D(j,k) = \log(1 + w \, \phi \, \Sigma \, \phi^\top)\)
  • A-optimal: \(\Delta_A(j,k) = \tfrac{w \, \phi \, \Sigma^2 \, \phi^\top}{1 + w \, \phi \, \Sigma \, \phi^\top}\)
  • Only needs Jacobian rows + vector-matrix-vector product
  • Same structure as linear case — local linearization is the key trick

6. GP Active Learning: Motivation

  • Previous sections: parametric models (Rasch, linear, neural net)
  • Gaussian Processes: nonparametric, flexible, natural uncertainty
  • GP posterior variance varies across input space
  • Key question: how to actively select queries for GP preference models?
  • Insight: GP uncertainty naturally guides query selection

6. Information-Theoretic Acquisition

Mutual information between reward \(r\) and observation \(y\):

\[ a(Q) = I(r; y \mid \mathcal{D}, Q) = H(y \mid \mathcal{D}, Q) - \mathbb{E}_{r}[H(y \mid r, Q)] \]

Two terms:

  1. \(H(y \mid \mathcal{D}, Q)\): predictive entropy — high when \(p(A \succ B) \approx 0.5\)
  2. \(\mathbb{E}_r[H(y \mid r, Q)]\): expected conditional entropy — low for “easy” comparisons

Want: model is uncertain, but human can give a clear answer.

6. Practical Computation

Using Laplace approximation:

\[ a(x_A, x_B) = h\!\left(\Phi\!\left(\frac{\mu_A - \mu_B}{\sqrt{2\sigma_{\text{noise}}^2 + g(x_A, x_B)}}\right)\right) - m(x_A, x_B) \]

where \(g(x_A, x_B) = \sigma_A^2 + \sigma_B^2 - 2\text{Cov}(r(x_A), r(x_B))\)

  • \(h(p)\): binary entropy; \(\Phi\): normal CDF
  • Key property: trivial query \((x, x)\) is a global minimizer — avoids degenerate queries

6. GP Active Learning Algorithm

  1. Initialize GP prior over reward function
  2. For each round \(t = 1, \ldots, T\):
    1. Compute acquisition \(a(x_A, x_B)\) for all candidate pairs
    2. Select pair \((x_A^*, x_B^*)\) maximizing acquisition
    3. Query oracle for preference
    4. Update GP posterior via Laplace approximation
  3. Return learned reward function with uncertainty

6. GP Active Learning: Results

Active selection targets uncertain regions, achieving lower error with fewer comparisons.

6. When to Use GP Active Learning

Scenario Recommendation
Nonlinear reward structure GP strongly preferred
Low-to-moderate dimensions (\(d \lesssim 10\)) GP works well
Expensive human feedback Active learning essential
Need uncertainty quantification GP provides naturally
Large datasets (\(n \succ 1000\)) Consider sparse GP
Very high dimensions Linear models more practical

7. Active DPO: Motivation

Recall DPO assumes Bradley-Terry preferences:

\[ p(y_w \succ y_l \mid x) = \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right) \]

  • LLMs have billions of parameters — exact Fisher information is intractable
  • How can we apply D-optimal design principles from Sections 1–5?

7. Log-Linear Policy Assumption

Approximate policy as log-linear in last-layer features:

\[ \pi(y \mid x; \theta) \propto \exp(\phi(x, y)^\top \theta) \]

  • Justified by neural tangent kernel perspective
  • DPO loss Hessian takes the form:

\[ H(\theta) = \sum_{i=1}^n p_i(1 - p_i) \cdot \Delta\phi_i \, \Delta\phi_i^\top \]

where \(\Delta\phi_i = \phi(x_i, y_{w,i}) - \phi(x_i, y_{l,i})\)

This IS the Fisher information matrix!

7. DPO D-Optimal Criterion

D-optimal selection for DPO:

\[ \max_S \; \log\det\!\left(\sum_{i \in S} p_i(1 - p_i) \cdot \Delta\phi_i \, \Delta\phi_i^\top\right) \]

  • Same mathematical structure as all previous sections
  • Feature difference: \(\Delta\phi_i = \phi(x_i, y_{w,i}) - \phi(x_i, y_{l,i})\)
  • Sherman-Morrison still applies for sequential selection

7. ADPO Algorithm

Active DPO (ADPO):

  1. Compute features: \(\Delta\phi_i = \phi(x_i, y_{i,1}) - \phi(x_i, y_{i,2})\)
  2. D-optimal selection: \[ i^* = \arg\max_i \; \log\det\!\left(I + p_i(1 - p_i) \, H_t^{-1} \Delta\phi_i \, \Delta\phi_i^\top\right) \]
  3. Query oracle: obtain human preference for pair \(i^*\)
  4. Update: add to training set, retrain with DPO loss

7. ADPO Convergence

Theorem. Under log-linear policy + regularity conditions:

\[ \|\hat\theta_n - \theta^*\| = O\!\left(\frac{d}{\sqrt{n}}\right) \]

  • Matches the minimax optimal rate for \(d\)-dimensional estimation
  • D-optimal ensures Fisher information grows proportionally to \(n\)
  • Key: effective dimension \(d\) is the rank of the Fisher information matrix (last-layer features), not the full parameter count

7. ADPO+ for Offline Settings

  • Pool-based: select from existing unlabeled pairs (fixed response pool)
  • Batch selection: select \(b\) queries at once for parallel annotation
  • Diversity: D-optimal naturally encourages diversity — redundant queries give diminishing log-det gain
  • Practical for production RLHF pipelines with fixed candidate pairs

7. When Does Active Learning Help for DPO?

Factor Active Learning Beneficial?
Annotation cost High cost \(\rightarrow\) essential
Data heterogeneity Diverse prompts \(\rightarrow\) more room for selection
Budget constraints Limited budget (\(\lt\) 10K) \(\rightarrow\) largest gains
Model capacity Larger models \(\rightarrow\) helps reduce overfitting
Query pool size Large pool \(\rightarrow\) more room for intelligent selection

7. GP Active Learning vs ADPO

Aspect GP Active Learning ADPO
Model Nonparametric (GP) Parametric (neural)
Scalability \(O(n^3)\) \(O(d^2 n)\)
Use case Small datasets, uncertainty LLM alignment, large models
Acquisition Information gain D-optimal design
Theory Mutual information Fisher information

Both share the core insight: use model uncertainty to guide query selection.

Summary: Fisher Information Everywhere

The unifying theme — Fisher information quantifies “informativeness”:

Model Fisher Information
Rasch (scalar) \(p(1-p)\)
Factor (\(K\)-dim) \(p(1-p) \, V V^\top\)
Pairwise \(p(1-p) \, x \, x^\top\)
GP preferences Predictive entropy \(-\) conditional entropy
DPO (LLM) \(p(1-p) \, \Delta\phi \, \Delta\phi^\top\)

Summary: Key Equations

  • D-optimal (Sherman-Morrison): \(\Delta_D = \log(1 + w \, x^\top \Sigma \, x)\)
  • Metric elicitation: \(O(\log(1/\epsilon))\) queries via binary search on unimodal boundary
  • ADPO convergence: \(\|\hat\theta_n - \theta^*\| = O(d / \sqrt{n})\)

All use the same principle: measure where uncertainty is highest, whether that’s item difficulty, posterior covariance, or GP predictive variance.

Practical Guidance

Setting Recommended Method
Scalar ability (testing, psychometrics) Rasch + Fisher active
Multi-dimensional preferences Factor model + D-optimal
Known item features Linear preference + D-optimal
Unknown nonlinear reward GP active learning
LLM alignment ADPO
Metric design Binary search on LPM

References

  • Hiranandani et al. (2019)
  • Biyik and Sadigh (2018)
  • Sadigh et al. (2017)
  • Houlsby et al. (2011)
  • Rafailov et al. (2023)
  • Additional:
    • Cohn, Ghahramani, and Jordan (1996)
    • Nair et al. (2022)
    • Karamcheti et al. (2023)
    • Gandhi et al. (2022)

Biyik, Erdem, and Dorsa Sadigh. 2018. “Batch Active Preference-Based Learning of Reward Functions.” In Proceedings of the 2nd Conference on Robot Learning, edited by Aude Billard, Anca Dragan, Jan Peters, and Jun Morimoto, 87:519–28. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v87/biyik18a.html.
Cohn, David A, Zoubin Ghahramani, and Michael I Jordan. 1996. “Active Learning with Statistical Models.” Journal of Artificial Intelligence Research 4: 129–45.
Gandhi, Kanishk, Siddharth Karamcheti, Madeline Liao, and Dorsa Sadigh. 2022. “Eliciting Compatible Demonstrations for Multi-Human Imitation Learning.” In Proceedings of the 6th Conference on Robot Learning (CoRL).
Hiranandani, Gaurush, Shant Boodaghians, Ruta Mehta, and Oluwasanmi Koyejo. 2019. “Performance Metric Elicitation from Pairwise Classifier Comparisons.” In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, edited by Kamalika Chaudhuri and Masashi Sugiyama, 89:371–79. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v89/hiranandani19a.html.
Houlsby, Neil, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. 2011. “Bayesian Active Learning for Classification and Preference Learning.” In arXiv Preprint arXiv:1112.5745.
Karamcheti, Siddharth et al. 2023. “Language-Driven Representation Learning for Robotics.” arXiv Preprint arXiv:2302.12766.
Nair, Suraj, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. 2022. “R3M: A Universal Visual Representation for Robot Manipulation.” https://arxiv.org/abs/2203.12601.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” https://arxiv.org/abs/2305.18290.
Sadigh, Dorsa, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia. 2017. “Active Preference-Based Learning of Reward Functions.” In Robotics: Science and Systems.