2  Learning

Intended Learning Outcomes

By the end of this chapter you will be able to:

  • Identify appropriate train/test splitting strategies for preference data and understand their implications for model evaluation.
  • Implement maximum likelihood estimation via gradient descent for the Bradley-Terry model, deriving and computing gradients efficiently.
  • Apply Bayesian inference using Markov Chain Monte Carlo (MCMC) to quantify parameter uncertainty and obtain posterior distributions.
  • Apply Laplace approximation to perform inference for Gaussian Process preference models, understanding why non-Gaussian likelihoods require approximation.
  • Derive online learning algorithms (Elo ratings) as stochastic gradient ascent and explain when online methods are preferred over batch learning.
  • Compare batch maximum likelihood, Bayesian inference, and online learning approaches in terms of computational cost, statistical efficiency, and practical applicability.
  • Evaluate learned preference models using multiple metrics including AUC, log-likelihood, and calibration error.
  • Apply regularization techniques (L2 penalty, early stopping) to prevent overfitting and improve generalization.
  • Conduct k-fold cross-validation for model selection and hyperparameter tuning in preference learning settings.
  • Implement modern optimization methods (Adam, learning rate schedules) and diagnose convergence through loss curves and gradient norms.
  • Analyze real-world preference data from language model alignment, addressing practical challenges such as noise, sparsity, and cold-start problems.

This chapter can be covered in two 50-minute lectures plus a hands-on lab session:

Lecture 1 (Sections 3.1–3.4): Core parameter estimation methods

  • Train/test splitting and evaluation metrics (10 min)
  • Maximum likelihood estimation for Bradley-Terry: derivation, implementation, evaluation (20 min)
  • Bayesian inference with MCMC: posterior distributions, Metropolis-Hastings (15 min)
  • Comparison of MLE vs Bayesian approaches (5 min)

Lecture 2 (Sections 3.5–3.8): Advanced learning topics

  • Online learning with Elo ratings: derivation as SGD, applications (15 min)
  • Regularization and overfitting: L2 penalty, validation curves (15 min)
  • Cross-validation and model selection (10 min)
  • Optimization methods: GD vs Adam, learning rate schedules (10 min)

Lab session: Real-world application and exercises

  • Apply methods to LLM preference data (30 min)
  • Work on selected exercises (20 min)

Designing a good reward signal by hand for a complex AI system is difficult and error-prone. Instead of manually specifying desirable behavior, we can learn a utility signal from preference data. In this chapter, we explore how to infer an underlying utility from various forms of feedback. Throughout, we include mathematical formulations and code examples to illustrate the learning process.

2.1 Chapter Overview

Section 1.6 in Chapter 2 established the foundational models for preference data: the Rasch model for item-wise responses, the Bradley-Terry model for pairwise comparisons, and K-dimensional factor models for richer representations. We saw how these models formalize the relationship between latent parameters and observed choices.

This chapter addresses the central question of learning: given observed preference data, how do we estimate the parameters of these models? We explore four complementary perspectives on parameter learning, each with distinct advantages:

  • Maximum Likelihood Estimation (Section 2.3): Find parameters that maximize the probability of observed data. Fast and scalable, but provides only point estimates.
  • Bayesian Inference (Section 2.4, Section 2.4.2): Treat parameters as random variables with prior distributions. We cover both parametric models (using MCMC) and nonparametric Gaussian Process models (using Laplace approximation). Quantifies uncertainty but requires more computation.
  • Online Learning (Section 2.5): Update estimates incrementally as new data arrives. Essential for dynamic systems with continuously arriving feedback.

Beyond these core methods, we cover essential machine learning techniques for practical applications:

  • Regularization (Section 2.6): Prevent overfitting through L2 penalties and early stopping.
  • Model Selection (Section 2.7): Use cross-validation to tune hyperparameters and compare approaches.
  • Optimization (Section 2.8): Improve convergence with modern methods like Adam and adaptive learning rates.
  • Real-World Application (Section 2.9): Apply all methods to language model preference data, confronting practical challenges.

Chapters 4-6 will build on these learning foundations to address active data collection (Chapter 4), decision-making under learned preferences (Chapter 5), and heterogeneous populations (Chapter 6).

2.2 Learning from Preference Data

Section 1.6 introduced the latent variable models for preference data: the Rasch model for item-wise responses, K-dimensional factor models (including the logistic factor model and the ideal point model), and the Bradley-Terry model for pairwise comparisons. We saw how these models relate user parameters \(U_i\) and item parameters \(V_j\) to observed responses, and how pairwise models can be derived from item-wise models.

This chapter focuses on learning the parameters of these models from observed data. We cover three complementary approaches:

  1. Maximum Likelihood Estimation for Bradley-Terry and Rasch models — finding parameters that maximize the probability of the observed data
  2. Bayesian Inference for uncertainty quantification — treating parameters as random variables with prior distributions
  3. Online Learning with the Elo rating system — updating parameter estimates incrementally as new data arrives

We begin by discussing how to split preference data into training and test sets, then develop estimation procedures for each approach.

2.3 Maximum Likelihood Estimation

Given a response from a particular generating process, there are various standard statistical inference procedures, such as maximum likelihood, maximum marginal likelihood, or Bayesian inference. These procedures are designed to infer parameters that are generalizable to new datasets. Hence, we discuss the training and testing dataset next.

The simplest train/test splitting is random, where we select a random subset (e.g., 80%) of the response matrix to be in the training set, and the rest in the test set. We can demonstrate, for example, with the Bradley-Terry response with 80% training and 20% testing data:

Having the train and test datasets from the Bradley–Terry response, which are denoted as \(\mathcal{D}_{\text{train}}\) and \(\mathcal{D}_{\text{test}}\), we first demonstrate the parameter learning with full information maximum likelihood estimation on a single user:

\[ \hat{V} = \arg\max_{V} \sum_{j,j' \in \mathcal{D}_{\text{train}}} \log p(Y_{0,jj'} | \sigma(H_{0,jj'})), \quad H_{0} = V 1_M^\top - 1_M V^\top, \tag{2.1}\] where \(1_M\) is a column vector of 1 with length \(M\) and \(H_{0,jj'} = v_j-v_{j'}\) for a fixed user \(0\). The optimization can be carried out with standard optimizers, such as gradient descent.

Let \(\mathcal{N}_m^+ = \{(m,k) \in \mathcal{D}_{\text{train}}\}\) be pairs recorded in the order \((m,k)\). Let \(\mathcal{N}_m^- = \{(k,m) \in \mathcal{D}_{\text{train}}\}\) be pairs recorded in the order \((k,m)\). Define residuals \(r_{mk} = y_{mk} - \sigma(V_m - V_k)\) and \(r_{km} = y_{km} - \sigma(V_k-V_m)\). Then

\[ \frac{\partial \ell}{\partial V_m} = \sum_{(m,k)\in \mathcal{N}^+_m} r_{mk} - \sum_{(k,m)\in \mathcal{N}^-_m} r_{km}. \tag{2.2}\]

This shows the contribution of data about \(m\) only. Each comparison with opponent \(k\) adds a term equal to the observed-minus-predicted win indicator, with a plus sign if \(m\) is listed first and a minus sign if \(m\) is listed second. Intuitively, if \(m\) beats \(k\) more often than the model predicts, the residual is positive and the gradient pushes \(V_m\) up. If \(m\) loses to \(k\) more than predicted, the residual is negative and the gradient pushes \(V_m\) down.

In this synthetic setting, one way to interpret the result is to compare with the Bayes optimal AUC. Here we quantify an upper bound on achievable test AUC under the assumed data-generating process. For each test pair \((j,k)\) we know the ground-truth win probability \(P_{jk}=\sigma(V_j-V_k)\) from the simulator. The Bayes-optimal score for that pair is exactly this probability, \(s^*_{jk}=P_{jk}\). Any classifier that ranks pairs by \(s^*\) maximizes AUC in expectation because it orders pairs by their true success probabilities. To estimate the corresponding Bayes AUC on our finite test set, we keep the same index set of pairs and repeatedly resample binary labels \(Y_{jk} \sim \mathrm{Bern}(P_{jk})\), then compute \(\mathrm{AUC}(s^*, Y)\) on each resample using a tie-aware definition. The mean of these Monte Carlo AUCs is the Bayes-optimal test AUC, and the empirical quantiles give a sampling range induced purely by label noise. For comparison, we compute the model’s AUC by replacing \(s^*_{jk}\) with the learned scores \(s^{*\text{model}}_{jk}=\sigma(\hat V_j-\hat V_k)\). The gap between the two summarizes how far the fitted model is from the oracle ranking implied by the true \(P^*_{jk}\) on the exact same test pairs.

The result shows that the inference has found a good solution compared to the Bayes optimal estimator! An additional, natural test is to see if the estimated parameters match the true ones. Since the likelihood function only pays attention to the difference of item appeal for any pair, the appeal is only identifiable up to a shift and scale transform. The standard practice is to center and whiten the solution.

Looking Ahead: Regularization

MLE can overfit when parameters outnumber observations. Section 2.6 introduces L2 regularization, which adds a penalty \(\frac{\lambda}{2}\|V\|_2^2\) to the objective. This is equivalent to MAP estimation with a Gaussian prior—connecting MLE to the Bayesian framework we develop next.

2.4 Bayesian Inference

A Bayesian approach provides a natural alternative to maximum likelihood for parameter estimation. Instead of finding a single point estimate, we place a prior distribution on parameters and update it using the likelihood from observed comparisons. This yields a posterior distribution that captures both central estimates and the uncertainty around them. We first cover parametric models using MCMC, then extend to nonparametric Gaussian Process models using Laplace approximation.

2.4.1 Parametric Models

For the Bradley–Terry model with finite-dimensional item parameters \(V\), we place i.i.d. standard normal priors and update them using the Bernoulli likelihood from observed comparisons. In practice, this posterior cannot be computed in closed form, so we turn to Markov chain Monte Carlo (MCMC), which constructs a sequence of samples that, in the limit, follow the posterior distribution.

The Metropolis–Hastings (MH) algorithm is straightforward to apply: at each step, we propose a new value for one or more coordinates of (V) (e.g., from a Gaussian centered at the current state), compute the acceptance ratio as the ratio of posterior densities between the proposed and current states, and accept the proposal with that probability. Repeating this process produces a chain of samples that can be used to approximate posterior means, variances, or other functionals of interest. This approach not only yields point predictions but also quantifies uncertainty about the relative strengths of items under the Bradley–Terry model.

Let the prior be i.i.d. standard normal for each item parameter: \(p(V) = \prod_{j=1}^M \mathcal N(V_j \mid 0, 1)\). Then the posterior distribution is \(p(V \mid \mathcal D) = p(\mathcal D \mid V)p(V)/p(\mathcal D),\) where the likelihood is \[ p(\mathcal D \mid V) = \prod_{(j,j')\in\mathcal I} \sigma(V_j - V_{j'})^{Y_{jj'}} \bigl(1-\sigma(V_j - V_{j'})\bigr)^{1-Y_{jj'}}. \tag{2.3}\]

The denominator is the marginal likelihood (evidence), which is intractable in closed form, so we resort to MCMC (e.g., MH) to sample from the posterior. Suppose we are at current (parameter) state \(V^{(t)}\). We propose a new state \(V'\) from a proposal distribution, such as Gaussian: \(q(V' \mid V^{(t)}) = \mathcal N (V'; V^{(t)}, \tau^2 I),\) where \(\tau^2\) is a step-size variance. This proposal is symmetric, meaning \(q(V' \mid V^{(t)}) = q(V^{(t)} \mid V').\) For any proposal distribution \(q(V' \mid V^{(t)})\), the Metropolis–Hastings acceptance probability is

\[ \alpha = \min \left\{1, \frac{p(V' \mid \mathcal D) \cdot q(V^{(t)} \mid V')}{p(V^{(t)} \mid \mathcal D) \cdot q(V' \mid V^{(t)})}\right\}. \tag{2.4}\]

This says: accept the proposal with probability proportional to how much more plausible it is under the posterior, adjusted by how easy it is to propose back. If we choose a Gaussian proposal, the proposal terms cancel out in the ratio due to symmetry. So the acceptance rule simplifies to \[ \alpha = \min \left\{1, \frac{p(V' \mid \mathcal D)}{p(V^{(t)} \mid \mathcal D)}\right\} = \min \left\{1, \frac{p(\mathcal D \mid V') \cdot p(V')}{p(\mathcal D \mid V^{(t)}) \cdot p(V^{(t)})}\right\}. \tag{2.5}\]

2.4.2 Gaussian Processes

MCMC works well for finite-dimensional parameter vectors, but what if we want to learn a nonparametric reward function? Gaussian Processes (GPs) extend Bayesian inference to function spaces, placing a prior distribution over reward functions \(r \sim \mathcal{GP}(m, k)\) and combining it with the Bradley-Terry likelihood for preferences. The challenge is that this likelihood is non-Gaussian, breaking the standard GP posterior formulas and requiring approximations like Laplace.

The inference problem. Given pairwise comparisons \(\mathcal{D} = \{(x_A^{(i)}, x_B^{(i)}, y_i)\}_{i=1}^n\) where \(y_i = 1\) if item \(A\) was preferred, we want to compute the posterior: \[ p(r \mid \mathcal{D}) \propto p(\mathcal{D} \mid r) \cdot p(r) \tag{2.6}\]

The likelihood follows the Bradley-Terry model: \[ p(y_i = 1 \mid r) = \sigma\left(r(x_A^{(i)}) - r(x_B^{(i)})\right) \tag{2.7}\]

Because this sigmoid likelihood is not Gaussian, the posterior \(p(r \mid \mathcal{D})\) is no longer a Gaussian Process.

2.4.2.1 Laplace Approximation

The Laplace approximation provides a tractable Gaussian approximation to the true posterior:

  1. Find the posterior mode \(\mathbf{r}^* = \arg\max_{\mathbf{r}} \log p(\mathcal{D} \mid \mathbf{r}) + \log p(\mathbf{r})\)
  2. Approximate with a Gaussian centered at the mode, using the Hessian of the log-posterior as precision

For preference data with GP prior (zero mean), the mode satisfies: \[ \mathbf{r}^* = \arg\max_{\mathbf{r}} \sum_{i=1}^n \log \sigma\left(y_i (r(x_A^{(i)}) - r(x_B^{(i)}))\right) - \frac{1}{2}\mathbf{r}^\top K^{-1} \mathbf{r} \tag{2.8}\]

where \(K\) is the kernel matrix and \(\mathbf{r} = [r(x_1), \ldots, r(x_m)]^\top\) collects function values at all unique points appearing in comparisons.

Newton’s method for finding the mode. The gradient and Hessian of the log-posterior are: \[ \nabla_{\mathbf{r}} \log p(\mathbf{r} \mid \mathcal{D}) = \mathbf{g} - K^{-1}\mathbf{r} \tag{2.9}\] \[ \nabla^2_{\mathbf{r}} \log p(\mathbf{r} \mid \mathcal{D}) = -W - K^{-1} \tag{2.10}\]

where \(\mathbf{g}\) is the gradient of the log-likelihood and \(W\) is a diagonal matrix with entries \(W_{ii} = p_i(1 - p_i)\) for predictions \(p_i = \sigma(r(x_A^{(i)}) - r(x_B^{(i)}))\).

The approximate posterior. After finding \(\mathbf{r}^*\), the Laplace approximation gives: \[ p(\mathbf{r} \mid \mathcal{D}) \approx \mathcal{N}\left(\mathbf{r}^*, (K^{-1} + W)^{-1}\right) \tag{2.11}\]

This is structurally similar to standard GP regression, but with the data-dependent precision matrix \(W\) arising from the Bradley-Terry likelihood rather than observation noise.

2.4.2.2 Connection to Fisher Information

The Laplace approximation has a natural connection to the Fisher information framework we developed for linear models. The Hessian of the negative log-likelihood at the mode gives the observed Fisher information: \[ I_{\text{obs}} = W = \text{diag}(p_1(1-p_1), \ldots, p_n(1-p_n)) \tag{2.12}\]

Just as with linear preference models, comparisons with \(p \approx 0.5\) contribute the most information (highest \(p(1-p)\)), while “easy” comparisons where one option clearly dominates contribute little.

2.4.2.3 Computational Considerations

Complexity. Each Newton iteration requires \(O(m^3)\) computation for the matrix solve, where \(m\) is the number of unique points. For \(n\) comparisons involving \(m \leq 2n\) unique points, the total cost is \(O(n^3)\) per iteration.

Scalability. For large datasets, several approximations are available: - Inducing points: Approximate the GP using \(k \ll m\) pseudo-inputs, reducing complexity to \(O(k^2 n)\) - Variational inference: Optimize a lower bound on the marginal likelihood - Conjugate gradient methods: Avoid explicit matrix inversion

Hyperparameter learning. The kernel hyperparameters (length-scale \(\ell\), signal variance \(\sigma_f^2\)) can be learned by maximizing the Laplace-approximated marginal likelihood: \[ \log p(\mathcal{D} \mid \theta) \approx \log p(\mathcal{D} \mid \mathbf{r}^*) + \log p(\mathbf{r}^* \mid \theta) + \frac{1}{2}\log |K^{-1} + W|^{-1} \tag{2.13}\]

In Chapter 4, we will see how the GP posterior uncertainty enables active query selection—choosing which comparisons to ask to learn most efficiently.

2.5 Online Learning

In many applications, comparisons between items arrive sequentially over time rather than being observed all at once. For example, players in online games are continuously matched, or recommendation systems log one user preference at a time. In such settings, it is often computationally infeasible to refit the full Bradley–Terry model by maximum likelihood or MCMC after each new observation. Instead, we want an online update rule that adjusts item strengths incrementally as new outcomes arrive.

This is precisely the motivation behind the Elo rating system, originally introduced for ranking chess players and later widely adopted in competitive games, online platforms, and even information retrieval. The key idea is to maintain a current estimate of each item’s (or player’s) latent strength, and update only the two items involved in a match when a new result comes in.

The Elo rule can be derived as a stochastic gradient ascent method on the Bradley–Terry log-likelihood. Suppose item \(j\) plays against item \(j'\). Given the current parameters, the log-likelihood gradient with respect to \(V_j\) and \(V_{j'}\) is \[ \frac{\partial \ell}{\partial V_j} = (y - p), \qquad \frac{\partial \ell}{\partial V_{j'}} = -(y - p). \tag{2.14}\]

A stochastic gradient ascent step with learning rate \(\eta\) gives the update: \[ V_j \leftarrow V_j + \eta (y - p), \qquad V_{j'} \leftarrow V_{j'} - \eta (y - p). \tag{2.15}\]

This is exactly the Elo update rule. The learning rate \(\eta\) is often called the K-factor in Elo literature. If \(y=1\) (item \(j\) wins) but the model predicted a low \(p\), then \((y - p)\) is positive and \(V_j\) increases, \(V_{j'}\) decreases — the system learns that \(j\) is stronger than previously believed. If \(y=0\) (item \(j'\) wins), the opposite adjustment happens. The magnitude of the update is larger when the outcome is surprising (large prediction error), and smaller when the outcome is expected (small prediction error). Thus, Elo is an online learning algorithm for the Bradley–Terry model, interpretable as stochastic gradient ascent with a fixed step size.

2.6 Regularization and Overfitting

The maximum likelihood estimator maximizes fit to observed training data but may overfit, especially when the number of parameters is large relative to the amount of data. In preference learning, overfitting manifests as learned item parameters that capture noise in the training comparisons rather than true relative strengths. Regularization techniques penalize model complexity to improve generalization to held-out test data.

2.6.1 L2 Regularization

The most common regularization approach adds an L2 penalty term to the log-likelihood. For the Bradley-Terry model, the regularized objective becomes: \[ \mathcal{L}_{\text{reg}}(V) = \sum_{(j,j') \in \mathcal{D}_{\text{train}}} \log p(Y_{jj'} \mid V_j - V_{j'}) - \frac{\lambda}{2} \|V\|_2^2 \tag{2.16}\] where \(\lambda \geq 0\) is the regularization strength. The penalty \(\frac{\lambda}{2} \|V\|_2^2 = \frac{\lambda}{2} \sum_{j=1}^M V_j^2\) discourages large parameter values. When \(\lambda = 0\), we recover standard MLE. As \(\lambda\) increases, parameters shrink toward zero.

The regularized gradient adds a simple term to the MLE gradient: \[ \frac{\partial \mathcal{L}_{\text{reg}}}{\partial V_m} = \sum_{(m,k)\in \mathcal{N}^+_m} r_{mk} - \sum_{(k,m)\in \mathcal{N}^-_m} r_{km} - \lambda V_m \tag{2.17}\]

Intuitively, regularization creates a bias-variance tradeoff: small \(\lambda\) permits complex fits (low bias, high variance), while large \(\lambda\) enforces simpler models (high bias, low variance). The optimal \(\lambda\) balances these to minimize test error.

Connection to Bayesian Inference

L2 regularization corresponds exactly to maximum a posteriori (MAP) estimation with a Gaussian prior \(p(V_j) = \mathcal{N}(0, 1/\lambda)\). The regularization strength \(\lambda\) is the inverse prior variance: strong regularization (large \(\lambda\)) means a tight prior belief that parameters are near zero.

The validation curve demonstrates the regularization trade-off: at \(\lambda = 0\) (no regularization), the model overfits to training data, achieving high training AUC but lower test AUC. As \(\lambda\) increases, test performance improves until an optimal point, after which excessive regularization underfits. The gap between training and test AUC narrows with proper regularization, indicating better generalization.

2.6.2 Early Stopping

An alternative to explicit regularization is early stopping: monitor validation performance during training and stop when it begins to degrade, even if training performance continues improving. This exploits the empirical observation that gradient descent follows a path from simple to complex models.

Early stopping automatically selects model complexity through the training trajectory. Unlike L2 regularization, it requires no tuning of \(\lambda\), but does require held-out validation data to monitor.

Key Takeaway: When to Regularize

Regularization is most critical when:

  • Few observations relative to parameters (e.g., 50 comparisons, 20 items)
  • Imbalanced data: Some items have many comparisons, others have few
  • High noise: Label noise or measurement error in comparisons

For large datasets with many comparisons per item, overfitting is less of a concern and regularization may have minimal impact. Always use validation data or cross-validation to tune regularization strength.

2.7 Model Selection and Cross-Validation

A single train/test split provides one estimate of generalization performance, but this estimate has high variance: a different random split may yield different results. Cross-validation (CV) systematically evaluates multiple train/test splits to obtain more reliable performance estimates and enable principled model selection.

2.7.1 K-Fold Cross-Validation

In k-fold cross-validation, we partition the data into \(k\) equally-sized folds. For each fold \(i \in \{1, \ldots, k\}\):

  1. Train the model on all folds except fold \(i\)
  2. Evaluate on fold \(i\) (held-out validation)
  3. Record the validation metric (e.g., AUC, log-likelihood)

The final CV score is the average across all \(k\) folds: \(\text{CV}_k = \frac{1}{k} \sum_{i=1}^k \text{metric}_i\). The standard error quantifies uncertainty in the estimate.

For preference data, we partition the set of observed pairwise comparisons (not items) into folds, ensuring each fold contains a representative sample of comparisons.

The CV estimate of {cv_scores.mean():.4f} ± {cv_scores.std():.4f} is more reliable than a single train/test split. The standard deviation quantifies variability across folds.

2.7.2 Hyperparameter Tuning with Cross-Validation

Cross-validation enables principled hyperparameter selection: evaluate each candidate hyperparameter on CV performance and select the best. For Bradley-Terry, key hyperparameters include regularization strength \(\lambda\) and learning rate.

The error bars show the standard deviation across folds, indicating uncertainty in the CV estimate for each hyperparameter. Select the hyperparameter with highest mean CV performance.

2.7.3 Beyond AUC: Multiple Evaluation Metrics

AUC measures ranking quality but does not capture all aspects of model performance. Additional metrics provide complementary insights:

Log-Likelihood: Measures the probability the model assigns to observed outcomes. Higher is better. \[ \text{LL} = \sum_{(j,j') \in \mathcal{D}_{\text{test}}} \left[ Y_{jj'} \log \sigma(V_j - V_{j'}) + (1 - Y_{jj'}) \log (1 - \sigma(V_j - V_{j'})) \right] \tag{2.18}\]

Calibration Error: Measures whether predicted probabilities match empirical frequencies. Bin predictions into intervals (e.g., [0.0, 0.1), [0.1, 0.2), …) and compare average predicted probability to observed frequency in each bin.

Different metrics may favor different models. AUC focuses on ranking, log-likelihood on probability estimates, and calibration on frequency matching. Use multiple metrics to gain a comprehensive view of model quality.

Key Takeaway: Cross-Validation Best Practices
  • Use CV for hyperparameter tuning, not test set evaluation (avoid overfitting to the test set)
  • Report mean ± std across folds to quantify uncertainty
  • Stratified splitting: For imbalanced data, ensure each fold has representative class proportions
  • Temporal splitting: For sequential data (e.g., chess games over time), use time-based splits instead of random CV to avoid leaking future information into past predictions
  • Nested CV: For unbiased performance estimates, use outer CV loop for evaluation and inner CV loop for hyperparameter selection

2.8 Optimization Methods

Gradient descent is the foundation for learning Bradley-Terry parameters, but modern optimization methods can accelerate convergence and improve final performance. We compare three widely-used optimizers and discuss when each is appropriate.

2.8.1 Beyond Vanilla Gradient Descent

Standard gradient descent updates parameters with a fixed learning rate: \(V \leftarrow V + \eta \nabla \mathcal{L}(V)\). This has two limitations:

  1. Fixed step size: Large \(\eta\) causes instability; small \(\eta\) slows convergence
  2. No momentum: Each step ignores the optimization history

Modern optimizers address these issues through adaptive learning rates and momentum.

2.8.2 Adam Optimizer

Adam (Adaptive Moment Estimation) maintains running averages of both gradients and squared gradients, adapting the learning rate per parameter. The update rule is:

\[ \begin{aligned} m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad &\text{(momentum)} \\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad &\text{(variance)} \\ \hat{m}_t &\leftarrow m_t / (1 - \beta_1^t), \quad \hat{v}_t \leftarrow v_t / (1 - \beta_2^t) \quad &\text{(bias correction)} \\ V &\leftarrow V + \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \quad &\text{(parameter update)} \end{aligned} \tag{2.19}\]

where \(g_t\) is the gradient at step \(t\), \(\beta_1 = 0.9\) and \(\beta_2 = 0.999\) are typical, and \(\epsilon = 10^{-8}\) prevents division by zero.

Adam typically converges faster than vanilla gradient descent and is less sensitive to learning rate tuning. The adaptive per-parameter learning rates help in problems with varying gradient magnitudes across parameters.

2.8.3 Learning Rate Schedules

Instead of a fixed learning rate, schedules decay \(\eta\) over time to enable large initial steps (fast progress) followed by small refinements (precision):

  • Step decay: \(\eta_t = \eta_0 \cdot \gamma^{\lfloor t / k \rfloor}\) (reduce by factor \(\gamma\) every \(k\) epochs)
  • Exponential decay: \(\eta_t = \eta_0 e^{-\lambda t}\)
  • Cosine annealing: \(\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\pi t / T))\)

Learning rate schedules are especially useful for training to convergence without extensive hyperparameter tuning.

Key Takeaway: Choosing an Optimizer

When to use each optimizer:

  • Vanilla GD: Simple problems, well-tuned learning rate available, or for pedagogical clarity
  • Adam: Default choice for most problems; robust to learning rate, fast convergence, handles sparse gradients well
  • SGD with momentum: When you need the theoretical guarantees of SGD (e.g., convergence proofs) but want faster practical convergence

Hyperparameter guidelines: - Adam: \(\alpha = 0.001\) to \(0.1\), default \(\beta_1 = 0.9\), \(\beta_2 = 0.999\) - GD: \(\eta = 0.01\) to \(0.1\), may need careful tuning - Learning rate schedules: Helpful when training for many epochs

2.9 Real-World Application: LLM Preference Learning

We now apply all three parameter learning methods to a realistic preference learning scenario inspired by language model alignment. While production LLM alignment uses datasets like Stanford Human Preferences (SHP) or Anthropic’s HH-RLHF, we construct a synthetic dataset that captures key properties of real preference data: noisy labels, varying item quality, and response embeddings.

2.9.1 Dataset: Simulated LLM Response Preferences

We simulate a setting where a language model generates multiple responses to prompts, and human annotators provide pairwise preferences. Each response is represented by a learned embedding (e.g., from a pretrained model), and the true utility is a linear function of this embedding plus noise.

This synthetic dataset mimics real LLM preference data: responses have embedding representations, utilities are learned functions of embeddings, and annotations contain noise.

2.9.2 Applying All Three Learning Methods

We apply MLE, Bayesian inference, and online learning (Elo) to the same data and compare results.

2.9.3 Key Observations

The three methods yield similar performance on this synthetic LLM preference task:

  1. MLE is fastest and most scalable, suitable for large-scale applications
  2. Bayesian inference via MCMC provides posterior distributions over utilities, enabling uncertainty quantification—the marginal plot shows how posterior samples vary, and error bars show 95% credible intervals
  3. Online (Elo) learns incrementally, ideal for streaming data but may be less data-efficient than batch methods

All methods successfully recover utilities correlated with the ground truth despite 10% label noise, demonstrating robustness. In production LLM alignment:

  • Data scale: Real datasets have 10K-100K+ comparisons
  • Cold start: New responses have no comparison history; requires initialization strategies
  • Computational cost: MCMC becomes expensive; MLE or online methods preferred
  • Temporal drift: User preferences may evolve; online methods naturally adapt
Connecting to DPO

Recall from Section 1.2 that Direct Preference Optimization (DPO) for language models is equivalent to Bradley-Terry MLE where the “utility” is \(\beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}\). The MLE techniques developed in this chapter directly apply to DPO training: regularization prevents overfitting to preference data, and optimization methods (Adam, learning rate schedules) accelerate convergence.

2.10 Discussion Questions

  • In the synthetic experiments, why is the Bayes optimal AUC less than 1.0 even though we know the true parameters? What does this reveal about the fundamental limits of prediction from noisy preference data?

  • How does L2 regularization affect the bias-variance tradeoff in Bradley-Terry estimation? Under what data conditions (sample size, noise level, number of items) would you expect regularization to have the largest impact on test performance?

  • When would you prefer online learning (Elo) over batch learning (MLE) for preference model estimation? Consider computational cost, data arrival patterns, and statistical efficiency. Can you design a hybrid approach that combines benefits of both?

  • The Bayesian approach via MCMC provides posterior distributions over parameters, while MLE gives point estimates. Beyond uncertainty quantification, what practical advantages might the full posterior provide? How would you use posterior samples to make decisions (e.g., which items to present to users)?

  • Cross-validation assumes that folds are exchangeable—that the data distribution is the same across all folds. For sequential preference data (e.g., chess games ordered in time, or LLM preferences from a changing user population), this assumption may be violated. How should cross-validation be adapted for temporal data? What are the risks of using standard k-fold CV on sequential data?

  • In the LLM preference learning example, all three methods (MLE, Bayesian, Elo) achieved similar test AUC despite different training procedures. Under what circumstances would you expect the methods to diverge in performance? Consider data scale, noise level, model mis-specification, and computational budget.

2.11 Bibliographic Notes

Maximum likelihood estimation for preference models dates back to Bradley and Terry (1952)’s original work on paired comparisons. The connection to modern machine learning was established through applications in ranking (Herbrich, Minka, and Graepel 2006) and recommender systems (Koren, Bell, and Volinsky 2009). Hunter (2004) provided efficient MM algorithms for Bradley-Terry MLE that scale to large problems.

Bayesian inference for preference data has a long history in psychometrics and experimental design. Davidson (1970) developed Bayesian approaches for paired comparisons. Modern MCMC methods for Bradley-Terry models are surveyed in Caron and Doucet (2012). The connection between L2 regularization and Gaussian priors (MAP estimation) is standard in Bayesian machine learning (Murphy 2012).

The Elo rating system was introduced by Elo (1978) for chess rankings and has since been applied broadly to competitive games (Glickman 1999 introduced the Glicko system with uncertainty estimates), online platforms (Herbrich, Minka, and Graepel 2006 for Xbox matchmaking), and information retrieval. The interpretation of Elo as stochastic gradient descent on the Bradley-Terry log-likelihood clarifies its connection to machine learning (Weng and Lin 2011).

Regularization in statistical learning has roots in ridge regression (Hoerl and Kennard 1970) and has become central to modern machine learning (Hastie, Tibshirani, and Friedman 2009). Early stopping as implicit regularization was analyzed by Yao, Rosasco, and Caponnetto (2007). The bias-variance tradeoff provides the theoretical foundation for why regularization improves generalization (Geman, Bienenstock, and Doursat 1992).

Cross-validation was formalized by Stone (1974) and Geisser (1975) for model selection. Kohavi (1995) provided practical guidance for k-fold CV. Nested cross-validation to avoid selection bias was emphasized by Cawley and Talbot (2010). Temporal validation strategies for time-series data are discussed in Bergmeir, Hyndman, and Koo (2018).

Optimization methods for machine learning are surveyed in Ruder (2016). Adam was introduced by Kingma and Ba (2014) and has become the default optimizer for deep learning. Convergence analysis for gradient descent on convex problems (including logistic regression, which includes Bradley-Terry) is classical (Boyd and Vandenberghe 2004). Learning rate schedules and their impact on generalization are discussed in Loshchilov and Hutter (2016).

LLM alignment via preference learning gained prominence with Christiano et al. (2017)’s introduction of RLHF. The connection to Bradley-Terry and reward modeling was made explicit in Ouyang et al. (2022) (InstructGPT). Direct Preference Optimization (DPO) (Rafailov et al. 2023) showed that RLHF can be reformulated as Bradley-Terry MLE, eliminating the need for a separate reward model. Recent work explores calibration (Stiennon et al. 2020), robustness to noise (al. 2022), and scaling laws (Gao, Schulman, and Hilton 2022) for preference-based LLM training.

For further reading: Agresti (2002) provides comprehensive coverage of categorical data analysis including paired comparisons. Marden (1995) covers ranking models from a statistical perspective. Liu (2011) surveys learning-to-rank methods in information retrieval, many of which build on preference models.

2.12 Exercises

Exercises are marked with difficulty levels: () for introductory, () for intermediate, and () for challenging.

2.12.1 Gradient Derivation for Plackett-Luce (*)

The Plackett-Luce model extends Bradley-Terry to full rankings. Given a ranking \((j_1 \succ j_2 \succ \cdots \succ j_M)\), the likelihood is: \[ p(\text{ranking} \mid V) = \prod_{k=1}^{M-1} \frac{\exp(V_{j_k})}{\sum_{\ell=k}^M \exp(V_{j_\ell})} \tag{2.20}\]

  1. Write the log-likelihood for a single ranking as a function of utilities \(V\).

  2. Derive the gradient \(\frac{\partial \log p}{\partial V_m}\) for an arbitrary item \(m\). Hint: Consider three cases: \(m\) appears at position \(k\), \(m\) appears after position \(k\), and \(m\) does not appear in the ranking.

  3. Implement gradient ascent for Plackett-Luce on simulated ranking data. Compare convergence to Bradley-Terry MLE—which converges faster and why?

2.12.2 L2 Regularized Bradley-Terry (**)

  1. Implement Bradley-Terry MLE with L2 regularization for a range of \(\lambda\) values. Generate synthetic data with \(M = 20\) items and vary the number of comparisons from 50 to 500. Plot test AUC vs \(\lambda\) for each dataset size.

  2. Explain why the optimal \(\lambda\) decreases as the number of comparisons increases. At what rate does it decay?

  3. Derive the Hessian matrix of the regularized log-likelihood. Under what conditions is it positive definite (ensuring a unique global maximum)?

2.12.3 K-Fold Cross-Validation Implementation (**)

  1. Implement 5-fold cross-validation for Bradley-Terry estimation from scratch (without using sklearn or similar libraries). Your implementation should:
  • Randomly partition comparison data into 5 folds
  • Train on 4 folds, validate on 1 fold
  • Repeat for all 5 folds
  • Return mean and standard deviation of validation AUC
  1. Compare your CV implementation to a single 80/20 train/test split over 20 random seeds. Which provides a more stable performance estimate?

  2. Extend your implementation to stratified CV: ensure each fold has approximately equal proportions of wins for each item. Why might stratification improve CV estimates for imbalanced data?

2.12.4 Learning Rate Sensitivity Analysis (*)

  1. For Bradley-Terry MLE on synthetic data (\(M = 15\) items, 100 comparisons), experiment with learning rates \(\eta \in \{0.001, 0.01, 0.05, 0.1, 0.5, 1.0\}\). Plot the training loss curve for each learning rate.

  2. Identify the learning rate that achieves the lowest final loss. What happens with learning rates that are too large? Too small?

  3. Implement a learning rate schedule (e.g., exponential decay \(\eta_t = \eta_0 \cdot 0.95^t\)) and compare to fixed learning rate. Does the schedule improve final performance?

2.12.5 MCMC Diagnostics (**)

The Metropolis-Hastings implementation in Section 2.4 produces a chain of samples. Assess convergence quality:

  1. Implement the effective sample size (ESS) diagnostic: ESS estimates how many independent samples the chain is equivalent to, accounting for autocorrelation. For a chain \(\{V^{(t)}\}_{t=1}^T\), the ESS for parameter \(V_j\) is approximately: \[ \text{ESS}_j = \frac{T}{1 + 2\sum_{k=1}^K \rho_k} \tag{2.21}\] where \(\rho_k\) is the autocorrelation at lag \(k\).

  2. Run the MH algorithm from Section 2.4 with different proposal step sizes \(\tau \in \{0.01, 0.05, 0.1, 0.5\}\). Plot ESS vs \(\tau\). What happens when \(\tau\) is too small? Too large?

  3. Implement trace plots for multiple chains (run MH 3 times with different initializations). Do all chains converge to the same stationary distribution?

2.12.6 Optimization Method Comparison (**)

  1. Implement gradient descent with momentum: \(m_t \leftarrow \beta m_{t-1} + \nabla \mathcal{L}(V_t)\), \(V_{t+1} \leftarrow V_t + \eta m_t\) where \(\beta \in [0, 1)\) controls momentum.

  2. On the same synthetic Bradley-Terry problem, compare four optimizers: vanilla GD, GD with momentum (\(\beta = 0.9\)), Adam, and RMSprop. Plot convergence curves (loss vs iteration).

  3. For each optimizer, find the best learning rate via grid search. Which optimizer is most sensitive to learning rate tuning?

2.12.7 Convergence Proof (***)

Prove that gradient ascent on the Bradley-Terry log-likelihood converges to the global maximum (assuming the maximum exists).

  1. Show that the Bradley-Terry log-likelihood is strictly concave when the comparison graph is strongly connected (every pair of items is connected by a directed path of comparisons).

  2. Prove that gradient ascent with sufficiently small step size \(\eta\) converges to the unique global maximum. Hint: Use the fact that the gradient vanishes only at the maximum for strictly concave functions.

  3. What happens when the comparison graph is not strongly connected? Construct an example with multiple disconnected components and characterize the set of maximum likelihood estimators.

2.12.8 Calibration Metrics (*)

  1. Implement a calibration plot: bin predicted probabilities into 10 intervals \([0, 0.1), [0.1, 0.2), \ldots, [0.9, 1.0]\), and for each bin, compute the average predicted probability and the empirical frequency of wins.

  2. Generate synthetic Bradley-Terry data and fit two models: one correctly specified (Bradley-Terry) and one mis-specified (assume all items have equal utility). Plot calibration curves for both. Which is better calibrated?

  3. Define the Expected Calibration Error (ECE): \(\text{ECE} = \sum_{b=1}^B \frac{|B_b|}{N} |\bar{p}_b - \bar{y}_b|\) where \(\bar{p}_b\) is the average predicted probability in bin \(b\), \(\bar{y}_b\) is the empirical frequency, and \(|B_b|\) is the number of predictions in bin \(b\). Compute ECE for your models.

2.12.9 Online to Batch Convergence (*)

  1. Implement Elo with a decreasing K-factor: \(K_t = K_0 / \sqrt{t}\) where \(t\) is the iteration number. This is known as the Robbins-Monro condition for stochastic approximation.

  2. On synthetic Bradley-Terry data, run Elo with decreasing \(K_t\) and compare the final learned parameters to batch MLE. How close do they converge?

  3. Prove (or argue informally) that Elo with \(K_t = \eta / \sqrt{t}\) converges to the MLE in the limit as \(t \to \infty\). Under what conditions does this hold?

2.12.10 Hyperparameter Tuning with Nested Cross-Validation (***)

To obtain an unbiased estimate of generalization performance when hyperparameters are selected via CV, we use nested (double) CV:

  • Outer loop: K-fold CV to estimate generalization
  • Inner loop: For each outer fold, use K-fold CV on the training data to select hyperparameters
  1. Implement nested 5-fold CV for Bradley-Terry with L2 regularization. The inner loop tunes \(\lambda\), the outer loop estimates test AUC.

  2. Compare the nested CV test AUC estimate to: (i) a single train/test split with CV on training data for \(\lambda\) selection, and (ii) CV test AUC when \(\lambda\) is selected using the outer CV test set (data leakage). Which is more optimistic? Pessimistic?

  3. Why is nested CV important? Give an example where selecting hyperparameters on the same data used for final evaluation leads to overly optimistic performance estimates.

2.12.11 Real-World Dataset Application (***)

Apply the methods from this chapter to a real preference dataset:

  1. Obtain a dataset such as the Jester jokes dataset (user ratings), a subset of MovieLens, or chess game outcomes from Lichess. Describe the data: number of items, number of comparisons, sparsity.

  2. Apply all three methods (MLE with regularization, Bayesian inference, online Elo) and compare their performance using multiple metrics (AUC, log-likelihood, calibration error). Use cross-validation to tune hyperparameters.

  3. Visualize the learned item parameters. Do they agree with intuition (e.g., highly-rated movies have high utility)? Identify the top-5 and bottom-5 items according to each method—do the methods agree?

  4. Discuss practical challenges encountered: computational cost, cold-start items, missing data, temporal effects. How would you address these in a production system?

2.12.12 Bias-Variance Tradeoff Demonstration (**)

  1. Generate synthetic Bradley-Terry data with \(M = 10\) items and vary the training set size \(n \in \{20, 50, 100, 200, 500\}\) comparisons. For each \(n\) and several regularization strengths \(\lambda\), fit models and compute:
  • Bias: Average difference between learned and true parameters
  • Variance: Standard deviation of learned parameters across 50 random datasets
  1. Plot bias and variance vs \(\lambda\) for each training set size. Verify that as \(\lambda\) increases, bias increases and variance decreases.

  2. Plot test AUC vs \(\lambda\) and identify the optimal \(\lambda\) for each \(n\). How does the optimal regularization strength change with dataset size?

References

Agresti, Alan. 2002. Categorical Data Analysis. 2nd ed. New York: John Wiley & Sons.
al., Yuntao Bai et. 2022. “Constitutional AI: Harmlessness from AI Feedback.” https://arxiv.org/abs/2212.08073.
Bergmeir, Christoph, Rob J. Hyndman, and Bonsoo Koo. 2018. “A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction.” Computational Statistics & Data Analysis 120: 70–83.
Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
Bradley, Ralph Allan, and Milton E Terry. 1952. “Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.” Biometrika 39 (3/4): 324–45.
Caron, François, and Arnaud Doucet. 2012. “Efficient Bayesian Inference for Generalized Bradley-Terry Models.” Journal of Computational and Graphical Statistics 21 (1): 174–96.
Cawley, Gavin C., and Nicola L. C. Talbot. 2010. “On over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.” Journal of Machine Learning Research 11: 2079–2107.
Christiano, Paul F, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. “Deep Reinforcement Learning from Human Preferences.” Advances in Neural Information Processing Systems 30.
Davidson, Roger R. 1970. “On Extending the Bradley-Terry Model to Accommodate Ties in Paired Comparison Experiments.” Journal of the American Statistical Association 65 (329): 317–28.
Elo, Arpad E. 1978. The Rating of Chessplayers, Past and Present. New York: Arco Publishing.
Gao, Leo, John Schulman, and Jacob Hilton. 2022. “Scaling Laws for Reward Model Overoptimization.” arXiv Preprint arXiv:2210.10760.
Geisser, Seymour. 1975. “The Predictive Sample Reuse Method with Applications.” Journal of the American Statistical Association 70 (350): 320–28.
Geman, Stuart, Elie Bienenstock, and Rene Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4 (1): 1–58.
Glickman, Mark E. 1999. “Parameter Estimation in Large Dynamic Paired Comparison Experiments.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 48 (3): 377–94. https://doi.org/10.1111/1467-9876.00159.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. New York: Springer.
Herbrich, Ralf, Tom Minka, and Thore Graepel. 2006. “TrueSkill: A Bayesian Skill Rating System.” In Advances in Neural Information Processing Systems, 19:569–76.
Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67.
Hunter, David R. 2004. “MM Algorithms for Generalized Bradley-Terry Models.” The Annals of Statistics 32 (1): 384–406. https://doi.org/10.1214/aos/1079120141.
Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” arXiv Preprint arXiv:1412.6980.
Kohavi, Ron. 1995. “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection” 14 (2): 1137–45.
Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. “Matrix Factorization Techniques for Recommender Systems.” Computer 42 (8): 30–37.
Liu, Tie-Yan. 2011. Learning to Rank for Information Retrieval. Springer.
Loshchilov, Ilya, and Frank Hutter. 2016. “SGDR: Stochastic Gradient Descent with Warm Restarts.” arXiv Preprint arXiv:1608.03983.
Marden, John I. 1995. Analyzing and Modeling Rank Data. Chapman & Hall.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press.
Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35: 27730–44.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” https://arxiv.org/abs/2305.18290.
Ruder, Sebastian. 2016. “An Overview of Gradient Descent Optimization Algorithms.” arXiv Preprint arXiv:1609.04747.
Stiennon, Nisan, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. “Learning to Summarize from Human Feedback.” Advances in Neural Information Processing Systems 33.
Stone, M. 1974. “Cross-Validatory Choice and Assessment of Statistical Predictions.” Journal of the Royal Statistical Society: Series B (Methodological) 36 (2): 111–33. https://doi.org/10.1111/J.2517-6161.1974.TB00994.X.
Weng, Ruby C., and Chih-Jen Lin. 2011. “A Bayesian Approximation Method for Online Ranking.” Journal of Machine Learning Research 12: 267–300.
Yao, Yuan, Lorenzo Rosasco, and Andrea Caponnetto. 2007. “On Early Stopping in Gradient Descent Learning.” Constructive Approximation 26 (2): 289–315.