2 Learning

Intended Learning Outcomes

By the end of this chapter you will be able to:

Identify appropriate train/test splitting strategies for preference data and understand their implications for model evaluation.
Implement maximum likelihood estimation via gradient descent for the Bradley-Terry model, deriving and computing gradients efficiently.
Apply Bayesian inference using Markov Chain Monte Carlo (MCMC) to quantify parameter uncertainty and obtain posterior distributions.
Apply Laplace approximation to perform inference for Gaussian Process preference models, understanding why non-Gaussian likelihoods require approximation.
Derive online learning algorithms (Elo ratings) as stochastic gradient ascent and explain when online methods are preferred over batch learning.
Compare batch maximum likelihood, Bayesian inference, and online learning approaches in terms of computational cost, statistical efficiency, and practical applicability.
Evaluate learned preference models using multiple metrics including AUC, log-likelihood, and calibration error.
Apply regularization techniques (L2 penalty, early stopping) to prevent overfitting and improve generalization.
Conduct k-fold cross-validation for model selection and hyperparameter tuning in preference learning settings.
Implement modern optimization methods (Adam, learning rate schedules) and diagnose convergence through loss curves and gradient norms.
Analyze real-world preference data from language model alignment, addressing practical challenges such as noise, sparsity, and cold-start problems.

Suggested Lecture Plan

This chapter can be covered in two 50-minute lectures plus a hands-on lab session:

Lecture 1 (Sections 3.1–3.4): Core parameter estimation methods

Train/test splitting and evaluation metrics (10 min)
Maximum likelihood estimation for Bradley-Terry: derivation, implementation, evaluation (20 min)
Bayesian inference with MCMC: posterior distributions, Metropolis-Hastings (15 min)
Comparison of MLE vs Bayesian approaches (5 min)

Lecture 2 (Sections 3.5–3.8): Advanced learning topics

Online learning with Elo ratings: derivation as SGD, applications (15 min)
Regularization and overfitting: L2 penalty, validation curves (15 min)
Cross-validation and model selection (10 min)
Optimization methods: GD vs Adam, learning rate schedules (10 min)

Lab session: Real-world application and exercises

Apply methods to LLM preference data (30 min)
Work on selected exercises (20 min)

Designing a good reward signal by hand for a complex AI system is difficult and error-prone. Instead of manually specifying desirable behavior, we can learn a utility signal from preference data. In this chapter, we explore how to infer an underlying utility from various forms of feedback. Throughout, we include mathematical formulations and code examples to illustrate the learning process.

2.1 Chapter Overview

Section 1.6 in Chapter 2 established the foundational models for preference data: the Rasch model for item-wise responses, the Bradley-Terry model for pairwise comparisons, and K-dimensional factor models for richer representations. We saw how these models formalize the relationship between latent parameters and observed choices.

This chapter addresses the central question of learning: given observed preference data, how do we estimate the parameters of these models? We explore four complementary perspectives on parameter learning, each with distinct advantages:

Maximum Likelihood Estimation (Section 2.3): Find parameters that maximize the probability of observed data. Fast and scalable, but provides only point estimates.
Bayesian Inference (Section 2.4, Section 2.4.2): Treat parameters as random variables with prior distributions. We cover both parametric models (using MCMC) and nonparametric Gaussian Process models (using Laplace approximation). Quantifies uncertainty but requires more computation.
Online Learning (Section 2.5): Update estimates incrementally as new data arrives. Essential for dynamic systems with continuously arriving feedback.

Beyond these core methods, we cover essential machine learning techniques for practical applications:

Regularization (Section 2.6): Prevent overfitting through L2 penalties and early stopping.
Model Selection (Section 2.7): Use cross-validation to tune hyperparameters and compare approaches.
Optimization (Section 2.8): Improve convergence with modern methods like Adam and adaptive learning rates.
Real-World Application (Section 2.9): Apply all methods to language model preference data, confronting practical challenges.

Chapters 4-6 will build on these learning foundations to address active data collection (Chapter 4), decision-making under learned preferences (Chapter 5), and heterogeneous populations (Chapter 6).

2.2 Learning from Preference Data

Section 1.6 introduced the latent variable models for preference data: the Rasch model for item-wise responses, K-dimensional factor models (including the logistic factor model and the ideal point model), and the Bradley-Terry model for pairwise comparisons. We saw how these models relate user parameters $U_i$ and item parameters $V_j$ to observed responses, and how pairwise models can be derived from item-wise models.

This chapter focuses on learning the parameters of these models from observed data. We cover three complementary approaches:

Maximum Likelihood Estimation for Bradley-Terry and Rasch models — finding parameters that maximize the probability of the observed data
Bayesian Inference for uncertainty quantification — treating parameters as random variables with prior distributions
Online Learning with the Elo rating system — updating parameter estimates incrementally as new data arrives

We begin by discussing how to split preference data into training and test sets, then develop estimation procedures for each approach.

2.3 Maximum Likelihood Estimation

Given a response from a particular generating process, there are various standard statistical inference procedures, such as maximum likelihood, maximum marginal likelihood, or Bayesian inference. These procedures are designed to infer parameters that are generalizable to new datasets. Hence, we discuss the training and testing dataset next.

The simplest train/test splitting is random, where we select a random subset (e.g., 80%) of the response matrix to be in the training set, and the rest in the test set. We can demonstrate, for example, with the Bradley-Terry response with 80% training and 20% testing data:

Having the train and test datasets from the Bradley–Terry response, which are denoted as $\mathcal{D}_{\text{train}}$ and $\mathcal{D}_{\text{test}}$, we first demonstrate the parameter learning with full information maximum likelihood estimation on a single user:

\[ \hat{V} = \arg\max_{V} \sum_{j,j' \in \mathcal{D}_{\text{train}}} \log p(Y_{0,jj'} | \sigma(H_{0,jj'})), \quad H_{0} = V 1_M^\top - 1_M V^\top, \tag{2.1}\] where $1_M$ is a column vector of 1 with length $M$ and $H_{0,jj'} = v_j-v_{j'}$ for a fixed user $0$. The optimization can be carried out with standard optimizers, such as gradient descent.

Let $\mathcal{N}_m^+ = \{(m,k) \in \mathcal{D}_{\text{train}}\}$ be pairs recorded in the order $(m,k)$. Let $\mathcal{N}_m^- = \{(k,m) \in \mathcal{D}_{\text{train}}\}$ be pairs recorded in the order $(k,m)$. Define residuals $r_{mk} = y_{mk} - \sigma(V_m - V_k)$ and $r_{km} = y_{km} - \sigma(V_k-V_m)$. Then

\[ \frac{\partial \ell}{\partial V_m} = \sum_{(m,k)\in \mathcal{N}^+_m} r_{mk} - \sum_{(k,m)\in \mathcal{N}^-_m} r_{km}. \tag{2.2}\]

This shows the contribution of data about $m$ only. Each comparison with opponent $k$ adds a term equal to the observed-minus-predicted win indicator, with a plus sign if $m$ is listed first and a minus sign if $m$ is listed second. Intuitively, if $m$ beats $k$ more often than the model predicts, the residual is positive and the gradient pushes $V_m$ up. If $m$ loses to $k$ more than predicted, the residual is negative and the gradient pushes $V_m$ down.

In this synthetic setting, one way to interpret the result is to compare with the Bayes optimal AUC. Here we quantify an upper bound on achievable test AUC under the assumed data-generating process. For each test pair $(j,k)$ we know the ground-truth win probability $P_{jk}=\sigma(V_j-V_k)$ from the simulator. The Bayes-optimal score for that pair is exactly this probability, $s^*_{jk}=P_{jk}$. Any classifier that ranks pairs by $s^*$ maximizes AUC in expectation because it orders pairs by their true success probabilities. To estimate the corresponding Bayes AUC on our finite test set, we keep the same index set of pairs and repeatedly resample binary labels $Y_{jk} \sim \mathrm{Bern}(P_{jk})$, then compute $\mathrm{AUC}(s^*, Y)$ on each resample using a tie-aware definition. The mean of these Monte Carlo AUCs is the Bayes-optimal test AUC, and the empirical quantiles give a sampling range induced purely by label noise. For comparison, we compute the model’s AUC by replacing $s^*_{jk}$ with the learned scores $s^{*\text{model}}_{jk}=\sigma(\hat V_j-\hat V_k)$. The gap between the two summarizes how far the fitted model is from the oracle ranking implied by the true $P^*_{jk}$ on the exact same test pairs.

The result shows that the inference has found a good solution compared to the Bayes optimal estimator! An additional, natural test is to see if the estimated parameters match the true ones. Since the likelihood function only pays attention to the difference of item appeal for any pair, the appeal is only identifiable up to a shift and scale transform. The standard practice is to center and whiten the solution.

Looking Ahead: Regularization

MLE can overfit when parameters outnumber observations. Section 2.6 introduces L2 regularization, which adds a penalty $\frac{\lambda}{2}\|V\|_2^2$ to the objective. This is equivalent to MAP estimation with a Gaussian prior—connecting MLE to the Bayesian framework we develop next.

2.4 Bayesian Inference

A Bayesian approach provides a natural alternative to maximum likelihood for parameter estimation. Instead of finding a single point estimate, we place a prior distribution on parameters and update it using the likelihood from observed comparisons. This yields a posterior distribution that captures both central estimates and the uncertainty around them. We first cover parametric models using MCMC, then extend to nonparametric Gaussian Process models using Laplace approximation.

2.4.1 Parametric Models

For the Bradley–Terry model with finite-dimensional item parameters $V$, we place i.i.d. standard normal priors and update them using the Bernoulli likelihood from observed comparisons. In practice, this posterior cannot be computed in closed form, so we turn to Markov chain Monte Carlo (MCMC), which constructs a sequence of samples that, in the limit, follow the posterior distribution.

The Metropolis–Hastings (MH) algorithm is straightforward to apply: at each step, we propose a new value for one or more coordinates of (V) (e.g., from a Gaussian centered at the current state), compute the acceptance ratio as the ratio of posterior densities between the proposed and current states, and accept the proposal with that probability. Repeating this process produces a chain of samples that can be used to approximate posterior means, variances, or other functionals of interest. This approach not only yields point predictions but also quantifies uncertainty about the relative strengths of items under the Bradley–Terry model.

Let the prior be i.i.d. standard normal for each item parameter: $p(V) = \prod_{j=1}^M \mathcal N(V_j \mid 0, 1)$. Then the posterior distribution is $p(V \mid \mathcal D) = p(\mathcal D \mid V)p(V)/p(\mathcal D),$ where the likelihood is \[ p(\mathcal D \mid V) = \prod_{(j,j')\in\mathcal I} \sigma(V_j - V_{j'})^{Y_{jj'}} \bigl(1-\sigma(V_j - V_{j'})\bigr)^{1-Y_{jj'}}. \tag{2.3}\]

The denominator is the marginal likelihood (evidence), which is intractable in closed form, so we resort to MCMC (e.g., MH) to sample from the posterior. Suppose we are at current (parameter) state $V^{(t)}$. We propose a new state $V'$ from a proposal distribution, such as Gaussian: $q(V' \mid V^{(t)}) = \mathcal N (V'; V^{(t)}, \tau^2 I),$ where $\tau^2$ is a step-size variance. This proposal is symmetric, meaning $q(V' \mid V^{(t)}) = q(V^{(t)} \mid V').$ For any proposal distribution $q(V' \mid V^{(t)})$, the Metropolis–Hastings acceptance probability is

\[ \alpha = \min \left\{1, \frac{p(V' \mid \mathcal D) \cdot q(V^{(t)} \mid V')}{p(V^{(t)} \mid \mathcal D) \cdot q(V' \mid V^{(t)})}\right\}. \tag{2.4}\]

This says: accept the proposal with probability proportional to how much more plausible it is under the posterior, adjusted by how easy it is to propose back. If we choose a Gaussian proposal, the proposal terms cancel out in the ratio due to symmetry. So the acceptance rule simplifies to \[ \alpha = \min \left\{1, \frac{p(V' \mid \mathcal D)}{p(V^{(t)} \mid \mathcal D)}\right\} = \min \left\{1, \frac{p(\mathcal D \mid V') \cdot p(V')}{p(\mathcal D \mid V^{(t)}) \cdot p(V^{(t)})}\right\}. \tag{2.5}\]

2.4.2 Gaussian Processes

MCMC works well for finite-dimensional parameter vectors, but what if we want to learn a nonparametric reward function? Gaussian Processes (GPs) extend Bayesian inference to function spaces, placing a prior distribution over reward functions $r \sim \mathcal{GP}(m, k)$ and combining it with the Bradley-Terry likelihood for preferences. The challenge is that this likelihood is non-Gaussian, breaking the standard GP posterior formulas and requiring approximations like Laplace.

The inference problem. Given pairwise comparisons $\mathcal{D} = \{(x_A^{(i)}, x_B^{(i)}, y_i)\}_{i=1}^n$ where $y_i = 1$ if item $A$ was preferred, we want to compute the posterior: \[ p(r \mid \mathcal{D}) \propto p(\mathcal{D} \mid r) \cdot p(r) \tag{2.6}\]

The likelihood follows the Bradley-Terry model: \[ p(y_i = 1 \mid r) = \sigma\left(r(x_A^{(i)}) - r(x_B^{(i)})\right) \tag{2.7}\]

Because this sigmoid likelihood is not Gaussian, the posterior $p(r \mid \mathcal{D})$ is no longer a Gaussian Process.

2.4.2.1 Laplace Approximation

The Laplace approximation provides a tractable Gaussian approximation to the true posterior:

Find the posterior mode $\mathbf{r}^* = \arg\max_{\mathbf{r}} \log p(\mathcal{D} \mid \mathbf{r}) + \log p(\mathbf{r})$
Approximate with a Gaussian centered at the mode, using the Hessian of the log-posterior as precision

For preference data with GP prior (zero mean), the mode satisfies: \[ \mathbf{r}^* = \arg\max_{\mathbf{r}} \sum_{i=1}^n \log \sigma\left(y_i (r(x_A^{(i)}) - r(x_B^{(i)}))\right) - \frac{1}{2}\mathbf{r}^\top K^{-1} \mathbf{r} \tag{2.8}\]

where $K$ is the kernel matrix and $\mathbf{r} = [r(x_1), \ldots, r(x_m)]^\top$ collects function values at all unique points appearing in comparisons.

Newton’s method for finding the mode. The gradient and Hessian of the log-posterior are: \[ \nabla_{\mathbf{r}} \log p(\mathbf{r} \mid \mathcal{D}) = \mathbf{g} - K^{-1}\mathbf{r} \tag{2.9}\] \[ \nabla^2_{\mathbf{r}} \log p(\mathbf{r} \mid \mathcal{D}) = -W - K^{-1} \tag{2.10}\]

where $\mathbf{g}$ is the gradient of the log-likelihood and $W$ is a diagonal matrix with entries $W_{ii} = p_i(1 - p_i)$ for predictions $p_i = \sigma(r(x_A^{(i)}) - r(x_B^{(i)}))$.

The approximate posterior. After finding $\mathbf{r}^*$, the Laplace approximation gives: \[ p(\mathbf{r} \mid \mathcal{D}) \approx \mathcal{N}\left(\mathbf{r}^*, (K^{-1} + W)^{-1}\right) \tag{2.11}\]

This is structurally similar to standard GP regression, but with the data-dependent precision matrix $W$ arising from the Bradley-Terry likelihood rather than observation noise.

2.4.2.2 Connection to Fisher Information

The Laplace approximation has a natural connection to the Fisher information framework we developed for linear models. The Hessian of the negative log-likelihood at the mode gives the observed Fisher information: \[ I_{\text{obs}} = W = \text{diag}(p_1(1-p_1), \ldots, p_n(1-p_n)) \tag{2.12}\]

Just as with linear preference models, comparisons with $p \approx 0.5$ contribute the most information (highest $p(1-p)$), while “easy” comparisons where one option clearly dominates contribute little.

2.4.2.3 Computational Considerations

Complexity. Each Newton iteration requires $O(m^3)$ computation for the matrix solve, where $m$ is the number of unique points. For $n$ comparisons involving $m \leq 2n$ unique points, the total cost is $O(n^3)$ per iteration.

Scalability. For large datasets, several approximations are available: - Inducing points: Approximate the GP using $k \ll m$ pseudo-inputs, reducing complexity to $O(k^2 n)$ - Variational inference: Optimize a lower bound on the marginal likelihood - Conjugate gradient methods: Avoid explicit matrix inversion

Hyperparameter learning. The kernel hyperparameters (length-scale $\ell$, signal variance $\sigma_f^2$) can be learned by maximizing the Laplace-approximated marginal likelihood: \[ \log p(\mathcal{D} \mid \theta) \approx \log p(\mathcal{D} \mid \mathbf{r}^*) + \log p(\mathbf{r}^* \mid \theta) + \frac{1}{2}\log |K^{-1} + W|^{-1} \tag{2.13}\]

In Chapter 4, we will see how the GP posterior uncertainty enables active query selection—choosing which comparisons to ask to learn most efficiently.

2.5 Online Learning

In many applications, comparisons between items arrive sequentially over time rather than being observed all at once. For example, players in online games are continuously matched, or recommendation systems log one user preference at a time. In such settings, it is often computationally infeasible to refit the full Bradley–Terry model by maximum likelihood or MCMC after each new observation. Instead, we want an online update rule that adjusts item strengths incrementally as new outcomes arrive.

This is precisely the motivation behind the Elo rating system, originally introduced for ranking chess players and later widely adopted in competitive games, online platforms, and even information retrieval. The key idea is to maintain a current estimate of each item’s (or player’s) latent strength, and update only the two items involved in a match when a new result comes in.

The Elo rule can be derived as a stochastic gradient ascent method on the Bradley–Terry log-likelihood. Suppose item $j$ plays against item $j'$. Given the current parameters, the log-likelihood gradient with respect to $V_j$ and $V_{j'}$ is \[ \frac{\partial \ell}{\partial V_j} = (y - p), \qquad \frac{\partial \ell}{\partial V_{j'}} = -(y - p). \tag{2.14}\]

A stochastic gradient ascent step with learning rate $\eta$ gives the update: \[ V_j \leftarrow V_j + \eta (y - p), \qquad V_{j'} \leftarrow V_{j'} - \eta (y - p). \tag{2.15}\]

This is exactly the Elo update rule. The learning rate $\eta$ is often called the K-factor in Elo literature. If $y=1$ (item $j$ wins) but the model predicted a low $p$, then $(y - p)$ is positive and $V_j$ increases, $V_{j'}$ decreases — the system learns that $j$ is stronger than previously believed. If $y=0$ (item $j'$ wins), the opposite adjustment happens. The magnitude of the update is larger when the outcome is surprising (large prediction error), and smaller when the outcome is expected (small prediction error). Thus, Elo is an online learning algorithm for the Bradley–Terry model, interpretable as stochastic gradient ascent with a fixed step size.

2.6 Regularization and Overfitting

The maximum likelihood estimator maximizes fit to observed training data but may overfit, especially when the number of parameters is large relative to the amount of data. In preference learning, overfitting manifests as learned item parameters that capture noise in the training comparisons rather than true relative strengths. Regularization techniques penalize model complexity to improve generalization to held-out test data.

2.6.1 L2 Regularization

The most common regularization approach adds an L2 penalty term to the log-likelihood. For the Bradley-Terry model, the regularized objective becomes: \[ \mathcal{L}_{\text{reg}}(V) = \sum_{(j,j') \in \mathcal{D}_{\text{train}}} \log p(Y_{jj'} \mid V_j - V_{j'}) - \frac{\lambda}{2} \|V\|_2^2 \tag{2.16}\] where $\lambda \geq 0$ is the regularization strength. The penalty $\frac{\lambda}{2} \|V\|_2^2 = \frac{\lambda}{2} \sum_{j=1}^M V_j^2$ discourages large parameter values. When $\lambda = 0$, we recover standard MLE. As $\lambda$ increases, parameters shrink toward zero.

The regularized gradient adds a simple term to the MLE gradient: \[ \frac{\partial \mathcal{L}_{\text{reg}}}{\partial V_m} = \sum_{(m,k)\in \mathcal{N}^+_m} r_{mk} - \sum_{(k,m)\in \mathcal{N}^-_m} r_{km} - \lambda V_m \tag{2.17}\]

Intuitively, regularization creates a bias-variance tradeoff: small $\lambda$ permits complex fits (low bias, high variance), while large $\lambda$ enforces simpler models (high bias, low variance). The optimal $\lambda$ balances these to minimize test error.

Connection to Bayesian Inference

L2 regularization corresponds exactly to maximum a posteriori (MAP) estimation with a Gaussian prior $p(V_j) = \mathcal{N}(0, 1/\lambda)$. The regularization strength $\lambda$ is the inverse prior variance: strong regularization (large $\lambda$) means a tight prior belief that parameters are near zero.

The validation curve demonstrates the regularization trade-off: at $\lambda = 0$ (no regularization), the model overfits to training data, achieving high training AUC but lower test AUC. As $\lambda$ increases, test performance improves until an optimal point, after which excessive regularization underfits. The gap between training and test AUC narrows with proper regularization, indicating better generalization.

2.6.2 Early Stopping

An alternative to explicit regularization is early stopping: monitor validation performance during training and stop when it begins to degrade, even if training performance continues improving. This exploits the empirical observation that gradient descent follows a path from simple to complex models.

Early stopping automatically selects model complexity through the training trajectory. Unlike L2 regularization, it requires no tuning of $\lambda$, but does require held-out validation data to monitor.

Key Takeaway: When to Regularize

Regularization is most critical when:

Few observations relative to parameters (e.g., 50 comparisons, 20 items)
Imbalanced data: Some items have many comparisons, others have few
High noise: Label noise or measurement error in comparisons

For large datasets with many comparisons per item, overfitting is less of a concern and regularization may have minimal impact. Always use validation data or cross-validation to tune regularization strength.

2.7 Model Selection and Cross-Validation

A single train/test split provides one estimate of generalization performance, but this estimate has high variance: a different random split may yield different results. Cross-validation (CV) systematically evaluates multiple train/test splits to obtain more reliable performance estimates and enable principled model selection.

2.7.1 K-Fold Cross-Validation

In k-fold cross-validation, we partition the data into $k$ equally-sized folds. For each fold $i \in \{1, \ldots, k\}$:

Train the model on all folds except fold $i$
Evaluate on fold $i$ (held-out validation)
Record the validation metric (e.g., AUC, log-likelihood)

The final CV score is the average across all $k$ folds: $\text{CV}_k = \frac{1}{k} \sum_{i=1}^k \text{metric}_i$. The standard error quantifies uncertainty in the estimate.

For preference data, we partition the set of observed pairwise comparisons (not items) into folds, ensuring each fold contains a representative sample of comparisons.

The CV estimate of {cv_scores.mean():.4f} ± {cv_scores.std():.4f} is more reliable than a single train/test split. The standard deviation quantifies variability across folds.

2.7.2 Hyperparameter Tuning with Cross-Validation

Cross-validation enables principled hyperparameter selection: evaluate each candidate hyperparameter on CV performance and select the best. For Bradley-Terry, key hyperparameters include regularization strength $\lambda$ and learning rate.

The error bars show the standard deviation across folds, indicating uncertainty in the CV estimate for each hyperparameter. Select the hyperparameter with highest mean CV performance.

2.7.3 Beyond AUC: Multiple Evaluation Metrics

AUC measures ranking quality but does not capture all aspects of model performance. Additional metrics provide complementary insights:

Log-Likelihood: Measures the probability the model assigns to observed outcomes. Higher is better. \[ \text{LL} = \sum_{(j,j') \in \mathcal{D}_{\text{test}}} \left[ Y_{jj'} \log \sigma(V_j - V_{j'}) + (1 - Y_{jj'}) \log (1 - \sigma(V_j - V_{j'})) \right] \tag{2.18}\]

Calibration Error: Measures whether predicted probabilities match empirical frequencies. Bin predictions into intervals (e.g., [0.0, 0.1), [0.1, 0.2), …) and compare average predicted probability to observed frequency in each bin.

Different metrics may favor different models. AUC focuses on ranking, log-likelihood on probability estimates, and calibration on frequency matching. Use multiple metrics to gain a comprehensive view of model quality.

Key Takeaway: Cross-Validation Best Practices

Use CV for hyperparameter tuning, not test set evaluation (avoid overfitting to the test set)
Report mean ± std across folds to quantify uncertainty
Stratified splitting: For imbalanced data, ensure each fold has representative class proportions
Temporal splitting: For sequential data (e.g., chess games over time), use time-based splits instead of random CV to avoid leaking future information into past predictions
Nested CV: For unbiased performance estimates, use outer CV loop for evaluation and inner CV loop for hyperparameter selection

2.8 Optimization Methods

Gradient descent is the foundation for learning Bradley-Terry parameters, but modern optimization methods can accelerate convergence and improve final performance. We compare three widely-used optimizers and discuss when each is appropriate.

2.8.1 Beyond Vanilla Gradient Descent

Standard gradient descent updates parameters with a fixed learning rate: $V \leftarrow V + \eta \nabla \mathcal{L}(V)$. This has two limitations:

Fixed step size: Large $\eta$ causes instability; small $\eta$ slows convergence
No momentum: Each step ignores the optimization history

Modern optimizers address these issues through adaptive learning rates and momentum.

2.8.2 Adam Optimizer

Adam (Adaptive Moment Estimation) maintains running averages of both gradients and squared gradients, adapting the learning rate per parameter. The update rule is:

\[ \begin{aligned} m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad &\text{(momentum)} \\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad &\text{(variance)} \\ \hat{m}_t &\leftarrow m_t / (1 - \beta_1^t), \quad \hat{v}_t \leftarrow v_t / (1 - \beta_2^t) \quad &\text{(bias correction)} \\ V &\leftarrow V + \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \quad &\text{(parameter update)} \end{aligned} \tag{2.19}\]

where $g_t$ is the gradient at step $t$, $\beta_1 = 0.9$ and $\beta_2 = 0.999$ are typical, and $\epsilon = 10^{-8}$ prevents division by zero.

Adam typically converges faster than vanilla gradient descent and is less sensitive to learning rate tuning. The adaptive per-parameter learning rates help in problems with varying gradient magnitudes across parameters.

2.8.3 Learning Rate Schedules

Instead of a fixed learning rate, schedules decay $\eta$ over time to enable large initial steps (fast progress) followed by small refinements (precision):

Step decay: $\eta_t = \eta_0 \cdot \gamma^{\lfloor t / k \rfloor}$ (reduce by factor $\gamma$ every $k$ epochs)
Exponential decay: $\eta_t = \eta_0 e^{-\lambda t}$
Cosine annealing: $\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\pi t / T))$

Learning rate schedules are especially useful for training to convergence without extensive hyperparameter tuning.

Key Takeaway: Choosing an Optimizer

When to use each optimizer:

Vanilla GD: Simple problems, well-tuned learning rate available, or for pedagogical clarity
Adam: Default choice for most problems; robust to learning rate, fast convergence, handles sparse gradients well
SGD with momentum: When you need the theoretical guarantees of SGD (e.g., convergence proofs) but want faster practical convergence

Hyperparameter guidelines: - Adam: $\alpha = 0.001$ to $0.1$, default $\beta_1 = 0.9$, $\beta_2 = 0.999$ - GD: $\eta = 0.01$ to $0.1$, may need careful tuning - Learning rate schedules: Helpful when training for many epochs

2.9 Real-World Application: LLM Preference Learning

We now apply all three parameter learning methods to a realistic preference learning scenario inspired by language model alignment. While production LLM alignment uses datasets like Stanford Human Preferences (SHP) or Anthropic’s HH-RLHF, we construct a synthetic dataset that captures key properties of real preference data: noisy labels, varying item quality, and response embeddings.

2.9.1 Dataset: Simulated LLM Response Preferences

We simulate a setting where a language model generates multiple responses to prompts, and human annotators provide pairwise preferences. Each response is represented by a learned embedding (e.g., from a pretrained model), and the true utility is a linear function of this embedding plus noise.

This synthetic dataset mimics real LLM preference data: responses have embedding representations, utilities are learned functions of embeddings, and annotations contain noise.

2.9.2 Applying All Three Learning Methods

We apply MLE, Bayesian inference, and online learning (Elo) to the same data and compare results.

2.9.3 Key Observations

The three methods yield similar performance on this synthetic LLM preference task:

MLE is fastest and most scalable, suitable for large-scale applications
Bayesian inference via MCMC provides posterior distributions over utilities, enabling uncertainty quantification—the marginal plot shows how posterior samples vary, and error bars show 95% credible intervals
Online (Elo) learns incrementally, ideal for streaming data but may be less data-efficient than batch methods

All methods successfully recover utilities correlated with the ground truth despite 10% label noise, demonstrating robustness. In production LLM alignment:

Data scale: Real datasets have 10K-100K+ comparisons
Cold start: New responses have no comparison history; requires initialization strategies
Computational cost: MCMC becomes expensive; MLE or online methods preferred
Temporal drift: User preferences may evolve; online methods naturally adapt

Connecting to DPO

Recall from Section 1.2 that Direct Preference Optimization (DPO) for language models is equivalent to Bradley-Terry MLE where the “utility” is $\beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}$. The MLE techniques developed in this chapter directly apply to DPO training: regularization prevents overfitting to preference data, and optimization methods (Adam, learning rate schedules) accelerate convergence.

2.10 Discussion Questions

In the synthetic experiments, why is the Bayes optimal AUC less than 1.0 even though we know the true parameters? What does this reveal about the fundamental limits of prediction from noisy preference data?
How does L2 regularization affect the bias-variance tradeoff in Bradley-Terry estimation? Under what data conditions (sample size, noise level, number of items) would you expect regularization to have the largest impact on test performance?
When would you prefer online learning (Elo) over batch learning (MLE) for preference model estimation? Consider computational cost, data arrival patterns, and statistical efficiency. Can you design a hybrid approach that combines benefits of both?
The Bayesian approach via MCMC provides posterior distributions over parameters, while MLE gives point estimates. Beyond uncertainty quantification, what practical advantages might the full posterior provide? How would you use posterior samples to make decisions (e.g., which items to present to users)?
Cross-validation assumes that folds are exchangeable—that the data distribution is the same across all folds. For sequential preference data (e.g., chess games ordered in time, or LLM preferences from a changing user population), this assumption may be violated. How should cross-validation be adapted for temporal data? What are the risks of using standard k-fold CV on sequential data?
In the LLM preference learning example, all three methods (MLE, Bayesian, Elo) achieved similar test AUC despite different training procedures. Under what circumstances would you expect the methods to diverge in performance? Consider data scale, noise level, model mis-specification, and computational budget.

2.11 Bibliographic Notes

Maximum likelihood estimation for preference models dates back to Bradley and Terry (1952)’s original work on paired comparisons. The connection to modern machine learning was established through applications in ranking (Herbrich, Minka, and Graepel 2006) and recommender systems (Koren, Bell, and Volinsky 2009). Hunter (2004) provided efficient MM algorithms for Bradley-Terry MLE that scale to large problems.

Bayesian inference for preference data has a long history in psychometrics and experimental design. Davidson (1970) developed Bayesian approaches for paired comparisons. Modern MCMC methods for Bradley-Terry models are surveyed in Caron and Doucet (2012). The connection between L2 regularization and Gaussian priors (MAP estimation) is standard in Bayesian machine learning (Murphy 2012).

The Elo rating system was introduced by Elo (1978) for chess rankings and has since been applied broadly to competitive games (Glickman 1999 introduced the Glicko system with uncertainty estimates), online platforms (Herbrich, Minka, and Graepel 2006 for Xbox matchmaking), and information retrieval. The interpretation of Elo as stochastic gradient descent on the Bradley-Terry log-likelihood clarifies its connection to machine learning (Weng and Lin 2011).

Regularization in statistical learning has roots in ridge regression (Hoerl and Kennard 1970) and has become central to modern machine learning (Hastie, Tibshirani, and Friedman 2009). Early stopping as implicit regularization was analyzed by Yao, Rosasco, and Caponnetto (2007). The bias-variance tradeoff provides the theoretical foundation for why regularization improves generalization (Geman, Bienenstock, and Doursat 1992).

Cross-validation was formalized by Stone (1974) and Geisser (1975) for model selection. Kohavi (1995) provided practical guidance for k-fold CV. Nested cross-validation to avoid selection bias was emphasized by Cawley and Talbot (2010). Temporal validation strategies for time-series data are discussed in Bergmeir, Hyndman, and Koo (2018).

Optimization methods for machine learning are surveyed in Ruder (2016). Adam was introduced by Kingma and Ba (2014) and has become the default optimizer for deep learning. Convergence analysis for gradient descent on convex problems (including logistic regression, which includes Bradley-Terry) is classical (Boyd and Vandenberghe 2004). Learning rate schedules and their impact on generalization are discussed in Loshchilov and Hutter (2016).

LLM alignment via preference learning gained prominence with Christiano et al. (2017)’s introduction of RLHF. The connection to Bradley-Terry and reward modeling was made explicit in Ouyang et al. (2022) (InstructGPT). Direct Preference Optimization (DPO) (Rafailov et al. 2023) showed that RLHF can be reformulated as Bradley-Terry MLE, eliminating the need for a separate reward model. Recent work explores calibration (Stiennon et al. 2020), robustness to noise (al. 2022), and scaling laws (Gao, Schulman, and Hilton 2022) for preference-based LLM training.

For further reading: Agresti (2002) provides comprehensive coverage of categorical data analysis including paired comparisons. Marden (1995) covers ranking models from a statistical perspective. Liu (2011) surveys learning-to-rank methods in information retrieval, many of which build on preference models.

2.12 Exercises

Exercises are marked with difficulty levels: () for introductory, () for intermediate, and () for challenging.

2.12.1 Gradient Derivation for Plackett-Luce (*)

The Plackett-Luce model extends Bradley-Terry to full rankings. Given a ranking $(j_1 \succ j_2 \succ \cdots \succ j_M)$, the likelihood is: \[ p(\text{ranking} \mid V) = \prod_{k=1}^{M-1} \frac{\exp(V_{j_k})}{\sum_{\ell=k}^M \exp(V_{j_\ell})} \tag{2.20}\]

Write the log-likelihood for a single ranking as a function of utilities $V$.
Derive the gradient $\frac{\partial \log p}{\partial V_m}$ for an arbitrary item $m$. Hint: Consider three cases: $m$ appears at position $k$, $m$ appears after position $k$, and $m$ does not appear in the ranking.
Implement gradient ascent for Plackett-Luce on simulated ranking data. Compare convergence to Bradley-Terry MLE—which converges faster and why?

2.12.2 L2 Regularized Bradley-Terry (**)

Implement Bradley-Terry MLE with L2 regularization for a range of $\lambda$ values. Generate synthetic data with $M = 20$ items and vary the number of comparisons from 50 to 500. Plot test AUC vs $\lambda$ for each dataset size.
Explain why the optimal $\lambda$ decreases as the number of comparisons increases. At what rate does it decay?
Derive the Hessian matrix of the regularized log-likelihood. Under what conditions is it positive definite (ensuring a unique global maximum)?

2.12.3 K-Fold Cross-Validation Implementation (**)

Implement 5-fold cross-validation for Bradley-Terry estimation from scratch (without using sklearn or similar libraries). Your implementation should:

Randomly partition comparison data into 5 folds
Train on 4 folds, validate on 1 fold
Repeat for all 5 folds
Return mean and standard deviation of validation AUC

Compare your CV implementation to a single 80/20 train/test split over 20 random seeds. Which provides a more stable performance estimate?
Extend your implementation to stratified CV: ensure each fold has approximately equal proportions of wins for each item. Why might stratification improve CV estimates for imbalanced data?

2.12.4 Learning Rate Sensitivity Analysis (*)

For Bradley-Terry MLE on synthetic data ($M = 15$ items, 100 comparisons), experiment with learning rates $\eta \in \{0.001, 0.01, 0.05, 0.1, 0.5, 1.0\}$. Plot the training loss curve for each learning rate.
Identify the learning rate that achieves the lowest final loss. What happens with learning rates that are too large? Too small?
Implement a learning rate schedule (e.g., exponential decay $\eta_t = \eta_0 \cdot 0.95^t$) and compare to fixed learning rate. Does the schedule improve final performance?

2.12.5 MCMC Diagnostics (**)

The Metropolis-Hastings implementation in Section 2.4 produces a chain of samples. Assess convergence quality:

Implement the effective sample size (ESS) diagnostic: ESS estimates how many independent samples the chain is equivalent to, accounting for autocorrelation. For a chain $\{V^{(t)}\}_{t=1}^T$, the ESS for parameter $V_j$ is approximately: \[ \text{ESS}_j = \frac{T}{1 + 2\sum_{k=1}^K \rho_k} \tag{2.21}\] where $\rho_k$ is the autocorrelation at lag $k$.
Run the MH algorithm from Section 2.4 with different proposal step sizes $\tau \in \{0.01, 0.05, 0.1, 0.5\}$. Plot ESS vs $\tau$. What happens when $\tau$ is too small? Too large?
Implement trace plots for multiple chains (run MH 3 times with different initializations). Do all chains converge to the same stationary distribution?

2.12.6 Optimization Method Comparison (**)

Implement gradient descent with momentum: $m_t \leftarrow \beta m_{t-1} + \nabla \mathcal{L}(V_t)$, $V_{t+1} \leftarrow V_t + \eta m_t$ where $\beta \in [0, 1)$ controls momentum.
On the same synthetic Bradley-Terry problem, compare four optimizers: vanilla GD, GD with momentum ($\beta = 0.9$), Adam, and RMSprop. Plot convergence curves (loss vs iteration).
For each optimizer, find the best learning rate via grid search. Which optimizer is most sensitive to learning rate tuning?

2.12.7 Convergence Proof (***)

Prove that gradient ascent on the Bradley-Terry log-likelihood converges to the global maximum (assuming the maximum exists).

Show that the Bradley-Terry log-likelihood is strictly concave when the comparison graph is strongly connected (every pair of items is connected by a directed path of comparisons).
Prove that gradient ascent with sufficiently small step size $\eta$ converges to the unique global maximum. Hint: Use the fact that the gradient vanishes only at the maximum for strictly concave functions.
What happens when the comparison graph is not strongly connected? Construct an example with multiple disconnected components and characterize the set of maximum likelihood estimators.

2.12.8 Calibration Metrics (*)

Implement a calibration plot: bin predicted probabilities into 10 intervals $[0, 0.1), [0.1, 0.2), \ldots, [0.9, 1.0]$, and for each bin, compute the average predicted probability and the empirical frequency of wins.
Generate synthetic Bradley-Terry data and fit two models: one correctly specified (Bradley-Terry) and one mis-specified (assume all items have equal utility). Plot calibration curves for both. Which is better calibrated?
Define the Expected Calibration Error (ECE): $\text{ECE} = \sum_{b=1}^B \frac{|B_b|}{N} |\bar{p}_b - \bar{y}_b|$ where $\bar{p}_b$ is the average predicted probability in bin $b$, $\bar{y}_b$ is the empirical frequency, and $|B_b|$ is the number of predictions in bin $b$. Compute ECE for your models.

2.12.9 Online to Batch Convergence (*)

Implement Elo with a decreasing K-factor: $K_t = K_0 / \sqrt{t}$ where $t$ is the iteration number. This is known as the Robbins-Monro condition for stochastic approximation.
On synthetic Bradley-Terry data, run Elo with decreasing $K_t$ and compare the final learned parameters to batch MLE. How close do they converge?
Prove (or argue informally) that Elo with $K_t = \eta / \sqrt{t}$ converges to the MLE in the limit as $t \to \infty$. Under what conditions does this hold?

2.12.10 Hyperparameter Tuning with Nested Cross-Validation (***)

To obtain an unbiased estimate of generalization performance when hyperparameters are selected via CV, we use nested (double) CV:

Outer loop: K-fold CV to estimate generalization
Inner loop: For each outer fold, use K-fold CV on the training data to select hyperparameters

Implement nested 5-fold CV for Bradley-Terry with L2 regularization. The inner loop tunes $\lambda$, the outer loop estimates test AUC.
Compare the nested CV test AUC estimate to: (i) a single train/test split with CV on training data for $\lambda$ selection, and (ii) CV test AUC when $\lambda$ is selected using the outer CV test set (data leakage). Which is more optimistic? Pessimistic?
Why is nested CV important? Give an example where selecting hyperparameters on the same data used for final evaluation leads to overly optimistic performance estimates.

2.12.11 Real-World Dataset Application (***)

Apply the methods from this chapter to a real preference dataset:

Obtain a dataset such as the Jester jokes dataset (user ratings), a subset of MovieLens, or chess game outcomes from Lichess. Describe the data: number of items, number of comparisons, sparsity.
Apply all three methods (MLE with regularization, Bayesian inference, online Elo) and compare their performance using multiple metrics (AUC, log-likelihood, calibration error). Use cross-validation to tune hyperparameters.
Visualize the learned item parameters. Do they agree with intuition (e.g., highly-rated movies have high utility)? Identify the top-5 and bottom-5 items according to each method—do the methods agree?
Discuss practical challenges encountered: computational cost, cold-start items, missing data, temporal effects. How would you address these in a production system?

2.12.12 Bias-Variance Tradeoff Demonstration (**)

Generate synthetic Bradley-Terry data with $M = 10$ items and vary the training set size $n \in \{20, 50, 100, 200, 500\}$ comparisons. For each $n$ and several regularization strengths $\lambda$, fit models and compute:

Bias: Average difference between learned and true parameters
Variance: Standard deviation of learned parameters across 50 random datasets

Plot bias and variance vs $\lambda$ for each training set size. Verify that as $\lambda$ increases, bias increases and variance decreases.
Plot test AUC vs $\lambda$ and identify the optimal $\lambda$ for each $n$. How does the optimal regularization strength change with dataset size?

References

Agresti, Alan. 2002. Categorical Data Analysis. 2nd ed. New York: John Wiley & Sons.

al., Yuntao Bai et. 2022. “Constitutional AI: Harmlessness from AI Feedback.” https://arxiv.org/abs/2212.08073.

Bergmeir, Christoph, Rob J. Hyndman, and Bonsoo Koo. 2018. “A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction.” Computational Statistics & Data Analysis 120: 70–83.

Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.

Bradley, Ralph Allan, and Milton E Terry. 1952. “Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.” Biometrika 39 (3/4): 324–45.

Caron, François, and Arnaud Doucet. 2012. “Efficient Bayesian Inference for Generalized Bradley-Terry Models.” Journal of Computational and Graphical Statistics 21 (1): 174–96.

Cawley, Gavin C., and Nicola L. C. Talbot. 2010. “On over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.” Journal of Machine Learning Research 11: 2079–2107.

Christiano, Paul F, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. “Deep Reinforcement Learning from Human Preferences.” Advances in Neural Information Processing Systems 30.

Davidson, Roger R. 1970. “On Extending the Bradley-Terry Model to Accommodate Ties in Paired Comparison Experiments.” Journal of the American Statistical Association 65 (329): 317–28.

Elo, Arpad E. 1978. The Rating of Chessplayers, Past and Present. New York: Arco Publishing.

Gao, Leo, John Schulman, and Jacob Hilton. 2022. “Scaling Laws for Reward Model Overoptimization.” arXiv Preprint arXiv:2210.10760.

Geisser, Seymour. 1975. “The Predictive Sample Reuse Method with Applications.” Journal of the American Statistical Association 70 (350): 320–28.

Geman, Stuart, Elie Bienenstock, and Rene Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4 (1): 1–58.

Glickman, Mark E. 1999. “Parameter Estimation in Large Dynamic Paired Comparison Experiments.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 48 (3): 377–94. https://doi.org/10.1111/1467-9876.00159.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. New York: Springer.

Herbrich, Ralf, Tom Minka, and Thore Graepel. 2006. “TrueSkill: A Bayesian Skill Rating System.” In Advances in Neural Information Processing Systems, 19:569–76.

Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67.

Hunter, David R. 2004. “MM Algorithms for Generalized Bradley-Terry Models.” The Annals of Statistics 32 (1): 384–406. https://doi.org/10.1214/aos/1079120141.

Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” arXiv Preprint arXiv:1412.6980.

Kohavi, Ron. 1995. “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection” 14 (2): 1137–45.

Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. “Matrix Factorization Techniques for Recommender Systems.” Computer 42 (8): 30–37.

Liu, Tie-Yan. 2011. Learning to Rank for Information Retrieval. Springer.

Loshchilov, Ilya, and Frank Hutter. 2016. “SGDR: Stochastic Gradient Descent with Warm Restarts.” arXiv Preprint arXiv:1608.03983.

Marden, John I. 1995. Analyzing and Modeling Rank Data. Chapman & Hall.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press.

Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35: 27730–44.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” https://arxiv.org/abs/2305.18290.

Ruder, Sebastian. 2016. “An Overview of Gradient Descent Optimization Algorithms.” arXiv Preprint arXiv:1609.04747.

Stiennon, Nisan, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. “Learning to Summarize from Human Feedback.” Advances in Neural Information Processing Systems 33.

Stone, M. 1974. “Cross-Validatory Choice and Assessment of Statistical Predictions.” Journal of the Royal Statistical Society: Series B (Methodological) 36 (2): 111–33. https://doi.org/10.1111/J.2517-6161.1974.TB00994.X.

Weng, Ruby C., and Chih-Jen Lin. 2011. “A Bayesian Approximation Method for Online Ranking.” Journal of Machine Learning Research 12: 267–300.

Yao, Yuan, Lorenzo Rosasco, and Andrea Caponnetto. 2007. “On Early Stopping in Gradient Descent Learning.” Constructive Approximation 26 (2): 289–315.

--- title: Learning format: html: include-after-body: text: | <script> // Auto-execute all pyodide cells after initialization document.addEventListener('DOMContentLoaded', function() { // Wait for pyodide to be fully ready (mainPyodide is set after loading) function waitForPyodide() { if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) { // Pyodide is ready, execute all cells with autorun=true if (typeof globalThis.qpyodideCellDetails !== 'undefined') { globalThis.qpyodideCellDetails.forEach((cell, index) => { if (cell.options && cell.options.autorun === 'true') { setTimeout(() => { const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`); if (runButton && !runButton.disabled) { runButton.click(); } }, index * 1000); // Stagger execution by 1 second each } }); } } else { setTimeout(waitForPyodide, 500); } } setTimeout(waitForPyodide, 2000); // Start checking after 2 seconds }); </script> filters: - pyodide --- ::: {.callout-note} ## Intended Learning Outcomes By the end of this chapter you will be able to: - **Identify** appropriate train/test splitting strategies for preference data and understand their implications for model evaluation. - **Implement** maximum likelihood estimation via gradient descent for the Bradley-Terry model, deriving and computing gradients efficiently. - **Apply** Bayesian inference using Markov Chain Monte Carlo (MCMC) to quantify parameter uncertainty and obtain posterior distributions. - **Apply** Laplace approximation to perform inference for Gaussian Process preference models, understanding why non-Gaussian likelihoods require approximation. - **Derive** online learning algorithms (Elo ratings) as stochastic gradient ascent and explain when online methods are preferred over batch learning. - **Compare** batch maximum likelihood, Bayesian inference, and online learning approaches in terms of computational cost, statistical efficiency, and practical applicability. - **Evaluate** learned preference models using multiple metrics including AUC, log-likelihood, and calibration error. - **Apply** regularization techniques (L2 penalty, early stopping) to prevent overfitting and improve generalization. - **Conduct** k-fold cross-validation for model selection and hyperparameter tuning in preference learning settings. - **Implement** modern optimization methods (Adam, learning rate schedules) and diagnose convergence through loss curves and gradient norms. - **Analyze** real-world preference data from language model alignment, addressing practical challenges such as noise, sparsity, and cold-start problems. ::: ::: {.callout-tip title="Suggested Lecture Plan" collapse="true"} This chapter can be covered in two 50-minute lectures plus a hands-on lab session: **Lecture 1** (Sections 3.1--3.4): Core parameter estimation methods - Train/test splitting and evaluation metrics (10 min) - Maximum likelihood estimation for Bradley-Terry: derivation, implementation, evaluation (20 min) - Bayesian inference with MCMC: posterior distributions, Metropolis-Hastings (15 min) - Comparison of MLE vs Bayesian approaches (5 min) **Lecture 2** (Sections 3.5--3.8): Advanced learning topics - Online learning with Elo ratings: derivation as SGD, applications (15 min) - Regularization and overfitting: L2 penalty, validation curves (15 min) - Cross-validation and model selection (10 min) - Optimization methods: GD vs Adam, learning rate schedules (10 min) **Lab session**: Real-world application and exercises - Apply methods to LLM preference data (30 min) - Work on selected exercises (20 min) ::: Designing a good reward signal by hand for a complex AI system is difficult and error-prone. Instead of manually specifying desirable behavior, we can learn a utility signal from preference data. In this chapter, we explore how to infer an underlying utility from various forms of feedback. Throughout, we include mathematical formulations and code examples to illustrate the learning process. ## Chapter Overview @sec-latent-variable in Chapter 2 established the foundational models for preference data: the Rasch model for item-wise responses, the Bradley-Terry model for pairwise comparisons, and K-dimensional factor models for richer representations. We saw how these models formalize the relationship between latent parameters and observed choices. This chapter addresses the central question of **learning**: given observed preference data, how do we estimate the parameters of these models? We explore four complementary perspectives on parameter learning, each with distinct advantages: - **Maximum Likelihood Estimation** (@sec-mle): Find parameters that maximize the probability of observed data. Fast and scalable, but provides only point estimates. - **Bayesian Inference** (@sec-bayesian, @sec-gp-inference): Treat parameters as random variables with prior distributions. We cover both parametric models (using MCMC) and nonparametric Gaussian Process models (using Laplace approximation). Quantifies uncertainty but requires more computation. - **Online Learning** (@sec-online): Update estimates incrementally as new data arrives. Essential for dynamic systems with continuously arriving feedback. Beyond these core methods, we cover essential machine learning techniques for practical applications: - **Regularization** (@sec-regularization): Prevent overfitting through L2 penalties and early stopping. - **Model Selection** (@sec-model-selection): Use cross-validation to tune hyperparameters and compare approaches. - **Optimization** (@sec-optimization): Improve convergence with modern methods like Adam and adaptive learning rates. - **Real-World Application** (@sec-real-world): Apply all methods to language model preference data, confronting practical challenges. Chapters 4-6 will build on these learning foundations to address active data collection (Chapter 4), decision-making under learned preferences (Chapter 5), and heterogeneous populations (Chapter 6). ## Learning from Preference Data @sec-latent-variable introduced the latent variable models for preference data: the Rasch model for item-wise responses, K-dimensional factor models (including the logistic factor model and the ideal point model), and the Bradley-Terry model for pairwise comparisons. We saw how these models relate user parameters $U_i$ and item parameters $V_j$ to observed responses, and how pairwise models can be derived from item-wise models. This chapter focuses on *learning* the parameters of these models from observed data. We cover three complementary approaches: 1. **Maximum Likelihood Estimation** for Bradley-Terry and Rasch models — finding parameters that maximize the probability of the observed data 2. **Bayesian Inference** for uncertainty quantification — treating parameters as random variables with prior distributions 3. **Online Learning** with the Elo rating system — updating parameter estimates incrementally as new data arrives We begin by discussing how to split preference data into training and test sets, then develop estimation procedures for each approach. {{< include _plt_setup.qmd >}} ```{pyodide-python} #| autorun: true #| echo: false rng = np.random.default_rng(2601) N = 50 # users M = 30 # items U = rng.normal(loc=0.0, scale=1, size=N) # users' appetites V = rng.normal(loc=0.0, scale=1, size=M) # items' appeals ``` ## Maximum Likelihood Estimation {#sec-mle} Given a response from a particular generating process, there are various standard statistical inference procedures, such as maximum likelihood, maximum marginal likelihood, or Bayesian inference. These procedures are designed to infer parameters that are generalizable to new datasets. Hence, we discuss the training and testing dataset next. The simplest train/test splitting is random, where we select a random subset (e.g., 80%) of the response matrix to be in the training set, and the rest in the test set. We can demonstrate, for example, with the Bradley-Terry response with 80% training and 20% testing data: ```{pyodide-python} #| autorun: true V = rng.normal(loc=0.0, scale=1, size=M) diff = V[:, None] - V[None, :] P = 1.0 / (1.0 + np.exp(-diff)) Y_BT = np.full((M, M), np.nan) triu_idx = np.triu_indices(M, k=1) randu = rng.random(size=triu_idx[0].shape[0]) wins_j_over_k = (randu < P[triu_idx]) # 1 if j beats k Y_BT[triu_idx] = wins_j_over_k Y_BT[(triu_idx[1], triu_idx[0])] = 1.0 - wins_j_over_k # Indices for upper triangle (excluding diagonal) triu_r, triu_c = np.triu_indices(M, k=1) # Keep only pairs that are observed (not NaN) in the source matrix valid_mask = ~np.isnan(Y_BT[triu_r, triu_c]) triu_r = triu_r[valid_mask] triu_c = triu_c[valid_mask] n_pairs = triu_r.shape[0] n_train = int(0.8 * n_pairs) idx = rng.choice(n_pairs, size=n_train, replace=False) train_pairs = np.zeros(n_pairs, dtype=bool) train_pairs[idx] = True Y_train = np.full_like(Y_BT, np.nan, dtype=float) Y_test = np.full_like(Y_BT, np.nan, dtype=float) r_tr, c_tr = triu_r[train_pairs], triu_c[train_pairs] Y_train[r_tr, c_tr] = Y_BT[r_tr, c_tr] Y_train[c_tr, r_tr] = 1.0 - Y_BT[r_tr, c_tr] # Fill test set (remaining pairs), both symmetric entries r_te, c_te = triu_r[~train_pairs], triu_c[~train_pairs] Y_test[r_te, c_te] = Y_BT[r_te, c_te] Y_test[c_te, r_te] = 1.0 - Y_BT[r_te, c_te] # Diagonals stay NaN np.fill_diagonal(Y_train, np.nan) np.fill_diagonal(Y_test, np.nan) # Combined matrix: train (blue/red) and test (green/orange) # Encode: 0=train-0, 1=train-1, 2=test-0, 3=test-1 Y_combined = np.full((M, M), np.nan) Y_combined[~np.isnan(Y_train)] = Y_train[~np.isnan(Y_train)] # 0 or 1 for train Y_combined[~np.isnan(Y_test)] = Y_test[~np.isnan(Y_test)] + 2 # 2 or 3 for test from matplotlib.colors import ListedColormap colors = ['#4575b4', '#d73027', '#1a9850', '#fdae61'] # blue, red, green, orange cmap = ListedColormap(colors) cmap.set_bad(color='white') plt.figure() plt.imshow(np.ma.masked_invalid(Y_combined), vmin=0, vmax=3, cmap=cmap) plt.title("Train/Test Split") plt.xlabel("Items") plt.ylabel("Items") import matplotlib.patches as mpatches legend_patches = [ mpatches.Patch(color='#4575b4', label='Train: 0'), mpatches.Patch(color='#d73027', label='Train: 1'), mpatches.Patch(color='#1a9850', label='Test: 0'), mpatches.Patch(color='#fdae61', label='Test: 1'), ] plt.legend(handles=legend_patches, loc='upper left', bbox_to_anchor=(1, 1)) plt.show() ``` Having the train and test datasets from the Bradley–Terry response, which are denoted as $\mathcal{D}_{\text{train}}$ and $\mathcal{D}_{\text{test}}$, we first demonstrate the parameter learning with full information maximum likelihood estimation on a single user: $$ \hat{V} = \arg\max_{V} \sum_{j,j' \in \mathcal{D}_{\text{train}}} \log p(Y_{0,jj'} | \sigma(H_{0,jj'})), \quad H_{0} = V 1_M^\top - 1_M V^\top, $$ {#eq-mle-objective} where $1_M$ is a column vector of 1 with length $M$ and $H_{0,jj'} = v_j-v_{j'}$ for a fixed user $0$. The optimization can be carried out with standard optimizers, such as gradient descent. Let $\mathcal{N}_m^+ = \{(m,k) \in \mathcal{D}_{\text{train}}\}$ be pairs recorded in the order $(m,k)$. Let $\mathcal{N}_m^- = \{(k,m) \in \mathcal{D}_{\text{train}}\}$ be pairs recorded in the order $(k,m)$. Define residuals $r_{mk} = y_{mk} - \sigma(V_m - V_k)$ and $r_{km} = y_{km} - \sigma(V_k-V_m)$. Then $$ \frac{\partial \ell}{\partial V_m} = \sum_{(m,k)\in \mathcal{N}^+_m} r_{mk} - \sum_{(k,m)\in \mathcal{N}^-_m} r_{km}. $$ {#eq-mle-gradient} This shows the contribution of data about $m$ only. Each comparison with opponent $k$ adds a term equal to the observed-minus-predicted win indicator, with a plus sign if $m$ is listed first and a minus sign if $m$ is listed second. Intuitively, if $m$ beats $k$ more often than the model predicts, the residual is positive and the gradient pushes $V_m$ up. If $m$ loses to $k$ more than predicted, the residual is negative and the gradient pushes $V_m$ down. ```{pyodide-python} #| autorun: true def auc_from_scores(scores, labels): pos = scores[labels == 1] neg = scores[labels == 0] if pos.size == 0 or neg.size == 0: return np.nan # Brute-force definition to avoid dependencies and handle ties cmp = (pos[:, None] > neg[None, :]).mean() ties = (pos[:, None] == neg[None, :]).mean() return cmp + 0.5 * ties # Collect observed training pairs and labels from the upper triangle only r_obs, c_obs = np.where(~np.isnan(Y_train)) upper_mask = r_obs < c_obs r_tr_fit = r_obs[upper_mask] c_tr_fit = c_obs[upper_mask] y_tr_fit = Y_train[r_tr_fit, c_tr_fit].astype(float) epochs = 100 lr = 0.01 train_auc_hist = [] n_pairs = r_tr_fit.size V_hat = np.zeros(M, dtype=float) for t in range(epochs): # Scores and probabilities for observed training pairs H = V_hat[r_tr_fit] - V_hat[c_tr_fit] p = 1.0 / (1.0 + np.exp(-H)) err = (y_tr_fit - p) grad = np.zeros_like(V_hat) np.add.at(grad, r_tr_fit, err) np.add.at(grad, c_tr_fit, -err) V_hat += lr * grad train_auc_hist.append(auc_from_scores(H, y_tr_fit)) # Collect observed test pairs and labels from the upper triangle r_te_obs, c_te_obs = np.where(~np.isnan(Y_test)) upper_mask_te = r_te_obs < c_te_obs r_te_fit = r_te_obs[upper_mask_te] c_te_fit = c_te_obs[upper_mask_te] y_te_fit = Y_test[r_te_fit, c_te_fit].astype(float) # Scores for test pairs s_te = V_hat[r_te_fit] - V_hat[c_te_fit] test_auc = auc_from_scores(s_te, y_te_fit) # Plot plt.figure() plt.plot(np.arange(1, epochs + 1), train_auc_hist, label="Train") plt.hlines(test_auc, xmin=0, xmax=epochs, linestyle="--", label="Test") plt.xlabel("Epoch") plt.ylabel("Train AUC") plt.ylim(0.75, 0.85) plt.show() ``` In this synthetic setting, one way to interpret the result is to compare with the Bayes optimal AUC. Here we quantify an upper bound on achievable test AUC under the assumed data-generating process. For each test pair $(j,k)$ we know the ground-truth win probability $P_{jk}=\sigma(V_j-V_k)$ from the simulator. The Bayes-optimal score for that pair is exactly this probability, $s^*_{jk}=P_{jk}$. Any classifier that ranks pairs by $s^*$ maximizes AUC in expectation because it orders pairs by their true success probabilities. To estimate the corresponding Bayes AUC on our finite test set, we keep the same index set of pairs and repeatedly resample binary labels $Y_{jk} \sim \mathrm{Bern}(P_{jk})$, then compute $\mathrm{AUC}(s^*, Y)$ on each resample using a tie-aware definition. The mean of these Monte Carlo AUCs is the Bayes-optimal test AUC, and the empirical quantiles give a sampling range induced purely by label noise. For comparison, we compute the model’s AUC by replacing $s^*_{jk}$ with the learned scores $s^{*\text{model}}_{jk}=\sigma(\hat V_j-\hat V_k)$. The gap between the two summarizes how far the fitted model is from the oracle ranking implied by the true $P^*_{jk}$ on the exact same test pairs. ```{pyodide-python} #| autorun: true def collect_pairs(Y): r, c = np.where(~np.isnan(Y)) upper = r < c return r[upper], c[upper], Y[r[upper], c[upper]].astype(float) # Unordered test set r_te_u, c_te_u, y_te_u = collect_pairs(Y_test) # Bayes scores and model scores for those exact pairs s_bayes = P[r_te_u, c_te_u] # ground-truth probs s_model = 1.0/(1.0+np.exp(-(V_hat[r_te_u]-V_hat[c_te_u]))) auc_bayes = auc_from_scores(s_bayes, y_te_u) auc_model = auc_from_scores(s_model, y_te_u) print(f"Test AUC. Bayes: {auc_bayes:.4f}. Model: {auc_model:.4f}.") def resample_labels_from_P(r_idx, c_idx, P_mat, rng): p = P_mat[r_idx, c_idx] return (rng.random(size=p.size) < p).astype(float) R = 2000 # repeats bayes_list = [] model_list = [] for _ in range(R): y_resamp = resample_labels_from_P(r_te_u, c_te_u, P, rng) bayes_list.append(auc_from_scores(s_bayes, y_resamp)) model_list.append(auc_from_scores(s_model, y_resamp)) print(f"Mean AUC over resamples. Bayes: {np.mean(bayes_list):.4f}. Model: {np.mean(model_list):.4f}.") print(f"5–95% range Bayes: {np.percentile(bayes_list,[5,95])}. Model: {np.percentile(model_list,[5,95])}.") ``` The result shows that the inference has found a good solution compared to the Bayes optimal estimator! An additional, natural test is to see if the estimated parameters match the true ones. Since the likelihood function only pays attention to the difference of item appeal for any pair, the appeal is only identifiable up to a shift and scale transform. The standard practice is to center and whiten the solution. ```{pyodide-python} #| autorun: true # Center and whiten V_hat (zero mean, unit variance) V_hat_norm = (V_hat - V_hat.mean()) / (V_hat.std() + 1e-12) # Fit best affine map a*V_hat_norm + b to V (least squares) A = np.vstack([V_hat_norm, np.ones_like(V_hat_norm)]).T a, b = np.linalg.lstsq(A, V, rcond=None)[0] V_hat_aligned = a * V_hat_norm + b # After alignment, points should cluster near the diagonal plt.figure() plt.scatter(V, V_hat, s=12, c="red") plt.scatter(V, V_hat_aligned, s=12, c="blue") lims = [-3, 3] plt.plot(lims, lims, linestyle="--") plt.xlabel("V") plt.ylabel("Estimated V") # Report alignment quality corr = np.corrcoef(V, V_hat)[0, 1] mse = np.mean((V - V_hat) ** 2) print(f"Before center and whiten: Corr {corr:.4f}, MSE: {mse:.6f}") corr = np.corrcoef(V, V_hat_aligned)[0, 1] mse = np.mean((V - V_hat_aligned) ** 2) print(f"After center and whiten: Corr {corr:.4f}, MSE: {mse:.6f}") ``` ::: {.callout-tip title="Looking Ahead: Regularization"} MLE can overfit when parameters outnumber observations. @sec-regularization introduces L2 regularization, which adds a penalty $\frac{\lambda}{2}\|V\|_2^2$ to the objective. This is equivalent to MAP estimation with a Gaussian prior—connecting MLE to the Bayesian framework we develop next. ::: ## Bayesian Inference {#sec-bayesian} A Bayesian approach provides a natural alternative to maximum likelihood for parameter estimation. Instead of finding a single point estimate, we place a prior distribution on parameters and update it using the likelihood from observed comparisons. This yields a posterior distribution that captures both central estimates and the uncertainty around them. We first cover parametric models using MCMC, then extend to nonparametric Gaussian Process models using Laplace approximation. ### Parametric Models For the Bradley–Terry model with finite-dimensional item parameters $V$, we place i.i.d. standard normal priors and update them using the Bernoulli likelihood from observed comparisons. In practice, this posterior cannot be computed in closed form, so we turn to Markov chain Monte Carlo (MCMC), which constructs a sequence of samples that, in the limit, follow the posterior distribution. The Metropolis–Hastings (MH) algorithm is straightforward to apply: at each step, we propose a new value for one or more coordinates of (V) (e.g., from a Gaussian centered at the current state), compute the acceptance ratio as the ratio of posterior densities between the proposed and current states, and accept the proposal with that probability. Repeating this process produces a chain of samples that can be used to approximate posterior means, variances, or other functionals of interest. This approach not only yields point predictions but also quantifies uncertainty about the relative strengths of items under the Bradley–Terry model. Let the prior be i.i.d. standard normal for each item parameter: $p(V) = \prod_{j=1}^M \mathcal N(V_j \mid 0, 1)$. Then the posterior distribution is $p(V \mid \mathcal D) = p(\mathcal D \mid V)p(V)/p(\mathcal D),$ where the likelihood is $$ p(\mathcal D \mid V) = \prod_{(j,j')\in\mathcal I} \sigma(V_j - V_{j'})^{Y_{jj'}} \bigl(1-\sigma(V_j - V_{j'})\bigr)^{1-Y_{jj'}}. $$ {#eq-bt-likelihood} The denominator is the marginal likelihood (evidence), which is intractable in closed form, so we resort to MCMC (e.g., MH) to sample from the posterior. Suppose we are at current (parameter) state $V^{(t)}$. We propose a new state $V'$ from a proposal distribution, such as Gaussian: $q(V' \mid V^{(t)}) = \mathcal N (V'; V^{(t)}, \tau^2 I),$ where $\tau^2$ is a step-size variance. This proposal is symmetric, meaning $q(V' \mid V^{(t)}) = q(V^{(t)} \mid V').$ For any proposal distribution $q(V' \mid V^{(t)})$, the Metropolis–Hastings acceptance probability is $$ \alpha = \min \left\{1, \frac{p(V' \mid \mathcal D) \cdot q(V^{(t)} \mid V')}{p(V^{(t)} \mid \mathcal D) \cdot q(V' \mid V^{(t)})}\right\}. $$ {#eq-mh-acceptance-general} This says: accept the proposal with probability proportional to how much more plausible it is under the posterior, adjusted by how easy it is to propose back. If we choose a Gaussian proposal, the proposal terms cancel out in the ratio due to symmetry. So the acceptance rule simplifies to $$ \alpha = \min \left\{1, \frac{p(V' \mid \mathcal D)}{p(V^{(t)} \mid \mathcal D)}\right\} = \min \left\{1, \frac{p(\mathcal D \mid V') \cdot p(V')}{p(\mathcal D \mid V^{(t)}) \cdot p(V^{(t)})}\right\}. $$ {#eq-mh-acceptance-symmetric} ```{pyodide-python} #| autorun: true def logpost(v, r_idx, c_idx, y_obs, prior_var=1.0): s = v[r_idx] - v[c_idx] # Stable log-sigmoid pieces # log p = -softplus(-s), log(1-p) = -softplus(s) ll = (y_obs * -np.log1p(np.exp(-s)) + (1 - y_obs) * -np.log1p(np.exp(s))).sum() lp = -0.5 * np.dot(v, v) / prior_var return ll + lp steps = 10000 prop_scale = 0.05 # Test labels (upper) y_te = Y_test[r_te, c_te].astype(float) rng = np.random.default_rng(3) v_cur = rng.normal(scale=0.1, size=M) v_cur -= v_cur.mean() lp_cur = logpost(v_cur, r_tr_fit, c_tr_fit, y_tr_fit) trace_every = 1 trace = [] acc = 0 for t in range(steps): v_prop = v_cur.copy() # Single-coordinate random walk j = rng.integers(M) v_prop[j] += rng.normal(scale=prop_scale) # Fix shift invariance v_prop -= v_prop.mean() lp_prop = logpost(v_prop, r_tr_fit, c_tr_fit, y_tr_fit) if np.log(rng.random()) < (lp_prop - lp_cur): v_cur, lp_cur = v_prop, lp_prop acc += 1 if (t + 1) % trace_every == 0: trace.append(v_cur.copy()) trace = np.array(trace) # shape (steps/trace_every, M) acc_rate = acc / steps print(f"[MH] Acceptance rate: {acc_rate:.3f} with prop_scale={prop_scale}") # Show trace with posterior marginal plt.figure() fig, (ax_trace, ax_hist) = plt.subplots(1, 2, width_ratios=[4, 1], sharey=True, layout='constrained') fig.get_layout_engine().set(wspace=0.05) # Trace plot ax_trace.plot(trace[:, 0], lw=0.8, color='steelblue') ax_trace.set_xlabel("Iteration / 5") ax_trace.set_ylabel(r"$v_0$") ax_trace.set_title("Trace of $v_0$") # Posterior marginal (horizontal histogram) ax_hist.hist(trace[:, 0], bins=50, density=True, orientation='horizontal', color='steelblue', alpha=0.7, edgecolor='white', linewidth=0.5) ax_hist.set_xlabel("Density") ax_hist.tick_params(labelleft=False) ax_hist.set_title("Posterior") plt.show() ``` ### Gaussian Processes {#sec-gp-inference} MCMC works well for finite-dimensional parameter vectors, but what if we want to learn a *nonparametric* reward function? Gaussian Processes (GPs) extend Bayesian inference to *function spaces*, placing a prior distribution over reward functions $r \sim \mathcal{GP}(m, k)$ and combining it with the Bradley-Terry likelihood for preferences. The challenge is that this likelihood is *non-Gaussian*, breaking the standard GP posterior formulas and requiring approximations like Laplace. **The inference problem.** Given pairwise comparisons $\mathcal{D} = \{(x_A^{(i)}, x_B^{(i)}, y_i)\}_{i=1}^n$ where $y_i = 1$ if item $A$ was preferred, we want to compute the posterior: $$ p(r \mid \mathcal{D}) \propto p(\mathcal{D} \mid r) \cdot p(r) $$ {#eq-gp-posterior} The likelihood follows the Bradley-Terry model: $$ p(y_i = 1 \mid r) = \sigma\left(r(x_A^{(i)}) - r(x_B^{(i)})\right) $$ {#eq-gp-bt-likelihood} Because this sigmoid likelihood is not Gaussian, the posterior $p(r \mid \mathcal{D})$ is no longer a Gaussian Process. #### Laplace Approximation The **Laplace approximation** provides a tractable Gaussian approximation to the true posterior: 1. **Find the posterior mode** $\mathbf{r}^* = \arg\max_{\mathbf{r}} \log p(\mathcal{D} \mid \mathbf{r}) + \log p(\mathbf{r})$ 2. **Approximate with a Gaussian** centered at the mode, using the Hessian of the log-posterior as precision For preference data with GP prior (zero mean), the mode satisfies: $$ \mathbf{r}^* = \arg\max_{\mathbf{r}} \sum_{i=1}^n \log \sigma\left(y_i (r(x_A^{(i)}) - r(x_B^{(i)}))\right) - \frac{1}{2}\mathbf{r}^\top K^{-1} \mathbf{r} $$ {#eq-gp-mode} where $K$ is the kernel matrix and $\mathbf{r} = [r(x_1), \ldots, r(x_m)]^\top$ collects function values at all unique points appearing in comparisons. **Newton's method for finding the mode.** The gradient and Hessian of the log-posterior are: $$ \nabla_{\mathbf{r}} \log p(\mathbf{r} \mid \mathcal{D}) = \mathbf{g} - K^{-1}\mathbf{r} $$ {#eq-gp-gradient} $$ \nabla^2_{\mathbf{r}} \log p(\mathbf{r} \mid \mathcal{D}) = -W - K^{-1} $$ {#eq-gp-hessian} where $\mathbf{g}$ is the gradient of the log-likelihood and $W$ is a diagonal matrix with entries $W_{ii} = p_i(1 - p_i)$ for predictions $p_i = \sigma(r(x_A^{(i)}) - r(x_B^{(i)}))$. **The approximate posterior.** After finding $\mathbf{r}^*$, the Laplace approximation gives: $$ p(\mathbf{r} \mid \mathcal{D}) \approx \mathcal{N}\left(\mathbf{r}^*, (K^{-1} + W)^{-1}\right) $$ {#eq-laplace-posterior} This is structurally similar to standard GP regression, but with the data-dependent precision matrix $W$ arising from the Bradley-Terry likelihood rather than observation noise. ```{pyodide-python} #| autorun: true def rbf_kernel(X1, X2, sigma_f=1.0, length_scale=0.5): """RBF kernel matrix.""" sq_dist = np.sum(X1**2, axis=1, keepdims=True) + np.sum(X2**2, axis=1) - 2 * X1 @ X2.T return sigma_f**2 * np.exp(-sq_dist / (2 * length_scale**2)) sigmoid = lambda x: 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500))) # True reward function (nonlinear!) def true_reward(x): return np.sin(2 * x) + 0.5 * x # Generate synthetic preference data n_comparisons = 30 X_A = np.random.uniform(-3, 3, n_comparisons).reshape(-1, 1) X_B = np.random.uniform(-3, 3, n_comparisons).reshape(-1, 1) r_A = true_reward(X_A.flatten()) r_B = true_reward(X_B.flatten()) y = (np.random.rand(n_comparisons) < sigmoid(r_A - r_B)).astype(float) # Collect unique points X_all = np.vstack([X_A, X_B]) X_unique, inv_idx = np.unique(X_all.flatten(), return_inverse=True) X_unique = X_unique.reshape(-1, 1) m = len(X_unique) idx_A = inv_idx[:n_comparisons] idx_B = inv_idx[n_comparisons:] # Build kernel matrix K = rbf_kernel(X_unique, X_unique) + 1e-4 * np.eye(m) K_inv = np.linalg.inv(K) # Laplace approximation via Newton's method r = np.zeros(m) for iteration in range(20): # Compute predictions and likelihood gradient diff = r[idx_A] - r[idx_B] p = sigmoid(diff) # Gradient of log-likelihood w.r.t. r at each unique point grad = np.zeros(m) for i in range(n_comparisons): residual = y[i] - p[i] grad[idx_A[i]] += residual grad[idx_B[i]] -= residual # Full gradient including GP prior grad_total = grad - K_inv @ r # Hessian: diagonal from likelihood + K_inv from prior W = np.zeros(m) for i in range(n_comparisons): W[idx_A[i]] += p[i] * (1 - p[i]) W[idx_B[i]] += p[i] * (1 - p[i]) H = -np.diag(W) - K_inv # Newton step r = r - np.linalg.solve(H, grad_total) # Posterior covariance at mode Sigma_post = np.linalg.inv(K_inv + np.diag(W)) # Predictions on a grid X_grid = np.linspace(-3, 3, 100).reshape(-1, 1) K_star = rbf_kernel(X_grid, X_unique) mu_pred = K_star @ K_inv @ r K_star_star = rbf_kernel(X_grid, X_grid) var_pred = np.diag(K_star_star - K_star @ (K_inv - K_inv @ Sigma_post @ K_inv) @ K_star.T) std_pred = np.sqrt(np.maximum(var_pred, 0)) # Plot plt.figure() plt.fill_between(X_grid.flatten(), mu_pred - 2*std_pred, mu_pred + 2*std_pred, alpha=0.3, color='blue', label='95% CI') plt.plot(X_grid, mu_pred, 'b-', lw=2, label='GP posterior mean') plt.plot(X_grid, true_reward(X_grid), 'k--', lw=2, label='True reward') plt.scatter(X_A[y==1], np.zeros(int(y.sum())) - 2.5, c='green', marker='^', s=50, label='Preferred A') plt.scatter(X_B[y==0], np.zeros(int((1-y).sum())) - 2.5, c='red', marker='v', s=50, label='Preferred B') plt.xlabel('x') plt.ylabel('r(x)') plt.title(f'GP Preference Learning via Laplace Approximation ({n_comparisons} comparisons)') plt.legend(loc='upper left') plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"Learned from {n_comparisons} pairwise comparisons") print(f"Posterior uncertainty varies: min std = {std_pred.min():.3f}, max std = {std_pred.max():.3f}") ``` #### Connection to Fisher Information {#sec-fisher-information} The Laplace approximation has a natural connection to the Fisher information framework we developed for linear models. The Hessian of the negative log-likelihood at the mode gives the *observed Fisher information*: $$ I_{\text{obs}} = W = \text{diag}(p_1(1-p_1), \ldots, p_n(1-p_n)) $$ {#eq-fisher-information} Just as with linear preference models, comparisons with $p \approx 0.5$ contribute the most information (highest $p(1-p)$), while "easy" comparisons where one option clearly dominates contribute little. #### Computational Considerations **Complexity.** Each Newton iteration requires $O(m^3)$ computation for the matrix solve, where $m$ is the number of unique points. For $n$ comparisons involving $m \leq 2n$ unique points, the total cost is $O(n^3)$ per iteration. **Scalability.** For large datasets, several approximations are available: - **Inducing points**: Approximate the GP using $k \ll m$ pseudo-inputs, reducing complexity to $O(k^2 n)$ - **Variational inference**: Optimize a lower bound on the marginal likelihood - **Conjugate gradient methods**: Avoid explicit matrix inversion **Hyperparameter learning.** The kernel hyperparameters (length-scale $\ell$, signal variance $\sigma_f^2$) can be learned by maximizing the Laplace-approximated marginal likelihood: $$ \log p(\mathcal{D} \mid \theta) \approx \log p(\mathcal{D} \mid \mathbf{r}^*) + \log p(\mathbf{r}^* \mid \theta) + \frac{1}{2}\log |K^{-1} + W|^{-1} $$ {#eq-marginal-likelihood} In Chapter 4, we will see how the GP posterior uncertainty enables *active* query selection—choosing which comparisons to ask to learn most efficiently. ## Online Learning {#sec-online} In many applications, comparisons between items arrive sequentially over time rather than being observed all at once. For example, players in online games are continuously matched, or recommendation systems log one user preference at a time. In such settings, it is often computationally infeasible to refit the full Bradley–Terry model by maximum likelihood or MCMC after each new observation. Instead, we want an online update rule that adjusts item strengths incrementally as new outcomes arrive. This is precisely the motivation behind the **Elo rating system**, originally introduced for ranking chess players and later widely adopted in competitive games, online platforms, and even information retrieval. The key idea is to maintain a current estimate of each item's (or player's) latent strength, and update only the two items involved in a match when a new result comes in. The Elo rule can be derived as a stochastic gradient ascent method on the Bradley–Terry log-likelihood. Suppose item $j$ plays against item $j'$. Given the current parameters, the log-likelihood gradient with respect to $V_j$ and $V_{j'}$ is $$ \frac{\partial \ell}{\partial V_j} = (y - p), \qquad \frac{\partial \ell}{\partial V_{j'}} = -(y - p). $$ {#eq-elo-gradient} A stochastic gradient ascent step with learning rate $\eta$ gives the update: $$ V_j \leftarrow V_j + \eta (y - p), \qquad V_{j'} \leftarrow V_{j'} - \eta (y - p). $$ {#eq-elo-update} This is exactly the Elo update rule. The learning rate $\eta$ is often called the **K-factor** in Elo literature. If $y=1$ (item $j$ wins) but the model predicted a low $p$, then $(y - p)$ is positive and $V_j$ increases, $V_{j'}$ decreases — the system learns that $j$ is stronger than previously believed. If $y=0$ (item $j'$ wins), the opposite adjustment happens. The magnitude of the update is larger when the outcome is surprising (large prediction error), and smaller when the outcome is expected (small prediction error). Thus, Elo is an online learning algorithm for the Bradley–Terry model, interpretable as stochastic gradient ascent with a fixed step size. ```{pyodide-python} #| autorun: true # Build a stream of observed pairs for the single user from Y_BT upper triangle pairs_stream = list(zip(triu_r, triu_c)) labels_stream = Y_BT[triu_r, triu_c].astype(float) # Initialize ratings r (Elo analog of v) r = np.zeros(M) Kfactor = 1.0 # step size def logistic_prob(a, b): return 1.0 / (1.0 + np.exp(-(a - b))) for (j, k), y in zip(pairs_stream, labels_stream): p = logistic_prob(r[j], r[k]) # Update: winner gets +K*(1-p), loser gets -K*(1-p) # If y=1 means j wins. If y=0 means k wins. if y == 1.0: r[j] += Kfactor * (1 - p) r[k] -= Kfactor * (1 - p) else: r[k] += Kfactor * p r[j] -= Kfactor * p # optional centering r -= r.mean() # Evaluate Elo ratings on held-out test pairs (same y_te from above) p_te_elo = 1.0 / (1.0 + np.exp(-(r[r_te] - r[c_te]))) auc_elo = auc_from_scores(p_te_elo, y_te) ``` ## Regularization and Overfitting {#sec-regularization} The maximum likelihood estimator maximizes fit to observed training data but may overfit, especially when the number of parameters is large relative to the amount of data. In preference learning, overfitting manifests as learned item parameters that capture noise in the training comparisons rather than true relative strengths. Regularization techniques penalize model complexity to improve generalization to held-out test data. ### L2 Regularization The most common regularization approach adds an L2 penalty term to the log-likelihood. For the Bradley-Terry model, the regularized objective becomes: $$ \mathcal{L}_{\text{reg}}(V) = \sum_{(j,j') \in \mathcal{D}_{\text{train}}} \log p(Y_{jj'} \mid V_j - V_{j'}) - \frac{\lambda}{2} \|V\|_2^2 $$ {#eq-regularized-objective} where $\lambda \geq 0$ is the **regularization strength**. The penalty $\frac{\lambda}{2} \|V\|_2^2 = \frac{\lambda}{2} \sum_{j=1}^M V_j^2$ discourages large parameter values. When $\lambda = 0$, we recover standard MLE. As $\lambda$ increases, parameters shrink toward zero. The regularized gradient adds a simple term to the MLE gradient: $$ \frac{\partial \mathcal{L}_{\text{reg}}}{\partial V_m} = \sum_{(m,k)\in \mathcal{N}^+_m} r_{mk} - \sum_{(k,m)\in \mathcal{N}^-_m} r_{km} - \lambda V_m $$ {#eq-regularized-gradient} Intuitively, regularization creates a bias-variance tradeoff: small $\lambda$ permits complex fits (low bias, high variance), while large $\lambda$ enforces simpler models (high bias, low variance). The optimal $\lambda$ balances these to minimize test error. ::: {.callout-note title="Connection to Bayesian Inference"} L2 regularization corresponds exactly to maximum a posteriori (MAP) estimation with a Gaussian prior $p(V_j) = \mathcal{N}(0, 1/\lambda)$. The regularization strength $\lambda$ is the inverse prior variance: strong regularization (large $\lambda$) means a tight prior belief that parameters are near zero. ::: ```{pyodide-python} #| autorun: true # Compare unregularized vs regularized Bradley-Terry on synthetic data # Generate data with few items and limited comparisons to induce overfitting rng_reg = np.random.default_rng(42) M_small = 10 # few items V_true_small = rng_reg.normal(0, 1, M_small) # Generate limited training data (sparse comparisons) n_train_pairs = 30 # only 30 comparisons among 10 items train_pairs_idx = rng_reg.choice(M_small * (M_small - 1) // 2, n_train_pairs, replace=False) all_pairs = [(i, j) for i in range(M_small) for j in range(i+1, M_small)] train_pairs = [all_pairs[idx] for idx in train_pairs_idx] # Generate outcomes Y_train_sparse = np.full((M_small, M_small), np.nan) for (i, j) in train_pairs: p_win = 1.0 / (1.0 + np.exp(-(V_true_small[i] - V_true_small[j]))) outcome = 1.0 if rng_reg.random() < p_win else 0.0 Y_train_sparse[i, j] = outcome Y_train_sparse[j, i] = 1.0 - outcome # Generate test data (all remaining pairs) test_pairs = [pair for pair in all_pairs if pair not in train_pairs] Y_test_sparse = np.full((M_small, M_small), np.nan) for (i, j) in test_pairs: p_win = 1.0 / (1.0 + np.exp(-(V_true_small[i] - V_true_small[j]))) outcome = 1.0 if rng_reg.random() < p_win else 0.0 Y_test_sparse[i, j] = outcome Y_test_sparse[j, i] = 1.0 - outcome # Extract training and test pair indices r_tr_s, c_tr_s, y_tr_s = collect_pairs(Y_train_sparse) r_te_s, c_te_s, y_te_s = collect_pairs(Y_test_sparse) def fit_regularized_bt(r_idx, c_idx, y_obs, M, lam=0.0, lr=0.05, epochs=200): """Fit Bradley-Terry with L2 regularization.""" V = np.zeros(M) for _ in range(epochs): H = V[r_idx] - V[c_idx] p = 1.0 / (1.0 + np.exp(-H)) err = y_obs - p grad = np.zeros(M) np.add.at(grad, r_idx, err) np.add.at(grad, c_idx, -err) # Add regularization gradient grad -= lam * V V += lr * grad return V # Fit models with different regularization strengths lambdas = [0.0, 0.01, 0.1, 0.5, 1.0, 5.0] train_aucs = [] test_aucs = [] for lam in lambdas: V_fit = fit_regularized_bt(r_tr_s, c_tr_s, y_tr_s, M_small, lam=lam) # Train AUC s_tr = V_fit[r_tr_s] - V_fit[c_tr_s] train_aucs.append(auc_from_scores(s_tr, y_tr_s)) # Test AUC s_te = V_fit[r_te_s] - V_fit[c_te_s] test_aucs.append(auc_from_scores(s_te, y_te_s)) # Plot validation curve plt.figure() plt.semilogx(lambdas, train_aucs, 'o-', label='Train AUC', markersize=8) plt.semilogx(lambdas, test_aucs, 's-', label='Test AUC', markersize=8) plt.xlabel(r'$\lambda$') plt.ylabel('AUC') plt.title('Validation Curve') plt.legend() plt.grid(True, alpha=0.3) plt.show() # Report optimal lambda best_idx = np.argmax(test_aucs) print(f"Optimal λ: {lambdas[best_idx]:.3f} (Test AUC: {test_aucs[best_idx]:.4f})") print(f"Unregularized (λ=0): Train AUC {train_aucs[0]:.4f}, Test AUC {test_aucs[0]:.4f}") ``` The validation curve demonstrates the regularization trade-off: at $\lambda = 0$ (no regularization), the model overfits to training data, achieving high training AUC but lower test AUC. As $\lambda$ increases, test performance improves until an optimal point, after which excessive regularization underfits. The gap between training and test AUC narrows with proper regularization, indicating better generalization. ### Early Stopping An alternative to explicit regularization is **early stopping**: monitor validation performance during training and stop when it begins to degrade, even if training performance continues improving. This exploits the empirical observation that gradient descent follows a path from simple to complex models. ```{pyodide-python} #| autorun: true # Demonstrate early stopping def fit_bt_with_validation(r_tr, c_tr, y_tr, r_val, c_val, y_val, M, lr=0.01, max_epochs=300): """Fit Bradley-Terry with early stopping.""" V = np.zeros(M) train_auc_hist = [] val_auc_hist = [] for epoch in range(max_epochs): # Gradient step H = V[r_tr] - V[c_tr] p = 1.0 / (1.0 + np.exp(-H)) err = y_tr - p grad = np.zeros(M) np.add.at(grad, r_tr, err) np.add.at(grad, c_tr, -err) V += lr * grad # Evaluate s_tr = V[r_tr] - V[c_tr] train_auc_hist.append(auc_from_scores(s_tr, y_tr)) s_val = V[r_val] - V[c_val] val_auc_hist.append(auc_from_scores(s_val, y_val)) return V, train_auc_hist, val_auc_hist V_es, tr_hist, val_hist = fit_bt_with_validation( r_tr_s, c_tr_s, y_tr_s, r_te_s, c_te_s, y_te_s, M_small ) # Find early stopping point (best validation performance) best_epoch = np.argmax(val_hist) plt.figure() plt.plot(tr_hist, label='Train AUC', alpha=0.7) plt.plot(val_hist, label='Validation AUC', alpha=0.7) plt.axvline(best_epoch, color='red', linestyle='--', label=f'Best epoch: {best_epoch}') plt.xlabel('Epoch') plt.ylabel('AUC') plt.title('Early Stopping: Training vs Validation Performance') plt.legend() plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"Best validation AUC {val_hist[best_epoch]:.4f} at epoch {best_epoch}") print(f"Final epoch ({len(val_hist)-1}): Train {tr_hist[-1]:.4f}, Val {val_hist[-1]:.4f}") ``` Early stopping automatically selects model complexity through the training trajectory. Unlike L2 regularization, it requires no tuning of $\lambda$, but does require held-out validation data to monitor. ::: {.callout-important title="Key Takeaway: When to Regularize"} Regularization is most critical when: - **Few observations** relative to parameters (e.g., 50 comparisons, 20 items) - **Imbalanced data**: Some items have many comparisons, others have few - **High noise**: Label noise or measurement error in comparisons For large datasets with many comparisons per item, overfitting is less of a concern and regularization may have minimal impact. Always use validation data or cross-validation to tune regularization strength. ::: ## Model Selection and Cross-Validation {#sec-model-selection} A single train/test split provides one estimate of generalization performance, but this estimate has high variance: a different random split may yield different results. **Cross-validation** (CV) systematically evaluates multiple train/test splits to obtain more reliable performance estimates and enable principled model selection. ### K-Fold Cross-Validation In **k-fold cross-validation**, we partition the data into $k$ equally-sized folds. For each fold $i \in \{1, \ldots, k\}$: 1. Train the model on all folds except fold $i$ 2. Evaluate on fold $i$ (held-out validation) 3. Record the validation metric (e.g., AUC, log-likelihood) The final CV score is the average across all $k$ folds: $\text{CV}_k = \frac{1}{k} \sum_{i=1}^k \text{metric}_i$. The standard error quantifies uncertainty in the estimate. For preference data, we partition the set of observed pairwise comparisons (not items) into folds, ensuring each fold contains a representative sample of comparisons. ```{pyodide-python} #| autorun: true def kfold_split_pairs(r_idx, c_idx, y_obs, k=5, rng=None): """Split pairwise comparison data into k folds.""" if rng is None: rng = np.random.default_rng() n_pairs = len(r_idx) fold_ids = np.arange(n_pairs) % k rng.shuffle(fold_ids) return fold_ids def cross_validate_bt(r_idx, c_idx, y_obs, M, k=5, lam=0.0, lr=0.05, epochs=200): """Perform k-fold cross-validation for Bradley-Terry model.""" rng_cv = np.random.default_rng(123) fold_ids = kfold_split_pairs(r_idx, c_idx, y_obs, k, rng_cv) fold_scores = [] for fold in range(k): # Split into train and validation train_mask = fold_ids != fold val_mask = fold_ids == fold r_tr = r_idx[train_mask] c_tr = c_idx[train_mask] y_tr = y_obs[train_mask] r_val = r_idx[val_mask] c_val = c_idx[val_mask] y_val = y_obs[val_mask] # Train model V = fit_regularized_bt(r_tr, c_tr, y_tr, M, lam=lam, lr=lr, epochs=epochs) # Evaluate on validation fold s_val = V[r_val] - V[c_val] auc_val = auc_from_scores(s_val, y_val) fold_scores.append(auc_val) return np.array(fold_scores) # Perform 5-fold CV on the synthetic dataset cv_scores = cross_validate_bt(r_tr_s, c_tr_s, y_tr_s, M_small, k=5, lam=0.1) print(f"5-Fold CV Scores: {cv_scores}") print(f"Mean CV AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}") ``` The CV estimate of {cv_scores.mean():.4f} ± {cv_scores.std():.4f} is more reliable than a single train/test split. The standard deviation quantifies variability across folds. ### Hyperparameter Tuning with Cross-Validation Cross-validation enables principled hyperparameter selection: evaluate each candidate hyperparameter on CV performance and select the best. For Bradley-Terry, key hyperparameters include regularization strength $\lambda$ and learning rate. ```{pyodide-python} #| autorun: true # Grid search over regularization strengths using cross-validation lambda_grid = [0.0, 0.01, 0.05, 0.1, 0.5, 1.0] cv_means = [] cv_stds = [] for lam in lambda_grid: scores = cross_validate_bt(r_tr_s, c_tr_s, y_tr_s, M_small, k=5, lam=lam) cv_means.append(scores.mean()) cv_stds.append(scores.std()) cv_means = np.array(cv_means) cv_stds = np.array(cv_stds) # Plot CV performance vs hyperparameter plt.figure() plt.errorbar(lambda_grid, cv_means, yerr=cv_stds, fmt='o-', capsize=5, markersize=8) plt.xscale('log') plt.xlabel(r'Regularization strength $\lambda$') plt.ylabel('Cross-Validation AUC') plt.title('Hyperparameter Tuning') plt.grid(True, alpha=0.3) plt.show() best_lam_idx = np.argmax(cv_means) print(f"Best λ: {lambda_grid[best_lam_idx]:.3f} (CV AUC: {cv_means[best_lam_idx]:.4f} ± {cv_stds[best_lam_idx]:.4f})") ``` The error bars show the standard deviation across folds, indicating uncertainty in the CV estimate for each hyperparameter. Select the hyperparameter with highest mean CV performance. ### Beyond AUC: Multiple Evaluation Metrics AUC measures ranking quality but does not capture all aspects of model performance. Additional metrics provide complementary insights: **Log-Likelihood**: Measures the probability the model assigns to observed outcomes. Higher is better. $$ \text{LL} = \sum_{(j,j') \in \mathcal{D}_{\text{test}}} \left[ Y_{jj'} \log \sigma(V_j - V_{j'}) + (1 - Y_{jj'}) \log (1 - \sigma(V_j - V_{j'})) \right] $$ {#eq-log-likelihood-metric} **Calibration Error**: Measures whether predicted probabilities match empirical frequencies. Bin predictions into intervals (e.g., [0.0, 0.1), [0.1, 0.2), ...) and compare average predicted probability to observed frequency in each bin. ```{pyodide-python} #| autorun: true def compute_log_likelihood(V, r_idx, c_idx, y_obs): """Compute log-likelihood on test data.""" s = V[r_idx] - V[c_idx] p = 1.0 / (1.0 + np.exp(-s)) # Numerical stability: use log(p) and log(1-p) carefully ll = (y_obs * np.log(p + 1e-12) + (1 - y_obs) * np.log(1 - p + 1e-12)).sum() return ll def compute_calibration_error(V, r_idx, c_idx, y_obs, n_bins=10): """Compute calibration error: difference between predicted prob and observed freq.""" s = V[r_idx] - V[c_idx] p_pred = 1.0 / (1.0 + np.exp(-s)) # Bin predictions bins = np.linspace(0, 1, n_bins + 1) cal_error = 0.0 for i in range(n_bins): in_bin = (p_pred >= bins[i]) & (p_pred < bins[i+1]) if in_bin.sum() > 0: avg_pred = p_pred[in_bin].mean() avg_obs = y_obs[in_bin].mean() cal_error += np.abs(avg_pred - avg_obs) * in_bin.sum() cal_error /= len(y_obs) return cal_error # Compare multiple metrics for different regularization strengths lambdas_compare = [0.0, 0.1, 1.0] metrics_results = [] for lam in lambdas_compare: V_fit = fit_regularized_bt(r_tr_s, c_tr_s, y_tr_s, M_small, lam=lam) # Compute metrics on test data s_te = V_fit[r_te_s] - V_fit[c_te_s] auc_te = auc_from_scores(s_te, y_te_s) ll_te = compute_log_likelihood(V_fit, r_te_s, c_te_s, y_te_s) cal_err = compute_calibration_error(V_fit, r_te_s, c_te_s, y_te_s) metrics_results.append({ 'lambda': lam, 'AUC': auc_te, 'Log-Likelihood': ll_te, 'Calibration Error': cal_err }) # Display results print("\nMulti-Metric Comparison:") print(f"{'λ':<8} {'AUC':<8} {'Log-Lik':<12} {'Cal Error':<12}") print("-" * 40) for res in metrics_results: print(f"{res['lambda']:<8.2f} {res['AUC']:<8.4f} {res['Log-Likelihood']:<12.2f} {res['Calibration Error']:<12.4f}") ``` Different metrics may favor different models. AUC focuses on ranking, log-likelihood on probability estimates, and calibration on frequency matching. Use multiple metrics to gain a comprehensive view of model quality. ::: {.callout-important title="Key Takeaway: Cross-Validation Best Practices"} - **Use CV for hyperparameter tuning**, not test set evaluation (avoid overfitting to the test set) - **Report mean ± std** across folds to quantify uncertainty - **Stratified splitting**: For imbalanced data, ensure each fold has representative class proportions - **Temporal splitting**: For sequential data (e.g., chess games over time), use time-based splits instead of random CV to avoid leaking future information into past predictions - **Nested CV**: For unbiased performance estimates, use outer CV loop for evaluation and inner CV loop for hyperparameter selection ::: ## Optimization Methods {#sec-optimization} Gradient descent is the foundation for learning Bradley-Terry parameters, but modern optimization methods can accelerate convergence and improve final performance. We compare three widely-used optimizers and discuss when each is appropriate. ### Beyond Vanilla Gradient Descent Standard gradient descent updates parameters with a fixed learning rate: $V \leftarrow V + \eta \nabla \mathcal{L}(V)$. This has two limitations: 1. **Fixed step size**: Large $\eta$ causes instability; small $\eta$ slows convergence 2. **No momentum**: Each step ignores the optimization history Modern optimizers address these issues through adaptive learning rates and momentum. ### Adam Optimizer **Adam** (Adaptive Moment Estimation) maintains running averages of both gradients and squared gradients, adapting the learning rate per parameter. The update rule is: $$ \begin{aligned} m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad &\text{(momentum)} \\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad &\text{(variance)} \\ \hat{m}_t &\leftarrow m_t / (1 - \beta_1^t), \quad \hat{v}_t \leftarrow v_t / (1 - \beta_2^t) \quad &\text{(bias correction)} \\ V &\leftarrow V + \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \quad &\text{(parameter update)} \end{aligned} $$ {#eq-adam-update} where $g_t$ is the gradient at step $t$, $\beta_1 = 0.9$ and $\beta_2 = 0.999$ are typical, and $\epsilon = 10^{-8}$ prevents division by zero. ```{pyodide-python} #| autorun: true def fit_bt_adam(r_idx, c_idx, y_obs, M, lr=0.1, epochs=200, beta1=0.9, beta2=0.999, eps=1e-8): """Fit Bradley-Terry using Adam optimizer.""" V = np.zeros(M) m = np.zeros(M) # first moment v = np.zeros(M) # second moment loss_history = [] for t in range(1, epochs + 1): # Compute gradient H = V[r_idx] - V[c_idx] p = 1.0 / (1.0 + np.exp(-H)) err = y_obs - p grad = np.zeros(M) np.add.at(grad, r_idx, err) np.add.at(grad, c_idx, -err) # Adam update m = beta1 * m + (1 - beta1) * grad v = beta2 * v + (1 - beta2) * (grad ** 2) # Bias correction m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) # Parameter update V += lr * m_hat / (np.sqrt(v_hat) + eps) # Log-likelihood for monitoring ll = (y_obs * np.log(p + 1e-12) + (1 - y_obs) * np.log(1 - p + 1e-12)).sum() loss_history.append(-ll) # negative log-likelihood (loss) return V, np.array(loss_history) # Compare GD vs Adam on the same problem V_gd, loss_gd = [], [] V_adam, loss_adam = fit_bt_adam(r_tr_s, c_tr_s, y_tr_s, M_small, lr=0.1, epochs=150) # Also fit with vanilla GD V = np.zeros(M_small) for epoch in range(150): H = V[r_tr_s] - V[c_tr_s] p = 1.0 / (1.0 + np.exp(-H)) err = y_tr_s - p grad = np.zeros(M_small) np.add.at(grad, r_tr_s, err) np.add.at(grad, c_tr_s, -err) V += 0.05 * grad # smaller LR for GD stability ll = (y_tr_s * np.log(p + 1e-12) + (1 - y_tr_s) * np.log(1 - p + 1e-12)).sum() loss_gd.append(-ll) loss_gd = np.array(loss_gd) # Plot convergence comparison plt.figure() plt.plot(loss_gd, label='Gradient Descent (lr=0.05)', alpha=0.8) plt.plot(loss_adam, label='Adam (lr=0.1)', alpha=0.8) plt.xlabel('Epoch') plt.ylabel('Negative Log-Likelihood (Loss)') plt.title('Optimization Comparison: GD vs Adam') plt.legend() plt.grid(True, alpha=0.3) plt.yscale('log') plt.show() print(f"Final loss - GD: {loss_gd[-1]:.4f}, Adam: {loss_adam[-1]:.4f}") ``` Adam typically converges faster than vanilla gradient descent and is less sensitive to learning rate tuning. The adaptive per-parameter learning rates help in problems with varying gradient magnitudes across parameters. ### Learning Rate Schedules Instead of a fixed learning rate, **schedules** decay $\eta$ over time to enable large initial steps (fast progress) followed by small refinements (precision): - **Step decay**: $\eta_t = \eta_0 \cdot \gamma^{\lfloor t / k \rfloor}$ (reduce by factor $\gamma$ every $k$ epochs) - **Exponential decay**: $\eta_t = \eta_0 e^{-\lambda t}$ - **Cosine annealing**: $\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\pi t / T))$ Learning rate schedules are especially useful for training to convergence without extensive hyperparameter tuning. ::: {.callout-important title="Key Takeaway: Choosing an Optimizer"} **When to use each optimizer:** - **Vanilla GD**: Simple problems, well-tuned learning rate available, or for pedagogical clarity - **Adam**: Default choice for most problems; robust to learning rate, fast convergence, handles sparse gradients well - **SGD with momentum**: When you need the theoretical guarantees of SGD (e.g., convergence proofs) but want faster practical convergence **Hyperparameter guidelines:** - Adam: $\alpha = 0.001$ to $0.1$, default $\beta_1 = 0.9$, $\beta_2 = 0.999$ - GD: $\eta = 0.01$ to $0.1$, may need careful tuning - Learning rate schedules: Helpful when training for many epochs ::: ## Real-World Application: LLM Preference Learning {#sec-real-world} We now apply all three parameter learning methods to a realistic preference learning scenario inspired by language model alignment. While production LLM alignment uses datasets like Stanford Human Preferences (SHP) or Anthropic's HH-RLHF, we construct a synthetic dataset that captures key properties of real preference data: noisy labels, varying item quality, and response embeddings. ### Dataset: Simulated LLM Response Preferences We simulate a setting where a language model generates multiple responses to prompts, and human annotators provide pairwise preferences. Each response is represented by a learned embedding (e.g., from a pretrained model), and the true utility is a linear function of this embedding plus noise. ```{pyodide-python} #| autorun: true # Simulate LLM response preference data rng_llm = np.random.default_rng(2024) # Setting: 50 LLM responses across different prompts n_responses = 50 embedding_dim = 8 # low-dimensional for interpretability # Generate response embeddings (simulate learned representations) # In reality, these would be from a language model's hidden states response_embeddings = rng_llm.normal(0, 1, size=(n_responses, embedding_dim)) # Normalize embeddings response_embeddings /= np.linalg.norm(response_embeddings, axis=1, keepdims=True) # True utility weights (what humans actually care about) # Positive weights favor certain embedding dimensions true_weights = rng_llm.normal(0, 1, size=embedding_dim) true_weights[0] = 2.0 # Strong preference for first dimension (e.g., helpfulness) true_weights[1] = -1.5 # Dispreference for second dimension (e.g., verbosity) # Compute true utilities true_utilities = response_embeddings @ true_weights # Generate pairwise comparisons with label noise (realistic!) n_comparisons = 200 comparison_pairs = [] comparison_labels = [] for _ in range(n_comparisons): # Sample a pair of responses i, j = rng_llm.choice(n_responses, size=2, replace=False) comparison_pairs.append((i, j)) # True win probability with Bradley-Terry p_i_wins = 1.0 / (1.0 + np.exp(-(true_utilities[i] - true_utilities[j]))) # Add label noise: 10% of annotations are random if rng_llm.random() < 0.1: label = rng_llm.choice([0, 1]) # noisy label else: label = 1 if rng_llm.random() < p_i_wins else 0 comparison_labels.append(label) comparison_pairs = np.array(comparison_pairs) comparison_labels = np.array(comparison_labels, dtype=float) print(f"Generated {n_comparisons} comparisons over {n_responses} LLM responses") print(f"True utility range: [{true_utilities.min():.2f}, {true_utilities.max():.2f}]") print(f"Label noise: ~10% of comparisons are random") ``` This synthetic dataset mimics real LLM preference data: responses have embedding representations, utilities are learned functions of embeddings, and annotations contain noise. ### Applying All Three Learning Methods We apply MLE, Bayesian inference, and online learning (Elo) to the same data and compare results. ```{pyodide-python} #| autorun: true # Split into train/test (80/20) n_train = int(0.8 * n_comparisons) train_idx = rng_llm.choice(n_comparisons, n_train, replace=False) test_idx = np.array([i for i in range(n_comparisons) if i not in train_idx]) r_train = comparison_pairs[train_idx, 0] c_train = comparison_pairs[train_idx, 1] y_train = comparison_labels[train_idx] r_test = comparison_pairs[test_idx, 0] c_test = comparison_pairs[test_idx, 1] y_test = comparison_labels[test_idx] # Method 1: Maximum Likelihood with L2 Regularization V_mle = fit_regularized_bt(r_train, c_train, y_train, n_responses, lam=0.05, lr=0.1, epochs=300) # Method 2: Bayesian Inference via MCMC (Metropolis-Hastings) def mcmc_bradley_terry(r, c, y, n_items, prior_std=1.0, n_samples=2000, burnin=500, step_size=0.15): """Sample from posterior of Bradley-Terry utilities using Metropolis-Hastings.""" def log_posterior(V): diff = V[r] - V[c] ll = np.sum(y * diff - np.log(1 + np.exp(diff))) # log-likelihood lp = -0.5 * np.sum(V**2) / prior_std**2 # log-prior: N(0, prior_std^2) return ll + lp # Initialize at MAP estimate for faster convergence V_current = fit_regularized_bt(r, c, y, n_items, lam=1.0/prior_std**2, lr=0.1, epochs=200) V_current = V_current - V_current.mean() log_p_current = log_posterior(V_current) samples = [] for t in range(n_samples + burnin): # Propose: random walk on one randomly chosen utility V_proposed = V_current.copy() idx = np.random.randint(n_items) V_proposed[idx] += np.random.randn() * step_size V_proposed = V_proposed - V_proposed.mean() # maintain centering # Accept/reject log_p_proposed = log_posterior(V_proposed) if np.log(np.random.rand()) < log_p_proposed - log_p_current: V_current, log_p_current = V_proposed, log_p_proposed if t >= burnin: samples.append(V_current.copy()) return np.array(samples) np.random.seed(42) V_samples = mcmc_bradley_terry(r_train, c_train, y_train, n_responses, prior_std=1.0, n_samples=1500, burnin=500) V_bayes = V_samples.mean(axis=0) # Posterior mean V_bayes_std = V_samples.std(axis=0) # Posterior uncertainty # Method 3: Online Learning (Elo) - process comparisons sequentially V_elo = np.zeros(n_responses) K_factor = 0.1 for i, j, y in zip(r_train, c_train, y_train): p = 1.0 / (1.0 + np.exp(-(V_elo[i] - V_elo[j]))) if y == 1.0: V_elo[i] += K_factor * (1 - p) V_elo[j] -= K_factor * (1 - p) else: V_elo[j] += K_factor * p V_elo[i] -= K_factor * p V_elo -= V_elo.mean() # center # Evaluate all three methods on test data s_mle = V_mle[r_test] - V_mle[c_test] s_bayes = V_bayes[r_test] - V_bayes[c_test] s_elo = V_elo[r_test] - V_elo[c_test] auc_mle = auc_from_scores(s_mle, y_test) auc_bayes = auc_from_scores(s_bayes, y_test) auc_elo = auc_from_scores(s_elo, y_test) ll_mle = compute_log_likelihood(V_mle, r_test, c_test, y_test) ll_bayes = compute_log_likelihood(V_bayes, r_test, c_test, y_test) ll_elo = compute_log_likelihood(V_elo, r_test, c_test, y_test) # Compare learned utilities to ground truth # Align learned utilities to true utilities (up to affine transformation) def align_utilities(V_learned, V_true): """Align learned to true utilities via least squares.""" V_norm = (V_learned - V_learned.mean()) / (V_learned.std() + 1e-8) A = np.vstack([V_norm, np.ones_like(V_norm)]).T a, b = np.linalg.lstsq(A, V_true, rcond=None)[0] return a * V_norm + b V_mle_aligned = align_utilities(V_mle, true_utilities) V_bayes_aligned = align_utilities(V_bayes, true_utilities) V_elo_aligned = align_utilities(V_elo, true_utilities) corr_mle = np.corrcoef(true_utilities, V_mle_aligned)[0, 1] corr_bayes = np.corrcoef(true_utilities, V_bayes_aligned)[0, 1] corr_elo = np.corrcoef(true_utilities, V_elo_aligned)[0, 1] # Display results print("\n=== Method Comparison ===") print(f"{'Method':<20} {'Test AUC':<12} {'Test LL':<12} {'Corr w/ True':<15}") print("-" * 60) print(f"{'MLE (λ=0.05)':<20} {auc_mle:<12.4f} {ll_mle:<12.2f} {corr_mle:<15.4f}") print(f"{'Bayesian (MCMC)':<20} {auc_bayes:<12.4f} {ll_bayes:<12.2f} {corr_bayes:<15.4f}") print(f"{'Online (Elo K=0.1)':<20} {auc_elo:<12.4f} {ll_elo:<12.2f} {corr_elo:<15.4f}") # Visualize learned vs true utilities (MLE and Elo) plt.figure() fig, axes = plt.subplots(1, 2) for ax, V_learned, method, corr in zip(axes, [V_mle_aligned, V_elo_aligned], ['MLE', 'Elo'], [corr_mle, corr_elo]): ax.scatter(true_utilities, V_learned, alpha=0.6, s=30) ax.plot([true_utilities.min(), true_utilities.max()], [true_utilities.min(), true_utilities.max()], 'r--', alpha=0.5) ax.set_xlabel('True Utility') ax.set_ylabel('Learned Utility') ax.set_title(f'{method} (r={corr:.3f})') ax.grid(True, alpha=0.3) plt.show() # Bayesian visualization with posterior uncertainty and marginal distributions fig = plt.figure(layout='constrained') gs = fig.add_gridspec(2, 2, width_ratios=[4, 1], height_ratios=[1, 4], hspace=0.05, wspace=0.05) ax_main = fig.add_subplot(gs[1, 0]) ax_top = fig.add_subplot(gs[0, 0], sharex=ax_main) ax_right = fig.add_subplot(gs[1, 1], sharey=ax_main) # Main scatter with error bars showing posterior std ax_main.errorbar(true_utilities, V_bayes_aligned, yerr=V_bayes_std * 1.96, fmt='o', alpha=0.5, markersize=4, capsize=2, elinewidth=0.5, label='Posterior mean ± 95% CI') ax_main.plot([true_utilities.min(), true_utilities.max()], [true_utilities.min(), true_utilities.max()], 'r--', alpha=0.5) ax_main.set_xlabel('True Utility') ax_main.set_ylabel('Learned Utility (Posterior Mean)') ax_main.grid(True, alpha=0.3) ax_main.legend(loc='upper left', fontsize=8) # Marginal histogram/KDE for true utilities (top) ax_top.hist(true_utilities, bins=20, density=True, alpha=0.7, color='steelblue') ax_top.set_ylabel('Density') ax_top.tick_params(labelbottom=False) # Marginal showing posterior samples distribution (right) # Sample a few posterior draws to show uncertainty for sample in V_samples[::100]: # every 100th sample sample_aligned = align_utilities(sample, true_utilities) ax_right.hist(sample_aligned, bins=20, density=True, alpha=0.1, color='orange', orientation='horizontal') ax_right.hist(V_bayes_aligned, bins=20, density=True, alpha=0.7, color='darkorange', orientation='horizontal', label='Posterior mean') ax_right.set_xlabel('Density') ax_right.tick_params(labelleft=False) fig.suptitle(f'Bayesian Inference (r={corr_bayes:.3f})', y=0.98) plt.show() ``` ### Key Observations The three methods yield similar performance on this synthetic LLM preference task: 1. **MLE** is fastest and most scalable, suitable for large-scale applications 2. **Bayesian** inference via MCMC provides posterior distributions over utilities, enabling uncertainty quantification—the marginal plot shows how posterior samples vary, and error bars show 95% credible intervals 3. **Online (Elo)** learns incrementally, ideal for streaming data but may be less data-efficient than batch methods All methods successfully recover utilities correlated with the ground truth despite 10% label noise, demonstrating robustness. In production LLM alignment: - **Data scale**: Real datasets have 10K-100K+ comparisons - **Cold start**: New responses have no comparison history; requires initialization strategies - **Computational cost**: MCMC becomes expensive; MLE or online methods preferred - **Temporal drift**: User preferences may evolve; online methods naturally adapt ::: {.callout-note title="Connecting to DPO"} Recall from @sec-llm-example that Direct Preference Optimization (DPO) for language models is equivalent to Bradley-Terry MLE where the "utility" is $\beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}$. The MLE techniques developed in this chapter directly apply to DPO training: regularization prevents overfitting to preference data, and optimization methods (Adam, learning rate schedules) accelerate convergence. ::: ## Discussion Questions - In the synthetic experiments, why is the Bayes optimal AUC less than 1.0 even though we know the true parameters? What does this reveal about the fundamental limits of prediction from noisy preference data? - How does L2 regularization affect the bias-variance tradeoff in Bradley-Terry estimation? Under what data conditions (sample size, noise level, number of items) would you expect regularization to have the largest impact on test performance? - When would you prefer online learning (Elo) over batch learning (MLE) for preference model estimation? Consider computational cost, data arrival patterns, and statistical efficiency. Can you design a hybrid approach that combines benefits of both? - The Bayesian approach via MCMC provides posterior distributions over parameters, while MLE gives point estimates. Beyond uncertainty quantification, what practical advantages might the full posterior provide? How would you use posterior samples to make decisions (e.g., which items to present to users)? - Cross-validation assumes that folds are exchangeable—that the data distribution is the same across all folds. For sequential preference data (e.g., chess games ordered in time, or LLM preferences from a changing user population), this assumption may be violated. How should cross-validation be adapted for temporal data? What are the risks of using standard k-fold CV on sequential data? - In the LLM preference learning example, all three methods (MLE, Bayesian, Elo) achieved similar test AUC despite different training procedures. Under what circumstances would you expect the methods to diverge in performance? Consider data scale, noise level, model mis-specification, and computational budget. ## Bibliographic Notes **Maximum likelihood estimation** for preference models dates back to @bradley1952rank's original work on paired comparisons. The connection to modern machine learning was established through applications in ranking [@herbrich2006trueskill] and recommender systems [@koren2009matrix]. @hunter2004mm provided efficient MM algorithms for Bradley-Terry MLE that scale to large problems. **Bayesian inference for preference data** has a long history in psychometrics and experimental design. @davidson1970extending developed Bayesian approaches for paired comparisons. Modern MCMC methods for Bradley-Terry models are surveyed in @caron2012bayesian. The connection between L2 regularization and Gaussian priors (MAP estimation) is standard in Bayesian machine learning [@murphy2012machine]. **The Elo rating system** was introduced by @elo1978rating for chess rankings and has since been applied broadly to competitive games [@glickman1999parameter introduced the Glicko system with uncertainty estimates], online platforms [@herbrich2006trueskill for Xbox matchmaking], and information retrieval. The interpretation of Elo as stochastic gradient descent on the Bradley-Terry log-likelihood clarifies its connection to machine learning [@weng2006bayesian]. **Regularization** in statistical learning has roots in ridge regression [@hoerl1970ridge] and has become central to modern machine learning [@hastie2009elements]. Early stopping as implicit regularization was analyzed by @yao2007early. The bias-variance tradeoff provides the theoretical foundation for why regularization improves generalization [@geman1992neural]. **Cross-validation** was formalized by @stone1974cross and @geisser1975predictive for model selection. @kohavi1995study provided practical guidance for k-fold CV. Nested cross-validation to avoid selection bias was emphasized by @cawley2010overfitting. Temporal validation strategies for time-series data are discussed in @bergmeir2018note. **Optimization methods** for machine learning are surveyed in @ruder2016overview. Adam was introduced by @kingma2014adam and has become the default optimizer for deep learning. Convergence analysis for gradient descent on convex problems (including logistic regression, which includes Bradley-Terry) is classical [@boyd2004convex]. Learning rate schedules and their impact on generalization are discussed in @loshchilov2016sgdr. **LLM alignment** via preference learning gained prominence with @christiano2017deep's introduction of RLHF. The connection to Bradley-Terry and reward modeling was made explicit in @ouyang2022training (InstructGPT). Direct Preference Optimization (DPO) [@rafailov2023direct] showed that RLHF can be reformulated as Bradley-Terry MLE, eliminating the need for a separate reward model. Recent work explores calibration [@stiennon2020learning], robustness to noise [@bai2022constitutional], and scaling laws [@gao2022scaling] for preference-based LLM training. **For further reading**: @agresti2002categorical provides comprehensive coverage of categorical data analysis including paired comparisons. @marden1995analyzing covers ranking models from a statistical perspective. @liu2011learning surveys learning-to-rank methods in information retrieval, many of which build on preference models. ## Exercises Exercises are marked with difficulty levels: (*) for introductory, (**) for intermediate, and (***) for challenging. ### Gradient Derivation for Plackett-Luce (*) The Plackett-Luce model extends Bradley-Terry to full rankings. Given a ranking $(j_1 \succ j_2 \succ \cdots \succ j_M)$, the likelihood is: $$ p(\text{ranking} \mid V) = \prod_{k=1}^{M-1} \frac{\exp(V_{j_k})}{\sum_{\ell=k}^M \exp(V_{j_\ell})} $$ {#eq-plackett-luce} (a) Write the log-likelihood for a single ranking as a function of utilities $V$. (b) Derive the gradient $\frac{\partial \log p}{\partial V_m}$ for an arbitrary item $m$. *Hint: Consider three cases: $m$ appears at position $k$, $m$ appears after position $k$, and $m$ does not appear in the ranking.* (c) Implement gradient ascent for Plackett-Luce on simulated ranking data. Compare convergence to Bradley-Terry MLE—which converges faster and why? ### L2 Regularized Bradley-Terry (**) (a) Implement Bradley-Terry MLE with L2 regularization for a range of $\lambda$ values. Generate synthetic data with $M = 20$ items and vary the number of comparisons from 50 to 500. Plot test AUC vs $\lambda$ for each dataset size. (b) Explain why the optimal $\lambda$ decreases as the number of comparisons increases. At what rate does it decay? (c) Derive the Hessian matrix of the regularized log-likelihood. Under what conditions is it positive definite (ensuring a unique global maximum)? ### K-Fold Cross-Validation Implementation (**) (a) Implement 5-fold cross-validation for Bradley-Terry estimation from scratch (without using sklearn or similar libraries). Your implementation should: - Randomly partition comparison data into 5 folds - Train on 4 folds, validate on 1 fold - Repeat for all 5 folds - Return mean and standard deviation of validation AUC (b) Compare your CV implementation to a single 80/20 train/test split over 20 random seeds. Which provides a more stable performance estimate? (c) Extend your implementation to stratified CV: ensure each fold has approximately equal proportions of wins for each item. Why might stratification improve CV estimates for imbalanced data? ### Learning Rate Sensitivity Analysis (*) (a) For Bradley-Terry MLE on synthetic data ($M = 15$ items, 100 comparisons), experiment with learning rates $\eta \in \{0.001, 0.01, 0.05, 0.1, 0.5, 1.0\}$. Plot the training loss curve for each learning rate. (b) Identify the learning rate that achieves the lowest final loss. What happens with learning rates that are too large? Too small? (c) Implement a learning rate schedule (e.g., exponential decay $\eta_t = \eta_0 \cdot 0.95^t$) and compare to fixed learning rate. Does the schedule improve final performance? ### MCMC Diagnostics (**) The Metropolis-Hastings implementation in @sec-bayesian produces a chain of samples. Assess convergence quality: (a) Implement the **effective sample size** (ESS) diagnostic: ESS estimates how many independent samples the chain is equivalent to, accounting for autocorrelation. For a chain $\{V^{(t)}\}_{t=1}^T$, the ESS for parameter $V_j$ is approximately: $$ \text{ESS}_j = \frac{T}{1 + 2\sum_{k=1}^K \rho_k} $$ {#eq-effective-sample-size} where $\rho_k$ is the autocorrelation at lag $k$. (b) Run the MH algorithm from @sec-bayesian with different proposal step sizes $\tau \in \{0.01, 0.05, 0.1, 0.5\}$. Plot ESS vs $\tau$. What happens when $\tau$ is too small? Too large? (c) Implement trace plots for multiple chains (run MH 3 times with different initializations). Do all chains converge to the same stationary distribution? ### Optimization Method Comparison (**) (a) Implement gradient descent with momentum: $m_t \leftarrow \beta m_{t-1} + \nabla \mathcal{L}(V_t)$, $V_{t+1} \leftarrow V_t + \eta m_t$ where $\beta \in [0, 1)$ controls momentum. (b) On the same synthetic Bradley-Terry problem, compare four optimizers: vanilla GD, GD with momentum ($\beta = 0.9$), Adam, and RMSprop. Plot convergence curves (loss vs iteration). (c) For each optimizer, find the best learning rate via grid search. Which optimizer is most sensitive to learning rate tuning? ### Convergence Proof (***) Prove that gradient ascent on the Bradley-Terry log-likelihood converges to the global maximum (assuming the maximum exists). (a) Show that the Bradley-Terry log-likelihood is strictly concave when the comparison graph is strongly connected (every pair of items is connected by a directed path of comparisons). (b) Prove that gradient ascent with sufficiently small step size $\eta$ converges to the unique global maximum. *Hint: Use the fact that the gradient vanishes only at the maximum for strictly concave functions.* (c) What happens when the comparison graph is not strongly connected? Construct an example with multiple disconnected components and characterize the set of maximum likelihood estimators. ### Calibration Metrics (*) (a) Implement a calibration plot: bin predicted probabilities into 10 intervals $[0, 0.1), [0.1, 0.2), \ldots, [0.9, 1.0]$, and for each bin, compute the average predicted probability and the empirical frequency of wins. (b) Generate synthetic Bradley-Terry data and fit two models: one correctly specified (Bradley-Terry) and one mis-specified (assume all items have equal utility). Plot calibration curves for both. Which is better calibrated? (c) Define the Expected Calibration Error (ECE): $\text{ECE} = \sum_{b=1}^B \frac{|B_b|}{N} |\bar{p}_b - \bar{y}_b|$ where $\bar{p}_b$ is the average predicted probability in bin $b$, $\bar{y}_b$ is the empirical frequency, and $|B_b|$ is the number of predictions in bin $b$. Compute ECE for your models. ### Online to Batch Convergence (*) (a) Implement Elo with a decreasing K-factor: $K_t = K_0 / \sqrt{t}$ where $t$ is the iteration number. This is known as the Robbins-Monro condition for stochastic approximation. (b) On synthetic Bradley-Terry data, run Elo with decreasing $K_t$ and compare the final learned parameters to batch MLE. How close do they converge? (c) Prove (or argue informally) that Elo with $K_t = \eta / \sqrt{t}$ converges to the MLE in the limit as $t \to \infty$. Under what conditions does this hold? ### Hyperparameter Tuning with Nested Cross-Validation (***) To obtain an unbiased estimate of generalization performance when hyperparameters are selected via CV, we use nested (double) CV: - **Outer loop**: K-fold CV to estimate generalization - **Inner loop**: For each outer fold, use K-fold CV on the training data to select hyperparameters (a) Implement nested 5-fold CV for Bradley-Terry with L2 regularization. The inner loop tunes $\lambda$, the outer loop estimates test AUC. (b) Compare the nested CV test AUC estimate to: (i) a single train/test split with CV on training data for $\lambda$ selection, and (ii) CV test AUC when $\lambda$ is selected using the outer CV test set (data leakage). Which is more optimistic? Pessimistic? (c) Why is nested CV important? Give an example where selecting hyperparameters on the same data used for final evaluation leads to overly optimistic performance estimates. ### Real-World Dataset Application (***) Apply the methods from this chapter to a real preference dataset: (a) Obtain a dataset such as the Jester jokes dataset (user ratings), a subset of MovieLens, or chess game outcomes from Lichess. Describe the data: number of items, number of comparisons, sparsity. (b) Apply all three methods (MLE with regularization, Bayesian inference, online Elo) and compare their performance using multiple metrics (AUC, log-likelihood, calibration error). Use cross-validation to tune hyperparameters. (c) Visualize the learned item parameters. Do they agree with intuition (e.g., highly-rated movies have high utility)? Identify the top-5 and bottom-5 items according to each method—do the methods agree? (d) Discuss practical challenges encountered: computational cost, cold-start items, missing data, temporal effects. How would you address these in a production system? ### Bias-Variance Tradeoff Demonstration (**) (a) Generate synthetic Bradley-Terry data with $M = 10$ items and vary the training set size $n \in \{20, 50, 100, 200, 500\}$ comparisons. For each $n$ and several regularization strengths $\lambda$, fit models and compute: - **Bias**: Average difference between learned and true parameters - **Variance**: Standard deviation of learned parameters across 50 random datasets (b) Plot bias and variance vs $\lambda$ for each training set size. Verify that as $\lambda$ increases, bias increases and variance decreases. (c) Plot test AUC vs $\lambda$ and identify the optimal $\lambda$ for each $n$. How does the optimal regularization strength change with dataset size?