Chapter 2: Learning
Lecture 1: Core parameter estimation
Lecture 2: Advanced learning topics
For Bradley-Terry with items \(j \in \{1, \ldots, M\}\):
\[ \hat{V} = \arg\max_{V} \sum_{(j,j') \in \mathcal{D}_{\text{train}}} \log p(Y_{jj'} \mid \sigma(V_j - V_{j'})) \]
Define residuals: \(r_{mk} = y_{mk} - \sigma(V_m - V_k)\) (observed \(-\) predicted)
\[ \frac{\partial \ell}{\partial V_m} = \sum_{(m,k)\in \mathcal{N}^+_m} r_{mk} - \sum_{(k,m)\in \mathcal{N}^-_m} r_{km} \]
Prior: \(p(V) = \prod_{j=1}^M \mathcal{N}(V_j \mid 0, 1)\)
Likelihood: \(p(\mathcal{D} \mid V) = \prod_{(j,j')} \sigma(V_j - V_{j'})^{Y_{jj'}} (1 - \sigma(V_j - V_{j'}))^{1 - Y_{jj'}}\)
Posterior: \(p(V \mid \mathcal{D}) \propto p(\mathcal{D} \mid V) \, p(V)\)
Denominator (evidence) is intractable \(\Rightarrow\) need MCMC to sample
General MH acceptance probability:
\[ \alpha = \min\left\{1, \frac{p(V' \mid \mathcal{D}) \cdot q(V^{(t)} \mid V')}{p(V^{(t)} \mid \mathcal{D}) \cdot q(V' \mid V^{(t)})}\right\} \]
With Gaussian proposal \(q(V' \mid V^{(t)}) = \mathcal{N}(V^{(t)}, \tau^2 I)\), the proposal terms cancel:
\[ \alpha = \min\left\{1, \frac{p(\mathcal{D} \mid V') \cdot p(V')}{p(\mathcal{D} \mid V^{(t)}) \cdot p(V^{(t)})}\right\} \]
\[ \mathbf{r}^* = \arg\max_{\mathbf{r}} \sum_{i=1}^n \log \sigma\bigl(y_i(r(x_A^{(i)}) - r(x_B^{(i)}))\bigr) - \tfrac{1}{2}\mathbf{r}^\top K^{-1}\mathbf{r} \]
Gradient: \(\nabla \log p(\mathbf{r} \mid \mathcal{D}) = \mathbf{g} - K^{-1}\mathbf{r}\)
Hessian: \(\nabla^2 \log p(\mathbf{r} \mid \mathcal{D}) = -W - K^{-1}\)
where \(W = \text{diag}(p_i(1-p_i))\) captures data-dependent precision
After finding \(\mathbf{r}^*\):
\[ p(\mathbf{r} \mid \mathcal{D}) \approx \mathcal{N}\left(\mathbf{r}^*, (K^{-1} + W)^{-1}\right) \]
SGD gradient of BT log-likelihood for a single comparison \((j, j')\):
\[ \frac{\partial \ell}{\partial V_j} = (y - p), \qquad \frac{\partial \ell}{\partial V_{j'}} = -(y - p) \]
Update with learning rate \(\eta\) (the K-factor):
\[ V_j \leftarrow V_j + \eta(y - p), \qquad V_{j'} \leftarrow V_{j'} - \eta(y - p) \]
Elo = online learning algorithm for Bradley-Terry, interpretable as SGD with fixed step size
| MLE | Bayesian (MCMC) | Online (Elo) | |
|---|---|---|---|
| Output | Point estimate | Posterior distribution | Point estimate |
| Uncertainty | No | Yes | No |
| Data | Batch | Batch | Sequential |
| Compute | Moderate | Expensive | Cheap per update |
| Best for | Large static datasets | Small data, uncertainty | Streaming data |
All three methods estimate the same underlying Bradley-Terry parameters
Add a penalty to the log-likelihood:
\[ \mathcal{L}_{\text{reg}}(V) = \sum_{(j,j')} \log p(Y_{jj'} \mid V_j - V_{j'}) - \frac{\lambda}{2}\|V\|_2^2 \]
\[ \frac{\partial \mathcal{L}_{\text{reg}}}{\partial V_m} = \sum_{(m,k)\in \mathcal{N}^+_m} r_{mk} - \sum_{(k,m)\in \mathcal{N}^-_m} r_{km} - \lambda V_m \]
Connection to Bayesian inference:
L2 regularization \(=\) MAP estimation with Gaussian prior \(\mathcal{N}(0, 1/\lambda)\)
\(\lambda\) is the inverse prior variance: large \(\lambda\) \(\Rightarrow\) tight prior near zero
For preference data: partition comparisons (not items) into folds
Area Under ROC Curve (AUC)
\[ \text{AUC} = \frac{\sum_{i: y_i=1} \sum_{j: y_j=0} \mathbf{1}[s_i \succ s_j]}{\sum_{i: y_i=1} \sum_{j: y_j=0} 1} \]
Log-Likelihood
\[ \begin{aligned} \text{LL} = \sum &Y_{jj'}\log\sigma(V_j - V_{j'}) \\ +\, &(1\!-\!Y_{jj'})\log(1\!-\!\sigma(V_j\!-\!V_{j'})) \end{aligned} \]
Measures probability assigned to observed outcomes; higher is better
Calibration Error
Different metrics may favor different models — use multiple for comprehensive evaluation
Standard GD: \(V \leftarrow V + \eta \nabla \mathcal{L}(V)\)
Two limitations:
Modern optimizers address these through adaptive learning rates and momentum
Adam (Adaptive Moment Estimation):
\[ \begin{aligned} m_t &\leftarrow \beta_1 m_{t-1} + (1-\beta_1)g_t \\ v_t &\leftarrow \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\ \hat{m}_t &\leftarrow m_t/(1-\beta_1^t), \quad \hat{v}_t \leftarrow v_t/(1-\beta_2^t) \\ V &\leftarrow V + \alpha\,\hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon) \end{aligned} \]
Instead of fixed \(\eta\), decay over time for fast initial progress + precise final convergence:
Especially useful for long training runs with Adam or SGD
| Optimizer | When to Use | Learning Rate |
|---|---|---|
| Vanilla GD | Simple problems, pedagogical | 0.01 – 0.1, needs tuning |
| Adam | Default for most problems | 0.001 – 0.1, robust |
| SGD + momentum | Theoretical guarantees needed | 0.01 – 0.1, with schedule |
Apply all three methods to a realistic preference setting:
DPO for language models is Bradley-Terry MLE where the “utility” is:
\[ r(x, y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \]
All techniques from this chapter directly apply:
Rafailov et al. (2023)
These connections unify the chapter: all methods estimate the same model, differing only in computation, data access, and uncertainty quantification.

Chapter 2: Learning