Chapter 1: Foundations
Preference data appears throughout machine learning:
Despite the diversity, all share a common mathematical structure: pairwise comparisons or choices from sets that reveal underlying preferences.


https://openai.com/research/learning-to-summarize-with-human-feedback
Modern LLMs are trained in two phases:
Choice models in RL
Three steps:
| \(M\) | Parameters (\(M!-1\)) |
|---|---|
| 3 | 5 |
| 4 | 23 |
| 10 | 3,628,799 |
Goal: Reduce this complexity \(\Rightarrow\) IIA collapses it to \(M\) parameters
Three interpretations of randomness:
Full preference lists: \(L = (j_1, j_2, \ldots, j_M)\) where \(j_1 \succ j_2 \succ \cdots \succ j_M\)
Choices from subsets: \((j, \mathcal{S})\) — \(j\) is the best from subset \(\mathcal{S}\)
\[ p(j \mid \mathcal{S}) = \sum_{\prec: j \succ k \; \forall k \in \mathcal{S} \setminus \{j\}} p(\prec) \]
Binary comparisons: \(\mathcal{S} = \{j, j'\}\), write \(Y_{jj'} = 1\)
Item-wise responses: \(Y_{ij} \in \{0,1\}\) — user \(i\)’s response to item \(j\)
Examples: e-commerce purchases, streaming play/skip, dating swipes, content moderation
Outside option: Accept/reject framing with item \(0\)
\(N\) users \(\times\) \(M\) items yields an \(N \times M\) response matrix \(Y\) with entries \(Y_{ij} \in \{0, 1\}\):
\[ Y = \begin{bmatrix} Y_{11} & \cdots & Y_{1M} \\ \vdots & \ddots & \vdots \\ Y_{N1} & \cdots & Y_{NM} \end{bmatrix} \]
| Domain | Data Type | Example |
|---|---|---|
| Recommender systems | Choice from set | Sees \(\{A,B,C\}\), clicks \(B\) |
| Information retrieval | Implicit pairwise | Click result 3 \(\Rightarrow\) pref. over 1, 2 |
| LLM alignment | Binary comparison | Annotator: \(A \succ B\) |
| Sports/Chess | Pairwise | Player \(j\) beats \(k\) |
| Streaming | Item-wise | Play (1) or skip (0) |
Goal: learn the underlying utility function from observed choices.
When different users have different preferences, we need both user and item parameters:
\[ p(Y_{ij} = 1) = \sigma(H_{ij}), \quad H_{ij} = f(U_i, V_j) \]
The simplest factor model — \(f\) is additive:
\[ p(Y_{ij} = 1 \mid U_i, V_j) = \sigma(U_i + V_j) \]
For a single user, the Rasch model implies Bradley-Terry for pairwise comparisons:
\[ p(j \succ k \mid i) = \sigma((U_i + V_j) - (U_i + V_k)) = \sigma(V_j - V_k) \]
The user-specific parameter \(U_i\) cancels out!
| Data Type | What It Reveals | Parameters Identified |
|---|---|---|
| Pairwise \((j \succ k)\) | Item differences only | \(V_j - V_k\) (up to constant) |
| Item-wise \((Y_{ij})\) | User appetites + item appeals | \(U_i\) and \(V_j\) (up to constant) |
This explains why recommender systems use item-wise data (clicks, purchases) while ranking systems (chess, LLM eval) can use pairwise comparisons.
Rasch assumes 1D — users might love action but dislike romance.
Logistic factor model: \(H_{ij} = U_i^\top V_j + Z_j\)
Foundation of Netflix Prize, collaborative filtering, two-tower models.
Ideal point model: users prefer items close to their ideal point:
\[ H_{ij} = -\|U_i - V_j\|_2 + Z_j \]
Natural for: political preferences (voters vs. candidates), music taste, product specs
Jamieson and Nowak (2011); Tatli, Nowak, and Vinayak (2022)
Deterministic utility (Latent variable):
Stochastic utility (Random utility):
Both yield \(p(j \succ k \mid i) = \sigma(V_j - V_k)\) for additive \(f\) — observationally equivalent for pairwise data.
Random utility: \(\tilde{H}_j = V_j + \varepsilon_j\)
Three interpretations of noise:
The key simplifying assumption: when \(\varepsilon_j\) are i.i.d. \(\Rightarrow\) IIA
Binary choice with individual attributes
\[ \begin{cases} U_n = \beta s_n + \epsilon_n \\ y_n = \begin{cases} 1 & U_n \gt 0 \\ 0 & U_n \leq 0 \end{cases} \end{cases} \]
Utility depends on alternative attributes with extreme value noise:
\[ \begin{cases} U_{n1} = \beta z_{n1} + \epsilon_{n1} \\ U_{n2} = \beta z_{n2} + \epsilon_{n2} \\ \epsilon_{n1}, \epsilon_{n2} \sim \text{iid extreme value} \end{cases} \] \[ \Rightarrow \quad P_{n1} = \frac{\exp(\beta z_{n1})}{\exp(\beta z_{n1}) + \exp(\beta z_{n2})} = \frac{1}{1 + \exp(-\beta (z_{n1} - z_{n2}))} \]
With \(J\) alternatives and extreme value noise:
\[ \begin{cases} U_{ni} = \beta z_{ni} + \epsilon_{ni} \\ \epsilon_{ni} \sim \text{iid extreme value} \end{cases} \quad \Rightarrow \quad P_{ni} = \frac{\exp(\beta z_{ni})}{\sum_{j=1}^{J} \exp(\beta z_{nj})} \]
The Plackett-Luce model extends to full rankings as a sequence of choices:
\[ Pr(\text{ranking } 1, 2, \dots, J) = \prod_{m=1}^{J-1} \frac{\exp(\beta z_m)}{\sum_{j=m}^{J} \exp(\beta z_{nj})} \]
DPO vs PPO
Rafailov et al. (2023)
\[ p^*(y_w \succ y_l \mid x) = \frac{\exp(r^*(x, y_w))}{\exp(r^*(x, y_w)) + \exp(r^*(x, y_l))} \]
where \(r^*(x, y)\) is some latent reward model that we do not have access to (i.e., the human preference)
Luckily, we can use parameterize the reward model with some neural networks with parameters \(\phi\):
Let us start with the Reward Maximization Objective in RL: \[ \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} [r_\phi(x, y) - \beta D_{KL}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x))] \]
\[ \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} [r_\phi(x, y) - \beta D_{KL}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x))] \]
Recall the definition of KL divergence: \[ D_{KL}(p \| q) = \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_{x \sim \mathcal{X}} \left[ \log \frac{p(x)}{q(x)} \right] \]
Substituting the KL divergence, we can rewrite the objective as: \[ \begin{aligned} &\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right] \right]\\ &=\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right] \end{aligned} \]
Then, we can continue to derive the objective as: \[ \begin{aligned} &\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right] \\ &\propto \min_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} - \frac{1}{\beta} r_\phi(x, y) \right]\\ &= \min_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ \log \frac{\pi_\theta(y|x)}{\frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right)} - \log Z(x) \right] \end{aligned} \] where \(Z(x) = \sum_{y} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right)\)
Because \(Z(x)\) is a constant with respect to \(\pi_\theta\), we can define: \[ \pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right) \]
Then, we can rewrite the optimization problem as: \[ \begin{aligned} &\min_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ \log \frac{\pi_\theta(y|x)}{\pi^*(y|x)} - \log Z(x) \right]\\ &\quad = \mathbb{D}_{KL}\!\left(\pi_\theta(y|x) \,\|\, \pi^*(y|x)\right) - \log Z(x) \end{aligned} \]
Thus, the optimal solution (i.e., the optimal language model) is: \[ \pi_\theta(y|x) = \pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right) \]
With some algebra, we can show that the optimal reward model is: \[ \begin{aligned} \pi_\theta(y|x) &= \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right)\\ \log \pi_\theta(y|x) &= \log \pi_{\text{ref}}(y|x) + \frac{1}{\beta} r_\phi(x, y) - \log Z(x) \text{// perform } \log(.)\\ r_\phi(x, y) &= \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)\\ \end{aligned} \]
Recall the Bradley-Terry model: \[ p_\phi(y_w \succ y_l \mid x) = \frac{\exp(r_\phi(x, y_w))}{\exp(r_\phi(x, y_w)) + \exp(r_\phi(x, y_l))} \]
And the optimal reward model: \[ r_\phi(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x) \]
Substituting, we can rewrite the choice model as: \[ \begin{aligned} p_\phi(y_w \succ y_l \mid x) &= \frac{1}{1 + \exp\left( \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} - \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} \right)}\\ &= \sigma\left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \end{aligned} \]
The reward model loss maximizes the likelihood of the choice model: \[ \mathcal{L} (r_\theta, \mathcal{D}) = - \mathbb{E}_{(x, y_w, u_l) \sim \mathcal{D}} \left[ \log p_\phi(y_w \succ y_l \mid x) \right] \]
Substituting the optimal reward, we obtain the DPO loss:
\[ \mathcal{L}_{DPO}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \right] \]
Rafailov et al. (2023)
RLHF Comparison
IIA assumes the relative likelihood of choosing \(j\) vs \(k\) is unchanged by a third alternative \(\ell\):
\[ \frac{p(j \mid \mathcal{S})}{p(k \mid \mathcal{S})} = \frac{p(j \mid \mathcal{S} \cup \{\ell\})}{p(k \mid \mathcal{S} \cup \{\ell\})} \]
Theorem 1: A random utility model \(H_j\) satisfies IIA if and only if \(H_j = V_j + \varepsilon_j\) where \(\varepsilon_j\) are i.i.d. Gumbel distributed.
(⇐) Gumbel \(\Rightarrow\) IIA: \[ \frac{p(j \mid \mathcal{S})}{p(k \mid \mathcal{S})} = \frac{e^{V_j}/\sum_{\ell} e^{V_\ell}}{e^{V_k}/\sum_{\ell} e^{V_\ell}} = e^{V_j - V_k} \] Independent of \(\mathcal{S}\) — IIA holds.
(⇒) IIA \(\Rightarrow\) Gumbel: IIA forces multiplicative structure; only Gumbel is compatible (Yellott, 1977).
Theorem 2: Under IIA (\(H_j = V_j + \varepsilon_j\), i.i.d. Gumbel), the probabilities are:
Choices from sets (softmax): \[ p(j \mid \mathcal{S}) = \frac{e^{V_j}}{\sum_{k \in \mathcal{S}} e^{V_k}} = \operatorname{softmax}_j ((V_k)_{k \in \mathcal{S}}) \]
Binary comparisons (Bradley-Terry): \[ p(Y_{jj'} = 1) = \sigma(V_j - V_{j'}) = \frac{1}{1 + e^{-(V_j - V_{j'})}} \]
Full rankings (Plackett-Luce): \[ p(j_1 \succ \cdots \succ j_M) = \prod_{m=1}^{M-1} \frac{e^{V_{j_m}}}{\sum_{k=m}^{M} e^{V_{j_k}}} \]
Example: \(V = (0, 1, 2)\)
All are special cases of random utility with i.i.d. Gumbel noise under IIA:
| Feedback Type | Model Name |
|---|---|
| Binary comparisons | Bradley-Terry |
| Full rankings | Plackett-Luce |
| Accept/reject | Logistic regression |
| Choices from subsets | Logit model |
| Multi-class | Multinomial logit |
DPO assumes Bradley-Terry: \(p(y \succ y' \mid x) = \sigma(r(x,y) - r(x,y'))\)
Justified by IIA: humans compare implicit rewards with i.i.d. Gumbel noise.
When BT fails for DPO:
Different utility vectors can generate identical choice probabilities:
\[ \frac{e^{V_j + c}}{\sum_{k \in \mathcal{S}} e^{V_k + c}} = \frac{e^c \cdot e^{V_j}}{e^c \cdot \sum_{k} e^{V_k}} = \frac{e^{V_j}}{\sum_{k} e^{V_k}} \]
Implications:
Even after identification, many structurally different models can fit data equally well. Named after Kurosawa’s 1950 film.
Example: 100 pairwise comparisons, 5 items — all achieve 90% accuracy:
For alignment: Many reward functions explain human feedback equally well — which one should we optimize?
Rasch \(\rightarrow\) Bradley-Terry: User params cancel (\(U_i\) disappears)
General factor model: \(p(j \succ k \mid i) = \sigma\left(U_i^\top (V_j - V_k) + (Z_j - Z_k)\right)\)
User-specific parameters \(U_i\) cancel when:
Use BT for ranking items globally; factor models for personalization.
Sub-populations satisfying IIA \(\not\Rightarrow\) full population satisfies IIA
Mixture model: \[ p(Y_{jj'} = 1) = \sum_{i=1}^N \alpha_i \, \sigma(V_j^{(i)} - V_{j'}^{(i)}) \]
Intuition: A mixture of Gumbels is not Gumbel (like a mixture of Gaussians is not Gaussian)
Solution: Random coefficients logit \(V_j = \beta^\top x_j + \varepsilon_j\), \(\beta \sim N(\mu, \Sigma)\)
Items 1, 2 are nearly identical (red bus, blue bus); item 3 is different (train).
Under IIA with \(V_1 = V_2\):
The fix: Allow correlated noise between similar alternatives
\[ \begin{aligned} &\bigl(p(1 \mid \{1,2,3\}),\; p(2 \mid \{1,2,3\}),\; p(3 \mid \{1,2,3\})\bigr)\\ &\quad = \left(\tfrac{p(1 \mid \{1,3\})}{2},\; \tfrac{p(2 \mid \{2,3\})}{2},\; p(3 \mid \{1,3\})\right) \end{aligned} \]
when errors for items 1, 2 are perfectly correlated.

Chapter 1: Foundations