1 Foundations
Fullscreen Part 1 Fullscreen Part 2
By the end of this chapter you will be able to:
- Identify where preference learning appears across machine learning: recommender systems, information retrieval, robotics, language model alignment, and ranking systems.
- Distinguish between item-wise preference data (accept/reject, ratings) and pairwise comparison data, understanding when each is appropriate.
- Differentiate deterministic preferences from stochastic (random) preferences and justify why randomness is essential for modeling noisy human choice.
- Formulate the Rasch model as the simplest factor model for item-wise data, identifying user appetite (\(U_i\)) and item appeal (\(V_j\)) as latent parameters.
- Extend 1D item utilities to K-dimensional factor models, explaining why multi-dimensional representations are essential for recommender systems.
- Derive how Bradley-Terry emerges as a special case of the Rasch model when conditioning on a single user.
- Define and apply the Independence of Irrelevant Alternatives (IIA) axiom, explaining how it collapses the full preference distribution to an \(M\)-parameter logit model.
- Derive choice probabilities for binary comparisons (Bradley–Terry), accept–reject decisions (logistic regression), and full or partial rankings (Plackett–Luce) from a random-utility model with i.i.d. Gumbel shocks.
- Connect the Bradley–Terry model to modern methods like Direct Preference Optimization (DPO) and Elo ratings.
- Simulate preference data from both item-wise (Rasch) and pairwise (Bradley-Terry) models, and visualize response matrices.
- Diagnose two key limitations of IIA—population heterogeneity (mixture models) and the red-bus/blue-bus cloning problem—and motivate richer models with correlated utility shocks.
- Explain why Gaussian Processes (GPs) provide a flexible nonparametric alternative to linear reward models, and identify when GPs are preferred over parametric approaches.
Throughout this book, we use the following conventions:
- Items are indexed by \(j, j', k \in \{1, \ldots, M\}\)
- Users are indexed by \(i \in \{1, \ldots, N\}\)
- Latent utility for user \(i\) on item \(j\) is \(H_{ij} \in \mathbb{R}\)
- Item appeal (item-specific parameter) is \(V_j \in \mathbb{R}\)
- Binary preference outcomes are random variables \(Y_{jj'} \in \{0, 1\}\), where \(Y_{jj'} = 1\) means item \(j\) is preferred over \(j'\)
- Probabilities are written as \(p(\cdot)\) with conditioning after \(\mid\)
- The sigmoid function is \(\sigma(x) = 1/(1 + e^{-x})\)
This is a book about machine learning from human preferences. Preference data arises throughout machine learning: recommender systems learn which products users prefer, search engines learn which results are most relevant, robots learn which trajectories humans find natural, and language models learn which responses are helpful. This chapter lays the theoretical foundation for understanding how preference data can be modeled and used for learning across all these domains.
1.1 Preference Learning Across Machine Learning
Preference data appears in many forms across machine learning applications:
Recommender systems: Users implicitly express preferences through clicks, purchases, ratings, and engagement time. A user clicking on item A rather than item B reveals \(A \succ B\). Netflix’s recommendation engine, Amazon’s product suggestions, and Spotify’s playlist generation all learn from such preference signals.
Information retrieval: Search engines learn from user clicks which documents are relevant to queries. If a user searches for “best restaurants nearby” and clicks on the third result, this suggests the third result was preferred over the first two—at least for that query.
Robotics and control: When teaching robots through demonstration or feedback, humans express preferences over trajectories. A human might indicate that a robot arm’s smooth motion is preferred over a jerky one, or that one grasping strategy is better than another.
Language model alignment: Large language models are fine-tuned using human preferences over responses. Given a prompt, annotators compare candidate outputs and indicate which is more helpful, harmless, or honest.
Game playing and sports: The Elo rating system, used in chess and many video games, is fundamentally a preference model—each game outcome is a pairwise comparison revealing which player was “preferred” (stronger) in that match.
Despite the diversity of applications, these all share a common mathematical structure: pairwise comparisons or choices from sets that reveal underlying preferences. The Bradley-Terry model and its extensions provide a unified framework for all these settings.
1.2 A Running Example: Language Model Alignment
To ground our discussion, we use language model alignment as a detailed running example. The ideas apply broadly, but LLMs provide concrete notation and have driven recent interest in preference learning.
Modern large language models like GPT-4 and Claude are trained in two distinct phases. Understanding this pipeline illustrates why preference models are essential to building AI systems that behave as humans intend.
Pretraining: Learning to Predict. In the first phase, called pretraining, a language model learns to predict the next token in a sequence. Given a corpus of text, the model is trained to minimize the cross-entropy loss: \[ \mathcal{L}_{\text{pretrain}} = -\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{T} \log \pi_\theta(x_t \mid x_{<t})\right] \tag{1.1}\] where \(\pi_\theta(x_t \mid x_{<t})\) is the model’s predicted probability of token \(x_t\) given the preceding tokens. This objective has a remarkable property: it is a proper scoring rule. A scoring rule is proper if a forecaster maximizes their expected score by reporting their true beliefs. For cross-entropy, this means that a model minimizing this loss will produce calibrated probability estimates—its predicted probabilities will match empirical frequencies in the training data.
A scoring rule \(S(p, y)\) assigns a score to a probability forecast \(p\) when outcome \(y\) is observed. The rule is proper if the expected score \(\mathbb{E}_{y \sim q}[S(p, y)]\) is maximized when \(p = q\), i.e., when the forecaster reports their true belief. The logarithmic scoring rule \(S(p, y) = \log p(y)\) (negative cross-entropy) is strictly proper: any deviation from true probabilities strictly decreases expected score. This is why cross-entropy is the standard training objective for classification and language modeling—it incentivizes honest probability reporting.
Pretraining produces models that are remarkably capable at predicting text, but prediction alone does not ensure the model will behave helpfully, harmlessly, or honestly. A model trained only to predict text will happily continue any prompt, including harmful ones.
Posttraining: Aligning with Human Preferences. The second phase, called posttraining or alignment, teaches the model to produce outputs that humans prefer. This requires a fundamentally different kind of data: rather than predicting what text comes next, we need to learn which outputs are better than others.
The dominant approach for posttraining is Reinforcement Learning from Human Feedback (RLHF) Christiano et al. (2017) Ouyang et al. (2022). At a high level, RLHF works as follows:
- Collect preference data: Given prompts \(x\), sample pairs of responses \((y_1, y_2)\) from the model and ask human annotators which response is better.
- Train a reward model: Fit a reward function \(r(x, y)\) to the preference data, predicting which response humans will prefer.
- Optimize the policy: Fine-tune the language model to maximize the learned reward while staying close to the original model (to prevent reward hacking).
A recent and influential simplification is Direct Preference Optimization (DPO) Rafailov et al. (2023), which skips the explicit reward model. DPO directly optimizes the policy on preference data using the objective: \[ \mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right] \tag{1.2}\] where \(y_w\) is the preferred (“winning”) response, \(y_l\) is the dispreferred (“losing”) response, \(\pi_{\text{ref}}\) is a reference policy (typically the pretrained model), and \(\beta\) controls how much the optimized policy can deviate from the reference.
The Bradley-Terry Connection. The key insight behind DPO is that it assumes human preferences follow a specific probabilistic model. If we define an implicit reward \(r^*(x, y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}\), then the DPO objective is equivalent to maximum likelihood estimation under the assumption: \[ p(y_w \succ y_l \mid x) = \sigma(r^*(x, y_w) - r^*(x, y_l)) = \frac{1}{1 + e^{-(r^*(x, y_w) - r^*(x, y_l))}} \tag{1.3}\] This is precisely the Bradley-Terry model Bradley and Terry (1952), one of the oldest and most widely-used models for pairwise comparisons. The Bradley-Terry model says that the probability of preferring item \(j\) over item \(j'\) is a sigmoid function of the difference in their “strengths” or utilities.
This raises natural questions: Why is the Bradley-Terry model a reasonable assumption? Where does it come from? When does it fail? Answering these questions is the goal of the rest of this chapter. We will see that Bradley-Terry arises naturally from a random utility model with specific noise assumptions, and that its key property—the Independence of Irrelevant Alternatives—is both powerful and limiting.
Whether aligning language models, building recommender systems, or ranking chess players, many preference learning methods assume the Bradley-Terry model: \(p(j \succ k) = \sigma(V_j - V_k)\). Understanding why this model is reasonable—and when it fails—requires the theory of random utility models and the Independence of Irrelevant Alternatives. The same mathematical framework applies across all these applications.
This chapter can be covered in two 50-minute lectures plus a discussion section:
Lecture 1 (Sections 1–3): Motivation and foundations
- Preference learning across ML: recommenders, search, robotics, LLMs (10 min)
- LLM alignment example: RLHF, DPO, and Bradley-Terry (20 min)
- Random preferences, types of comparison data (20 min)
Lecture 2 (Sections 4–6): Theory and limitations
- Random utility models, the Ackley function examples (20 min)
- IIA, Gumbel equivalence, derivation of choice probabilities (20 min)
- Limitations: heterogeneity and red-bus/blue-bus (10 min)
Discussion section: Exercises 1–3, code walkthroughs for Ackley simulations
1.3 Chapter Overview
Having motivated preference models through their role across machine learning—from recommender systems to language model alignment—we now develop the theoretical foundations. The rest of this chapter covers:
- Random preferences (Section 1.4): Why we model preferences as stochastic, and the mathematical framework for doing so.
- Types of comparison data: Full rankings, choices from sets, and binary comparisons—all as observations from the same underlying preference distribution.
- Random utility models: Representing preferences through utility functions with noise, and deriving choice probabilities.
- Independence of Irrelevant Alternatives (IIA): The key assumption that makes Bradley-Terry and related models tractable, plus its limitations.
Chapters 3-6 will then cover how to learn preference models from data, actively collect informative comparisons, make decisions based on learned preferences, and handle heterogeneous populations with diverse preferences.
Historical Notes. The ideas in this chapter draw from nearly a century of research across psychology, economics, and statistics:
- Thurstone (1927) introduced the law of comparative judgment in psychophysics, proposing that perceived stimuli have random “discriminal processes”—the first random utility model.
- Luce (1959) provided an axiomatic foundation through his choice axiom, which is equivalent to IIA. This work established the mathematical foundations for probabilistic choice theory.
- Bradley & Terry (1952) developed the paired comparison model that bears their name, originally for ranking chess players and other competitive settings.
- McFadden (1974) connected random utility theory to econometrics, developing the conditional logit model for analyzing discrete choices. This work, along with his development of the nested logit and mixed logit models, earned him the Nobel Prize in Economics in 2000.
- Modern ML: The application of preference models to machine learning accelerated with Christiano et al. (2017)’s introduction of RLHF for training AI systems. Rafailov et al. (2023)’s DPO provided a direct connection between preference optimization and the classical Bradley-Terry model.
“The theory of discrete choice…has become a cornerstone of modern microeconometrics, with applications ranging from transportation to marketing to environmental economics.” — Daniel McFadden, Nobel Lecture (2000)
1.4 Random Preferences as a Model of Comparisons
We start with a set of items \(j \in \{1, \ldots, M\}\)—be they products, robot trajectories, or language model responses. We will consider models to generate comparisons that are orders. For realism, but also for mathematical simplicity, we will assume in this book that the set of items is discrete and has \(M\) items.
Comparisons may be random and are generated by random draws of (total) orders. (Total) Orders have two properties.
- First, for two items \(j, j'\) either \(j \prec j'\) and/or \(j \succ j'\) must hold, an assumption called totality: Either \(j\) is weakly preferred to \(j'\) or \(j'\) is weakly preferred to \(j\).
- The second assumption is transitivity: if \(j \succ j'\) and \(j' \succ k\), then also \(j \succ k\).
In the following, we consider randomness as generated from a decision-maker who has an order, or preference relation, \(\prec\) on a set of items. We refer to the random object \(\prec\) as the oracle preference. Each preference \(\prec\) has an associated probability mass \(p(\prec)\), leading to an \((M!-1)\)-dimensional vector encoding the full random preference distribution. This grows extremely fast: even with just \(M=4\) items, we need \(4! - 1 = 23\) parameters; with \(M=10\), we need over 3.6 million. Reducing this complexity is a central goal of this chapter—we will see that the Independence of Irrelevant Alternatives collapses this to just \(M\) parameters.
One might wonder why we need to have a random preference. Deterministic preferences are conceptually helpful constructs and are used broadly in the fields of Consumer theory (e.g., Mas-Colell et al. (1995)). However, they suffer when bringing them to data, as data is inherently noisy.
Even when allowing for randomness, assumptions we will impose in this chapter—transitivity and the Independence of Irrelevant Alternatives—are strong yet practical. In many situations they will fail, for good reasons: humans cannot always clearly rank alternatives, their choices may reflect cultural norms, or they might exhibit self-control problems. These limitations of classical preference models will be discussed in later chapters. Until then, we will make full use of learning stochastic preferences.
1.5 Types of Comparison Data
There are different types of comparison data we may observe. We can relate them back to the population preferences \(\prec\).
Full Preference Lists. The conceptually simplest and practically most verbose preference sampling is to get the full preference ranking, i.e., \(L = (j_1, j_2, \dots, j_M)\), where \(j_1 \succ j_2 \succ \cdots \succ j_M\). In this case, we know not only that \(j_1\) is preferred to \(j_2\), but also, by transitivity, that it is preferred to all other options. Similarly, we know that \(j_2\) is preferred to all options but \(j_1\), etc. In many cases, we do not observe full preferences as the cognitive load for humans is too high.
Choices from Subsets. Another type of sample is \((j, \mathcal{S})\) where \(j\) is the most preferred alternative from subset \(\mathcal{S}\) for a sampled preference. Formally, \(j \succ k\) for all \(k \in \mathcal{S} \setminus \{j\}\)—\(j\) is preferred to all elements of \(\mathcal{S}\) other than itself.
Formally, the probability that we observe \((j, \mathcal{S})\) is \[ p(j \mid \mathcal{S}) = \sum_{\prec: j \succ k \; \forall k \in \mathcal{S} \setminus \{j\}} p(\prec). \tag{1.4}\] That is, the probability of observing \((j, \mathcal{S})\) is given by the sum of all preferred samples \(\prec\) such that \(j\) is preferred to all \(k\) in \(\mathcal{S}\) other than \(j\).
If the choice is binary, \(\mathcal{S} = \{j, j'\}\), we write this as \(Y_{jj'} = 1\) (item \(j\) preferred over \(j'\)). We highlight that these outcomes are random, and depend on the sample of \(\prec\). Binary data is convenient and quick to elicit and has been prominently applied in language model finetuning and evaluation.
Sometimes, particularly when a decision-maker is offered an item or “nothing”, we will implicitly assume that there is an “outside option” indexed by \(0\), allowing us to interpret \(Y_{j0} = 1\) as “accepting” item \(j\), and \(Y_{j0} = 0\) as rejecting it. Outside options can be thought of as fundamental limits to what a system designer can obtain. Consider a recommendation system. A user of that system might engage with content or not. In principle, instead of engaging, they will do something else. We do not model this explicitly as a fundamental abstraction. All models are wrong, but some are useful.
Item-wise Responses. In many applications, users respond to individual items rather than comparing pairs. We denote \(Y_{ij} \in \{0, 1\}\) as the binary response of user \(i\) to item \(j\), where \(Y_{ij} = 1\) indicates acceptance (like, purchase, engagement) and \(Y_{ij} = 0\) indicates rejection. This yields an \(N \times M\) response matrix: \[ Y = \begin{bmatrix} Y_{11} & Y_{12} & \cdots & Y_{1M} \\ Y_{21} & Y_{22} & \cdots & Y_{2M} \\ \vdots & \vdots & \ddots & \vdots \\ Y_{N1} & Y_{N2} & \cdots & Y_{NM} \end{bmatrix} \tag{1.5}\] Examples of item-wise data include: e-commerce purchases (did user \(i\) buy product \(j\)?), streaming behavior (did user \(i\) play or skip track \(j\)?), dating apps (did user \(i\) swipe right on profile \(j\)?), and content moderation (did annotator \(i\) flag content \(j\)?). Item-wise data is often more abundant than pairwise data—every user-item interaction generates one data point—but requires modeling both user and item parameters to capture heterogeneity across users.
Context. Choices are often conditional, and data is given by \((i, L)\) (for list-based data), \((i, j, \mathcal{S})\) (for general choice-based data), \((i, j, j')\) for pairwise data, or \((i, j)\) for item-wise data. Here \(i\) indexes the user or context: the environment of a purchase, the goal of a robot, or a user prompt for a large language model. It can also be a prompt to the decision-maker, e.g., to human raters on whether they should pick preferences based on helpfulness or harmlessness Ganguli et al. (2022). The inclusion of context in learning allows for the generalization of preferences, as we will see in subsequent chapters.
Recommender Systems. A user browsing an e-commerce site sees products \(\{A, B, C\}\) and clicks on \(B\). This reveals the choice \((B, \{A, B, C\})\)—item \(B\) was preferred from the displayed set. If we only track whether the user purchased, we observe accept-reject data: \(Y_{B0} = 1\) (accepted \(B\) over the outside option of not buying).
Information Retrieval. A user searches “best pizza near me” and clicks on the third result. This provides implicit feedback that result 3 was preferred over results 1 and 2 for this query context.
Language Model Alignment. Given a prompt \(x\) = “Explain quantum entanglement to a 10-year-old,” annotators compare two responses and indicate \(A \succ B\). This triple \((x, y_w = A, y_l = B)\) constitutes one training example for RLHF or DPO. Public datasets in this format include Anthropic’s HH-RLHF, OpenAI’s summarization preferences, and the Stanford Human Preferences (SHP) dataset.
Sports Rankings. Each chess match between players \(j\) and \(k\) yields a pairwise comparison. If player \(j\) wins, we observe \(Y_{jk} = 1\). The Elo rating system models these outcomes using the Bradley-Terry model.
In all cases, the mathematical structure is the same: observations reveal preferences over items, and our goal is to learn the underlying utility function.
1.6 Deterministic Utility Models
The preference models discussed so far treat items as having fixed utilities \(V_j\). But in many applications—especially recommender systems—different users have different preferences over the same items. A horror movie fan and a romantic comedy fan will have very different utility for the same film. This section introduces deterministic utility models (also called latent variable models) that accommodate both user-specific and item-specific parameters.
From Items to Users and Items. When we observe binary responses \(Y_{ij} \in \{0,1\}\) from \(N\) users to \(M\) items, we model the response probability using a latent utility that depends on both user and item: \[ p(Y_{ij} = 1) = \sigma(H_{ij}), \quad H_{ij} = f(U_i, V_j) \tag{1.6}\] where \(U_i\) captures user \(i\)’s characteristics and \(V_j\) captures item \(j\)’s characteristics. In this deterministic utility framework, stochasticity enters through the Bernoulli sampling \(Y_{ij} \sim \text{Bernoulli}(\sigma(H_{ij}))\), not through noise in \(H_{ij}\) itself. The function \(f\) and the dimensionality of \(U_i, V_j\) define different model families.
1.6.1 The Rasch Model
The Rasch model (also called the 1-parameter logistic model in psychometrics) is the simplest factor model, where \(f\) is additive: \[ p(Y_{ij} = 1 \mid U_i, V_j) = \sigma(U_i + V_j) \tag{1.7}\] Here:
- \(U_i \in \mathbb{R}\) is the user appetite: user \(i\)’s general tendency to accept items. High \(U_i\) means an enthusiastic user who likes many things; low \(U_i\) means a selective user.
- \(V_j \in \mathbb{R}\) is the item appeal: how universally appealing item \(j\) is. High \(V_j\) means a crowd-pleaser; low \(V_j\) means a niche item.
The probability of acceptance depends only on the sum \(U_i + V_j\): enthusiastic users accept more items, and popular items are accepted by more users.
The sorted response matrix reveals the structure: high-appetite users (top) accept most items, popular items (right) are accepted by most users, and the transition from rejection to acceptance follows a diagonal pattern determined by the sum \(U_i + V_j\).
1.6.2 Connecting Rasch to Bradley-Terry
A remarkable fact connects item-wise and pairwise models: for a single user, the Rasch model implies Bradley-Terry for pairwise comparisons.
Derivation: Suppose user \(i\) makes a pairwise comparison between items \(j\) and \(k\). In the Rasch framework, the user has deterministic utilities \(H_{ij} = U_i + V_j\) and \(H_{ik} = U_i + V_k\). When comparing, the user’s binary responses \(Y_{ij}, Y_{ik} \sim \text{Bernoulli}(\sigma(H_{ij})), \text{Bernoulli}(\sigma(H_{ik}))\) are noisy observations of these utilities.
One natural comparison mechanism is to prefer \(j\) over \(k\) if the user would accept \(j\) in isolation more readily than \(k\). Under the logistic difference model, the pairwise preference arises from comparing the latent utilities with logistic noise: \[ p(j \succ k \mid i) = p(H_{ij} + \eta_{ij} > H_{ik} + \eta_{ik}) \tag{1.8}\] where \(\eta_{ij}, \eta_{ik}\) are independent logistic noise terms (corresponding to the inverse CDF of the Bernoulli observation process). Since the difference of two logistic random variables is also logistic: \[ p(j \succ k \mid i) = \sigma(H_{ij} - H_{ik}) = \sigma((U_i + V_j) - (U_i + V_k)) = \sigma(V_j - V_k) \tag{1.9}\]
The user-specific parameter \(U_i\) cancels out! This has important implications:
| Data Type | What It Reveals | Parameters Identified |
|---|---|---|
| Pairwise \((j \succ k)\) | Item differences only | \(V_j - V_k\) (up to constant) |
| Item-wise \((Y_{ij})\) | User appetites + item appeals | \(U_i\) and \(V_j\) (up to constant) |
Pairwise data cannot distinguish between a world where all items are excellent and users are selective, versus all items are mediocre and users are enthusiastic. Item-wise data can make this distinction.
Goal: Show rigorously that the deterministic utility Rasch model with Bernoulli observations yields Bradley-Terry pairwise probabilities.
Setup: User \(i\) has deterministic utilities \(H_{ij} = U_i + V_j\) for each item. Binary responses are: \[ Y_{ij} \sim \text{Bernoulli}(p_{ij}), \quad p_{ij} = \sigma(H_{ij}) = \frac{1}{1 + e^{-(U_i + V_j)}} \tag{1.10}\]
Pairwise comparison mechanism: When comparing items \(j\) and \(k\), one natural model is the logistic noise model: we sample independent logistic noise \(\eta_{ij}, \eta_{ik}\) and prefer \(j\) if: \[ H_{ij} + \eta_{ij} > H_{ik} + \eta_{ik} \tag{1.11}\]
This corresponds to the observation that each Bernoulli response \(Y_{ij}\) can be viewed as thresholding a latent continuous variable with logistic noise: \[ Y_{ij} = \mathbb{1}[H_{ij} + \eta_{ij} > 0] \tag{1.12}\]
Derivation: \[\begin{align} p(j \succ k \mid i) &= p(H_{ij} + \eta_{ij} > H_{ik} + \eta_{ik}) \\ &= p(\eta_{ij} - \eta_{ik} > H_{ik} - H_{ij}) \\ &= p(\eta_{ij} - \eta_{ik} > (U_i + V_k) - (U_i + V_j)) \\ &= p(\eta_{ij} - \eta_{ik} > V_k - V_j) \end{align}\]
Since \(\eta_{ij}\) and \(\eta_{ik}\) are independent logistic random variables, their difference \(\eta_{ij} - \eta_{ik}\) is also logistic. Therefore: \[ p(j \succ k \mid i) = \sigma(V_j - V_k) \tag{1.13}\]
This is exactly the Bradley-Terry formula, and notice that \(U_i\) cancels out—pairwise probabilities depend only on item parameters \(V_j, V_k\).
Interpretation: The Bernoulli observation model with deterministic utilities is equivalent to having noisy utilities with logistic noise for the purpose of pairwise comparisons.
This explains why recommender systems, which need to personalize, rely heavily on item-wise data (clicks, purchases, ratings), while ranking systems that only need to order items (chess ratings, LLM evaluation) can use pairwise comparisons.
1.6.3 K-Dimensional Factor Models
The Rasch model assumes users and items can each be characterized by a single number. This is limiting: a user might love action movies but dislike romance, while another has the opposite preference. To capture such heterogeneity, we extend to K-dimensional representations.
The Logistic Factor Model: \[ H_{ij} = U_i^\top V_j + Z_j \tag{1.14}\] where:
- \(U_i \in \mathbb{R}^K\) is the user embedding: user \(i\)’s preferences along \(K\) latent dimensions
- \(V_j \in \mathbb{R}^K\) is the item embedding: item \(j\)’s characteristics along the same \(K\) dimensions
- \(Z_j \in \mathbb{R}\) is the item offset: baseline popularity independent of user preferences
The dot product \(U_i^\top V_j\) measures alignment between user preferences and item characteristics. When \(K = 1\) and \(Z_j = 0\), this reduces to a reparameterized Rasch model.
The Ideal Point Model: \[ H_{ij} = -\|U_i - V_j\|_2 + Z_j \tag{1.15}\] Here, users and items live in the same \(K\)-dimensional space, and users prefer items close to their ideal point \(U_i\). This model is natural for:
- Political preferences: voters (users) and candidates (items) on an ideological spectrum
- Music taste: listeners prefer songs similar to their preferred style
- Product recommendations: users prefer products near their ideal specifications
Notice that pairwise data grows as \(O(M^2)\) per user, while item-wise data grows as \(O(M)\). This combinatorial explosion makes exhaustive pairwise comparison impractical for large item sets.
1.6.4 Why Factor Models Matter
Factor models are the foundation of modern recommender systems:
Matrix Factorization (Netflix Prize): The winning approaches learned low-rank factorizations \(Y \approx UV^\top\) of the user-item rating matrix.
Collaborative Filtering: Similar users have similar embeddings \(U_i \approx U_{i'}\); similar items have similar embeddings \(V_j \approx V_{j'}\). This enables predicting preferences for unobserved user-item pairs.
Two-Tower Models: Modern industrial systems use neural networks to compute \(U_i = f_\theta(\text{user features})\) and \(V_j = g_\phi(\text{item features})\), then score via dot product.
Low-Rank Structure: The \(N \times M\) response matrix can be approximately reconstructed from \(N \times K\) user embeddings and \(M \times K\) item embeddings, where typically \(K \ll \min(N, M)\). This compression enables generalization.
We defer learning factor model parameters to Chapter 3, but understanding their structure is essential for modeling preference data.
The latent variable view extends preference models to handle user heterogeneity:
- Rasch model: \(p(Y_{ij} = 1) = \sigma(U_i + V_j)\) — 1D user appetite + item appeal
- Factor model: \(H_{ij} = U_i^\top V_j + Z_j\) — K-dimensional embeddings
- Ideal point: \(H_{ij} = -\|U_i - V_j\|_2 + Z_j\) — proximity in latent space
For a single user, Rasch implies Bradley-Terry for pairwise comparisons. But item-wise data reveals richer structure (both \(U_i\) and \(V_j\)) than pairwise data (only \(V_j - V_k\)).
1.6.5 Two Equivalent Views of Stochastic Choice
Before introducing random utility models formally, we clarify an important conceptual point. The latent variable models above and the random utility models we introduce next represent two equivalent ways to model the same stochastic choice behavior.
1.6.5.1 Deterministic Utility View (Latent Variable Framework)
In latent variable models (such as the Rasch and factor models), utilities are deterministic: \[ H_{ij} = f(U_i, V_j) \quad \text{(no randomness)} \tag{1.16}\] Stochasticity enters through the observation process: \[ Y_{ij} \sim \text{Bernoulli}(\sigma(H_{ij})) \tag{1.17}\] For pairwise comparisons, when \(f\) is additive (e.g., \(f(U_i, V_j) = U_i + V_j\)), this yields \(p(j \succ k \mid i) = \sigma(V_j - V_k)\) when \(U_i\) cancels.
Conceptual interpretation: The user has fixed, deterministic preferences \(H_{ij}\), but their binary responses (clicks, likes) are noisy observations of these preferences. The randomness is in the measurement or response mechanism.
1.6.5.2 Stochastic Utility View (Random Utility Framework)
In random utility models, utilities themselves are random variables: \[ \tilde{H}_{ij} = f(U_i, V_j) + \varepsilon_{ij} \quad \text{(where } \varepsilon_{ij} \sim \text{Gumbel)} \tag{1.18}\] Choices are determined by which item achieves the highest realized utility: \[ j \succ k \iff \tilde{H}_{ij} > \tilde{H}_{ik} \tag{1.19}\] With Gumbel-distributed noise and additive \(f\), this also yields \(p(j \succ k \mid i) = \sigma(V_j - V_k)\) when \(U_i\) cancels.
Conceptual interpretation: The user’s utility evaluation for item \(j\) fluctuates randomly across decision instances due to context, mood, unmeasured factors, or cognitive noise. The randomness is in the utility evaluation itself, not the response mechanism.
1.6.5.3 Why Both Views are Equivalent
For pairwise choice probabilities, both frameworks predict: \[ p(j \succ k \mid i) = \sigma(V_j - V_k) \tag{1.20}\]
They are observationally equivalent—given only pairwise comparison data, we cannot distinguish between them. The difference is purely conceptual:
- Latent variable view: Suited when we have explicit binary responses \(Y_{ij}\) (clicks, ratings) and want to model user heterogeneity explicitly
- Random utility view: Suited when we observe direct choices/rankings and want to model utility fluctuations
Neither view is “better”—they are complementary perspectives from different research traditions: - Latent variable/Psychometrics tradition: Measurement theory, test reliability, explicit binary responses - Random utility/Economics tradition: Discrete choice theory, revealed preferences, direct choices
The random utility framework we develop next provides powerful theoretical tools (IIA, Gumbel maximum properties) that explain why the softmax/Bradley-Terry form emerges from first principles.
1.7 Stochastic Utility Models
We now turn to a powerful mathematical framework for understanding stochastic preferences: stochastic utility models (also called random utility models). This framework provides the theoretical foundation for preference-based methods across machine learning—from collaborative filtering in recommender systems to RLHF in language models. As we will see, the Bradley-Terry model emerges naturally from stochastic utility models with specific noise assumptions.
An equivalent way to represent random preferences is to identify a sample \(\prec\) with a vector of utilities \(H = (H_j)_{j=1}^M \in \mathbb{R}^M\) where \(j \succ j'\) if and only if \(H_j > H_{j'}\). (For the concerned reader: We assume that \(H_j = H_{j'}\) happens with zero probability; and for discrete item sets such a vector always exists.)
To get a sense for different random utility models, we consider a particular model that has the complexity of many models in modern machine learning: The Ackley function. In this model, each alternative is represented by a \(d\)-dimensional vector \((x_1, \ldots, x_d) \in \mathbb{R}^d\), the Ackley function is given by \[ \text{Ackley}(x_1, x_2, \dots, x_d) = -a e^{-b \sqrt{\frac{1}{d} \sum_{j=1}^d x_j^2}} - e^{\frac{1}{d} \sum_{j=1}^d \cos(c x_j)} + a + e. \tag{1.21}\] for some constants \(a, b, c \in \mathbb{R}\). By stacking a number \(k\) of human preferences, we can compute for \(k\) samples from a random model the function in a vectorized way.
We can think of the rows of \(X\) as features of different alternatives. We can visualize the full landscape of the utility function for two alternatives (\(n=2\)) and features of a single dimension (\(d=1\)).
In this model, we can sample choice data when assuming a random generating model of, e.g., np.random.randn(n, d)*0.5 + np.ones((n, d))*0.5.
Item-wise (Accept-Reject) Sampling. One method for data collection is accept-reject sampling, where the decision-maker considers one item at a time and decides if they like it compared to an outside option. This is common in applications like recommendation systems, where accepting refers to a consumption signal.
We will use a simulation to familiarize ourselves with accept-reject sampling. On the surface below, blue and red points correspond to accept or reject points.
Pairwise Comparisons. Pairwise comparisons, now used to fine-tune large language models can similarly be generated in this model.
Similarly, we can sample data for \((j, \mathcal{S})\) for \(j \in \mathcal{S} \subseteq \{1, \ldots, M\}\).
1.7.1 Mean Utilities
In the stochastic utility framework, we model utilities as stochastic random variables. Let \(\tilde{H}_{ij}\) denote a random utility that can be decomposed as: \[ \tilde{H}_{ij} = f(U_i, V_j) + \varepsilon_{ij} \tag{1.22}\] where \(f(U_i, V_j)\) is the mean utility (deterministic component) and \(\varepsilon_{ij}\) is the noise (stochastic component), typically assumed independent across items.
For the single-user or population-aggregated case (dropping the user index \(i\)), this simplifies to: \[ \tilde{H}_j = V_j + \varepsilon_j \tag{1.23}\]
Parallel with latent variable models: Recall that in the deterministic utility framework (Section Section 1.6), we wrote \(H_{ij} = f(U_i, V_j)\) as a deterministic quantity, with stochasticity entering through the observation process \(Y_{ij} \sim \text{Bernoulli}(\sigma(H_{ij}))\). The key difference:
- Deterministic utility view: \(H_{ij} = f(U_i, V_j)\) is fixed; randomness is in Bernoulli sampling of \(Y_{ij}\)
- Stochastic utility view: \(\tilde{H}_{ij} = f(U_i, V_j) + \varepsilon_{ij}\) is random; randomness is in \(\varepsilon_{ij}\) within the utility itself
Both frameworks yield the same pairwise probabilities (when \(f\) is additive and the user parameter cancels), demonstrating their observational equivalence.
When users are modeled explicitly (as in Chapter 3), we return to the latent variable approach with \(H_{ij}\) depending on both user and item parameters. For the remainder of this chapter, we focus on the single-user or population-aggregated case using random utilities. There are different forms of noise possible. We will focus on a particular one (called Type-1 Extreme value), but others are also popular, for example Gaussian noise.
There are (at least) three different ways to view this noise \(\varepsilon_{ij}\) in the random utility framework:
- Either it is capturing the heterogeneity of different decision-makers—a view that is taken in the Economics field of Industrial Organization. Under this view, observing \((j, \mathcal{S})\) more frequently than \((j', \mathcal{S})\) is a sign of there being a higher number of decision-makers preferring \(j\) over \(j'\) than the other way around.
- or as errors of a decision-maker’s optimization of utilities \(V_j\). This view is endorsed in the literature on Bounded Rationality. Under this view, it cannot be directly concluded from frequent observation of \((j, \mathcal{S})\) compared to \((j', \mathcal{S})\) that \(j\) is preferred to \(j'\). It might have been chosen in error.
- We can also view it as a belief of the designer about the preferences \((V_j)_{j=1}^M\). In this view, the posterior after observing data can be used to make claims about relative preferences.
These interpretations explain why one might model utilities as stochastic (\(\tilde{H}_{ij} = f(U_i, V_j) + \varepsilon_{ij}\)) rather than placing all randomness in the observation process (\(Y_{ij} \sim \text{Bernoulli}(\sigma(H_{ij}))\) where \(H_{ij} = f(U_i, V_j)\)). The choice of framework depends on which conceptual interpretation best matches the application domain and available data.
The interpretation will guide our decision-making predictions in Chapters 4 and 5.
We next introduce a main way to simplify learning utility functions: The axiom of Independence of Irrelevant Alternatives.
1.8 Independence of Irrelevant Alternatives
In later chapters, we will consider cases where we sample from the most preferred elements from all items, which we call the generation task. A simple assumption will allow us to recover the probabilities of choosing any item, and in fact, the full distribution of \(\prec\) from binary comparisons: The so-called Independence of Irrelevant Alternatives, introduced in Luce et al. (1959). This assumption not only allows us to easily identify a preference model, it will also massively reduce what is needed to be estimated from data: Instead of the full \((M!-1)\)-dimensional object, it will be sufficient to learn \(M\) values.
IIA assumes that the relative likelihood of choosing \(j\) compared to \(k\) does not change whether a third alternative \(\ell\) is in the choice set or not. Formally, for every \(\mathcal{S} \subseteq \{1, \ldots, M\}\), \(j, k \in \mathcal{S}\), and \(\ell \notin \mathcal{S}\), \[ \frac{p(j \mid \mathcal{S})}{p(k \mid \mathcal{S})} = \frac{p(j \mid \mathcal{S} \cup \{\ell\})}{p(k \mid \mathcal{S} \cup \{\ell\})}. \tag{1.24}\] (In particular, it must be that \(p(k \mid \mathcal{S}) \neq 0\) and \(p(k \mid \mathcal{S} \cup \{\ell\}) \neq 0\).) That is, the relative probability of choosing \(j\) over \(k\) should be independent of whether \(\ell\) is present in the choice set. We will show that this single assumption is sufficient to make the choice model \(M\)-dimensional, making learning feasible.
First, to our primary example: All random utility models with independent and identically distributed noise terms satisfy IIA. (We ask the reader to convince themselves that the Ackley function does not satisfy IIA.)
A random utility model \(H_j\) satisfies IIA if and only if we can write it as \(H_j = V_j + \varepsilon_j\), where \(V_j\) is deterministic and \(\varepsilon_j\) is sampled independently and identically from the Gumbel distribution. The Gumbel distribution has cumulative distribution function \(F(x) = e^{-e^{-x}}\).
(⇐) Gumbel implies IIA. If \(\varepsilon_j\) are i.i.d. Gumbel, then for any choice set \(\mathcal{S}\) and any \(j, k \in \mathcal{S}\): \[ \frac{p(j \mid \mathcal{S})}{p(k \mid \mathcal{S})} = \frac{e^{V_j}/\sum_{\ell \in \mathcal{S}} e^{V_\ell}}{e^{V_k}/\sum_{\ell \in \mathcal{S}} e^{V_\ell}} = \frac{e^{V_j}}{e^{V_k}} = e^{V_j - V_k} \tag{1.25}\] This ratio depends only on \(V_j - V_k\) and is independent of other alternatives in \(\mathcal{S}\), so IIA holds.
(⇒) IIA implies Gumbel. The converse is more subtle: one must show that IIA together with regularity (adding alternatives cannot increase choice probability) uniquely determines the Gumbel distribution. The key insight is that IIA forces a multiplicative structure on choice probabilities, and the only continuous distributions compatible with this structure are the extreme value distributions. See Yellott Jr (1977) for the complete proof.
How does IIA relate to the deterministic utility (Bernoulli sampling) view?
Recall that we showed two equivalent frameworks for pairwise choice: the deterministic utility view (\(H_{ij} = f(U_i, V_j)\) with Bernoulli observations) and the stochastic utility view (\(\tilde{H}_{ij} = f(U_i, V_j) + \varepsilon_{ij}\) with Gumbel noise).
The connection to IIA is subtle:
Latent variable models (\(H_{ij} = f(U_i, V_j)\), \(Y_{ij} \sim \text{Bernoulli}(\sigma(H_{ij}))\)) do not inherently assume IIA. Whether IIA holds depends on the functional form \(f\) and how choices are generated from the binary responses \(Y_{ij}\).
Random utility models with i.i.d. Gumbel noise (\(\tilde{H}_{ij} = f(U_i, V_j) + \varepsilon_{ij}\)) automatically satisfy IIA because the noise is i.i.d. across items.
For pairwise comparisons from a single user with additive \(f(U_i, V_j) = U_i + V_j\), both frameworks yield the same Bradley-Terry probabilities \(p(j \succ k \mid i) = \sigma(V_j - V_k)\), which trivially satisfies IIA (the ratio \(p(j \succ k)/p(k \succ j) = e^{2(V_j - V_k)}\) is constant).
However, for multi-way choices (choosing from sets \(\mathcal{S}\)), the frameworks can differ: - The Gumbel view yields softmax: \(p(j \mid \mathcal{S}) = e^{V_j}/\sum_{k \in \mathcal{S}} e^{V_k}\) (IIA holds by construction) - The Bernoulli view requires additional assumptions about how binary responses aggregate into multi-way choices
Thus, IIA is most naturally associated with the random utility framework, though both frameworks are equivalent for pairwise data.
This is quite strong, and an equivalence. If we are willing to assume IIA, it is sufficient to learn \(M\) parameters to characterize the full distribution—an exponential decrease in parameters to learn. The Gumbel model may be unusual in particular for those with a stronger background in machine learning. A more familiar formulation arises for the probabilities of choice.
Assume a random preference model satisfies IIA, hence \(H_j = V_j + \varepsilon_j\) with i.i.d. Gumbel noise. Then, the probabilities of lists are: \[ p(j_1 \succ j_2 \succ \cdots \succ j_M) = \frac{e^{V_{j_1}}}{\sum_{m=1}^M e^{V_{j_m}}} \cdot \frac{e^{V_{j_2}}}{\sum_{m=2}^M e^{V_{j_m}}} \cdots \frac{e^{V_{j_{M-1}}}}{e^{V_{j_{M-1}}} + e^{V_{j_M}}}. \tag{1.26}\] For choices from sets, \[ p(j \mid \mathcal{S}) = \frac{e^{V_j}}{\sum_{k \in \mathcal{S}} e^{V_k}} = \operatorname{softmax}_j ((V_k)_{k \in \mathcal{S}}). \tag{1.27}\] In particular, for binary comparisons \[ p(Y_{jj'} = 1) = \frac{e^{V_j}}{e^{V_j} + e^{V_{j'}}} = \frac{1}{1 + e^{-(V_j - V_{j'})}} = \sigma(V_j - V_{j'}). \tag{1.28}\] where \(\sigma(x) = 1/(1 + e^{-x})\) is the sigmoid function.
We derive (Equation 1.27). Item \(j\) is chosen from \(\mathcal{S}\) if \(H_j > H_k\) for all \(k \in \mathcal{S} \setminus \{j\}\), i.e., if \(V_j + \varepsilon_j > V_k + \varepsilon_k\) for all \(k \neq j\).
Conditioning on \(\varepsilon_j = t\), we need \(\varepsilon_k < V_j - V_k + t\) for all \(k \neq j\). Since the \(\varepsilon_k\) are independent Gumbel with CDF \(F(x) = e^{-e^{-x}}\): \[ p(j \text{ wins} \mid \varepsilon_j = t) = \prod_{k \in \mathcal{S}, k \neq j} F(V_j - V_k + t) = \prod_{k \neq j} e^{-e^{-(V_j - V_k + t)}} \tag{1.29}\] The Gumbel density is \(f(t) = e^{-t} e^{-e^{-t}}\). Integrating over \(t\): \[ p(j \mid \mathcal{S}) = \int_{-\infty}^{\infty} \prod_{k \neq j} e^{-e^{-(V_j - V_k + t)}} \cdot e^{-t} e^{-e^{-t}} \, dt \tag{1.30}\] Let \(u = e^{-t}\), so \(du = -e^{-t} dt\). After substitution and simplification (Exercise 2 asks you to complete this), we obtain: \[ p(j \mid \mathcal{S}) = \frac{e^{V_j}}{\sum_{k \in \mathcal{S}} e^{V_k}} \tag{1.31}\]
In particular, the choice probabilities \(p(j \mid \mathcal{S})\) are equivalent to the multi-class logistic regression model (also called multinomial logit model), and generation from this model, that is, sampling \(j\) with probability \(p(j \mid \{1, \ldots, M\})\) is given by softmax-sampling \(\operatorname{softmax}((V_j)_{j=1}^M)\).
This model has many names, depending on the feedback type we consider. For binary comparison data, it is the Bradley-Terry model Bradley and Terry (1952). If data is in forms of list, it is called the Plackett-Luce model Plackett (1975). For accept-reject sampling it is also called logistic regression. For choices from subsets, it is called the (discrete choice) logit model. We will usually call the model the logit model and specify the feedback type.
Consider three items with utilities \(V = (0, 1, 2)\). The Bradley-Terry pairwise probabilities are:
| Comparison | Probability | Calculation |
|---|---|---|
| \(p(1 \succ 2)\) | 0.27 | \(\sigma(0 - 1) = 1/(1 + e^1) \approx 0.27\) |
| \(p(2 \succ 1)\) | 0.73 | \(\sigma(1 - 0) = 1/(1 + e^{-1}) \approx 0.73\) |
| \(p(1 \succ 3)\) | 0.12 | \(\sigma(0 - 2) = 1/(1 + e^2) \approx 0.12\) |
| \(p(2 \succ 3)\) | 0.27 | \(\sigma(1 - 2) = 1/(1 + e^1) \approx 0.27\) |
| \(p(3 \succ 1)\) | 0.88 | \(\sigma(2 - 0) = 1/(1 + e^{-2}) \approx 0.88\) |
| \(p(3 \succ 2)\) | 0.73 | \(\sigma(2 - 1) = 1/(1 + e^{-1}) \approx 0.73\) |
For choice from the full set \(\{1, 2, 3\}\), softmax gives: \[ p(j \mid \{1,2,3\}) = \frac{e^{V_j}}{e^0 + e^1 + e^2} = \frac{e^{V_j}}{1 + 2.72 + 7.39} \approx (0.09, 0.24, 0.67) \tag{1.32}\] Item 3 (highest utility) is most likely to be chosen. Note that these utilities are only identified up to a constant: \((0, 1, 2)\) and \((5, 6, 7)\) yield identical choice probabilities.
We can now see why DPO assumes the Bradley-Terry model (Equation 1.3). When we posit that human preferences arise from comparing implicit rewards with i.i.d. Gumbel noise—that is, a human prefers response \(y\) over \(y'\) when \(r(x, y) + \varepsilon > r(x, y') + \varepsilon'\)—the resulting preference probability is exactly Bradley-Terry: \(p(y \succ y' \mid x) = \sigma(r(x, y) - r(x, y'))\). DPO’s loss function is then the negative log-likelihood under this model.
Numerical example. Suppose for a prompt \(x\), the model assigns:
- \(\log \pi_\theta(y_w \mid x) - \log \pi_{\text{ref}}(y_w \mid x) = 0.5\) (preferred response upweighted)
- \(\log \pi_\theta(y_l \mid x) - \log \pi_{\text{ref}}(y_l \mid x) = -0.3\) (dispreferred response downweighted)
With \(\beta = 1\), the implicit reward difference is \(0.5 - (-0.3) = 0.8\), giving preference probability \(\sigma(0.8) \approx 0.69\). Under Bradley-Terry, this datapoint contributes \(-\log(0.69) \approx 0.37\) to the loss.
What if Bradley-Terry fails? If human preferences exhibit:
- Context effects: preferences depend on what other options are shown → IIA violated
- Intransitive preferences: \(A \succ B \succ C \succ A\) → no consistent utility exists
- Annotator disagreement: different people have different preferences → mixture model needed
DPO will learn a “compromise” policy that may not match any individual’s true preferences. See Chapter 6 for approaches to handling heterogeneity.
The IIA assumption has many desirable properties, such as stochastic transitivity and relativity. The reader is asked to prove them in the exercises to this chapter. The learning of, and optimization based on, the mean utilities \((V_j)_{j=1}^M\) is one of the central goals of this book. Supervised learning based on it will be covered in the next chapter.
IIA is surprisingly strong, but does not allow for choice probabilities that are, fundamentally, results of multiple decision-makers making choices together. A first restriction is given by
Heterogeneity. A crucial shortcoming of IIA is that if sub-populations satisfy IIA this does not mean that the full population satisfies IIA. Assume that our population consists of sub-populations \(i = 1, \dots, N\) which have mass \(\alpha_1, \alpha_2, \dots, \alpha_N\), respectively, in the population, and that each of the groups has preferences satisfying IIA. Because of IIA, we can represent each sub-group’s stochastic preferences with an average utility vector \((V_j^{(i)})_{j=1}^M\) for group \(i\). The distribution of the full population is then given by a mixture of the sub-population preferences. For example, for binary comparisons
\[ p(Y_{jj'} = 1) = \sum_{i=1}^N \alpha_i \, p(Y_{jj'} = 1 \mid \text{group } i) = \sum_{i=1}^N \alpha_i \, \sigma(V_j^{(i)} - V_{j'}^{(i)}). \tag{1.33}\]
Sadly, such mixtures are far from IIA, as we see in the following coding example.
An intuition for IIA’s failure comes from viewing it as random utility functions: A mixture of vectors that have independent entries is not Gaussian.
Classic ways to solve this concern is to consider a model with explicit representation of heterogeneity, where preferences depend on a latent type \(\alpha \sim F\) for some distribution \(F\), and \((V_j)_{j=1}^M\) given \(\alpha\) follows a random utility model with independent error terms. For example, consider a logit random utility model \[ V_j = \beta^\top x_j + \varepsilon_j \tag{1.34}\] and assume \(\beta \sim N(\mu, \Sigma)\) is a normally distributed vector, a model called the random coefficients logit model. Equivalently, we can view this as a model with correlated utility shocks.
A second limitation is only relevant if we move beyond binary choices, or observe preference lists. Let \(\{1, 2, 3\}\) be three items, where items \(1\) and \(2\) are (almost) identical and different from item \(3\). (In the classical example, items \(1, 2\) are red and blue buses, respectively, and item \(3\) is a train). Assume an IIA model given by average utilities \((V_j)\). As items \(1\) and \(2\) are almost identical, assume \(V_1 = V_2\). We have \(p(3 \mid \{1, 3\}) = p(3 \mid \{2, 3\})\). How do these values compare to \(p(3 \mid \{1, 2, 3\})\)? It would be intuitive to think that item \(3\) is chosen with the same frequency, as there should not be more “demand” for item \(3\) only because item \(1\) is cloned. This is not the case.
The choice probability of train is reduced. Why is our intuition making us think that items \(1\) and \(2\) should split their choice probability? One option is because we assume some correlation: If you like item \(1\) over item \(3\) then you should also like item \(2\) over item \(3\), and vice versa. Hence, we would like correlation between choice probabilities in random utility models. For example, if we allow in a logit random utility model the error terms for items \(1, 2\) to be perfectly correlated (and we break ties uniformly at random), then confirming our intuition.
IIA has limitations, which might require more flexible specifications of noise and heterogeneity.
The Independence of Irrelevant Alternatives (IIA) is equivalent to i.i.d. Gumbel noise in random utility models. Under IIA, choice probabilities take the softmax form, and pairwise comparisons follow the Bradley-Terry model: \(p(j \succ k) = \sigma(V_j - V_k)\). This reduces parameter complexity from \(M! - 1\) to just \(M\), making learning tractable. However, IIA can fail when alternatives are similar (red-bus/blue-bus) or when the population is heterogeneous.
This is the end of our discussion of the Independence of Irrelevant Alternatives. Additional features can be found in (Train 2009; Ben-Akiva and Lerman 1985; McFadden 1981) and the original paper for logit analysis McFadden (1972).
1.9 Identification and the Rashomon Effect
When learning utility parameters from choice data, two fundamental challenges arise: the identification problem and the Rashomon effect. Both concern the multiplicity of models that can explain the same observations, but they operate at different levels. These challenges apply to both deterministic and stochastic utility models.
The Identification Problem. A fundamental issue in utility-based models is identification: different utility vectors can generate identical choice probabilities or response patterns.
For stochastic utility models (IIA/softmax): For any constant \(c \in \mathbb{R}\), the mean utilities \((V_1, \ldots, V_M)\) and \((V_1 + c, \ldots, V_M + c)\) generate the same choice probabilities from \(p(j \mid \mathcal{S}) = e^{V_j}/\sum_{k \in \mathcal{S}} e^{V_k}\).
For deterministic utility models: For any constant \(c \in \mathbb{R}\), when utilities depend only on differences (e.g., pairwise comparisons where \(p(j \succ k \mid i) = \sigma(H_{ij} - H_{ik})\)), the utilities \((H_{i1}, \ldots, H_{iM})\) and \((H_{i1} + c, \ldots, H_{iM} + c)\) generate the same probabilities for user \(i\).
Proof.
Stochastic case: For any \(\mathcal{S}\) and \(j \in \mathcal{S}\): \[ \frac{e^{V_j + c}}{\sum_{k \in \mathcal{S}} e^{V_k + c}} = \frac{e^c \cdot e^{V_j}}{e^c \cdot \sum_{k \in \mathcal{S}} e^{V_k}} = \frac{e^{V_j}}{\sum_{k \in \mathcal{S}} e^{V_k}} \tag{1.36}\]
Deterministic case: For pairwise comparisons: \[ p(j \succ k \mid i) = \sigma((H_{ij} + c) - (H_{ik} + c)) = \sigma(H_{ij} - H_{ik}) \tag{1.37}\]
This shows that choice/comparison data only identifies utility differences, not absolute levels. The identification problem is a structural property of both frameworks—it doesn’t matter how much data we collect; we can never distinguish between \((V_1, V_2, V_3)\) and \((V_1 + 10, V_2 + 10, V_3 + 10)\) from choice behavior alone, nor between \((H_{i1}, H_{i2}, H_{i3})\) and \((H_{i1} + 10, H_{i2} + 10, H_{i3} + 10)\) from pairwise comparisons alone.
Implications:
Normalization required: When learning utilities, we must fix one value. The standard convention is \(V_0 = 0\) for an outside option, making all other utilities interpretable as “value relative to opting out.”
Only differences matter: From choice data, we can only identify utility differences \(V_j - V_k\), not absolute levels. This is why Bradley-Terry uses \(\sigma(V_j - V_{j'})\).
Connection to DPO: In DPO, the reference policy \(\pi_{\text{ref}}\) provides the normalization. The implicit reward \(r^*(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}\) measures deviation from the reference, avoiding the identification problem.
The Rashomon Effect. Even after imposing identification constraints, there may exist many structurally different models that fit the data equally well. This phenomenon, named after Akira Kurosawa’s 1950 film Rashomon (where multiple witnesses give contradictory but internally consistent accounts of the same event), is called the Rashomon effect (Breiman 2001; Rudin et al. 2022).
Consider these examples:
Model class multiplicity: A mixture of two Bradley-Terry models with group-specific utilities \((V_j^{(1)}, V_j^{(2)})\) might predict equally well as a single Bradley-Terry model with different utilities \(V_j\), or even a non-IIA model with correlated noise.
Representation multiplicity: In factor models with \(K\) dimensions, many different \((K \times M)\) embedding matrices can generate similar predicted preferences through different geometric arrangements of items.
Feature multiplicity: In contextual preference models (Chapter 5), many different feature weightings \(\theta\) might achieve similar prediction accuracy but assign different importance to features.
The Rashomon ratio quantifies this multiplicity: if the best model achieves loss \(L^*\), the Rashomon set \(\mathcal{R}_\epsilon\) contains all models with loss at most \((1 + \epsilon) L^*\). When \(|\mathcal{R}_\epsilon|\) is large, many explanations coexist (Semenova, Rudin, and Parr 2022).
Why This Matters:
Interpretation uncertainty: If many models fit equally well, interpreting any single model’s parameters as “the truth” is suspect. Which utility values are the “real” ones?
Scientific understanding: The Rashomon effect suggests we should report sets of good models rather than a single “best” model, especially for scientific inference (Breiman 2001).
Alignment implications: In RLHF, if many reward functions explain human feedback equally well, which one should we optimize? Different reward functions in the Rashomon set may induce different policies, even if they agree on the training comparisons (Skalse et al. 2022).
Relationship Between Identification and Rashomon:
- Identification is about the model’s mathematical structure: parameters that are algebraically equivalent within a single model class.
- Rashomon effect is about empirical indistinguishability: different models or parameterizations that happen to fit the observed data equally well.
Think of identification as “these parameters mean the same thing,” while Rashomon says “these different models perform the same.”
Suppose we collect 100 pairwise comparisons between 5 items and fit:
- A Bradley-Terry model with utilities \((0, V_2, V_3, V_4, V_5)\) achieving 90% accuracy
- A 2-group mixture model with group utilities and mixing weights, also 90% accuracy
- A nested logit model with correlation parameters, also 90% accuracy
All three models satisfy different identification constraints (so parameters within each model are well-defined), yet all three fit equally well. This is the Rashomon effect. Without additional data or prior knowledge, we cannot determine which model is “correct.”
In later chapters, we’ll see how regularization (Chapter 3), Bayesian approaches, and structural assumptions can help navigate both identification and Rashomon challenges. For now, remember: be cautious when interpreting learned utilities as ground truth—there may be many equally valid explanations.
1.10 Connecting Item-wise and Pairwise Models
We have now seen two families of preference models:
- Pairwise models (Bradley-Terry, Plackett-Luce): Model comparisons between items using item parameters \(V_j\) only
- Item-wise models (Rasch, factor models): Model individual responses using both user parameters \(U_i\) and item parameters \(V_j\)
How do these relate? This section clarifies the connections and provides guidance on when to use each.
From Item-wise to Pairwise. Given a factor model for item-wise data, we can derive pairwise preferences by comparing latent utilities:
Rasch → Bradley-Terry (single user): As shown in Section 1.6, if \(H_{ij} = U_i + V_j\), then \(p(j \succ k \mid i) = \sigma(V_j - V_k)\), which is exactly Bradley-Terry. The user parameter cancels.
General Factor Model → User-specific Preferences: If \(H_{ij} = U_i^\top V_j + Z_j\), then: \[ p(j \succ k \mid i) = \sigma\left(U_i^\top (V_j - V_k) + (Z_j - Z_k)\right) \tag{1.38}\] This is not standard Bradley-Terry—the preference probability depends on the user \(U_i\)! Different users may rank the same items differently. This captures the reality that a horror fan and a comedy fan will have opposite preferences between The Shining and The Hangover.
When User Parameters Cancel. User parameters cancel out in pairwise comparisons when:
- Rasch model: The additive structure \(H_{ij} = U_i + V_j\) means user appetite scales all items equally
- Single user: With \(N = 1\), there’s only one \(U_i\), which becomes part of the constant
- Homogeneous population: When all users are identical (\(U_i = U\) for all \(i\))
In these cases, Bradley-Terry suffices for ranking items. But for personalized recommendations, we need factor models that preserve user information.
A Unified Generative View. Both model families arise from the same latent utility framework:
Latent Utility: H_{ij} = f(U_i, V_j) + ε_{ij}
/ \
Item-wise Response: Pairwise Comparison:
Y_{ij} = 𝟙[H_{ij} > 0] Y_{i,jk} = 𝟙[H_{ij} > H_{ik}]
The choice between data types depends on the application:
| Consideration | Item-wise Favored | Pairwise Favored |
|---|---|---|
| Data abundance | Every interaction is one datapoint | Comparisons require explicit elicitation |
| User identity | Users are logged in / identifiable | Users are anonymous or pooled |
| Goal | Personalized recommendations | Global ranking of items |
| Scale | \(O(NM)\) observations | \(O(NM^2)\) possible comparisons |
| Cold start | Can use user/item features | Relies only on comparison outcomes |
- Use Bradley-Terry / pairwise models when: ranking items for a population (LLM evaluation, sports rankings), users are anonymous, or comparisons are the natural data format (A/B testing, tournament brackets)
- Use Factor models / item-wise models when: building personalized recommendations, users have persistent identities, or you need to predict responses to new items
- Use both when: you have rich interaction data and want both global quality rankings and personalized recommendations (e.g., a streaming service ranking shows globally while personalizing the home page)
1.11 Beyond Bradley-Terry: Random Choice Models (Optional)
This section is optional and covers advanced extensions of the Bradley-Terry framework for readers interested in the broader literature on discrete choice models.
The Bradley-Terry model underlying DPO is just one member of a rich family of random choice models (also called discrete choice models) developed primarily in economics and psychology. These models provide principled ways to relax IIA when it fails to capture observed preference patterns.
Probit models. Instead of Gumbel-distributed noise (which yields Bradley-Terry/logit), we can assume Gaussian noise: \(H_j = V_j + \varepsilon_j\) with \(\varepsilon \sim \mathcal{N}(0, \Sigma)\). When \(\Sigma = I\) (independent noise), this yields the multinomial probit model. The key advantage is that probit allows for arbitrary correlation structures in \(\Sigma\), naturally handling the red-bus/blue-bus problem. If red and blue buses have correlated noise (\(\Sigma_{12} > 0\)), adding one does not steal probability from the train. The disadvantage is computational: unlike logit, probit choice probabilities lack closed-form expressions and require numerical integration.
Nested logit. A tractable middle ground is the nested logit model, which groups alternatives into “nests” with correlated noise within each nest. For the transportation example, we might have Bus nest = {red bus, blue bus} and Train nest = {train}. The model first chooses a nest, then an alternative within the nest. This preserves IIA within nests while allowing substitution patterns across nests that violate IIA.
Random coefficients (mixed) logit.
The most flexible extension is the random coefficients logit (also called mixed logit), which we already encountered in the heterogeneity discussion. Here, the utility coefficients \(\beta\) are random: \[ V_j = \beta^\top x_j, \quad \beta \sim F \tag{1.39}\] for some distribution \(F\) (often Gaussian). This model can approximate any random utility model arbitrarily well (McFadden and Train 2000), making it extremely flexible. Modern implementations use simulation-based estimation; see Train (2009) for details.
Gaussian Process models for nonlinear rewards. {#sec-gp-models}
The random coefficients logit still assumes utility is linear in features \(x_j\). This assumption is limiting: linear reward functions can only achieve their optima at extreme points (corners of the feasible region). Consider a robot learning preferences over mini-golf shots parameterized by speed and angle—a linear reward \(r = w_1 \cdot \text{speed} + w_2 \cdot \text{angle}\) cannot express “I prefer a moderate shot to the middle target.”
Gaussian Processes (GPs) provide a nonparametric alternative that can model arbitrary smooth functions: \[ r(x) \sim \mathcal{GP}(m, k) \tag{1.40}\] where \(m: \mathcal{X} \to \mathbb{R}\) is the mean function and \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is the covariance function (kernel). A GP is defined by the property that any finite collection of function values follows a multivariate Gaussian: \[ [r(x_1), \ldots, r(x_n)]^\top \sim \mathcal{N}(\boldsymbol{\mu}, K), \quad \text{where } \mu_i = m(x_i), \; K_{ij} = k(x_i, x_j) \tag{1.41}\]
The most common kernel is the Radial Basis Function (RBF): \[ k(x, x') = \sigma_f^2 \exp\left(-\frac{\|x - x'\|^2}{2\ell^2}\right) \tag{1.42}\] which encodes the assumption that nearby points have correlated function values, with length-scale \(\ell\) controlling smoothness.
When combined with the Bradley-Terry likelihood, we model preferences as \(p(A \succ B \mid r) = \sigma(r(x_A) - r(x_B))\) where \(r \sim \mathcal{GP}(m, k)\). Unlike standard GP regression where we observe function values directly, preference data provides only ordinal information—which item is preferred—making inference more challenging.
Why GPs for preferences? (1) Expressiveness: Can represent complex nonlinear functions without specifying a parametric form. (2) Uncertainty quantification: GP provides predictive means and confidence intervals. (3) Active learning: Uncertainty guides which comparisons to query (Chapter 4). (4) Bayesian framework: Principled posterior updates as new comparisons arrive.
Trade-offs: GPs are well-suited when the reward function has unknown nonlinear structure and the feature dimension is low-to-moderate (\(d \lesssim 10\)). For high-dimensional problems or large datasets, linear models or neural networks may be more practical due to the \(O(n^3)\) computational cost of GP inference. Chapter 3 covers inference techniques (including Laplace approximation for non-Gaussian likelihoods), and Chapter 4 shows how GP uncertainty enables sample-efficient active query selection.
Implications for preference learning. For practitioners using DPO or RLHF, these extensions suggest directions for improvement:
- Heterogeneous populations: If annotators have diverse preferences, a single Bradley-Terry model may be misspecified. Random coefficients models (or mixture models) can capture this heterogeneity.
- Correlated alternatives: When comparing responses that differ only slightly (e.g., same content with different formatting), IIA violations may occur. Nested or probit models can help.
- Choice set effects: If preferences depend on what alternatives are present, IIA fails. This is particularly relevant for k-wise comparisons with \(k > 2\).
These extensions are active areas of research in machine learning from human preferences; see Siththaranjan, Laidlaw, and Hadfield-Menell (2023) for recent work on distributional preference learning that relaxes Bradley-Terry assumptions.
The next chapter is the first to study learning of average utility functions from preference data, and assumes that a dataset is given of utility vectors \((V_j)_{j=1}^M\) for different types of sampling and for different notions of “inference”.
| Notation | Meaning | Domain / Type |
|---|---|---|
| \(M\) | Number of items | \(\mathbb{N}\) |
| \(j, j', k\) | Item indices | \(\{1, \ldots, M\}\) |
| \(0\) | Outside (“no-choice”) option index | Reference item |
| \(\prec\) / \(\succ\) | Weak preference relation / its strict part | Binary relation on items |
| \(L=(j_1,\dots ,j_M)\) | Full ranking (preference list) | Permutation of items |
| \((j, \mathcal{S})\) | Observation that \(j\) is chosen from subset \(\mathcal{S}\) | \(j \in \mathcal{S} \subseteq \{1,\ldots,M\}\) |
| \(Y_{jj'}\) | Binary preference outcome (\(j\) vs \(j'\)) | \(\{0, 1\}\) |
| \(i\) | User/context index | \(\{1, \ldots, N\}\) |
| \(N\) | Number of users | \(\mathbb{N}\) |
| \(x\) | Context/prompt (in LLM setting) | Input space |
| \(y, y_w, y_l\) | Responses (winning/losing in preference data) | Output space |
| \(\pi_\theta\) | Policy (language model) parameterized by \(\theta\) | Distribution over outputs |
| \(\pi_{\text{ref}}\) | Reference policy | Distribution over outputs |
| \(r(x, y)\) | Reward function | \(\mathbb{R}\) |
| \(U_i\) | User embedding/appetite | \(\mathbb{R}^K\) |
| \(V_j\) | Item loading/appeal | \(\mathbb{R}^K\) (or \(\mathbb{R}\) when \(K=1\)) |
| \(Z_j\) | Item offset | \(\mathbb{R}\) |
| \(H_{ij}\) | Latent utility for user \(i\), item \(j\) | \(\mathbb{R}\) |
| \(H_j\) | Latent utility (single-user case) | \(\mathbb{R}\) |
| \(\varepsilon_j\) | Stochastic utility shock for item \(j\) | \(\mathbb{R}\) (i.i.d.) |
| \(p(\cdot)\) | Probability | \([0,1]\) |
| \(p(j \mid \mathcal{S})\) | Logit/Plackett-Luce choice probability of \(j\) from \(\mathcal{S}\) | \([0,1]\) |
| \(\sigma(x)\) | Sigmoid \(1/(1+e^{-x})\) | \((0,1)\) |
| \(d\) | Dimensionality of feature vectors | \(\mathbb{N}\) |
| \(\boldsymbol{x}_j \in\mathbb{R}^{d}\) | Feature vector of item \(j\) | \(\mathbb{R}^{d}\) |
| \(\text{Ackley}(\boldsymbol{x})\) | Ackley test-function value | \(\mathbb{R}\) |
| \(a,b,c\) | Ackley parameters | Scalars |
| \(\alpha_i\) | Population weight of subgroup \(i\) | \((0,1)\) with \(\sum_i\alpha_i=1\) |
| \(\beta\) | Random-coefficients vector in linear RUM / KL penalty in DPO | \(\mathbb{R}^{d}\) or \(\mathbb{R}^+\) |
| \(\Sigma\) | Covariance matrix of \(\beta\) | \(\mathbb{R}^{d\times d}\) |
| \(\mathcal{GP}(m, k)\) | Gaussian Process with mean \(m\) and kernel \(k\) | Distribution over functions |
| \(k(x, x')\) | Kernel (covariance) function | \(\mathbb{R}\) |
| \(\ell\) | Length-scale parameter in RBF kernel | \(\mathbb{R}^+\) |
| \(\sigma_f^2\) | Signal variance in GP kernel | \(\mathbb{R}^+\) |
1.12 Discussion Questions
- How does modeling preferences as random (rather than deterministic) help us capture real-world choice behavior?
- How does the Rasch model differ from standard logistic regression? What additional structure does it impose by having separate user and item parameters?
- Why does user appetite \(U_i\) cancel out in pairwise comparisons under the Rasch model? When is this a feature (simplification) versus a limitation (loss of information)?
- What is the Independence of Irrelevant Alternatives (IIA) axiom, and why does it simplify the estimation of choice models?
- Why do i.i.d. Gumbel shocks in a random utility model lead to the Plackett–Luce (list) and softmax/logit (choice) formulas?
- In what ways can binary comparisons, choice-from-a-set, and full rankings each be seen as observations of the same underlying stochastic preference distribution?
- When would you prefer to collect item-wise data (accept/reject, ratings) versus pairwise comparisons? Consider cost, information content, and downstream tasks.
- How does introducing an “outside option” (item \(0\)) allow us to interpret accept–reject data (e.g., clicks in a recommender system) within the same logit framework?
- Why does mixing multiple IIA-satisfying sub-populations generally violate IIA at the aggregate level? What does this imply for preference learning from diverse user populations?
- Explain the “red bus–blue bus” problem: why does splitting a single alternative into two identical ones distort logit choice probabilities? Give an example of when this might occur in recommender systems or search ranking.
- Modern recommender systems use neural networks to compute user and item embeddings. How does this relate to the K-dimensional factor model \(H_{ij} = U_i^\top V_j + Z_j\)?
- The ideal point model (\(H_{ij} = -\|U_i - V_j\|_2 + Z_j\)) is popular in political science. Why might Euclidean distance be more appropriate than dot product for modeling political preferences?
- Compare implicit preference signals (clicks, watch time, purchases) with explicit ratings. What are the tradeoffs for preference learning?
- The Elo rating system in chess uses the Bradley-Terry model. What assumptions does this make about player skill, and when might these assumptions fail?
1.13 Bibliographic Notes
Random utility models originate in Thurstone’s (1927) work on psychophysical scaling, where he proposed that perceived stimuli have random “discriminal processes.” The axiomatic approach via the choice axiom (equivalent to IIA) was developed by Luce et al. (1959), providing the mathematical foundations for probabilistic choice theory. Yellott Jr (1977) proved the remarkable equivalence between IIA and Gumbel-distributed noise.
The Rasch model was introduced by Rasch (1960) for psychometric testing, where \(U_i\) represents student ability and \(V_j\) represents item difficulty. The model has become foundational in educational testing and is used in standardized tests like the GRE. For comprehensive treatment of item response theory, see Baker (2001). The connection between Rasch models and Bradley-Terry was explored in the paired comparison literature; see Andrich (1978).
Factor models for recommendations gained prominence after the Netflix Prize competition, where matrix factorization methods proved highly effective (Koren, Bell, and Volinsky 2009). The probabilistic interpretation connecting matrix factorization to latent factor models is developed in Mnih and Salakhutdinov (2007). Modern neural approaches to collaborative filtering build on these foundations; see He et al. (2017) for neural collaborative filtering.
Ideal point models originate in Coombs (1950)’s unfolding theory and were developed for political science applications by Poole and Rosenthal (1985) for analyzing roll-call voting. The connection to preference learning is explored in Carroll and Chang (1972). For modern computational approaches, see Armstrong et al. (2014).
Discrete choice econometrics was pioneered by McFadden (1974), who developed the conditional logit model and its applications to transportation choice. His subsequent work on nested logit and mixed logit models, along with his contributions to the econometric theory of discrete choice, earned him the Nobel Prize in Economics in 2000 (shared with James Heckman). The definitive graduate textbook treatment is Train (2009).
The Bradley-Terry model was introduced by Bradley and Terry (1952) for analyzing paired comparisons in ranking problems, though similar models were developed independently by Zermelo (1929) for chess ratings. The extension to partial rankings is due to Plackett (1975) and Luce et al. (1959). For modern statistical treatments, see Cattelan (2012).
Proper scoring rules have a long history in probability forecasting; see Gneiting and Raftery (2007) for a comprehensive treatment. The connection between proper scoring rules and calibration is fundamental to understanding why cross-entropy is the standard loss for classification and language modeling.
Preference learning in machine learning has grown rapidly since Christiano et al. (2017) introduced RLHF for learning from human preferences without hand-specified reward functions. Ouyang et al. (2022) scaled this approach to large language models. Rafailov et al. (2023) introduced DPO, making explicit the Bradley-Terry assumption underlying reward modeling. For connections to social choice theory, see Arrow (1951) and Sen (1986).
For further reading: On the economics of discrete choice, Ben-Akiva and Lerman (1985) provides accessible coverage. On the psychology of choice, Kahneman (2011) discusses bounded rationality and systematic deviations from utility maximization. On recent advances in preference learning for AI, see the surveys by Kaufmann et al. (2023).
1.14 Exercises
Exercises are marked with difficulty levels: (*) for introductory, (**) for intermediate, and (***) for challenging.
1.14.1 Properties of IIA Models (*)
Prove that if a preference model satisfies IIA, it will also satisfy \(p(j \mid \mathcal{S}) \le p(j \mid \mathcal{S}')\) for any \(j\) and \(\mathcal{S} \supseteq \mathcal{S}'\) (called regularity) and for all \((j, k, \ell)\), if \(p(j \mid \{j, k\}) \ge 0.5\) and \(p(k \mid \{k, \ell\}) \ge 0.5\), then necessarily \(p(j \mid \{j, \ell\}) \ge 0.5\).
1.14.2 Discrete Choice Models (**)
Consider a linear random utility model \(V_j = \beta^\top x_j + \varepsilon_j\) for \(j = 1, 2, \ldots, M\), where \(\varepsilon_j\) is i.i.d. sampled from a Gumbel distribution. We would like to compute \(p(j \mid \{1, \ldots, M\})\) and connect it to multi-class logistic regression.
First compute \(p(H_j < t)\) for any \(j\) in terms of \(F\). Use this probability to provide a formula for \(p(j \mid \{1, \ldots, M\})\) over \(t\) in terms of \(f\) and \(F\).
Compute the integral derived in part (a) with the appropriate \(u\)-substitution. You should arrive at the multi-class logistic regression model.
1.14.3 Mixtures and correlations (*)
Prove that the class of random preferences induced by the following two are identical: (a) mixtures of IIA random utility models (that is, those with i.i.d. noise) (b) random utility models with correlated noise.
1.14.4 Sufficient Statistics (***)
Show that choice data completely specifies the preference model. That is, express \(p(\prec)\) for any \(\prec\) in terms of \(p(j \mid \mathcal{S})\), \(j \in \mathcal{S} \subseteq \{1, \ldots, M\}\).
Show that this is not the case for binary comparisons. That is, give an example of two different preference models that induce the same probabilities \(p(Y_{jk} = 1)\).
Hint: For (a), consider how you would reconstruct the probability of a ranking \(j_1 \succ j_2 \succ j_3\) from choice probabilities. For (b), consider M=3 and try to find two distributions over the 6 possible rankings that agree on all pairwise margins.
1.14.5 Non-Random Utility Models (***)
Not all probability assignments for binary comparisons \(p_{jk} = p(Y_{jk} = 1)\) can be realized with a random preference model. Give an example of binary comparisons \((p_{12}, p_{23}, p_{31})\) that cannot be a result of a random preference model.
Hint: Think about what constraints transitivity imposes. If \(p_{12} > 0.5\) and \(p_{23} > 0.5\), what can you say about \(p_{13}\)?
1.14.6 Bradley-Terry Maximum Likelihood (**)
You observe the following pairwise comparison outcomes between 4 chess players over a tournament:
| Comparison | Player A wins | Player B wins |
|---|---|---|
| 1 vs 2 | 65 | 35 |
| 1 vs 3 | 72 | 28 |
| 1 vs 4 | 80 | 20 |
| 2 vs 3 | 58 | 42 |
| 2 vs 4 | 70 | 30 |
| 3 vs 4 | 55 | 45 |
Write the log-likelihood for the Bradley-Terry model as a function of utilities \(V = (V_1, V_2, V_3, V_4)\) with the identification constraint \(V_4 = 0\).
Derive the gradient of the log-likelihood. Implement gradient ascent to find the MLE.
Interpret the learned utilities. Which player is strongest? What is the estimated probability that Player 1 beats Player 2?
1.14.7 From Preferences to Policy (**)
This exercise connects Bradley-Terry estimation to DPO.
Show that minimizing the DPO loss (Equation 1.2) on a dataset of preferences is equivalent to maximizing the Bradley-Terry log-likelihood, where the “utility” of response \(y\) given prompt \(x\) is \(\beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}\).
Suppose 20% of preference labels are “noisy”—i.e., the annotator flipped a coin instead of expressing their true preference. How does this affect the learned policy compared to noise-free data?
If you had access to the noise rate, how would you modify the DPO objective to account for it?
1.14.8 Rasch Model Properties (*)
Consider the Rasch model \(p(Y_{ij} = 1) = \sigma(U_i + V_j)\) where \(U_i\) is user appetite and \(V_j\) is item appeal.
Show that the log-odds of acceptance is linear in the parameters: \[ \log \frac{p(Y_{ij} = 1)}{p(Y_{ij} = 0)} = U_i + V_j \tag{1.43}\]
Prove that for a single user \(i\) comparing items \(j\) and \(k\), the Rasch model implies Bradley-Terry: \[ p(j \succ k \mid i) = \sigma(V_j - V_k) \tag{1.44}\] What happens to the user parameter \(U_i\)?
Explain why item-wise data from \(N\) users can identify both \(\{U_i\}_{i=1}^N\) and \(\{V_j\}_{j=1}^M\) (up to a single additive constant), while pairwise comparison data cannot identify the \(U_i\) parameters at all.
1.14.9 Factor Model Comparison (**)
Consider two factor models with \(K\)-dimensional embeddings:
- Dot-product model: \(H_{ij} = U_i^\top V_j\)
- Ideal-point model: \(H_{ij} = -\|U_i - V_j\|_2^2\)
Show that with \(K = 1\), the two models can represent the same set of preference probabilities. Hint: Find a transformation between the parameters.
Give an intuitive example of a preference pattern that the dot-product model can represent but the ideal-point model cannot. Hint: Consider what happens when a user’s embedding points in the opposite direction from an item’s embedding.
In what application domains (beyond those mentioned in the chapter) would you prefer the ideal-point model over the dot-product model? Justify your answer.
1.14.10 From Rasch to Recommendations (*)
A music streaming service has 1000 users and 500 songs. Each user has listened to some songs (listen = \(Y_{ij} = 1\)) and skipped others when recommended (skip = \(Y_{ij} = 0\)). You model this with the Rasch model.
Write down the log-likelihood for the Rasch model given an observed response matrix \(Y\).
After fitting the model, you estimate user appetite \(U_{\text{Alice}} = 1.5\) for user Alice and item appeal \(V_{\text{song}} = -0.5\) for the song “Bohemian Rhapsody”. What is the probability that Alice will listen to this song if recommended?
User Bob has estimated appetite \(U_{\text{Bob}} = -0.5\). Without knowing the item parameters, compute the odds ratio comparing Alice’s probability of listening to any song versus Bob’s probability.
1.14.11 Posterior Inference for Mixture Preferences (**)
(This exercise previews some of the aspects for learning utility functions from the next chapter but is self-contained.) You are part of the ML team on the movie streaming site “Preferential”. You receive full preference orderings in the form \(j_1 \succ j_2 \succ \cdots \succ j_M\), where \(j_1\) is the most, and \(j_M\) the least preferred option. The preferences come from \(600\) distinct users with \(50\) examples per user. Each movie \(j\) has a \(10\)-dimensional feature vector \(\mathbf{x}_j\), and each user \(i\) has a \(10\)-dimensional weight vector \(\mathbf{w}_i\). The preferences for user \(i\) follow the random utility model \(V_j = \mathbf{w}_i^\top \mathbf{x}_j + \varepsilon_j\), where \(\varepsilon_j\) is i.i.d. Gumbel distributed.
Sadly, you lost all user identifiers. Unashamedly, you assume a model where a proportion \(p\) of the users have weights \(\mathbf{w}_1\), and a proportion \(1-p\) has weights \(\mathbf{w}_2\). Each user belongs to one of two groups: users with weights \(\mathbf{w}_1\) are part of Group 1, and users with weights \(\mathbf{w}_2\) are part of Group 2.
For a datapoint \((j_1 \succ j_2)\) and conditional on \(p\), \(\mathbf{w}_1\) and \(\mathbf{w}_2\), compute the likelihood \(p(Y_{j_1 j_2} = 1 \mid p, \mathbf{w}_1, \mathbf{w}_2)\).
Use the likelihood to simplify the posterior distribution of \(p, \mathbf{w}_1, \mathbf{w}_2\) after updating on \((\mathbf{x}_{j_1}, \mathbf{x}_{j_2})\) leaving terms for the priors unchanged.
Assume priors \(p\sim B(1, 1)\), \(w_1\sim N (0, \mathbf{I})\), and \(w_2\sim N(0, \mathbf{I})\) where \(B\) represents the Beta distribution and \(\mathcal{N}\) represents the normal distribution (all three sampled independently). You will notice that the posterior from part (b) has no simple closed form, requiring numerical methods. One such method, allowing to approximate sample from the posterior \(\pi\), is called Metropolis-Hastings. (The reason why one might want to sample from the posterior will be discussed later.) Broadly, the idea of Metropolis-Hastings and similar, so-called Markov Chain Monte Carlo methods is the following: Construct a Markov chain \(\{x_t\}_{t=1}^\infty\) which has as “ergodic” distribution given by your desired distribution. By properties of Markov chains, for \(t \gg 0\), \(x_t\) will be almost as good as sampled from the “ergodic” distribution. In Metropolis-Hastings, the distribution is a proposal \(P\) for \(x_{t+1}\) is made via sampling from a chosen probability kernel \(Q(\bullet | x_t)\) (e.g., adding Gaussian noise). The acceptance probability of the proposal is given by
\[ x_{t+1}=\begin{cases} \tilde Q(\bullet | x_t) & \text{with probability } A, \\ x_t & \text{with probability } 1 - A. \end{cases} \tag{1.45}\] where \[ A= \min \left( 1, \frac{\pi(\bullet )Q(x_t | \bullet )}{\pi(x_t)Q( \bullet | x_t)} \right). \tag{1.46}\] We will extract samples from the Markov chain after a “burn-in period”, \((x_{T+1}, x_{T+2},\cdots, x_{N})\).
To build some intuition, suppose we have a biased coin that turns heads with probability \(p_{\text{heads}}\). We observe \(12\) coin flips to have \(9\) heads (H) and \(3\) tails (T). If our prior for \(p_{\text{H}}\) was \(B(1, 1)\), then, by properties of the Beta distribution, our posterior will be \(B(1 + 9, 1 + 3)=B(10, 4)\). The Bayesian update is given by
\[ p(p_{\text{H}}|9\text{H}, 3\text{T}) = \frac{p(9\text{H}, 3\text{T} | p_{\text{H}})B(1, 1)(p_{\text{H}})}{\int_0^1 P(9\text{H}, 3\text{T} | p_{\text{H}})B(1, 1)(p_{\text{H}}) \, \mathrm dp_{\text{H}}} =\frac{p(9\text{H}, 3\text{T} | p_{\text{H}})}{\int_0^1 p(9\text{H}, 3\text{T} | p_{\text{H}}) \, \mathrm dp_{\text{H}}}. \tag{1.47}\]
Find the acceptance probability \(A\) in the setting of the biased coin assuming the proposal distribution \(Q(\cdot|x_t)=x_t+N(0,\sigma)\) for given \(\sigma\). Notice that this choice of \(Q\) is symmetric, i.e., \(Q(x_t|p)=Q(p|x_t)\) for all \(p \in \mathbb R\). Note that it is unnecessary to compute the normalizing constant of the Bayesian update (i.e., the integral in the denominator). This simplification is one of the main practical advantages of Metropolis-Hastings.
- Implement Metropolis-Hastings to sample from the posterior distribution of the biased coin in
multimodal_preferences/biased_coin.py. Attach a histogram of your MCMC samples overlayed on top of the true posterior \(B(10, 4)\) by runningpython biased_coin.py.
- Implement Metropolis-Hastings in the movie setting inside
multimodal_preferences/movie_metropolis.py. You should be able to achieve a \(90\%\) success rate with mostfraction_acceptedvalues above \(0.1\). Success is measured by thresholded closeness of predicted parameters to true parameters. You may notice occasional failures that occur due to lack of convergence which we will account for in grading.