Machine Learning from Human Preferences

Chapter 2: Choice Models (Part 2)

The ideal point model

  • An embedding approach, assumes user item preference depends on distance
    • Let \(x_n\) denote a latent vector representing an individual \(n\)
    • Let \(v_i\) denote a latent vector representing choice (or item) \(i\) \(U_{ni} = dist(x_n, v_i) + \epsilon_{ni}\)
    • Model is equivalent to choosing the “closest” item

\[ y_{ni} = \begin{cases} 1, & \text{if } U_{ni} > U_{nj} \ \forall j \neq i \\ 0, & \text{otherwise} \end{cases} \]

Ideal point model: the why

  • Pros: Can sometimes learn preferences faster than attribute-based preference models by exploiting geometry (see refs)
  • Cons:
    • Embedding assumption may be strong (can make more flexible via distance function choice)
    • However, have to select a distance function (usually use Euclidian distance in the embedding)

Jamieson and Nowak (2011); Tatli, Nowak, and Vinayak (2022)

Choice models in RL (and RLHF)

Choice models in RL

Application: RL and Language

(Bradley-Terry model)

RLHF

https://openai.com/research/learning-to-summarize-with-human-feedback

Choice models in ML (recommender systems, bandits, Direct Preference Optimization)

Model in ML

Choice models in ML (recommender systems, bandits, DPO)

Model in ML

Why DPO?

DPO vs PPO

  • RLHF pipeline is complex and unstable due to the reward model optimization.
  • DPO is more stable and can be used to optimize the reward model directly.

Rafailov et al. (2023)

DPO: Bradley-Terry model

  • Given prompt \(x\) and completions \(y_w\) and \(y_l\) the choice model gives the preference

\[ p^*(y_w > y_l | x) = \frac{\exp(r^*(x, y_w))}{\exp(r^*(x, y_w)) + \exp(r^*(x, y_l))} \]

where \(r^*(x, y)\) is some latent reward model that we do not have access to (i.e., the human preference)

DPO: Bradley-Terry model

Luckily, we can use parameterize the reward model with some neural networks with parameters \(\phi\):

Let us start with the Reward Maximization Objective in RL: \[ \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} [r_\phi(x, y) - \beta D_{KL}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x))] \]

  • Where \(\pi_\theta(y|x)\) is the language model, and \(\pi_{\text{ref}}(y|x)\) is the reference model (e.g., the language model before fine-tuning)

\[ \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} [r_\phi(x, y) - \beta D_{KL}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x))] \]

Recall the definition of KL divergence: \[ D_{KL}(p \| q) = \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_{x \sim \mathcal{X}} \left[ \log \frac{p(x)}{q(x)} \right] \]

Then we can rewrite the objective as: \[ \begin{aligned} &\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right] \right]\\ &=\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right] \end{aligned} \]

Then, we can continue to derive the objective as: \[ \begin{aligned} &\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right] \\ &\propto \min_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} - \frac{1}{\beta} r_\phi(x, y) \right] \text{// reverse and divide } \beta\\ &= \min_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ \log \frac{\pi_\theta(y|x)}{\frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right)} - \log Z(x) \right] \end{aligned} \]

\[ \text{with} \quad Z(x) = \sum_{y} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right) \]

Because \(Z(x)\) is a constant with respect to \(\pi_\theta\), we can define: \[ \pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right) \]

Then, we can rewrite the optimization problem as: \[ \min_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ \log \frac{\pi_\theta(y|x)}{\pi^*(y|x)} - \log Z(x) \right] \]

\[ = \min_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left[ \mathbb{D}_{KL}(\pi_\theta(y|x) \| \pi^*(y|x)) - \log Z(x) \right] \]

Thus, the optimal solution (i.e., the optimal language model) is: \[ \pi_\theta(y|x) = \pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right) \]

With some algebra, we can show that the optimal reward model is: \[ \begin{aligned} \pi_\theta(y|x) &= \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right)\\ \log \pi_\theta(y|x) &= \log \pi_{\text{ref}}(y|x) + \frac{1}{\beta} r_\phi(x, y) - \log Z(x) \text{// perform } \log(.)\\ r_\phi(x, y) &= \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)\\ \end{aligned} \]

Recall the Bradley-Terry model with parameterized reward model: \[ p_\phi(y_w > y_l | x) = \frac{\exp(r_\phi(x, y_w))}{\exp(r_\phi(x, y_w)) + \exp(r_\phi(x, y_l))} \]

We also have the optimal reward model: \[ r_\phi(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x) \]

Thus, we can rewrite the choice model as: \[ \begin{aligned} p_\phi(y_w \succ y_l | x) &= \frac{1}{1 + \exp\left( \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} - \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} \right)}\\ &= \sigma\left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \end{aligned} \]

DPO: Bradley-Terry model

Recall our objective to maximize the reward model, we can rewrite the objective as maximizing the likelihood of the choice model: \[ \mathcal{L} (r_\theta, \mathcal{D}) = - \mathbb{E}_{(x, y_w, u_l) \sim \mathcal{D}} \left[ \log p_\phi(y_w \succ y_l | x) \right] \]

Finally, we can rewrite the objective as:

\[ \begin{aligned} \mathcal{L}_{DPO}(\pi_\theta; \pi_{\text{ref}}) &= - \mathbb{E}_{(x, y_w, u_l) \sim \mathcal{D}} \left[ \log p_\phi(y_w \succ y_l | x) \right]\\ &= -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \right] \end{aligned} \]

Rafailov et al. (2023)

RLHF Comparison

Should your ML application use an explicit utility/reward model?

  • Pro:
    • Reward models can be re-used (in principle)
    • Reward model can be examined to infer properties of human(s), and measure the quality of the preference model(s)
    • Reward model(s) add useful inductive biases to the training pipeline
  • Cons:
    • The extra step of reward modeling can introduce (unnecessary?) errors
    • Reward model optimization can be unstable (e.g., in RLHF, as argued by DPO)

Some criticisms of choice modeling more broadly

  • Real-world choices often appear to be highly situational or context-dependent e.g., way choice is posed, emotional states, other factors not well modeled.
    • Arguably what is exploited by marketing. Related to framing effects (more later).
    • A partial rebuttal: In principle, can always add more context to the model.
  • Many choices are intuitive rather than rational, so utility optimization models do not apply
    • Please have limited attention and cognitive capability, especially for less salient choices
    • Default choices are powerful, e.g., in 401K, or opt-in organ donors

Q & A

  • What are some key assumptions in (discrete) choice models?
    • Rationality (existence of a utility function that determines choices)
    • Parametric model for utility and choice noise
    • Finite set of choices, and explicit alternatives
  • How does one apply discrete choice models to ML/RL applications with changing context (input)
    • Model utility via generic models (e.g., deep neural networks)
  • What are some criticisms of discrete choice models?
    • Humans display context-dependent choices
    • Humans often make intuitive (or irrational) choices

What is not covered

  • Details of estimation, analysis
    • Maximum likelihood is generally equivalent to standard classification/ranking
    • Existing analysis (though often interesting) is mostly for linear (or simpler) utilities
    • Many of the interesting theoretical questions are for active querying settings
  • Beyond discrete choice models
    • With equivalent alternatives (\(U_1 > U_2, U_1 \approx U_3\))
    • Continuous “choices” e.g., pricing, demand/supply
    • Dynamic discrete choice (for time varying choices) \(\approx\) RL
  • Experimental design for “stated preferences”
    • How to design a survey to measure alternatives, conjoint analysis
  • Active querying (future discussion)

Summary

  • Today: Overview of discrete choice models
    • Basics of discrete choice and rationality assumptions
    • Benefits and criticisms of discrete choice
    • Some special cases and applications of discrete choice models to ML
  • Next Lecture: Student discussion on Human Decision Making and Choice Models

References

  • Train (1986)
  • McFadden and Train (2000)
  • Luce et al. (1959)
  • Additional:
    • Ben-Akiva and Lerman (1985)
    • Park, Simar, and Zelenyuk (2017)
    • Rafailov et al. (2023)

Ben-Akiva, Moshe E., and Steven R. Lerman. 1985. Discrete Choice Analysis: Theory and Application to Travel Demand. Transportation Studies. Cambridge, MA: MIT Press.
Jamieson, Kevin G, and Robert Nowak. 2011. “Active Ranking Using Pairwise Comparisons.” In Advances in Neural Information Processing Systems.
Luce, R Duncan et al. 1959. Individual Choice Behavior. Vol. 4. Wiley New York.
McFadden, Daniel, and Kenneth Train. 2000. “Mixed MNL Models for Discrete Response.” Journal of Applied Econometrics 15 (5): 447–70.
Park, Byeong U, Leopold Simar, and Valentin Zelenyuk. 2017. “Nonparametric Estimation of Dynamic Discrete Choice Models for Time Series Data.” Computational Statistics & Data Analysis 108: 97–120. https://doi.org/10.1016/j.csda.2016.10.024.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” https://arxiv.org/abs/2305.18290.
Tatli, Gokcan, Rob Nowak, and Ramya Korlakai Vinayak. 2022. “Learning Preference Distributions from Distance Measurements.” In Proceedings of the Conference.
Train, Kenneth. 1986. Qualitative Choice Analysis: Theory, Econometrics, and an Application to Automobile Demand. MIT Press.