7  Conclusion

Machine learning systems are increasingly shaping consequential decisions in our lives—from the content we consume to the opportunities we receive. A central challenge in developing trustworthy AI is ensuring these systems align with human preferences rather than optimizing for proxies that diverge from what people actually value. This book has presented a comprehensive framework for learning from human feedback, drawing on nearly a century of research across economics, psychology, statistics, and computer science.

7.1 Our Approach

This book has taken an interdisciplinary approach to machine learning from human preferences, weaving together insights from diverse fields into a unified technical and conceptual framework.

7.1.1 Foundations from Multiple Disciplines

Our treatment began with foundational models that originated outside of machine learning. The Bradley-Terry model (Section 1.6.2), introduced in 1952 for ranking chess players, provides the mathematical backbone for modern preference learning—including state-of-the-art methods like Direct Preference Optimization for language model alignment. Thurstone’s law of comparative judgment from psychophysics (1927), Luce’s choice axiom (1959), and McFadden’s discrete choice econometrics (1974) all inform how we model the relationship between latent utilities and observed choices.

This interdisciplinary foundation is not merely historical context. Understanding why the Bradley-Terry model arises from random utility assumptions, and when its key assumption—Independence of Irrelevant Alternatives—fails, equips practitioners to recognize the limitations of their methods. The red-bus/blue-bus problem and preference heterogeneity are not abstract concerns but practical challenges that arise in real applications.

7.1.2 From Axioms to Algorithms

The book progressed from axiomatic foundations to practical algorithms:

Chapter 1 established the mathematical language of preferences. We showed how random utility models with Gumbel-distributed noise yield the tractable logit form, and how the IIA axiom collapses an exponentially large preference space to a linear number of parameters. These are not arbitrary modeling choices—they encode specific assumptions about human cognition that may or may not hold in practice.

Chapter 2 developed estimation methods for preference models. Maximum likelihood estimation provides point estimates efficiently, while Bayesian inference quantifies uncertainty—essential for downstream decision-making. The Elo rating system, used from chess to video games, emerges naturally as stochastic gradient descent on the Bradley-Terry likelihood. Gaussian Processes offer flexible nonparametric alternatives when linear reward functions are too restrictive.

Chapter 3 addressed active data collection. Human feedback is expensive; we cannot afford to collect data naively. Fisher information quantifies how much each comparison teaches us about preferences, enabling intelligent query selection. The connection to optimal experimental design—A-optimal, D-optimal, E-optimal criteria—grounds active learning in classical statistical theory.

Chapter 4 turned to decision-making under learned preferences. Thompson Sampling elegantly balances exploration and exploitation by sampling from the posterior and acting optimally under the sample. Dueling bandits extend multi-armed bandits to the comparison setting, with distinct winner concepts (Condorcet, Borda, von Neumann) and regret notions appropriate for different applications. Preferential Bayesian Optimization unified Gaussian Process preference models from earlier chapters with sequential decision-making, enabling optimization of complex objectives from comparison feedback alone. The chapter culminated in the Cooperative Inverse Reinforcement Learning (CIRL) framework, which reconceived preference learning as a cooperative game between human and agent—shifting from passive data collection to interactive collaboration where the agent’s queries themselves shape the human’s ability to communicate preferences.

Chapter 5 examined aggregation when multiple stakeholders have heterogeneous preferences. Arrow’s Impossibility Theorem and the Gibbard-Satterthwaite Theorem establish fundamental limits on what any aggregation mechanism can achieve—yet domain restrictions like single-peaked preferences offer practical escape routes. A key result connected the Borda count to Direct Preference Optimization, showing that DPO finds the policy that wins the most head-to-head matchups (Section 5.3)—giving modern alignment methods a precise social choice interpretation. Sen’s liberal paradox (Section 6.3.4.1) revealed the tension between respecting individual rights and achieving Pareto efficiency when preferences are “nosy,” while the inversion problem (Section 5.7) cautioned that observed behavior may not reflect true preferences. The Community Notes case study (Section 5.6) illustrated how these ideas play out in a deployed system. Mechanism design, including VCG auctions, showed how to align individual incentives with collective objectives—but at a cost.

7.1.3 The Critical Turn

A distinctive feature of this book is Chapter 6, which stepped back from technical methods to ask a normative question: Whose preferences matter?

Every technical choice in the preference learning pipeline—from who gets queried to how preferences are aggregated—embeds value judgments about whose interests count. The Fisher information machinery from Chapter 3 maximizes informativeness per query, but information-maximizing strategies can systematically underquery minority populations whose preferences are “easier” to model. Assuming IIA treats context-dependent preferences as irrational, potentially disadvantaging overburdened users whose choices are shaped by external constraints rather than stable preferences. Using existing data to define a reference policy encodes historical inequities into the learning objective.

This critical lens does not reject the technical methods developed in earlier chapters. Rather, it calls for conscious recognition that preference learning is not value-neutral. The designer’s choices shape who benefits and who is harmed. Fairness constraints, stratified evaluation, and participatory design are not add-ons but essential components of responsible system development.

7.2 Lessons from the Alignment Era

The rapid adoption of RLHF and DPO for language model alignment—beginning with InstructGPT in 2022 and accelerating through ChatGPT, Claude, and their successors—has stress-tested preference learning at a scale that the field’s founders could not have anticipated. Several lessons have emerged from this experience that inform how we should think about the methods developed in this book.

7.2.1 What Worked

Bradley-Terry is remarkably effective in practice. Despite its strong assumptions (IIA, homogeneous annotators, context-independence), the Bradley-Terry model has proven to be a workhorse for reward modeling and DPO. The simplicity that makes it analytically tractable—a single parameter per item, sigmoid link function, logistic likelihood—also makes it computationally efficient at scale. The DPO reformulation (Chapter 1, Section 1.2) eliminates the separate reward model entirely, showing that the Bradley-Terry likelihood can be directly optimized as a policy objective. This is a striking vindication of the classical models that form the backbone of this book.

Active learning reduces annotation cost. The Fisher information framework from Chapter 3 has proven practically valuable for selecting which comparisons to annotate. ADPO (Section 3.5) and related methods demonstrate that intelligent query selection can reduce annotation budgets by 30-50% compared to random sampling, a substantial saving when human preference labels cost $1-10 each and training runs require tens of thousands of comparisons.

Uncertainty quantification matters for deployment. The Bayesian methods from Chapter 2 and the posterior-based decision-making from Chapter 4 (Thompson Sampling, PBO) are not merely academic exercises. In production systems, knowing how confident the model is in its preference predictions enables calibrated decisions about when to deploy new model versions, when to collect more data, and when to fall back to conservative defaults.

7.2.2 What Surprised Us

The gap between reward models and human judgment is smaller than expected—but the remaining gap is consequential. Modern reward models trained via Bradley-Terry achieve high agreement with held-out human preferences (often 70-80% accuracy on binary comparisons). But the errors are not random: they cluster around responses that are subtly harmful, sycophantic, or superficially impressive but factually wrong. These failure modes are precisely the cases where getting preferences right matters most, and they suggest that the IIA assumption—which treats all comparisons as equally well-modeled—may need to be relaxed for safety-critical applications.

Goodhart’s Law is real and pernicious. When models are optimized against a learned reward function, they inevitably find ways to achieve high reward without satisfying the underlying human preferences. This “reward hacking” is a direct consequence of the gap between the proxy (learned reward) and the target (true human preferences). The phenomenon manifests as models that produce longer responses (because annotators weakly prefer length), use confident-sounding language regardless of accuracy, or agree with the user’s stated views rather than providing honest assessments. Recognizing reward hacking as an instance of Goodhart’s Law—“when a measure becomes a target, it ceases to be a good measure”—connects it to a rich literature in economics and public policy.

Annotator disagreement is signal, not noise. Early RLHF work treated disagreement among annotators as measurement error to be averaged away. But Chapter 5’s analysis of preference aggregation suggests a different interpretation: annotators disagree because they genuinely hold different preferences, not (only) because they make mistakes. The DPO-Borda connection (Section 5.3) shows that aggregating via majority vote implicitly selects the Borda winner—which may not be the socially optimal choice when minority preferences carry important information. This has led to growing interest in pluralistic alignment approaches that maintain multiple preference models rather than collapsing to a single aggregate.

7.2.3 What Broke

The inversion problem at scale. Chapter 5’s discussion of the inversion problem (Section 5.7)—that observed behavior does not equal underlying preferences—takes on new urgency at the scale of commercial annotation. Annotators working under time pressure, fatigue, and piece-rate compensation produce preference labels that reflect their working conditions as much as their genuine judgments. Studies have documented systematic biases: annotators favor shorter responses when fatigued, prefer responses that match their cultural background, and exhibit position bias (favoring the first response presented). These are precisely the distortions that Chapter 6’s fairness analysis predicts—and they propagate through the entire pipeline.

IIA violations in practice. The Independence of Irrelevant Alternatives assumption, whose limitations we analyzed theoretically in Chapter 1 (the red-bus/blue-bus problem), manifests concretely in LLM alignment. Annotators’ preference between two responses depends on what other responses they have recently evaluated (contrast effects), how the comparison is framed (whether they are asked “which is better” versus “which is less harmful”), and even the order of presentation. These violations suggest that richer preference models—context-dependent Bradley-Terry, mixed logit, or the random-coefficients models from Chapter 1—may be needed for high-stakes applications.

Single-turn preferences do not capture long-horizon value. Most preference learning for language models collects feedback on individual responses in isolation. But users ultimately care about the quality of extended interactions—whether the model is consistently helpful over a conversation, whether it remembers context appropriately, and whether it knows when to ask clarifying questions versus provide direct answers. Extending preference models from single comparisons to trajectory-level feedback remains an open challenge that connects to the sequential decision-making framework of Chapter 4.

7.3 Forwarding the Field

Machine learning from human preferences is a vibrant area of research with many open challenges. We highlight several directions that we believe are particularly important for the field’s advancement.

7.3.1 Beyond Pairwise Comparisons

Much of this book focused on pairwise comparisons—the simplest and most well-understood form of preference data. But humans express preferences in many ways:

  • Natural language feedback: “This response is helpful but could be more concise” conveys richer information than a binary comparison. Recent work on learning from critiques and natural language rewards suggests that language feedback can be formalized as soft constraints on the reward function, though the mapping from language to reward remains noisy and context-dependent.
  • Partial rankings: A user might confidently rank their top three choices but be uncertain about the rest. Plackett-Luce models (Equation 2.20) handle full rankings, but principled treatment of partial and uncertain rankings requires extensions that allow “I don’t know” or “these are roughly equivalent” responses.
  • Ratings and Likert scales: Absolute judgments on a scale provide cardinal (not just ordinal) information, though they introduce calibration challenges across annotators—one person’s 4 out of 5 may be another’s 3. Anchoring and cross-annotator calibration methods are active areas of research.
  • Implicit signals: Engagement time, scroll patterns, and click behavior reveal preferences without explicit queries. But these signals are heavily confounded by interface design, attention patterns, and the inversion problem (Chapter 5)—longer engagement may reflect confusion rather than interest.

Integrating these diverse modalities into unified preference models is an important open problem. Each modality has strengths: pairwise comparisons are reliable but expensive; implicit signals are cheap but noisy; natural language is rich but difficult to formalize. Hybrid approaches that combine multiple feedback types, weighting each appropriately, hold promise for more efficient and accurate preference learning.

7.3.2 Scalable Oversight

As AI systems become more capable, the tasks they perform become harder for humans to evaluate. A language model writing code may produce solutions that work but are subtly flawed in ways a non-expert cannot detect. A scientific assistant might propose experiments whose validity requires deep domain expertise to assess.

Scalable oversight addresses this challenge through techniques like:

  • Recursive reward modeling: Decomposing complex evaluation tasks into simpler subtasks that humans can reliably judge. The key insight is that while evaluating a complete solution may be hard, verifying individual steps is often feasible. This connects to the preference elicitation framework of Chapter 3: by decomposing evaluation, we can apply Fisher information analysis to each subtask and select the most informative verification queries.
  • AI-assisted evaluation: Using AI systems to help humans provide more accurate feedback, while remaining vigilant about circular dependencies. RLAIF (reinforcement learning from AI feedback) replaces human annotators with AI evaluators, raising questions about whether the preference signal is genuinely informative or merely reflects the evaluator model’s own biases.
  • Debate and amplification: Adversarial procedures where AI systems critique each other’s outputs, surfacing flaws that a single evaluator might miss. This can be formalized as a zero-sum game, connecting to the von Neumann winner concept from Chapter 4’s dueling bandits framework.

The CIRL framework from Chapter 4 models the cooperative structure of oversight, where the human’s ability to evaluate depends on how the agent presents its outputs. But scaling these ideas to superhuman AI systems—where the agent’s capabilities exceed the evaluator’s—remains one of the most important open frontiers in AI safety.

7.3.3 Temporal Dynamics and Preference Change

This book has largely treated preferences as static: a user has fixed utilities \(V_j\) that we aim to estimate. In reality, preferences evolve over time. A user’s taste in music changes as they discover new genres; a society’s norms around acceptable content shift across decades; an annotator’s judgment drifts as they gain experience with the task.

Modeling temporal preference dynamics introduces several challenges:

  • Distinguishing drift from noise: Is a changed preference a genuine evolution (the user’s taste matured) or a noisy observation (the user was tired that day)? The Bayesian framework from Chapter 2 can be extended with time-varying priors—for example, a random walk on utilities—but choosing the right rate of change requires domain knowledge.
  • Non-stationarity in online learning: The Elo rating system (Chapter 2, Section 2.5) implicitly handles non-stationarity through its constant step size \(K\), which weights recent observations more heavily. But the optimal \(K\) depends on the rate of change, creating a tradeoff between responsiveness and stability.
  • Retroactive fairness: If a system learns preferences from historical data that reflects outdated norms, deploying those preferences perpetuates past values. Chapter 6’s pipeline framework (\(\mathcal{E} \to \mathcal{L} \to \mathcal{A} \to \mathcal{D}\)) provides a lens for auditing where temporal biases enter.

7.3.4 Heterogeneity and Personalization

This book has established that preference heterogeneity is not a nuisance but a fundamental feature of human populations. Chapter 1 introduced mixture models and K-dimensional factor models to capture latent preference structure. Chapter 5’s analysis of the inversion problem (Section 5.7) showed that observed behavior can systematically misrepresent preferences for overburdened groups, and the liberal paradox (Section 6.3.4.1) revealed when respecting diverse preferences conflicts with collective efficiency. Chapter 6 demonstrated how these issues compound through feedback loops across the preference learning pipeline.

Yet significant open problems remain:

  • Scalable mixture models that identify latent preference clusters in high-dimensional settings without requiring users to self-identify. Variational inference and amortized methods show promise for scaling mixture-of-experts preference models to millions of users.
  • Contextual preference models that account for how the same user’s preferences vary with cognitive load, time pressure, and framing—building on the contextual Bradley-Terry models introduced in Chapter 6. The challenge is distinguishing genuine context-dependence from IIA violations.
  • Personalization with fairness constraints that provide tailored assistance while ensuring no group is systematically underserved. Multi-objective optimization frameworks that Pareto-balance personalization and equity are an active area of research.
  • Cold-start and few-shot preference learning: How should a system behave for a new user with no preference history? Transfer learning from population-level models, combined with the active elicitation strategies from Chapter 3, can accelerate personalization while managing uncertainty.

The tension between personalization (giving each user what they want) and fairness (ensuring equitable outcomes across groups) is not fully resolved by the technical tools we have today. Progress requires both better algorithms and clearer normative frameworks for navigating these tradeoffs.

7.3.5 Fairness and Value Alignment

Chapter 6 introduced fairness considerations; making them operational requires ongoing research:

  • Fairness metrics for preference learning: Standard fairness definitions (demographic parity, equalized odds) were developed for classification. What are the appropriate analogs for preference models? When is it fair for a system to learn different preference models for different groups? The pipeline framework from Chapter 6 suggests that fairness must be assessed at each stage (\(\mathcal{E}, \mathcal{L}, \mathcal{A}, \mathcal{D}\)), not just at the final output.
  • Value alignment under disagreement: When stakeholders have genuinely conflicting preferences, what should a system optimize for? Majority preferences? Pareto improvements? Maximin welfare? The social choice results of Chapter 5—particularly Arrow’s theorem—tell us that no aggregation rule is universally satisfactory, but they also point toward domain restrictions and mechanism design as paths forward.
  • Transparency and contestability: Users should understand why a system makes the recommendations it does, and have meaningful recourse when they disagree. This requires preference models that are not only accurate but interpretable—a challenge for the high-dimensional factor models and GP-based approaches developed in earlier chapters.
  • Pluralistic alignment: Rather than learning a single aggregate preference model, maintain multiple models reflecting different value systems and allow users or institutions to choose among them. This connects to the mixture models of Chapter 1 and the multi-issue voting framework of Chapter 5.

These are not purely technical questions—they require engagement with philosophy, law, and democratic theory. But technical researchers must be part of the conversation, ensuring that fairness frameworks are implementable and that their limitations are well understood.

7.3.6 Foundation Model Alignment

Large language models trained via RLHF and DPO represent the most prominent current application of preference learning. Yet significant challenges remain:

  • Reward hacking: Models may find ways to achieve high reward without actually satisfying human preferences—exploiting ambiguities in the reward signal rather than genuinely helpful behavior. This is a manifestation of Goodhart’s Law: when the learned reward becomes the optimization target, it ceases to be a faithful proxy for human preferences. Mitigations include reward model ensembles, KL-constrained optimization (as in DPO’s implicit KL penalty), and regular recalibration against fresh human judgments.
  • Distributional shift: Preferences collected in one context may not generalize to deployment settings, especially as models are used for novel tasks. The Bradley-Terry model assumes a fixed set of items with stable utilities, but in practice the “items” (model responses) change as the model improves, creating a moving target that requires online adaptation.
  • Constitutional AI and rule-based approaches: Can we move beyond learning from examples to learning from principles? Constitutional AI proposes specifying desirable behavior through written rules rather than example comparisons, raising questions about how to reconcile principle-based and preference-based approaches. The connection to single-peaked preferences (Chapter 5) is suggestive: if principles define a partial order over responses, they may restrict the preference domain enough to escape aggregation impossibilities.
  • Multi-objective alignment: Real-world alignment requires balancing multiple objectives simultaneously—helpfulness, harmlessness, honesty, and more. This is fundamentally a multi-objective optimization problem where different objectives can conflict, connecting to the social choice framework of Chapter 5.

The alignment of foundation models with human values is arguably the most consequential application of preference learning today. Progress in this area requires both fundamental research on preference modeling and careful empirical work on what actually makes language model outputs helpful, harmless, and honest.

7.4 Capstone Projects

The chapters of this book develop a rich toolkit—from preference models and estimation to active elicitation, sequential decisions, social aggregation, and fairness—but the deepest understanding comes from applying these ideas to open-ended problems. The following capstone projects are designed as extended investigations suitable for individual or small-team work. Each project is inspired by the format of the CS329H course project: students produce a pre-analysis plan, a final manuscript in NeurIPS format (up to 8 pages), and reproducible code. Projects span the book’s six main chapters and range from focused empirical studies to more ambitious research contributions. Difficulty is indicated by stars: * (one person, roughly four weeks), ** (one to two people, roughly six weeks), and *** (two to three people, eight or more weeks).

7.4.1 Project 1: When Does IIA Fail? Detecting and Modeling Context Effects in Preference Data (*)

Description. The Independence of Irrelevant Alternatives assumption (Section 1.8) is the backbone of Bradley-Terry and Plackett-Luce models, yet it is routinely violated in practice—the “red-bus/blue-bus” problem being the canonical example. This project investigates when and how much IIA violations matter by designing controlled experiments (or analyzing existing preference datasets) to detect context effects, then comparing Bradley-Terry against richer models that relax IIA.

Suggested methods.

  • Fit a standard Bradley-Terry model (Section 1.6.2) and a mixed logit or nested logit model (Section 1.11) to the same preference data.
  • Use likelihood ratio tests or AIC/BIC (Section 2.7) to quantify whether the richer model provides a statistically significant improvement.
  • Construct synthetic “red-bus/blue-bus” scenarios by introducing near-duplicate items into a choice set and measuring the distortion in estimated utilities.

Expected deliverables. A manuscript reporting (1) a dataset with documented IIA violations, (2) a quantitative comparison of Bradley-Terry vs. at least one IIA-relaxing model, and (3) practical guidelines for when practitioners should move beyond Bradley-Terry. Code for all experiments with reproducible results.

7.4.2 Project 2: Bayesian vs. Frequentist Preference Learning on Real Annotation Data (*)

Description. Chapter 2 presents three estimation paradigms—maximum likelihood (Section 2.3), Bayesian inference via MCMC (Section 2.4), and online learning via the Elo system (Section 2.5). How do these approaches compare on real-world human preference data, especially in low-data regimes where uncertainty quantification matters most? This project conducts a systematic comparison using publicly available LLM preference datasets.

Suggested methods.

  • Implement MLE with L2 regularization (Section 2.6), Bayesian inference with the Laplace approximation (Section 2.4.2), and Elo updates (Section 2.5) for the same Bradley-Terry model.
  • Evaluate predictive accuracy (AUC), calibration, and posterior coverage as a function of dataset size by subsampling.
  • Assess computational cost and convergence diagnostics (Section 2.8).

Expected deliverables. A manuscript with learning curves showing accuracy and calibration vs. sample size for each method, a discussion of when Bayesian uncertainty is worth the computational overhead, and fully documented code including data preprocessing scripts.

7.4.3 Project 3: Active Preference Elicitation for Personalized Recommendation (**)

Description. When human feedback is expensive, active query selection can dramatically reduce annotation cost. This project builds an end-to-end active preference elicitation system for a concrete domain—such as food preferences, music taste, or movie recommendations—and measures how much the Fisher information framework improves over random querying.

Suggested methods.

  • Implement a Rasch or factor model (Section 1.6.3) with Fisher-information-based query selection (Section 3.2, Section 2.4.2.2).
  • Compare A-optimal, D-optimal, and E-optimal criteria (Section 3.2.1) for selecting the next comparison to present.
  • Collect real preference data from at least 10 participants (or use a realistic simulator) and measure the reduction in queries needed to reach a target prediction accuracy.
  • Optionally extend to the pairwise setting using the active pair selection framework from Section 3.2.2.

Expected deliverables. A working interactive system (command-line or web-based) that adaptively selects queries, an empirical evaluation showing query savings relative to random selection, and a manuscript discussing the practical benefits and limitations of active elicitation.

7.4.4 Project 4: Dueling Bandits for A/B Testing with Preference Feedback (**)

Description. Traditional A/B testing compares treatments using scalar metrics (click-through rate, revenue), but in many settings the outcome of interest is a human preference judgment—e.g., “Which UI layout do you prefer?” This project formulates A/B testing as a dueling bandit problem and evaluates whether dueling bandit algorithms can identify the best variant faster than naive round-robin comparisons.

Suggested methods.

  • Implement at least two dueling bandit algorithms from Chapter 4, such as RUCB or Double Thompson Sampling, targeting different winner concepts (Condorcet vs. Borda, Section 4.5.3).
  • Define appropriate regret metrics for the A/B testing setting and track cumulative regret over time.
  • Simulate a realistic A/B testing scenario with 5–10 variants and heterogeneous user preferences (using mixture models from Section 1.6.3 to generate synthetic users).
  • Compare sample efficiency against uniform random pairing and standard Thompson Sampling (Section 4.3) on scalar proxy metrics.

Expected deliverables. A manuscript presenting regret curves and sample complexity comparisons, a discussion of which winner concept is most appropriate for A/B testing, and modular code that could be adapted to other dueling bandit applications.

7.4.5 Project 5: Preferential Bayesian Optimization for Human-in-the-Loop Design (**)

Description. Preferential Bayesian Optimization (PBO) enables optimization of complex objectives—such as the taste of a recipe, the aesthetics of a generated image, or the comfort of a robot trajectory—from pairwise human feedback alone. This project applies PBO to a concrete design problem where scalar evaluation is difficult but pairwise comparison is natural.

Suggested methods.

  • Implement the GP-based PBO framework from Chapter 4, including a Gaussian Process preference model with Laplace approximation (Section 2.4.2) and an acquisition function (Expected Improvement of the Copeland score or Thompson Sampling).
  • Choose a design domain with a continuous parameter space (e.g., color palettes, font configurations, audio equalization settings, or procedurally generated environments).
  • Collect preference feedback from at least 5 participants and track convergence to their preferred design.
  • Compare PBO against random search and, if a scalar proxy is available, against standard Bayesian Optimization.

Expected deliverables. A manuscript reporting convergence rates and user satisfaction, a working PBO system with a human-facing interface, and an analysis of how posterior uncertainty evolves with the number of comparisons.

7.4.6 Project 6: The DPO-Borda Connection: Empirical Validation on Multi-Annotator Data (**)

Description. Chapter 5 establishes a theoretical connection between Direct Preference Optimization and the Borda count (Section 5.3): DPO finds the policy that maximizes pairwise win rate, i.e., the Borda winner. But does this hold empirically when annotators are heterogeneous and preferences are noisy? This project tests the DPO-Borda connection on real multi-annotator preference data and investigates when the Borda winner diverges from other social choice solutions.

Suggested methods.

  • Use a publicly available multi-annotator preference dataset (e.g., from Chatbot Arena or similar).
  • Compute the Borda ranking, Condorcet ranking (if it exists), and plurality ranking from raw annotations.
  • Train a DPO model and compare its implicit ranking to each social choice solution.
  • Analyze cases where the rankings diverge: are these cases where annotators have genuinely heterogeneous preferences, or where the data is simply noisy? Use the mixture model framework from Section 1.6.3 to test for latent preference clusters.

Expected deliverables. A manuscript quantifying the agreement between DPO-implied rankings and classical social choice rankings, an analysis of when and why they diverge, and reproducible code for all experiments.

7.4.7 Project 7: Auditing a Preference Learning Pipeline for Compounding Bias (**)

Description. Chapter 6 demonstrates that small biases at each stage of the preference learning pipeline—elicitation, learning, aggregation, decision—can compound into large disparities (Section 6.3.5). This project operationalizes that insight by constructing a complete pipeline simulation, injecting realistic biases at each stage, and measuring how they interact.

Suggested methods.

  • Build a simulated preference learning pipeline following the four-stage framework (\(\mathcal{E} \to \mathcal{L} \to \mathcal{A} \to \mathcal{D}\)) from Section 6.2.1.
  • Model two or more user groups with different base rates of participation, different noise levels, and different preference distributions.
  • At each pipeline stage, introduce a specific bias mechanism: (1) undersampling a group in elicitation (Section 6.3.1), (2) IIA misspecification for context-dependent preferences (Section 6.3.2), (3) volume-weighted aggregation (Section 6.3.3), and (4) liberal vs. illiberal decision-making (Section 6.3.4).
  • Measure group-level disparities in recommendation quality at the pipeline output, and decompose total disparity into contributions from each stage.

Expected deliverables. A manuscript with a bias decomposition showing how much each pipeline stage contributes to overall unfairness, concrete recommendations for which stage to prioritize when mitigating bias, and a reusable simulation framework.

7.4.8 Project 8: Mechanism Design for Incentive-Compatible Preference Elicitation (***)

Description. When users have strategic incentives—e.g., inflating ratings to influence recommendations, or misreporting preferences in a voting system—naive preference elicitation produces biased data. This project designs and evaluates an incentive-compatible mechanism for a specific preference learning application, drawing on the mechanism design framework from Chapter 5 (Section 5.10).

Suggested methods.

  • Choose a concrete setting where strategic misreporting is plausible (e.g., peer review, course evaluation, or collaborative filtering).
  • Design a mechanism that incentivizes truthful reporting, building on VCG mechanisms or peer prediction from Section 5.10.
  • Prove or empirically demonstrate the mechanism’s incentive compatibility properties.
  • Simulate strategic agents using a mixture of truthful and strategic types, and compare preference estimation quality under your mechanism vs. naive elicitation.
  • Analyze the mechanism’s efficiency cost (the “price of incentive compatibility”) relative to a setting with truthful agents.

Expected deliverables. A manuscript with a formal mechanism description, theoretical analysis of incentive properties, simulation results comparing truthful vs. strategic settings, and implementation code. If applicable, include a discussion of how the mechanism interacts with the fairness considerations from Chapter 6 (Section 6.5).

7.4.9 Project 9: Cooperative Inverse Reinforcement Learning for Interactive Preference Learning (***)

Description. The CIRL framework from Chapter 4 models preference learning as a cooperative game between a human and an agent, where the agent’s actions serve both to accomplish tasks and to elicit information about the human’s reward function. This project implements a CIRL-style agent in a simplified domain and studies how cooperative interaction compares to passive preference observation.

Suggested methods.

  • Define a gridworld or simple continuous environment where a human has a latent reward function over outcomes.
  • Implement a CIRL agent that maintains a belief over reward functions and selects actions to jointly maximize expected reward and information gain, following the framework from Chapter 4.
  • Compare against (1) a passive agent that observes demonstrations and infers preferences via inverse reinforcement learning, and (2) an active agent that can ask explicit comparison queries (Section 3.4).
  • Measure convergence rate to the true reward function, cumulative regret, and the human’s perceived interaction quality (if using human participants) or simulated interaction cost.

Expected deliverables. A manuscript presenting the CIRL formulation for your domain, regret comparisons across the three approaches, an analysis of when cooperative interaction provides the most benefit (e.g., ambiguous rewards, high-dimensional preference spaces), and well-documented code.

7.4.10 Project 10: Pluralistic Alignment—Learning and Serving Multiple Preference Models (***)

Description. Most preference learning systems collapse diverse human preferences into a single aggregate model. Drawing on Arrow’s Impossibility Theorem (Section 5.2), the DPO-Borda connection (Section 5.3), and the fairness framework of Chapter 6, this project explores an alternative: learning multiple preference models that represent distinct value systems, and allowing users or institutions to choose among them.

Suggested methods.

  • Start with a multi-annotator preference dataset and fit a mixture model to discover latent preference clusters (Section 1.6.3, Section 2.4).
  • Train separate DPO or reward models for each cluster and evaluate whether cluster-specific models outperform a single aggregate model on within-cluster prediction accuracy.
  • Investigate the social choice properties of the cluster-level approach: does serving cluster-specific models avoid Arrow-style impossibilities? Under what conditions does it introduce new fairness concerns (e.g., stereotype reinforcement)?
  • Design a user-facing mechanism for selecting among preference models, and evaluate whether users can meaningfully choose the model that best represents their values.

Expected deliverables. A manuscript with (1) evidence that latent preference clusters exist in real data, (2) a comparison of pluralistic vs. aggregate alignment, (3) a discussion of the fairness trade-offs involved, and (4) code for cluster discovery, model training, and evaluation. This project connects nearly every chapter of the book and is suitable for an ambitious team.

7.5 A Practitioner’s Checklist

For readers building preference learning systems in practice, we distill the book’s lessons into a checklist of questions to address at each stage of system development.

Data Collection (Chapters 1, 3)

  • What type of preference data will you collect? Pairwise comparisons, rankings, ratings, or natural language? Each carries different assumptions and failure modes.
  • Who are your annotators, and how are they incentivized? Piece-rate compensation can introduce the fatigue and speed-accuracy tradeoffs that create the inversion problem (Chapter 5, Section 5.7).
  • Are you using active query selection? If human feedback is expensive, the Fisher information framework (Chapter 3) can substantially reduce the number of comparisons needed.
  • Have you considered position bias, framing effects, and other presentation artifacts that violate IIA?

Modeling (Chapters 1, 2)

  • Is Bradley-Terry appropriate for your setting, or do you need richer models? If your population is heterogeneous, consider mixture models or factor models from Chapter 1.
  • Are you quantifying uncertainty? Point estimates (MLE) are fast but miss the posterior information needed for downstream decision-making. The Laplace approximation provides a practical middle ground.
  • How are you evaluating your model? Use multiple metrics—AUC for ranking, calibration error for probabilistic predictions, and subgroup analysis for fairness (Chapter 6).
  • Is your model regularized appropriately? Too little regularization overfits to noisy preferences; too much shrinks all utilities toward zero.

Decision-Making (Chapter 4)

  • Are you in a measurement setting (reduce uncertainty) or a reward maximization setting (manage uncertainty)? The choice determines whether elicitation (Chapter 3) or bandit/optimization methods (Chapter 4) are appropriate.
  • If optimizing sequentially, how are you balancing exploration and exploitation? Thompson Sampling provides a principled default.
  • What is your winner concept? Condorcet, Borda, and von Neumann winners can diverge—the choice encodes assumptions about what “best” means.

Aggregation and Fairness (Chapters 5, 6)

  • Whose preferences are included, and whose are excluded? Convenience sampling overrepresents easily-reached populations.
  • How are you aggregating across annotators? Majority vote selects the Borda winner (Chapter 5, Section 5.3)—is that what you want?
  • Have you audited for compounding unfairness? Small biases at each pipeline stage can multiply through feedback loops (Chapter 6).
  • Are your value tradeoffs explicit and documented? If not, they are implicit and unaccountable.

Deployment and Monitoring

  • Are you monitoring subgroup performance over time, not just aggregate metrics?
  • Do you have a plan for preference drift? User preferences and societal norms change; static models decay.
  • Is there a mechanism for users to contest or correct the system’s preference model?
  • Have you considered the system’s role as a choice architect—how the options you present shape the preferences you observe?

7.6 Scope and Limitations

This book has deliberately focused on the mathematical and algorithmic foundations of preference learning, and several important topics lie outside its scope.

Cognitive science of preference formation. We model preferences as given—either as fixed utilities or as draws from a distribution—without deeply examining how humans form preferences in the first place. The rich literature on bounded rationality, heuristics and biases, and constructed preferences suggests that human choice is more complex than any random utility model captures. Readers interested in these foundations should consult the work of Kahneman, Tversky, Slovic, and their successors.

Large-scale systems engineering. The algorithms in this book are presented at a scale suitable for understanding and experimentation. Production preference learning systems at major technology companies handle billions of data points, thousands of annotators, and models with billions of parameters. The engineering challenges of distributed training, data pipeline management, annotation quality control, and real-time serving are substantial but largely orthogonal to the algorithmic questions we address.

Legal and regulatory frameworks. The deployment of preference learning systems increasingly intersects with privacy regulation (GDPR’s right to explanation, data minimization), anti-discrimination law, and sector-specific oversight (healthcare, finance, education). These legal dimensions shape what data can be collected, how it can be used, and what accountability mechanisms must be in place. We have touched on related ethical considerations in Chapter 6, but a thorough treatment of the regulatory landscape is beyond our scope.

Multi-modal and embodied preferences. This book focuses primarily on preferences over discrete items or text responses. Preferences in robotics (over trajectories), in design (over visual layouts), and in multi-modal settings (over combinations of text, images, and audio) raise additional challenges that we have only briefly touched upon.

7.7 Final Thought

At its core, this book has been about a simple idea: rather than specifying objectives by hand, we can learn them from human feedback. This idea is powerful because it promises AI systems that serve human values rather than proxy metrics—systems that improve as we better understand what we want.

But the idea is also subtle. Human preferences are noisy, inconsistent, context-dependent, and heterogeneous. They can be manipulated, and they evolve over time. Learning from preferences does not automatically solve the alignment problem; it transforms it into questions about whose preferences to learn from, how to aggregate them, and when to defer to them versus override them.

The technical methods in this book—Bradley-Terry models, Bayesian inference, Fisher information, Thompson Sampling, preferential Bayesian optimization, the DPO-Borda connection, CIRL, mechanism design—are tools, and they are more deeply connected than a chapter-by-chapter reading might suggest. The Elo system is stochastic gradient descent on Bradley-Terry; DPO finds the Borda winner; Fisher information drives both active elicitation and fairness auditing; CIRL reframes passive data collection as cooperative interaction. These connections form a coherent framework, not merely a collection of techniques. But like all tools, their value depends on how they are used. The same algorithms that personalize recommendations to delight users can personalize manipulation to exploit them.

We hope this book has equipped you not only with technical skills but also with the conceptual frameworks to use them wisely. The field of machine learning from human preferences is young and evolving rapidly. The researchers and practitioners who engage with it today will shape how AI systems learn from and serve humanity in the decades to come.

We invite you to contribute—to develop new methods, to apply existing ones thoughtfully, to critique approaches that fall short, and to engage with the broader societal implications of this work. The challenge of building AI systems that truly align with human values is too important to be left to any single discipline or community. It requires the collective effort of computer scientists, economists, psychologists, philosophers, policymakers, and the public.

The journey from this book’s mathematical foundations to AI systems that reliably serve human flourishing is long. But every improvement in how we learn from human preferences brings us closer to that goal. We hope you will join us on this path.