Machine Learning from Human Preferences
Introduction
Machine learning is increasingly shaping various aspects of our lives, from education and healthcare to scientific discovery. A key challenge in developing trustworthy intelligent systems is ensuring they align with human preferences. Learning from human feedback offers a promising solution to this challenge. This book introduces the foundations and practical applications of machine learning from human preferences. Instead of manually predefining the learning goal, the book presents preference-based learning that incorporates human feedback to guide the learning process, drawing insights from related fields such as economics, psychology, and human-computer interaction. By the end of this book, readers will be equipped with the key concepts and tools needed to design systems that effectively align with human preferences.
The book is intended for researchers, practitioners, and students who are interested in intergrating machine learning with human-centered application. We assume some basic knowledge of probability and statistics, but provides sufficient background and references for the readers to follow the main ideas. The book also provides illustrative program examples and datasets. The field of machine learning from human preference is a vibrant area of research and practice with many open challenges and opportunities, and we hope that this book will inspire readers to further explore and advance this exciting field.
We hope with the present book to both allow more use of human preferences in machine learning, and new data modalities as Artificial Intelligence systems become increasingly important.
Stanford, May 2025, THK
Structure of this book
The book has three parts which introduce fundamental models, present learning paradigms, and discuss assumptions.
Part 1: Background
We provide background on axioms underlying comparisons in Chapter 1. We discover key modeling assumptions. It covers random preference models, the Independence of Irrelevant Alternatives (IIA), and types of comparison data (binary rankings, accept-reject, lists). The chapter also discusses the main limitations of IIA based on heterogeneity.
Part 2: Learning
The second part introduces several approaches to learning from comparisons.
Chapter 2 considers a setting where comparison data is given and studies both maximum likelihood and posterior-based learning of comparison models. It uses case studies from language modeling and robotics. We discuss the challenges in learning multimodal/heterogenous rewards that fail to satisfy IIA.
Chapter 3 considers active data collection of comparisons with the goal of optimal inference on comparison models. Various strategies are explored, including reducing the learner’s variance, exploiting ambiguity and domain knowledge in ranking, with a case study from robotics.
Chapter 4 studies processes where comparisons are used to guide decisions. We first set up the bandit approach to recommending maximal objects with respect to comparisons, and discuss dueling bandits. We then consider reinforcement learning from human feedback (RLHF) to align language models that decide on which text to generate. We highlight the role of uncertainty quantification and exploration for decision-making.
Chapter 5 considers decision-making in the presence of heterogeneity. We first focus on dealing with heterogeneity to maximize average utility using personalization. We then discuss aggregation mechanisms that are voting-based and decisions that are independent of some features of the outcome.
Part 3: Reflection
The final part of the book discusses limitations of comparison data, and opportunities resulting from stated preference data.
Chapter 6 critiques machine learning from comparisons. It takes different disciplinary lenses, from social psychology, philosophy, and critical studies, to highlight where comparisons are limited in the expression of human preferences, and what are alternatives.
Chapter 7 considers models that are broader than comparisons in our model, many of which we can think of as stated preferences. These are models in which value judgments are given in terms of Likert scales or textual descriptions. We propose ways in how such feedback can be merged with comparison data to better express preferences.
How to engage with this book
There are three models of reading, and teaching with, this book. Chapter 1 is underlying all of the book, so is part of all of these pathways.
For practitioners and those teaching applied AI content, we recommend a reading of Chapters 1, 2, 4, and 7, which can be used as a sequence in an early graduate course on Machine Learning. This sequence allows highlighting human data sources in an introductory machine learning course.
For people with background in discrete choice, we propose to skim Chapter 1, and study Chapters 2 and 4. These studies allow readers to integrate machine learning in their studies of discrete choice, demand models, and Industrial Organization.
For those with deep background in machine learning, we propose to study Chapters 2-4 and 7. These chapters maximize the amount of machine learning covered, and are suitable for a deep learning-based course on machine learning.
For those interested in the methodological and theoretical foundations of machine learning from comparisons, we recommend a reading of Chapters 1, 5, 6, and 7. Chapters 1 and 5 study the underpinnings of revealed preferences and aggregation, Chapter 6 critiques these assumptions, and Chapter 7 looks at broader ways of eliciting preferences. It is suitable for critical study in a course on Computation and Society.
Prior knowledge
The book assumes knowledge of the fundamentals of statistics, linear algebra and machine learning. Many example code excerpts are written in python, and make experience in the python programming language valuable for readers.
Additional Materials
Every chapter has problems for readers and slides for teaching of the material available. They are available on the book’s website.
Acknowledgments
Initial versions of this book were compiled as lecture notes to the class CS329H: Machine Learning from Human Preferences at Stanford University taught in Fall 2023 and Fall 2024. We thank Rehaan Ahmad, Ahmed Ahmed, Jirayu Burapacheep, Michael Byun, Akash Chaurasia, Andrew Conkey, Tanvi Deshpande, Eric Han, Laya Iyer, Adarsh Jeewajee, Shreyas Kar, Arjun Karanam, Jared Moore, Aashiq Muhamed, Bidipta Sarkar, William Shabecoff, Stephan Sharkov, Max Sobol Mark, Kushal Thaman, Joe Vincent, Yibo Zhang, Duc Nguyen, Grace Sodunke, Ky Nguyen, and Mykkel Kochenderfer for their early contributions and feedback.
Citation
Thanks for reading our book! We hope you find this book useful in your research and teaching.
@book{mlhp,
author = {Truong, Sang and Haupt, Andreas and Koyejo, Sanmi},
title = {{Machine Learning from Human Preferences}},
year = {2025},
publisher = {Stanford University},
doi = {},
note = {}
}
For attribution, please cite this work as: