5 Aggregation

Intended Learning Outcomes

By the end of this chapter you will be able to:

Distinguish between social welfare functions and social choice functions, and explain their role in aggregating preferences from multiple agents.
State Arrow’s Impossibility Theorem and the Gibbard–Satterthwaite Theorem, explaining why no “perfect” voting rule exists for three or more alternatives.
Define single-peaked preferences and derive the generalized median voter scheme as a strategy-proof aggregation rule on restricted domains.
Apply Moulin’s theorem to design strategy-proof, Pareto efficient voting rules when preferences lie on a single-peaked domain.
Connect the Borda count to Direct Preference Optimization (DPO), showing that the DPO-optimal policy maximizes the expected Borda score.
Identify nosy vs. non-nosy preferences and apply Sen’s framework to determine when liberal vs. illiberal assistance is appropriate in AI systems.
Analyze the tension between privacy and personalization in preference learning using the Contextual Integrity framework.
Recognize the inversion problem (observed behavior does not necessarily equal underlying preferences) and its implications for revealed preference approaches.
Design incentive-compatible mechanisms for settings without monetary transfers, including peer prediction and online learning.
Evaluate real-world aggregation systems like Community Notes for combining diverse perspectives on content moderation.

Suggested Lecture Plan

This chapter can be covered in four 50-minute lectures plus a discussion section:

Lecture 1 (Sections 5.1–5.2): Classical Social Choice Theory

Motivation: When and why must we aggregate preferences? (10 min)
Voting rules: Plurality, Borda, STV, Condorcet (10 min)
Arrow’s theorem and Gibbard–Satterthwaite (20 min)
Implications for AI alignment and RLHF (10 min)

Lecture 2 (Sections 5.3–5.4): Escaping Impossibility

Domain restrictions: Single-peaked preferences (15 min)
Generalized median voter and Moulin’s theorem (15 min)
Borda count and modified IIA (IIA’) (10 min)
Code demo: DPO-Borda connection (10 min)

Lecture 3 (Sections 5.5–5.7): Beyond Classical Voting

Multi-issue voting and separable preferences (10 min)
Nosy preferences and the Paretian Liberal paradox (15 min)
Case study: Community Notes model (15 min)
Discussion: When should systems override stated preferences? (10 min)

Lecture 4 (Sections 5.8–5.10): Challenges in Practice

The inversion problem: Behavior vs. mental states (15 min)
Privacy and Contextual Integrity (15 min)
Mechanism design: Auctions and VCG (20 min)

Discussion Section: Peer prediction, incentive-compatible online learning, and exercises

Notation

Throughout this chapter, we use the following conventions:

Symbol	Meaning
$N = \{1, \ldots, n\}$	Set of $n$ voters (agents)
$A = \{a_1, \ldots, a_m\}$	Set of $m$ alternatives
$\succ_i$	Voter $i$’s strict preference ordering over $A$
$\mathcal{L}(A)$	Set of all strict linear orders over $A$
$f: \mathcal{L}(A)^n \to A$	Social choice function (profile to winner)
$F: \mathcal{L}(A)^n \to \mathcal{L}(A)$	Social welfare function (profile to ranking)
$\text{SP}(Y)$	Single-peaked preferences on totally ordered set $Y$
$\text{peak}(\succ_i)$	Peak (ideal point) of voter $i$’s preferences
$v_i$	Private valuation of agent $i$ (in auctions)
$\varphi(v)$	Virtual valuation in Myerson framework
$\text{MI}(X; Y)$	Mutual information between $X$ and $Y$

Chapters 1–4 focused on modeling preferences of a single decision-maker or learning from the feedback of individual users. But many real-world applications require aggregating preferences across multiple individuals. When training a language model with RLHF, annotators may disagree about which response is better. When a recommender system serves millions of users, it must balance diverse tastes. When an AI assistant helps a family plan a vacation, it must somehow combine different members’ preferences.

This chapter explores the fundamental challenges and solutions for preference aggregation. We begin with classical social choice theory, which reveals deep impossibility results: there is no “perfect” way to aggregate preferences. We then examine how to escape these impossibilities through domain restrictions, scoring rules, and mechanism design. Along the way, we connect these classical results to modern AI alignment challenges, including the relationship between the Borda count and DPO, the problem of nosy preferences in content moderation, and the tension between privacy and personalization.

5.1 Chapter Overview

This chapter is organized into five parts:

Social Choice Theory (Section 5.2–Section 5.2.0.1): We introduce the formal framework of social choice, present Arrow’s and Gibbard–Satterthwaite’s impossibility theorems, and discuss their implications for AI systems that aggregate human feedback.
Escaping Impossibility (Section 5.3–Section 5.4): We show how restricting the domain of preferences (e.g., to single-peaked preferences) enables strategy-proof aggregation, and we prove a key connection between the Borda count and DPO.
Beyond Classical Voting (Section 5.5–Section 5.7): We extend to multi-issue settings, discuss nosy preferences and the Paretian Liberal paradox, and analyze Community Notes as a real-world case study.
Challenges in Practice (Section 5.8–Section 5.10): We examine the inversion problem (behavior ≠ preferences), privacy concerns in preference learning, and when paternalistic intervention may be justified.
Mechanism Design (Section 5.11–Section 5.12): We cover auctions, VCG mechanisms, optimal auction design, peer prediction, and incentive-compatible online learning.

5.2 Social Choice Theory

In many applications, human preferences must be aggregated across multiple individuals to determine a collective decision or ranking. This process is central to social choice theory, which provides a mathematical foundation for preference aggregation. Unlike individual preference modeling, which focuses on how a single person makes decisions, social choice theory addresses the challenge of combining multiple preference profiles into a single coherent outcome. A social welfare function (SWF) takes as input each individual’s preference ranking over a set of alternatives and produces a social ranking of those alternatives. A related concept is a social choice function (SCF), which selects a single winning alternative given individuals’ preferences. Many voting rules can be seen as social choice functions that aim to reflect the group’s preferences. Formally, let $N=\{1,2,\dots,n\}$ be a set of $n$ voters (agents) and $A=\{a_1,\dots,a_m\}$ a set of $m$ alternatives (with $m \ge 3$). Each voter $i$ has a preference order $\succ_i$ over $A$. A social choice function is a mapping $f: (\succ_1,\dots,\succ_n)\mapsto A$ that picks a winning alternative for each possible profile of individual preferences. A social welfare function is a mapping that produces a complete societal ranking $\succ^*$ of the alternatives. The central question is: can we design an aggregation rule that faithfully represents individual preferences while satisfying certain fairness or rationality axioms?

Many common voting rules illustrate different methods of aggregation, each with its own merits and vulnerabilities:

Plurality: Each voter names their top choice; the alternative with the most votes wins.
Borda Count: Voters rank all alternatives, and points are assigned based on the position in each ranking. For example, with $m$ alternatives, a voter’s top-ranked alternative gets $m-1$ points, the second-ranked gets $m-2$, and so on down to 0. The Borda score of an alternative is the sum of points from all voters, and the winner is the alternative with the highest total score.
Single Transferable Vote (STV): Voters rank choices, and the count proceeds in rounds. In each round, the alternative with the fewest votes is eliminated and those votes are transferred to the next preferred remaining alternative on each ballot, until one candidate has a majority.
Condorcet Methods: These look for a candidate that wins in all pairwise majority contests against other alternatives (the Condorcet winner), if such an alternative exists.

However, preference aggregation is not always straightforward. The Condorcet paradox illustrates that majority preferences can be cyclic (rock-paper-scissors style), so that no single alternative is majority-preferred to all others, violating transitivity. Different voting rules can yield different winners on the same profile, highlighting how the choice of rule influences the outcome. To guide the design of social choice functions, several desirable properties or axioms have been proposed. Three classical fairness criteria are:

Unanimity (Pareto efficiency): If all individuals strictly prefer one alternative $x$ over another $y$ (i.e. $x \succ_i y$ for every voter $i$), then the group ranking should prefer $x$ over $y$ as well ($x \succ^* y$).
Independence of Irrelevant Alternatives (IIA): The social preference between any two alternatives $x$ and $y$ should depend only on the individual preferences between $x$ and $y$. In other words, if we change individuals’ rankings of other “irrelevant” alternatives (not $x$ or $y$) in any way, the group’s relative ordering of $x$ and $y$ should remain unchanged.
Non-dictatorship: The aggregation should not simply follow a single individual’s preference regardless of others. There is no voter $i$ who always gets their top choice as the social choice (or whose rankings always become the social ranking), irrespective of other voters’ preferences.

Additionally, we assume an unrestricted domain (universal admissibility): individuals may have any transitive preference ordering over the $m$ alternatives (no restrictions like single-peaked preferences unless explicitly imposed). One might hope that a fair voting rule exists that satisfies all the above properties for three or more alternatives. Surprisingly, a seminal negative result shows this is impossible.

5.2.0.1 Arrow’s Impossibility Theorem

Arrow’s Impossibility Theorem (Arrow 1951) is a cornerstone of social choice theory. It states that when there are three or more alternatives ($m\ge 3$), no social welfare function can simultaneously satisfy Unanimity, IIA, and Non-dictatorship – unless it is a trivial dictatorial rule. In other words, any aggregation mechanism that is not dictatorial will inevitably violate at least one of the fairness criteria. The theorem is usually proven by contradiction: assuming a social welfare function satisfies all conditions, one can show that one voter’s preferences always decide the outcome, hence the rule is dictatorial. Intuitively, Arrow’s theorem is driven by the possibility of preference cycles in majority voting. Even if individual preferences are transitive, aggregated majorities can prefer $A$ to $B$, $B$ to $C$, and $C$ to $A$ in a cycle, as in the Condorcet paradox. Under Unanimity and IIA, the social ranking must locally match these pairwise preferences, but this produces a contradiction with transitivity unless one voter’s ranking is given overriding authority. A sketch of Arrow’s proof is as follows: one shows that under the axioms, the social ranking between any two alternatives $x$ and $y$ must agree with some particular voter’s preference (the “pivotal” voter for that pair). With IIA, the identity of the pivotal voter must be the same across all pairs of alternatives, otherwise by cleverly constructing profiles one can derive a conflict. This single pivotal voter then effectively dictates the entire social order, violating Non-dictatorship. Hence, the axioms are incompatible.

Arrow’s Impossibility Theorem has profound implications: it formalizes the inherent trade-offs in designing any fair aggregation scheme. In practice, different voting rules relax one or more of Arrow’s conditions. For instance, Borda count violates IIA (since introducing or removing an irrelevant alternative can change the point totals), while a dictatorship violates fairness blatantly. The theorem suggests that every practical voting system must sacrifice at least one of the ideal fairness criteria. It also motivated the exploration of alternative frameworks (such as allowing interpersonal comparisons of utility or cardinal preference aggregation) to escape the impossibility by weakening assumptions.

Complementing Arrow’s theorem, the Gibbard–Satterthwaite theorem focuses on incentives and strategic manipulation in voting systems (Gibbard 1973; Satterthwaite 1975). It considers any deterministic social choice function $f$ that chooses a single winner from the set of $m\ge 3$ alternatives. The theorem states that if $f$ is strategy-proof (incentive compatible) and onto (its range of outcomes is the entire set of alternatives), then $f$ must be dictatorial. Strategy-proofness (also called truthfulness or dominant-strategy incentive compatibility) means that no voter can ever benefit by misrepresenting their true preferences, regardless of what others do. In other words, reporting their genuine ranking is a (weakly) dominant strategy for each voter. The theorem implies that for any realistic voting rule where every alternative can possibly win, either one voter effectively decides the outcome (a dictatorship) or else the rule is susceptible to strategic manipulation by voters. The Gibbard–Satterthwaite theorem tells us that every non-dictatorial voting rule for 3 or more alternatives is manipulable: there will exist some election scenario where a voter can gain a more preferred outcome by voting insincerely (i.e. not according to their true preferences). For example, in a simple plurality vote, a voter whose true favorite is a long-shot candidate might vote for a more viable candidate to avoid a worst-case outcome (“lesser of two evils” voting). In a Borda count election, voters might strategically raise a competitor in their ranking to push down an even stronger rival. The only way to avoid all such strategic voting incentives is to have a dictatorship or limit the choice set to at most two alternatives.

The proof of Gibbard–Satterthwaite is non-trivial, but one can outline the idea: Given a non-dictatorial and onto rule $f$, one shows there exist at least three distinct outcomes that can result from some preference profiles. By carefully constructing profiles and using the onto property, one finds a situation where a single voter can change the outcome by switching their order of two candidates, demonstrating manipulability. The theorem is robust – even if we allow ties or weaker conditions, similar impossibility results hold (Gibbard’s 1978 extension handles randomized rules). The practical takeaway is that all meaningful voting protocols encourage tactical voting in some situations. Nonetheless, certain systems are considered “harder to manipulate” or more resistant due to complexity or uncertainty. For instance, while STV (ranked-choice voting) can be manipulated in theory, determining a beneficial strategic vote can be NP-hard in worst cases, which arguably provides some practical deterrence to manipulation (Bartholdi, Tovey, and Trick 1989).

Arrow’s and Gibbard–Satterthwaite’s theorems highlight the limitations any preference aggregation method must face. In domains like reinforcement learning from human feedback (RLHF) and AI alignment, we also aggregate preferences – often preferences of multiple human evaluators or preferences revealed in pairwise comparisons – to guide machine learning systems. While these settings sometimes use cardinal scores or learned reward functions (escaping the strict ordinal framework of Arrow’s theorem), the spirit of these impossibility results still applies: there is no perfect way to aggregate human opinions without trade-offs.

For example, aggregating human feedback to train a model may run into inconsistencies analogous to preference cycles, especially when feedback comes from diverse individuals with different values. A simple majority vote over preferences might yield unstable or unfair outcomes if some annotators are systematically in the minority. Weighting votes by some credibility or expertise (weighted voting) can improve outcomes but raises the question of how to set the weights without introducing dictator-like influence. Recent research has proposed methods like jury learning – which integrates dissenting voices by having a panel (“jury”) of models or human subgroups whose aggregated judgment guides the learning (Gordon et al. 2022) – to ensure minority preferences are not entirely ignored. Another perspective is social choice in AI alignment, which suggests using social choice theory to design AI systems that respect a plurality of human values instead of collapsing everything into a single objective. In pluralistic value alignment, instead of forcing a single “best” solution, an AI might present a diverse set of options or behave in a way that reflects a distribution of values. This approach aims to preserve the diversity of human preferences rather than always aggregating to one monolithic preference. For instance, a conversational AI might be designed to recognize multiple acceptable responses (each aligning with different value systems) rather than one canonical “aligned” response for a given query.

These considerations are especially relevant in generative AI and large language models, where training involves human preference data. If we aggregate feedback naively, we might overfit to the majority preference and lose minority perspectives (a form of tyranny of the majority). On the other hand, trying to satisfy everyone can lead to indecision or an incoherent objective. The impossibility results remind us there is no free lunch: we must carefully decide which properties to prioritize (e.g. giving more weight to expert annotators versus preserving broader fairness, or balancing consistency vs inclusivity). Designing aggregation mechanisms for AI that reflect collective human values is an ongoing challenge. It often involves insights from traditional voting theory (to understand trade-offs and failure modes) combined with machine learning techniques (to model and learn from preference data). In summary, social choice theory provides cautionary guidance as we build systems that learn from human preferences: we need to be conscious of which fairness criteria we relax and be transparent about the compromises being made in any preference aggregation pipeline.

5.3 Escaping Impossibility: Domain Restrictions

Arrow’s and Gibbard–Satterthwaite’s theorems assume an unrestricted domain: any transitive preference ordering is admissible. But what if we restrict the types of preferences that can occur? In many real-world settings, preferences have natural structure that we can exploit.

5.3.1 Single-Peaked Preferences

One of the most important restricted domains is single-peaked preferences. The idea is that alternatives can be arranged along a line (e.g., a left-right political spectrum, a temperature setting, or a budget level), and each voter has an “ideal point” or “peak” on this line, with preferences decreasing monotonically as we move away from the peak in either direction.

Definition: Single-Peaked Preferences

Let $Y$ be a totally ordered set of alternatives. A preference ordering $\succ$ over $Y$ is single-peaked if there exists a peak $p(\succ) \in Y$ such that for all $y, y' \in Y$:

If $y < y' \leq p(\succ)$, then $y' \succ y$
If $p(\succ) \leq y' < y$, then $y' \succ y$

The set of all single-peaked preferences on $Y$ is denoted $\text{SP}(Y)$.

Intuitively, a voter with peak $p$ always prefers alternatives closer to $p$ over alternatives farther from $p$. This rules out preferences like “I prefer the extremes to the middle”—which would create the kind of cycles that drive Arrow’s impossibility.

Example: Temperature preferences. Consider three colleagues deciding on the office thermostat setting. The alternatives are temperatures from 65°F to 75°F. Alice prefers 68°F (her peak), with utility decreasing as temperature moves away from 68. Bob prefers 72°F. Carol prefers 70°F. All three have single-peaked preferences on the temperature spectrum.

5.3.2 Generalized Median Voter Scheme

On single-peaked domains, a remarkable class of voting rules emerges that satisfies strategy-proofness, Pareto efficiency, and other desirable properties.

Definition: Generalized Median Voter Scheme

Fix phantom votes $a_1, \ldots, a_{n-1} \in \mathbb{R} \cup \{\pm\infty\}$. The generalized median voter scheme is defined as: \[ f(\succ_1, \ldots, \succ_n) = \text{median}\big(p(\succ_1), \ldots, p(\succ_n), a_1, \ldots, a_{n-1}\big) \tag{5.1}\] where $p(\succ_i)$ is the peak of voter $i$’s preferences.

The phantom votes act as “anchor points” that can shift the median in various directions. With $n-1$ phantom votes, there are $2n-1$ total values, so the median is well-defined. Different choices of phantom votes yield different voting rules:

Pure median rule: Set all phantoms to $\pm\infty$ appropriately so they don’t affect the median. The outcome is simply the median of the voters’ peaks.
Dictatorial rule: Set all phantoms equal to voter $i$’s peak. The outcome is always voter $i$’s ideal point.
Status quo rule: Set all phantoms to the status quo point $q$. The outcome can only move toward the median of voter peaks if enough voters prefer change.

Moulin’s Characterization Theorem

Theorem (Moulin, 1980): On the domain of single-peaked preferences $\text{SP}(Y)$, a social choice function $f$ satisfies:

Strategy-proofness (SP): No voter can benefit by misreporting their peak
Pareto efficiency (PE): The outcome is never unanimously dispreferred to another alternative
Anonymity and peaks-only: The outcome depends only on the set of reported peaks

if and only if $f$ is a generalized median voter scheme.

This is a powerful positive result: by restricting to single-peaked preferences, we escape Arrow’s impossibility and can achieve both strategy-proofness and efficiency.

The following code demonstrates the generalized median voter scheme and its strategy-proofness:

5.4 Scoring Rules and the DPO-Borda Connection

Another way to escape Arrow’s impossibility is to relax the Independence of Irrelevant Alternatives (IIA) axiom. The Borda count is a classic scoring rule that does exactly this.

5.4.1 The Borda Count

Definition: Borda Count / Soft Copeland Score

For a preference profile, the Borda score (or soft Copeland score) of alternative $y \in Y$ equals the expected number of alternatives ranked below $y$ by a randomly chosen voter. Equivalently, it counts the number of pairwise “matches” that $y$ wins against other alternatives: \[ \text{Borda}(y) = \sum_{i=1}^{n} |\{y' \neq y : y \succ_i y'\}| = \sum_{y' \neq y} |\{i : y \succ_i y'\}| \tag{5.2}\] The Borda winner is the alternative with the maximum Borda score.

The Borda count satisfies many desirable properties but violates IIA. However, we can define a weaker version of IIA that Borda does satisfy.

Definition: Modified IIA (IIA’)

Fix alternatives $y, y' \in Y$. If two preference profiles have, for every agent: 1. The same pairwise ordering of $y$ vs. $y'$, AND 2. The same number of alternatives strictly between $y$ and $y'$

then it should not be the case that $f(\succ) = y$ and $f(\succ') = y'$.

This relaxation allows the social choice to depend on how far apart two alternatives are in each voter’s ranking, not just their relative order.

Theorem: Properties of Borda Count

The Borda count satisfies: - Unrestricted domain (U) - Pareto efficiency (PE) - Non-dictatorship (ND) - Modified IIA (IIA’)

By relaxing IIA to IIA’, we escape Arrow’s impossibility.

5.4.2 Connection to Direct Preference Optimization (DPO)

A remarkable result connects the Borda count to modern RLHF methods, specifically Direct Preference Optimization (DPO).

Theorem: DPO-Borda Equivalence

Assume that responses $y, y'$ are drawn from a reference policy $\pi_{\text{ref}}(\cdot | x)$. Define the $\pi$-weighted Borda winner as: \[ y^* = \arg\max_{y \in Y} \mathbb{E}_{y' \sim \pi_{\text{ref}}(\cdot | x)}\big[\mathbf{1}[y \succ y' \mid x]\big] \tag{5.3}\] Consider DPO training with reference policy $\pi_{\text{ref}}$. Then the DPO-optimal policy $\pi_{\text{DPO}}$ satisfies: \[ \frac{\pi_{\text{DPO}}(y | x)}{\pi_{\text{ref}}(y | x)} \propto \text{(weighted Borda score of } y \text{)} \]

In other words, DPO finds a policy that upweights responses proportionally to their Borda scores. This provides a social choice interpretation of what DPO is doing: it aggregates pairwise preferences by finding the response that would win the most head-to-head matchups against alternatives drawn from the reference policy.

Proof Sketch

Write $\hat{\pi} = \pi^* / \pi_{\text{ref}}$ for the density ratio. Let $\bar{\sigma}(\Delta r^*(x, y', y)) = \mathbb{P}(y \succ y' | x; r^*)$ denote the Bradley-Terry probability under the true reward $r^*$.

The DPO loss is: \[ \mathcal{L}_{\text{DPO}}(\pi) = -\mathbb{E}_{x,y,y'}\Big[\bar{\sigma}(\Delta r^*) \cdot \log \sigma\big(\beta \log \frac{\hat{\pi}(y' | x)}{\hat{\pi}(y | x)}\big) + (1 - \bar{\sigma}(\Delta r^*)) \cdot \log\big(1 - \sigma(\beta \log \frac{\hat{\pi}(y | x)}{\hat{\pi}(y' | x)})\big)\Big] \]

Taking the gradient with respect to $\hat{\pi}(y | x)$ and setting to zero: \[ \mathbb{E}_{y' \sim \hat{\pi}}\Big[\sigma\big(\beta \log \frac{\pi(y | x)}{\pi(y' | x)}\big)\Big] = \underbrace{\mathbb{E}_{y' \sim \mathcal{D}(\cdot | x)}\big[\bar{\sigma}(\Delta r^*(x, y', y))\big]}_{\text{Borda score of } y} \]

The right-hand side is exactly the expected win rate of $y$ against a random alternative—i.e., its Borda score.

5.5 Multi-Issue Voting and Separable Preferences

Real-world decisions often involve multiple independent issues. For example, an AI system might need to optimize for helpfulness, harmlessness, and honesty simultaneously. How should we aggregate preferences when there are multiple dimensions?

5.5.1 Voting by Committees

Definition: Voting by Committees

Let $K = \{1, \ldots, k\}$ be a set of “objects” (issues), and let $Y = 2^K$ be the set of all subsets (possible outcomes include any combination of objects). A voting scheme $f: \mathcal{P}^n \to 2^K$ is voting by committees if for each object $x \in K$, there exists a committee $C_x = (N, W_x)$ where $W_x$ is a family of winning coalitions, such that:

The outcome includes object $x$ if and only if $\{i : x \in B(\succ_i)\} \in W_x$

where $B(\succ_i)$ is voter $i$’s top-ranked subset.

5.5.2 Separable Preferences

When can we decompose multi-issue decisions into independent single-issue decisions?

Definition: Separable Preferences

A preference $\succ$ on $2^K$ is separable if for all $A \subseteq K$ and $x \notin A$: \[ A \cup \{x\} \succ A \quad \Longleftrightarrow \quad x \in G(\succ) \tag{5.4}\] where $G(\succ) = \{x \in K : \{x\} \succ \emptyset\}$ is the set of objects the voter considers “good” in isolation.

Separability means that whether you want to add object $x$ to a bundle doesn’t depend on what’s already in the bundle—it only depends on whether $x$ is intrinsically good.

Characterization on Separable Domains

Theorem: A voting scheme satisfies surjectivity (S), strategy-proofness (SP), and separability (SEP) if and only if $f$ is voting by committees.

Note: Voting by committees generally does not satisfy Pareto efficiency.

Application to RLHF: When training language models, we often want to optimize for multiple criteria (helpful, harmless, honest). If annotator preferences are separable across these criteria, we can aggregate each criterion independently. But preferences are often not separable—for instance, a highly helpful response might necessarily involve some risk of harm, creating dependencies between criteria.

5.6 Nosy Preferences and the Liberal Paradox

A crucial distinction in preference aggregation is between private preferences (caring only about your own outcomes) and nosy preferences (caring about what others receive).

5.6.1 Private vs. Nosy Preferences

Definition: Nosy Preferences

A preference is nosy if the individual cares about outcomes affecting others, not just themselves. Examples include:

Safety concerns: Not wanting others to see bomb-building instructions
Privacy violations: Wanting to prevent disclosure of others’ personal data
Content moderation: Preferring that certain content not be shown to anyone
Paternalism: Wanting to protect others from choices that might harm them
Fairness: Caring that others receive equitable treatment
Psychological harm: Preferring that AI systems not deceive users

A preference is private if the individual only cares about their own allocation or experience.

Most classical social choice theory assumes private preferences. But in AI systems, especially content moderation and alignment, nosy preferences are ubiquitous.

5.6.2 Sen’s Liberal Paradox

When preferences are nosy, even seemingly weak requirements can conflict.

Theorem: Impossibility of the Paretian Liberal (Sen, 1970)

Consider a society where each individual has a “personal sphere”—a set of choices over which they should have autonomy. The following properties are inconsistent:

Minimal Liberalism: Each individual is decisive over at least one pair of alternatives in their personal sphere.
Pareto Efficiency: If everyone prefers $x$ to $y$, then society should choose $x$ over $y$.
Unrestricted Domain: Any preference profile is admissible.

Classic Example: The Prude and the Book

Consider two individuals, Prude and Lewd, and a controversial book. The alternatives are: - $a$: Prude reads the book - $b$: Lewd reads the book - $c$: No one reads the book

Prude’s preferences: $c \succ_P a \succ_P b$ (prefers no one reads it, but would rather read it themselves than let Lewd read it—a nosy preference!)

Lewd’s preferences: $a \succ_L b \succ_L c$ (wants Prude to read it, then themselves, rather than no one reading it—also nosy!)

By Minimal Liberalism, Prude should decide between $a$ and $c$ (their personal reading choice). Prude prefers $c$.
By Minimal Liberalism, Lewd should decide between $b$ and $c$ (their personal reading choice). Lewd prefers $b$.
But both prefer $a$ to $b$! By Pareto, society should prefer $a$ to $b$.

This creates a cycle: $c$ beats $a$ (Prude’s liberty), $b$ beats $c$ (Lewd’s liberty), but $a$ beats $b$ (Pareto). There is no consistent social choice.

Implications for AI Alignment: The Liberal Paradox suggests that when different stakeholders have nosy preferences about AI behavior, there may be no way to respect both individual autonomy and collective efficiency. Content moderation decisions often involve exactly this tension—one user’s preference for free expression conflicts with another’s preference for a safe environment.

5.7 Case Study: Community Notes

Community Notes (formerly Birdwatch) is a real-world system for aggregating preferences about content helpfulness that addresses some of the challenges discussed above.

5.7.1 The Community Notes Model

Community Notes uses a factor model to identify “bridging” notes—notes that receive positive ratings from users across ideological divides:

\[ u(y; \alpha, p, \varepsilon) = \mu + \alpha_j + \beta_j + p^\top q_j + \varepsilon_j \tag{5.5}\]

where: - $\mu$ is a global intercept - $\alpha_j$ is a rater intercept (some raters are generally more positive) - $\beta_j$ is a note intercept (some notes are generally better) - $p$ and $q_j$ are $k$-dimensional factor vectors capturing ideological alignment - $\varepsilon_j$ is residual noise

The key insight is that $\beta_j$ captures note quality after controlling for ideological agreement. A note is selected if $\beta_j \geq c$ for some threshold $c$. This means a note must be rated positively by users who disagree ideologically, not just by an ideological majority.

5.7.2 Why Bridging Works

Traditional majority voting would select notes favored by the largest ideological group. Community Notes instead identifies notes with broad cross-partisan appeal. This is analogous to finding a Condorcet winner that beats all alternatives in pairwise comparisons across different subgroups.

The factor model approach has connections to: - Collaborative filtering: The $p^\top q_j$ term is similar to matrix factorization in recommender systems - Item response theory: The model resembles the Rasch model from Chapter 1, extended with latent factors - Jury theorems: Under certain conditions, diverse juries aggregate to correct answers better than homogeneous majorities

5.8 The Inversion Problem: Behavior vs. Preferences

A fundamental challenge in preference learning is the inversion problem: observed behavior does not necessarily reveal underlying preferences or utility.

5.8.1 Behavior ≠ Mental State

The Inversion Problem

Standard revealed preference theory assumes that choices reveal preferences: if someone chooses $x$ over $y$, they must prefer $x$ to $y$. But this assumption can fail when:

Habit formation: Repeated behavior creates habits that persist even when preferences change
Cognitive limitations: Fatigue, distraction, or bounded rationality lead to suboptimal choices
Context effects: The same underlying preference produces different behaviors in different contexts
Strategic behavior: People may choose strategically rather than according to true preferences

Example: The Doritos Problem

Consider a smart pantry system that observes eating behavior. A user consistently chooses Doritos when offered. The system infers: “User prefers Doritos.” But the user might actually prefer healthier options—they just succumb to availability and habit when Doritos are present. Optimizing for observed “preferences” (engagement) may not optimize for true welfare.

Example: Doctor Fatigue

Studies show that doctors’ prescribing patterns change systematically throughout the day as fatigue accumulates. Late-afternoon decisions may reflect fatigue more than medical judgment. A preference learning system that treats all decisions equally would encode fatigue-induced biases as “preferences.”

5.8.2 Implications for RLHF

The inversion problem has direct implications for training AI systems from human feedback:

Annotator fatigue: Quality of preference labels may degrade over long annotation sessions
Engagement vs. satisfaction: Optimizing for clicks or watch time may not optimize for user welfare
Context-dependent feedback: The same annotator might give different feedback depending on their mood, prior examples, or framing
Strategic annotation: If annotators believe certain patterns lead to better outcomes for them, they may annotate strategically

Potential solutions: - Weight annotations by estimated quality or consistency - Use deliberation or discussion before labeling to reduce noise - Explicitly model annotator state (fatigue, expertise) as latent variables - Collect meta-feedback about label confidence

5.9 Privacy and Personalization

Preference learning inherently involves collecting personal data. This creates tension between privacy protection and personalization quality.

5.9.1 Contextual Integrity

Helen Nissenbaum’s Contextual Integrity framework provides a nuanced approach to privacy that goes beyond simple “consent” models.

Contextual Integrity Framework

Privacy is preserved when information flows match context-specific norms. Five parameters determine appropriateness:

Sender: Who is sharing the information
Subject: Whose information is being shared
Recipient: Who receives the information
Data Type: What kind of information
Transmission Principle: What rules govern further use

A privacy violation occurs when information flows in ways that violate contextual norms, even if the subject “consented.”

Example: Heart Rate Data

Consider a fitness tracker that collects heart rate data: - Acceptable flow: Sending to a running coach for training optimization - Unacceptable flow: Sending to an advertising network for targeted ads - The difference: The transmission principle (coaching vs. advertising) violates the user’s expectations about the fitness context

5.9.2 Differential Privacy Limitations

Differential privacy (DP) provides formal guarantees that algorithms don’t reveal individual information. However, DP has fundamental limitations for preference learning:

Personalization requires individual data: By definition, DP prevents using individual data for personalization
Trade-off is inherent: Stronger privacy guarantees mean less accurate preference models
“Persuasive” use of DP: Some systems claim DP protection with parameters so weak they provide little actual privacy

Contextual integrity offers a middle ground: allow data use that matches user expectations (e.g., personalization within a service) while preventing unexpected flows (e.g., selling data to third parties).

5.10 Paternalism in AI Systems

When should an AI system override a user’s stated preferences?

5.10.1 Paternalism vs. Nosy Preferences

Paternalism differs from nosy preferences: - Nosy preferences: Caring about others’ choices for your own sake - Paternalism: Overriding others’ choices for their sake

5.10.2 When is Paternalism Justified?

Several conditions might justify paternalistic intervention:

Information asymmetry: The system has information the user lacks (e.g., long-term health effects)
Cognitive limitations: The user is impaired (fatigue, addiction, cognitive decline)
Protection of future self: The user’s current choice harms their future self (e.g., saving for retirement)
Irreversible harm: The consequences of a choice are severe and irreversible

5.10.3 Design Principles

AI systems that exercise any form of paternalism should:

Be transparent: Users should know when and why their preferences are being overridden
Allow override: Users should be able to insist on their original choice
Minimize interference: The lightest intervention that achieves the goal should be used
Justify interventions: Each paternalistic choice should have a clear rationale
Update based on feedback: Systems should learn when interventions are welcomed vs. resented

Example: AI Assistants Refusing Requests

When a user asks an AI assistant for harmful information (e.g., how to synthesize drugs), the refusal is paternalistic if motivated by protecting the user, or nosy if motivated by protecting others. In practice, refusals often serve both purposes.

5.11 Mechanism Design

While voting rules aggregate ordinal rank preferences to select a social outcome, another class of preference aggregation occurs in economic settings like auctions and general mechanism design. Here individuals reveal their valuations (numerical utilities) for outcomes, and the mechanism chooses an outcome (such as an allocation of goods) and possibly payments. Mechanism design asks: how can we design rules so that rational agents, acting in their own interest, end up revealing information that leads to a socially desirable outcome? A central concept is incentive compatibility – the mechanism should be designed so that each participant’s best strategy is to act according to their true preferences (e.g. bid their true value). In this section, we focus on auctions as a prime example of preference aggregation with money, and highlight classical results including Vickrey–Clarke–Groves (VCG) mechanisms and Myerson’s optimal auction.

Consider a single-item auction with one item for sale and $n$ bidders. Bidder $i$ has a private valuation $v_i$ for the item (how much the item is worth to them). Each bidder’s goal is to maximize their own utility, defined as $v_i - p_i$ if they win and pay price $p_i$, or $0$ if they do not win (assuming quasilinear utility where money is the transferable utility). The auction’s task is to allocate the item to one of the bidders and possibly determine payments. We can think of an auction as a mechanism that asks each bidder for a “message” (typically a bid representing how much they are willing to pay), then selects a winner and a price based on the bids. A key objective might be social welfare maximization – allocate the item to the bidder who values it most (maximizing $v_i$ of the winner). Another possible objective is revenue maximization for the seller – choose the allocation and price to maximize the seller’s expected payment.

A classic result in auction theory is that to maximize social welfare in a single-item private-value setting, one should award the item to the highest valuer – and this can be done in an incentive-compatible way by using a second-price auction. The Vickrey second-price auction works as follows: (1) All bidders submit sealed bids $b_1, b_2, \ldots, b_n$. (2) The bidder with the highest bid wins the item. (3) The price paid by the winner is the second-highest bid. For example, if the bids are $(2,\, 6,\, 4,\, 1)$ (in some currency units), the highest bid is $6$ (by bidder 2, say) and the second-highest is $4$. Bidder 2 wins the item and pays $4$.

Under this mechanism, it turns out that bidding truthfully $b_i = v_i$ is a dominant strategy for each bidder. In other words, the auction is dominant-strategy incentive compatible (DSIC): no matter what others do, a bidder maximizes their expected utility by reporting their true valuation. The intuition is as follows. If bidder $i$ bids lower than their true value (i.e. $b_i < v_i$), and if their true value was actually the highest, they risk losing the item even though they value it more than the price they would have paid – a missed opportunity for positive utility. Bidding higher than their value ($b_i > v_i$) cannot help them win in any situation where bidding truthfully wouldn’t (it could only make a difference if their true $v_i$ wasn’t the highest but they tried to win anyway); and if they do win with an inflated bid, they might end up paying the second-highest bid which could be above their true value, yielding negative utility. By bidding exactly $v_i$, if they win, it means all other bids were lower, so $v_i$ is at least as high as the second-highest bid $p$ they pay – guaranteeing non-negative utility $v_i - p \ge 0$. If they lose, it means someone else had a higher bid (hence higher value, if others are truthful), so bidder $i$ wouldn’t have gained anyway. This argument, made rigorous by Vickrey (Vickrey 1961), establishes that truth-telling is a dominant strategy in the second-price auction. As a consequence, when everyone bids truthfully, the item is allocated to the bidder with the highest $v_i$, achieving maximum social surplus (allocative efficiency). The second-price auction is thus an elegant mechanism that aligns individual incentives with social welfare maximization.

It is worth contrasting this with a first-price auction, where the winner pays their own bid. In a first-price auction, bidders have an incentive to bid below their true value (to avoid the winner’s curse of paying too much), in a Nash equilibrium that involves bid shading. The first-price auction can still allocate to the highest valuer in equilibrium, but only through strategic behavior (and it is not DSIC). By charging the second-highest bid, the Vickrey auction removes the incentive to shade bids, since the price does not directly depend on one’s own bid beyond the fact of winning or losing.

So far, we discussed auctions aimed at maximizing social welfare. In many cases, the auctioneer (seller) is interested in maximizing revenue. A foundational result by Roger Myerson (1981) provides a characterization of optimal auctions (those that maximize the seller’s expected revenue) for single-item settings under certain assumptions (Myerson 1981). The problem can be formulated as follows: suppose each bidder’s private value $v_i$ is drawn independently from a known distribution $F_i$ (for simplicity, assume identical distribution $F$ for all bidders, i.i.d.). We seek a mechanism (allocation rule and payment rule) that maximizes the seller’s expected payment, subject to incentive compatibility and individual rationality (participants should not expect negative utility from truthful participation).

Myerson’s theorem states that the optimal auction in such a setting is a threshold auction characterized by virtual valuations. Define the virtual value for a bidder with value $v$ as $\varphi(v) = v - \frac{1-F(v)}{f(v)}$, where $f$ is the probability density function of $F$ (assuming it is continuous). An assumption called regularity (which holds for many distributions) is that $\varphi(v)$ is non-decreasing in $v$. Myerson showed that the revenue-maximizing strategy is: treat $\varphi(v)$ as the effective “score” of a bid, allocate the item to the bidder with the highest non-negative virtual value (if all virtual values are negative, allocate to no one), and charge them the smallest value they could have such that they would still win (the payment is essentially the critical bid where $\varphi$ of that bid equals the second-highest virtual value or the zero cutoff). In practice, for i.i.d. bidders, this reduces to: there is an optimal reserve price $r$ such that you sell to the highest bidder if and only if their bid $b_{\max} \ge r$; if sold, the price is the max of the second-highest bid and $r$.

In the case of $n$ bidders with values i.i.d. uniform on $[0,1]$ (which is a regular distribution), one can compute the optimal reserve price. The virtual value function for uniform $[0,1]$ is $\varphi(v) = v - \frac{1-v}{1} = 2v - 1$. Setting $\varphi(v)\ge 0$ gives $v \ge 0.5$. So Myerson’s mechanism says: don’t sell the item if all bids are below 0.5; otherwise, sell to the highest bidder at at least 0.5. This is exactly a second-price auction with a reserve of $r=0.5$. Our earlier example implicitly demonstrated this: with two uniform(0,1) bidders, the optimal auction sets a reserve price of $0.5$ and yields a certain expected revenue. We can break down the cases: - With probability $1/4$, both bidders have values below $0.5$ (each below 0.5 with probability 1/2), in which case nobody wins and revenue is 0. - With probability $1/4$, both bidders have $v > 0.5$. In this case, the second-price auction with reserve will sell to the highest bidder at the max of the second-highest value and 0.5. Given both $v_1, v_2 > 0.5$, the expected second-highest value (conditional on both >0.5) is $\frac{2}{3}$ (in fact, the order statistics of two uniforms on [0.5,1] give mean of min = 2/3). So in this case the expected price is the second-highest value (since that will exceed 0.5), about 0.667. - With probability $1/2$, one bidder is above 0.5 and the other below. In that case, the one above 0.5 wins at price equal to the reserve 0.5 (since the second-highest bid is the reserve).

Taking the expectation, the seller’s expected revenue is $0*(1/4) + (2/3)*(1/4) + (1/2*1/2) = 0 + 1/6 + 1/4 = 5/12 \approx 0.417$. This is higher than the expected revenue without a reserve. In fact, without a reserve (just a plain second-price with two bidders uniform [0,1]), one can compute the expected revenue is $1/3 \approx 0.333$ (the second order statistic’s expectation). Thus, the reserve has increased revenue. Myerson’s theory tells us that indeed the second-price auction with an optimally chosen reserve maximizes revenue among all DSIC mechanisms for this setting. A notable special case result is that when bidder distributions are i.i.d. and regular, an optimal auction is essentially “allocatively efficient with a reserve price” – i.e. aside from possibly excluding low-value bidders via a reserve, it allocates to the highest remaining bid.

Myerson’s work also highlighted the gap between revenue maximization and welfare maximization. The price of optimality (in revenue) is that the seller might sometimes forego efficient allocation (e.g. not selling despite a willing buyer, in order to preserve a high reserve price strategy). In contrast, Vickrey’s auction always allocates efficiently but may not maximize revenue.

An interesting insight in auction theory is that increasing competition can yield more revenue than fine-tuning the auction mechanism. The Bulow–Klemperer theorem (Bulow and Klemperer 1996) demonstrates that, under certain regularity assumptions, a simple welfare-maximizing auction with one extra bidder outperforms the optimal auction with fewer bidders. Specifically, for i.i.d. bidders with a regular distribution $F$, the expected revenue of a second-price auction with $n+1$ bidders is at least as high as the expected revenue of the Myerson-optimal auction with $n$ bidders. In formula form:

\[ \mathbb{E}_{v_1,\ldots,v_{n+1} \sim F}[\text{Rev}^{\text{(second-price)}}(n+1 \text{ bidders})] \geq \mathbb{E}_{v_1,\ldots,v_n \sim F}[\text{Rev}^{\text{(optimal)}}(n \text{ bidders})] \,. \tag{5.6}\]

This result suggests that, in practice, having more participants (competition) is often more valuable than exploiting detailed knowledge of bidder distributions. As a corollary, a policy recommendation is that a seller is usually better off using a simple auction design (like a Vickrey auction or other transparent mechanism) and putting effort into attracting more bidders, rather than using a complex optimal mechanism that might discourage participation.

5.11.0.1 The VCG Mechanism

Vickrey’s second-price auction can be generalized to multiple items and more complex outcomes by the Vickrey–Clarke–Groves (VCG) mechanism. The VCG mechanism is a cornerstone of mechanism design that provides a general solution for implementing socially efficient outcomes (maximizing total stated value) in dominant strategies, for a broad class of problems. It works for any scenario where agents have quasilinear utilities and we want to maximize the sum of valuations.

In a general mechanism design setting, let $\Omega$ be the set of possible outcomes. Each agent $i$ has a private valuation function $v_i(\omega)$ for outcomes $\omega \in \Omega$ (the amount of utility, in money terms, that $i$ gets from outcome $\omega$). Agents report bids $b_i(\omega)$ (which we hope equal $v_i(\omega)$ if they are truthful). The mechanism then chooses an outcome $\omega^* \in \Omega$ to maximize the reported total value:

\[ \omega^* = \arg\max_{\omega \in \Omega} \sum_{i=1}^n b_i(\omega) \,, \tag{5.7}\]

i.e. $\omega^*$ is the outcome that would be socially optimal if the $b_i$ were true values. To induce truth-telling, VCG sets payments such that each agent pays the externality they impose on others by their presence. Specifically, one convenient form of the VCG payment for agent $i$ is:

\[ p_i(b) = \max_{\omega \in \Omega} \sum_{j \neq i} b_j(\omega)\;-\;\sum_{j \neq i} b_j(\omega^*) \,, \tag{5.8}\]

which can be interpreted as: what would the total value of others be if $i$ were not present (first term, maximizing without $i$) minus the total value others actually get in the chosen outcome $\omega^*$. Equivalently, we can write the payment as the agent’s bid for the chosen outcome minus a rebate term:

\[ p_i(b) = b_i(\omega^*) \;-\; \Big[\sum_{j=1}^n b_j(\omega^*) - \max_{\omega \in \Omega} \sum_{j \neq i} b_j(\omega)\Big] \,. \tag{5.9}\]

This formula (which in single-item auction reduces to second-price logic) ensures that each agent’s net payoff is $v_i(\omega^*) - p_i = \max_{\omega} \sum_{j\neq i} v_j(\omega) + v_i(\omega^*) - \sum_{j\neq i} v_j(\omega^*)$. All terms except $v_i(\omega^*)$ cancel out, meaning each agent’s utility equals the max total welfare of others plus their own value for the chosen outcome minus others’ welfare in the chosen outcome – which does not depend on $v_i(\omega^*)$ except through the decision of $\omega^*$. By construction, an agent cannot influence $\omega^*$ in a way that improves this expression unless it genuinely increases total welfare, so misreporting $v_i$ cannot increase their utility. Thus truthful reporting is a dominant strategy. VCG is dominant-strategy incentive compatible (DSIC) and produces an outcome that maximizes $\sum_i v_i(\omega)$, achieving social welfare maximization.

VCG provides a powerful existence result: under broad conditions, there is a mechanism that achieves efficient allocation with truth-telling (in fact, VCG is essentially the unique one, aside from adding harmless constant transfers). However, implementing VCG in practice can be difficult. One challenge is computational: finding $\arg\max_{\omega}\sum_i b_i(\omega)$ can be NP-hard if $\Omega$ is a combinatorially large space (as in many combinatorial auctions). Another issue is budget balance and revenue: VCG payments might not yield any revenue to the mechanism designer in some cases (or even require subsidies in complex settings), and they can be low or zero in certain environments, which is problematic if the seller needs revenue. VCG is also vulnerable to collusion or the presence of fake identities (sybil attacks) – the mechanism assumes each participant is a separate entity; if one bidder can split into two identities, they might game the outcome.

Nonetheless, for many domains, VCG or variants have been successfully used or at least studied. Notably, combinatorial auctions (where multiple items are up for sale and bidders have valuations for bundles of items) can in theory be handled by VCG: just let $\Omega$ be all possible allocations of items to bidders, and have bidders report $b_i(S)$ for each bundle $S$ of items. VCG would allocate the items in the way that maximizes total reported value and charge each bidder the opportunity cost their presence imposes on others. In practice, as mentioned, combinatorial auctions face exponential complexity in preference reporting (each bidder potentially has to specify a value for every subset of items) and winner determination (solving an NP-hard combinatorial optimization). Heuristic or restricted approaches (like limiting the kinds of bundles or using iterative bidding with query learning of preferences) are used to make the problem tractable. Additionally, pure VCG in combinatorial settings can have undesirable properties: for example, in some cases adding more bidders can cause VCG prices to drop to zero (the so-called “threshold problem” or revenue monotonicity failure), and bidders may collude to manipulate their bids collectively.

One high-stakes application of combinatorial auctions is spectrum auctions for selling licenses of electromagnetic spectrum to telecom companies. Governments have used multi-round combinatorial auctions to allocate spectrum, with billions of dollars at stake. Designing these auctions requires balancing efficiency with simplicity and robustness to strategic behavior. Early spectrum auctions that used simpler formats (like sequential auctions or one-shot sealed bids for each license) ran into problems like the exposure problem – a bidder valuing a combination of items (say complementary licenses in adjacent regions) risks winning only part of the combination at a high price, which could be bad for them if the items are worth much less separately. The simultaneous multi-round auction (SMRA) was an innovation that allowed bidding on all items at once in rounds, giving bidders some price discovery to mitigate the exposure problem. Even so, strategic issues like demand reduction (bidders deliberately not bidding on too many items to keep prices low) and tacit collusion through signaling bids have been observed. These practical complications underscore that while VCG is a beautiful theoretical ideal, real-world mechanism design often involves compromises and tweaks.

5.11.1 Case Study 1: Mechanism for Peer Grading

To illustrate an application of mechanism design beyond auctions, consider a classroom setting where students grade each other’s work (peer assessment). The goal is to design a system (a “mechanism”) that produces fair and accurate grades while incentivizing students to put effort into grading. Jason Hartline and colleagues (2020) studied such a scenario, examining how to optimize scoring rules for peer grading (Hartline et al. 2020). In this setting, students are both agents (who might strategize to maximize their own grade or minimize their effort) and graders. The “outcome” we want is a set of final grades for students, ideally reflecting the true quality of their work.

One idea is to use proper scoring rules to evaluate the peer graders. A proper scoring rule is a concept from forecast evaluation that gives highest expected score for truthful reporting of probabilities. In peer grading, one might try to reward students based on how close their grading is to some ground truth or to the TA’s grades. However, a naive application of proper scoring can backfire. Hartline et al. observed a “lazy peer grader” problem: if students figure out that always giving an average score (say 80%) yields a decent reward under the scoring rule, they might not bother to carefully distinguish good and bad work. In one experiment, giving all peers an 80% could yield a 96% accuracy score for the grader under a certain scoring rule (Hartline et al. 2023). This clearly undermines the goal – the grader is basically cheating the system by always predicting the class average.

To combat this, the mechanism designers sought a scoring rule that maximizes the difference in reward between a diligent grading and a lazy strategy, thereby incentivizing effort. They formulated this as an optimization problem: design the reward function for peer graders such that truthful, careful grading yields a strictly higher expected score than any degenerate strategy like always giving the average. By analyzing data and grader behavior models, they adjusted the scoring rules to penalize obviously lazy patterns and reward variance when warranted. The resulting mechanism improved the accuracy of peer grading by aligning the incentives of student graders (who want a high score for their grading job) with the objective of accurate assessment. This case study highlights how ideas of incentive compatibility and mechanism design apply even in social/educational contexts: the “payments” are points towards one’s own grade, and the mechanism must account for strategic behavior to ensure a reliable outcome.

In conclusion, mechanism design provides a toolkit for aggregating preferences (or signals, like grades or bids) in a principled way, by explicitly accounting for individual incentives. Whether in auctions, peer grading, or other domains, the design of rules (allocation algorithms, payment or scoring schemes) crucially determines whether people feel encouraged to be truthful or to game the system. The theories of VCG and Myerson give us optimal baselines for efficiency and revenue in auctions, while impossibility results like Gibbard–Satterthwaite warn us of the limitations in voting. Real-world implementations often have to grapple with complexity and approximate these ideals. While learning from individual human preference is a powerful approach, it too faces aggregation challenges. If the human feedback is inconsistent or if different annotators have different preferences, the reward model may end up capturing an average that satisfies no one perfectly. There is active research on scalable oversight: techniques to gather and aggregate human feedback on tasks that are too complex for any single person to evaluate reliably. This includes approaches like recursive reward modeling, iterated amplification (Christiano, Shlegeris, and Amodei 2018), and AI-assisted debate (Irving, Christiano, and Amodei 2018), where AI systems help humans provide better feedback or break down tasks. The goal of scalable oversight is to leverage human preferences and principles in guiding AI even as AI systems tackle increasingly complex or open-ended tasks, while mitigating the human burden and bias in evaluation.

In summary, preference aggregation in machine learning spans from simple models like Bradley–Terry for pairwise comparisons to elaborate RLHF pipelines for training large models. The deep mathematical foundations – whether Arrow’s theorem or Myerson’s auction theory – remind us that whenever we aggregate preferences or signals from multiple sources, we must consider incentive effects, fairness criteria, and the possibility of inconsistency. By combining insights from social choice, economics, and statistical learning, we aim to build AI systems that not only learn from human preferences but do so in a principled, robust, and fair manner. The next chapter will delve further into aligning AI with human values, building on the mechanisms and learning algorithms discussed here to ensure AI systems remain beneficial and in line with what people truly want.

5.11.2 Case Study 2: Incentive-Compatible Online Learning

To address this problem, we seek to create a model. We first outline the key criteria that our model must achieve. The model revolves around repeated interactions between a planner (the system) and multiple agents (the users). Each agent, upon arrival in the system, is presented with a set of available options to choose from. These options could vary widely depending on the application of the model, such as routes in a transportation network, a selection of hotels in a travel booking system, or even entertainment choices in a streaming service. The interaction process is straightforward but crucial: agents arrive, select an action from the provided options, and then report feedback based on their experience. This feedback is vital as it forms the basis upon which the planner improves and evolves its recommendations. The agents in this model are considered strategic; they aim to maximize their reward based on the information available to them. This aspect of the model acknowledges the real-world scenario where users are typically self-interested and seek to optimize their own outcomes. The planner, on the other hand, has a broader objective. It aims to learn which alternatives are best in a given context and works to maximize the overall welfare of all agents. This involves a complex balancing act: the planner must accurately interpret feedback from a diverse set of agents, each with their own preferences and biases, and use this information to refine and improve the set of options available. The ultimate goal of the planner is to create a dynamic, responsive system that not only caters to the immediate needs of individual agents but also enhances the collective experience over time, leading to a continually improving recommendation ecosystem.

Here, we seek to address the inherent limitations faced by the planner, particularly in scenarios where monetary transfers are not an option, and the only tool at its disposal is the control over the flow of information between agents. This inquiry aims to understand the extent to which these limitations impact the planner’s ability to effectively guide and influence agent behavior. A critical question is whether the planner can successfully induce exploration among agents, especially in the absence of financial incentives. This involves investigating strategies to encourage users to try less obvious or popular options, thus broadening the scope of feedback and enhancing the system’s ability to learn and identify the best alternatives. Another question is understanding the rate at which the planner learns from agent interactions. This encompasses examining how different agent incentives, their willingness to explore, and their feedback impact the speed and efficiency with which the planner can identify optimal recommendations.

The model can be extended in several directions, each raising its own set of questions.

1.  Multiple Agents with Interconnected Payoffs: When multiple agents arrive simultaneously, their choices and payoffs become interconnected, resembling a game. The research question here focuses on how these interdependencies affect individual and collective decision-making.

2.  Planner with Arbitrary Objective Function: Investigating scenarios where the planner operates under an arbitrary objective function, which might not align with maximizing overall welfare or learning the best alternative.

3.  Observed Heterogeneity Among Agents: This involves situations where differences among agents are observable and known, akin to contextual bandits in machine learning. The research question revolves around how these observable traits can be used to tailor recommendations more effectively.

4.  Unobserved Heterogeneity Among Agents: This aspect delves into scenarios where differences among agents are not directly observable, necessitating the use of causal inference techniques to understand and cater to diverse user needs.

In our setup, there is a “planner,” which aims to increase exploration, and many independent “agents,” which will act selfishly (in a way that they believe will maximize their individual reward) (Mansour, Slivkins, and Syrgkanis 2019; Mansour et al. 2021). Under our model shown in Figure 1.1, there are $K$ possible actions that all users can take, and each action has some mean reward $\mu_i \in [0, 1]$. In addition, there is a common prior belief on each $\mu_i$ across all users.. The $T$ agents, or users, will arrive sequentially. As the $t$’th user arrives, they are recommended an action $I_t$ by the planner, which they are free to follow or not follow. After taking whichever action they choose, the user experiences some realized reward $r_i \in [0, 1]$, which is stochastic i.i.d. with mean $\mu_i$, and reports this reward back to the planner.

So far, the model we have defined is equivalent to a multi-armed bandit model, which we have seen earlier in this chapter (1). Under this model, well-known results in economics, operations research and computer science show that $O(\sqrt{T})$ regret is achievable (Russo and Roy 2015; Auer, Cesa-Bianchi, and Fischer 2002; Lai and Robbins 1985) with algorithms such as Thompson sampling and UCB. However, our agents are strategic and aim to maximize their own rewards. If they observe the rewards gained from actions taken by other previous users, they will simply take the action they believe will yield the highest reward given the previous actions; they would prefer to benefit from exploration done by other users rather than take the risk of exploring themselves. Therefore, exploration on an individual level, which the planner would like to facilitate, is not guaranteed under this paradigm.

In light of this, we also require that our model satisfy incentive compatibility, or that taking the action recommended by the planner has an expected utility that is as high as any other action the agent could take. Formally, $\forall i : \, E[\mu_i | I_t = i] \geq E[\mu_{i'} | I_t = i].$ Note that this incentivizes the agents to actually take the actions recommended by the planner; if incentive compatibility is not satisfied, agents would simply ignore the planner and take whatever action they think will lead to the highest reward.

At a high level, the key to achieving incentive compatibility while still creating a policy for the planner that facilitates exploration is information asymmetry. Under this paradigm, the users only have access to their previous recommendations, actions, and rewards, and not to the recommendations, actions, and rewards of other users. Therefore, they are unsure of whether, after other users take certain actions and receive certain rewards, arms that they might have initially considered worse in practice outperform arms that they initially considered better. Only the planner has access to the previous actions and rewards of all users; the user only has access to their own recommendations and overall knowledge of the planner’s policy. The main question we aim to answer for the rest of this section is, given this new constraint of incentive compatibility, is $O(\sqrt{T})$ regret still achievable? We illustrate such an algorithm in the following.

The main result here is a black-box reduction algorithm to turn any bandit algorithm into an incentive compatible one, with only a constant increase in Bayesian regret. Since, as mentioned earlier, there are bandit algorithms with $O(\sqrt{T})$ Bayesian regret, black-box reduction will also allow us to get incentive-compatible algorithms with $O(\sqrt{T})$ regret. The idea of black-box reduction will be to simulate $T$ steps of any bandit algorithm in an incentive-compatible way in $c T$ steps. This allows us to design incentive-compatible recommendation systems by using any bandit algorithm and then adapting it. Consider the following setting: there are two possible actions, $A_1$ and $A_2$. Assume the setting of deterministic rewards, where action 1 has reward $\mu_1$ with prior $U[1/3, 1]$ and mean $\mathbb{E}[\mu_1] = 2/3$, and action 2 has reward $\mu_2$ with prior $U[0, 1]$ and mean $\mathbb{E}[\mu_2] = 1/2$. Without the planner intervention and with full observability, users would simply always pick $A_1$, so how can the planner incentivize users to play $A_2$?

The key insight is going to be to hide exploration in a pool of exploitation. The users are only going to receive a recommendation from the planner, and no other observations. After deterministically recommending the action with the highest expected reward ($A_1$), the planner will pick one guinea pig to recommend the exploratory action of $A_2$. The users don’t know whether they are the guinea pig, so intuitively, as long as the planner picks guinea pigs uniformly at random and at low enough frequencies, the optimal decision for the users is still to follow the planner’s recommendation, even if it might go against their interest. The planner will pick the user who will be recommended the exploratory action uniformly at random from the $L$ users that come after the first one (which deterministically gets recommended the exploitation action). Under this setting (illustrated in Figure 1.2), it is optimal for users to always follow the option that is recommended for them. More formally, if $I_t$ is the recommendation that a user receives at time $t$, then we have that:

\[ \begin{split} \mathbb{E}[\mu_1 - \mu_2 | I_t = 2] Pr[I_t = 2] &= \frac{1}{L} (\mu_1 - \mu_2) \quad \text{(Gains if you are the unlucky guinea pig)}\\ &+ (1 - \frac{1}{L}) \mathbb{E}[\mu_1 - \mu_2 | \mu_1 < \mu_2] \times p[\mu_1 < \mu_2] \quad \text{(Loss if you are not and $\mu_1 < \mu_2$)}\\ &\leq 0 \end{split} \tag{5.10}\]

This holds when $L \geq 12$. It means that the gains from not taking the recommended action are negative, which implies that users should always take the recommendation. So far we have considered the case where rewards are deterministic, but what about stochastic rewards? We are now going to consider the case where rewards are independent and identically distributed from some distribution, and where each action $A_i$ has some reward distribution $r_i^t \sim D_i, \mathbb{E}[r_i^t] = \mu_i$. Back to the case where there are only two actions, we are going to adapt the prior algorithm of guinea pig-picking to the stochastic reward setting. Since one reward observation is not enough to fully know $\mu_1$ anymore, we’ll instead observe the outcome of the first action $M$ times to form a strong posterior $\mathbb{E}[\mu_1 | r_1^1, \ldots r_1^M]$. We can use with stochastic rewards when there are two actions. Similarly, as before, we pick one guinea pig uniformly at random from the next $L$ users and use the reward we get as the exploratory signal. In a very similar manner, we can generalize this algorithm from always having two actions to the general multi-armed bandit problem. Now suppose we have a general multi-armed bandit algorithm $A$. We will wrap this algorithm around our black box reduction algorithm to make it incentive-compatible. We wrap every decision that $A$ would make by exactly $L-1$ recommendations of the action believed to be the best so far. This guarantees that the expected rewards for the users that are not chosen as guinea pigs are at least as good as $A$’s reward at phase $n$.

5.12 Mutual Information Paradigm

In this section we discuss an influential new framework for designing peer prediction mechanisms, the Mutual Information Paradigm (MIP) introduced by Kong and Schoenebeck (Kong and Schoenebeck 2019). Traditional peer prediction approaches typically rely on scoring rules and correlation between agents’ signals. However, these methods often struggle with issues like uninformed equilibria, where agents can coordinate on uninformative strategies that yield higher payoffs than truth-telling. The core idea is to reward agents based on the mutual information between their report and the reports of other agents. We consider a setting with $n$ agents, each possessing a private signal $\Psi_i$ drawn from some set $\Sigma$. The mechanism asks each agent to report their signal, which we denote as $\hat{\Psi}_i$. For each agent $i$, the mechanism randomly selects a reference agent $j \neq i$. Agent $i$’s payment is then calculated as $MI(\hat{\Psi}_i; \hat{\Psi}_j)$ where $MI$ is an information-monotone mutual information measure. An information-monotone $MI$ measure must satisfy the following properties:

Symmetry: $MI(X; Y) = MI(Y; X)$.
Non-negativity: $MI(X; Y) \geq 0$, with equality if and only if $X$ and $Y$ are independent.
Data processing inequality: For any transition probability $M$, if $Y$ is independent of $M(X)$ conditioned on $X$, then $MI(M(X); Y) \leq MI(X; Y)$.

Two important families of mutual information measures that satisfy these properties are $f$-mutual information and Bregman mutual information. The $f$-mutual information is defined as $MI_f(X; Y) = D_f(U_{X,Y}, V_{X,Y})$, where $D_f$ is an $f$-divergence, $U_{X,Y}$ is the joint distribution of $X$ and $Y$, and $V_{X,Y}$ is the product of their marginal distributions. The Bregman mutual information is defined as: $BMI_{PS}(X; Y) = \mathbb{E}_{X} [D{PS}(U_{Y|X}, U_Y)]$, where $D_{PS}$ is a Bregman divergence based on a proper scoring rule $PS$, $U_{Y|X}$ is the conditional distribution of $Y$ given $X$, and $U_Y$ is the marginal distribution of $Y$. The MIP framework can be applied in both single-question and multi-question settings. In the multi-question setting, the mechanism can estimate the mutual information empirically from multiple questions. In the single-question setting, additional techniques like asking for predictions about other agents’ reports are used to estimate the mutual information. A key theoretical result of the MIP framework is that when the chosen mutual information measure is strictly information-monotone with respect to agents’ priors, the resulting mechanism is both dominantly truthful and strongly truthful. This means that truth-telling is a dominant strategy for each agent and that the truth-telling equilibrium yields strictly higher payoffs than any other non-permutation strategy profile. As research continues to address practical implementation challenges of designing truthful mechanisms, MIP-based approaches have significant potential to improve preference elicitation and aggregation in real-world applications lacking verifiable ground truth.

5.13 Discussion Questions

Arrow’s theorem and cardinal utilities: Arrow’s Impossibility Theorem applies to ordinal preferences. How does allowing cardinal utilities (like summing utilities across voters) change the picture? What new problems arise when we allow interpersonal utility comparisons?
Single-peaked preferences in practice: Can you think of real-world preference domains that are plausibly single-peaked? What about domains that clearly violate single-peakedness? How might you test whether a given preference dataset is approximately single-peaked?
DPO and Borda: The DPO-Borda connection suggests that DPO finds the response that wins the most pairwise comparisons. Is this what we actually want for language model alignment? What are scenarios where the Borda winner might not be the “best” response?
Nosy preferences in content moderation: Give three examples of nosy preferences that arise in content moderation for social media platforms. For each, discuss whether you think the platform should respect or override this preference.
The inversion problem: How should RLHF systems handle the possibility that annotator choices don’t reflect their true preferences (due to fatigue, cognitive biases, or the structure of the annotation task)? Propose a concrete intervention.
Privacy vs. personalization: A streaming service wants to personalize recommendations but users are concerned about privacy. Using the Contextual Integrity framework, design a data policy that balances these concerns. What transmission principles would you specify?
Paternalism trade-offs: Consider an AI assistant that notices a user frequently asks questions late at night that they seem to regret in the morning. Should the assistant intervene? If so, how? Discuss the ethical considerations.
Mechanism design without money: In RLHF, we cannot pay annotators based on the “truth” of their labels (since there’s no ground truth). How do peer prediction mechanisms like MIP help? What assumptions do they require?
Community Notes vs. majority voting: Why does Community Notes use a factor model instead of simple majority voting? What problems would majority voting create for controversial topics?
Strategic annotators: If RLHF annotators knew their labels would influence the model’s behavior, would they have incentives to vote strategically? How does Gibbard-Satterthwaite apply here?
Auction design for AI compute: Cloud providers sell AI compute resources. Design an auction mechanism for allocating GPU time that maximizes welfare. What properties should it have?
VCG and AI alignment: Could VCG-like mechanisms be used to aggregate diverse stakeholder preferences for AI system design? What would the “payments” be in this context?
Pluralistic alignment: Instead of aggregating preferences into a single objective, some propose that AI systems should preserve preference diversity. How would you operationalize this? What are the trade-offs?
Separability across criteria: When is it reasonable to assume that preferences over “helpful,” “harmless,” and “honest” are separable? When does this assumption fail? Give concrete examples.
Contextual Integrity violations: Identify three ways that current AI systems might violate contextual integrity norms. For each, propose a design change that would restore contextual integrity.

5.14 Bibliographic Notes

Social Choice Theory: The foundational results in social choice are due to Arrow (1951), who proved his impossibility theorem, and Gibbard (1973) and Satterthwaite (1975), who independently proved the manipulability theorem. Black (1948) introduced single-peaked preferences, and Moulin (1980) characterized strategy-proof rules on this domain. For comprehensive treatments of social choice theory, see Sen (1970a) and Austen-Smith and Banks (2000).

Domain Restrictions: Beyond single-peakedness, various restricted domains have been studied. See Gaertner (2001) for a survey. The connection between single-peaked preferences and the median voter theorem goes back to Black (1948) and Downs (1957).

Borda Count and Scoring Rules: The Borda count was proposed by Jean-Charles de Borda in 1781. Modern treatments include Young (1974), who characterized Borda axiomatically. The connection between Borda and pairwise comparisons is explored in Fishburn (1977). The DPO-Borda connection builds on insights from Rafailov et al. (2023).

Nosy Preferences and Liberalism: Sen’s (1970b) Liberal Paradox sparked extensive literature on the conflict between liberty and efficiency. See Hausman (2012) for philosophical discussion and (gaertner1988when?) for domain restriction approaches.

Contextual Integrity: Nissenbaum’s (2009) framework has been influential in privacy scholarship. Applications to machine learning appear in recent work on contextual data use.

Mechanism Design: Vickrey (1961) introduced the second-price auction. The VCG mechanism was developed by Vickrey, Clarke (clarke1971?), and Groves (groves1973?). Myerson’s (1981) optimal auction theory is foundational. The Bulow-Klemperer theorem appears in Bulow and Klemperer (1996). For textbook treatments, see Krishna (2009) and Börgers (2015).

Peer Prediction: The Mutual Information Paradigm was introduced by Kong and Schoenebeck (2019). Earlier work on peer prediction includes Miller, Resnick, and Zeckhauser (2005) and Prelec (2004). Applications to peer grading appear in Hartline et al. (2020) and Hartline et al. (2023).

Incentive-Compatible Bandits: The study of exploration under strategic agents is developed in Mansour, Slivkins, and Syrgkanis (2019) and Mansour et al. (2021). Connections to information design are explored in Bergemann and Morris (2019).

AI Alignment and Aggregation: Jury learning and methods for aggregating diverse annotator preferences are studied in Gordon et al. (2022). Scalable oversight through recursive reward modeling appears in Christiano, Shlegeris, and Amodei (2018). AI-assisted debate is proposed in Irving, Christiano, and Amodei (2018). Community Notes is described in X/Twitter’s public documentation.

5.15 Exercises

5.15.1 Exercise 1: Condorcet Cycles (*)

Consider three voters with the following preferences over alternatives $\{A, B, C\}$:

Voter 1: $A \succ B \succ C$
Voter 2: $B \succ C \succ A$
Voter 3: $C \succ A \succ B$

(a) Show that majority rule produces a preference cycle. Specifically, determine the majority preference for each pair $(A, B)$, $(B, C)$, and $(A, C)$.

(b) Compute the Borda score for each alternative. Which alternative wins under Borda count?

(c) Does this preference profile have a Condorcet winner? Explain.

5.15.2 Exercise 2: Single-Peaked Preferences (**)

Consider five voters with peaks at positions $\{1, 3, 5, 7, 9\}$ on a line (total order).

(a) What is the outcome of the pure median rule (no phantom voters)?

(b) Add two phantom voters at positions 2 and 8. Compute the generalized median outcome.

(c) Suppose voter 1 (with true peak at 1) misreports their peak as 0. Does this change the outcome? Use this to argue informally why the generalized median is strategy-proof.

5.15.3 Exercise 3: DPO-Borda Connection (**)

Consider three responses $\{y_1, y_2, y_3\}$ to a prompt $x$. Suppose annotators have the following pairwise preferences:

$\mathbb{P}(y_1 \succ y_2) = 0.7$
$\mathbb{P}(y_1 \succ y_3) = 0.6$
$\mathbb{P}(y_2 \succ y_3) = 0.8$

(a) Compute the expected Borda score for each response (sum of pairwise win probabilities).

(b) If responses are sampled uniformly at random from a reference policy, which response would DPO upweight most strongly? Connect to the DPO-Borda theorem.

5.15.4 Exercise 4: Virtual Valuations and Optimal Reserve (**)

For bidders with values drawn i.i.d. from $\text{Uniform}[0, 1]$:

(a) Derive the virtual valuation function $\varphi(v) = v - \frac{1-F(v)}{f(v)}$.

(b) Compute the optimal reserve price $r^*$ by solving $\varphi(r^*) = 0$.

(c) With two bidders, compute the expected revenue of a second-price auction with reserve $r = 0.5$ and compare to expected revenue without a reserve.

5.15.5 Exercise 5: VCG Payments (**)

In a combinatorial auction with items $\{A, B\}$ and three bidders with the following valuations:

Bidder	$v(\{A\})$	$v(\{B\})$	$v(\{A, B\})$
1	3	2	4
2	2	4	5
3	1	1	6

(a) Find the welfare-maximizing allocation.

(b) Compute the VCG payment for each winner using the externality formula.

5.15.6 Exercise 6: Nosy Preferences (***)

Formalize the Paretian Liberal paradox for a content moderation setting:

(a) Define the set of “content types” $C = \{c_1, c_2, c_3\}$ and two users with nosy preferences over what content each other should see.

(b) Construct a preference profile where minimal liberalism (each user decides their own feed) conflicts with Pareto efficiency.

(c) Propose a domain restriction that would resolve the paradox in this setting.

5.15.7 Exercise 7: Inversion Problem in RLHF (**)

An RLHF annotator’s preferences become more random as they fatigue. Model this as follows: in the first hour, preferences follow Bradley-Terry with true utilities. After the first hour, choices become uniform random with probability $p_{\text{fatigue}} = 0.3$.

(a) Write the likelihood of observing $y_w \succ y_l$ as a mixture model.

(b) Suppose you observe inconsistent annotations from an annotator (e.g., $y_1 \succ y_2$ and later $y_2 \succ y_1$). How would you detect whether this is due to fatigue vs. genuinely indifferent preferences?

(c) Propose a weighting scheme to down-weight fatigued annotations in reward model training.

5.15.8 Exercise 8: Community Notes Factor Model (**)

Implement a simplified Community Notes model. Generate synthetic data as follows:

100 users with ideological positions $p_i \sim \text{Uniform}[-1, 1]$
20 notes with true quality $\beta_j \sim \mathcal{N}(0, 1)$ and ideological slant $q_j \sim \text{Uniform}[-1, 1]$
Each user-note rating: $r_{ij} = \beta_j + p_i \cdot q_j + \varepsilon_{ij}$ where $\varepsilon_{ij} \sim \mathcal{N}(0, 0.5)$

(a) Implement code to generate this synthetic data.

(b) Use linear regression to estimate $\beta_j$ (note quality) controlling for the ideological interaction term $p_i \cdot q_j$.

(c) Compare your estimated $\hat{\beta}_j$ to the true $\beta_j$. How well does the model recover note quality?

5.15.9 Exercise 9: Pairwise Feedback Mechanisms for Digital Goods (***)

Consider a marketplace for digital goods (such as personalized articles, artwork, or AI-generated data), where the exact utility derived from these goods is only revealed to buyers after the goods have been generated and delivered. To elicit truthful preferences from buyers who find it difficult to precisely quantify their valuations beforehand, the marketplace implements a pairwise feedback mechanism, inspired by the work of Robertson and Koyejo (2023).

Formally, each buyer requests a personalized digital good and, upon receiving the good, provides feedback by indicating whether their realized utility is higher or lower than a randomly chosen reference price $c \in [0,1]$. The mechanism utilizes this binary feedback to estimate valuations and allocate future goods accordingly.

Answer the following:

Formalize the Problem: Let $u_i$ denote the true valuation of buyer $i$, and let $r_i(c)$ denote the buyer’s reported feedback ($r_i(c) = 1$ if $u_i \geq c$, 0 otherwise). Prove that, under uniform random selection of the reference price $c$, the expected value $\mathbb{E}[r_i(c)]$ is equal to the true valuation $u_i$.
Incentive Compatibility Analysis: Discuss conditions under which this feedback-based mechanism is incentive compatible, i.e., buyers have no incentive to misreport their preferences. Specifically, analyze why strategic misreporting (reporting $r_i(c)$ incorrectly for some reference prices) would not increase a buyer’s expected payoff.
Regret Analysis: Suppose the mechanism estimates buyers’ utilities from past feedback and allocates future goods using an epsilon-greedy strategy (exploration rate $\eta_t$). Provide an informal discussion of the trade-off involved in choosing the exploration rate, and how it affects the social welfare and revenue of the marketplace over time.
Practical Implications: Suggest one practical scenario outside the digital-goods marketplace where such a feedback-driven, pairwise comparison approach would be beneficial. Briefly justify your choice, mentioning challenges and benefits.

5.15.10 Exercise 10: Scalable Oversight in Complex Decision-Making (***)

In scenarios involving complex or high-dimensional outcomes (such as summarizing lengthy texts, assessing the quality of detailed AI-generated reports, or reviewing scientific papers), evaluating the quality of outputs can become infeasible for a single human overseer. One practical solution is scalable oversight, where the evaluation task is decomposed and distributed among multiple human evaluators or even assisted by AI agents. Consider a scalable oversight scenario inspired by recursive reward modeling, where complex evaluations are hierarchically decomposed into simpler tasks. Specifically, suppose you want to evaluate a lengthy report generated by an AI system. Answer the following:

(a) Decomposition of the Task: Propose a formal recursive decomposition strategy to evaluate a long AI-generated report of length (N) paragraphs. Specifically, describe a hierarchical evaluation method that decomposes the original evaluation into simpler subtasks at multiple hierarchical levels. Clearly describe how many subtasks you have at each level and how the final aggregated evaluation is computed.

(b) Statistical Aggregation Method: Suppose each evaluation subtask yields a binary score (s_i {0,1}), where (1) indicates acceptable quality and (0) indicates unacceptable quality. Propose a simple statistical aggregation method (e.g., majority voting, threshold voting, weighted aggregation, etc.) to combine subtask evaluations into a single global quality assessment at the top level. Justify your choice mathematically.

(c) Computational Simulation: Implement a Python simulation of your hierarchical decomposition and aggregation method described in parts (a) and (b). Assume each subtask is evaluated with some fixed probability (p) of being correct (representing human evaluators with bounded accuracy).

Specifically, your simulation should: - Implement a hierarchical evaluation scheme (e.g., binary-tree decomposition). - Assume evaluators have accuracy (p = 0.8) (i.e., probability of correctly identifying paragraph quality). - Simulate how evaluator accuracy at the leaf nodes affects the reliability of the global evaluation at the root node. - Plot how the reliability of the top-level evaluation (accuracy at the root) varies as you increase the depth of hierarchy for a report of fixed length (e.g., (N = 64) paragraphs).

(d) Practical Discussion: Briefly discuss advantages and potential drawbacks of scalable oversight approaches such as recursive decomposition in the context of AI alignment. Include considerations such as evaluator fatigue, consistency, cost, and vulnerability to manipulation or collusion.

References

Arrow, Kenneth J. 1951. Social Choice and Individual Values. John Wiley; Sons.

Auer, Peter, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. “Finite-Time Analysis of the Multiarmed Bandit Problem.” Machine Learning 47 (2). https://doi.org/10.1023/A:1013689704352.

Austen-Smith, David, and Jeffrey S Banks. 2000. Positive Political Theory i: Collective Preference. University of Michigan Press.

Bartholdi, John J., Craig A. Tovey, and Michael A. Trick. 1989. “The Computational Difficulty of Manipulating an Election.” Social Choice and Welfare 6 (3): 227–41.

Bergemann, Dirk, and Stephen Morris. 2019. “Information Design: A Unified Perspective.” In Journal of Economic Literature, 57:44–95. 1.

Black, Duncan. 1948. “On the Rationale of Group Decision-Making.” Journal of Political Economy 56 (1): 23–34.

Börgers, Tilman. 2015. An Introduction to the Theory of Mechanism Design. Oxford University Press.

Bulow, Jeremy, and Paul Klemperer. 1996. “Auctions Versus Negotiations.” The American Economic Review 86 (1): 180–94. http://www.jstor.org/stable/2118262.

Christiano, Paul, Buck Shlegeris, and Dario Amodei. 2018. “Supervising Strong Learners by Amplifying Weak Experts.” arXiv Preprint arXiv:1810.08575.

Downs, Anthony. 1957. An Economic Theory of Democracy. Harper; Row.

Fishburn, Peter C. 1977. “Condorcet Social Choice Functions.” SIAM Journal on Applied Mathematics 33 (3): 469–89.

Gaertner, Wulf. 2001. Domain Conditions in Social Choice Theory. Cambridge University Press.

Gibbard, Allan. 1973. “Manipulation of Voting Schemes: A General Result.” Econometrica 41 (4): 587–601.

Gordon, Noah J., Vaishnavh Nagarajan Shankar, Shi Feng, Yejin Choi, and Noah A. Smith. 2022. “Jury Learning: Integrating Dissenting Voices into Machine Learning Models.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2658–73. Association for Computational Linguistics.

Hartline, Jason D., Yingkai Li, Liren Shan, and Yifan Wu. 2020. “Optimization of Scoring Rules.” CoRR abs/2007.02905. https://arxiv.org/abs/2007.02905.

Hartline, Jason D., Liren Shan, Yingkai Li, and Yifan Wu. 2023. “Optimal Scoring Rules for Multi-Dimensional Effort.” In Proceedings of Thirty Sixth Conference on Learning Theory, edited by Gergely Neu and Lorenzo Rosasco, 195:2624–50. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v195/hartline23a.html.

Hausman, Daniel M. 2012. Preference, Value, Choice, and Welfare. Cambridge University Press.

Irving, Geoffrey, Paul Christiano, and Dario Amodei. 2018. “AI Safety via Debate.” arXiv Preprint arXiv:1805.00899.

Kong, Yuqing, and Grant Schoenebeck. 2019. “An Information Theoretic Framework for Designing Information Elicitation Mechanisms That Reward Truth-Telling.” ACM Trans. Econ. Comput. 7 (1). https://doi.org/10.1145/3296670.

Krishna, Vijay. 2009. Auction Theory. 2nd ed. Academic Press.

Lai, T. L, and Herbert Robbins. 1985. “Asymptotically Efficient Adaptive Allocation Rules.” Advances in Applied Mathematics 6 (1): 4–22. https://doi.org/https://doi.org/10.1016/0196-8858(85)90002-8.

Mansour, Yishay, Aleksandrs Slivkins, and Vasilis Syrgkanis. 2019. “Bayesian Incentive-Compatible Bandit Exploration.” https://arxiv.org/abs/1502.04147.

Mansour, Yishay, Aleksandrs Slivkins, Vasilis Syrgkanis, and Zhiwei Steven Wu. 2021. “Bayesian Exploration: Incentivizing Exploration in Bayesian Games.” https://arxiv.org/abs/1602.07570.

Miller, Nolan, Paul Resnick, and Richard Zeckhauser. 2005. “Eliciting Informative Feedback: The Peer-Prediction Method.” In Management Science, 51:1359–73. 9.

Moulin, Hervé. 1980. “On Strategy-Proofness and Single Peakedness.” Public Choice 35 (4): 437–55.

Myerson, Roger B. 1981. “Optimal Auction Design.” Mathematics of Operations Research 6 (1): 58–73.

Nissenbaum, Helen. 2009. Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford University Press.

Prelec, Dražen. 2004. “A Bayesian Truth Serum for Subjective Data.” In Science, 306:462–66. 5695.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” https://arxiv.org/abs/2305.18290.

Russo, Daniel, and Benjamin Van Roy. 2015. “An Information-Theoretic Analysis of Thompson Sampling.” https://arxiv.org/abs/1403.5341.

Satterthwaite, Mark Allen. 1975. “Strategy-Proofness and Arrow’s Conditions: Existence and Correspondence Theorems for Voting Procedures and Social Welfare Functions.” Journal of Economic Theory 10 (2): 187–217.

Sen, Amartya. 1970a. Collective Choice and Social Welfare. Holden-Day.

———. 1970b. “The Impossibility of a Paretian Liberal.” Journal of Political Economy 78 (1): 152–57. https://doi.org/10.1086/259614.

Vickrey, William. 1961. “Counterspeculation, Auctions, and Competitive Sealed Tenders.” Journal of Finance 16 (1): 8–37.

Young, H Peyton. 1974. “An Axiomatization of Borda’s Rule.” Journal of Economic Theory 9 (1): 43–52.

--- title: "Aggregation" format: html: include-after-body: text: | <script> // Auto-execute all pyodide cells after initialization document.addEventListener('DOMContentLoaded', function() { // Wait for pyodide to be fully ready (mainPyodide is set after loading) function waitForPyodide() { if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) { // Pyodide is ready, execute all cells with autorun=true if (typeof globalThis.qpyodideCellDetails !== 'undefined') { globalThis.qpyodideCellDetails.forEach((cell, index) => { if (cell.options && cell.options.autorun === 'true') { setTimeout(() => { const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`); if (runButton && !runButton.disabled) { runButton.click(); } }, index * 1000); // Stagger execution by 1 second each } }); } } else { setTimeout(waitForPyodide, 500); } } setTimeout(waitForPyodide, 2000); // Start checking after 2 seconds }); </script> filters: - pyodide --- ::: {.callout-note} ## Intended Learning Outcomes By the end of this chapter you will be able to: - **Distinguish** between social welfare functions and social choice functions, and explain their role in aggregating preferences from multiple agents. - **State** Arrow's Impossibility Theorem and the Gibbard–Satterthwaite Theorem, explaining why no "perfect" voting rule exists for three or more alternatives. - **Define** single-peaked preferences and derive the generalized median voter scheme as a strategy-proof aggregation rule on restricted domains. - **Apply** Moulin's theorem to design strategy-proof, Pareto efficient voting rules when preferences lie on a single-peaked domain. - **Connect** the Borda count to Direct Preference Optimization (DPO), showing that the DPO-optimal policy maximizes the expected Borda score. - **Identify** nosy vs. non-nosy preferences and apply Sen's framework to determine when liberal vs. illiberal assistance is appropriate in AI systems. - **Analyze** the tension between privacy and personalization in preference learning using the Contextual Integrity framework. - **Recognize** the inversion problem (observed behavior does not necessarily equal underlying preferences) and its implications for revealed preference approaches. - **Design** incentive-compatible mechanisms for settings without monetary transfers, including peer prediction and online learning. - **Evaluate** real-world aggregation systems like Community Notes for combining diverse perspectives on content moderation. ::: ::: {.callout-tip title="Suggested Lecture Plan" collapse="true"} This chapter can be covered in four 50-minute lectures plus a discussion section: **Lecture 1** (Sections 5.1–5.2): Classical Social Choice Theory - Motivation: When and why must we aggregate preferences? (10 min) - Voting rules: Plurality, Borda, STV, Condorcet (10 min) - Arrow's theorem and Gibbard–Satterthwaite (20 min) - Implications for AI alignment and RLHF (10 min) **Lecture 2** (Sections 5.3–5.4): Escaping Impossibility - Domain restrictions: Single-peaked preferences (15 min) - Generalized median voter and Moulin's theorem (15 min) - Borda count and modified IIA (IIA') (10 min) - Code demo: DPO-Borda connection (10 min) **Lecture 3** (Sections 5.5–5.7): Beyond Classical Voting - Multi-issue voting and separable preferences (10 min) - Nosy preferences and the Paretian Liberal paradox (15 min) - Case study: Community Notes model (15 min) - Discussion: When should systems override stated preferences? (10 min) **Lecture 4** (Sections 5.8–5.10): Challenges in Practice - The inversion problem: Behavior vs. mental states (15 min) - Privacy and Contextual Integrity (15 min) - Mechanism design: Auctions and VCG (20 min) **Discussion Section**: Peer prediction, incentive-compatible online learning, and exercises ::: ::: {.callout-note title="Notation"} Throughout this chapter, we use the following conventions: | Symbol | Meaning | |--------|---------| | $N = \{1, \ldots, n\}$ | Set of $n$ voters (agents) | | $A = \{a_1, \ldots, a_m\}$ | Set of $m$ alternatives | | $\succ_i$ | Voter $i$'s strict preference ordering over $A$ | | $\mathcal{L}(A)$ | Set of all strict linear orders over $A$ | | $f: \mathcal{L}(A)^n \to A$ | Social choice function (profile to winner) | | $F: \mathcal{L}(A)^n \to \mathcal{L}(A)$ | Social welfare function (profile to ranking) | | $\text{SP}(Y)$ | Single-peaked preferences on totally ordered set $Y$ | | $\text{peak}(\succ_i)$ | Peak (ideal point) of voter $i$'s preferences | | $v_i$ | Private valuation of agent $i$ (in auctions) | | $\varphi(v)$ | Virtual valuation in Myerson framework | | $\text{MI}(X; Y)$ | Mutual information between $X$ and $Y$ | ::: Chapters 1–4 focused on modeling preferences of a *single* decision-maker or learning from the feedback of individual users. But many real-world applications require aggregating preferences across *multiple* individuals. When training a language model with RLHF, annotators may disagree about which response is better. When a recommender system serves millions of users, it must balance diverse tastes. When an AI assistant helps a family plan a vacation, it must somehow combine different members' preferences. This chapter explores the fundamental challenges and solutions for preference aggregation. We begin with classical social choice theory, which reveals deep impossibility results: there is no "perfect" way to aggregate preferences. We then examine how to escape these impossibilities through domain restrictions, scoring rules, and mechanism design. Along the way, we connect these classical results to modern AI alignment challenges, including the relationship between the Borda count and DPO, the problem of nosy preferences in content moderation, and the tension between privacy and personalization. ## Chapter Overview This chapter is organized into five parts: 1. **Social Choice Theory** (@sec-aggregation–@sec-arrow): We introduce the formal framework of social choice, present Arrow's and Gibbard–Satterthwaite's impossibility theorems, and discuss their implications for AI systems that aggregate human feedback. 2. **Escaping Impossibility** (@sec-domain-restrictions–@sec-borda-dpo): We show how restricting the domain of preferences (e.g., to single-peaked preferences) enables strategy-proof aggregation, and we prove a key connection between the Borda count and DPO. 3. **Beyond Classical Voting** (@sec-multi-issue–@sec-community-notes): We extend to multi-issue settings, discuss nosy preferences and the Paretian Liberal paradox, and analyze Community Notes as a real-world case study. 4. **Challenges in Practice** (@sec-inversion–@sec-paternalism): We examine the inversion problem (behavior ≠ preferences), privacy concerns in preference learning, and when paternalistic intervention may be justified. 5. **Mechanism Design** (@sec-mechanism-design–@sec-mip): We cover auctions, VCG mechanisms, optimal auction design, peer prediction, and incentive-compatible online learning. ## Social Choice Theory {#sec-aggregation} In many applications, human preferences must be aggregated across multiple individuals to determine a collective decision or ranking. This process is central to social choice theory, which provides a mathematical foundation for preference aggregation. Unlike individual preference modeling, which focuses on how a single person makes decisions, social choice theory addresses the challenge of combining multiple preference profiles into a single coherent outcome. A social welfare function (SWF) takes as input each individual's preference ranking over a set of alternatives and produces a social ranking of those alternatives. A related concept is a social choice function (SCF), which selects a single winning alternative given individuals’ preferences. Many voting rules can be seen as social choice functions that aim to reflect the group’s preferences. Formally, let $N=\{1,2,\dots,n\}$ be a set of $n$ voters (agents) and $A=\{a_1,\dots,a_m\}$ a set of $m$ alternatives (with $m \ge 3$). Each voter $i$ has a preference order $\succ_i$ over $A$. A social choice function is a mapping $f: (\succ_1,\dots,\succ_n)\mapsto A$ that picks a winning alternative for each possible profile of individual preferences. A social welfare function is a mapping that produces a complete societal ranking $\succ^*$ of the alternatives. The central question is: can we design an aggregation rule that faithfully represents individual preferences while satisfying certain fairness or rationality axioms? Many common voting rules illustrate different methods of aggregation, each with its own merits and vulnerabilities: - Plurality: Each voter names their top choice; the alternative with the most votes wins. - Borda Count: Voters rank all alternatives, and points are assigned based on the position in each ranking. For example, with $m$ alternatives, a voter’s top-ranked alternative gets $m-1$ points, the second-ranked gets $m-2$, and so on down to 0. The Borda score of an alternative is the sum of points from all voters, and the winner is the alternative with the highest total score. - Single Transferable Vote (STV): Voters rank choices, and the count proceeds in rounds. In each round, the alternative with the fewest votes is eliminated and those votes are transferred to the next preferred remaining alternative on each ballot, until one candidate has a majority. - Condorcet Methods: These look for a candidate that wins in all pairwise majority contests against other alternatives (the Condorcet winner), if such an alternative exists. However, preference aggregation is not always straightforward. The Condorcet paradox illustrates that majority preferences can be cyclic (rock-paper-scissors style), so that no single alternative is majority-preferred to all others, violating transitivity. Different voting rules can yield different winners on the same profile, highlighting how the choice of rule influences the outcome. To guide the design of social choice functions, several desirable properties or axioms have been proposed. Three classical fairness criteria are: 1. Unanimity (Pareto efficiency): If all individuals strictly prefer one alternative $x$ over another $y$ (i.e. $x \succ_i y$ for every voter $i$), then the group ranking should prefer $x$ over $y$ as well ($x \succ^* y$). 2. Independence of Irrelevant Alternatives (IIA): The social preference between any two alternatives $x$ and $y$ should depend only on the individual preferences between $x$ and $y$. In other words, if we change individuals’ rankings of other “irrelevant” alternatives (not $x$ or $y$) in any way, the group’s relative ordering of $x$ and $y$ should remain unchanged. 3. Non-dictatorship: The aggregation should not simply follow a single individual’s preference regardless of others. There is no voter $i$ who always gets their top choice as the social choice (or whose rankings always become the social ranking), irrespective of other voters’ preferences. Additionally, we assume an unrestricted domain (universal admissibility): individuals may have any transitive preference ordering over the $m$ alternatives (no restrictions like single-peaked preferences unless explicitly imposed). One might hope that a fair voting rule exists that satisfies all the above properties for three or more alternatives. Surprisingly, a seminal negative result shows this is impossible. #### Arrow's Impossibility Theorem {#sec-arrow} Arrow's Impossibility Theorem [@arrow1951] is a cornerstone of social choice theory. It states that when there are three or more alternatives ($m\ge 3$), no social welfare function can simultaneously satisfy Unanimity, IIA, and Non-dictatorship – unless it is a trivial dictatorial rule. In other words, any aggregation mechanism that is not dictatorial will inevitably violate at least one of the fairness criteria. The theorem is usually proven by contradiction: assuming a social welfare function satisfies all conditions, one can show that one voter’s preferences always decide the outcome, hence the rule is dictatorial. Intuitively, Arrow’s theorem is driven by the possibility of preference cycles in majority voting. Even if individual preferences are transitive, aggregated majorities can prefer $A$ to $B$, $B$ to $C$, and $C$ to $A$ in a cycle, as in the Condorcet paradox. Under Unanimity and IIA, the social ranking must locally match these pairwise preferences, but this produces a contradiction with transitivity unless one voter’s ranking is given overriding authority. A sketch of Arrow’s proof is as follows: one shows that under the axioms, the social ranking between any two alternatives $x$ and $y$ must agree with some particular voter’s preference (the “pivotal” voter for that pair). With IIA, the identity of the pivotal voter must be the same across all pairs of alternatives, otherwise by cleverly constructing profiles one can derive a conflict. This single pivotal voter then effectively dictates the entire social order, violating Non-dictatorship. Hence, the axioms are incompatible. Arrow’s Impossibility Theorem has profound implications: it formalizes the inherent trade-offs in designing any fair aggregation scheme. In practice, different voting rules relax one or more of Arrow’s conditions. For instance, Borda count violates IIA (since introducing or removing an irrelevant alternative can change the point totals), while a dictatorship violates fairness blatantly. The theorem suggests that every practical voting system must sacrifice at least one of the ideal fairness criteria. It also motivated the exploration of alternative frameworks (such as allowing interpersonal comparisons of utility or cardinal preference aggregation) to escape the impossibility by weakening assumptions. Complementing Arrow’s theorem, the Gibbard–Satterthwaite theorem focuses on incentives and strategic manipulation in voting systems [@gibbard1973; @satterthwaite1975]. It considers any deterministic social choice function $f$ that chooses a single winner from the set of $m\ge 3$ alternatives. The theorem states that if $f$ is strategy-proof (incentive compatible) and onto (its range of outcomes is the entire set of alternatives), then $f$ must be dictatorial. Strategy-proofness (also called truthfulness or dominant-strategy incentive compatibility) means that no voter can ever benefit by misrepresenting their true preferences, regardless of what others do. In other words, reporting their genuine ranking is a (weakly) dominant strategy for each voter. The theorem implies that for any realistic voting rule where every alternative can possibly win, either one voter effectively decides the outcome (a dictatorship) or else the rule is susceptible to strategic manipulation by voters. The Gibbard–Satterthwaite theorem tells us that every non-dictatorial voting rule for 3 or more alternatives is manipulable: there will exist some election scenario where a voter can gain a more preferred outcome by voting insincerely (i.e. not according to their true preferences). For example, in a simple plurality vote, a voter whose true favorite is a long-shot candidate might vote for a more viable candidate to avoid a worst-case outcome (“lesser of two evils” voting). In a Borda count election, voters might strategically raise a competitor in their ranking to push down an even stronger rival. The only way to avoid all such strategic voting incentives is to have a dictatorship or limit the choice set to at most two alternatives. The proof of Gibbard–Satterthwaite is non-trivial, but one can outline the idea: Given a non-dictatorial and onto rule $f$, one shows there exist at least three distinct outcomes that can result from some preference profiles. By carefully constructing profiles and using the onto property, one finds a situation where a single voter can change the outcome by switching their order of two candidates, demonstrating manipulability. The theorem is robust – even if we allow ties or weaker conditions, similar impossibility results hold (Gibbard’s 1978 extension handles randomized rules). The practical takeaway is that all meaningful voting protocols encourage tactical voting in some situations. Nonetheless, certain systems are considered “harder to manipulate” or more resistant due to complexity or uncertainty. For instance, while STV (ranked-choice voting) can be manipulated in theory, determining a beneficial strategic vote can be NP-hard in worst cases, which arguably provides some practical deterrence to manipulation [@bartholdi1989]. Arrow’s and Gibbard–Satterthwaite’s theorems highlight the limitations any preference aggregation method must face. In domains like reinforcement learning from human feedback (RLHF) and AI alignment, we also aggregate preferences – often preferences of multiple human evaluators or preferences revealed in pairwise comparisons – to guide machine learning systems. While these settings sometimes use cardinal scores or learned reward functions (escaping the strict ordinal framework of Arrow’s theorem), the spirit of these impossibility results still applies: there is no perfect way to aggregate human opinions without trade-offs. For example, aggregating human feedback to train a model may run into inconsistencies analogous to preference cycles, especially when feedback comes from diverse individuals with different values. A simple majority vote over preferences might yield unstable or unfair outcomes if some annotators are systematically in the minority. Weighting votes by some credibility or expertise (weighted voting) can improve outcomes but raises the question of how to set the weights without introducing dictator-like influence. Recent research has proposed methods like jury learning – which integrates dissenting voices by having a panel (“jury”) of models or human subgroups whose aggregated judgment guides the learning [@gordon2022jury] – to ensure minority preferences are not entirely ignored. Another perspective is social choice in AI alignment, which suggests using social choice theory to design AI systems that respect a plurality of human values instead of collapsing everything into a single objective. In pluralistic value alignment, instead of forcing a single “best” solution, an AI might present a diverse set of options or behave in a way that reflects a distribution of values. This approach aims to preserve the diversity of human preferences rather than always aggregating to one monolithic preference. For instance, a conversational AI might be designed to recognize multiple acceptable responses (each aligning with different value systems) rather than one canonical “aligned” response for a given query. These considerations are especially relevant in generative AI and large language models, where training involves human preference data. If we aggregate feedback naively, we might overfit to the majority preference and lose minority perspectives (a form of tyranny of the majority). On the other hand, trying to satisfy everyone can lead to indecision or an incoherent objective. The impossibility results remind us there is no free lunch: we must carefully decide which properties to prioritize (e.g. giving more weight to expert annotators versus preserving broader fairness, or balancing consistency vs inclusivity). Designing aggregation mechanisms for AI that reflect collective human values is an ongoing challenge. It often involves insights from traditional voting theory (to understand trade-offs and failure modes) combined with machine learning techniques (to model and learn from preference data). In summary, social choice theory provides cautionary guidance as we build systems that learn from human preferences: we need to be conscious of which fairness criteria we relax and be transparent about the compromises being made in any preference aggregation pipeline. ## Escaping Impossibility: Domain Restrictions {#sec-domain-restrictions} Arrow's and Gibbard–Satterthwaite's theorems assume an *unrestricted domain*: any transitive preference ordering is admissible. But what if we restrict the types of preferences that can occur? In many real-world settings, preferences have natural structure that we can exploit. ### Single-Peaked Preferences {#sec-single-peaked} One of the most important restricted domains is *single-peaked preferences*. The idea is that alternatives can be arranged along a line (e.g., a left-right political spectrum, a temperature setting, or a budget level), and each voter has an "ideal point" or "peak" on this line, with preferences decreasing monotonically as we move away from the peak in either direction. ::: {.callout-note title="Definition: Single-Peaked Preferences"} Let $Y$ be a totally ordered set of alternatives. A preference ordering $\succ$ over $Y$ is **single-peaked** if there exists a peak $p(\succ) \in Y$ such that for all $y, y' \in Y$: - If $y < y' \leq p(\succ)$, then $y' \succ y$ - If $p(\succ) \leq y' < y$, then $y' \succ y$ The set of all single-peaked preferences on $Y$ is denoted $\text{SP}(Y)$. ::: Intuitively, a voter with peak $p$ always prefers alternatives closer to $p$ over alternatives farther from $p$. This rules out preferences like "I prefer the extremes to the middle"—which would create the kind of cycles that drive Arrow's impossibility. **Example: Temperature preferences.** Consider three colleagues deciding on the office thermostat setting. The alternatives are temperatures from 65°F to 75°F. Alice prefers 68°F (her peak), with utility decreasing as temperature moves away from 68. Bob prefers 72°F. Carol prefers 70°F. All three have single-peaked preferences on the temperature spectrum. ### Generalized Median Voter Scheme {#sec-median-voter} On single-peaked domains, a remarkable class of voting rules emerges that satisfies strategy-proofness, Pareto efficiency, and other desirable properties. ::: {.callout-note title="Definition: Generalized Median Voter Scheme"} Fix *phantom votes* $a_1, \ldots, a_{n-1} \in \mathbb{R} \cup \{\pm\infty\}$. The **generalized median voter scheme** is defined as: $$ f(\succ_1, \ldots, \succ_n) = \text{median}\big(p(\succ_1), \ldots, p(\succ_n), a_1, \ldots, a_{n-1}\big) $$ {#eq-median-voter} where $p(\succ_i)$ is the peak of voter $i$'s preferences. ::: The phantom votes act as "anchor points" that can shift the median in various directions. With $n-1$ phantom votes, there are $2n-1$ total values, so the median is well-defined. Different choices of phantom votes yield different voting rules: - **Pure median rule**: Set all phantoms to $\pm\infty$ appropriately so they don't affect the median. The outcome is simply the median of the voters' peaks. - **Dictatorial rule**: Set all phantoms equal to voter $i$'s peak. The outcome is always voter $i$'s ideal point. - **Status quo rule**: Set all phantoms to the status quo point $q$. The outcome can only move toward the median of voter peaks if enough voters prefer change. ::: {.callout-important title="Moulin's Characterization Theorem"} **Theorem (Moulin, 1980)**: On the domain of single-peaked preferences $\text{SP}(Y)$, a social choice function $f$ satisfies: 1. **Strategy-proofness (SP)**: No voter can benefit by misreporting their peak 2. **Pareto efficiency (PE)**: The outcome is never unanimously dispreferred to another alternative 3. **Anonymity and peaks-only**: The outcome depends only on the set of reported peaks if and only if $f$ is a generalized median voter scheme. ::: This is a powerful positive result: by restricting to single-peaked preferences, we escape Arrow's impossibility and can achieve both strategy-proofness and efficiency. The following code demonstrates the generalized median voter scheme and its strategy-proofness: ```{pyodide-python} #| autorun: true import numpy as np import matplotlib.pyplot as plt def generalized_median(peaks, phantoms): """Compute generalized median voter outcome.""" all_points = np.concatenate([peaks, phantoms]) return np.median(all_points) # Example: 5 voters with different peaks voter_peaks = np.array([2.0, 4.0, 5.0, 7.0, 9.0]) phantoms = np.array([0.0, 3.0, 6.0, 10.0]) # 4 phantom votes outcome = generalized_median(voter_peaks, phantoms) print(f"Voter peaks: {voter_peaks}") print(f"Phantom votes: {phantoms}") print(f"Median outcome: {outcome}") # Demonstrate strategy-proofness: voter 1 tries to manipulate fig, axes = plt.subplots(1, 2, figsize=(10, 4)) # Left plot: True reporting ax = axes[0] true_outcome = generalized_median(voter_peaks, phantoms) ax.scatter(voter_peaks, np.zeros(5), s=100, c='blue', label='Voter peaks', zorder=3) ax.scatter(phantoms, np.zeros(4), s=60, c='gray', marker='x', label='Phantoms', zorder=3) ax.axvline(true_outcome, color='green', linestyle='--', linewidth=2, label=f'Outcome: {true_outcome}') ax.axvline(voter_peaks[0], color='blue', linestyle=':', alpha=0.5) ax.set_xlim(-1, 11) ax.set_ylim(-0.5, 0.5) ax.set_xlabel('Alternative space') ax.set_title('Truth-telling: Voter 1 reports peak = 2.0') ax.legend(loc='upper right', fontsize=8) ax.set_yticks([]) # Right plot: Voter 1 misreports (tries to pull outcome left) ax = axes[1] manipulated_peaks = voter_peaks.copy() manipulated_peaks[0] = 0.0 # Voter 1 reports 0 instead of 2 manip_outcome = generalized_median(manipulated_peaks, phantoms) ax.scatter(manipulated_peaks, np.zeros(5), s=100, c='blue', label='Reported peaks', zorder=3) ax.scatter([voter_peaks[0]], [0], s=100, c='red', marker='o', facecolors='none', linewidths=2, label='True peak', zorder=4) ax.scatter(phantoms, np.zeros(4), s=60, c='gray', marker='x', label='Phantoms', zorder=3) ax.axvline(manip_outcome, color='green', linestyle='--', linewidth=2, label=f'Outcome: {manip_outcome}') ax.set_xlim(-1, 11) ax.set_ylim(-0.5, 0.5) ax.set_xlabel('Alternative space') ax.set_title('Manipulation attempt: Voter 1 reports 0.0') ax.legend(loc='upper right', fontsize=8) ax.set_yticks([]) plt.tight_layout() plt.show() # Check: Is the manipulated outcome closer to voter 1's true peak? print(f"\nVoter 1's true peak: {voter_peaks[0]}") print(f"Outcome with truth-telling: {true_outcome} (distance: {abs(true_outcome - voter_peaks[0]):.1f})") print(f"Outcome with manipulation: {manip_outcome} (distance: {abs(manip_outcome - voter_peaks[0]):.1f})") print("Manipulation did not help voter 1!" if abs(manip_outcome - voter_peaks[0]) >= abs(true_outcome - voter_peaks[0]) else "Manipulation helped!") ``` ## Scoring Rules and the DPO-Borda Connection {#sec-borda-dpo} Another way to escape Arrow's impossibility is to relax the Independence of Irrelevant Alternatives (IIA) axiom. The **Borda count** is a classic scoring rule that does exactly this. ### The Borda Count ::: {.callout-note title="Definition: Borda Count / Soft Copeland Score"} For a preference profile, the **Borda score** (or soft Copeland score) of alternative $y \in Y$ equals the expected number of alternatives ranked below $y$ by a randomly chosen voter. Equivalently, it counts the number of pairwise "matches" that $y$ wins against other alternatives: $$ \text{Borda}(y) = \sum_{i=1}^{n} |\{y' \neq y : y \succ_i y'\}| = \sum_{y' \neq y} |\{i : y \succ_i y'\}| $$ {#eq-borda} The **Borda winner** is the alternative with the maximum Borda score. ::: The Borda count satisfies many desirable properties but violates IIA. However, we can define a weaker version of IIA that Borda *does* satisfy. ::: {.callout-note title="Definition: Modified IIA (IIA')"} Fix alternatives $y, y' \in Y$. If two preference profiles have, for every agent: 1. The same pairwise ordering of $y$ vs. $y'$, AND 2. The same *number* of alternatives strictly between $y$ and $y'$ then it should not be the case that $f(\succ) = y$ and $f(\succ') = y'$. ::: This relaxation allows the social choice to depend on *how far apart* two alternatives are in each voter's ranking, not just their relative order. ::: {.callout-important title="Theorem: Properties of Borda Count"} The Borda count satisfies: - **Unrestricted domain (U)** - **Pareto efficiency (PE)** - **Non-dictatorship (ND)** - **Modified IIA (IIA')** By relaxing IIA to IIA', we escape Arrow's impossibility. ::: ### Connection to Direct Preference Optimization (DPO) A remarkable result connects the Borda count to modern RLHF methods, specifically Direct Preference Optimization (DPO). ::: {.callout-important title="Theorem: DPO-Borda Equivalence"} Assume that responses $y, y'$ are drawn from a reference policy $\pi_{\text{ref}}(\cdot | x)$. Define the **$\pi$-weighted Borda winner** as: $$ y^* = \arg\max_{y \in Y} \mathbb{E}_{y' \sim \pi_{\text{ref}}(\cdot | x)}\big[\mathbf{1}[y \succ y' \mid x]\big] $$ {#eq-borda-weighted} Consider DPO training with reference policy $\pi_{\text{ref}}$. Then the DPO-optimal policy $\pi_{\text{DPO}}$ satisfies: $$ \frac{\pi_{\text{DPO}}(y | x)}{\pi_{\text{ref}}(y | x)} \propto \text{(weighted Borda score of } y \text{)} $$ ::: In other words, **DPO finds a policy that upweights responses proportionally to their Borda scores**. This provides a social choice interpretation of what DPO is doing: it aggregates pairwise preferences by finding the response that would win the most head-to-head matchups against alternatives drawn from the reference policy. ::: {.callout-tip title="Proof Sketch" collapse="true"} Write $\hat{\pi} = \pi^* / \pi_{\text{ref}}$ for the density ratio. Let $\bar{\sigma}(\Delta r^*(x, y', y)) = \mathbb{P}(y \succ y' | x; r^*)$ denote the Bradley-Terry probability under the true reward $r^*$. The DPO loss is: $$ \mathcal{L}_{\text{DPO}}(\pi) = -\mathbb{E}_{x,y,y'}\Big[\bar{\sigma}(\Delta r^*) \cdot \log \sigma\big(\beta \log \frac{\hat{\pi}(y' | x)}{\hat{\pi}(y | x)}\big) + (1 - \bar{\sigma}(\Delta r^*)) \cdot \log\big(1 - \sigma(\beta \log \frac{\hat{\pi}(y | x)}{\hat{\pi}(y' | x)})\big)\Big] $$ Taking the gradient with respect to $\hat{\pi}(y | x)$ and setting to zero: $$ \mathbb{E}_{y' \sim \hat{\pi}}\Big[\sigma\big(\beta \log \frac{\pi(y | x)}{\pi(y' | x)}\big)\Big] = \underbrace{\mathbb{E}_{y' \sim \mathcal{D}(\cdot | x)}\big[\bar{\sigma}(\Delta r^*(x, y', y))\big]}_{\text{Borda score of } y} $$ The right-hand side is exactly the expected win rate of $y$ against a random alternative—i.e., its Borda score. ::: ## Multi-Issue Voting and Separable Preferences {#sec-multi-issue} Real-world decisions often involve multiple independent issues. For example, an AI system might need to optimize for helpfulness, harmlessness, *and* honesty simultaneously. How should we aggregate preferences when there are multiple dimensions? ### Voting by Committees ::: {.callout-note title="Definition: Voting by Committees"} Let $K = \{1, \ldots, k\}$ be a set of "objects" (issues), and let $Y = 2^K$ be the set of all subsets (possible outcomes include any combination of objects). A voting scheme $f: \mathcal{P}^n \to 2^K$ is **voting by committees** if for each object $x \in K$, there exists a committee $C_x = (N, W_x)$ where $W_x$ is a family of winning coalitions, such that: - The outcome includes object $x$ if and only if $\{i : x \in B(\succ_i)\} \in W_x$ where $B(\succ_i)$ is voter $i$'s top-ranked subset. ::: ### Separable Preferences When can we decompose multi-issue decisions into independent single-issue decisions? ::: {.callout-note title="Definition: Separable Preferences"} A preference $\succ$ on $2^K$ is **separable** if for all $A \subseteq K$ and $x \notin A$: $$ A \cup \{x\} \succ A \quad \Longleftrightarrow \quad x \in G(\succ) $$ {#eq-separable} where $G(\succ) = \{x \in K : \{x\} \succ \emptyset\}$ is the set of objects the voter considers "good" in isolation. ::: Separability means that whether you want to add object $x$ to a bundle doesn't depend on what's already in the bundle—it only depends on whether $x$ is intrinsically good. ::: {.callout-important title="Characterization on Separable Domains"} **Theorem**: A voting scheme satisfies surjectivity (S), strategy-proofness (SP), and separability (SEP) if and only if $f$ is voting by committees. Note: Voting by committees generally does *not* satisfy Pareto efficiency. ::: **Application to RLHF**: When training language models, we often want to optimize for multiple criteria (helpful, harmless, honest). If annotator preferences are separable across these criteria, we can aggregate each criterion independently. But preferences are often *not* separable—for instance, a highly helpful response might necessarily involve some risk of harm, creating dependencies between criteria. ## Nosy Preferences and the Liberal Paradox {#sec-nosy-preferences} A crucial distinction in preference aggregation is between *private* preferences (caring only about your own outcomes) and *nosy* preferences (caring about what others receive). ### Private vs. Nosy Preferences ::: {.callout-note title="Definition: Nosy Preferences"} A preference is **nosy** if the individual cares about outcomes affecting others, not just themselves. Examples include: - **Safety concerns**: Not wanting others to see bomb-building instructions - **Privacy violations**: Wanting to prevent disclosure of others' personal data - **Content moderation**: Preferring that certain content not be shown to anyone - **Paternalism**: Wanting to protect others from choices that might harm them - **Fairness**: Caring that others receive equitable treatment - **Psychological harm**: Preferring that AI systems not deceive users A preference is **private** if the individual only cares about their own allocation or experience. ::: Most classical social choice theory assumes private preferences. But in AI systems, especially content moderation and alignment, nosy preferences are ubiquitous. ### Sen's Liberal Paradox When preferences are nosy, even seemingly weak requirements can conflict. ::: {.callout-important title="Theorem: Impossibility of the Paretian Liberal (Sen, 1970)"} Consider a society where each individual has a "personal sphere"—a set of choices over which they should have autonomy. The following properties are inconsistent: 1. **Minimal Liberalism**: Each individual is decisive over at least one pair of alternatives in their personal sphere. 2. **Pareto Efficiency**: If everyone prefers $x$ to $y$, then society should choose $x$ over $y$. 3. **Unrestricted Domain**: Any preference profile is admissible. ::: **Classic Example: The Prude and the Book** Consider two individuals, Prude and Lewd, and a controversial book. The alternatives are: - $a$: Prude reads the book - $b$: Lewd reads the book - $c$: No one reads the book Prude's preferences: $c \succ_P a \succ_P b$ (prefers no one reads it, but would rather read it themselves than let Lewd read it—a nosy preference!) Lewd's preferences: $a \succ_L b \succ_L c$ (wants Prude to read it, then themselves, rather than no one reading it—also nosy!) - By Minimal Liberalism, Prude should decide between $a$ and $c$ (their personal reading choice). Prude prefers $c$. - By Minimal Liberalism, Lewd should decide between $b$ and $c$ (their personal reading choice). Lewd prefers $b$. - But both prefer $a$ to $b$! By Pareto, society should prefer $a$ to $b$. This creates a cycle: $c$ beats $a$ (Prude's liberty), $b$ beats $c$ (Lewd's liberty), but $a$ beats $b$ (Pareto). There is no consistent social choice. **Implications for AI Alignment**: The Liberal Paradox suggests that when different stakeholders have nosy preferences about AI behavior, there may be no way to respect both individual autonomy and collective efficiency. Content moderation decisions often involve exactly this tension—one user's preference for free expression conflicts with another's preference for a safe environment. ## Case Study: Community Notes {#sec-community-notes} Community Notes (formerly Birdwatch) is a real-world system for aggregating preferences about content helpfulness that addresses some of the challenges discussed above. ### The Community Notes Model Community Notes uses a factor model to identify "bridging" notes—notes that receive positive ratings from users across ideological divides: $$ u(y; \alpha, p, \varepsilon) = \mu + \alpha_j + \beta_j + p^\top q_j + \varepsilon_j $$ {#eq-community-notes} where: - $\mu$ is a global intercept - $\alpha_j$ is a rater intercept (some raters are generally more positive) - $\beta_j$ is a note intercept (some notes are generally better) - $p$ and $q_j$ are $k$-dimensional factor vectors capturing ideological alignment - $\varepsilon_j$ is residual noise The key insight is that $\beta_j$ captures note quality *after controlling for ideological agreement*. A note is selected if $\beta_j \geq c$ for some threshold $c$. This means a note must be rated positively by users who *disagree* ideologically, not just by an ideological majority. ### Why Bridging Works Traditional majority voting would select notes favored by the largest ideological group. Community Notes instead identifies notes with broad cross-partisan appeal. This is analogous to finding a Condorcet winner that beats all alternatives in pairwise comparisons across different subgroups. The factor model approach has connections to: - **Collaborative filtering**: The $p^\top q_j$ term is similar to matrix factorization in recommender systems - **Item response theory**: The model resembles the Rasch model from Chapter 1, extended with latent factors - **Jury theorems**: Under certain conditions, diverse juries aggregate to correct answers better than homogeneous majorities ## The Inversion Problem: Behavior vs. Preferences {#sec-inversion} A fundamental challenge in preference learning is the **inversion problem**: observed behavior does not necessarily reveal underlying preferences or utility. ### Behavior ≠ Mental State ::: {.callout-important title="The Inversion Problem"} Standard revealed preference theory assumes that choices reveal preferences: if someone chooses $x$ over $y$, they must prefer $x$ to $y$. But this assumption can fail when: 1. **Habit formation**: Repeated behavior creates habits that persist even when preferences change 2. **Cognitive limitations**: Fatigue, distraction, or bounded rationality lead to suboptimal choices 3. **Context effects**: The same underlying preference produces different behaviors in different contexts 4. **Strategic behavior**: People may choose strategically rather than according to true preferences ::: **Example: The Doritos Problem** Consider a smart pantry system that observes eating behavior. A user consistently chooses Doritos when offered. The system infers: "User prefers Doritos." But the user might actually prefer healthier options—they just succumb to availability and habit when Doritos are present. Optimizing for observed "preferences" (engagement) may not optimize for true welfare. **Example: Doctor Fatigue** Studies show that doctors' prescribing patterns change systematically throughout the day as fatigue accumulates. Late-afternoon decisions may reflect fatigue more than medical judgment. A preference learning system that treats all decisions equally would encode fatigue-induced biases as "preferences." ### Implications for RLHF The inversion problem has direct implications for training AI systems from human feedback: 1. **Annotator fatigue**: Quality of preference labels may degrade over long annotation sessions 2. **Engagement vs. satisfaction**: Optimizing for clicks or watch time may not optimize for user welfare 3. **Context-dependent feedback**: The same annotator might give different feedback depending on their mood, prior examples, or framing 4. **Strategic annotation**: If annotators believe certain patterns lead to better outcomes for them, they may annotate strategically **Potential solutions**: - Weight annotations by estimated quality or consistency - Use deliberation or discussion before labeling to reduce noise - Explicitly model annotator state (fatigue, expertise) as latent variables - Collect meta-feedback about label confidence ## Privacy and Personalization {#sec-privacy} Preference learning inherently involves collecting personal data. This creates tension between privacy protection and personalization quality. ### Contextual Integrity Helen Nissenbaum's **Contextual Integrity** framework provides a nuanced approach to privacy that goes beyond simple "consent" models. ::: {.callout-note title="Contextual Integrity Framework"} Privacy is preserved when information flows match context-specific norms. Five parameters determine appropriateness: 1. **Sender**: Who is sharing the information 2. **Subject**: Whose information is being shared 3. **Recipient**: Who receives the information 4. **Data Type**: What kind of information 5. **Transmission Principle**: What rules govern further use A privacy violation occurs when information flows in ways that violate contextual norms, even if the subject "consented." ::: **Example: Heart Rate Data** Consider a fitness tracker that collects heart rate data: - **Acceptable flow**: Sending to a running coach for training optimization - **Unacceptable flow**: Sending to an advertising network for targeted ads - **The difference**: The transmission principle (coaching vs. advertising) violates the user's expectations about the fitness context ### Differential Privacy Limitations Differential privacy (DP) provides formal guarantees that algorithms don't reveal individual information. However, DP has fundamental limitations for preference learning: 1. **Personalization requires individual data**: By definition, DP prevents using individual data for personalization 2. **Trade-off is inherent**: Stronger privacy guarantees mean less accurate preference models 3. **"Persuasive" use of DP**: Some systems claim DP protection with parameters so weak they provide little actual privacy Contextual integrity offers a middle ground: allow data use that matches user expectations (e.g., personalization within a service) while preventing unexpected flows (e.g., selling data to third parties). ## Paternalism in AI Systems {#sec-paternalism} When should an AI system override a user's stated preferences? ### Paternalism vs. Nosy Preferences Paternalism differs from nosy preferences: - **Nosy preferences**: Caring about others' choices for *your own* sake - **Paternalism**: Overriding others' choices for *their* sake ### When is Paternalism Justified? Several conditions might justify paternalistic intervention: 1. **Information asymmetry**: The system has information the user lacks (e.g., long-term health effects) 2. **Cognitive limitations**: The user is impaired (fatigue, addiction, cognitive decline) 3. **Protection of future self**: The user's current choice harms their future self (e.g., saving for retirement) 4. **Irreversible harm**: The consequences of a choice are severe and irreversible ### Design Principles AI systems that exercise any form of paternalism should: 1. **Be transparent**: Users should know when and why their preferences are being overridden 2. **Allow override**: Users should be able to insist on their original choice 3. **Minimize interference**: The lightest intervention that achieves the goal should be used 4. **Justify interventions**: Each paternalistic choice should have a clear rationale 5. **Update based on feedback**: Systems should learn when interventions are welcomed vs. resented **Example: AI Assistants Refusing Requests** When a user asks an AI assistant for harmful information (e.g., how to synthesize drugs), the refusal is paternalistic if motivated by protecting the user, or nosy if motivated by protecting others. In practice, refusals often serve both purposes. ## Mechanism Design {#sec-mechanism-design} While voting rules aggregate ordinal rank preferences to select a social outcome, another class of preference aggregation occurs in economic settings like auctions and general mechanism design. Here individuals reveal their valuations (numerical utilities) for outcomes, and the mechanism chooses an outcome (such as an allocation of goods) and possibly payments. Mechanism design asks: how can we design rules so that rational agents, acting in their own interest, end up revealing information that leads to a socially desirable outcome? A central concept is incentive compatibility – the mechanism should be designed so that each participant’s best strategy is to act according to their true preferences (e.g. bid their true value). In this section, we focus on auctions as a prime example of preference aggregation with money, and highlight classical results including Vickrey–Clarke–Groves (VCG) mechanisms and Myerson’s optimal auction. Consider a single-item auction with one item for sale and $n$ bidders. Bidder $i$ has a private valuation $v_i$ for the item (how much the item is worth to them). Each bidder’s goal is to maximize their own utility, defined as $v_i - p_i$ if they win and pay price $p_i$, or $0$ if they do not win (assuming quasilinear utility where money is the transferable utility). The auction’s task is to allocate the item to one of the bidders and possibly determine payments. We can think of an auction as a mechanism that asks each bidder for a “message” (typically a bid representing how much they are willing to pay), then selects a winner and a price based on the bids. A key objective might be social welfare maximization – allocate the item to the bidder who values it most (maximizing $v_i$ of the winner). Another possible objective is revenue maximization for the seller – choose the allocation and price to maximize the seller’s expected payment. A classic result in auction theory is that to maximize social welfare in a single-item private-value setting, one should award the item to the highest valuer – and this can be done in an incentive-compatible way by using a second-price auction. The Vickrey second-price auction works as follows: (1) All bidders submit sealed bids $b_1, b_2, \ldots, b_n$. (2) The bidder with the highest bid wins the item. (3) The price paid by the winner is the second-highest bid. For example, if the bids are $(2,\, 6,\, 4,\, 1)$ (in some currency units), the highest bid is $6$ (by bidder 2, say) and the second-highest is $4$. Bidder 2 wins the item and pays $4$. Under this mechanism, it turns out that bidding truthfully $b_i = v_i$ is a dominant strategy for each bidder. In other words, the auction is dominant-strategy incentive compatible (DSIC): no matter what others do, a bidder maximizes their expected utility by reporting their true valuation. The intuition is as follows. If bidder $i$ bids lower than their true value (i.e. $b_i < v_i$), and if their true value was actually the highest, they risk losing the item even though they value it more than the price they would have paid – a missed opportunity for positive utility. Bidding higher than their value ($b_i > v_i$) cannot help them win in any situation where bidding truthfully wouldn’t (it could only make a difference if their true $v_i$ wasn’t the highest but they tried to win anyway); and if they do win with an inflated bid, they might end up paying the second-highest bid which could be above their true value, yielding negative utility. By bidding exactly $v_i$, if they win, it means all other bids were lower, so $v_i$ is at least as high as the second-highest bid $p$ they pay – guaranteeing non-negative utility $v_i - p \ge 0$. If they lose, it means someone else had a higher bid (hence higher value, if others are truthful), so bidder $i$ wouldn’t have gained anyway. This argument, made rigorous by Vickrey [@vickrey1961], establishes that truth-telling is a dominant strategy in the second-price auction. As a consequence, when everyone bids truthfully, the item is allocated to the bidder with the highest $v_i$, achieving maximum social surplus (allocative efficiency). The second-price auction is thus an elegant mechanism that aligns individual incentives with social welfare maximization. It is worth contrasting this with a first-price auction, where the winner pays their own bid. In a first-price auction, bidders have an incentive to bid below their true value (to avoid the winner’s curse of paying too much), in a Nash equilibrium that involves bid shading. The first-price auction can still allocate to the highest valuer in equilibrium, but only through strategic behavior (and it is not DSIC). By charging the second-highest bid, the Vickrey auction removes the incentive to shade bids, since the price does not directly depend on one’s own bid beyond the fact of winning or losing. So far, we discussed auctions aimed at maximizing social welfare. In many cases, the auctioneer (seller) is interested in maximizing revenue. A foundational result by Roger Myerson (1981) provides a characterization of optimal auctions (those that maximize the seller’s expected revenue) for single-item settings under certain assumptions [@myerson1981]. The problem can be formulated as follows: suppose each bidder’s private value $v_i$ is drawn independently from a known distribution $F_i$ (for simplicity, assume identical distribution $F$ for all bidders, i.i.d.). We seek a mechanism (allocation rule and payment rule) that maximizes the seller’s expected payment, subject to incentive compatibility and individual rationality (participants should not expect negative utility from truthful participation). Myerson’s theorem states that the optimal auction in such a setting is a threshold auction characterized by virtual valuations. Define the virtual value for a bidder with value $v$ as $\varphi(v) = v - \frac{1-F(v)}{f(v)}$, where $f$ is the probability density function of $F$ (assuming it is continuous). An assumption called regularity (which holds for many distributions) is that $\varphi(v)$ is non-decreasing in $v$. Myerson showed that the revenue-maximizing strategy is: treat $\varphi(v)$ as the effective “score” of a bid, allocate the item to the bidder with the highest non-negative virtual value (if all virtual values are negative, allocate to no one), and charge them the smallest value they could have such that they would still win (the payment is essentially the critical bid where $\varphi$ of that bid equals the second-highest virtual value or the zero cutoff). In practice, for i.i.d. bidders, this reduces to: there is an optimal reserve price $r$ such that you sell to the highest bidder if and only if their bid $b_{\max} \ge r$; if sold, the price is the max of the second-highest bid and $r$. In the case of $n$ bidders with values i.i.d. uniform on $[0,1]$ (which is a regular distribution), one can compute the optimal reserve price. The virtual value function for uniform $[0,1]$ is $\varphi(v) = v - \frac{1-v}{1} = 2v - 1$. Setting $\varphi(v)\ge 0$ gives $v \ge 0.5$. So Myerson’s mechanism says: don’t sell the item if all bids are below 0.5; otherwise, sell to the highest bidder at at least 0.5. This is exactly a second-price auction with a reserve of $r=0.5$. Our earlier example implicitly demonstrated this: with two uniform(0,1) bidders, the optimal auction sets a reserve price of $0.5$ and yields a certain expected revenue. We can break down the cases: - With probability $1/4$, both bidders have values below $0.5$ (each below 0.5 with probability 1/2), in which case nobody wins and revenue is 0. - With probability $1/4$, both bidders have $v > 0.5$. In this case, the second-price auction with reserve will sell to the highest bidder at the max of the second-highest value and 0.5. Given both $v_1, v_2 > 0.5$, the expected second-highest value (conditional on both >0.5) is $\frac{2}{3}$ (in fact, the order statistics of two uniforms on [0.5,1] give mean of min = 2/3). So in this case the expected price is the second-highest value (since that will exceed 0.5), about 0.667. - With probability $1/2$, one bidder is above 0.5 and the other below. In that case, the one above 0.5 wins at price equal to the reserve 0.5 (since the second-highest bid is the reserve). Taking the expectation, the seller’s expected revenue is $0*(1/4) + (2/3)*(1/4) + (1/2*1/2) = 0 + 1/6 + 1/4 = 5/12 \approx 0.417$. This is higher than the expected revenue without a reserve. In fact, without a reserve (just a plain second-price with two bidders uniform [0,1]), one can compute the expected revenue is $1/3 \approx 0.333$ (the second order statistic’s expectation). Thus, the reserve has increased revenue. Myerson’s theory tells us that indeed the second-price auction with an optimally chosen reserve maximizes revenue among all DSIC mechanisms for this setting. A notable special case result is that when bidder distributions are i.i.d. and regular, an optimal auction is essentially “allocatively efficient with a reserve price” – i.e. aside from possibly excluding low-value bidders via a reserve, it allocates to the highest remaining bid. Myerson’s work also highlighted the gap between revenue maximization and welfare maximization. The price of optimality (in revenue) is that the seller might sometimes forego efficient allocation (e.g. not selling despite a willing buyer, in order to preserve a high reserve price strategy). In contrast, Vickrey’s auction always allocates efficiently but may not maximize revenue. An interesting insight in auction theory is that increasing competition can yield more revenue than fine-tuning the auction mechanism. The Bulow–Klemperer theorem [@bulow-klemperer1996] demonstrates that, under certain regularity assumptions, a simple welfare-maximizing auction with one extra bidder outperforms the optimal auction with fewer bidders. Specifically, for i.i.d. bidders with a regular distribution $F$, the expected revenue of a second-price auction with $n+1$ bidders is at least as high as the expected revenue of the Myerson-optimal auction with $n$ bidders. In formula form: $$ \mathbb{E}_{v_1,\ldots,v_{n+1} \sim F}[\text{Rev}^{\text{(second-price)}}(n+1 \text{ bidders})] \geq \mathbb{E}_{v_1,\ldots,v_n \sim F}[\text{Rev}^{\text{(optimal)}}(n \text{ bidders})] \,. $$ {#eq-bulow-klemperer} This result suggests that, in practice, having more participants (competition) is often more valuable than exploiting detailed knowledge of bidder distributions. As a corollary, a policy recommendation is that a seller is usually better off using a simple auction design (like a Vickrey auction or other transparent mechanism) and putting effort into attracting more bidders, rather than using a complex optimal mechanism that might discourage participation. #### The VCG Mechanism {#sec-vcg} Vickrey's second-price auction can be generalized to multiple items and more complex outcomes by the Vickrey–Clarke–Groves (VCG) mechanism. The VCG mechanism is a cornerstone of mechanism design that provides a general solution for implementing socially efficient outcomes (maximizing total stated value) in dominant strategies, for a broad class of problems. It works for any scenario where agents have quasilinear utilities and we want to maximize the sum of valuations. In a general mechanism design setting, let $\Omega$ be the set of possible outcomes. Each agent $i$ has a private valuation function $v_i(\omega)$ for outcomes $\omega \in \Omega$ (the amount of utility, in money terms, that $i$ gets from outcome $\omega$). Agents report bids $b_i(\omega)$ (which we hope equal $v_i(\omega)$ if they are truthful). The mechanism then chooses an outcome $\omega^* \in \Omega$ to maximize the reported total value: $$ \omega^* = \arg\max_{\omega \in \Omega} \sum_{i=1}^n b_i(\omega) \,, $$ {#eq-vcg-allocation} i.e. $\omega^*$ is the outcome that would be socially optimal if the $b_i$ were true values. To induce truth-telling, VCG sets payments such that each agent pays the externality they impose on others by their presence. Specifically, one convenient form of the VCG payment for agent $i$ is: $$ p_i(b) = \max_{\omega \in \Omega} \sum_{j \neq i} b_j(\omega)\;-\;\sum_{j \neq i} b_j(\omega^*) \,, $$ {#eq-vcg-externality} which can be interpreted as: what would the total value of others be if $i$ were not present (first term, maximizing without $i$) minus the total value others actually get in the chosen outcome $\omega^*$. Equivalently, we can write the payment as the agent’s bid for the chosen outcome minus a rebate term: $$ p_i(b) = b_i(\omega^*) \;-\; \Big[\sum_{j=1}^n b_j(\omega^*) - \max_{\omega \in \Omega} \sum_{j \neq i} b_j(\omega)\Big] \,. $$ {#eq-vcg-payment} This formula (which in single-item auction reduces to second-price logic) ensures that each agent’s net payoff is $v_i(\omega^*) - p_i = \max_{\omega} \sum_{j\neq i} v_j(\omega) + v_i(\omega^*) - \sum_{j\neq i} v_j(\omega^*)$. All terms except $v_i(\omega^*)$ cancel out, meaning each agent’s utility equals the max total welfare of others plus their own value for the chosen outcome minus others’ welfare in the chosen outcome – which does not depend on $v_i(\omega^*)$ except through the decision of $\omega^*$. By construction, an agent cannot influence $\omega^*$ in a way that improves this expression unless it genuinely increases total welfare, so misreporting $v_i$ cannot increase their utility. Thus truthful reporting is a dominant strategy. VCG is dominant-strategy incentive compatible (DSIC) and produces an outcome that maximizes $\sum_i v_i(\omega)$, achieving social welfare maximization. VCG provides a powerful existence result: under broad conditions, there is a mechanism that achieves efficient allocation with truth-telling (in fact, VCG is essentially the unique one, aside from adding harmless constant transfers). However, implementing VCG in practice can be difficult. One challenge is computational: finding $\arg\max_{\omega}\sum_i b_i(\omega)$ can be NP-hard if $\Omega$ is a combinatorially large space (as in many combinatorial auctions). Another issue is budget balance and revenue: VCG payments might not yield any revenue to the mechanism designer in some cases (or even require subsidies in complex settings), and they can be low or zero in certain environments, which is problematic if the seller needs revenue. VCG is also vulnerable to collusion or the presence of fake identities (sybil attacks) – the mechanism assumes each participant is a separate entity; if one bidder can split into two identities, they might game the outcome. Nonetheless, for many domains, VCG or variants have been successfully used or at least studied. Notably, combinatorial auctions (where multiple items are up for sale and bidders have valuations for bundles of items) can in theory be handled by VCG: just let $\Omega$ be all possible allocations of items to bidders, and have bidders report $b_i(S)$ for each bundle $S$ of items. VCG would allocate the items in the way that maximizes total reported value and charge each bidder the opportunity cost their presence imposes on others. In practice, as mentioned, combinatorial auctions face exponential complexity in preference reporting (each bidder potentially has to specify a value for every subset of items) and winner determination (solving an NP-hard combinatorial optimization). Heuristic or restricted approaches (like limiting the kinds of bundles or using iterative bidding with query learning of preferences) are used to make the problem tractable. Additionally, pure VCG in combinatorial settings can have undesirable properties: for example, in some cases adding more bidders can cause VCG prices to drop to zero (the so-called “threshold problem” or revenue monotonicity failure), and bidders may collude to manipulate their bids collectively. One high-stakes application of combinatorial auctions is spectrum auctions for selling licenses of electromagnetic spectrum to telecom companies. Governments have used multi-round combinatorial auctions to allocate spectrum, with billions of dollars at stake. Designing these auctions requires balancing efficiency with simplicity and robustness to strategic behavior. Early spectrum auctions that used simpler formats (like sequential auctions or one-shot sealed bids for each license) ran into problems like the exposure problem – a bidder valuing a combination of items (say complementary licenses in adjacent regions) risks winning only part of the combination at a high price, which could be bad for them if the items are worth much less separately. The simultaneous multi-round auction (SMRA) was an innovation that allowed bidding on all items at once in rounds, giving bidders some price discovery to mitigate the exposure problem. Even so, strategic issues like demand reduction (bidders deliberately not bidding on too many items to keep prices low) and tacit collusion through signaling bids have been observed. These practical complications underscore that while VCG is a beautiful theoretical ideal, real-world mechanism design often involves compromises and tweaks. ### Case Study 1: Mechanism for Peer Grading To illustrate an application of mechanism design beyond auctions, consider a classroom setting where students grade each other’s work (peer assessment). The goal is to design a system (a “mechanism”) that produces fair and accurate grades while incentivizing students to put effort into grading. Jason Hartline and colleagues (2020) studied such a scenario, examining how to optimize scoring rules for peer grading [@jasonH2020]. In this setting, students are both agents (who might strategize to maximize their own grade or minimize their effort) and graders. The "outcome" we want is a set of final grades for students, ideally reflecting the true quality of their work. One idea is to use proper scoring rules to evaluate the peer graders. A proper scoring rule is a concept from forecast evaluation that gives highest expected score for truthful reporting of probabilities. In peer grading, one might try to reward students based on how close their grading is to some ground truth or to the TA’s grades. However, a naive application of proper scoring can backfire. Hartline et al. observed a “lazy peer grader” problem: if students figure out that always giving an average score (say 80%) yields a decent reward under the scoring rule, they might not bother to carefully distinguish good and bad work. In one experiment, giving all peers an 80% could yield a 96% accuracy score for the grader under a certain scoring rule [@jasonH2023]. This clearly undermines the goal – the grader is basically cheating the system by always predicting the class average. To combat this, the mechanism designers sought a scoring rule that maximizes the difference in reward between a diligent grading and a lazy strategy, thereby incentivizing effort. They formulated this as an optimization problem: design the reward function for peer graders such that truthful, careful grading yields a strictly higher expected score than any degenerate strategy like always giving the average. By analyzing data and grader behavior models, they adjusted the scoring rules to penalize obviously lazy patterns and reward variance when warranted. The resulting mechanism improved the accuracy of peer grading by aligning the incentives of student graders (who want a high score for their grading job) with the objective of accurate assessment. This case study highlights how ideas of incentive compatibility and mechanism design apply even in social/educational contexts: the “payments” are points towards one’s own grade, and the mechanism must account for strategic behavior to ensure a reliable outcome. In conclusion, mechanism design provides a toolkit for aggregating preferences (or signals, like grades or bids) in a principled way, by explicitly accounting for individual incentives. Whether in auctions, peer grading, or other domains, the design of rules (allocation algorithms, payment or scoring schemes) crucially determines whether people feel encouraged to be truthful or to game the system. The theories of VCG and Myerson give us optimal baselines for efficiency and revenue in auctions, while impossibility results like Gibbard–Satterthwaite warn us of the limitations in voting. Real-world implementations often have to grapple with complexity and approximate these ideals. While learning from individual human preference is a powerful approach, it too faces aggregation challenges. If the human feedback is inconsistent or if different annotators have different preferences, the reward model may end up capturing an average that satisfies no one perfectly. There is active research on scalable oversight: techniques to gather and aggregate human feedback on tasks that are too complex for any single person to evaluate reliably. This includes approaches like recursive reward modeling, iterated amplification [@christiano2018supervising], and AI-assisted debate [@irving2018ai], where AI systems help humans provide better feedback or break down tasks. The goal of scalable oversight is to leverage human preferences and principles in guiding AI even as AI systems tackle increasingly complex or open-ended tasks, while mitigating the human burden and bias in evaluation. In summary, preference aggregation in machine learning spans from simple models like Bradley–Terry for pairwise comparisons to elaborate RLHF pipelines for training large models. The deep mathematical foundations – whether Arrow’s theorem or Myerson’s auction theory – remind us that whenever we aggregate preferences or signals from multiple sources, we must consider incentive effects, fairness criteria, and the possibility of inconsistency. By combining insights from social choice, economics, and statistical learning, we aim to build AI systems that not only learn from human preferences but do so in a principled, robust, and fair manner. The next chapter will delve further into aligning AI with human values, building on the mechanisms and learning algorithms discussed here to ensure AI systems remain beneficial and in line with what people truly want. ### Case Study 2: Incentive-Compatible Online Learning To address this problem, we seek to create a model. We first outline the key criteria that our model must achieve. The model revolves around repeated interactions between a planner (the system) and multiple agents (the users). Each agent, upon arrival in the system, is presented with a set of available options to choose from. These options could vary widely depending on the application of the model, such as routes in a transportation network, a selection of hotels in a travel booking system, or even entertainment choices in a streaming service. The interaction process is straightforward but crucial: agents arrive, select an action from the provided options, and then report feedback based on their experience. This feedback is vital as it forms the basis upon which the planner improves and evolves its recommendations. The agents in this model are considered strategic; they aim to maximize their reward based on the information available to them. This aspect of the model acknowledges the real-world scenario where users are typically self-interested and seek to optimize their own outcomes. The planner, on the other hand, has a broader objective. It aims to learn which alternatives are best in a given context and works to maximize the overall welfare of all agents. This involves a complex balancing act: the planner must accurately interpret feedback from a diverse set of agents, each with their own preferences and biases, and use this information to refine and improve the set of options available. The ultimate goal of the planner is to create a dynamic, responsive system that not only caters to the immediate needs of individual agents but also enhances the collective experience over time, leading to a continually improving recommendation ecosystem. Here, we seek to address the inherent limitations faced by the planner, particularly in scenarios where monetary transfers are not an option, and the only tool at its disposal is the control over the flow of information between agents. This inquiry aims to understand the extent to which these limitations impact the planner's ability to effectively guide and influence agent behavior. A critical question is whether the planner can successfully induce exploration among agents, especially in the absence of financial incentives. This involves investigating strategies to encourage users to try less obvious or popular options, thus broadening the scope of feedback and enhancing the system's ability to learn and identify the best alternatives. Another question is understanding the rate at which the planner learns from agent interactions. This encompasses examining how different agent incentives, their willingness to explore, and their feedback impact the speed and efficiency with which the planner can identify optimal recommendations. The model can be extended in several directions, each raising its own set of questions. 1. Multiple Agents with Interconnected Payoffs: When multiple agents arrive simultaneously, their choices and payoffs become interconnected, resembling a game. The research question here focuses on how these interdependencies affect individual and collective decision-making. 2. Planner with Arbitrary Objective Function: Investigating scenarios where the planner operates under an arbitrary objective function, which might not align with maximizing overall welfare or learning the best alternative. 3. Observed Heterogeneity Among Agents: This involves situations where differences among agents are observable and known, akin to contextual bandits in machine learning. The research question revolves around how these observable traits can be used to tailor recommendations more effectively. 4. Unobserved Heterogeneity Among Agents: This aspect delves into scenarios where differences among agents are not directly observable, necessitating the use of causal inference techniques to understand and cater to diverse user needs. In our setup, there is a "planner," which aims to increase exploration, and many independent "agents," which will act selfishly (in a way that they believe will maximize their individual reward) [@mansour2019bayesianincentivecompatiblebanditexploration; @mansour2021bayesianexplorationincentivizingexploration]. Under our model shown in Figure [1.1](#fig-planner-agent){reference-type="ref" reference="fig-planner-agent"}, there are $K$ possible actions that all users can take, and each action has some mean reward $\mu_i \in [0, 1]$. In addition, there is a common prior belief on each $\mu_i$ across all users.. The $T$ agents, or users, will arrive sequentially. As the $t$'th user arrives, they are recommended an action $I_t$ by the planner, which they are free to follow or not follow. After taking whichever action they choose, the user experiences some realized reward $r_i \in [0, 1]$, which is stochastic i.i.d. with mean $\mu_i$, and reports this reward back to the planner. So far, the model we have defined is equivalent to a multi-armed bandit model, which we have seen earlier in this chapter ([1](#4optim){reference-type="ref" reference="4optim"}). Under this model, well-known results in economics, operations research and computer science show that $O(\sqrt{T})$ regret is achievable [@russo2015informationtheoreticanalysisthompsonsampling; @auer_cesa-bianchi_fischer_2002; @LAI19854] with algorithms such as Thompson sampling and UCB. However, our agents are strategic and aim to maximize their own rewards. If they observe the rewards gained from actions taken by other previous users, they will simply take the action they believe will yield the highest reward given the previous actions; they would prefer to benefit from exploration done by other users rather than take the risk of exploring themselves. Therefore, exploration on an individual level, which the planner would like to facilitate, is not guaranteed under this paradigm. In light of this, we also require that our model satisfy incentive compatibility, or that taking the action recommended by the planner has an expected utility that is as high as any other action the agent could take. Formally, $\forall i : \, E[\mu_i | I_t = i] \geq E[\mu_{i'} | I_t = i].$ Note that this incentivizes the agents to actually take the actions recommended by the planner; if incentive compatibility is not satisfied, agents would simply ignore the planner and take whatever action they think will lead to the highest reward. At a high level, the key to achieving incentive compatibility while still creating a policy for the planner that facilitates exploration is information asymmetry. Under this paradigm, the users only have access to their previous recommendations, actions, and rewards, and not to the recommendations, actions, and rewards of other users. Therefore, they are unsure of whether, after other users take certain actions and receive certain rewards, arms that they might have initially considered worse in practice outperform arms that they initially considered better. Only the planner has access to the previous actions and rewards of all users; the user only has access to their own recommendations and overall knowledge of the planner's policy. The main question we aim to answer for the rest of this section is, given this new constraint of incentive compatibility, is $O(\sqrt{T})$ regret still achievable? We illustrate such an algorithm in the following. The main result here is a black-box reduction algorithm to turn any bandit algorithm into an incentive compatible one, with only a constant increase in Bayesian regret. Since, as mentioned earlier, there are bandit algorithms with $O(\sqrt{T})$ Bayesian regret, black-box reduction will also allow us to get incentive-compatible algorithms with $O(\sqrt{T})$ regret. The idea of black-box reduction will be to simulate $T$ steps of any bandit algorithm in an incentive-compatible way in $c T$ steps. This allows us to design incentive-compatible recommendation systems by using any bandit algorithm and then adapting it. Consider the following setting: there are two possible actions, $A_1$ and $A_2$. Assume the setting of deterministic rewards, where action 1 has reward $\mu_1$ with prior $U[1/3, 1]$ and mean $\mathbb{E}[\mu_1] = 2/3$, and action 2 has reward $\mu_2$ with prior $U[0, 1]$ and mean $\mathbb{E}[\mu_2] = 1/2$. Without the planner intervention and with full observability, users would simply always pick $A_1$, so how can the planner incentivize users to play $A_2$? The key insight is going to be to hide exploration in a pool of exploitation. The users are only going to receive a recommendation from the planner, and no other observations. After deterministically recommending the action with the highest expected reward ($A_1$), the planner will pick one guinea pig to recommend the exploratory action of $A_2$. The users don't know whether they are the guinea pig, so intuitively, as long as the planner picks guinea pigs uniformly at random and at low enough frequencies, the optimal decision for the users is still to follow the planner's recommendation, even if it might go against their interest. The planner will pick the user who will be recommended the exploratory action uniformly at random from the $L$ users that come after the first one (which deterministically gets recommended the exploitation action). Under this setting (illustrated in Figure [1.2](#fig-deterministic-guinea-pig){reference-type="ref" reference="fig-deterministic-guinea-pig"}), it is optimal for users to always follow the option that is recommended for them. More formally, if $I_t$ is the recommendation that a user receives at time $t$, then we have that: $$ \begin{split} \mathbb{E}[\mu_1 - \mu_2 | I_t = 2] Pr[I_t = 2] &= \frac{1}{L} (\mu_1 - \mu_2) \quad \text{(Gains if you are the unlucky guinea pig)}\\ &+ (1 - \frac{1}{L}) \mathbb{E}[\mu_1 - \mu_2 | \mu_1 < \mu_2] \times p[\mu_1 < \mu_2] \quad \text{(Loss if you are not and $\mu_1 < \mu_2$)}\\ &\leq 0 \end{split} $$ {#eq-guinea-pig-incentive} This holds when $L \geq 12$. It means that the gains from not taking the recommended action are negative, which implies that users should always take the recommendation. So far we have considered the case where rewards are deterministic, but what about stochastic rewards? We are now going to consider the case where rewards are independent and identically distributed from some distribution, and where each action $A_i$ has some reward distribution $r_i^t \sim D_i, \mathbb{E}[r_i^t] = \mu_i$. Back to the case where there are only two actions, we are going to adapt the prior algorithm of guinea pig-picking to the stochastic reward setting. Since one reward observation is not enough to fully know $\mu_1$ anymore, we'll instead observe the outcome of the first action $M$ times to form a strong posterior $\mathbb{E}[\mu_1 | r_1^1, \ldots r_1^M]$. We can use with stochastic rewards when there are two actions. Similarly, as before, we pick one guinea pig uniformly at random from the next $L$ users and use the reward we get as the exploratory signal.\ In a very similar manner, we can generalize this algorithm from always having two actions to the general multi-armed bandit problem. Now suppose we have a general multi-armed bandit algorithm $A$. We will wrap this algorithm around our black box reduction algorithm to make it incentive-compatible. We wrap every decision that $A$ would make by exactly $L-1$ recommendations of the action believed to be the best so far. This guarantees that the expected rewards for the users that are not chosen as guinea pigs are at least as good as $A$'s reward at phase $n$. ## Mutual Information Paradigm {#sec-mip} In this section we discuss an influential new framework for designing peer prediction mechanisms, the Mutual Information Paradigm (MIP) introduced by Kong and Schoenebeck [@kongschoenebeck2019]. Traditional peer prediction approaches typically rely on scoring rules and correlation between agents' signals. However, these methods often struggle with issues like uninformed equilibria, where agents can coordinate on uninformative strategies that yield higher payoffs than truth-telling. The core idea is to reward agents based on the mutual information between their report and the reports of other agents. We consider a setting with $n$ agents, each possessing a private signal $\Psi_i$ drawn from some set $\Sigma$. The mechanism asks each agent to report their signal, which we denote as $\hat{\Psi}_i$. For each agent $i$, the mechanism randomly selects a reference agent $j \neq i$. Agent $i$'s payment is then calculated as $MI(\hat{\Psi}_i; \hat{\Psi}_j)$ where $MI$ is an information-monotone mutual information measure. An information-monotone $MI$ measure must satisfy the following properties: - Symmetry: $MI(X; Y) = MI(Y; X)$. - Non-negativity: $MI(X; Y) \geq 0$, with equality if and only if $X$ and $Y$ are independent. - Data processing inequality: For any transition probability $M$, if $Y$ is independent of $M(X)$ conditioned on $X$, then $MI(M(X); Y) \leq MI(X; Y)$. Two important families of mutual information measures that satisfy these properties are $f$-mutual information and Bregman mutual information. The $f$-mutual information is defined as $MI_f(X; Y) = D_f(U_{X,Y}, V_{X,Y})$, where $D_f$ is an $f$-divergence, $U_{X,Y}$ is the joint distribution of $X$ and $Y$, and $V_{X,Y}$ is the product of their marginal distributions. The Bregman mutual information is defined as: $BMI_{PS}(X; Y) = \mathbb{E}_{X} [D{PS}(U_{Y|X}, U_Y)]$, where $D_{PS}$ is a Bregman divergence based on a proper scoring rule $PS$, $U_{Y|X}$ is the conditional distribution of $Y$ given $X$, and $U_Y$ is the marginal distribution of $Y$. The MIP framework can be applied in both single-question and multi-question settings. In the multi-question setting, the mechanism can estimate the mutual information empirically from multiple questions. In the single-question setting, additional techniques like asking for predictions about other agents' reports are used to estimate the mutual information. A key theoretical result of the MIP framework is that when the chosen mutual information measure is strictly information-monotone with respect to agents' priors, the resulting mechanism is both dominantly truthful and strongly truthful. This means that truth-telling is a dominant strategy for each agent and that the truth-telling equilibrium yields strictly higher payoffs than any other non-permutation strategy profile. As research continues to address practical implementation challenges of designing truthful mechanisms, MIP-based approaches have significant potential to improve preference elicitation and aggregation in real-world applications lacking verifiable ground truth. ## Discussion Questions 1. **Arrow's theorem and cardinal utilities**: Arrow's Impossibility Theorem applies to ordinal preferences. How does allowing cardinal utilities (like summing utilities across voters) change the picture? What new problems arise when we allow interpersonal utility comparisons? 2. **Single-peaked preferences in practice**: Can you think of real-world preference domains that are plausibly single-peaked? What about domains that clearly violate single-peakedness? How might you test whether a given preference dataset is approximately single-peaked? 3. **DPO and Borda**: The DPO-Borda connection suggests that DPO finds the response that wins the most pairwise comparisons. Is this what we actually want for language model alignment? What are scenarios where the Borda winner might not be the "best" response? 4. **Nosy preferences in content moderation**: Give three examples of nosy preferences that arise in content moderation for social media platforms. For each, discuss whether you think the platform should respect or override this preference. 5. **The inversion problem**: How should RLHF systems handle the possibility that annotator choices don't reflect their true preferences (due to fatigue, cognitive biases, or the structure of the annotation task)? Propose a concrete intervention. 6. **Privacy vs. personalization**: A streaming service wants to personalize recommendations but users are concerned about privacy. Using the Contextual Integrity framework, design a data policy that balances these concerns. What transmission principles would you specify? 7. **Paternalism trade-offs**: Consider an AI assistant that notices a user frequently asks questions late at night that they seem to regret in the morning. Should the assistant intervene? If so, how? Discuss the ethical considerations. 8. **Mechanism design without money**: In RLHF, we cannot pay annotators based on the "truth" of their labels (since there's no ground truth). How do peer prediction mechanisms like MIP help? What assumptions do they require? 9. **Community Notes vs. majority voting**: Why does Community Notes use a factor model instead of simple majority voting? What problems would majority voting create for controversial topics? 10. **Strategic annotators**: If RLHF annotators knew their labels would influence the model's behavior, would they have incentives to vote strategically? How does Gibbard-Satterthwaite apply here? 11. **Auction design for AI compute**: Cloud providers sell AI compute resources. Design an auction mechanism for allocating GPU time that maximizes welfare. What properties should it have? 12. **VCG and AI alignment**: Could VCG-like mechanisms be used to aggregate diverse stakeholder preferences for AI system design? What would the "payments" be in this context? 13. **Pluralistic alignment**: Instead of aggregating preferences into a single objective, some propose that AI systems should preserve preference diversity. How would you operationalize this? What are the trade-offs? 14. **Separability across criteria**: When is it reasonable to assume that preferences over "helpful," "harmless," and "honest" are separable? When does this assumption fail? Give concrete examples. 15. **Contextual Integrity violations**: Identify three ways that current AI systems might violate contextual integrity norms. For each, propose a design change that would restore contextual integrity. ## Bibliographic Notes **Social Choice Theory**: The foundational results in social choice are due to Arrow [-@arrow1951], who proved his impossibility theorem, and Gibbard [-@gibbard1973] and Satterthwaite [-@satterthwaite1975], who independently proved the manipulability theorem. Black [-@black1948rationale] introduced single-peaked preferences, and Moulin [-@moulin1980strategy] characterized strategy-proof rules on this domain. For comprehensive treatments of social choice theory, see @sen1970collective and @austen-smith_banks_2000. **Domain Restrictions**: Beyond single-peakedness, various restricted domains have been studied. See @gaertner2001domain for a survey. The connection between single-peaked preferences and the median voter theorem goes back to @black1948rationale and @downs1957economic. **Borda Count and Scoring Rules**: The Borda count was proposed by Jean-Charles de Borda in 1781. Modern treatments include @young1974axiomatization, who characterized Borda axiomatically. The connection between Borda and pairwise comparisons is explored in @fishburn1977condorcet. The DPO-Borda connection builds on insights from @rafailov2023direct. **Nosy Preferences and Liberalism**: Sen's [-@sen1970impossibility] Liberal Paradox sparked extensive literature on the conflict between liberty and efficiency. See @hausman2012preference for philosophical discussion and @gaertner1988when for domain restriction approaches. **Contextual Integrity**: Nissenbaum's [-@nissenbaum2009privacy] framework has been influential in privacy scholarship. Applications to machine learning appear in recent work on contextual data use. **Mechanism Design**: Vickrey [-@vickrey1961] introduced the second-price auction. The VCG mechanism was developed by Vickrey, Clarke [-@clarke1971], and Groves [-@groves1973]. Myerson's [-@myerson1981] optimal auction theory is foundational. The Bulow-Klemperer theorem appears in @bulow-klemperer1996. For textbook treatments, see @krishna2009auction and @borgers2015introduction. **Peer Prediction**: The Mutual Information Paradigm was introduced by @kongschoenebeck2019. Earlier work on peer prediction includes @miller2005eliciting and @prelec2004bayesian. Applications to peer grading appear in @jasonH2020 and @jasonH2023. **Incentive-Compatible Bandits**: The study of exploration under strategic agents is developed in @mansour2019bayesianincentivecompatiblebanditexploration and @mansour2021bayesianexplorationincentivizingexploration. Connections to information design are explored in @bergemann2019information. **AI Alignment and Aggregation**: Jury learning and methods for aggregating diverse annotator preferences are studied in @gordon2022jury. Scalable oversight through recursive reward modeling appears in @christiano2018supervising. AI-assisted debate is proposed in @irving2018ai. Community Notes is described in X/Twitter's public documentation. ## Exercises ### Exercise 1: Condorcet Cycles (*) Consider three voters with the following preferences over alternatives $\{A, B, C\}$: - Voter 1: $A \succ B \succ C$ - Voter 2: $B \succ C \succ A$ - Voter 3: $C \succ A \succ B$ **(a)** Show that majority rule produces a preference cycle. Specifically, determine the majority preference for each pair $(A, B)$, $(B, C)$, and $(A, C)$. **(b)** Compute the Borda score for each alternative. Which alternative wins under Borda count? **(c)** Does this preference profile have a Condorcet winner? Explain. ### Exercise 2: Single-Peaked Preferences (**) Consider five voters with peaks at positions $\{1, 3, 5, 7, 9\}$ on a line (total order). **(a)** What is the outcome of the pure median rule (no phantom voters)? **(b)** Add two phantom voters at positions 2 and 8. Compute the generalized median outcome. **(c)** Suppose voter 1 (with true peak at 1) misreports their peak as 0. Does this change the outcome? Use this to argue informally why the generalized median is strategy-proof. ### Exercise 3: DPO-Borda Connection (**) Consider three responses $\{y_1, y_2, y_3\}$ to a prompt $x$. Suppose annotators have the following pairwise preferences: - $\mathbb{P}(y_1 \succ y_2) = 0.7$ - $\mathbb{P}(y_1 \succ y_3) = 0.6$ - $\mathbb{P}(y_2 \succ y_3) = 0.8$ **(a)** Compute the expected Borda score for each response (sum of pairwise win probabilities). **(b)** If responses are sampled uniformly at random from a reference policy, which response would DPO upweight most strongly? Connect to the DPO-Borda theorem. ### Exercise 4: Virtual Valuations and Optimal Reserve (**) For bidders with values drawn i.i.d. from $\text{Uniform}[0, 1]$: **(a)** Derive the virtual valuation function $\varphi(v) = v - \frac{1-F(v)}{f(v)}$. **(b)** Compute the optimal reserve price $r^*$ by solving $\varphi(r^*) = 0$. **(c)** With two bidders, compute the expected revenue of a second-price auction with reserve $r = 0.5$ and compare to expected revenue without a reserve. ### Exercise 5: VCG Payments (**) In a combinatorial auction with items $\{A, B\}$ and three bidders with the following valuations: | Bidder | $v(\{A\})$ | $v(\{B\})$ | $v(\{A, B\})$ | |--------|------------|------------|---------------| | 1 | 3 | 2 | 4 | | 2 | 2 | 4 | 5 | | 3 | 1 | 1 | 6 | **(a)** Find the welfare-maximizing allocation. **(b)** Compute the VCG payment for each winner using the externality formula. ### Exercise 6: Nosy Preferences (***) Formalize the Paretian Liberal paradox for a content moderation setting: **(a)** Define the set of "content types" $C = \{c_1, c_2, c_3\}$ and two users with nosy preferences over what content each other should see. **(b)** Construct a preference profile where minimal liberalism (each user decides their own feed) conflicts with Pareto efficiency. **(c)** Propose a domain restriction that would resolve the paradox in this setting. ### Exercise 7: Inversion Problem in RLHF (**) An RLHF annotator's preferences become more random as they fatigue. Model this as follows: in the first hour, preferences follow Bradley-Terry with true utilities. After the first hour, choices become uniform random with probability $p_{\text{fatigue}} = 0.3$. **(a)** Write the likelihood of observing $y_w \succ y_l$ as a mixture model. **(b)** Suppose you observe inconsistent annotations from an annotator (e.g., $y_1 \succ y_2$ and later $y_2 \succ y_1$). How would you detect whether this is due to fatigue vs. genuinely indifferent preferences? **(c)** Propose a weighting scheme to down-weight fatigued annotations in reward model training. ### Exercise 8: Community Notes Factor Model (**) Implement a simplified Community Notes model. Generate synthetic data as follows: - 100 users with ideological positions $p_i \sim \text{Uniform}[-1, 1]$ - 20 notes with true quality $\beta_j \sim \mathcal{N}(0, 1)$ and ideological slant $q_j \sim \text{Uniform}[-1, 1]$ - Each user-note rating: $r_{ij} = \beta_j + p_i \cdot q_j + \varepsilon_{ij}$ where $\varepsilon_{ij} \sim \mathcal{N}(0, 0.5)$ **(a)** Implement code to generate this synthetic data. **(b)** Use linear regression to estimate $\beta_j$ (note quality) controlling for the ideological interaction term $p_i \cdot q_j$. **(c)** Compare your estimated $\hat{\beta}_j$ to the true $\beta_j$. How well does the model recover note quality? ### Exercise 9: Pairwise Feedback Mechanisms for Digital Goods (***) Consider a marketplace for digital goods (such as personalized articles, artwork, or AI-generated data), where the exact utility derived from these goods is only revealed to buyers after the goods have been generated and delivered. To elicit truthful preferences from buyers who find it difficult to precisely quantify their valuations beforehand, the marketplace implements a pairwise feedback mechanism, inspired by the work of Robertson and Koyejo (2023). Formally, each buyer requests a personalized digital good and, upon receiving the good, provides feedback by indicating whether their realized utility is higher or lower than a randomly chosen reference price $c \in [0,1]$. The mechanism utilizes this binary feedback to estimate valuations and allocate future goods accordingly. Answer the following: (a) **Formalize the Problem:** Let $u_i$ denote the true valuation of buyer $i$, and let $r_i(c)$ denote the buyer's reported feedback ($r_i(c) = 1$ if $u_i \geq c$, 0 otherwise). Prove that, under uniform random selection of the reference price $c$, the expected value $\mathbb{E}[r_i(c)]$ is equal to the true valuation $u_i$. (b) **Incentive Compatibility Analysis:** Discuss conditions under which this feedback-based mechanism is incentive compatible, i.e., buyers have no incentive to misreport their preferences. Specifically, analyze why strategic misreporting (reporting $r_i(c)$ incorrectly for some reference prices) would not increase a buyer's expected payoff. (c) **Regret Analysis:** Suppose the mechanism estimates buyers' utilities from past feedback and allocates future goods using an epsilon-greedy strategy (exploration rate $\eta_t$). Provide an informal discussion of the trade-off involved in choosing the exploration rate, and how it affects the social welfare and revenue of the marketplace over time. (d) **Practical Implications:** Suggest one practical scenario outside the digital-goods marketplace where such a feedback-driven, pairwise comparison approach would be beneficial. Briefly justify your choice, mentioning challenges and benefits. ### Exercise 10: Scalable Oversight in Complex Decision-Making (***) In scenarios involving complex or high-dimensional outcomes (such as summarizing lengthy texts, assessing the quality of detailed AI-generated reports, or reviewing scientific papers), evaluating the quality of outputs can become infeasible for a single human overseer. One practical solution is scalable oversight, where the evaluation task is decomposed and distributed among multiple human evaluators or even assisted by AI agents. Consider a scalable oversight scenario inspired by recursive reward modeling, where complex evaluations are hierarchically decomposed into simpler tasks. Specifically, suppose you want to evaluate a lengthy report generated by an AI system. Answer the following: **(a) Decomposition of the Task:** Propose a formal recursive decomposition strategy to evaluate a long AI-generated report of length $N$ paragraphs. Specifically, describe a hierarchical evaluation method that decomposes the original evaluation into simpler subtasks at multiple hierarchical levels. Clearly describe how many subtasks you have at each level and how the final aggregated evaluation is computed. **(b) Statistical Aggregation Method:** Suppose each evaluation subtask yields a binary score $s_i \in \{0,1\}$, where $1$ indicates acceptable quality and $0$ indicates unacceptable quality. Propose a simple statistical aggregation method (e.g., majority voting, threshold voting, weighted aggregation, etc.) to combine subtask evaluations into a single global quality assessment at the top level. Justify your choice mathematically. **(c) Computational Simulation:** Implement a Python simulation of your hierarchical decomposition and aggregation method described in parts (a) and (b). Assume each subtask is evaluated with some fixed probability $p$ of being correct (representing human evaluators with bounded accuracy). Specifically, your simulation should: - Implement a hierarchical evaluation scheme (e.g., binary-tree decomposition). - Assume evaluators have accuracy $p = 0.8$ (i.e., probability of correctly identifying paragraph quality). - Simulate how evaluator accuracy at the leaf nodes affects the reliability of the global evaluation at the root node. - Plot how the reliability of the top-level evaluation (accuracy at the root) varies as you increase the depth of hierarchy for a report of fixed length (e.g., $N = 64$ paragraphs). **(d) Practical Discussion:** Briefly discuss advantages and potential drawbacks of scalable oversight approaches such as recursive decomposition in the context of AI alignment. Include considerations such as evaluator fatigue, consistency, cost, and vulnerability to manipulation or collusion.

Symbol	Meaning
\(N = \{1, \ldots, n\}\)	Set of \(n\) voters (agents)
\(A = \{a_1, \ldots, a_m\}\)	Set of \(m\) alternatives
\(\succ_i\)	Voter \(i\)’s strict preference ordering over \(A\)
\(\mathcal{L}(A)\)	Set of all strict linear orders over \(A\)
\(f: \mathcal{L}(A)^n \to A\)	Social choice function (profile to winner)
\(F: \mathcal{L}(A)^n \to \mathcal{L}(A)\)	Social welfare function (profile to ranking)
\(\text{SP}(Y)\)	Single-peaked preferences on totally ordered set \(Y\)
\(\text{peak}(\succ_i)\)	Peak (ideal point) of voter \(i\)’s preferences
\(v_i\)	Private valuation of agent \(i\) (in auctions)
\(\varphi(v)\)	Virtual valuation in Myerson framework
\(\text{MI}(X; Y)\)	Mutual information between \(X\) and \(Y\)

Bidder	\(v(\{A\})\)	\(v(\{B\})\)	\(v(\{A, B\})\)
1	3	2	4
2	2	4	5
3	1	1	6