Machine Learning from Human Preferences

Chapter 5: Aggregation

Chapter Overview

Chapters 1–4 focused on preferences of a single decision-maker. This chapter asks: how do we aggregate preferences across multiple individuals?

Motivating examples:

RLHF: Annotators disagree about which response is better
Recommender systems: Must balance diverse tastes across millions of users
Content moderation: Whose preferences should govern what is shown?
AI alignment: How to combine human values into a single objective?

Chapter Structure

Social Choice Theory: Arrow’s and Gibbard–Satterthwaite’s impossibility theorems
Escaping Impossibility: Single-peaked preferences, Borda count, DPO connection
Beyond Classical Voting: Multi-issue voting, nosy preferences, Community Notes
Challenges in Practice: Inversion problem, privacy, paternalism
Mechanism Design: Auctions, VCG, peer prediction, incentive-compatible learning

Notation

Symbol	Meaning
\(N = \{1, \ldots, n\}\)	Set of \(n\) voters (agents)
\(A = \{a_1, \ldots, a_m\}\)	Set of \(m\) alternatives
\(\succ_i\)	Voter \(i\)’s strict preference ordering over \(A\)
\(\mathcal{L}(A)\)	Set of all strict linear orders over \(A\)
\(f: \mathcal{L}(A)^n \to A\)	Social choice function (SCF): profile \(\to\) winner
\(F: \mathcal{L}(A)^n \to \mathcal{L}(A)\)	Social welfare function (SWF): profile \(\to\) ranking
\(\text{SP}(Y)\)	Single-peaked preferences on totally ordered set \(Y\)
\(p(\succ_i)\)	Peak (ideal point) of voter \(i\)’s preferences

The Condorcet Paradox

Majority preferences can be cyclic — even when individual preferences are transitive:

Voter	Ranking
Voter 1	\(A \succ B \succ C\)
Voter 2	\(B \succ C \succ A\)
Voter 3	\(C \succ A \succ B\)

Majority prefers \(A\) to \(B\) (voters 1, 3)
Majority prefers \(B\) to \(C\) (voters 1, 2)
Majority prefers \(C\) to \(A\) (voters 2, 3)

No Condorcet winner exists! — a rock-paper-scissors cycle

Classical Fairness Axioms

Three desirable properties for any social welfare function:

Unanimity (Pareto efficiency): If every voter prefers \(x\) to \(y\), then society ranks \(x\) above \(y\)

Independence of Irrelevant Alternatives (IIA): The social ranking of \(x\) vs. \(y\) depends only on individual rankings of \(x\) vs. \(y\) — not on other alternatives

Non-dictatorship: No single voter always determines the social ranking

Additionally, we assume unrestricted domain: any transitive preference ordering is admissible.

Arrow’s Impossibility Theorem

Important

Theorem (Arrow, 1951): For \(m \geq 3\) alternatives, no social welfare function can simultaneously satisfy:

Unanimity
Independence of Irrelevant Alternatives
Non-dictatorship

under unrestricted domain.

Every practical voting system must sacrifice at least one fairness criterion.

Arrow (1951)

Arrow’s Theorem: Proof Sketch

The proof proceeds by contradiction:

Assume a SWF satisfies Unanimity, IIA, and Non-dictatorship
Show that the social ranking between any pair \(x, y\) must agree with some pivotal voter
By IIA, the pivotal voter must be the same for all pairs of alternatives

This single pivotal voter dictates the entire social order — contradiction with Non-dictatorship

Key driver: Condorcet cycles force the aggregation to “break ties” by deferring to one voter.

Arrow’s Theorem: Which Axiom Does Each Rule Violate?

Voting Rule	Violates	Why
Dictatorship	Non-dictatorship	One voter decides everything
Plurality	IIA	Adding a “spoiler” changes the winner
Borda Count	IIA	Removing an alternative changes point totals
Pairwise Majority	Transitivity	Condorcet cycles

Takeaway: There is no free lunch — every voting rule makes trade-offs.

Gibbard–Satterthwaite Theorem

Important

Theorem (Gibbard, 1973; Satterthwaite, 1975): For \(m \geq 3\) alternatives, any social choice function \(f\) that is:

Strategy-proof (no voter benefits from misreporting preferences), and
Onto (every alternative can possibly win)

must be dictatorial.

Every non-dictatorial voting rule is manipulable: some voter can gain by voting insincerely.

Gibbard (1973); Satterthwaite (1975)

Strategic Voting Examples

Plurality — “Lesser of two evils”:

True preference: \(C \succ A \succ B\), but \(C\) has no chance
Strategic vote: \(A\) (to prevent \(B\) from winning)

Borda Count — Strategic ranking:

Artificially rank a strong competitor last to reduce their Borda score

Practical deterrence: While STV can always be manipulated in theory, finding a beneficial strategic vote can be NP-hard in worst cases.

Implications for AI Alignment

Arrow’s and Gibbard–Satterthwaite’s theorems apply to any preference aggregation:

Aggregating RLHF annotator feedback faces the same impossibilities
A simple majority vote may yield unstable outcomes if annotators are diverse
Weighting votes by expertise risks creating dictator-like influence

Modern approaches:

Jury learning: Panel of models/subgroups whose aggregated judgment guides learning
Pluralistic alignment: Preserve diversity of values rather than collapsing to a single objective
DPO: Implicitly aggregates pairwise preferences (more on this soon)

Gordon et al. (2022)

Escaping Impossibility: Domain Restrictions

Arrow’s and Gibbard–Satterthwaite assume unrestricted domain: any transitive ordering is admissible.

Key idea: If we restrict which preferences can occur, we can escape impossibility!

In many real-world settings, preferences have natural structure we can exploit.

Single-Peaked Preferences

Note

Definition: A preference ordering \(\succ\) over a totally ordered set \(Y\) is single-peaked if there exists a peak \(p(\succ) \in Y\) such that:

If \(y \lt y' \leq p(\succ)\), then \(y' \succ y\)
If \(p(\succ) \leq y' \lt y\), then \(y' \succ y\)

Intuition: Each voter has an “ideal point” (peak), and utility decreases as alternatives move away from the peak in either direction.

This rules out “I prefer the extremes to the middle” — which creates cycles.

Single-Peaked: Temperature Example

Three colleagues choosing the office thermostat (65°F to 75°F):

Alice: peak at 68°F — utility decreases away from 68
Bob: peak at 72°F — utility decreases away from 72
Carol: peak at 70°F — utility decreases away from 70

All three have single-peaked preferences on the temperature line.

Median voter outcome: 70°F (Carol’s peak) — and no one can profitably manipulate!

Generalized Median Voter Scheme

Note

Definition: Fix phantom votes \(a_1, \ldots, a_{n-1} \in \mathbb{R} \cup \{\pm\infty\}\). The generalized median voter scheme is:

\[ f(\succ_1, \ldots, \succ_n) = \text{median}\big(p(\succ_1), \ldots, p(\succ_n), a_1, \ldots, a_{n-1}\big) \]

The \(n-1\) phantom votes act as anchor points that shift the median:

With \(n\) voter peaks + \(n-1\) phantoms = \(2n-1\) total values \(\Rightarrow\) median is well-defined

Phantom Vote Examples

Different phantom choices yield different rules:

Rule	Phantom Votes	Effect
Pure median	All \(\pm\infty\)	Outcome = median of voter peaks
Dictatorial	All equal to voter \(i\)’s peak	Voter \(i\) always wins
Status quo	All equal to status quo \(q\)	Change requires consensus

The phantom votes allow tuning the rule between fully responsive and highly conservative.

Moulin’s Characterization Theorem

Important

Theorem (Moulin, 1980): On the domain of single-peaked preferences \(\text{SP}(Y)\), a social choice function \(f\) satisfies:

Strategy-proofness: No voter benefits by misreporting their peak
Pareto efficiency: Outcome is never unanimously dispreferred
Peaks-only: Outcome depends only on the set of reported peaks

if and only if \(f\) is a generalized median voter scheme.

By restricting to single-peaked preferences, we escape Arrow’s impossibility and achieve both strategy-proofness and efficiency!

Moulin (1980)

Strategy-Proofness: Intuition

Why can’t voters manipulate the median?

Suppose voter \(i\) has true peak \(p_i = 5\) and outcome is median \(= 6\)
Voter \(i\) wants to pull outcome left toward 5
They can misreport \(p_i' = 0\) (exaggerate leftward preference)

But the median of \(\{0, p_2, \ldots, p_n, a_1, \ldots, a_{n-1}\}\) is the same as with \(p_i = 5\)
The voter’s peak is already on the left side of the median — moving it further left doesn’t change the median!

Key insight: You can only move the median if your peak crosses it — but then the outcome moves away from your true peak.

Scoring Rules and the Borda Count

Another escape from Arrow: relax IIA instead of restricting the domain.

Note

Definition (Borda Count): The Borda score of alternative \(y\) counts pairwise wins:

\[ \text{Borda}(y) = \sum_{i=1}^{n} |\{y' \neq y : y \succ_i y'\}| \]

The Borda winner is the alternative with the maximum Borda score.

Equivalently: with \(m\) alternatives, voter gives \(m-1\) points to top, \(m-2\) to second, …, 0 to last.

Modified IIA (IIA’)

Borda violates IIA but satisfies a weaker version:

Note

Definition (IIA’): If two profiles have, for every voter:

The same pairwise ordering of \(y\) vs. \(y'\), AND
The same number of alternatives strictly between \(y\) and \(y'\)

then the social choice should not flip between \(y\) and \(y'\).

Important

Theorem: Borda satisfies Unrestricted Domain, Pareto Efficiency, Non-dictatorship, and IIA’. By relaxing IIA to IIA’, we escape Arrow’s impossibility.

Connection to DPO

A remarkable result connects Borda to modern RLHF:

Important

Theorem (DPO-Borda Equivalence): Assume responses \(y, y'\) are drawn from \(\pi_{\text{ref}}(\cdot \mid x)\). The DPO-optimal policy satisfies:

\[ \frac{\pi_{\text{DPO}}(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \propto \text{(weighted Borda score of } y \text{)} \]

DPO upweights responses proportionally to their Borda scores — it finds the response that wins the most head-to-head matchups.

Rafailov et al. (2023)

DPO-Borda: Proof Sketch

The DPO loss is: \[ \mathcal{L}_{\text{DPO}}(\pi) = -\mathbb{E}_{x,y,y'}\Big[\bar{\sigma}(\Delta r^*) \cdot \log \sigma\big(\beta \log \tfrac{\hat{\pi}(y' \mid x)}{\hat{\pi}(y \mid x)}\big) + \cdots\Big] \]

Taking the gradient and setting to zero: \[ \mathbb{E}_{y' \sim \hat{\pi}}\Big[\sigma\big(\beta \log \tfrac{\pi(y \mid x)}{\pi(y' \mid x)}\big)\Big] = \underbrace{\mathbb{E}_{y' \sim \mathcal{D}}\big[\bar{\sigma}(\Delta r^*(x, y', y))\big]}_{\text{Borda score of } y} \]

The RHS is the expected win rate of \(y\) against a random alternative — exactly its Borda score.

DPO-Borda: What It Means

Social choice interpretation of DPO:

DPO aggregates pairwise human preferences using the Borda count
It finds the policy that upweights responses by how often they would win head-to-head

Implications:

DPO inherits Borda’s strengths: Pareto efficient, non-dictatorial, satisfies IIA’
DPO inherits Borda’s weakness: violates IIA — adding a new response candidate can change rankings
The reference policy \(\pi_{\text{ref}}\) determines the weighting of comparisons

Multi-Issue Voting

Real-world decisions often involve multiple independent issues.

Example in RLHF: Optimize for helpfulness, harmlessness, and honesty simultaneously.

Question: Can we aggregate each criterion independently?

Voting by Committees

Note

Definition: A voting scheme is voting by committees if for each object \(x \in K\), there exists a committee \(C_x\) with winning coalitions \(W_x\) such that:

The outcome includes \(x\) \(\iff\) \(\{i : x \in B(\succ_i)\} \in W_x\)

where \(B(\succ_i)\) is voter \(i\)’s top-ranked subset.

Each issue is decided independently by its own committee — a natural decomposition.

Separable Preferences

Note

Definition: A preference \(\succ\) on \(2^K\) is separable if for all \(A \subseteq K\) and \(x \notin A\):

\[ A \cup \{x\} \succ A \quad \Longleftrightarrow \quad x \in G(\succ) \]

where \(G(\succ) = \{x \in K : \{x\} \succ \emptyset\}\) is the set of “good” objects.

Separability means: whether you want to add \(x\) to a bundle doesn’t depend on what’s already there.

Characterization and Limits

Important

Theorem: A voting scheme satisfies surjectivity, strategy-proofness, and separability if and only if it is voting by committees.

Caveat: Voting by committees generally does not satisfy Pareto efficiency.

Application to RLHF: Preferences over “helpful” and “harmless” are often not separable — a highly helpful response may necessarily involve some risk of harm, creating dependencies.

Nosy Preferences

Note

Definition: A preference is nosy if the individual cares about outcomes affecting others, not just themselves. A preference is private if the individual only cares about their own allocation.

Examples of nosy preferences:

Safety: Not wanting others to see dangerous instructions
Privacy: Wanting to prevent disclosure of others’ data
Content moderation: Preferring certain content not be shown to anyone
Fairness: Caring that others receive equitable treatment

Sen’s Liberal Paradox

Important

Theorem (Sen, 1970): The following three properties are inconsistent:

Minimal Liberalism: Each individual is decisive over at least one pair in their personal sphere
Pareto Efficiency: If everyone prefers \(x\) to \(y\), society chooses \(x\)
Unrestricted Domain: Any preference profile is admissible

When preferences are nosy, even weak requirements conflict!

Sen (1970)

The Prude and the Book

Two individuals and a controversial book. Alternatives: \(a\) (Prude reads), \(b\) (Lewd reads), \(c\) (no one reads).

Prude: \(c \succ_P a \succ_P b\)

Prefers no one reads it, but would rather read it themselves than let Lewd read it (nosy!)

Lewd: \(a \succ_L b \succ_L c\)

Wants Prude to read it most of all (also nosy!)

The Prude and the Book: The Cycle

Prude’s liberty (personal reading choice): \(c\) beats \(a\)
Lewd’s liberty (personal reading choice): \(b\) beats \(c\)
Pareto (both prefer \(a\) to \(b\)): \(a\) beats \(b\)

\[ c \succ a \succ b \succ c \quad \text{— a cycle! No consistent social choice.} \]

Implication for AI: Content moderation involves exactly this tension — one user’s preference for free expression conflicts with another’s preference for a safe environment.

Case Study: Community Notes

Community Notes (formerly Birdwatch) aggregates ratings about content helpfulness across ideological divides.

Problem with majority voting: The largest ideological group would dominate.

Solution: Find bridging notes — rated positively by users who disagree ideologically.

Community Notes: Factor Model

\[ u(y; \alpha, p, \varepsilon) = \mu + \alpha_j + \beta_j + p^\top q_j + \varepsilon_j \]

Term	Interpretation
\(\mu\)	Global intercept
\(\alpha_j\)	Rater intercept (some raters more positive)
\(\beta_j\)	Note intercept (note quality)
\(p^\top q_j\)	Ideological alignment factor
\(\varepsilon_j\)	Residual noise

Community Notes: Bridging Mechanism

Key insight: \(\beta_j\) captures note quality after controlling for ideology.

A note is selected if \(\beta_j \geq c\) for some threshold \(c\)
This means it must be rated positively by users who disagree ideologically

Connections:

Collaborative filtering: The \(p^\top q_j\) term \(\approx\) matrix factorization
Item response theory: Resembles the Rasch model (Ch. 2) extended with latent factors
Jury theorems: Diverse juries aggregate to correct answers better than homogeneous majorities

The Inversion Problem

Important

Core insight: Observed behavior \(\neq\) underlying preferences or utility.

Standard revealed preference assumes choices reveal preferences. This can fail due to:

Habit formation: Repeated behavior persists even when preferences change
Cognitive limitations: Fatigue, distraction, bounded rationality
Context effects: Same preference \(\to\) different behaviors in different contexts
Strategic behavior: People choose strategically, not according to true preferences

The Doritos Problem

A smart pantry observes eating behavior:

User consistently chooses Doritos when offered
System infers: “User prefers Doritos”

But: The user might prefer healthier options — they just succumb to availability and habit.

Lesson: Optimizing for observed “preferences” (engagement) may not optimize for true welfare.

This is the engagement vs. satisfaction problem in recommender systems.

Implications for RLHF

The inversion problem directly affects AI training from human feedback:

Annotator fatigue: Label quality degrades over long sessions
Engagement \(\neq\) satisfaction: Clicks and watch time \(\neq\) user welfare
Context-dependent feedback: Same annotator gives different feedback based on mood, prior examples
Strategic annotation: Annotators may label strategically if they believe it affects outcomes

Potential solutions:

Weight annotations by estimated quality/consistency
Use deliberation before labeling to reduce noise
Model annotator state (fatigue, expertise) as latent variables
Collect meta-feedback about label confidence

Privacy and Personalization

Preference learning inherently involves collecting personal data.

Tension: Better personalization requires more data, but privacy demands less.

Contextual Integrity Framework

Note

Contextual Integrity (Nissenbaum): Privacy is preserved when information flows match context-specific norms. Five parameters:

Sender: Who is sharing the information
Subject: Whose information is being shared
Recipient: Who receives the information
Data Type: What kind of information
Transmission Principle: What rules govern further use

A privacy violation occurs when information flows against contextual norms, even with consent.

Nissenbaum (2009)

Example: Heart Rate Data

A fitness tracker collects heart rate data:

Flow	Recipient	Transmission Principle	Appropriate?
To running coach	Coach	Training optimization	Yes
To advertiser	Ad network	Targeted advertising	No

Same data, same consent — but different transmission principles violate expectations about the fitness context.

Differential Privacy: Limitations

Differential privacy (DP) provides formal guarantees, but has fundamental limits for preference learning:

Personalization requires individual data — by definition, DP prevents this
Trade-off is inherent: Stronger privacy \(\Rightarrow\) less accurate models
“Persuasive” DP: Some systems claim protection with parameters so weak they provide little actual privacy

Contextual Integrity as middle ground: Allow data use that matches expectations (personalization within a service) while preventing unexpected flows (selling to third parties).

Paternalism in AI Systems

When should an AI system override a user’s stated preferences?

Key distinction:

Nosy preferences: Caring about others’ choices for your own sake
Paternalism: Overriding others’ choices for their sake

When is Paternalism Justified?

Four conditions that might justify intervention:

Information asymmetry: The system has information the user lacks (e.g., long-term health effects)

Cognitive limitations: The user is impaired (fatigue, addiction, cognitive decline)

Protection of future self: Current choice harms their future self (e.g., saving for retirement)

Irreversible harm: Consequences are severe and irreversible

Design Principles for Paternalistic AI

AI systems that exercise paternalism should:

Be transparent: Users know when preferences are overridden
Allow override: Users can insist on their original choice
Minimize interference: Use the lightest intervention that achieves the goal
Justify interventions: Provide clear rationale for each override
Update based on feedback: Learn when interventions are welcomed vs. resented

Example: When an AI assistant refuses a request — is it paternalism (protecting the user) or nosy (protecting others)? Often both.

Mechanism Design: Overview

While voting aggregates ordinal preferences, mechanism design aggregates cardinal valuations (with money).

Central concept: Incentive compatibility — design rules so that rational agents reveal true preferences.

Key question: Can we align individual self-interest with social welfare?

Single-Item Auction Setup

One item for sale, \(n\) bidders
Bidder \(i\) has private valuation \(v_i\) (how much the item is worth to them)
Utility: \(v_i - p_i\) if they win and pay \(p_i\); otherwise \(0\)

Two objectives:

Social welfare: Allocate to the highest valuer
Revenue: Maximize the seller’s expected payment

Vickrey Second-Price Auction

Mechanism:

All bidders submit sealed bids \(b_1, \ldots, b_n\)
Highest bidder wins
Winner pays the second-highest bid

Example: Bids = \((2, 6, 4, 1)\)

Bidder 2 wins (bid = 6)
Pays 4 (second-highest bid)
Utility = \(v_2 - 4\)

Vickrey (1961)

Why Truth-Telling is Dominant

Bidding \(b_i = v_i\) is a dominant strategy (DSIC):

Bid too low (\(b_i \lt v_i\)): Risk losing when \(v_i \gt\) second-highest bid — missed positive utility

Bid too high (\(b_i \gt v_i\)): Win even when second-highest bid \(\gt v_i\) — negative utility!

Bid truthfully (\(b_i = v_i\)): Win \(\iff\) \(v_i\) is highest; pay \(\leq v_i\) — guaranteed non-negative utility

Result: Allocates to highest valuer \(\Rightarrow\) welfare-maximizing.

First-Price vs. Second-Price

First-Price Auction

Winner pays own bid
Incentive to shade bids below \(v_i\)
Nash equilibrium involves strategic behavior
Not DSIC

Second-Price (Vickrey)

Winner pays second-highest bid
Truth-telling is dominant
DSIC
Same efficiency in equilibrium

By decoupling the price from the winner’s bid, Vickrey removes the incentive to shade.

Myerson’s Optimal Auction

Goal: Maximize seller’s expected revenue (not welfare).

Setup: Bidders’ values \(v_i \sim F\) i.i.d. Define the virtual valuation:

\[ \varphi(v) = v - \frac{1 - F(v)}{f(v)} \]

Myerson’s theorem: Allocate to the bidder with the highest non-negative virtual value. If all virtual values are negative, don’t sell.

For i.i.d. regular distributions: this is a second-price auction with an optimal reserve price \(r\).

Myerson (1981)

Example: Uniform[0,1] Bidders

For \(v \sim \text{Uniform}[0,1]\): \(\varphi(v) = 2v - 1\)

Setting \(\varphi(v) \geq 0\): optimal reserve price \(r = 0.5\)

Revenue comparison (two bidders):

Scenario	Probability	Revenue
Both below 0.5	\(1/4\)	\(0\) (no sale)
Both above 0.5	\(1/4\)	\(\approx 2/3\) (second-highest value)
One above, one below	\(1/2\)	\(0.5\) (reserve price)

Expected revenue: \(0 + \tfrac{1}{6} + \tfrac{1}{4} = \tfrac{5}{12} \approx 0.417\) vs. \(\tfrac{1}{3} \approx 0.333\) without reserve.

Bulow–Klemperer Theorem

Important

Theorem (Bulow & Klemperer, 1996): For i.i.d. regular \(F\):

\[ \mathbb{E}[\text{Rev}^{\text{(second-price)}}(n+1)] \geq \mathbb{E}[\text{Rev}^{\text{(optimal)}}(n)] \]

A simple second-price auction with one extra bidder outperforms the optimal auction with fewer bidders!

Practical takeaway: Use simple, transparent mechanisms and focus on attracting more participants rather than complex optimal designs.

Bulow and Klemperer (1996)

The VCG Mechanism

Generalization of Vickrey’s auction to multiple items and complex outcomes.

Setting: Outcomes \(\omega \in \Omega\); agent \(i\) has valuation \(v_i(\omega)\); quasilinear utility.

Allocation rule — maximize total reported value:

\[ \omega^* = \arg\max_{\omega \in \Omega} \sum_{i=1}^n b_i(\omega) \]

VCG: Payment Rule

Each agent pays the externality they impose on others:

\[ p_i(b) = \underbrace{\max_{\omega \in \Omega} \sum_{j \neq i} b_j(\omega)}_{\text{Others' welfare without } i} - \underbrace{\sum_{j \neq i} b_j(\omega^*)}_{\text{Others' welfare with } i} \]

Intuition: You pay the “damage” your presence causes to everyone else.

Reduces to second-price logic for single items
Truth-telling is a dominant strategy (DSIC)
Outcome maximizes social welfare \(\sum_i v_i(\omega)\)

VCG: Challenges in Practice

Despite theoretical elegance, VCG faces practical hurdles:

Computational: Finding \(\arg\max_\omega \sum_i b_i(\omega)\) can be NP-hard (combinatorial auctions)

Budget balance: VCG payments may require subsidies in some settings

Collusion and sybil attacks: If one bidder splits into two identities, they can game the outcome

Application: Spectrum auctions — billions of dollars at stake; multi-round simultaneous auctions used in practice.

Case Study: Peer Grading

Setting: Students grade each other’s work. Design a mechanism that incentivizes careful grading.

The lazy grader problem: Always giving 80% can yield 96% accuracy under naive scoring rules — the grader “cheats” by predicting the class average.

Solution: Optimize the scoring rule to maximize the gap between diligent grading and lazy strategies.

Result: Incentive compatibility aligns grader incentives with accurate assessment — “payments” are grade points.

Hartline et al. (2020)

Incentive-Compatible Online Learning

Setting: A planner (system) interacts with strategic agents (users) who arrive sequentially.

\(K\) possible actions, each with mean reward \(\mu_i \in [0,1]\)
Agents want to maximize their own reward
Planner wants to learn the best alternative and maximize overall welfare

Challenge: Without monetary transfers, how can the planner induce exploration?

Key tool: Information asymmetry — users only see their own recommendations.

The Guinea Pig Strategy

Idea: Hide exploration in a pool of exploitation.

Deterministically recommend the best-known action (\(A_1\)) to most users
Pick one guinea pig uniformly at random from the next \(L\) users
Recommend the exploratory action (\(A_2\)) to the guinea pig

Users don’t know if they’re the guinea pig, so following the recommendation is optimal!

Guinea Pig: Why It Works

The expected gain from deviating (ignoring the recommendation):

\[ \mathbb{E}[\mu_1 - \mu_2 \mid I_t = 2] \leq \tfrac{1}{L}(\mu_1 - \mu_2) + (1 - \tfrac{1}{L})\mathbb{E}[\mu_1 - \mu_2 \mid \mu_1 \lt \mu_2] \cdot P[\mu_1 \lt \mu_2] \]

This is \(\leq 0\) when \(L \geq 12\).

Interpretation: The small chance of being the guinea pig is outweighed by the chance that the exploration action is actually better.

Black-Box Reduction

General algorithm: Turn any bandit algorithm into an incentive-compatible one.

Recipe: Wrap every decision that the bandit algorithm \(A\) makes with \(L-1\) recommendations of the best-known arm.

Result: Simulates \(T\) steps of \(A\) in \(cT\) steps, achieving \(O(\sqrt{T})\) regret — the same rate as non-strategic settings!

Incentive compatibility comes “for free” (up to a constant factor).

Mansour, Slivkins, and Syrgkanis (2019)

Mutual Information Paradigm

Problem: How to incentivize truthful reporting when there’s no verifiable ground truth?

MIP (Kong & Schoenebeck, 2019): Reward agents based on mutual information between their report and a reference agent’s report:

\[ \text{Payment}_i = MI(\hat{\Psi}_i;\; \hat{\Psi}_j) \]

where \(j \neq i\) is randomly selected.

Kong and Schoenebeck (2019)

MIP: Properties of Information-Monotone MI

An information-monotone MI measure satisfies:

Symmetry: \(MI(X; Y) = MI(Y; X)\)
Non-negativity: \(MI(X; Y) \geq 0\), with equality iff \(X \perp Y\)
Data processing inequality: For any channel \(M\), \(MI(M(X); Y) \leq MI(X; Y)\)

Two important families:

\(f\)-mutual information: Based on \(f\)-divergence between joint and product of marginals
Bregman mutual information: Based on proper scoring rules

MIP: Key Result

Important

Theorem: When the MI measure is strictly information-monotone, the resulting mechanism is:

Dominantly truthful: Truth-telling is a dominant strategy
Strongly truthful: Truth-telling equilibrium yields strictly higher payoffs than any non-permutation strategy

Why it works: Any manipulation (noise, partial reporting) can only decrease mutual information with the reference agent — so truthful reporting maximizes payment.

Summary (1)

Social Choice Theory:

Arrow’s theorem: No SWF satisfies Unanimity + IIA + Non-dictatorship for \(m \geq 3\)
Gibbard–Satterthwaite: Every non-dictatorial SCF is manipulable
Single-peaked preferences + Moulin’s theorem \(\Rightarrow\) strategy-proof median voter schemes
Borda count relaxes IIA to IIA’; DPO-Borda: DPO aggregates as weighted Borda count

Summary (2)

Beyond Classical Voting:

Multi-issue voting with separable preferences \(\Rightarrow\) voting by committees
Nosy preferences create tensions (Sen’s Liberal Paradox)
Community Notes: Factor model identifies bridging content across ideological divides

Challenges in Practice:

Inversion problem: Behavior \(\neq\) preferences (habits, fatigue, strategy)
Privacy: Contextual Integrity \(\gt\) simple consent models
Paternalism: Justified under info asymmetry, cognitive limits, irreversible harm

Summary (3)

Mechanism Design:

Vickrey auction: Second-price \(\Rightarrow\) truth-telling is dominant; welfare-maximizing
Myerson: Virtual valuations + reserve price \(\Rightarrow\) optimal revenue
Bulow–Klemperer: One more bidder \(\gt\) optimal mechanism
VCG: General externality-based payments for multi-item settings

Incentives Without Money:

Guinea pig strategy: Hide exploration in exploitation; \(O(\sqrt{T})\) regret
MIP: Mutual information rewards \(\Rightarrow\) dominant truthfulness in peer prediction

References

Arrow (1951)
Gibbard (1973)
Satterthwaite (1975)
Moulin (1980)
Rafailov et al. (2023)
Sen (1970)
Vickrey (1961)
Myerson (1981)
Bulow and Klemperer (1996)
Nissenbaum (2009)
Kong and Schoenebeck (2019)
Mansour, Slivkins, and Syrgkanis (2019)
Gordon et al. (2022)
Hartline et al. (2020)
Bartholdi, Tovey, and Trick (1989)
Black (1948)

Arrow, Kenneth J. 1951. Social Choice and Individual Values. John Wiley; Sons.

Bartholdi, John J., Craig A. Tovey, and Michael A. Trick. 1989. “The Computational Difficulty of Manipulating an Election.” Social Choice and Welfare 6 (3): 227–41.

Black, Duncan. 1948. “On the Rationale of Group Decision-Making.” Journal of Political Economy 56 (1): 23–34.

Bulow, Jeremy, and Paul Klemperer. 1996. “Auctions Versus Negotiations.” The American Economic Review 86 (1): 180–94. http://www.jstor.org/stable/2118262.

Gibbard, Allan. 1973. “Manipulation of Voting Schemes: A General Result.” Econometrica 41 (4): 587–601.

Gordon, Noah J., Vaishnavh Nagarajan Shankar, Shi Feng, Yejin Choi, and Noah A. Smith. 2022. “Jury Learning: Integrating Dissenting Voices into Machine Learning Models.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2658–73. Association for Computational Linguistics.

Hartline, Jason D., Yingkai Li, Liren Shan, and Yifan Wu. 2020. “Optimization of Scoring Rules.” CoRR abs/2007.02905. https://arxiv.org/abs/2007.02905.

Kong, Yuqing, and Grant Schoenebeck. 2019. “An Information Theoretic Framework for Designing Information Elicitation Mechanisms That Reward Truth-Telling.” ACM Trans. Econ. Comput. 7 (1). https://doi.org/10.1145/3296670.

Mansour, Yishay, Aleksandrs Slivkins, and Vasilis Syrgkanis. 2019. “Bayesian Incentive-Compatible Bandit Exploration.” https://arxiv.org/abs/1502.04147.

Moulin, Hervé. 1980. “On Strategy-Proofness and Single Peakedness.” Public Choice 35 (4): 437–55.

Myerson, Roger B. 1981. “Optimal Auction Design.” Mathematics of Operations Research 6 (1): 58–73.

Nissenbaum, Helen. 2009. Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford University Press.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” https://arxiv.org/abs/2305.18290.

Satterthwaite, Mark Allen. 1975. “Strategy-Proofness and Arrow’s Conditions: Existence and Correspondence Theorems for Voting Procedures and Social Welfare Functions.” Journal of Economic Theory 10 (2): 187–217.

Sen, Amartya. 1970. “The Impossibility of a Paretian Liberal.” Journal of Political Economy 78 (1): 152–57. https://doi.org/10.1086/259614.

Vickrey, William. 1961. “Counterspeculation, Auctions, and Competitive Sealed Tenders.” Journal of Finance 16 (1): 8–37.