6 Whose?
By the end of this chapter you will be able to:
- Identify how technical design choices at each stage of the preference learning pipeline (elicitation, learning, aggregation, decision) embed value judgments about whose preferences matter and how they should be weighted.
- Explain the inversion problem—why behavior ≠ mental state—and how optimizing for revealed preferences (clicks, bids, engagement) can systematically disadvantage groups whose behavior diverges from preferences due to context (fatigue, time constraints, cultural norms).
- Distinguish between individual fairness (similar entities treated similarly) and group fairness (protected groups have equal outcomes), recognizing their fundamental incompatibility as demonstrated by Dwork et al.
- Apply Sen’s framework of nosy vs. non-nosy preferences to determine when liberal assistance (respecting stated preferences) is insufficient and illiberal assistance (overriding preferences) is justified.
- Analyze how unfairness compounds across the elicitation → learning → aggregation → decision pipeline through feedback loops, showing how small biases at each stage multiply over time.
- Evaluate governance mechanisms (designer discretion, user control, participatory design, regulatory oversight, professional standards) for deciding whose preferences should be weighted and how.
- Design preference learning systems that make value tradeoffs explicit, document who benefits and who is harmed, and include fairness constraints at each pipeline stage.
- Implement stratified sampling, subgroup performance monitoring, and fairness-constrained optimization to detect and mitigate compounding unfairness.
- Audit systems for red flags: feedback loops where average performance improves but subgroup performance worsens, optimization-fairness divergence, invisibility of harm, post-hoc rationalization, ignoring context, and revealed preference fallacies.
- Connect abstract fairness concepts to concrete technical decisions in Bradley-Terry models (Chapter 2), DPO weighting (Chapter 3), active learning via Fisher information (Chapter 4), exploration strategies (Chapter 5), and aggregation rules (Chapter 6).
This chapter can be covered in two 50-minute lectures plus a discussion/activity session:
Lecture 1 (Sections 7.1–7.2): The Pipeline and Design Choices
- Introduction: Whose preferences matter? The central question (10 min)
- The 4-stage pipeline framework with AI Review Assistant running example (10 min)
- Elicitation: Who gets queried? The inversion problem (10 min)
- Learning: What structures are valid? IIA violations and context-dependence (10 min)
- Aggregation and decision: Weighing preferences, liberal vs. illiberal assistance (10 min)
Lecture 2 (Sections 7.3–7.6): Fairness and Design Principles
- Individual vs. group fairness: formal definitions, conflicts, examples (15 min)
- Process vs. outcome fairness: tensions and connections to Sen’s liberalism (10 min)
- How unfairness compounds: feedback loops and end-to-end analysis (10 min)
- Eight design principles for practitioners (10 min)
- Red flags, governance mechanisms (5 min)
Discussion/Activity Session: Case Study Analysis (50 min)
- Paper-reviewer matching design exercise: make choices at each pipeline stage (30 min)
- Discussion of case studies: LaMPost, Multi-Value, DaDa (20 min)
6.1 Chapter Overview
Chapters 2–6 provided powerful technical methods for learning from human preferences:
- Chapter 2 showed how to model preferences through Bradley-Terry, Rasch, and factor models, assuming preferences exist and can be represented mathematically.
- Chapter 3 developed estimation methods (MLE, Bayesian inference, online learning), assuming we have preference data to train on.
- Chapter 4 introduced active elicitation via Fisher information and optimal design, assuming we know which comparisons to request.
- Chapter 5 addressed sequential decisions under preference uncertainty, assuming we know what outcomes to pursue.
- Chapter 6 covered aggregation mechanisms, providing technical tools but not normative guidance on whose preferences to aggregate.
This chapter provides the missing normative framework: whose preferences should we learn from, and how should we weigh them?
Every technical choice embeds value judgments. Choosing an active learning strategy determines who gets asked (efficiency vs. representation). Assuming IIA imposes structure on what preferences are valid (rational preferences vs. psychological reality). Selecting DPO’s reference policy \(\pi_{\text{ref}}\) determines whose preferences count more (high-volume users vs. equal weighting). Deciding when to override stated preferences determines what preferences are respected (liberal vs. illiberal assistance).
The Central Insight: These are not merely technical decisions—they are fairness decisions with real consequences. A system meant to “help all reviewers” may primarily help already-privileged users if we optimize for efficiency at each stage. Small biases compound: elicitation undersamples some groups → learning fits them poorly → aggregation downweights them → decisions provide poor assistance → they disengage → even less data. The gap widens.
The Roadmap: This chapter traces the preference learning pipeline from elicitation through decision-making, analyzing how each stage embeds values and how unfairness accumulates. We formalize key fairness concepts (individual vs. group, process vs. outcome), provide concrete design principles for building fairer systems, examine governance mechanisms for value decisions, and ground the analysis in real-world case studies.
Connection to Earlier Chapters:
- IIA from Bradley-Terry (Chapter 2, Section 1.8) has fairness implications: it privileges “rational” preferences and may mismodel overburdened users.
- DPO’s reference policy (Chapter 3) implicitly weights by participation—high-volume users dominate if we don’t correct for unequal engagement.
- Active learning via Fisher information (Chapter 4) maximizes information gain but may undersample minority groups, creating representativeness issues.
- Exploration strategies (Chapter 5) raise liberal vs. illiberal questions: explore what users want or what’s best for them?
- Aggregation impossibilities (Chapter 6, Arrow’s theorem, Sen’s Paretian Liberal) show that no mechanism satisfies all properties—fairness constraints further limit options.
This chapter equips you to recognize value tradeoffs, make conscious fairness choices, and build systems that respect human dignity while acknowledging impossibility results and practical constraints.
6.2 Introduction: Whose Preferences Matter?
This book has taught you how to learn from human preferences. You can model preferences through Bradley-Terry (Section 1.6.2), estimate parameters via maximum likelihood (Section 2.3), actively collect informative comparisons (Section 3.4), make sequential decisions (Section 6.3.4), and aggregate heterogeneous preferences (Section 5.2). These are powerful technical tools.
But now we ask a deeper question: Whose preferences should we learn from, and how should we weigh them?
This is not a technical question with an algorithmic answer—it is a normative question about values and fairness. Yet every technical decision you make embeds an answer to this question, whether you recognize it or not:
When you choose an active learning strategy (Chapter 4), you determine who gets asked. Maximizing Fisher information is efficient but may undersample minority users. This is a fairness decision disguised as an optimization problem.
When you assume IIA for tractability (Chapter 2), you impose structure on what preferences are valid. Context-dependent preferences (fatigue, framing, load) are treated as “irrational” and ignored. But these violations disproportionately affect overburdened users.
When you use DPO with a reference policy \(\pi_{\text{ref}}\) (Chapter 3), you determine whose preferences count more. High-volume users dominate the reference policy. Is this fair? Or should all users be weighted equally?
When you decide whether to defer to stated preferences or override them (Chapter 5), you determine what preferences are respected. Should an AI review assistant help a reviewer be consistently harsh (liberal assistance) or nudge toward constructive feedback (illiberal assistance)?
These choices have consequences. They determine who benefits from your system and who is harmed.
6.2.1 The Four-Stage Pipeline
To analyze these fairness questions systematically, we introduce a four-stage pipeline for preference learning systems:
Elicitation (\(\mathcal{E}\)): How do we collect preference data? Who do we query? What questions do we ask?
Learning (\(\mathcal{L}\)): What model structure do we assume? What preference patterns are considered valid?
Aggregation (\(\mathcal{A}\)): How do we combine multiple users’ preferences? Whose preferences are weighted more?
Decision (\(\mathcal{D}\)): When do we defer to user preferences vs. override them? What constitutes legitimate paternalism?
At each stage, we make choices that embed values. Often these choices compound: a bias introduced at elicitation amplifies through learning, aggregation, and decision-making, creating feedback loops that widen disparities over time.
Before analyzing fairness at each pipeline stage, one must first identify who is affected. A useful framework distinguishes four categories: (1) direct users who interact with the system and provide preference data (e.g., reviewers using the AI assistant); (2) affected parties who are impacted by decisions but may not be users (e.g., paper authors whose work is reviewed); (3) system designers who set objectives and constraints (e.g., conference organizers); and (4) society broadly, which bears indirect consequences (e.g., the research community’s trust in peer review). Different stakeholder groups may have conflicting interests, and a fairness analysis that considers only direct users can overlook harms to affected parties—a common failure mode in deployed preference learning systems.
Every technical choice in the preference learning pipeline is a value choice about whose preferences matter and how they should be weighted.
You cannot avoid making these choices. You can only choose whether to make them consciously and transparently, or to hide them behind claims of “technical neutrality.”
This chapter provides frameworks for recognizing these choices, analyzing their fairness implications, and making principled tradeoffs.
A natural instinct is to treat fairness as an optimization problem — “find the system that is most fair.” But the impossibility results throughout this book (Arrow’s theorem, Gibbard-Satterthwaite, the incompatibility of individual and group fairness) mean there is no single solution that satisfies all fairness desiderata simultaneously. A system that is fair by one metric (e.g., calibration across groups) may be unfair by another (e.g., equalized odds). The pitfall is claiming technical neutrality: every design choice — what data to collect, which loss to minimize, how to aggregate, when to override — embeds a value judgment about whose preferences matter. The goal is not to find the fair solution but to make these tradeoffs consciously and transparently.
6.2.2 Running Example: AI Review Assistant for Peer Review
To ground these abstract questions, we use a concrete running example throughout this chapter: an AI system that helps peer reviewers write paper reviews. The system learns from each reviewer’s past reviews to assist them in evaluating new papers.
The Setting: Consider a large ML conference with 10,000 submitted papers and 5,000 reviewers. Each paper needs at least 3 reviews. Reviewers bid on papers (prefer / willing / not-willing), and a matching algorithm assigns papers to reviewers. An AI assistant learns from past reviews to help reviewers write better, more efficient reviews.
Why This Example? Peer review combines all the challenges of preference learning:
Elicitation: Which reviewers should we query for more feedback to train the assistant? Senior reviewers write more detailed reviews (more training data), but this may leave junior reviewers under-served.
Learning: Reviewers have context-dependent preferences (fatigue, load, framing). Should we assume IIA? Or model context explicitly?
Aggregation: Should the shared model weight all reviewers equally? Or weight by review volume (more data from seniors)?
Decision: If a reviewer tends to write harsh reviews, should the assistant help them be consistently harsh? Or nudge toward constructive feedback?
Moreover, peer review has clear stakeholders (reviewers, authors, editors, scientific community), measurable outcomes (review quality, author satisfaction, acceptance rates), and real fairness concerns (junior vs. senior reviewers, underrepresented subfields, institutional prestige).
We will trace how design choices at each pipeline stage create fairness implications, and how these compound into systematic disadvantages for some groups.
6.2.3 Connection to Earlier Chapters
This chapter builds directly on the technical foundations from Chapters 2–6:
From Chapter 2 (Foundations): The Bradley-Terry model assumes IIA—if reviewer prefers Paper A over Paper B, adding Paper C shouldn’t change this. But in Section 6.3.2, we’ll see why this assumption has fairness consequences: it treats context-dependent preferences as invalid, disadvantaging overburdened reviewers.
From Chapter 3 (Learning): DPO maximizes a weighted Borda rule where \(\pi_{\text{ref}}\) is trained on existing data (Equation 1.2). But existing data reflects existing inequities. In Section 6.3.3, we’ll analyze how status quo bias in \(\pi_{\text{ref}}\) weights senior reviewers more heavily.
From Chapter 4 (Elicitation): Active learning via Fisher information (Section 2.4.2.2) queries where the model is most uncertain. But uncertainty may be highest for well-represented groups with complex preferences. In Section 6.3.1, we’ll show how information-maximizing strategies can undersample minorities.
From Chapter 5 (Decisions): Thompson Sampling explores based on posterior uncertainty (Section 4.3). But should we explore over what users want (liberal) or what’s best for them (illiberal)? In Section 6.3.4, we’ll formalize this distinction using Sen’s framework of nosy preferences.
From Chapter 6 (Aggregation): Arrow’s theorem and Sen’s Impossibility of a Paretian Liberal (Section 6.4) show that no aggregation mechanism satisfies all desirable properties. In Section 6.4.1.1, we’ll see how adding fairness constraints further limits our options.
The technical methods are powerful. But applying them responsibly requires understanding whose preferences you’re optimizing for and whose interests you’re serving.
6.2.4 Roadmap for This Chapter
The rest of this chapter proceeds as follows:
Section 6.3 analyzes the four-stage pipeline in detail, showing how each stage embeds value judgments and how unfairness compounds across stages.
Section 6.4 formalizes key fairness distinctions: individual vs. group fairness, process vs. outcome fairness, and the fundamental incompatibilities between them.
Section 6.5 provides eight concrete design principles for building fairer preference learning systems, along with red flags to watch for.
Governance (discussed throughout the chapter) examines governance mechanisms for deciding whose preferences matter: designer discretion, user control, participatory design, regulatory oversight, and professional standards.
Design Principles (Section 6.5) distills the chapter’s insights into actionable principles for building fairer preference learning systems.
Let’s begin by tracing the AI review assistant through the four-stage pipeline, seeing how fairness issues emerge and compound at each step.
6.3 Design Decisions as Value Choices
Consider building an AI assistant for peer review. At each pipeline stage, a seemingly neutral technical choice embeds a value judgment.
| Stage | Technical Choice | Value Embedded | Who Benefits | Who Is Harmed |
|---|---|---|---|---|
| Elicitation \(\mathcal{E}\) | Query reviewers with highest Fisher information | Efficiency over representation | Well-represented subfields | Niche/emerging areas |
| Learning \(\mathcal{L}\) | Fit Bradley-Terry (assumes IIA) | Assumes context-independent preferences | Average-context reviewers | Overburdened reviewers whose preferences are context-dependent |
| Aggregation \(\mathcal{A}\) | Simple majority across annotators | Equal weight per person | Majority subfield | Interdisciplinary or minority perspectives |
| Decision \(\mathcal{D}\) | Always follow the model’s recommendation | Trust the system | Users well-served by the model | Edge cases, novel paper types |
Compounding effect: If elicitation undersamples postdocs \(\to\) the model learns senior-faculty preferences \(\to\) aggregation weights those preferences equally (but they’re overrepresented in data) \(\to\) the assistant serves senior faculty well, who use it more, generating more data. A 10% sampling bias at stage \(\mathcal{E}\) can become a 40% performance gap after one feedback cycle.
Intervention: Stratified sampling at \(\mathcal{E}\) (ensure equal representation by career stage), fairness-constrained learning at \(\mathcal{L}\) (equalize subgroup AUC), and regular auditing at \(\mathcal{D}\) (compare subgroup satisfaction) break the feedback loop.
We now trace the AI review assistant example through all four pipeline stages, analyzing the value tradeoffs and fairness implications at each step.
6.3.1 Elicitation: Who Gets Queried?
Technical Question: How should we collect data to train the review assistant? Which reviewers should we observe more carefully?
Consider three design options for elicitation policy \(\mathcal{E}\):
- Universal querying: Observe all reviewers equally (every review provides training data)
- Productivity-based: Query “productive” reviewers more—those who write detailed, timely reviews
- Active learning: Use Fisher information (Section 2.4.2.2) to query where the model is most uncertain
Value Embedded: Options 2 and 3 prioritize efficiency (maximize information gain per query) over representation (ensure all reviewer types contribute training data).
This seems like a purely technical optimization decision. But it has profound fairness consequences.
6.3.1.1 The Inversion Problem
The core issue is what we call the inversion problem: observable behavior does not equal underlying mental state.
Let \(B\) denote observed behavior (e.g., writing a detailed review), \(M\) denote mental state (true expertise and preferences), and \(C\) denote context (time availability, fatigue, language proficiency). The relationship is: \[ P(B \mid M, C) \neq P(B \mid M) \tag{6.1}\] Different groups have different mappings from mental state to behavior due to context.
In peer review, a detailed review could reflect:
- True expertise (\(M\)): Reviewer knows the area deeply and can provide thorough feedback
- Available time (\(C\)): Senior researchers have more flexible schedules, can spend 3 hours on a review
- Language proficiency (\(C\)): Native English speakers write longer reviews more quickly
- Not fatigue (\(C\)): Reviews written at 2am are shorter but not necessarily less expert
If we train on behavior (detailed reviews = good), we conflate expertise with privilege. The system learns to assist reviewers who already have advantages.
Suppose we observe that Reviewer A writes reviews averaging 800 words, while Reviewer B writes 400-word reviews.
Naive interpretation: A is twice as thorough as B → query A twice as much → learn A’s style better → assistant works better for A.
Reality: A is a senior professor with a light teaching load. B is a postdoc with 60-hour weeks, writing reviews at midnight. B’s shorter reviews may be equally expert but reflect time constraints, not preferences.
Result: Active learning queries A more → model learns A’s style → assists A well → A uses it more → provides more data. Meanwhile, B gets poor assistance → disengages → provides less data. The gap widens.
This is the inversion problem: optimizing for revealed behavior systematically misunderstands groups whose context differs.
Fairness Implications: If we query productive reviewers more, we systematically undersample:
- Junior reviewers (less time, heavier teaching loads)
- Non-native English speakers (writing reviews takes longer)
- Reviewers with caregiving responsibilities (less flexible time)
- Reviewers in disadvantageous time zones (synchronous discussions happen while they sleep)
The AI assistant becomes good at helping senior, privileged reviewers and poor at helping those already disadvantaged. This is disparate impact—unequal quality of assistance across demographics.
Connection to Chapter 4: Active learning via Fisher information (Section 2.4.2.2) queries where \(\det(\mathcal{I}(\theta))\) is maximized—where the model is most uncertain. But uncertainty may be highest for well-represented groups with complex preferences. Minority groups may have simpler preference patterns (less uncertainty) or may be learned poorly initially (undersampled, so model is uncertain but queries go to majority anyway). Either way, information-maximizing strategies can undersample minorities.
6.3.1.2 Alternative: Stratified Sampling
Stratified sampling ensures we learn from diverse reviewer populations, even if less “efficient”:
- Divide reviewers into strata (junior/senior, institution type, language background)
- Ensure minimum sampling from each stratum
- Within each stratum, can still use active learning
This sacrifices some statistical efficiency (we might query less informative comparisons) but ensures representation—all groups contribute training data proportionally.
The Tradeoff: Efficiency vs. fairness. Pure active learning maximizes information gain but may exacerbate disparities. Stratified sampling ensures fairness but uses more queries to achieve the same overall model quality.
There is no “right” answer. But you must choose consciously and transparently.
6.3.1.3 Code Example: Stratified Sampling vs. Active Learning
Let’s simulate this tradeoff concretely. We’ll create a population of reviewers with varying expertise and context, then compare three elicitation strategies:
Now let’s implement three elicitation strategies and see how well each learns the different groups:
Interpretation:
Uniform sampling: Queries all reviewers equally. Learning error is similar across groups (fairest).
Productive sampling: Queries based on observed behavior (review length). Because seniors write longer reviews due to more time (not expertise), they get queried much more. Result: learns seniors well, juniors poorly. This is the inversion problem in action.
Stratified sampling: Ensures proportional representation. Learning quality is balanced across groups—sacrifices some efficiency for fairness.
Key Insight: Optimizing for revealed preferences (review length) creates disparate impact. The system becomes good at serving already-privileged groups.
Connection to Real Systems: This pattern appears in: - RLHF for LLMs: high-volume annotators dominate training data - Recommender systems: power users get better recommendations (more historical data) - Active learning for robotics: demonstrations from experienced users oversample certain interaction styles
Design Principle: When elicitation strategy affects representation, stratify by relevant demographics to ensure fairness—even if it reduces statistical efficiency.
Next, we’ll see how the learning stage compounds this bias by imposing structure (IIA) that works poorly for context-dependent preferences.
6.3.2 Learning: What Preference Structures Are Valid?
Technical Question: What model structure should we assume for reviewers’ preferences over papers?
Recall from Chapter 2 that the Bradley-Terry model (Section 1.6.2) assumes Independence of Irrelevant Alternatives (IIA): if a reviewer prefers Paper A over Paper B, adding Paper C to the choice set shouldn’t change this preference. Mathematically: \[ \frac{P(\text{choose } j \mid \{j, k\})}{P(\text{choose } k \mid \{j, k\})} = \frac{P(\text{choose } j \mid \{j, k, \ell\})}{P(\text{choose } k \mid \{j, k, \ell\})} \tag{6.2}\] This simplifying assumption reduces the model from \(M!\) parameters to just \(M\) item utilities \(V_j\), making learning tractable.
But IIA is violated in peer review—and these violations have fairness consequences.
6.3.2.1 When IIA Fails in Peer Review
Consider realistic violations of IIA in paper-reviewer matching:
Violation 1: Complementarity - Reviewer already reviewed 2 similar papers in the same subfield - Adding a third similar paper (Paper C) makes them less interested in Paper A (fatigue with the subfield) - Preference A ≻ B changes to B ≻ A when C is added - Context-dependence: preferences depend on the full choice set, not just pairwise comparisons
Violation 2: Framing Effects - Paper C is exceptionally high-quality - By comparison, Paper A (which seemed good) now looks mediocre - Preference A ≻ B changes to B ≻ A due to the reference point shift - The “red bus / blue bus” problem from discrete choice theory (Section 1.8)
Violation 3: Reviewer Load - Reviewer’s current assignment load is heavy - Adding Paper C (another assignment) changes their tolerance for marginal papers - Preference for Paper A decreases because accepting it means even more work - Context-dependence: preferences reflect current state (load), not just intrinsic paper quality
By assuming IIA, we’re effectively saying: these violations are “irrational” and we’ll ignore them.
Value Embedded: We privilege “rational economic preferences” (stable, context-free utilities) over the psychological reality of reviewer fatigue, comparison effects, and load-dependence.
6.3.2.2 Fairness Implications
Who is disadvantaged by assuming IIA?
If we ignore context-dependent preferences, we mismodel reviewers who: - Are overburdened (heavy loads → preferences change with additional assignments) - Experience fatigue (late-night reviewing → different preferences than fresh morning reviews) - Have complementarity effects (reviewing similar papers → diminishing interest)
Crucially, these groups overlap with disadvantaged demographics: - Junior reviewers: Asked to review more (training for tenure), heavier teaching loads, less flexibility to decline - Non-Western reviewers: Time zone disadvantages for synchronous discussions, different academic calendar pressures - Caregivers: Less flexible time, reviewing happens in fragmented windows
An IIA-based model works well for senior reviewers with light loads and flexible schedules (their preferences are closer to context-free). It works poorly for overburdened reviewers whose preferences reflect their current state.
This is individual unfairness: Similar reviewers (by expertise) are treated differently (by assistance quality) due to structural factors.
Connection to Chapter 2: The Bradley-Terry model’s tractability comes from IIA (Section 1.8). But Section 1.8 warned that IIA violations occur when utilities are correlated (the red bus / blue bus problem). In peer review, utilities for similar papers are correlated—reviewing one affects willingness to review another. We’re sacrificing model correctness for computational convenience, and the cost falls disproportionately on certain groups.
6.3.2.3 Alternative: Contextual Bradley-Terry
Instead of assuming context-free utilities \(V_j\), we can model context-dependent preferences: \[ H_{ij} = U_i^\top V_j + f(C_i) \tag{6.3}\] where: - \(U_i\) and \(V_j\) are latent factors (as in Chapter 2, Section 1.6.3) - \(C_i\) is reviewer \(i\)’s current context (load, fatigue, recent reviews) - \(f(C_i)\) is a function capturing how context shifts preferences
For example, if Reviewer \(i\) has already reviewed \(n_i\) papers in the current round: \[ H_{ij} = U_i^\top V_j - \lambda n_i \tag{6.4}\] The \(-\lambda n_i\) term captures diminishing willingness as load increases.
This model doesn’t satisfy IIA (preferences change with context), but it more accurately captures real reviewer behavior.
The Tradeoff: Contextual models require more data to fit (additional parameters \(\lambda\), features \(C_i\)). Standard Bradley-Terry is simpler and works well for low-load reviewers. But it systematically misfits high-load reviewers.
Should we use the simpler model (works for some) or the complex model (works for all)? This is a fairness decision, not just a statistical one.
6.3.2.4 Code Example: Bradley-Terry with Context Features
Let’s simulate reviewers with context-dependent preferences and see how standard vs. contextual Bradley-Terry perform:
Now let’s fit two models and compare their performance:
Now evaluate how well each model predicts preferences for different reviewer groups:
Interpretation:
Standard Bradley-Terry: Assumes preferences are context-free. Works well for low-load reviewers (seniors) whose preferences are relatively stable. Works poorly for high-load reviewers (juniors) whose preferences shift with fatigue. Individual unfairness: similar reviewers treated differently.
Contextual Bradley-Terry: Explicitly models how load affects preferences. Performance is balanced across both groups—the model correctly captures that high-load reviewers become more selective.
Key Insight: The IIA assumption (standard BT) is not neutral—it privileges groups whose preferences happen to be context-free and systematically misfits groups with context-dependent preferences.
Design Principle: When modeling preferences, test whether IIA violations correlate with demographic groups. If they do, use richer models (contextual BT, mixed logit, nested logit) to avoid individual unfairness—even if they’re more complex.
Connection to Real Systems: - RLHF for LLMs: User preferences may depend on context (time of day, previous queries, mood). Standard preference models assume context-free utilities. - Recommender systems: User preferences shift with context (weekend vs. weekday, home vs. commute). Contextual bandits address this. - Healthcare decision support: Patient preferences depend on current health state, not just intrinsic item utilities.
Next, we’ll see how the aggregation stage further compounds bias by weighting preferences based on volume.
6.3.3 Aggregation: How Should We Weigh Preferences?
Technical Question: If we’re building a shared review assistant (not personalized per reviewer), how do we aggregate preferences from multiple reviewers?
From Chapter 3 (Equation 1.2), recall that DPO (Direct Preference Optimization) for language models maximizes: \[ \mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right] \tag{6.5}\] where \(\pi_{\text{ref}}\) is a reference policy trained on existing data.
The key question: Who contributes to \(\pi_{\text{ref}}\)?
If we train the reference policy on all past reviews, then reviewers who write more reviews get more weight. The reference policy implicitly performs volume-weighted aggregation.
6.3.3.1 Who Dominates the Reference Policy?
In peer review, high-volume reviewers tend to be:
- Senior researchers: Invited to review more, higher accept rates for review invitations, more established in the community
- Researchers at top institutions: Editors preferentially invite reviewers from prestigious institutions
- Native English speakers: Writing reviews is faster, lower burden per review
These are also the groups that already have advantages. By weighting the reference policy by volume, we encode status quo bias—the system learns to reproduce existing patterns, including existing inequities.
Value Embedded: Past data reflects “true” preferences worth replicating. But past data also reflects historical inequalities in who participates, who has time, and who gets invited.
6.3.3.2 Fairness Implications
If the shared AI review assistant is trained via volume-weighted aggregation (DPO’s \(\pi_{\text{ref}}\)), then:
- The system learns “good review” = reviews by senior/privileged researchers
- When assisting other reviewers, it nudges them toward this style
- Diverse perspectives are downweighted or erased (e.g., reviews emphasizing different values, writing styles from non-Western academic cultures)
- Result: Homogenization—reviews become more similar, reflecting the dominant culture
This is group unfairness: Underrepresented groups’ preferences are systematically underweighted in aggregation.
Connection to Chapter 6: Borda count (Section 4.5.3) weights all voters equally. But if voting (participation) is unequal, then Borda becomes volume-weighted, privileging high-participation groups. The “Community Notes” system on X (Twitter) has this exact issue—volunteer raters are not representative, so “bridging” rewards notes appealing across rater factors, but what if some viewpoints are excluded from the factor space entirely?
Connection to Chapter 3: The reference policy \(\pi_{\text{ref}}\) in DPO (Equation 1.2) is trained on historical data \(\mathcal{D}\). If \(\mathcal{D}\) has \(n_A\) samples from Group A and \(n_B\) from Group B, the gradient implicitly weights Group A by \(n_A / (n_A + n_B)\). High-volume groups dominate.
6.3.3.3 Alternative: Equal-Weight Aggregation
Instead of volume-weighted aggregation, we could use equal-weight aggregation:
- Compute per-reviewer reference policies: \(\pi_{\text{ref}}^{(i)}\) for each reviewer \(i\)
- Aggregate with equal weights: \(\pi_{\text{ref}} = \frac{1}{N} \sum_{i=1}^N \pi_{\text{ref}}^{(i)}\)
- Or: stratify by group and ensure proportional representation
This ensures all reviewers contribute equally, regardless of review volume.
The Tradeoff: Equal-weight aggregation may have higher variance (low-volume reviewers contribute as much as high-volume reviewers, despite less data). Volume-weighting has lower variance but systematic bias toward high-participation groups.
Should we optimize for statistical efficiency (volume-weighting) or fairness (equal-weighting)? This is a value choice.
6.3.3.4 Code Example: Weighted vs. Unweighted Aggregation
Let’s simulate a population with unequal participation and see how aggregation affects outcomes:
Now let’s compute three aggregation strategies:
Now simulate how the AI assistant (trained on aggregated style) serves different groups:
Interpretation:
Volume-weighted aggregation (DPO default): Learns senior reviewers’ style well (they dominate training data). Provides excellent assistance to seniors, poor assistance to juniors. Group unfairness: systematic disparity.
Equal-weighted aggregation: Balances both groups. Quality is more equitable—neither group is perfectly served, but both get reasonable assistance.
Group-stratified aggregation: Ensures proportional representation (30% senior, 70% junior). Since juniors are the majority, learned style is closer to theirs. This corrects for the participation imbalance.
Key Insight: Volume-weighted aggregation (the default in DPO and many systems) creates group unfairness by privileging high-participation groups. Equal or stratified weighting ensures fairer outcomes but may reduce statistical efficiency.
Design Principle: When aggregating preferences, examine participation patterns. If participation correlates with privilege, use equal or stratified weighting to ensure fairness—don’t default to volume-weighting without justification.
Connection to Real Systems: - RLHF for LLMs: High-volume annotators dominate \(\pi_{\text{ref}}\). If annotator demographics are skewed, the model learns skewed preferences. - Recommender systems: Power users dominate collaborative filtering. Their preferences are overweighted in recommendations for everyone. - Voting systems: Turnout differs by demographics. Volume-weighted aggregation (like electoral systems without turnout adjustment) underweights low-turnout groups.
Next, we’ll see how the decision stage adds another layer: when should we defer to preferences vs. override them?
6.3.4 Decision: Liberal vs. Illiberal Assistance
Technical Question: Should the AI review assistant help reviewers follow their stated preferences, or should it sometimes override those preferences?
This isn’t a technical question—it’s a normative question about when paternalism is justified. We draw on Amartya Sen’s framework from Chapter 6 (Section 6.3.4.1).
6.3.4.1 Sen’s Framework: Nosy vs. Non-Nosy Preferences
From Chapter 6, recall Sen’s “Impossibility of a Paretian Liberal” (Section 6.3.4.1). Sen distinguishes between:
- Non-nosy preferences: Preferences about your own outcomes (“I want to review ML papers”)
- Nosy preferences: Preferences about others’ choices (“Junior reviewers should handle tedious papers”)
Liberal assistance respects non-nosy preferences—each reviewer’s assistant optimizes for their own stated preferences.
Illiberal assistance allows (or enforces) nosy preferences—the system implements community standards or overrides individual preferences to prevent harm.
In peer review:
Liberal approach: Each reviewer’s assistant learns from their past reviews, helping them be consistent with their own style. If Reviewer A writes harsh reviews, the assistant helps them write consistently harsh reviews (respecting their preferences).
Illiberal approach: The assistant is trained on community standards (what constitutes a “good review” per AC/community norms). It nudges reviewers toward constructive feedback, even if they prefer harsh criticism.
6.3.4.2 When is Illiberal Assistance Justified?
Sen’s framework suggests illiberal assistance may be justified when:
Preferences harm third parties: Harsh reviews harm authors. Is nudging toward constructive feedback a justified “nosy preference” about how others experience reviews?
Preferences are formed under poor conditions: Reviewer writes harsh review when tired—this is the inversion problem (Section 6.3.1.1) again. Behavior (harsh review) ≠ deliberate judgment (mental state).
Individual preferences aggregate to collective harm: If all reviews are harsh, authors leave the field. Tragedy of the commons—individual rationality leads to collective harm.
Connection to Chapter 5: Thompson Sampling (Section 4.3) explores based on posterior uncertainty. But should we explore over what users want (liberal: respect stated preferences) or what’s best for them (illiberal: optimize for long-term well-being)? This is the same liberal/illiberal distinction.
6.3.4.3 Fairness Implications
Both liberal and illiberal assistance have fairness problems:
Liberal assistance might be unfair if: - Some reviewers have biased preferences (e.g., harsher on papers from certain institutions) - System amplifies these biases by helping reviewers be “consistently” biased - Authors from disadvantaged groups systematically receive worse reviews - Result: Individual autonomy respected, but group fairness violated
Illiberal assistance might be unfair if: - “Community standards” reflect dominant group’s norms - System corrects diverse reviewing styles toward homogeneity - Junior reviewers or reviewers from underrepresented backgrounds have their voice “corrected away” - Result: Group fairness pursued, but individual autonomy violated
No easy answer: This is the core tension in fairness. Do we respect individual autonomy (liberal, may entrench bias) or enforce equity (illiberal, may erase diversity)?
Connection to Chapter 5: Arrow’s theorem (Section 5.2) and Sen’s Paretian Liberal (Section 6.3.4.1) show impossibility results—no mechanism satisfies all desirable properties. Adding fairness constraints makes this worse. We must choose which properties to prioritize.
6.3.4.4 Code Example: Liberal vs. Illiberal Assistance
Let’s simulate how liberal vs. illiberal assistance affects different reviewers over time:
Now simulate how two types of AI assistants affect reviewer tone over time:
Interpretation:
Liberal assistance: Helps each reviewer be consistent with their own style. Biased reviewers become more harsh (the assistant helps them efficiently write harsh reviews). Unbiased reviewers become more constructive. Result: Individual autonomy respected, but bias amplified. Authors from disadvantaged groups receive systematically harsher reviews.
Illiberal assistance: Nudges all reviewers toward the community standard. Biased reviewers are corrected toward constructive tone. But unbiased reviewers are also pulled toward the standard—diversity is reduced. Result: Homogenization. If the “community standard” itself reflects dominant norms, minority voices are erased.
Key Insight: Neither liberal nor illiberal assistance is fair in all contexts. Liberal respects autonomy but may amplify harm. Illiberal promotes standards but may erase diversity.
Design Principle: Use structured paternalism: - Default to liberal assistance (respect stated preferences) - Override when justified: harms third parties, poor conditions (fatigue, manipulation), collective action problems - Be transparent about overrides (explain to users why) - Provide recourse (appeal mechanism)
When to choose illiberal: - High-stakes domains (healthcare, criminal justice): outcomes matter more than process - Third-party harm is severe (hate speech, harassment) - User explicitly consents to “help me be better” (coaching mode)
When to choose liberal: - Low-stakes domains (entertainment, shopping): personal preference is paramount - Diversity of perspectives is valuable (creative work, academic review) - Users have established expertise (senior researchers less needing intervention)
Connection to Real Systems: - Content moderation: Liberal = “free speech” (respect stated preferences). Illiberal = community standards (remove harmful content even if users prefer it). - Health apps: Liberal = track what user reports. Illiberal = nudge toward healthier behaviors (paternalistic). - LLM alignment: Should the model always satisfy user requests (liberal) or refuse harmful requests (illiberal)?
Finally, we’ll see how all four pipeline stages compound unfairness through feedback loops.
6.3.5 How Unfairness Compounds: The Full Pipeline
We’ve seen how each pipeline stage introduces bias: - Elicitation (Section 6.3.1): Productive sampling undersamples juniors - Learning (Section 6.3.2): IIA mismodels overburdened reviewers - Aggregation (Section 6.3.3): Volume-weighting privileges high-participation groups - Decision (Section 6.3.4): Liberal assistance amplifies existing biases
Now we trace how these biases compound across stages through feedback loops, creating widening disparities over time.
6.3.5.1 The Feedback Loop
Consider the full AI review assistant pipeline:
Stage 1—Elicitation: System queries productive reviewers more (senior researchers write longer reviews due to more time). Junior reviewers, non-native speakers provide less training data.
Stage 2—Learning: Model assumes IIA (ignores context-dependence). Works poorly for overburdened reviewers (disproportionately junior, caregivers). Learning error is higher for juniors.
Stage 3—Aggregation: DPO weights by review volume. Senior reviewers’ style dominates the reference policy \(\pi_{\text{ref}}\).
Stage 4—Decision: Liberal assistance maximizes each reviewer’s productivity. But junior reviewers using the assistant get suggestions that don’t fit their style (trained on senior reviewers), so the assistant is less helpful.
Feedback: Junior reviewers find the assistant unhelpful, use it less, provide even less training data → cycle repeats. Senior reviewers find it very helpful, use it more, dominate training data further. The gap widens exponentially.
Key Insight: Each stage’s unfairness amplifies the next. A small bias at Stage 1 (20% less data from juniors) becomes a large bias at Stage 4 (assistant is 50% less helpful to juniors), which feeds back to Stage 1 (juniors provide 40% less data in the next round).
This is a positive feedback loop (in the technical sense): deviations from fairness grow over time rather than self-correcting.
6.3.5.2 Mathematical Model of Compounding Bias
Let’s formalize this feedback loop. Let \(q_t^{(g)}\) denote the number of queries to group \(g\) at time \(t\), and \(a_t^{(g)}\) denote the assistant quality for group \(g\) at time \(t\).
Stage 1 (Elicitation): Queries proportional to productivity (measured by past assistant usage): \[ q_{t+1}^{(g)} \propto a_t^{(g)} \tag{6.6}\] Groups with better assistance get queried more.
Stages 2-3 (Learning + Aggregation): Assistant quality depends on training data: \[ a_{t+1}^{(g)} = f(q_{t+1}^{(g)}) \tag{6.7}\] where \(f\) is increasing (more queries → more data → better learning).
Feedback: Combining these: \[ a_{t+1}^{(g)} \propto f(a_t^{(g)}) \tag{6.8}\]
If \(f\) is superlinear (e.g., \(f(x) = x^{1.2}\)), then small initial advantages grow exponentially. If Group A starts with slightly better assistance (\(a_0^{(A)} = 1.1 \cdot a_0^{(B)}\)), after \(T\) rounds: \[ \frac{a_T^{(A)}}{a_T^{(B)}} = \left(\frac{a_0^{(A)}}{a_0^{(B)}}\right)^{1.2^T} \to \infty \text{ as } T \to \infty \tag{6.9}\]
Result: Small initial biases explode into permanent disparities.
6.3.5.3 Code Example: Simulating Compounding Bias
Let’s simulate the full four-stage pipeline and watch disparities widen over time:
Interpretation:
Round 0: Seniors have slightly better assistance (10% advantage) due to initial data imbalance.
Rounds 1-5: Seniors find the assistant more helpful → use it more → provide more queries → system learns their preferences better → assistance quality improves. Juniors find it less helpful → use it less → provide fewer queries → system learns their preferences worse → assistance quality degrades.
Rounds 6-15: The gap widens exponentially. What started as a 10% difference becomes a 30-40% difference. Seniors get excellent assistance; juniors get poor assistance.
Endstate: Without intervention, the system converges to serving seniors well and juniors poorly—even though initial expertise was similar.
Key Insight: Small biases at each stage multiply through the feedback loop. This is why end-to-end fairness analysis is critical—optimizing each stage independently is insufficient.
Design Principle: Audit for compounding unfairness: 1. Map the full pipeline: Trace how data flows from elicitation → learning → aggregation → decision → back to elicitation 2. Test feedback loops: Simulate the system over multiple rounds. Does disparity grow or shrink? 3. Monitor per-group metrics over time: If average performance improves but subgroup performance worsens, you have compounding bias 4. Intervene early: Small biases are easier to correct than large ones. Don’t wait for disparities to explode
Connection to Real Systems: - RLHF for LLMs: If early training data is demographically imbalanced, the model learns certain users better → they provide more feedback → model improves for them → gap widens. - Recommender systems: “Rich get richer” dynamics—popular items get recommended more → viewed more → become more popular. - Search engines: High-ranked results get more clicks → clicks are used as relevance signal → high-ranked results rank even higher. - Social media: Viral content gets more engagement → platforms show it to more users → it gets even more engagement.
All these systems have the same mathematical structure: positive feedback loops amplifying initial biases.
6.3.5.4 Breaking the Feedback Loop
How do we prevent compounding unfairness?
Option 1: Stratified elicitation (Section 6.3.1.2) - Ensure minimum queries from each group regardless of current assistance quality - Breaks the Stage 1 → Stage 2 link in the feedback loop
Option 2: Fairness-constrained learning - Don’t just minimize overall error—minimize maximum per-group error - Ensures all groups are learned well, even if some provide less data
Option 3: Equal-weight aggregation (Section 6.3.3.3) - Don’t weight by volume—weight by population proportion - Breaks the Stage 3 bias that privileges high-participation groups
Option 4: Illiberal assistance with equity goals (Section 6.3.4) - Override individual preferences when they create group disparities - E.g., cap assistance quality for advantaged groups until disadvantaged groups catch up
Option 5: Regular re-initialization - Periodically reset the system to equal assistance for all groups - Prevents long-run divergence (though fairness issues persist short-run)
The Tradeoff: All these interventions sacrifice some efficiency (overall system quality) for fairness (reducing disparities). You cannot optimize both simultaneously.
Recommended Approach: Combine multiple interventions across stages. No single-stage fix prevents compounding—you need fairness constraints at multiple pipeline stages.
This completes our analysis of the four-stage pipeline. Next, we formalize key fairness concepts that appeared throughout this section.
6.4 Fairness Concepts and Impossibilities
Throughout Section 7.2, we encountered fairness tensions: efficiency vs. representation (Section 6.3.1), IIA vs. context-dependence (Section 6.3.2), volume-weighting vs. equal-weighting (Section 6.3.3), liberal vs. illiberal (Section 6.3.4). These aren’t implementation details—they reflect deep conflicts between incompatible fairness criteria.
This section formalizes two fundamental tensions: 1. Individual vs. group fairness (incompatible by theorem) 2. Process vs. outcome fairness (incompatible in practice)
Understanding these impossibilities helps us make conscious tradeoffs rather than searching for nonexistent perfect solutions.
6.4.1 Individual vs. Group Fairness
Individual fairness (Dwork et al., 2012): Similar individuals should be treated similarly.
Formal definition: For distance metric \(d\) on individuals and outcomes, a decision function \(f\) is individually fair if: \[ d(x_i, x_j) \leq \epsilon \implies d(f(x_i), f(x_j)) \leq \delta(\epsilon) \tag{6.10}\] where \(\delta\) is non-decreasing. Intuitively: if two individuals are close in relevant features, their outcomes should be close.
In peer review: Papers with similar topics and quality should get similar-quality reviewers. If Papers A and B both study transformers on low-resource languages, they should get reviewers with comparable expertise.
Group fairness (demographic parity): Protected demographic groups should have equal average outcomes.
Formal definition: For groups \(g_1, g_2 \in \mathcal{G}\), a decision function \(f\) satisfies group fairness if: \[ \mathbb{E}[f(x) \mid G = g_1] = \mathbb{E}[f(x) \mid G = g_2] \tag{6.11}\] where \(G\) is the group membership variable.
In peer review: Papers from mainstream ML subfields vs. small subfields should receive equally qualified reviewers on average.
6.4.1.1 The Fundamental Incompatibility
Dwork et al. (2012) proved these criteria are often mutually incompatible—no decision function satisfies both.
Intuition: Suppose we have two groups with different distributions over features \(x\): - Group A: Papers mostly in mainstream topics (many expert reviewers available) - Group B: Papers in small subfields (few expert reviewers available)
Individual fairness says: Papers with similar topics get similar reviewers. Since Group B papers are in small subfields, they get less-expert reviewers (few experts exist).
Group fairness says: On average, Group B papers should get equally expert reviewers as Group A papers.
These conflict: To satisfy group fairness, we’d need to assign Group B papers to reviewers less matched by topic (drawing from the broader pool). But this violates individual fairness (similar topics → different review quality based on which group the paper belongs to).
The group fairness framework above treats groups as monolithic categories (Group A vs. Group B). In reality, individuals belong to multiple overlapping groups simultaneously—a junior researcher working on a small subfield from an under-resourced institution faces compounding disadvantages that analyzing any single group membership would miss. Intersectional fairness requires monitoring outcomes for all combinations of protected attributes, but this quickly creates a sample size problem: with \(k\) binary attributes, there are \(2^k\) subgroups, many of which may have too few members for reliable statistical analysis. Practical approaches include hierarchical models that share information across subgroups, or focusing on subgroups identified by preliminary analysis as having the largest outcome disparities.
Connection to Earlier Sections: - Elicitation (Section 6.3.1): Stratified sampling ensures group fairness (equal queries per group) but may violate individual fairness (equally productive reviewers get different query rates based on group). - Aggregation (Section 6.3.3): Equal-weight aggregation promotes group fairness but may violate individual fairness (reviewers with equal participation get different weights based on group size).
6.4.1.2 Code Example: Individual vs. Group Fairness Tradeoff
Let’s simulate paper-reviewer matching with constraints for individual vs. group fairness:
Now let’s create three matching strategies:
Interpretation:
Greedy matching (individual fairness): Each paper gets its best-available reviewer. Group A papers get high-expertise reviewers (many available). Group B papers get lower-expertise reviewers (few available). Satisfies individual fairness (similar papers within a group get similar treatment) but violates group fairness (systematic disparity between groups).
Group-fair matching: Forces both groups to have similar average reviewer quality. For Group B, this means finding better-matched reviewers even if sub-optimal for individual papers. Satisfies group fairness (equal outcomes on average) but violates individual fairness (some Group B papers get worse matches than similar Group A papers would).
Key Insight: This is not a failure of implementation—it’s a mathematical impossibility. Dwork et al. proved you cannot satisfy both criteria simultaneously in settings like this.
Design Principle: You must choose which fairness criterion to prioritize: - Individual fairness: When “merit” / topic-match is well-defined and important (e.g., expertise matching) - Group fairness: When systemic disparities need correction and outcomes matter more than process (e.g., ensuring underrepresented subfields get quality reviews)
Make this choice consciously and transparently, documenting why.
6.4.2 Process vs. Outcome Fairness
Another fundamental tension: fair process vs. fair outcomes.
Process fairness (procedural justice): Same rules and opportunities for all. Equal treatment.
Outcome fairness (distributive justice): Equitable results. Equal outcomes, adjusting for disadvantages.
In peer review:
Process fairness: All reviewers bid with equal weight. Matching algorithm treats all bids equally. No adjustment for demographics or privilege.
Outcome fairness: Adjust bid weights to ensure junior reviewers, underrepresented subfields, and under-resourced institutions get quality assignments proportional to their population.
6.4.2.1 The Tension
Process fairness says: Don’t adjust for demographics—that’s discrimination. Treat everyone equally.
Outcome fairness says: Equal treatment of unequal groups perpetuates inequality. Must adjust to correct historical disadvantages.
These conflict when groups have different baseline resources or positions. A process-fair system produces outcome-unfair results if groups start from unequal positions.
Connection to Sen’s Liberalism (Section 6.3.4): - Liberal assistance is process-fair: respect each user’s stated preferences, don’t override - Illiberal assistance is outcome-fair: override preferences to achieve equitable outcomes
The same philosophical tension appears across scales: individual assistance and system-wide fairness.
6.4.2.2 Code Example: Process vs. Outcome Fairness
Let’s simulate a bidding system for paper-reviewer matching with process-fair vs. outcome-fair allocation:
Now allocate papers using two approaches:
Interpretation:
Process-fair allocation: Equal bid weights. Seniors, who bid strategically on high-quality papers, get better assignments. Juniors, who bid less strategically, get worse papers. Process is fair (same rules) but outcome is unfair (systematic disparity).
Outcome-fair allocation: Boost junior bids to compensate for less strategic bidding. Outcomes are equalized. But process is “unfair” (different groups treated differently) even though outcome is fair (equal results).
Key Insight: Process and outcome fairness are fundamentally in tension when groups have unequal starting positions or differential access to strategic information.
Design Principle: Choose based on domain and stakes: - Process fairness in low-stakes domains where autonomy matters more than outcomes (entertainment, shopping) - Outcome fairness in high-stakes domains where systemic disparities must be corrected (hiring, credit, healthcare, education)
Connection to Liberal vs. Illiberal (Section 6.3.4): - Liberal assistance = process fairness (respect preferences) - Illiberal assistance = outcome fairness (adjust for better outcomes)
Same tension, different scale.
Summary of Section 7.3: We formalized two fundamental fairness conflicts: - Individual vs. group fairness: Mathematically incompatible (Dwork et al.) - Process vs. outcome fairness: Philosophically incompatible in practice
These aren’t failures of engineering—they’re impossibility results. No system satisfies all fairness criteria. You must choose which to prioritize, make that choice consciously, and be transparent about tradeoffs.
Next, we provide eight concrete design principles for navigating these impossibilities in practice.
6.5 Design Principles for Fair Preference Learning
The pipeline analysis (Section 6.3) and fairness impossibilities (Section 6.4) reveal fundamental tensions. No system satisfies all desirable properties. But we’re not helpless—conscious design choices can mitigate unfairness even if we can’t eliminate it entirely.
This section provides eight actionable principles for building fairer preference learning systems, distilled from the chapter’s analysis and real-world experience. Each principle addresses specific failure modes from earlier sections.
6.5.1 Principle 1: Make Value Tradeoffs Explicit
Principle: Document what values are prioritized, what’s sacrificed, who benefits, and who is harmed. Create values documents alongside technical design docs.
Rationale: Every technical choice embeds value judgments (Section 6.2). Claiming “technical neutrality” hides these choices, preventing scrutiny and accountability. Explicit documentation enables informed debate.
In practice: - For the AI review assistant: Document that productivity-based sampling prioritizes efficiency over representation, benefiting senior reviewers at the expense of juniors. - Create a “fairness impact statement” for each pipeline stage: Who gets more/less? Why is this acceptable? - Include this in design reviews, not just code reviews.
Example template:
Design Decision: Use DPO with volume-weighted reference policy
Value Prioritized: Statistical efficiency (lower variance)
Value Sacrificed: Group fairness (equal influence)
Who Benefits: High-volume annotators (seniors, native speakers)
Who Is Harmed: Low-volume annotators (juniors, non-native speakers)
Justification: [Your reasoning here]
Mitigation: [e.g., Minimum sampling quotas for underrepresented groups]
Connection: Addresses the hidden value judgments in elicitation (Section 6.3.1), learning (Section 6.3.2.2), aggregation (Section 6.3.3), and decisions (Section 6.3.4.3).
6.5.2 Principle 2: Distinguish Behavior from Mental State
Principle: Don’t optimize for clicks, bids, or engagement without asking if they reflect true preferences. Model context explicitly. Weight less-automatic signals more heavily.
Rationale: The inversion problem (Section 6.3.1.1): \(P(B \mid M, C) \neq P(B \mid M)\). Behavior conflates mental state with context. Groups with worse context (fatigue, time pressure, language barriers) appear less expert when measured by behavior alone.
In practice: - Don’t query reviewers proportional to review length—use stratified sampling to ensure representation. - If using engagement metrics (time spent, clicks), adjust for accessibility (slow internet, screen readers, non-native language comprehension). - Collect explicit preference signals (ratings, comparisons) in addition to implicit behavior.
Red flag: If you’re using revealed preferences (purchase history, clicks, review volume) as ground truth without modeling context, you’re committing the inversion fallacy.
Connection: Core issue in elicitation (Section 6.3.1). Productivity-based sampling systematically undersamples groups with poor context.
6.5.3 Principle 3: Audit for Compounding Unfairness
Principle: Map the full pipeline (elicitation → learning → aggregation → decision). Simulate over time. Test: Does disparity grow or shrink? If it grows, intervene urgently.
Rationale: Section Section 6.3.5 showed how small biases multiply through feedback loops. A 10% initial advantage becomes 40% after 15 rounds. Endstate: system serves privileged groups excellently, disadvantaged groups poorly.
In practice: - Before deployment: Run simulations with different initial conditions (varied group sizes, participation rates). - After deployment: Track per-group metrics over time. Plot disparity trends. If widening, pause and redesign. - Set circuit breakers: If disparity exceeds threshold (e.g., 2x gap in quality), trigger automatic review.
Audit checklist: - [ ] Have you mapped all feedback paths from decision → elicitation? - [ ] Do you track per-group metrics, not just averages? - [ ] Have you simulated 10+ rounds to check for compounding? - [ ] Is there a process for emergency intervention if disparities explode?
Connection: End-to-end analysis from Section 6.3.5. Addresses how elicitation bias (Section 6.3.1) amplifies through learning (Section 6.3.2) and aggregation (Section 6.3.3).
6.5.4 Principle 4: Choose Fairness Definition Consciously
Principle: Accept that individual/group and process/outcome fairness are often incompatible (Section 6.4.1.1). Choose based on domain and stakes. Be transparent about choice and justify.
Rationale: Impossibility theorems (Dwork et al. for individual/group, Sen for process/outcome) prove you cannot satisfy all criteria. Attempting to satisfy all leads to incoherent systems. Better to prioritize explicitly.
Decision framework:
| Context | Prioritize | Rationale |
|---|---|---|
| High-stakes domains (hiring, credit, healthcare) | Outcome fairness | Systemic disparities must be corrected; outcomes matter more than process |
| Domains with well-defined merit (paper-topic matching) | Individual fairness | Similar entities should get similar treatment; topic expertise is measurable |
| Domains requiring diversity (creative work, research) | Group fairness | Ensure all perspectives represented; avoid homogenization |
| Low-stakes personal domains (entertainment, shopping) | Process fairness | Autonomy paramount; respect stated preferences |
In practice: - For peer review: Prioritize individual fairness (expertise-topic match) but constrain for group fairness (minimum quality per subfield). - Document choice in values statement (Principle 1). - Measure violations of non-prioritized criteria and report them (transparency).
Connection: Formalizes the tradeoffs from Section 6.4.1 and Section 6.4.2.
6.5.5 Principle 5: Use Stratification, Not Just Optimization
Principle: Report performance per subgroup, not just average. Set minimum thresholds per subgroup. Use constrained optimization: maximize utility subject to fairness constraints.
Rationale: Optimization on average hides disparities. A system can improve for 80% of users while degrading for 20%. Averages increase, but unfairness grows. Stratified reporting reveals this.
In practice: - Don’t report “95% accuracy”—report “98% for Group A, 87% for Group B” (stratified). - Set constraints: “No group below 90% accuracy” rather than “average 95% accuracy.” - Optimization formulation: \[ \max_{\theta} \text{Utility}(\theta) \quad \text{subject to} \quad \min_{g \in \mathcal{G}} \text{Quality}_g(\theta) \geq \tau \tag{6.12}\] where \(\mathcal{G}\) is the set of protected groups and \(\tau\) is the fairness threshold.
Example: In the review assistant, don’t just measure “average assistant quality”—measure quality for seniors, juniors, native speakers, non-native speakers separately. Ensure all exceed a minimum bar.
Statistical note: Small subgroups have high variance. Use confidence intervals. Require statistically significant disparities before intervening (avoid overreacting to noise).
Connection: Addresses aggregation bias (Section 6.3.3) and compounding unfairness (Section 6.3.5).
6.5.6 Principle 6: When in Doubt, Ask
Principle: Involve affected communities early and throughout. Ongoing feedback loops, not one-time consultation. Be willing to redesign based on stakeholder input.
Rationale: Designers often lack lived experience of disadvantaged groups. Assumptions about “what’s fair” reflect designer privilege. Participatory design surfaces issues invisible to designers.
In practice: - Form advisory boards with junior reviewers, underrepresented subfields, international scholars. - Show them stratified metrics (Principle 5). Ask: “Is this disparity acceptable? What would make it better?” - Iterate: Design → Deploy to subset → Gather feedback → Redesign → Repeat. - Provide recourse mechanisms: Users can report unfairness, flag bad suggestions, opt out of assistance.
Caution: Participatory design is costly and doesn’t scale perfectly. But it’s essential for high-stakes systems affecting marginalized groups. Low-stakes systems can use lighter-weight feedback (surveys, usage analytics).
Connection: Governance considerations are woven throughout this chapter. Complements Principle 1 (make tradeoffs explicit) by involving stakeholders in the tradeoff decisions.
6.5.7 Principle 7: Build for Observability and Accountability
Principle: Log decisions and context for auditing. Enable external verification. Create feedback channels for reporting harms. Plan for rapid iteration when problems are found.
Rationale: Fairness violations are often invisible to designers. Users experience harm but can’t prove it (disparities are statistical, not individual). Observability makes the invisible visible.
In practice: - Logging: Record all queries, model predictions, assistant suggestions with user demographics and context. Enables post-hoc auditing. - Dashboards: Real-time stratified metrics (Principle 5) visible to stakeholders, not just developers. - Feedback channels: “Report unfairness” button. User submissions trigger review. - Rapid response: When disparities detected, freeze rollout, investigate, fix, redeploy. Build this into the development process, not as an afterthought.
Privacy consideration: Logging demographics raises privacy concerns. Use differential privacy, aggregate before sharing, or allow users to opt in to demographic tracking with clear benefits explained.
Example: The review assistant logs which reviewers use the assistant, which suggestions they accept/reject, stratified by group. Monthly fairness audits check for growing disparities. If detected, system pauses and team investigates.
Connection: Enables enforcement of all other principles. Without observability, fairness claims are unverifiable.
6.5.8 Principle 8: Recognize Limits of Liberal Assistance
Principle: Default to respecting preferences (liberal). Override when: (1) harms third parties, (2) preferences formed under poor conditions, (3) collective action problems. Use structured paternalism: transparency, justification, recourse.
Rationale: Section Section 6.3.4 showed the liberal/illiberal tension. Liberal assistance respects autonomy but can amplify bias. Illiberal assistance promotes standards but can erase diversity. Neither is always right.
Decision framework for when to override:
Override (illiberal) when: - Third-party harm is severe (hate speech, harassment, bias that harms authors) - Preferences formed under poor conditions (fatigue, manipulation, addiction) - Collective action problem (all harsh reviews harm scientific culture) - User explicitly consents to coaching (“help me be better”)
Respect (liberal) when: - No third-party harm (personal preferences about own outcomes) - User has expertise and autonomy (senior researchers, established preferences) - Diversity of perspectives is valuable (creative work, research pluralism)
Structured paternalism: - Transparency: Explain to users when and why you’re overriding their preferences - Justification: Document the harm being prevented - Recourse: Allow users to appeal, opt out, or request human review
Example: The review assistant nudges reviewers toward constructive feedback (illiberal) when reviews are harshly critical without justification (third-party harm to authors). But it respects reviewers’ substantive judgments about paper quality (liberal—expertise and no harm).
Connection: Resolves the liberal/illiberal tension from Section 6.3.4 by providing criteria for when each is appropriate.
6.5.9 Red Flags and Warning Signs
Watch for these patterns—they indicate fairness problems requiring intervention:
- Feedback loops: Average performance improves, but some subgroup’s performance worsens. Gap widens over time.
- Action: Audit for compounding (Principle 3). Implement circuit breakers.
- Optimization-fairness divergence: Metrics you’re optimizing ≠ what stakeholders care about. High engagement but low satisfaction for some groups.
- Action: Ask stakeholders (Principle 6). Redefine success metrics to include fairness.
- Invisibility of harm: Affected parties can’t see how they’re harmed. Disparities are statistical, not individually observable.
- Action: Build observability (Principle 7). Make stratified metrics public.
- Post-hoc rationalization: Justifying disparate impact as “technically correct” or “meritocratic” after the fact.
- Action: Make value tradeoffs explicit upfront (Principle 1). No hiding behind “neutral” algorithms.
- Ignoring context: Assuming preferences are context-free when they’re not (IIA violations).
- Action: Distinguish behavior from mental state (Principle 2). Model context explicitly.
- Revealed preference fallacy: Assuming behavior = true preferences without accounting for manipulation, poor conditions, or limited options.
- Action: Collect explicit preferences. Weight less-automatic signals more heavily (Principle 2).
- Homogenization: System converges to serving one group’s preferences/style. Diversity decreases over time.
- Action: Use stratification (Principle 5). Ensure all groups represented in training data and outcomes.
- Privilege escalation: Initial advantages (more data, better context) compound into permanent disparities.
- Action: Audit for compounding (Principle 3). Intervene early before gaps explode.
Monitoring cadence: Run fairness audits before deployment (simulation), at launch (baseline metrics), and monthly ongoing (trend analysis). Disparities often emerge slowly—quarterly reviews are too infrequent.
Summary: These eight principles provide concrete guidance for building fairer systems despite impossibility results. No system is perfectly fair, but conscious application of these principles dramatically improves outcomes for disadvantaged groups.
6.6 Quick Check
Test your understanding before proceeding to the exercises.
A company collects RLHF annotations from crowdworkers paid per comparison. Explain how the inversion problem applies here: what aspects of annotator behavior might not reflect their true preferences?
Individual fairness says similar entities should be treated similarly; group fairness says protected groups should have equal outcomes. Construct a concrete example where satisfying one violates the other.
In the four-stage pipeline (\(\mathcal{E} \to \mathcal{L} \to \mathcal{A} \to \mathcal{D}\)), explain how a 10% bias at the elicitation stage could compound into a larger gap after one feedback cycle.
When is illiberal assistance (overriding user preferences) justified according to Sen’s framework? Give an example involving an AI assistant.
6.7 Summary
- Every technical choice in a preference learning system—who gets queried, what model structure is assumed, how preferences are aggregated, when to defer vs. override—embeds a value judgment about whose preferences matter and how they should be weighted.
- The four-stage pipeline (\(\mathcal{E} \to \mathcal{L} \to \mathcal{A} \to \mathcal{D}\)) provides a systematic framework for auditing where value choices enter: elicitation determines who is heard, learning determines what structures are imposed, aggregation determines how voices are combined, and decision determines when to act.
- The inversion problem is fundamental: observed behavior (clicks, ratings, comparison choices) does not equal underlying preferences. Fatigue, time constraints, cultural norms, and interface design systematically distort behavior, and these distortions disproportionately affect overburdened groups.
- Individual fairness (similar entities treated similarly) and group fairness (protected groups have equal outcomes) are fundamentally incompatible—satisfying one can violate the other, forcing explicit tradeoffs rather than seeking a single “fair” solution.
- Unfairness compounds across the pipeline: small biases at each stage multiply through feedback loops, where improved service for well-represented groups generates more data, further improving their service while neglecting others.
- Sen’s distinction between nosy and non-nosy preferences determines when systems should respect stated preferences (liberal assistance) versus override them (illiberal assistance), providing a principled framework for paternalism in AI.
- The chapter’s eight design principles—from making value tradeoffs explicit to auditing for feedback loops—provide concrete, actionable guidance for building fairer preference learning systems despite impossibility results.
6.8 Exercises
These exercises deepen understanding of the fairness concepts, tradeoffs, and impossibilities introduced in this chapter. Difficulty levels: * (easy), ** (medium), *** (hard).
6.8.1 Exercise 1: Individual vs. Group Fairness Conflict (*)
Consider a paper-reviewer matching scenario: - 80 papers from Subfield A (mainstream topic with 60 expert reviewers available) - 20 papers from Subfield B (small topic with 10 expert reviewers available) - Each paper needs 3 reviewers
Individual fairness constraint: Papers on similar topics should get reviewers with similar expertise. Formally, if topics \(d(p_i, p_j) \leq \epsilon\), then expertise quality \(d(r(p_i), r(p_j)) \leq \delta\).
Group fairness constraint: Papers from Subfield A and B should receive equally expert reviewers on average. Formally, \(\mathbb{E}[\text{expertise} \mid \text{Subfield A}] = \mathbb{E}[\text{expertise} \mid \text{Subfield B}]\).
(a) Prove that these constraints are incompatible in this setting. Show that satisfying individual fairness (similar papers get similar reviewers) necessarily violates group fairness (Subfield B papers get less-expert reviewers on average due to limited expert pool).
(b) Propose a modified fairness criterion that is achievable. For example, could you satisfy “individual fairness within each subfield” while allowing between-group disparities? Formalize your proposal.
6.8.2 Exercise 2: Inversion Problem Formalization (**)
Let \(B\) denote observed behavior (e.g., review length in words), \(M\) denote mental state (true expertise), and \(C\) denote context (available time). Consider two reviewer groups:
Group 1 (juniors): Often overburdened - \(P(B \geq 500 \mid M = \text{high}, C = \text{low time}) = 0.3\) (high expertise but limited time → short review) - \(P(B \geq 500 \mid M = \text{low}, C = \text{low time}) = 0.1\) - \(P(C = \text{low time}) = 0.7\) (juniors have less time 70% of the time)
Group 2 (seniors): More flexible time - \(P(B \geq 500 \mid M = \text{high}, C = \text{good time}) = 0.9\) (high expertise + time → long review) - \(P(B \geq 500 \mid M = \text{low}, C = \text{good time}) = 0.2\) - \(P(C = \text{good time}) = 0.9\) (seniors have good time 90% of the time)
Assume \(P(M = \text{high}) = 0.6\) for both groups (similar expertise distributions).
(a) Compute \(P(M = \text{high} \mid B \geq 500)\) for each group using Bayes’ rule. Show that even with equal expertise, seniors appear more expert when measured by behavior.
(b) If we use productivity-based sampling (query probability proportional to \(P(B \geq 500)\)), compute the query ratio between seniors and juniors. How much more do we query seniors?
(c) Derive a correction factor to infer \(M\) from \(B\) that accounts for differential context \(C\) across groups. Under what conditions is this correction identifiable from data?
6.8.3 Exercise 3: Compounding Bias Dynamics (**)
Model a feedback loop in the review assistant system with discrete time steps \(t = 0, 1, 2, \ldots\):
Stage 1 (Elicitation): Queries at time \(t+1\) proportional to current assistant quality: \[ q_{t+1}^{(g)} = q_0^{(g)} \cdot \left(\frac{a_t^{(g)}}{a_0^{(g)}}\right)^\alpha \tag{6.13}\] where \(q_t^{(g)}\) = queries to group \(g\) at time \(t\), \(a_t^{(g)}\) = assistant quality for group \(g\), \(\alpha > 0\) controls feedback strength.
Stages 2-3 (Learning + Aggregation): Quality improves with queries: \[ a_{t+1}^{(g)} = a_{\max} \cdot \left(1 - \exp\left(-\lambda q_{t+1}^{(g)}\right)\right) \tag{6.14}\] where \(\lambda > 0\) and \(a_{\max}\) is maximum achievable quality.
Initial conditions: \(a_0^{(\text{senior})} = 1.0\), \(a_0^{(\text{junior})} = 0.9\), \(q_0^{(\text{senior})} = q_0^{(\text{junior})} = 100\).
(a) Simulate this system for \(T = 10\) rounds with \(\alpha = 1.2\), \(\lambda = 0.01\), \(a_{\max} = 1.0\). Plot \(a_t^{(\text{senior})}\) and \(a_t^{(\text{junior})}\) over time. Does the gap grow or shrink?
(b) Derive conditions on \(\alpha\) and \(\lambda\) under which the quality gap \(a_t^{(\text{senior})} - a_t^{(\text{junior})}\) grows exponentially vs. converges to a stable disparity.
(c) Propose an intervention to prevent gap widening. For example, enforce minimum queries per group: \(q_{t+1}^{(g)} \geq q_{\min}\). Show mathematically how this caps disparity growth.
6.8.4 Exercise 4: Fairness in DPO (***)
Recall from Chapter 3 that DPO optimizes: \[ \mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right] \tag{6.15}\]
Suppose dataset \(\mathcal{D}\) has \(n_A\) samples from Group A and \(n_B\) samples from Group B, where \(n_A > n_B\) (unequal participation).
(a) Show that the gradient \(\nabla_\theta \mathcal{L}_{\text{DPO}}\) implicitly weights Group A’s preferences by \(\frac{n_A}{n_A + n_B}\) and Group B’s by \(\frac{n_B}{n_A + n_B}\). (Hint: Expand the expectation over the empirical distribution.)
(b) Derive a fair-DPO objective that gives equal weight to both groups regardless of sample size: \[ \mathcal{L}_{\text{fair-DPO}} = -\frac{1}{2}\left[\mathbb{E}_{A} [\cdots] + \mathbb{E}_{B} [\cdots]\right] \tag{6.16}\] Show how this requires reweighting samples by \(w_A = \frac{1}{n_A}\), \(w_B = \frac{1}{n_B}\).
(c) Suppose Group A has better inter-annotator agreement (lower label noise: \(\sigma_A^2 < \sigma_B^2\)). Analyze the bias-variance tradeoff: Does equal-weighting improve fairness at the cost of higher variance in learned policy? Derive the MSE for each group under standard DPO vs. fair-DPO.
6.8.5 Exercise 5: Liberal vs. Illiberal Assistance (**)
Model two reviewer types with preferences over review tone (harsh vs. constructive):
Type 1 (Biased reviewers): Prefer harsh tone. Utility function \(u_1(\text{tone}) = -|\text{tone} - (-0.8)|\).
Type 2 (Constructive reviewers): Prefer constructive tone. Utility function \(u_2(\text{tone}) = -|\text{tone} - 0.6|\).
The AI assistant can follow two policies:
Liberal assistance: Maximize each reviewer’s stated utility. Suggests tone matching their preference.
Illiberal assistance: Nudge toward community standard \(\text{tone}_{\text{standard}} = 0.3\). Suggests \(\text{tone}_t = (1-\gamma) \cdot \text{preference}_i + \gamma \cdot \text{tone}_{\text{standard}}\) where \(\gamma \in [0, 1]\) is nudge strength.
(a) Under liberal assistance (\(\gamma = 0\)), reviewers follow their preferences. Papers receive reviews with tone distribution matching reviewer distribution. If 30% of reviewers are Type 1 (harsh), what is the average tone authors experience?
(b) Under illiberal assistance (\(\gamma = 0.5\)), all reviewers are nudged halfway toward the standard. Compute the new average tone and variance across papers.
(c) Suppose harsh reviews harm authors (especially from disadvantaged groups who receive more harsh reviews). Formalize the externality: \(e(\text{tone}) = \max(0, -\text{tone})\) (negative tone creates harm). Compute total harm under liberal vs. illiberal assistance. When is illiberal assistance justified by harm reduction?
6.8.6 Exercise 6: Strategy-Proofness and Nosy Preferences (**)
In the paper-reviewer matching from Section 6.2.2, reviewers report preferences \(\succ_i\) over papers \(P\). The matching algorithm maximizes total utility: \[ \max_{\text{assignment}} \sum_{i} u_i(\text{assignment}(i)) \tag{6.17}\]
Assume preferences are non-nosy (each reviewer cares only about their own assignment, not others’).
(a) Show this mechanism is not strategy-proof: Reviewers can improve their outcome by misreporting preferences. Construct an example where Reviewer 1 gets a better paper by lying about \(\succ_1\).
(b) Now suppose senior reviewers have nosy preferences: they want certain papers assigned to junior reviewers (e.g., training papers). Formalize as: \[ u_i(\text{assignment}) = u_i^{\text{self}}(\text{assignment}(i)) + \sum_{j \in \text{Juniors}} \beta_{ij} \cdot u_i^{\text{nosy}}(\text{assignment}(j)) \tag{6.18}\] where \(\beta_{ij}\) is how much senior \(i\) cares about junior \(j\)’s assignment.
Show that coordinated misreporting by seniors can systematically harm juniors by steering “training papers” to juniors even if juniors prefer research-relevant papers.
(c) Design a mechanism that prevents this manipulation while still allowing some nosy preferences. For example, could you use VCG payments (Section 5.10.0.1) to incentivize truth-telling, or constrain the matching to ignore nosy components? Prove or disprove: Is any truthful mechanism possible when preferences are nosy?
End of Exercises
These exercises develop deep understanding of fairness tradeoffs and impossibilities. Work through them to internalize the chapter’s key insights: fairness conflicts are mathematical necessities, not engineering failures, requiring conscious prioritization.
6.9 Discussion Questions
These open-ended questions promote critical thinking about fairness, values, and governance in preference learning systems. Use them for in-class discussions, essay prompts, or group projects.
Inversion Problem Severity: When is the inversion problem (Section 6.3.1.1) most severe? Consider domains where behavior strongly diverges from mental state (addiction, manipulation, fatigue, time pressure). How should preference learning systems handle cases where clicks/engagement ≠ satisfaction? Should they attempt to infer mental state, or is that itself a form of paternalism?
Sen’s Nosy Preferences: Sen’s framework (Section 6.3.4) distinguishes nosy from non-nosy preferences. But many “nosy” preferences seem justified—parents’ preferences over children’s choices, experts’ preferences over safety regulations, community standards for harmful content. What formal criteria could distinguish justified from unjustified nosy preferences? Where do you draw the line?
Fairness Prioritization in Your Domain: Individual and group fairness are often incompatible (Section 6.4.1). In a preference learning system for your research domain or project, which would you prioritize? How would you justify this choice to stakeholders who care about the other criterion? What constraints or compensating measures could partially satisfy the non-prioritized criterion?
Institutional Structures for Value Choices: This chapter argues that technical choices are never value-neutral—engineers embed values whether they acknowledge it or not. But engineers often lack training in ethics, political philosophy, and affected communities’ lived experiences. What institutional structures should guide these value choices? Professional standards (like medical ethics)? Regulatory oversight (like FDA for drugs)? Participatory design (community advisory boards)? Some combination? Defend your answer.
Fairness-Constrained Active Learning: Chapter 4’s active learning (Section 2.4.2.2) maximizes information gain by querying where the model is most uncertain. But this may undersample minority groups (if their preferences are simpler or they’re initially underrepresented). How would you modify Fisher information or D-optimal design to satisfy fairness constraints? Specifically, design an objective that trades off information gain against ensuring minimum representation per group. What’s the cost in sample efficiency?
Equal Weighting in DPO: DPO (Equation 1.2) from Chapter 3 implicitly weights preferences by data volume—high-volume users dominate the reference policy \(\pi_{\text{ref}}\). If you wanted equal weighting across demographic groups (not individuals), how would you modify the DPO objective? Would this require collecting demographic information at training time? What privacy/fairness tradeoffs arise?
Stakes and Fairness: The chapter argues that fairness considerations differ for high-stakes (loan approvals, medical decisions, hiring) vs. low-stakes (music recommendations, entertainment) applications. Do you agree with this distinction? Or should all systems be held to the same fairness standards regardless of stakes? If you agree with the distinction, where exactly is the boundary? Classify these cases: college admissions, dating apps, social media content ranking, job recommendations, bail decisions, credit cards.
Detecting Compounding Bias: Feedback loops (Section 6.3.5) can amplify initial unfairness exponentially. What monitoring systems would you build to detect compounding bias before it causes significant harm? How would you distinguish “acceptable disparity due to preference differences” from “unacceptable disparity due to system bias”? What statistical tests would you use, and what significance thresholds?
Stratified Reporting with Small Samples: Principle 5 (Section 6.5.5) advocates stratified reporting—measuring performance per subgroup, not just overall. But for small subgroups (e.g., 10 users), per-group metrics have high variance and wide confidence intervals. How do you balance fairness transparency (reporting disparities) with statistical validity (avoiding false alarms)? Would you report metrics for groups below size \(n\)? Set a minimum group size? Use Bayesian shrinkage?
Aggregation Impossibilities Compounded: Chapter 6 proved impossibility results for aggregation—Arrow’s theorem, Gibbard-Satterthwaite, Sen’s Paretian Liberal. This chapter adds fairness constraints (individual vs. group, process vs. outcome). Does adding fairness make the impossibilities worse (even fewer mechanisms satisfy expanded criteria), or can fairness constraints help break ties between incompatible criteria from Chapter 6? Analyze specific cases.
6.10 Bibliographic Notes
This chapter draws on diverse literatures in fairness, social choice theory, human-computer interaction, and preference learning. Below we trace the intellectual history and point to key references.
Fairness in Machine Learning: The modern fairness in ML literature began with Hardt, Price, and Srebro (2016) on equalized odds and Dwork et al. (2012) on individual fairness. Barocas, Hardt, and Narayanan (2019) provide a comprehensive textbook covering group fairness, individual fairness, and impossibility results. Friedler, Scheidegger, and Venkatasubramanian (2021) formalize when different fairness criteria are compatible vs. impossible to satisfy simultaneously—the mathematical foundations for Section 6.4.1.1. For fairness in preference learning and RLHF specifically, see Kirk et al. (2023) on diverse perspectives in LLM alignment and Ganguli et al. (2023) on capacity for fairness vs. harmlessness tradeoffs.
Sen’s Social Choice Theory: Amartya Sen’s work provides the normative foundations for this chapter. Sen (1970) introduced the “Impossibility of a Paretian Liberal,” showing that individual liberty (respecting non-nosy preferences) and Pareto efficiency are incompatible—the liberal/illiberal tension in Section 6.3.4. Sen (1977) introduced the distinction between revealed preferences and mental states, anticipating the inversion problem (Section 6.3.1.1). Sen (2017) (extended edition) synthesizes Sen’s social choice work. The connection to preference learning was formalized by Sah et al. (2024).
Inversion Problem and Revealed Preferences: The gap between behavior and mental state has been studied in behavioral economics and HCI. Ariely (2008) documents systematic deviations between stated and revealed preferences under poor conditions (fatigue, cognitive load, framing). Kleinberg, Lakkaraju, et al. (2018) argue that algorithms should infer mental state not just optimize for behavior, formalizing the inversion problem. Mullainathan et al. (2022) show how machine learning on biased behavior data amplifies existing inequalities.
Contextual Integrity and Privacy: Nissenbaum (2009) introduced contextual integrity—the idea that privacy violations occur when information flows inappropriately for the context, not when information is collected per se. This connects to the inversion problem: behavior in one context (fatigue, time pressure) may not reflect preferences in another context (well-rested, deliberative). Tschantz (2020) extend contextual integrity to algorithmic fairness.
Compounding Unfairness and Feedback Loops: The dynamics of how small biases amplify through feedback were formalized by Liu et al. (2018) for delayed impact in classification and Hashimoto et al. (2018) for distributional robustness. Ensign et al. (2018) analyzed feedback loops in predictive policing. For preference learning specifically, Chen et al. (2024) show how RLHF feedback loops can amplify annotator biases over successive fine-tuning rounds. The general phenomenon is sometimes called “algorithmic monoculture” (Kleinberg, Ludwig, et al. (2018)).
Participatory Design and Involving Stakeholders: Schuler and Namioka (1993) is the classic introduction to participatory design in HCI. Modern applications to AI fairness include Sloane et al. (2020) on power dynamics in participatory ML and Rakova et al. (2021) on stakeholder engagement for algorithmic accountability. Green and Chen (2019) provide principles for participatory algorithm design focused on affected communities.
Strategy-Proofness and Mechanism Design: The connection between preference elicitation and mechanism design comes from Mas-Colell et al. (1995) and Nisan et al. (2007). Sen’s work on nosy preferences’ implications for strategy-proofness is formalized in Sen (1983). Pattanaik (1978) extend this to social welfare functions. The application to peer review matching is studied by Kobren, Saha, and McCallum (2019) and Stelmakh, Shah, and Singh (2021).
Discrete Choice and IIA: The independence of irrelevant alternatives (IIA) assumption and its violations were studied extensively in economics. McFadden (1974) introduced the conditional logit model based on IIA and received the Nobel Prize for this work. McFadden (2000) shows when IIA fails (the “red bus / blue bus problem”). Train (2009) provides a comprehensive treatment of discrete choice models relaxing IIA (nested logit, mixed logit). The connection to preference learning was made by Maystre and Grossglauser (2015) for Plackett-Luce models.
Red Bus/Blue Bus Problem: Debreu (1960) first formalized the red bus/blue bus problem: adding a duplicate alternative (blue bus identical to red bus) shouldn’t change the ratio of probabilities between red bus and train, but IIA implies it does. This motivated development of nested logit models (McFadden (1978)) that allow correlation within nests (buses) while maintaining IIA across nests.
DPO and Volume Weighting: Direct Preference Optimization (Rafailov et al. (2023)) simplified RLHF by eliminating the reward model. Rafailov et al. (2024) analyze scaling properties and show the reference policy implicitly weights preferences by volume. Kirk et al. (2024) study how DPO’s weighting affects fairness across demographic groups. Zhao et al. (2024) propose fairness-constrained variants of DPO.
AI Review Assistants and Peer Review Fairness: The paper-reviewer matching problem and its fairness implications are studied by Kobren, Saha, and McCallum (2019) (maximizing paper-reviewer compatibility) and Stelmakh, Shah, and Singh (2021) (fairness constraints in assignment). Shah (2022) analyze bias in peer review systems. AI-assisted peer review is explored by Liang et al. (2023), though fairness issues remain underexplored.
Historical Context: The questions in this chapter are not new. Arrow (1950) showed aggregation impossibilities in 1950. Thurstone (1927) introduced random utility models in 1927. Luce et al. (1959) axiomatized IIA in 1959. What’s new is the application to machine learning from human preferences at scale—where small biases compound through automated feedback loops affecting millions of users. Classical fairness concerns now manifest in algorithmic systems with unprecedented speed and scope.
6.11 Further Reading
For readers who want to go deeper into the topics introduced in this chapter, we recommend the following:
- Barocas, Hardt, and Narayanan (2019), Fairness and Machine Learning: Limitations and Opportunities — The comprehensive textbook on fairness in ML, covering group fairness definitions, individual fairness, impossibility results, and the sociotechnical context of algorithmic decision-making.
- Sen (2017), Collective Choice and Social Welfare (expanded edition) — Sen’s definitive treatment of social choice theory, the liberal paradox, and the distinction between revealed preferences and welfare, providing the normative foundations for this chapter.
- Dwork et al. (2012), “Fairness Through Awareness” — The foundational paper on individual fairness, formalizing the Lipschitz condition that similar individuals should receive similar outcomes and introducing the metric learning challenge for fairness.
- Fürnkranz and Hüllermeier (2010), Preference Learning — A textbook covering preference learning from a machine learning perspective, including label ranking, instance ranking, and object ranking, complementing this book’s focus on human preference data.
- Casper et al. (2023), “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback” — A systematic survey of RLHF failure modes, including reward hacking, distributional shift, and the challenges of aligning AI systems with diverse human values.
- Kleinberg, Lakkaraju, et al. (2018), “Human Decisions and Machine Predictions” — Formalizes the gap between behavior and mental state (the inversion problem) in the context of judicial bail decisions, showing how observed choices can systematically diverge from underlying preferences.
- Hashimoto et al. (2018), “Fairness Without Demographics in Repeated Loss Minimization” — Introduces distributionally robust optimization for fairness, showing how to protect minority groups without requiring explicit demographic labels, relevant to the compounding unfairness analysis in this chapter.
This chapter synthesizes insights across these literatures to provide integrated guidance for building fairer preference learning systems.