Designing a good utility function (or reward function) by hand for a complex AI or robotics task is notoriously difficult and error-prone. Instead of manually specifying what is “good” behavior, we can learn a utility function from human preferences. In this chapter, we explore how an agent can infer a human’s underlying utility function (their preferences or reward criteria) from various forms of feedback. We discuss both supervised learning and Bayesian approaches to utility learning, and examine techniques motivated by robotics—learning from demonstrations, physical corrections, trajectory evaluations, and pairwise comparisons. Throughout, we include mathematical formulations and code examples to illustrate the learning process.
2.1 The Supervised Learning Problem
Supervised learning approaches treat human feedback as labeled data to directly fit a utility function. The core idea is to assume there exists a true utility function \(u^*(x)\) (over states, outcomes, or trajectories \(x\)) that explains a human’s choices. We then choose a parameterized model \(u_\theta(x)\) and adjust \(\theta\) so that \(u_\theta\) agrees with the human-provided preferences.
A common feedback format is pairwise comparisons: the human is shown two options (outcomes or trajectories) \(A\) and \(B\) and indicates which is preferred. We can model the probability that the human prefers \(A\) over \(B\) using a logistic or Bradley–Terry model:
where \(\sigma(z)=\frac{1}{1+e^{-z}}\) is the sigmoid function. This implies the human is more likely to prefer \(A\) if \(u_\theta(A)\) is much larger than \(u_\theta(B)\).
At the heart of learning from human preferences lies a latent utility function — a function that assigns numerical value to states, trajectories, or outcomes according to a human’s (possibly unspoken) preferences. The goal of a learning algorithm is to infer this function from observed feedback, which may come in the form of demonstrations, ratings, rankings, or pairwise comparisons. But how exactly do we represent and update our belief about this hidden utility function?
Two major paradigms in statistical learning provide different answers: point estimation and posterior estimation.
In point estimation, we seek a single “best guess” for the utility function — typically a function \(u_\theta(x)\) from a parameterized family (e.g. linear models, neural nets), with parameters \(\theta \in \mathbb{R}^d\). Given data \(\mathcal{D}\) from human feedback (e.g. preferences), we choose the parameter \(\hat{\theta}\) that best explains the observed behavior. Formally:
This is maximum likelihood estimation (MLE): we pick the parameters that make the observed data most probable under our model. Once \(\hat{\theta}\) is selected, we treat \(u_{\hat{\theta}}\) as the agent’s utility function, and optimize or sample behavior accordingly. This approach is straightforward and computationally efficient. It is the foundation of most supervised learning methods (like logistic regression or deep learning), and it provides a natural interpretation: we’re directly finding the utility function that agrees with the human feedback. However, it discards uncertainty: it assumes the data is sufficient to pin down a single utility function, which may not be true in practice.
In contrast, posterior estimation takes a fully Bayesian view. Instead of committing to one estimate, we maintain a distribution over utility functions. That is, we place a prior \(p(\theta)\) over parameters (or over functions \(u\) more generally), and update this to a posterior after observing data \(\mathcal{D}\):
This posterior expresses our uncertainty over which utility functions are compatible with the human feedback. From this distribution, we can make predictions (e.g., using the posterior mean utility), quantify confidence, or even actively select new queries to reduce uncertainty (active learning). For instance, if we model utilities with a Gaussian Process (GP), then the posterior over \(u(x)\) is also a GP after observing comparisons or evaluations. If we use a neural network for \(u_\theta(x)\), we can approximate the posterior with ensembles, variational inference, or MCMC. Posterior estimation is especially valuable when human feedback is sparse, noisy, or ambiguous — as is often the case in real-world preference learning. It allows the agent to reason about what it doesn’t know and to take cautious or exploratory actions accordingly.
The next two sections instantiate these two perspectives. In Section 4.1, we explore point estimation via supervised learning — treating preference data as labeled examples and fitting a utility model. In Section 4.2, we shift to posterior estimation with Bayesian methods like Gaussian processes and Bayesian neural networks, which model both our current estimate and the uncertainty around it.
2.2 Point Estimation via Maximum Likelihood
Given a dataset of comparisons \(D=\{(A_i, B_i, y_i)\}\) (with \(y_i=1\) if \(A_i\) was preferred and \(0\) if \(B_i\) was preferred), we can fit \(\theta\) by maximizing the likelihood of the human’s choices. Equivalently, we minimize a binary cross-entropy loss:
often with a regularization term to prevent overfitting. This is a straightforward supervised learning problem – essentially logistic regression – on pairwise difference features.
Example: Suppose a human’s utility for an outcome can be described by a quadratic function (unknown to the learning algorithm). We collect some pairwise preferences and then train a utility model \(u_\theta(x)\) to predict those preferences. The code below simulates this scenario:
import numpy as np# True utility function (unknown to learner), e.g. u*(x) = -(x-5)^2 + constant def true_utility(x):return-(x-5)**2# (peak at x=5)# Generate synthetic pairwise preference datanp.random.seed(42)n_pairs =20X1 = np.random.uniform(0, 10, size=n_pairs) # 20 random x-valuesX2 = np.random.uniform(0, 10, size=n_pairs) # 20 more random x-values# Determine preferences according to true utilityprefs = (true_utility(X1) > true_utility(X2)).astype(int) # 1 if X1 preferred, else 0# Parametric model for utility: u_theta(x) = w0 + w1*x + w2*x^2 (quadratic form)# Initialize weightsw = np.zeros(3)lr =0.01# learning ratereg =1e-3# L2 regularization strengthfor epoch inrange(1000):# Compute predictions via logistic model util_diff = (w[0] + w[1]*X1 + w[2]*X1**2) - (w[0] + w[1]*X2 + w[2]*X2**2) pred =1/ (1+ np.exp(-util_diff)) # σ(w·(phi(X1)-phi(X2)))# Gradient of cross-entropy loss grad = np.array([0.0, 0.0, 0.0]) error = pred - prefs # (sigma - y)# Features for X1 and X2 phi1 = np.vstack([np.ones(n_pairs), X1, X1**2]).T phi2 = np.vstack([np.ones(n_pairs), X2, X2**2]).T phi_diff = phi1 - phi2# Gradient: derivative of loss w.rt w = (sigma - y)*φ_diff (averaged) + reg grad = phi_diff.T.dot(error) / n_pairs + reg * w# Update weights w -= lr * gradprint("Learned weights:", w)
Learned weights: [ 0. 2.74417195 -0.22129969]
After training, we can compare the learned utility function \(u_\theta(x)\) to the true utility \(u^*(x)\). Below we plot the two functions:
The learned curve closely matches the true utility up to an arbitrary scaling factor (utility is only defined up to affine transform when inferred from comparisons). The algorithm successfully recovered a utility function that orders states almost the same as the true utility \(u^*(x)\). In general, learning from comparisons can infer the relative utility of options (which item is preferred), although the absolute scale of \(u_\theta\) is unidentifiable without further assumptions. Supervised learning on preferences has been widely used for ranking problems and preference-based reward learning.
In standard preference learning, we often learn a utility function and then use it to define a policy. However, in some settings—especially those involving large models like language models—it is more effective to directly learn a policy that aligns with human preferences, bypassing the intermediate reward model. One such method is Direct Preference Optimization (DPO), which offers a simple, stable way to align a policy to preference data through supervised learning.
To understand DPO, consider the following setting:
We are given a reference policy \(\pi_{\text{ref}}\), such as a pre-trained language model.
We want to learn a new policy \(\pi_\theta\) that improves upon \(\pi_{\text{ref}}\) by better reflecting human preferences.
Our data consists of pairwise comparisons: for each prompt \(x\), a human expresses a preference between two outputs \(y_+ \succ y_-\), where \(y_+\) is the preferred response.
Rather than learning an explicit reward function \(R(x, y)\) and using it to optimize the policy via reinforcement learning, DPO treats this as a classification problem: we want to encourage the policy to assign higher likelihood to the preferred response.
To formalize this, we define a preference score: \[
s_\theta(x, y_+, y_-) = \log \pi_\theta(y_+ \mid x) - \log \pi_\theta(y_- \mid x)
\] This is the difference in log-likelihood between the preferred and dispreferred outputs. We can then define the DPO loss as a logistic regression objective: \[
\mathcal{L}_{\text{DPO}}(\theta) = -\log \sigma\left(s_\theta(x, y_+, y_-)\right)
\] where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function.
This loss encourages \(\pi_\theta\) to assign greater probability mass to \(y_+\) than \(y_-\), pushing the policy toward outputs that align with human preferences. Because this is a differentiable, supervised loss, it can be optimized with standard gradient-based techniques, without needing to sample from the environment or estimate advantages, as in traditional RL.
Although DPO does not explicitly define or optimize a reward function, we can interpret it as doing so implicitly. Suppose we define a reward function: \[
R_\theta(y \mid x) = \log \pi_\theta(y \mid x) - \log \pi_{\text{ref}}(y \mid x)
\] This reward encourages \(\pi_\theta\) to move away from \(\pi_{\text{ref}}\) in directions that increase the probability of preferred outputs. Under this formulation, the DPO objective can be interpreted as optimizing this reward difference directly from preferences.
To understand why this implicit reward leads to a stable and interpretable policy, we can connect DPO to the principle of maximum entropy. This principle says that, among all distributions that satisfy certain constraints (e.g., achieving a particular expected reward), we should prefer the one with maximum entropy—that is, the most uncertain or uncommitted distribution consistent with our knowledge.
Formally, consider the space \(\mathcal{P}\) of distributions over responses \(y\), and a reward function \(R(y)\). The maximum entropy distribution that satisfies a reward constraint is the solution to:
The solution to this constrained optimization problem is a Boltzmann distribution: \[
p^*(y) \propto \exp\left(\frac{R(y)}{\tau}\right)
\] for some temperature \(\tau > 0\), where \(\tau\) controls how deterministic the distribution is. As \(\tau \to 0\), the distribution concentrates on the highest-reward outputs; as \(\tau \to \infty\), it becomes uniform.
Now suppose our reference policy \(\pi_{\text{ref}}(y \mid x)\) already represents a reasonable starting point. Then the optimal policy \(\pi_\theta\) can be viewed as a reward-weighted version of this reference policy:
This form ensures that \(\pi_\theta\) remains close to \(\pi_{\text{ref}}\) (via the KL term), while still assigning more mass to high-reward (preferred) outputs. Importantly, this form arises naturally from maximum entropy inference when the reference distribution is used as a baseline.
DPO thus combines reward maximization with entropy regularization, encouraging the learned policy to prefer outcomes favored by human feedback while preserving diversity and stability. It sidesteps the challenges of explicitly learning a reward model or tuning complex RL pipelines, offering a direct and scalable method for preference-based alignment.
In practice, DPO has been shown to achieve similar or better alignment performance compared to reinforcement learning from human feedback (RLHF) while being more stable and easier to implement. It avoids the need to sample from the model during training or tune delicate hyperparameters of RL. Conceptually, DPO demonstrates that if we structure our utility model cleverly (here, as the log-ratio of policy and reference), we can extract an optimal policy in closed-form and learn utilities via supervised learning.
import numpy as npimport matplotlib.pyplot as pltimport matplotlib.animation as animationfrom scipy.special import logsumexp# --- Setup: 1D input x, discrete actions y ---np.random.seed(0)x =5.0# fixed inputY = np.linspace(-4, 4, 100) # discrete action spacen_actions =len(Y)# --- True reward function (unknown to learner) ---def true_reward(x, y):return-((y - np.sin(x))**2) # reward peak near y = sin(x)R_true = true_reward(x, Y)# --- Reference policy: fixed Gaussian-like distribution ---def ref_policy(y): logits =-0.5* (y /2.0)**2# log probs of N(0, 2^2)return np.exp(logits - logsumexp(logits))pi_ref = ref_policy(Y)# --- Preference data from reward samples ---def sample_preference(x, Y, R_fn, temperature=1.0): logits = R_fn(x, Y) / temperature probs = np.exp(logits - logsumexp(logits)) sampled = np.random.choice(len(Y), size=2, replace=False, p=probs) y_plus, y_minus = sampled if R_fn(x, Y[sampled[0]]) > R_fn(x, Y[sampled[1]]) else sampled[::-1]return y_plus, y_minusn_pairs =100pair_indices = [sample_preference(x, Y, true_reward) for _ inrange(n_pairs)]# --- DPO loss and gradient ---def dpo_loss_and_grad(theta, y_pos_idx, y_neg_idx, pi_ref): logits = theta + np.log(pi_ref +1e-8) logp_pos = logits[y_pos_idx] - logsumexp(logits) logp_neg = logits[y_neg_idx] - logsumexp(logits) s = logp_pos - logp_neg sigma =1/ (1+ np.exp(-s)) loss =-np.log(sigma +1e-8) softmax = np.exp(logits - logsumexp(logits)) grad =- (1- sigma) * (np.eye(n_actions)[y_pos_idx] - np.eye(n_actions)[y_neg_idx]) + sigma * softmaxreturn loss, grad# --- Training loop with history tracking ---theta = np.zeros(n_actions)lr =0.05n_steps =100history = []for step inrange(n_steps): total_grad = np.zeros_like(theta)for y_pos_idx, y_neg_idx in pair_indices: _, grad = dpo_loss_and_grad(theta, y_pos_idx, y_neg_idx, pi_ref) total_grad += grad theta -= lr * total_grad / n_pairs logits_snapshot = theta + np.log(pi_ref +1e-8) pi_snapshot = np.exp(logits_snapshot - logsumexp(logits_snapshot)) history.append(pi_snapshot)# --- Animation setup ---fig, ax = plt.subplots(figsize=(7, 4))line_true, = ax.plot(Y, R_true, 'k--', label='True Reward')line_ref, = ax.plot(Y, pi_ref, 'g-', label='Reference Policy')line_learned, = ax.plot([], [], 'b-', label='Learned Policy')# Add preference pair indicatorspref_lines = [ax.axvline(Y[idx], color='blue', linestyle=':', alpha=0.3) for idx, _ in pair_indices]pref_lines += [ax.axvline(Y[idx], color='red', linestyle=':', alpha=0.3) for _, idx in pair_indices]ax.set_ylim(-0.025, 0.025)ax.set_title("DPO Policy Evolution")ax.set_ylabel("Probability")ax.set_xlabel("y")ax.legend()def update(frame): pi_snapshot = history[frame] line_learned.set_data(Y, pi_snapshot) ax.set_title(f"DPO Policy Evolution (Step {frame +1})")return [line_learned]from IPython.display import HTMLfrom matplotlib import rcrc('animation', html='jshtml')ani = animation.FuncAnimation(fig, update, frames=n_steps, interval=100, blit=True)HTML(ani.to_jshtml())