Diverging Preferences: When do Annotators Disagree and do Models Know?¶

Conference: ICML 2025
arXiv: 2410.14632
Code: None
Area: LLM Alignment/RLHF
Keywords: Preference Divergence, Reward Modeling, Pluralistic Alignment, LLM-as-Judge, Distributional Reward Models

TL;DR¶

This paper systematically analyzes the reasons behind annotator disagreement in RLHF preference datasets by taxonomizing them into 10 categories. It reveals that over 75% of disagreements stem from personal preference rather than annotation noise. To address this, the paper proposes a Mean-Var Reward Model to effectively differentiate between diverging and high-consensus preferences, and uncovers systematic biases in LLM-as-Judge evaluation methodologies when facing disagreement.

Background & Motivation¶

Background: RLHF has become the standard approach for aligning LLMs, yet annotator disagreement remains widespread—affecting 39% of samples in MultiPref and 24% in HelpSteer2.

Limitations of Prior Work: Existing reward modeling workflows (e.g., Bradley-Terry) treat annotator disagreement as simple noise, aggregating labels via majority voting. This practice ignores that disagreements often reflect legitimate differences in user perspectives.

Key Challenge: Standard reward models predict similar reward margins for both diverging and high-consensus preferences, failing to distinguish between them. Consequently, LLMs trained via RLHF only learn to cater to a single user perspective, violating the objective of pluralistic alignment.

Goal: (1) Understand the root causes of annotator disagreement; (2) design reward models capable of identifying diverging preferences; and (3) discover and mitigate biases in LLM-as-Judge evaluation.

Key Insight: Begin with data analysis by liberating individual annotation data in MultiPref and HelpSteer2 to establish a taxonomy of disagreement causes, leveraging these findings to drive improvements in reward modeling and evaluation.

Core Idea: Model rewards as distributions (rather than single scalars), thereby both predicting preference direction and capturing the degree of disagreement among annotators.

Method¶

Overall Architecture¶

The method comprises three components: (1) Disagreement cause analysis and taxonomy construction; (2) distributional reward model design and training; and (3) LLM-as-Judge bias analysis and mitigation.

Key Designs¶

Disagreement Taxonomy:
- Function: Categorizes preference divergences into 4 major categories and 10 subcategories.
- Mechanism: Summarizes the sources of disagreement by manually analyzing 200 diverging samples.
- Four major categories:
  - Task Underspecification (20-22%): Prompt is underspecified, leading to multiple reasonable interpretations.
  - Response Style (Verbosity 38-44%, Format 20-32%, Complexity 10%, Aesthetic Taste 14-22%): Differences in individual preferences.
  - Refusal (Comply vs. Refuse 5%, Refuse vs. Refuse 20%): Divergence in safety judgments.
  - Errors & Hallucinations (14-24%): Disagreement on factual errors.
- Design Motivation: Over 75% of disagreements arise from legitimate stylistic differences or task underspecifications rather than annotator errors.
- Annotation Agreement: Cohen's κ = 0.58-0.59, Krippendorff's α = 0.62-0.68.
Mean-Var Distributional Reward Model (KL):
- Function: Models the reward of each response as a normal distribution \(r_A \sim \mathcal{N}(\mu_A, \sigma_A^2)\), simultaneously predicting both mean and variance.
- Mechanism: The mean reflects the overall preference direction, while the variance captures the degree of annotator disagreement.
- Key Formulas: \(r_A - r_B \sim \mathcal{N}(\mu_A - \mu_B, \sigma_A^2 + \sigma_B^2 - 2\rho\sigma_A\sigma_B)\)
- Where \(\rho\) models the correlation between the two responses (based on tie frequency).
- Training Loss: Uses KL divergence loss to map the value of \(r_A - r_B\) to the probability distribution of annotated preference labels.
- Preference Mapping Intervals: Ties correspond to \((-0.5, 0.5)\), slight preference to \([0.5, 1.5)\), and significant preference to \([1.5, \infty)\).
- Disagreement Detection: \(|\mu_A - \mu_B| - \lambda(\sigma_A + \sigma_B)\); classified as diverging when variance is large and the mean difference is small.
- Difference from Bradley-Terry: The latter only outputs scalar rewards and cannot capture uncertainty.
LLM-as-Judge Bias Analysis and Mitigation:
- Function: Analyzes LLM-as-Judge behavior on diverging preferences and proposes a filtering methodology.
- Core Finding: The proportion of LLM-as-Judge choosing a "winner" on diverging preference samples (73.8%) is almost identical to that on high-consensus preference samples (73.1%).
- Bias Types: Systematic preferences for verbose and highly formatted outputs, and favoring compliance over refusal.
- Mitigation Strategy: Uses the distributional reward model to identify diverging samples and remove them from evaluation benchmarks.

Loss & Training¶

Mean-Var (KL): Trained using KL divergence loss, mapping \(r_A - r_B\) into 5 preference categories (Strongly prefer A / Marginally prefer A / Tie / Marginally prefer B / Strongly prefer B).
Baseline Mean-Var (NLL, Independent): Uses negative log-likelihood loss under independence assumption, yielding sub-optimal performance.
Classification (KL): A 5-way classifier based on Likert-5 scores, predicting the distribution of scores.
All models are trained on Llama-3-8B-Instruct.

Key Experimental Results¶

Distributional vs. Scalar Reward Models¶

Reward Model	MultiPref Pref Acc	MultiPref Div AUROC	HS2 Pref Acc	HS2 Div AUROC
Skywork (27B)	0.651	0.494	—	—
Nemotron (70B)	0.638	0.400	—	—
Bradley-Terry (Agg)	0.663	0.458	0.683	0.482
Bradley-Terry (All)	0.648	0.438	0.678	0.489
Mean-Var (KL, ours)	0.664	0.615	0.684	0.582
Classification (KL)	—	—	0.659	0.648

LLM-as-Judge Performance across Different Preference Types¶

Preference Type	MultiPref Winner Chosen %	HelpSteer2 Winner Chosen %
High-Consensus Preference	73.1%	64.6%
High-Consensus Tie	42.6%	51.9%
Diverging Preference (All)	73.8%	57.3%
Diverging Preference (Significant)	76.0%	65.0%

Key Findings¶

The Diverging ID AUROC of standard Bradley-Terry reward models is close to random (~0.5), failing to distinguish diverging from consistent preferences.
Mean-Var (KL) improves the Div AUROC from 0.46 to 0.62 (+0.16) while maintaining preference accuracy.
LLM-as-Judge exhibits almost the same "decisiveness" on diverging preferences as on high-consensus ones, showing a systematic bias towards specific styles.
Verbosity is the largest source of disagreement (38-44%), rather than annotation noise as previously assumed.

Highlights & Insights¶

Deep Data Insights: This work represents the first systematic analysis of disagreement causes in real-world preference datasets, establishing an empirically grounded taxonomy of divergence.
It uncovers a widely overlooked issue: current reward models treat diverging and consistent samples identically, hindering the progress of pluralistic alignment.
The distributional reward model offers an elegant solution by simultaneously performing preference prediction and divergence identification within a single model.
The findings regarding LLM-as-Judge bias serve as an important cautionary tale for popular evaluation benchmarks like Arena-Hard.

Limitations & Future Work¶

Validation is limited to two English datasets; divergence patterns may vary across different languages and cultures.
The distributional reward model is not yet directly integrated into RLHF training pipelines; incorporating divergence knowledge into policy optimization remains an open question.
The granularity of the disagreement taxonomy could be further refined, as categories like Verbosity and Format can overlap.
The mitigation strategy for LLM-as-Judge bias relies on the reward model, introducing a potential risk of circular dependency.

Closely relates to and provides empirical data support for the line of research on Pluralistic Alignment.
The concept of distributional reward modeling can be generalized to other scenarios featuring annotation uncertainty, such as safety evaluation.
The disagreement taxonomy provides direct guidance for building and quality-controlling preference datasets.
Offers a new analytical perspective on reward hacking: some instances of "alignment failure" might simply indicate that the model has learned the preferences of a specific subgroup of annotators.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐