Skip to content

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

Conference: AAAI 2026 arXiv: 2512.00709 Code: None Area: LLM Alignment Keywords: RLHF, DPO, preference flipping, robust alignment, noisy annotation

TL;DR

To address the pervasive "preference flipping" problem in human preference annotation, this paper proposes FA-DPO (Flipping-Aware DPO), which models the annotation process as a two-stage procedure consisting of "true human intent + instance-dependent flipping probability." By correcting the BT model loss and iteratively optimizing a flipping estimation module, FA-DPO substantially improves alignment robustness under various noise conditions, achieving up to a 16.7% gain over DPO when instance-dependent flipping rates are high.

Background & Motivation

Background: RLHF/DPO are the dominant paradigms for LLM alignment, yet they implicitly assume noise-free preference annotations. In practice, studies show that a flipping rate as low as 10% can degrade alignment performance by 30%.

Limitations of Prior Work: (a) Human preference annotations inevitably contain noise — environmental interference, distraction, or adversarial attacks can all induce label flips; (b) existing robust methods (cDPO, rDPO) assume a fixed global flipping rate independent of sample content, which is unrealistic, since ambiguous preference pairs are more susceptible to flipping than clear ones.

Key Challenge: The fixed flipping rate assumption applies uniform correction to all samples, failing to distinguish between "inherently ambiguous and flip-prone samples" and "clear samples that have been adversarially flipped."

Key Insight: The annotation process is decomposed into two stages — Stage 1 annotates according to true human intent (BT model), and Stage 2 applies instance-dependent label corruption (flipping probability \(\varepsilon_{\tilde{x}}\) correlated with sample content).

Core Idea: The likelihood function in the BT model loss is corrected using instance-dependent flipping probabilities, such that samples with high flipping probabilities are down-weighted or even subject to gradient reversal. A learnable flipping probability estimation module is jointly optimized with the LLM.

Method

Overall Architecture

An instance-dependent flipping probability estimation module is added on top of standard DPO: 1. A classifier estimates the per-sample flipping probability \(\varepsilon_{\tilde{x}}\) from preference pair features. 2. The LLM policy model is trained using the corrected FA-DPO loss. 3. Both components are optimized via alternating iteration.

Key Designs

  1. Instance-Dependent Flipping Probability Modeling:

    • Core proposition: the relationship between the corrupted preference probability and the true probability is \(\tilde{\mathbb{P}}\{\tilde{y}_w \succ \tilde{y}_l | x\} = (1-\varepsilon_{\tilde{x}})p + \varepsilon_{\tilde{x}}(1-p)\)
    • \(\varepsilon_{\tilde{x}}\) is instance-dependent — correlated with the content and ambiguity of the preference pair.
    • Design Motivation: cDPO/rDPO with fixed \(\varepsilon\) cannot distinguish noise levels across different samples.
  2. FA-DPO Loss Function:

    • Corrected loss: \(\mathcal{L}_{\text{FA-DPO}} = -\mathbb{E}_{\tilde{x}}[\log((1-\varepsilon_{\tilde{x}})p_\theta + \varepsilon_{\tilde{x}}(1-p_\theta))]\)
    • Gradient weight analysis (key distinction from cDPO/rDPO):
      • \(\varepsilon = 0\) (no flipping) → reduces to standard DPO
      • \(\varepsilon < 0.5\) (low flipping rate) → weight increases with model confidence, enhancing convergence stability
      • \(\varepsilon = 0.5\) (pure ambiguity) → weight is zero, automatically filtering signal-free samples
      • \(\varepsilon > 0.5\) (high flipping rate) → gradient direction is reversed, correcting flipped labels back to their true values — a self-correction capability absent in cDPO/rDPO
    • Design Motivation: Rather than simple additive correction, the method employs multiplicative reparameterization that jointly depends on flipping probability and model confidence.
  3. Flipping Probability Estimation Module:

    • Uses known features of NLP preference annotations (e.g., response length difference, perplexity difference, semantic similarity) as input features.
    • A lightweight classifier is trained to estimate \(\varepsilon_{\tilde{x}}\).
    • Alternately optimized with the LLM policy model.

Loss & Training

An alternating two-step procedure: (1) fix the flipping model and update the policy model using the FA-DPO loss; (2) fix the policy model and update the flipping probability estimation module. Compatible with both standard RLHF and DPO pipelines.

Key Experimental Results

Main Results: Win Rate under Different Noise Conditions

Method Anthropic-HH (Low Noise) Anthropic-HH (High Noise) HH_Golden (Low Noise) HH_Golden (High Noise)
DPO 67.2 55.8 83.5 58.6
cDPO 67.2 67.1 83.5 66.6
rDPO 70.1 57.8 83.5 47.7
ROPO 70.8 67.3 83.5 64.4
FA-DPO 73.1 69.8 83.5 70.8
Gain +2.3 +2.5 +16.7

Key Findings

  • Most pronounced advantage under high noise: In the HH_Golden high-noise setting (high instance-dependent flipping rate), FA-DPO outperforms the best baseline by 16.7 percentage points, owing to the gradient reversal mechanism that actively corrects flipped samples.
  • Gains persist under low noise: A 2.3 pp improvement is observed on Anthropic-HH under low noise, demonstrating the value of instance-dependent modeling even when noise is limited.
  • rDPO degrades under high noise: The global flipping rate assumption fails in instance-dependent noise scenarios, illustrating that uniform correction is insufficient.
  • Gradient reversal is the core advantage: Upon detecting samples with high flipping probability, FA-DPO automatically reverses the preference direction — effectively recovering the true preference signal from noisy labels.

Ablation Study

Configuration Win Rate Notes
FA-DPO Full Best Complete model
Fixed global \(\varepsilon\) Below Full Degenerates to cDPO-style method
No iterative updates Slightly lower Flipping estimation is less accurate
Random features Significant drop Preference features are critical

Highlights & Insights

  • The gradient weight analysis is the most compelling theoretical contribution of this paper. Through systematic comparison with cDPO/rDPO, it clearly characterizes four behavioral modes of FA-DPO (no flipping → standard; low flipping → enhanced stability; high ambiguity → filtering; high flipping → reversal correction). The gradient reversal mechanism when \(\varepsilon > 0.5\) is particularly noteworthy — it enables FA-DPO to recover correct learning signals from adversarially flipped samples, a capability entirely absent in prior methods.
  • Decomposing the annotation process into "human intent + external corruption" as a two-stage model exhibits statistical elegance and is generalizable to any human-preference-based learning scenario beyond RLHF (e.g., recommender systems, crowdsourced annotation).

Limitations & Future Work

  • The flipping probability estimation relies on hand-crafted preference features (length difference, perplexity difference, etc.); the quality of feature selection directly affects performance.
  • The convergence of alternating optimization lacks theoretical guarantees — oscillation between the two models may occur.
  • Experiments are conducted on relatively small-scale datasets (Anthropic-HH, HH_Golden); effectiveness on larger-scale alignment data remains unverified.
  • The flipping probability estimation module introduces additional training complexity, requiring extra classifier training and feature extraction compared to vanilla DPO.
  • Validation on current large-scale LLMs (e.g., Llama-3-70B) is absent.
  • vs. cDPO (Mitchell et al.): Applies label smoothing with a fixed global \(\varepsilon\), imposing identical correction on all samples. FA-DPO's instance-dependent flipping probability is more precise and additionally supports gradient reversal.
  • vs. rDPO (Chowdhury et al.): Extends cDPO with a debiasing correction but still assumes a fixed flipping rate, leading to degradation under instance-dependent noise (Win Rate decreases in experiments).
  • vs. ROPO: Accounts for noise but does not differentiate across instances; FA-DPO outperforms ROPO in all high-noise settings.
  • vs. Instance-Dependent Noisy Label Learning: Draws on instance-dependent noise theory from computer vision, representing the first systematic application of this framework to RLHF.
  • vs. RIME (Cheng et al.): A sample selection approach based on training loss values that leverages the tendency of DNNs to learn clean samples first. FA-DPO retains all samples rather than discarding them, correcting each via its estimated flipping probability.
  • Implication: The instance-dependent noise modeling paradigm is broadly applicable to any human-preference-based learning scenario, including recommender systems and crowdsourced annotation.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Instance-dependent flipping modeling and gradient reversal mechanism constitute a significant theoretical contribution; the analysis of four behavioral modes is particularly insightful.
  • Experimental Thoroughness: ⭐⭐⭐ Theoretical analysis is rigorous, but experimental scale is limited (restricted dataset and model scale).
  • Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are rigorous; the gradient comparison analysis against cDPO/rDPO is exceptionally clear.
  • Value: ⭐⭐⭐⭐ Represents an important theoretical advance for robust RLHF, though validation at large-scale deployment remains insufficient.