ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning¶
Conference: NeurIPS 2025
arXiv: 2512.25023
Code: None
Area: LLM Alignment
Keywords: reward model, preference strength, response ranking, RLHF, sample efficiency
TL;DR¶
This paper proposes ResponseRank, a method that robustly learns utility differences by exploiting local relative differences in proxy signals of preference strength (e.g., response time and annotator agreement), significantly improving the sample efficiency of reward models.
Background & Motivation¶
Limitations of Prior Work¶
Limitations of Prior Work: Background: Binary preference labels in RLHF (A is preferred over B) convey only the direction of preference, not its intensity. However, preference strength is critical for decision-making under uncertainty and for the generalization of preference models:
Information Loss: "Strongly preferring apple" and "slightly preferring apple" are indistinguishable under binary annotation.
Unreliable Proxy Signals: Proxy signals such as response time and annotator agreement can reflect preference strength but are noisy and subject to confounding factors.
Incomparable Absolute Values: Absolute values of response time are not comparable across different annotators or different questions.
Low Sample Efficiency: Failing to utilize preference strength information necessitates larger amounts of annotated data.
Method¶
Overall Architecture¶
ResponseRank converts preference strength information into ranking constraints by comparing the relative differences of proxy signals within carefully constructed strata, and uses these constraints to train more accurate reward models.
Key Designs¶
1. Local Stratification Strategy
- Samples are stratified by feature (e.g., same prompt, same annotator).
- Proxy signals are compared only within the same stratum, controlling for systematic bias.
- Cross-stratum comparisons are avoided to eliminate confounding effects (e.g., differences in prompt difficulty).
2. Relative Strength Ranking
- Within each stratum, samples are sorted by proxy signal value in descending order.
- Higher proxy signal values (e.g., shorter response time, higher annotator agreement) indicate stronger preference.
- Ranking constraints are generated: if the preference of \((x_i, y_i^w, y_i^l)\) is stronger than that of \((x_j, y_j^w, y_j^l)\), then \(|r(y_i^w) - r(y_i^l)| > |r(y_j^w) - r(y_j^l)|\).
3. Ranking Constraint Training
A ranking loss is added on top of the standard BT loss: $\(\mathcal{L}_{\text{rank}} = \sum_{(i,j) \in \text{rank-pairs}} \max(0, \Delta_j - \Delta_i + \text{margin})\)$
where \(\Delta_i = r(y_i^w) - r(y_i^l)\) is the utility difference.
Loss & Training¶
- \(\mathcal{L}_{\text{BT}}\): Standard Bradley-Terry preference loss.
- \(\mathcal{L}_{\text{rank}}\): Ranking constraint based on preference strength.
- \(\lambda\): Trade-off coefficient.
Key Experimental Results¶
Main Results¶
Synthetic preference learning (with simulated response time):
| Method | 20% Data | 50% Data | 100% Data |
|---|---|---|---|
| BT (Standard) | 0.62 | 0.71 | 0.78 |
| Weighted BT | 0.65 | 0.73 | 0.79 |
| ResponseRank | 0.72 | 0.78 | 0.82 |
Language model reward learning (annotator agreement as proxy, RewardBench Accuracy):
| Method | Chat | Safety | Reasoning | Avg. |
|---|---|---|---|---|
| BT Standard | 72.5 | 80.3 | 68.2 | 73.7 |
| Margin-BT | 73.8 | 81.2 | 69.5 | 74.8 |
| ResponseRank | 76.2 | 83.5 | 72.1 | 77.3 |
Ablation Study¶
Robustness under varying proxy signal quality:
| Noise Level | BT | Weighted BT | ResponseRank |
|---|---|---|---|
| No noise | 0.78 | 0.85 | 0.86 |
| Low noise | 0.78 | 0.82 | 0.84 |
| Medium noise | 0.78 | 0.78 | 0.82 |
| High noise | 0.78 | 0.72 | 0.80 |
Key Findings¶
- ResponseRank achieves with only 20% of the data the same performance level as standard BT trained on 100%, representing a 5× improvement in sample efficiency.
- Local stratification is the key to robustness—Weighted BT under high noise actually performs worse than standard BT.
- The Pearson Distance Correlation (PDC) metric effectively distinguishes ordinal accuracy from cardinal utility learning.
- Using simulated returns as proxy signals in RL control tasks is equally effective.
Highlights & Insights¶
- Minimal Assumptions: Only assumes that proxy signals are locally valid, without making assumptions about their global distribution.
- Robust Design: The advantage over Weighted BT becomes more pronounced as noise increases.
- New Metric PDC: Pearson Distance Correlation disentangles ordinal and cardinal learning.
Limitations & Future Work¶
- The stratification strategy requires a sufficient number of within-stratum samples and may be unsuitable for small-scale datasets.
- A systematic comparison of proxy signal choices (response time vs. annotator agreement vs. others) is lacking.
- Ranking constraints only consider pairwise relationships and do not exploit total-order information across multiple samples.
- Integration with direct preference optimization methods such as DPO remains unexplored.
Related Work & Insights¶
- RLHF: Standard reinforcement learning from human feedback framework.
- Preference Learning with RT: Related work on response-time-assisted preference learning.
- Learning to Rank: Ranking learning methods from information retrieval.
Rating¶
- ⭐ Novelty: 8/10 — The design combining local stratification and relative ranking is concise and effective.
- ⭐ Value: 8/10 — A 5× improvement in sample efficiency is highly valuable for practical annotation workflows.
- ⭐ Writing Quality: 8/10 — Experiments cover three settings: synthetic, language, and RL.