Inference-Time Reward Hacking in Large Language Models¶
Conference: NeurIPS 2025 arXiv: 2506.19248 Code: None Area: Recommender Systems Keywords: reward hacking, inference-time alignment, Best-of-N, winner's curse, hedging
TL;DR¶
This paper mathematically proves that inference-time alignment methods (e.g., BoN) inevitably exhibit reward hacking (true reward first increases then decreases) when optimizing a proxy reward. It proposes Best-of-Poisson (BoP) sampling to approximate the optimal KL-reward trade-off distribution, and designs the HedgeTune algorithm to locate the optimal inference-time parameter via one-dimensional root-finding, effectively mitigating reward hacking in both mathematical reasoning and human preference settings.
Background & Motivation¶
Background: The core paradigm of current LLM alignment methods (RLHF, DPO, BoN, etc.) is to maximize a reward function while minimizing KL divergence from a reference model. Among these, Best-of-N (BoN) is widely adopted for its simplicity and efficiency—generating N candidate responses and selecting the one with the highest reward.
Limitations of Prior Work: All proxy reward models are imperfect; they cannot precisely capture complex objectives such as correctness, helpfulness, and safety. Optimizing a biased proxy reward leads to reward hacking, where true performance first improves and then degrades.
Key Challenge: Methods such as BoN are inherently susceptible to the winner's curse—as the number of candidates N increases, the selected response tends to be one whose true quality is overestimated by the proxy reward, resulting in over-optimization.
Goal: To characterize the inevitability of inference-time reward hacking and provide practical mitigation mechanisms.
Key Insight: Drawing on the winner's curse from information theory and auction theory, the paper reformulates the parameter tuning problem for inference-time alignment as a one-dimensional root-finding problem.
Core Idea: Poisson-randomized sampling approximates the optimal tilted distribution via a single parameter \(\mu\); HedgeTune then identifies the hacking threshold to achieve optimal "hedging" against the proxy reward.
Method¶
Overall Architecture¶
The central optimization objective studied in this paper is the standard KL-constrained reward maximization: $\(\pi^{\star} = \arg\max_{\pi} \mathbb{E}_{\pi}[r_p(X)] - \frac{1}{\lambda} D_{\text{KL}}(\pi \| \pi_{\text{ref}})\)$
The theoretically optimal solution is an exponential tilt of the reference distribution, but direct sampling is intractable in practice (requiring enumeration of all possible continuations). Inference-time approximation methods are therefore necessary.
Pipeline: 1. Sample N candidate responses from the reference model \(\pi_{\text{ref}}\) 2. Score candidates with a proxy reward model 3. Select one output via a selection mechanism (BoN / SBoN / BoP) 4. Calibrate the selection mechanism's parameter with HedgeTune to avoid over-optimization
Key Designs¶
-
Formal Definition of Reward Hacking (Definition 1): Defines the hacking threshold \(\theta^{\dagger}\)—beyond which the true reward begins to decline. Theorem 1 proves that under TP2 (total positivity of order 2) and the monotone likelihood ratio condition, the true reward function with respect to the inference-time parameter is either monotone or unimodal (having exactly one maximum), thereby establishing the inevitability of reward hacking.
-
Best-of-Poisson (BoP) Sampling (Algorithm 3): The core innovation—replacing the fixed sample count N in BoN with a Poisson-distributed random variable \(n' \sim \text{Poisson}(\mu)\), setting \(n = n' + 1\) to guarantee at least one sample. The BoP density is: \(q_{\mu}(x) = (\mu x + 1) e^{\mu(x-1)}, \quad x \in [0,1]\) Key advantage: BoP approximates the optimal tilted distribution with a single parameter \(\mu\), achieving a KL gap of only \(O(10^{-4})\) (under the uniform proxy reward assumption). This means BoP can serve as an inference-time approximation to the RLHF optimal policy without requiring model fine-tuning for each \(\lambda\).
-
HedgeTune Algorithm (Algorithm 4): Aims to identify the hacking threshold \(\theta^{\dagger}\) at which the marginal gain in true reward is zero.
- For each prompt, map proxy reward scores to empirical quantiles \(u \in [0,1]\)
- Construct the residual function \(R(\theta) = \mathbb{E}_{u \sim p_\theta}[r_t(u) \cdot \psi(u, \theta)]\)
- Solve \(\bar{R}(\theta^{\star}) = 0\) via bisection or Newton's method
- For BoN: find the optimal N; for SBoN: find the optimal \(\lambda\); for BoP: find the optimal \(\mu\)
Loss & Training¶
- HedgeTune does not require access to the LLM's internal distribution; only proxy reward and true reward scoring data are needed
- Requires one-time calibration, applicable to verifiable reward settings (mathematical reasoning, program synthesis) or LLM-as-a-judge scenarios
- The proxy reward model is trained with standard binary cross-entropy loss on preference pairs
Key Experimental Results¶
Main Results I: Verifiable Reward Setting¶
Using the PPE dataset (responses generated by GPT-4o-mini / Claude Haiku 3), scored by three reward models:
| Dataset | Reward Model | BoN Optimal N | BoP Optimal μ | HedgeTune Peak Recovery |
|---|---|---|---|---|
| MMLU Pro | InternLM-2 1.8B | ~8 | ~7 | ✓ Successful |
| MATH | Llama-3-Offset-Bias 8B | ~16 | ~14 | ✓ Successful |
| GPQA | Skywork-Llama-3.1 8B | ~32 | ~30 | ✓ Successful |
Key finding: Even with the RewardBench rank-12 Skywork 8B reward model, BoN still exhibits hacking on GPQA (accuracy declines when N is too large). HedgeTune successfully recovers the optimal operating point across all settings.
Main Results II: Human Preference Setting¶
Using Pythia 1.4B reference model + AlpacaFarm + AlpacaRM gold-standard reward:
| Proxy RM Training Data Size | Label Noise | BoN Hacking Threshold N† | SBoN Optimal λ† | BoP Hacking Threshold μ† |
|---|---|---|---|---|
| 10k | 0% | ~16 | ~2.5 | ~14 |
| 20k | 0% | ~64 | ~4.0 | ~60 |
| 46k | 25% | ~8 | ~1.5 | ~7 |
| 80k | 25% | ~32 | ~3.0 | ~28 |
Key finding: Smaller proxy RM training data or higher label noise leads to a lower hacking threshold (earlier degradation). SBoN can achieve peak true reward without hacking through temperature \(\lambda\).
Ablation Study¶
| Method | # Parameters | Approximates Optimal Distribution | KL Gap | Hacking Mitigation |
|---|---|---|---|---|
| BoN | 1 (N) | No | N/A | Requires HedgeTune |
| SBoN | 2 (N, λ) | No (but more flexible) | N/A | λ=0 falls back to reference |
| BoP | 1 (μ) | Yes (gap < 8×10⁻⁴) | O(10⁻⁴) | Requires HedgeTune |
| Optimal tilted distribution | 1 (λ) | Yes (theoretically optimal) | 0 | Not directly sampleable |
Key Findings¶
- BoP achieves nearly identical KL-reward trade-off to the optimal tilted distribution with a single parameter; the KL gap is consistently < 8×10⁻⁴
- The "rise-then-fall" pattern of reward hacking is an intrinsic property of the MLR density family (including BoN and BoP)
- HedgeTune incurs minimal computational overhead (only one-dimensional root-finding) and can directly reuse existing sampling data
- SBoN can fully avoid hacking in certain settings with a fixed \(\lambda\) (returning the best achievable reward when the threshold is unreachable)
Highlights & Insights¶
- Elegantly connects the winner's curse from auction theory to LLM alignment, offering both theoretical novelty and practical guidance
- The design of BoP is highly elegant: Poisson randomization introduces exponential structure that naturally approximates the optimal tilted distribution
- Theorem 1 is broadly applicable—it holds for any inference-time method satisfying TP2, not limited to BoN
- HedgeTune is practically convenient: it requires no access to LLM internal parameters, only black-box scoring data
Limitations & Future Work¶
- HedgeTune requires access to true rewards (or a strong judge), limiting applicability in open-ended tasks where verification is infeasible
- Theoretical analysis relies on the uniform proxy reward assumption (though the loss from CDF transformation is small, discrete cases require additional treatment in the appendix)
- Reward hacking in multi-turn dialogue or sequential decision-making settings is not discussed
- Poisson randomization in BoP introduces variance in sample count, which may affect latency predictability
Related Work & Insights¶
- vs. Gao et al. (2023) scaling law: Gao et al. empirically observed reward hacking in BoN; this paper provides the first rigorous mathematical proof of its inevitability (Theorem 1)
- vs. SBoN (Mayrink Verdun et al. 2025): SBoN introduces a temperature parameter for flexibility but requires tuning two parameters; BoP achieves comparable performance with a single parameter
- vs. RLHF/DPO: The proposed method operates entirely at inference time without model fine-tuning, and can be viewed as a training-free alternative to RLHF
- vs. Huang et al. (2025) coverage analysis: They prove that BoN inevitably hacks for large N; this paper provides concrete mitigation strategies
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The combination of the winner's curse perspective, BoP, and HedgeTune demonstrates high innovation with solid theoretical contributions
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers both verifiable reward and human preference settings with multiple reward models and datasets, though experiments with larger-scale LLMs are absent
- Writing Quality: ⭐⭐⭐⭐⭐ Well-structured, with rigorous theoretical derivations and informative, aesthetically pleasing figures
- Value: ⭐⭐⭐⭐⭐ Directly relevant to the safe deployment of inference-time alignment methods; BoP and HedgeTune are plug-and-play