Inference-Time Reward Hacking in Large Language Models¶

Conference: NeurIPS 2025 arXiv: 2506.19248 Code: None Area: Recommender Systems Keywords: reward hacking, inference-time alignment, Best-of-N, winner's curse, hedging

TL;DR¶

This paper mathematically proves that inference-time alignment methods (e.g., BoN) inevitably exhibit reward hacking (true reward first increases then decreases) when optimizing a proxy reward. It proposes Best-of-Poisson (BoP) sampling to approximate the optimal KL-reward trade-off distribution, and designs the HedgeTune algorithm to locate the optimal inference-time parameter via one-dimensional root-finding, effectively mitigating reward hacking in both mathematical reasoning and human preference settings.

Background & Motivation¶

Background: The core paradigm of current LLM alignment methods (RLHF, DPO, BoN, etc.) is to maximize a reward function while minimizing KL divergence from a reference model. Among these, Best-of-N (BoN) is widely adopted for its simplicity and efficiency—generating N candidate responses and selecting the one with the highest reward.

Limitations of Prior Work: All proxy reward models are imperfect; they cannot precisely capture complex objectives such as correctness, helpfulness, and safety. Optimizing a biased proxy reward leads to reward hacking, where true performance first improves and then degrades.

Key Challenge: Methods such as BoN are inherently susceptible to the winner's curse—as the number of candidates N increases, the selected response tends to be one whose true quality is overestimated by the proxy reward, resulting in over-optimization.

Goal: To characterize the inevitability of inference-time reward hacking and provide practical mitigation mechanisms.

Key Insight: Drawing on the winner's curse from information theory and auction theory, the paper reformulates the parameter tuning problem for inference-time alignment as a one-dimensional root-finding problem.

Core Idea: Poisson-randomized sampling approximates the optimal tilted distribution via a single parameter $\mu$; HedgeTune then identifies the hacking threshold to achieve optimal "hedging" against the proxy reward.

Method¶

Overall Architecture¶

The central optimization objective studied in this paper is the standard KL-constrained reward maximization: $$\pi^{\star} = \arg\max_{\pi} \mathbb{E}_{\pi}[r_p(X)] - \frac{1}{\lambda} D_{\text{KL}}(\pi \| \pi_{\text{ref}})$$

The theoretically optimal solution is an exponential tilt of the reference distribution, but direct sampling is intractable in practice (requiring enumeration of all possible continuations). Inference-time approximation methods are therefore necessary.

Pipeline: 1. Sample N candidate responses from the reference model $\pi_{\text{ref}}$ 2. Score candidates with a proxy reward model 3. Select one output via a selection mechanism (BoN / SBoN / BoP) 4. Calibrate the selection mechanism's parameter with HedgeTune to avoid over-optimization

Key Designs¶

Formal Definition of Reward Hacking (Definition 1): Defines the hacking threshold $\theta^{\dagger}$—beyond which the true reward begins to decline. Theorem 1 proves that under TP2 (total positivity of order 2) and the monotone likelihood ratio condition, the true reward function with respect to the inference-time parameter is either monotone or unimodal (having exactly one maximum), thereby establishing the inevitability of reward hacking.
Best-of-Poisson (BoP) Sampling (Algorithm 3): The core innovation—replacing the fixed sample count N in BoN with a Poisson-distributed random variable $n' \sim \text{Poisson}(\mu)$, setting $n = n' + 1$ to guarantee at least one sample. The BoP density is: $q_{\mu}(x) = (\mu x + 1) e^{\mu(x-1)}, \quad x \in [0,1]$ Key advantage: BoP approximates the optimal tilted distribution with a single parameter $\mu$, achieving a KL gap of only $O(10^{-4})$ (under the uniform proxy reward assumption). This means BoP can serve as an inference-time approximation to the RLHF optimal policy without requiring model fine-tuning for each $\lambda$.
HedgeTune Algorithm (Algorithm 4): Aims to identify the hacking threshold $\theta^{\dagger}$ at which the marginal gain in true reward is zero.
- For each prompt, map proxy reward scores to empirical quantiles $u \in [0,1]$
- Construct the residual function $R(\theta) = \mathbb{E}_{u \sim p_\theta}[r_t(u) \cdot \psi(u, \theta)]$
- Solve $\bar{R}(\theta^{\star}) = 0$ via bisection or Newton's method
- For BoN: find the optimal N; for SBoN: find the optimal $\lambda$; for BoP: find the optimal $\mu$

Loss & Training¶

HedgeTune does not require access to the LLM's internal distribution; only proxy reward and true reward scoring data are needed
Requires one-time calibration, applicable to verifiable reward settings (mathematical reasoning, program synthesis) or LLM-as-a-judge scenarios
The proxy reward model is trained with standard binary cross-entropy loss on preference pairs

Key Experimental Results¶

Main Results I: Verifiable Reward Setting¶

Using the PPE dataset (responses generated by GPT-4o-mini / Claude Haiku 3), scored by three reward models:

Dataset	Reward Model	BoN Optimal N	BoP Optimal μ	HedgeTune Peak Recovery
MMLU Pro	InternLM-2 1.8B	~8	~7	✓ Successful
MATH	Llama-3-Offset-Bias 8B	~16	~14	✓ Successful
GPQA	Skywork-Llama-3.1 8B	~32	~30	✓ Successful

Key finding: Even with the RewardBench rank-12 Skywork 8B reward model, BoN still exhibits hacking on GPQA (accuracy declines when N is too large). HedgeTune successfully recovers the optimal operating point across all settings.

Main Results II: Human Preference Setting¶

Using Pythia 1.4B reference model + AlpacaFarm + AlpacaRM gold-standard reward:

Proxy RM Training Data Size	Label Noise	BoN Hacking Threshold N†	SBoN Optimal λ†	BoP Hacking Threshold μ†
10k	0%	~16	~2.5	~14
20k	0%	~64	~4.0	~60
46k	25%	~8	~1.5	~7
80k	25%	~32	~3.0	~28

Key finding: Smaller proxy RM training data or higher label noise leads to a lower hacking threshold (earlier degradation). SBoN can achieve peak true reward without hacking through temperature $\lambda$.

Ablation Study¶

Method	# Parameters	Approximates Optimal Distribution	KL Gap	Hacking Mitigation
BoN	1 (N)	No	N/A	Requires HedgeTune
SBoN	2 (N, λ)	No (but more flexible)	N/A	λ=0 falls back to reference
BoP	1 (μ)	Yes (gap < 8×10⁻⁴)	O(10⁻⁴)	Requires HedgeTune
Optimal tilted distribution	1 (λ)	Yes (theoretically optimal)	0	Not directly sampleable

Key Findings¶

BoP achieves nearly identical KL-reward trade-off to the optimal tilted distribution with a single parameter; the KL gap is consistently < 8×10⁻⁴
The "rise-then-fall" pattern of reward hacking is an intrinsic property of the MLR density family (including BoN and BoP)
HedgeTune incurs minimal computational overhead (only one-dimensional root-finding) and can directly reuse existing sampling data
SBoN can fully avoid hacking in certain settings with a fixed $\lambda$ (returning the best achievable reward when the threshold is unreachable)

Highlights & Insights¶

Elegantly connects the winner's curse from auction theory to LLM alignment, offering both theoretical novelty and practical guidance
The design of BoP is highly elegant: Poisson randomization introduces exponential structure that naturally approximates the optimal tilted distribution
Theorem 1 is broadly applicable—it holds for any inference-time method satisfying TP2, not limited to BoN
HedgeTune is practically convenient: it requires no access to LLM internal parameters, only black-box scoring data

Limitations & Future Work¶

HedgeTune requires access to true rewards (or a strong judge), limiting applicability in open-ended tasks where verification is infeasible
Theoretical analysis relies on the uniform proxy reward assumption (though the loss from CDF transformation is small, discrete cases require additional treatment in the appendix)
Reward hacking in multi-turn dialogue or sequential decision-making settings is not discussed
Poisson randomization in BoP introduces variance in sample count, which may affect latency predictability

vs. Gao et al. (2023) scaling law: Gao et al. empirically observed reward hacking in BoN; this paper provides the first rigorous mathematical proof of its inevitability (Theorem 1)
vs. SBoN (Mayrink Verdun et al. 2025): SBoN introduces a temperature parameter for flexibility but requires tuning two parameters; BoP achieves comparable performance with a single parameter
vs. RLHF/DPO: The proposed method operates entirely at inference time without model fine-tuning, and can be viewed as a training-free alternative to RLHF
vs. Huang et al. (2025) coverage analysis: They prove that BoN inevitably hacks for large N; this paper provides concrete mitigation strategies

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of the winner's curse perspective, BoP, and HedgeTune demonstrates high innovation with solid theoretical contributions
Experimental Thoroughness: ⭐⭐⭐⭐ Covers both verifiable reward and human preference settings with multiple reward models and datasets, though experiments with larger-scale LLMs are absent
Writing Quality: ⭐⭐⭐⭐⭐ Well-structured, with rigorous theoretical derivations and informative, aesthetically pleasing figures
Value: ⭐⭐⭐⭐⭐ Directly relevant to the safe deployment of inference-time alignment methods; BoP and HedgeTune are plug-and-play