Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning¶
Conference: ACL 2026 arXiv: 2604.21327 Code: https://github.com/yuyongcan/DDRL Area: Image Restoration Keywords: Test-Time Reinforcement Learning, Pseudo-Label Noise, GRPO Bias, Denoising Debiasing, Math Reasoning
TL;DR¶
This paper systematically analyzes sources and amplification mechanisms of spurious signals in test-time RL (TTRL) — mid-frequency answers constitute the ambiguous zone as the primary noise source, while GRPO's within-group normalization amplifies these spurious signals — and proposes DDRL with balanced sampling, fixed advantage values, and consensus offline refinement, achieving 15.3% relative improvement on Qwen2.5-Math-1.5B.
Background & Motivation¶
Key Challenge: (1) Source level — the relationship between answer frequency and reliability is nonlinear: high-frequency answers are mostly correct, low-frequency mostly wrong, mid-frequency highly ambiguous; (2) Amplification level — GRPO's within-group normalization assigns extreme advantage values when positive samples are scarce. In supervised RL this is reasonable, but in TTRL, few positive samples indicate low consensus/high uncertainty.
Method¶
Key Designs¶
-
Balanced Confidence Sampling: Positive samples select top-\(K^+\) highest frequency pseudo-label matches (capped at \(\lfloor K/2 \rfloor\)); negative samples select \(K^-\) lowest frequency samples; mid-frequency ambiguous zone is entirely discarded.
-
Debiased Advantage Estimation: Fixes advantage values at \(A_i = +1\) (positive) or \(A_i = -1\) (negative), eliminating the "fewer positives → larger advantage → most unreliable samples get maximum weight" vicious cycle.
-
Consensus Offline Refinement: Post-RL SFT refinement using rejection-sampled datasets with high-consensus answers.
Key Experimental Results¶
| Model/Method | AIME2024 | MATH-500 | Relative Gain |
|---|---|---|---|
| Qwen2.5-Math-1.5B + TTRL | 15.8 | 73.0 | - |
| Qwen2.5-Math-1.5B + DDRL | 18.2 | 84.2 | +15.3% |
Key Findings¶
- Removing GRPO normalization alone raises AIME2024 from 15.8% to 20.6%
- All three DDRL components contribute independent gains and are stackable
- Consistent improvement across three different LLM scales
Highlights & Insights¶
- The "frequency-reliability" analysis is thorough, clearly locating spurious signal sources in the mid-frequency zone
- Reveals that "reasonable assumptions in supervised RL are violated in unsupervised TTRL" — insightful for all GRPO-based unsupervised methods
- Fixed \(+1/-1\) advantage values outperform complex normalization, embodying "simplicity is more robust in noisy environments"
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐