Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning¶

Conference: ACL 2026 arXiv: 2604.21327 Code: https://github.com/yuyongcan/DDRL Area: Image Restoration Keywords: Test-Time Reinforcement Learning, Pseudo-Label Noise, GRPO Bias, Denoising Debiasing, Math Reasoning

TL;DR¶

This paper systematically analyzes sources and amplification mechanisms of spurious signals in test-time RL (TTRL) — mid-frequency answers constitute the ambiguous zone as the primary noise source, while GRPO's within-group normalization amplifies these spurious signals — and proposes DDRL with balanced sampling, fixed advantage values, and consensus offline refinement, achieving 15.3% relative improvement on Qwen2.5-Math-1.5B.

Background & Motivation¶

Key Challenge: (1) Source level — the relationship between answer frequency and reliability is nonlinear: high-frequency answers are mostly correct, low-frequency mostly wrong, mid-frequency highly ambiguous; (2) Amplification level — GRPO's within-group normalization assigns extreme advantage values when positive samples are scarce. In supervised RL this is reasonable, but in TTRL, few positive samples indicate low consensus/high uncertainty.

Method¶

Key Designs¶

Balanced Confidence Sampling: Positive samples select top-\(K^+\) highest frequency pseudo-label matches (capped at \(\lfloor K/2 \rfloor\)); negative samples select \(K^-\) lowest frequency samples; mid-frequency ambiguous zone is entirely discarded.
Debiased Advantage Estimation: Fixes advantage values at \(A_i = +1\) (positive) or \(A_i = -1\) (negative), eliminating the "fewer positives → larger advantage → most unreliable samples get maximum weight" vicious cycle.
Consensus Offline Refinement: Post-RL SFT refinement using rejection-sampled datasets with high-consensus answers.

Key Experimental Results¶

Model/Method	AIME2024	MATH-500	Relative Gain
Qwen2.5-Math-1.5B + TTRL	15.8	73.0	-
Qwen2.5-Math-1.5B + DDRL	18.2	84.2	+15.3%

Key Findings¶

Removing GRPO normalization alone raises AIME2024 from 15.8% to 20.6%
All three DDRL components contribute independent gains and are stackable
Consistent improvement across three different LLM scales

Highlights & Insights¶

The "frequency-reliability" analysis is thorough, clearly locating spurious signal sources in the mid-frequency zone
Reveals that "reasonable assumptions in supervised RL are violated in unsupervised TTRL" — insightful for all GRPO-based unsupervised methods
Fixed \(+1/-1\) advantage values outperform complex normalization, embodying "simplicity is more robust in noisy environments"

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐