Skip to content

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Conference: ACL 2026 arXiv: 2604.21327 Code: https://github.com/yuyongcan/DDRL Area: Image Restoration Keywords: Test-Time Reinforcement Learning, Pseudo-Label Noise, GRPO Bias, Denoising Debiasing, Math Reasoning

TL;DR

This paper systematically analyzes sources and amplification mechanisms of spurious signals in test-time RL (TTRL) — mid-frequency answers constitute the ambiguous zone as the primary noise source, while GRPO's within-group normalization amplifies these spurious signals — and proposes DDRL with balanced sampling, fixed advantage values, and consensus offline refinement, achieving 15.3% relative improvement on Qwen2.5-Math-1.5B.

Background & Motivation

Key Challenge: (1) Source level — the relationship between answer frequency and reliability is nonlinear: high-frequency answers are mostly correct, low-frequency mostly wrong, mid-frequency highly ambiguous; (2) Amplification level — GRPO's within-group normalization assigns extreme advantage values when positive samples are scarce. In supervised RL this is reasonable, but in TTRL, few positive samples indicate low consensus/high uncertainty.

Method

Key Designs

  1. Balanced Confidence Sampling: Positive samples select top-\(K^+\) highest frequency pseudo-label matches (capped at \(\lfloor K/2 \rfloor\)); negative samples select \(K^-\) lowest frequency samples; mid-frequency ambiguous zone is entirely discarded.

  2. Debiased Advantage Estimation: Fixes advantage values at \(A_i = +1\) (positive) or \(A_i = -1\) (negative), eliminating the "fewer positives → larger advantage → most unreliable samples get maximum weight" vicious cycle.

  3. Consensus Offline Refinement: Post-RL SFT refinement using rejection-sampled datasets with high-consensus answers.

Key Experimental Results

Model/Method AIME2024 MATH-500 Relative Gain
Qwen2.5-Math-1.5B + TTRL 15.8 73.0 -
Qwen2.5-Math-1.5B + DDRL 18.2 84.2 +15.3%

Key Findings

  • Removing GRPO normalization alone raises AIME2024 from 15.8% to 20.6%
  • All three DDRL components contribute independent gains and are stackable
  • Consistent improvement across three different LLM scales

Highlights & Insights

  • The "frequency-reliability" analysis is thorough, clearly locating spurious signal sources in the mid-frequency zone
  • Reveals that "reasonable assumptions in supervised RL are violated in unsupervised TTRL" — insightful for all GRPO-based unsupervised methods
  • Fixed \(+1/-1\) advantage values outperform complex normalization, embodying "simplicity is more robust in noisy environments"

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐