REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning¶
Conference: ICLR 2026 arXiv: 2505.19862 Code: GitHub Area: Reinforcement Learning Keywords: overthinking in reasoning, reflection-awareness, online RL, GRPO, reasoning efficiency
TL;DR¶
REA-RL is a framework that employs a distilled small reflection model to online detect and truncate overthinking tokens, generating revised reasoning paths, while incorporating a reflection reward to prevent model degradation into non-reflective vanilla CoT during RL training. On DeepSeek-R1-Distill-Qwen-7B, it achieves a 36% reduction in reasoning token consumption with zero accuracy loss.
Background & Motivation¶
Background: Large reasoning models (LRMs) such as DeepSeek-R1 and QwQ have achieved remarkable progress on complex tasks like mathematical reasoning through a paradigm of "deep thinking and self-reflection." These models repeatedly verify, reflect, and correct their outputs after generating answers, analogous to human iterative problem-solving. However, this capability introduces severe reasoning efficiency issues—models may reflect more than eight times on elementary-level math problems, consuming thousands of tokens.
Limitations of Prior Work: Existing approaches to overthinking fall into two categories, each with critical flaws. Offline data methods (e.g., SFT for short responses, Best-of-N sampling, or compact reasoning paths generated by stronger models) suffer from distribution shift: the static dataset increasingly diverges from the evolving model policy during training, and the data generation and filtering overhead alone exceeds twice that of standard sampling, making them unsuitable for online settings. Online RL with length rewards (e.g., Kimi K1.5's length-normalized reward) resolves distribution shift but introduces a more dangerous consequence—in pursuing shorter outputs, models completely abandon reflection and degrade to vanilla chain-of-thought (CoT), appearing efficient on simple problems while suffering sharply elevated error rates on complex ones.
Key Challenge: There is a fundamental conflict between length rewards and reflection quality. Reflection inherently requires additional tokens (e.g., "wait," "but," "let me check"), whereas length rewards penalize all long outputs indiscriminately, failing to distinguish "valuable reflection" from "meaningless repetition." With purely parallel sampling, the absence of short-yet-correct responses as positive examples leaves the model with no guidance other than the crude heuristic "shorter is better."
Key Insight: The authors identify two key observations: (1) detecting overthinking is not inherently complex—it suffices to locate within the reasoning trace the segment that first contains a correct answer, a task achievable by a fine-tuned weak model; (2) Snell et al. have demonstrated that parallel sampling combined with sequential revision constitutes the compute-optimal test-time scaling strategy. This suggests that a lightweight small model can online provide truncated revision paths during training, while a dedicated reward signal can protect reflective behavior.
Core Idea: A distilled 7B reflection model online truncates overthinking tokens to generate revised paths (addressing the data problem), paired with a reflection keyword density reward to prevent reflection capability degradation during RL training (addressing the reward problem). The two components are complementary, achieving a balance between efficiency and performance.
Method¶
Overall Architecture¶
REA-RL extends the standard GRPO training pipeline along two dimensions. Data dimension: For each problem, \(G\) reasoning paths are sampled in parallel; the reflection model then online identifies the position in each path where a correct answer first appears, truncates subsequent overthinking tokens, and lets the policy model complete the final answer—yielding \(G\) revised paths. The original and revised paths (\(2G\) total) jointly participate in optimization. Reward dimension: In addition to accuracy and length rewards, a reflection reward penalizes responses with insufficiently low reflection density, preventing the model from abandoning reflection in pursuit of shorter outputs.
Key Designs¶
-
Automatic Overthinking Detection and Response Revision:
- Function: Automatically locates the position in the reasoning trace where a correct answer first appears, marks all subsequent tokens as overthinking, and truncates them.
- Mechanism: The think section is segmented into multiple chunks by paragraph, and Qwen2.5-32B-Instruct evaluates each chunk to determine whether it contains a correct answer. A two-step verification is employed to reduce false positives—a first pass filters candidate chunks, and a second pass re-checks each individually. All content following the first confirmed chunk is designated as overthinking. After truncation, the think section is forcibly terminated, and the model generates the final answer with a "Final Answer:" prefix under a 16K token budget. Regular expression matching is avoided due to format inconsistency and incidental mentions of answers during reasoning, which cause frequent false detections.
- Design Motivation: This provides high-quality overthinking annotations for both training the reflection model and defining the revision procedure. Experiments show the method automatically removes 24% of tokens without accuracy loss when no gold answer is provided, and 34% when a gold answer is available.
-
Distilled Reflection Model (Online Sequential Revision):
- Function: During online RL training, rapidly generates truncated revised versions for each sampled path, extending parallel sampling into a hybrid "parallel + sequential" scaling strategy.
- Mechanism: DeepSeek-R1-7B first generates four reasoning paths per problem on the training set; the 32B model's two-step detection method annotates whether each chunk contains a correct answer, constructing SFT data. The 32B model's two-step judgment capability is then distilled into Qwen2.5-7B-Instruct—after training, the 7B model predicts each chunk's category (whether it contains an answer) in a single step. During online training, the reflection model identifies and truncates overthinking positions across \(G\) parallel sampled paths, retaining effective reasoning content while the policy model completes the final answer, yielding \(G\) revised paths \(S^r\). Both original and revised paths participate in GRPO optimization. This approach is equivalent to applying a partial penalty to overthinking tokens: when revision succeeds, the length reward ensures \(a_i^r > a_i\), and overthinking tokens receive negative advantage. Revised paths are discarded when revision changes a correct answer to incorrect, or when both original and revised are incorrect, to prevent adverse incentives.
- Design Motivation: Addresses both "offline data distribution shift" and "insufficient short correct samples in online parallel-only sampling." The reflection model's revised paths are generated online (distribution-consistent) and shorter than original paths (positive sample guidance), realizing the compute-optimal scaling strategy proven by Snell et al. Compared to simply expanding parallel sampling from 4 to 8 (Gen8), revised paths yield substantially greater gains under equivalent data volume.
-
Reflection Reward (Preventing Non-Reflective Degradation):
- Function: Incorporates a reflection density signal into the reward function, penalizing responses with excessively low reflection frequency to prevent models from completely losing reflection capability under length pressure.
- Mechanism: The density of reflection keywords ("wait," "alternatively," "check," "but") in each response is computed as \(D_i = N_{\text{Reflect}} / N_{\text{Token}}\), with closely clustered keywords counted only once to avoid double-counting. Using the 0.2 quantile \(D_{0.2}\) of the training data as the threshold, the reflection reward is defined as \(R_{\text{Reflect}}(s_i) = \min(0, D_i / D_{0.2} - 1)\): a penalty is applied only when density falls below the bottom 20%; responses with normal or higher density are unaffected (reward = 0). This ensures the reflection reward does not inadvertently encourage overthinking. Additionally, the length reward from Kimi K1.5 is improved by setting the length reward for incorrect responses to 0 (the original formulation still grants partial rewards for incorrect responses, which misleads the model into pursuing short outputs even on hard problems).
- Design Motivation: When length rewards alone are used, reflection density on simple problems plummets to one-eighth of baseline (reflection interval on GSM8K increases from 105 to 814 tokens), with corresponding performance degradation. The reflection reward implements a "floor protection" strategy—not requiring more reflection, but prohibiting its complete absence—achieving a balance between efficiency and quality.
Loss & Training¶
The total reward is the sum of three components: \(R = R_{\text{Acc}} + R_{\text{RLen}} + R_{\text{Reflect}}\). \(R_{\text{Acc}}\) is the rule-verified accuracy reward (correct: 1, incorrect: 0); \(R_{\text{RLen}}\) is the improved length reward (for correct responses: \(\lambda = 1 - (len - min\_len)/(max\_len - min\_len)\); for incorrect: 0); \(R_{\text{Reflect}}\) is the reflection density reward. Under the GRPO framework, advantages are computed via within-group reward normalization across \(2G\) paths to optimize the policy model. Training is conducted on three NVIDIA A800 80G GPUs for approximately 120 hours, representing roughly a 50% overhead relative to standard GRPO (primarily due to serial reflection model inference and revision generation), partially offset by parallelizing training and data generation through deferred vLLM inference model updates.
Key Experimental Results¶
Main Results (16K budget, comparison with baselines)¶
| Method | GSM8K↑ | Math500↑ | Gaokao23↑ | Amc23↑ | AIME24↑ | Avg. Acc | Avg. TR↓ |
|---|---|---|---|---|---|---|---|
| R1-7B (original) | 91.66 | 92.00 | 81.82 | 88.12 | 48.33 | 80.39 | 100% |
| GRPO (accuracy reward only) | 92.87 | 93.40 | 82.34 | 87.19 | 50.42 | 81.24 | 102% |
| GRPO + Kimi length reward | 85.97 | 88.40 | 72.21 | 87.81 | 50.00 | 76.88 | 57% |
| NoThink | 85.06 | 84.20 | 66.23 | 66.25 | 23.33 | 65.01 | 27% |
| ShorterBetter | 85.37 | 83.40 | 66.75 | 76.88 | 50.00 | 72.48 | 31% |
| DAST | 87.79 | 91.40 | 83.12 | 88.44 | 51.67 | 80.48 | 76% |
| Arora & Zanette | 90.45 | 93.20 | 77.14 | 86.88 | 48.33 | 79.20 | 71% |
| REA-RL Reflection Reward | 92.72 | 92.80 | 81.82 | 88.75 | 54.58 | 82.13 | 72% |
| REA-RL Reflection Model | 89.23 | 92.40 | 79.74 | 88.12 | 47.92 | 79.48 | 52% |
| REA-RL Combined | 89.99 | 91.00 | 82.08 | 89.38 | 51.25 | 80.74 | 64% |
Ablation Study (Revision strategy comparison, 16K budget)¶
| Revision Strategy | Avg. Acc↑ | Avg. TR↓ | Notes |
|---|---|---|---|
| Original R1-7B | 80.39 | 100% | No revision baseline |
| 7B Revise (untrained) | 75.77 | 69.65% | Qwen-7B direct two-step detection; 17% output format errors |
| 32B Revise (no gold) | 80.61 | 75.65% | Qwen-32B two-step detection; effective but high inference cost |
| Reflection Model-Weak | 80.95 | 88.03% | Truncates at second correct answer; conservative strategy |
| Reflection Model-Normal | 80.62 | 83.50% | Truncates at first correct answer; used during training |
| Reflection Model-Strong | 80.14 | 78.52% | Truncates before chunk probability > 0.25; aggressive strategy |
| Fixed Trunc (same ratio) | 76.86 | 79.68% | Fixed-ratio truncation at equivalent compression; significant performance drop |
| GRPO Gen8 (parallel scaling only) | 81.35 | 97.55% | Expands to 8-way parallel sampling; no efficiency improvement |
Key Findings¶
- Length reward is a double-edged sword: Applying the Kimi K1.5 length reward alone reduces average accuracy from 80.39 to 76.88 under a 16K budget. While AIME24 sees a slight improvement (50.00 vs. 48.33), suggesting hard problems benefit from compression, simple problem performance collapses (GSM8K −5.69), indicating the model loses necessary reasoning steps.
- Reflection density is a direct indicator of degradation: After length-reward training, the reflection interval on GSM8K increases from 105.48 to 813.96 tokens—indicating the model reflects only once per 800+ tokens, effectively abandoning reflection. REA-RL combined maintains this at 150.87 tokens, only marginally below the original model.
- Reflection model and reflection reward are complementary but not synergistic: The reflection model excels at compression (TR 52%), while the reflection reward excels at preserving accuracy (highest Acc at 82.13). Their combination yields intermediate values (TR 64%, Acc 80.74) rather than optimizing both directions simultaneously—owing to the inherent tension between the reflection model's objective (penalizing overthinking) and the reflection reward's objective (protecting reflection).
- Difficulty-adaptive behavior: REA-RL reduces reflection frequency by an average of 22% and improves efficiency by 45% on simple problems (GSM8K/Math500/Gaokao23), but reduces reflection by only 4% and improves efficiency by 27% on hard problems (Amc23/AIME24). The model learns a "think less on easy problems, think appropriately on hard ones" strategy.
- Online training substantially outperforms offline: RPO, which uses offline training on 32B-generated revision data, achieves only 82.19 on Amc23 and 42.92 on AIME24 under a 16K budget, compared to 89.38 and 51.25 for REA-RL combined, confirming distribution shift as the bottleneck of offline methods.
Highlights & Insights¶
- Distillation-based overthinking detection: The complex two-step detection capability of the 32B model is distilled into a lightweight one-step detection by a 7B model, enabling real-time online revision. This "large model annotates, small model executes" paradigm substantially outperforms direct prompting of small models (17% format errors without training; post-training performance approaches the 32B model), and is generalizable to other online scenarios requiring large-model judgment.
- "Floor protection" design of the reflection reward: Using the 0.2 quantile as the threshold rather than encouraging more reflection precisely addresses the problem of "length rewards killing reflection" without introducing over-reflection as a side effect. Experiments show that any quantile ≤ 0.2 yields comparable results, with smaller quantiles favoring efficiency and larger quantiles favoring accuracy, providing practitioners with flexible tuning flexibility.
- Theoretical analysis: revision as partial advantage assignment: The paper demonstrates that online revision is not merely data augmentation, but is equivalent to assigning negative partial advantage to overthinking tokens and positive partial advantage to revised tokens. This provides a deeper optimization perspective beyond "removing redundancy" and explains why failed revisions must be discarded (otherwise the reversed advantage direction would incentivize overthinking).
- Temporal dynamics revealed by training curves: Figure 3 shows that the reflection model causes severe performance degradation in the first 1,000 steps (due to overly aggressive truncation), but subsequently the model recovers and ultimately surpasses the baseline in both performance and efficiency, since the penalty targets only overthinking tokens rather than effective reflection tokens. This indicates the method requires sufficiently long training to converge, and short-term evaluation may underestimate its effectiveness.
Limitations & Future Work¶
- Validation limited to distilled 7B models: The method has not been tested on natively pretrained LRMs (e.g., the full DeepSeek-R1), leaving its effectiveness on large-scale native reasoning models unconfirmed. The authors acknowledge this limitation stems from the prohibitive training time of larger models.
- Overthinking detection relies on LLM judgment: The detection method relies on LLM assessment of whether chunks contain correct answers, and cannot guarantee complete elimination of all overthinking. For open-ended tasks where answers cannot be rule-verified (e.g., code generation, text summarization), the detection criteria would need to be redesigned.
- Language limitation of the reflection keyword list: The reflection reward depends on English keywords ("wait," "but," "alternatively," "check"), which may fail for multilingual reasoning models or reasoning styles that do not employ these keywords. A more robust approach may require replacing keyword matching with a semantic classifier.
- No synergistic gain from the combined method: The objectives of the reflection model (truncating overthinking) and the reflection reward (preserving reflection capability) are inherently in tension, yielding a compromise rather than a synergistic amplification when combined. Future work may explore dynamically adjusting the weight of each component—emphasizing reflection protection in early training and gradually increasing efficiency pressure in later stages.
Related Work & Insights¶
- vs. NoThink: NoThink bypasses all reasoning via prompting, achieving extreme efficiency (TR 27%) at the cost of accuracy collapsing to 65.01, demonstrating that completely removing reasoning is infeasible. REA-RL's value lies in "selectively preserving effective reasoning."
- vs. DAST: DAST pre-estimates a length budget based on problem difficulty, a concept similar to REA-RL's difficulty-adaptive behavior, but requires an additional difficulty estimator. REA-RL's difficulty awareness emerges naturally through training—an implicit benefit of RL optimization.
- vs. ShorterBetter: ShorterBetter uses the shortest correct response as the length anchor, achieving TR as low as 31% but at the cost of accuracy dropping to 72.48. The issue lies in using the most extreme short samples as anchors, resulting in over-compression. REA-RL's revision strategy is more conservative—it truncates only content appearing after a correct answer has already been reached, preserving the complete reasoning chain leading to the answer.
- vs. Kimi K1.5 length reward: REA-RL introduces a critical improvement over K1.5's length reward: the length reward for incorrect responses is set to 0 (the original formulation still grants partial rewards of min(0.5, λ) for incorrect responses), avoiding erroneous encouragement of short outputs on hard problems.
Rating¶
- Novelty: ⭐⭐⭐⭐ The dual-component solution combining the reflection model and reflection reward is systematic, though individual components (distillation-based detection, keyword reward) are not entirely novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five datasets spanning varying difficulty levels, comprehensive ablation, reflection density analysis, training dynamics, multi-baseline comparison, and revision strategy comparison.
- Writing Quality: ⭐⭐⭐⭐⭐ A coherent logical arc from motivating examples in Figure 1, through systematic analysis of the failure modes of two prior approaches, to the incremental construction of the proposed solution.
- Value: ⭐⭐⭐⭐⭐ The practical value of a 36% efficiency improvement with zero performance loss is high; the reflection reward and partial advantage analysis provide transferable theoretical insights.