Efficient Reasoning with Balanced Thinking¶
Conference: ICLR 2026 arXiv: 2603.12372 Code: GitHub Area: Model Compression / Efficient Inference Keywords: Large language model reasoning, overthinking, underthinking, hidden-state steering, training-free acceleration
TL;DR¶
This paper proposes ReBalance, a training-free framework that simultaneously mitigates overthinking and underthinking in large reasoning models (LRMs) via confidence-guided dynamic hidden-state steering vectors, achieving joint improvements in both reasoning efficiency and accuracy.
Background & Motivation¶
- Background: Large reasoning models (e.g., DeepSeek-R1, QwQ) have acquired powerful reasoning capabilities through SFT and RL training, yet face significant computational efficiency challenges in practical deployment.
- Limitations of Prior Work: LRMs exhibit two opposing failure modes — overthinking: expending redundant reasoning steps on simple problems; and underthinking: prematurely converging on complex problems without sufficiently exploring the reasoning path.
- Key Challenge: Existing methods for mitigating overthinking (e.g., suppressing reflective keywords, adjusting reasoning length) tend to induce underthinking, forming an inherent trade-off between the two. As shown in Fig. 2(a), prior methods reduce the reasoning length of both correctly and incorrectly solved samples simultaneously, indicating the introduction of underthinking.
- Goal: To alleviate overthinking without inducing underthinking, thereby achieving balanced reasoning.
- Key Insight: The paper observes that stepwise confidence and confidence variance can serve as continuous indicators of reasoning state — high variance reflects hesitation and path-switching (overthinking), while persistently high confidence reflects premature commitment (underthinking).
- Core Idea: Confidence signals are used to identify reasoning states; a hidden-state steering vector is constructed from overthinking to underthinking representations; a dynamic control function then modulates the steering direction and magnitude based on real-time confidence.
Method¶
Overall Architecture¶
ReBalance operates in two phases — offline and online: 1. Offline Phase: A single forward pass is performed on a small-scale dataset to identify overthinking/underthinking steps, extract hidden-state prototypes, compute the steering vector, and fit the dynamic control function. 2. Online Phase: During inference, the dynamic control function computes a steering weight based on real-time confidence, which is used to inject the steering vector into the hidden states.
Key Designs¶
1. Explicit Modeling of Overthinking and Underthinking¶
- Function: Classifies reasoning steps into an overthinking set \(\mathcal{O}\) and an underthinking set \(\mathcal{U}\) based on confidence-derived metrics.
- Mechanism: The step-level confidence is defined as \(c_s = \exp\left(\frac{1}{|\mathcal{T}_s|}\sum_{t \in \mathcal{T}_s} \ln p_t^{\max}\right)\), and the confidence variance within a sliding window is \(\operatorname{Var}(c_s; \mathcal{W}_s)\). Empirical quantile thresholds \(\tau_c^L, \tau_c^H, \tau_v^L, \tau_v^H\) are used to categorize steps as:
- Overthinking \(\mathcal{O} = \{s: c_s \leq \tau_c^L \wedge v_s \geq \tau_v^H\}\) (low confidence, high variance)
- Underthinking \(\mathcal{U} = \{s: c_s \geq \tau_c^H \wedge v_s \leq \tau_v^L\}\) (high confidence, low variance)
- Design Motivation: The correspondence between confidence patterns and reasoning states is empirically validated in Fig. 2(b).
2. Confidence-Based Steering Vector Extraction¶
- Function: Extracts prototype representations of overthinking and underthinking from deep hidden states to construct the steering vector.
- Mechanism: Prototypes \(\bm{\mu}^O\) and \(\bm{\mu}^U\) are obtained by averaging the first-token hidden states of steps in \(\mathcal{O}\) and \(\mathcal{U}\), respectively. The steering vector is defined as \(\mathbf{v} = \frac{\bm{\mu}^O - \bm{\mu}^U}{\|\bm{\mu}^O - \bm{\mu}^U\|_2}\). Hidden-state adjustment follows \(\tilde{\mathbf{h}}_{t_s^{(1)}} = \mathbf{h}_{t_s^{(1)}} + \alpha_s \mathbf{v}\), where \(\alpha_s = \lambda_s \delta_s\), with \(\delta_s = +1\) to mitigate underthinking and \(\delta_s = -1\) to mitigate overthinking.
- Design Motivation: Deep hidden states exhibit stronger discriminative power for reasoning patterns (see appendix), and the first token conditions subsequent generation via causal attention.
3. Model-Behavior-Driven Dynamic Control Function¶
- Function: Adaptively adjusts the steering direction and magnitude based on real-time confidence.
- Mechanism: \(g(c_s, v_s) = \text{sign}(c_s - \tau_c^H) \cdot B(c_s, v_s) \cdot \tanh(|c_s - \tau_c^H|)\)
- Direction \(\delta_s\): negative when \(c_s < \tau_c^H\) (mitigating overthinking); positive when \(c_s > \tau_c^H\) (mitigating underthinking).
- Magnitude \(\lambda_s\): smoothly saturated via \(\tanh\); \(B(c_s, v_s)\) is a variance-aware amplitude function that adaptively switches among \(B_m\), \(B_o\), and \(B_u\) according to the current reasoning state.
- Design Motivation: Avoids hard switching and ensures numerical stability; parameters \(B_m\), \(B_o\), etc. are derived adaptively from model behavior, requiring no manual tuning.
Loss & Training¶
ReBalance is a training-free method and involves no loss function design. All components are obtained from a single offline forward pass.
Key Experimental Results¶
Main Results¶
| Model / Dataset | MATH-500 Acc↑ | MATH-500 Tokens↓ | AIME24 Acc↑ | GSM8K Acc↑ |
|---|---|---|---|---|
| R1-Distill-1.5B Baseline | 79.6 | 4516 | 23.3 | 76.0 |
| R1-Distill-1.5B ReBalance | 83.0 | 3474 (−23%) | 33.3 | 78.3 |
| R1-Distill-7B Baseline | 89.8 | 3699 | 46.7 | 89.2 |
| R1-Distill-7B ReBalance | 92.6 | 2903 (−22%) | 53.3 | 91.6 |
| Qwen3-14B Baseline | 93.8 | 4470 | 66.7 | 95.1 |
| Qwen3-14B ReBalance | 94.0 | 3641 (−19%) | 73.3 | 96.3 |
| QwQ-32B Baseline | 94.8 | 4535 | 66.7 | 96.3 |
| QwQ-32B ReBalance | 95.4 | 3551 (−22%) | 73.3 | 96.7 |
Ablation Study¶
- Using the steering vector without the dynamic control function yields limited performance gains and even degradation on some datasets.
- Removing the variance-aware amplitude \(B(c_s, v_s)\) prevents the model from distinguishing normal from abnormal reasoning states.
- Shallow vs. deep hidden states: deep layers (e.g., the second-to-last and third-to-last) perform best; shallow layers lack sufficient discriminative power.
Key Findings¶
- ReBalance consistently outperforms all baselines across 4 models (0.5B–32B) and 9 benchmarks.
- It simultaneously reduces reasoning length (15–30%) and improves accuracy (typically +2–10%), a combination rarely achieved by prior methods.
- Steering vectors extracted from small-scale seen datasets generalize robustly to unseen datasets.
- Unlike token-level suppression methods such as SEAL and DEER, ReBalance does not sacrifice valuable intermediate reasoning steps.
Highlights & Insights¶
- Core Insight: High confidence variance corresponds to overthinking (hesitation); persistently high confidence corresponds to underthinking (premature commitment). This observation is both intuitive and empirically validated.
- Methodological Value: Overthinking and underthinking are unified and addressed within a single framework rather than handled separately.
- Strong Practicality: Training-free and plug-and-play, requiring only a one-time offline computation on a small dataset, making deployment cost extremely low.
- Counter-Intuitive Finding: Outputs with shorter reasoning chains achieve higher accuracy, demonstrating that redundant reasoning genuinely introduces hallucinations.
Limitations & Future Work¶
- The steering vector is extracted from a fixed dataset and may not generalize to all task distributions; online update mechanisms warrant further exploration.
- Confidence computation relies on token probabilities; robustness to different sampling strategies (e.g., top-k, nucleus sampling) has not been fully validated.
- While the quantile thresholds \(q_L, q_H\) are adaptive, optimal values may vary across models and tasks.
- Validation is currently limited to mathematical and code reasoning; effectiveness on natural language reasoning and multi-step planning remains to be verified.
Related Work & Insights¶
- Overthinking Mitigation: SEAL (Chen et al., 2025b) reduces reasoning length by suppressing reflective keywords but may induce underthinking; NoThinking (Ma et al., 2025b) takes a more aggressive approach by skipping the thinking phase entirely.
- Reasoning Efficiency: Unlike RL-based reasoning length control methods, ReBalance is entirely training-free.
- Hidden-State Manipulation: The approach draws on ideas from representation engineering, but targets reasoning patterns rather than behavioral control.
- Insight: Using confidence as a probe for reasoning quality can be generalized to extract more fine-grained reasoning dynamics from LRMs.
Rating¶
⭐⭐⭐⭐ (4/5)
- Novelty: ⭐⭐⭐⭐ Unifying overthinking and underthinking under a single model and addressing both via hidden-state steering is a novel and elegant approach.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across 4 models × 9 benchmarks; highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated and mathematical derivations are complete.
- Value: ⭐⭐⭐⭐⭐ Training-free and plug-and-play; highly deployment-friendly.