Efficient Reasoning with Balanced Thinking¶

Conference: ICLR 2026 arXiv: 2603.12372 Code: GitHub Area: Model Compression / Efficient Inference Keywords: Large language model reasoning, overthinking, underthinking, hidden-state steering, training-free acceleration

TL;DR¶

This paper proposes ReBalance, a training-free framework that simultaneously mitigates overthinking and underthinking in large reasoning models (LRMs) via confidence-guided dynamic hidden-state steering vectors, achieving joint improvements in both reasoning efficiency and accuracy.

Background & Motivation¶

Background: Large reasoning models (e.g., DeepSeek-R1, QwQ) have acquired powerful reasoning capabilities through SFT and RL training, yet face significant computational efficiency challenges in practical deployment.
Limitations of Prior Work: LRMs exhibit two opposing failure modes — overthinking: expending redundant reasoning steps on simple problems; and underthinking: prematurely converging on complex problems without sufficiently exploring the reasoning path.
Key Challenge: Existing methods for mitigating overthinking (e.g., suppressing reflective keywords, adjusting reasoning length) tend to induce underthinking, forming an inherent trade-off between the two. As shown in Fig. 2(a), prior methods reduce the reasoning length of both correctly and incorrectly solved samples simultaneously, indicating the introduction of underthinking.
Goal: To alleviate overthinking without inducing underthinking, thereby achieving balanced reasoning.
Key Insight: The paper observes that stepwise confidence and confidence variance can serve as continuous indicators of reasoning state — high variance reflects hesitation and path-switching (overthinking), while persistently high confidence reflects premature commitment (underthinking).
Core Idea: Confidence signals are used to identify reasoning states; a hidden-state steering vector is constructed from overthinking to underthinking representations; a dynamic control function then modulates the steering direction and magnitude based on real-time confidence.

Method¶

Overall Architecture¶

ReBalance operates in two phases — offline and online: 1. Offline Phase: A single forward pass is performed on a small-scale dataset to identify overthinking/underthinking steps, extract hidden-state prototypes, compute the steering vector, and fit the dynamic control function. 2. Online Phase: During inference, the dynamic control function computes a steering weight based on real-time confidence, which is used to inject the steering vector into the hidden states.

Key Designs¶

1. Explicit Modeling of Overthinking and Underthinking¶

Function: Classifies reasoning steps into an overthinking set \(\mathcal{O}\) and an underthinking set \(\mathcal{U}\) based on confidence-derived metrics.
Mechanism: The step-level confidence is defined as \(c_s = \exp\left(\frac{1}{|\mathcal{T}_s|}\sum_{t \in \mathcal{T}_s} \ln p_t^{\max}\right)\), and the confidence variance within a sliding window is \(\operatorname{Var}(c_s; \mathcal{W}_s)\). Empirical quantile thresholds \(\tau_c^L, \tau_c^H, \tau_v^L, \tau_v^H\) are used to categorize steps as:
- Overthinking \(\mathcal{O} = \{s: c_s \leq \tau_c^L \wedge v_s \geq \tau_v^H\}\) (low confidence, high variance)
- Underthinking \(\mathcal{U} = \{s: c_s \geq \tau_c^H \wedge v_s \leq \tau_v^L\}\) (high confidence, low variance)
Design Motivation: The correspondence between confidence patterns and reasoning states is empirically validated in Fig. 2(b).

2. Confidence-Based Steering Vector Extraction¶

Function: Extracts prototype representations of overthinking and underthinking from deep hidden states to construct the steering vector.
Mechanism: Prototypes \(\bm{\mu}^O\) and \(\bm{\mu}^U\) are obtained by averaging the first-token hidden states of steps in \(\mathcal{O}\) and \(\mathcal{U}\), respectively. The steering vector is defined as \(\mathbf{v} = \frac{\bm{\mu}^O - \bm{\mu}^U}{\|\bm{\mu}^O - \bm{\mu}^U\|_2}\). Hidden-state adjustment follows \(\tilde{\mathbf{h}}_{t_s^{(1)}} = \mathbf{h}_{t_s^{(1)}} + \alpha_s \mathbf{v}\), where \(\alpha_s = \lambda_s \delta_s\), with \(\delta_s = +1\) to mitigate underthinking and \(\delta_s = -1\) to mitigate overthinking.
Design Motivation: Deep hidden states exhibit stronger discriminative power for reasoning patterns (see appendix), and the first token conditions subsequent generation via causal attention.

3. Model-Behavior-Driven Dynamic Control Function¶

Function: Adaptively adjusts the steering direction and magnitude based on real-time confidence.
Mechanism: \(g(c_s, v_s) = \text{sign}(c_s - \tau_c^H) \cdot B(c_s, v_s) \cdot \tanh(|c_s - \tau_c^H|)\)
- Direction \(\delta_s\): negative when \(c_s < \tau_c^H\) (mitigating overthinking); positive when \(c_s > \tau_c^H\) (mitigating underthinking).
- Magnitude \(\lambda_s\): smoothly saturated via \(\tanh\); \(B(c_s, v_s)\) is a variance-aware amplitude function that adaptively switches among \(B_m\), \(B_o\), and \(B_u\) according to the current reasoning state.
Design Motivation: Avoids hard switching and ensures numerical stability; parameters \(B_m\), \(B_o\), etc. are derived adaptively from model behavior, requiring no manual tuning.

Loss & Training¶

ReBalance is a training-free method and involves no loss function design. All components are obtained from a single offline forward pass.

Key Experimental Results¶

Main Results¶

Model / Dataset	MATH-500 Acc↑	MATH-500 Tokens↓	AIME24 Acc↑	GSM8K Acc↑
R1-Distill-1.5B Baseline	79.6	4516	23.3	76.0
R1-Distill-1.5B ReBalance	83.0	3474 (−23%)	33.3	78.3
R1-Distill-7B Baseline	89.8	3699	46.7	89.2
R1-Distill-7B ReBalance	92.6	2903 (−22%)	53.3	91.6
Qwen3-14B Baseline	93.8	4470	66.7	95.1
Qwen3-14B ReBalance	94.0	3641 (−19%)	73.3	96.3
QwQ-32B Baseline	94.8	4535	66.7	96.3
QwQ-32B ReBalance	95.4	3551 (−22%)	73.3	96.7

Ablation Study¶

Using the steering vector without the dynamic control function yields limited performance gains and even degradation on some datasets.
Removing the variance-aware amplitude \(B(c_s, v_s)\) prevents the model from distinguishing normal from abnormal reasoning states.
Shallow vs. deep hidden states: deep layers (e.g., the second-to-last and third-to-last) perform best; shallow layers lack sufficient discriminative power.

Key Findings¶

ReBalance consistently outperforms all baselines across 4 models (0.5B–32B) and 9 benchmarks.
It simultaneously reduces reasoning length (15–30%) and improves accuracy (typically +2–10%), a combination rarely achieved by prior methods.
Steering vectors extracted from small-scale seen datasets generalize robustly to unseen datasets.
Unlike token-level suppression methods such as SEAL and DEER, ReBalance does not sacrifice valuable intermediate reasoning steps.

Highlights & Insights¶

Core Insight: High confidence variance corresponds to overthinking (hesitation); persistently high confidence corresponds to underthinking (premature commitment). This observation is both intuitive and empirically validated.
Methodological Value: Overthinking and underthinking are unified and addressed within a single framework rather than handled separately.
Strong Practicality: Training-free and plug-and-play, requiring only a one-time offline computation on a small dataset, making deployment cost extremely low.
Counter-Intuitive Finding: Outputs with shorter reasoning chains achieve higher accuracy, demonstrating that redundant reasoning genuinely introduces hallucinations.

Limitations & Future Work¶

The steering vector is extracted from a fixed dataset and may not generalize to all task distributions; online update mechanisms warrant further exploration.
Confidence computation relies on token probabilities; robustness to different sampling strategies (e.g., top-k, nucleus sampling) has not been fully validated.
While the quantile thresholds \(q_L, q_H\) are adaptive, optimal values may vary across models and tasks.
Validation is currently limited to mathematical and code reasoning; effectiveness on natural language reasoning and multi-step planning remains to be verified.

Overthinking Mitigation: SEAL (Chen et al., 2025b) reduces reasoning length by suppressing reflective keywords but may induce underthinking; NoThinking (Ma et al., 2025b) takes a more aggressive approach by skipping the thinking phase entirely.
Reasoning Efficiency: Unlike RL-based reasoning length control methods, ReBalance is entirely training-free.
Hidden-State Manipulation: The approach draws on ideas from representation engineering, but targets reasoning patterns rather than behavioral control.
Insight: Using confidence as a probe for reasoning quality can be generalized to extract more fine-grained reasoning dynamics from LRMs.

Rating¶

⭐⭐⭐⭐ (4/5)

Novelty: ⭐⭐⭐⭐ Unifying overthinking and underthinking under a single model and addressing both via hidden-state steering is a novel and elegant approach.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across 4 models × 9 benchmarks; highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated and mathematical derivations are complete.
Value: ⭐⭐⭐⭐⭐ Training-free and plug-and-play; highly deployment-friendly.