Efficient Test-Time Scaling via Temporal Reasoning Aggregation¶

Conference: ACL 2026 arXiv: 2604.17304 Code: https://github.com/qianfantianyuzhouzhou/TRACE Area: LLM Reasoning Efficiency Keywords: Test-time scaling, early exit, reasoning convergence, multi-step aggregation, overthinking

TL;DR¶

This paper proposes TRACE, a framework that determines reasoning convergence by aggregating two complementary signals within a sliding window — answer consistency across steps and confidence trajectory over time — enabling training-free dynamic early exit that reduces token usage by 25–30% with only a 1–2% accuracy drop.

Background & Motivation¶

State of the Field: Test-time scaling improves LLM reasoning performance by increasing inference-time computation (extending chain-of-thought or searching multiple paths). However, this leads to substantial unnecessary token generation — models frequently continue reasoning after having already arrived at the correct answer (the overthinking phenomenon).

Limitations of Prior Work: Existing dynamic early-exit methods primarily rely on single-step confidence signals to decide when to terminate reasoning. Research has shown that single-step confidence is unreliable in multi-step reasoning — it reflects the certainty of a single step rather than stability across steps. For instance, a model may assign high confidence to an incorrect intermediate step, triggering premature termination.

Root Cause: Terminating too early yields incorrect outputs, while terminating too late wastes resources. Single-step confidence cannot distinguish between genuine reasoning convergence and transiently high-confidence erroneous steps. Reasoning convergence is inherently a temporal phenomenon that requires stability signals spanning multiple steps.

Paper Goals: To design an early-exit strategy based on multi-step evidence aggregation that provides more reliable convergence judgments than single-step confidence.

Starting Point: Inspired by self-consistency — if multiple reasoning paths yield the same answer, that answer is more likely correct. This intuition is extended from multi-path sampling to multiple steps within a single reasoning trajectory.

Core Idea: Two complementary signals are tracked simultaneously within a sliding window: (1) answer consistency — whether the predicted answer remains stable across multiple steps; and (2) confidence trajectory — whether confidence evolves stably over time. Both signals are jointly used to determine whether reasoning has genuinely converged.

Method¶

Overall Architecture¶

TRACE maintains a sliding window of size \(k\) covering the most recent \(k\) reasoning steps during autoregressive inference. At each step, TRACE computes an Answer Consistency Score (ACS) and a Confidence Trajectory Score (CTS) within the window, combines them with a weighted sum into a unified stability score, and terminates reasoning when this score exceeds threshold \(\tau\). No additional training is required; TRACE can be directly applied to off-the-shelf LLMs.

Key Designs¶

Answer Consistency Score (ACS):
- Function: Measures the persistence of the predicted answer across recent reasoning steps.
- Mechanism: At each reasoning step, a lightweight auxiliary prompt elicits a candidate final answer from the current reasoning context. ACS is defined as the frequency of candidate answer \(a\) within the sliding window: \(\text{ACS}(a) = \text{count}(a) / k\). When reasoning converges, the correct answer recurs across consecutive steps, producing a high ACS.
- Design Motivation: Inspired by self-consistency, cross-step persistence of an answer is a strong signal of reasoning convergence.
Confidence Trajectory Score (CTS):
- Function: Tracks the temporal evolution of model confidence.
- Mechanism: At each step, the confidence of the candidate answer is computed using normalized entropy \(\tilde{H}\) as \(c = 1 - \frac{1}{n}\sum_j \tilde{H}(p_j)\). CTS is then defined as the average confidence of candidate answer \(a\) over the steps at which it appears: \(\text{CTS}(a) = \frac{1}{\text{count}(a)}\sum_{t \in \mathcal{T}(a)} c_t\). CTS distinguishes persistently high confidence (a convergence signal) from sporadically high confidence (noise).
- Design Motivation: Single-step confidence is unreliable, but an answer that receives high confidence across multiple steps is more likely to reflect genuine convergence.
Joint Early-Exit Decision:
- Function: Integrates both signals to make a reliable termination judgment.
- Mechanism: The unified stability score is \(S(a) = \alpha \cdot \text{ACS}(a) + (1-\alpha) \cdot \text{CTS}(a)\). The candidate answer with the highest score is selected as \(a^\star\). Reasoning terminates and \(a^\star\) is output when \(S(a^\star) > \tau\). The parameter \(\alpha\) controls the relative contribution of each signal.
- Design Motivation: ACS and CTS provide complementary perspectives — ACS captures whether answers are consistent, while CTS captures whether the model is confident. Their joint judgment is more robust than either signal alone.

Loss & Training¶

TRACE is a training-free inference-time method applied directly to off-the-shelf models. Evaluation is conducted on Qwen3-8B and DeepSeek-R1-Distill-Llama-8B across the benchmarks OlympiadBench, MATH500, AIME24, AMC23, and AIME25. Hyperparameters \(\alpha\) and threshold \(\tau\) are tuned on a validation set.

Key Experimental Results¶

Main Results¶

Method	Avg. Accuracy	Avg. Token Usage	Notes
Vanilla (full inference)	baseline	100%	Complete reasoning
Single-step confidence early exit	−11%	~60%	Severe degradation from premature termination
TRACE	−1~2%	70–75%	Optimal trade-off

Ablation Study¶

Signal Combination	Performance	Notes
ACS only	Moderate	Answers consistent but confidence may be low
CTS only	Moderate	Confidence high but answers may be inconsistent
ACS + CTS	Best	Complementary signals jointly decide

Key Findings¶

Single-step confidence early exit achieves an overall accuracy of only 0.44 on hard benchmarks (vs. 0.55 for full inference), confirming the unreliability of single-step signals.
TRACE reduces token usage by 25–30% with only a 1–2% accuracy drop, significantly outperforming existing dynamic reasoning methods.
Compared to the strongest early-exit baseline, TRACE improves average accuracy by 2–4 points under comparable or lower token budgets.
The choice of sliding window size \(k\) affects sensitivity — too small yields unstable signals, while too large introduces detection latency.

Highlights & Insights¶

The paradigm shift from single-step judgment to multi-step aggregation is compelling: Figure 1 clearly illustrates how single-step high confidence can mislead early exit, while TRACE avoids such errors by observing cross-step consistency.
The answer elicitation design is elegant: A lightweight prompt elicits candidate answers at each step without interrupting the reasoning process, enabling real-time monitoring of intermediate reasoning states.
The training-free, plug-and-play nature is highly practical: No model modification or additional training is required, directly reducing inference cost.

Limitations & Future Work¶

Answer elicitation at each step requires additional forward passes, introducing non-trivial overhead.
Applicability to non-mathematical reasoning tasks (e.g., code generation, natural language inference) has not been thoroughly validated.
Threshold \(\tau\) and window size \(k\) require tuning on a validation set and may need different configurations across datasets.
When reasoning genuinely requires long chains (rather than exhibiting overthinking), TRACE may incorrectly judge the process as having converged.

vs. single-step confidence early exit: Single-step signals are systematically unreliable in multi-step reasoning; TRACE provides more stable convergence signals through temporal aggregation.
vs. RL-based methods (e.g., length-penalty training): RL methods require additional training and are sensitive to reward design, whereas TRACE is training-free and directly applicable, offering greater flexibility.