Entropy-based Exploration Conduction for Multi-step Reasoning¶

Conference: ACL 2025
arXiv: 2503.15848
Code: None
Area: LLM Reasoning
Keywords: Multi-step Reasoning, Entropy, Exploration Depth, Dynamic Adjustment, \epsilon-greedy

TL;DR¶

This paper proposes the Entro-duction method, which dynamically adjusts exploration depth by monitoring changes in output entropy and variance entropy during the LLM reasoning process. It adopts an \(\epsilon\)-greedy strategy to select among three exploration behaviors (deepening, expanding, or stopping), improving reasoning accuracy while avoiding redundant computation.

Background & Motivation¶

1. Background¶

The multi-step reasoning capabilities of LLMs are crucial for complex tasks. Structured reasoning methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) have demonstrated significant effectiveness. However, the reasoning depth and width of these methods heavily depend on predefined settings, which vary drastically across different tasks.

2. Limitations of Prior Work¶

Over-reasoning: Wasting significant computational steps on simple questions.
Under-reasoning: Terminating prematurely on complex questions, missing crucial reasoning paths.
Rigidity of predefined depth: CoT fixes 5-8 steps, and ToT uses fixed layers and branching factors, failing to adapt dynamically.
Outcome-oriented optimization: RL-based post-training methods are computationally expensive and task-specific.

3. Key Challenge¶

The mismatch between reasoning exploration depth and task complexity—a single static reasoning structure cannot simultaneously fit both simple and complex tasks, yet LLMs struggle to accurately determine the required exploration depth solely based on parameterized knowledge.

4. Goal¶

How to enable LLMs to automatically and transparently assess whether current exploration is sufficient during reasoning and dynamically adjust subsequent exploration strategies accordingly?

5. Key Insight¶

Leveraging the uncertainty information embedded in the output logits of LLMs—quantifying the model's reasoning state by calculating the entropy and variance entropy at each reasoning step, and automatically switching exploration behaviors based on entropy dynamics.

6. Core Idea¶

Using the output entropy of reasoning steps to characterize reasoning uncertainty, and using variance entropy to represent reasoning stability. Based on the dynamics of both metrics, an \(\epsilon\)-greedy strategy is applied to dynamically choose to deepen, expand, or stop the exploration.

Method¶

Overall Architecture¶

In each reasoning step, Entro-duction: 1. Computes the normalized entropy and normalized variance entropy of the current step. 2. Compares them with the previous step to obtain the deltas \((\Delta H, \Delta \sigma_H^2)\). 3. Determines the optimal actions based on a state mapping system. 4. Adopts an \(\epsilon\)-greedy strategy to sample the actual execution behavior.

Key Designs¶

Reasoning State Evaluation — Entropy Metric¶

Entropy measures the uncertainty of the current reasoning step. For a reasoning step \(\mathcal{T}_{i,j}\) consisting of \(n\) tokens, each token \(t_{ijk}\) corresponds to a logit \(l_{ijk}\):

\[p_{ijk} = \frac{\exp(l_{ijk})}{\sum_{r=1}^{n}\exp(l_{ijr})}\]

\[H(\mathcal{T}_{i,j}) = -\sum_{k=1}^{n} p_{ijk}\log_2(p_{ijk})\]

Normalized Entropy: \(\tilde{H}(\mathcal{T}_{i,j}) = \frac{H(\mathcal{T}_{i,j})}{\log_2(n)}\), bounded within \([0, 1]\).

Variance Entropy for Reasoning Stability¶

\[\sigma_H^2(\mathcal{T}_{i,j}) = \frac{1}{n}\sum_{k=1}^{n}\left[H(t_{ijk}) - \bar{H}(\mathcal{T}_{i,j})\right]^2\]

Variance entropy captures the level of fluctuation in token-level uncertainty within a single reasoning step.

Four Reasoning States and Action Mapping¶

Entropy Change	Variance Entropy Change	Meaning	Mapping Behavior
\(\Delta H < 0\)	\(\Delta \sigma^2 < 0\)	Reasoning certainty increases, stable process	Deepen
\(\Delta H > 0\)	\(\Delta \sigma^2 < 0\)	More possibilities but the direction has not diverged	Deepen
\(\Delta H < 0\)	\(\Delta \sigma^2 > 0\)	Uncertainty decreases but local divergence arises	Expand
\(\Delta H > 0\)	\(\Delta \sigma^2 > 0\)	Complex and unstable reasoning	Stop

Three Exploration Behaviors¶

Deepen: Adds a new node \(\mathcal{T}_{i,j+1}\) to the current reasoning chain to continue traversing deeper.
Expand: Splits the current reasoning chain into two, generating new nodes \(\mathcal{T}_{i,j+1}\) and \(\mathcal{T}'_{i,j+1}\) respectively.
Stop: Terminates the expansion of the current chain.

\(\epsilon\)-greedy Action Selection¶

\[\pi_j(a|\mathbf{s}_j) = \begin{cases} 1 - \epsilon, & a = a_j^* \\ \frac{\epsilon}{|\mathcal{A}| - 1}, & a \neq a_j^* \end{cases}\]

Executes the optimal action \(a_j^*\) recommended by the mapping function with probability \(1-\epsilon\), and randomly explores other actions with probability \(\frac{\epsilon}{2}\). The final reasoning conclusion is determined through a voting mechanism \(L = V(\mathcal{S}, \mathcal{R})\).

Soft Stop Mechanism¶

Instead of immediate termination upon receiving a Stop signal, the model continues reasoning for an additional 1-2 steps before stopping, preventing premature termination in complex tasks.

Loss & Training¶

Training-free, pure test-time inference method. Implemented based on LLaMA-3.1-8B-Instruct with temperature 0.7 and a maximum of 128 tokens.

Key Experimental Results¶

Main Results¶

Method	GSM8K	SVAMP	StrategyQA	CSQA	Average
CoT	75.2 (8 steps)	83.4 (8 steps)	57.7 (5 steps)	75.6 (5 steps)	—
CoT-SC@maj8	78.1 (24 steps)	87.5 (24 steps)	68.3 (15 steps)	78.2 (15 steps)	—
CoT-SC@maj64	80.2 (24 steps)	89.6 (24 steps)	67.1 (15 steps)	78.7 (15 steps)	—
ToT	72.6 (121 steps)	83.3 (121 steps)	65.8 (121 steps)	73.5 (121 steps)	—
Complex CoT	81.4 (8 steps)	86.2 (8 steps)	65.7 (5 steps)	73.9 (5 steps)	—
DRR	83.0	90.2	67.7	82.1	—
Entro-duction	85.4 (9.5 steps)	92.0 (11.2 steps)	70.3 (9.6 steps)	79.6 (7.1 steps)	—

Ablation Study¶

Configuration	GSM8K	SVAMP	StrategyQA	CSQA
Base (No adjustment)	~72	~82	~55	~72
Entropy Only	~77	~85	~60	~74
Variance Only	~78	~86	~62	~75
Both (Full)	~85	~92	~70	~80

Key Findings¶

Win-win in accuracy and step count: Entro-duction achieves 85.4-92.0% accuracy within only 9.5-11.2 steps, which is significantly fewer than ToT's 121 steps while obtaining higher accuracy.
GSM8K 85.4% > DRR 83.0%: Outperforms the DRR method which requires additional training.
SVAMP 92.0%: Particularly effective on simple mathematical reasoning tasks.
Both entropy and variance entropy are indispensable: Using either metric alone yields sub-optimal performance compared to combining them.
Expand behavior is crucial: Removing the Expand action leads to a significant performance drop in tasks requiring multi-path exploration.
Soft Stop > Hard Stop: Stop with 1-2 extra steps (Stop@2/3) outperforms immediate termination (Hard Stop) by a large margin.
\(\epsilon = 0.25\) is optimal: Mathematical tasks require more exploration, whereas commonsense tasks demand more precise decision-making.

Highlights & Insights¶

Transparent and interpretable reasoning control: Action selection based on entropy dynamics is intuitive and transparent, eliminating the need for black-box self-reflection.
Zero training overhead: Operates entirely at test time without requiring any additional training or fine-tuning.
Complementarity of two-dimensional metrics: Entropy captures "uncertainty" while variance entropy captures "stability". Combined, they provide a comprehensive description of the reasoning state.
Adaptive reasoning depth: Automatically allocates different average reasoning steps to different tasks (9.5-11.2 steps for math, 7.1-9.6 steps for commonsense reasoning).
Practicality of the soft stop mechanism: A simple delayed termination strategy effectively mitigates the issue of premature stopping.

Limitations & Future Work¶

Experiments are restricted only to LLaMA-3.1-8B; generalization to LLMs of different scales and architectures remains to be verified.
The mapping from four reasoning states to three actions relies on manually designed rules without exploring more fine-grained state spaces.
The \(\epsilon\) parameter requires manual tuning, and different tasks might need different optimal values.
Evaluation is limited to four benchmark datasets, failing to cover more complex real-world reasoning scenarios.
The entropy metric only leverages token-level probabilities, failing to capture semantic-level uncertainty.
The final decision-making strategy for multi-chain voting is basic (majority vote), which might not be the optimal way to aggregate answers.

CoT / CoT-SC / ToT: Classical reasoning structures. Entro-duction equips them with adaptive depth adjustment capabilities.
Semantic Entropy (Farquhar et al., 2024): Employs semantic entropy to detect LLM hallucinations, inspiring this work to utilize entropy as a metric for monitoring reasoning states.
DRR (Yang et al., 2024): Optimizes reasoning depth via distillation and RL, but requires extra training.
Self-talk (Shwartz et al., 2020): Enhances reasoning through generating exploratory questions, though lack of adaptability.
Insights: The output logits of LLMs convey rich information about the reasoning states, which can be leveraged for training-free reasoning process control.

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea of using entropy to guide reasoning depth is novel, and the dual-dimension metric design is insightful.
Experimental Thoroughness: ⭐⭐⭐ — Bounded to four datasets with detailed ablation studies, yet restricted to only one model size and a limited variety of tasks.
Writing Quality: ⭐⭐⭐⭐ — Highly rigorous mathematical formulation and lucid analysis of the four states.
Value: ⭐⭐⭐⭐ — Offers a lightweight, transparent, and training-free depth control scheme for multi-step reasoning, possessing high practical potential.