Cautious Next Token Prediction¶

Conference: ACL 2025
arXiv: 2507.03038
Code: Available
Area: Others
Keywords: Decoding Strategy, Entropy-Adaptive Sampling, Perplexity Ranking, Training-free, LLM Inference

TL;DR¶

Proposes Cautious Next Token Prediction (CNTP), a training-free adaptive decoding strategy. When the model's prediction entropy is high (showing uncertainty), it samples multiple candidate paths up to a punctuation mark, then selects the path with the lowest perplexity as the final generation. This significantly improves accuracy without sacrificing diversity.

Background & Motivation¶

Current mainstream LLM decoding strategies (top-p/nucleus sampling + temperature scaling) are the "industry default", but tend to generate incoherent or incorrect content when the model is uncertain.
Greedy Decoding: Strong determinism but lacks diversity, prone to falling into local optima.
Stochastic Sampling: Maintains diversity but does not guarantee coherence; error-prone at highly uncertain steps.
Beam Search: Large computational overhead (\(O(L \times B)\)) and lacks adaptability, treating certain and uncertain steps equally.
Self-Consistency: Requires running the entire decoding process \(N_{sc}\) times, with a cost of \(O(N_{sc} \times L)\), offering no adaptability at intermediate steps.
The core insight stems from human behavior: when facing uncertain steps during problem-solving, humans tend to "think more," explore multiple paths, and eventually choose the one they are most confident in. Models can also mimic this "cautious" strategy.
Existing inference-time methods like CoT and Self-Refinement largely rely on external feedback or extensive sampling, leaving a gap for a lightweight, adaptive, and training-free decoding scheme.

Method¶

Overall Architecture¶

The core loop of CNTP: Before generating each token, compute the predictive entropy at the current position \(\rightarrow\) If entropy is low, proceed with normal single-token sampling \(\rightarrow\) If entropy is high, independently sample \(N\) candidate paths up to a punctuation mark \(\rightarrow\) Compute the perplexity of each path \(\rightarrow\) Select the path with the lowest perplexity to append to the sequence \(\rightarrow\) Continue generation.

Key Designs¶

1. Entropy-Based Uncertainty Detection¶

Given the current sequence \(s\), compute the entropy of the probability distribution for the next token:

\[H(s) = -\sum_{w} p(w|s) \log p(w|s)\]

Set two thresholds \(H_{min}\) and \(H_{max}\), and a maximum number of trials \(N_{max}\). The entropy is linearly mapped to the number of trials:

\[N = \max\left(1, \min\left(N_{max}, \left\lfloor\frac{H - H_{min}}{H_{max} - H_{min}} \times N_{max}\right\rfloor\right)\right)\]

\(H(s) < H_{min}\): The model is highly confident; \(N=1\), degrading to standard single-step sampling.
\(H(s) > H_{max}\): The model is highly uncertain; \(N=N_{max}\), encouraging full exploration.

2. Multi-Path Sampling and Punctuation-Based Termination¶

When \(N > 1\), independently sample \(N\) candidate paths, each starting from the current position until meeting a punctuation mark (., ?, !, :, ;, ), ], \n) or reaching stop criteria. The design of punctuation-based termination enables CNTP to make locally optimal choices at the sentence level rather than the token level or full-answer level.

3. Perplexity Ranking Selection¶

For each candidate path \(s_i\), compute the sentence-level perplexity:

\[\text{PPL}(s_i) = \exp(\mathcal{L}(s_i)/|s_i|)\]

where \(\mathcal{L}(s_i) = -\sum_{t} \log p(w_t | s_{<t})\).

Select the path with the lowest perplexity as the optimal continuation. This strategy leverages the model's own likelihood function as a "judge," without requiring external feedback.

4. Integration with Self-Consistency¶

CNTP can serve as an "inner-loop" optimization for Self-Consistency (SC): employing CNTP within each independent reasoning chain of SC improves the quality of individual chains before applying majority voting. Thus, CNTP and SC are orthogonally complementary.

Complexity Analysis¶

Method	Complexity	Adaptability
Greedy Decoding	\(O(L)\)	None
Beam Search (B)	\(O(L \times B)\)	None
Self-Consistency (\(N_{sc}\))	\(O(N_{sc} \times L)\)	None
CNTP	\(O(L \times (1 + p(N_{max}-1)))\)	Yes

Here, \(p\) represents the proportion of high-entropy steps. Since \(p \ll 1\) in practice, the computational cost of CNTP is significantly lower than that of Beam Search and SC.

Theoretical Guarantees¶

Theorem 1: Under two mild assumptions (correct paths have the lowest perplexity; high entropy indicates low probability of the correct token): 1. The probability of CNTP generating a correct complete sequence is \(\ge\) single-sample decoding. 2. The average computational cost is strictly lower than \(L \times N_{max}\).

Key Experimental Results¶

Main Results: Llama-3.1-8B-Instruct¶

Method	GSM8K	MATH	StrategyQA
Greedy Decoding	79.8	41.5	72.9
Stochastic Decoding	79.4±0.8	41.5±1.2	72.0±0.7
CNTP (Ours)	81.6±0.6	47.1±1.7	73.2±0.2
Beam Search (beam=5)	82.3	48.0	72.9
SC (40 paths)	84.8	56.0	76.2
CNTP + SC (40 paths)	85.2	57.5	76.3

DeepSeek-R1-Distill-Qwen-1.5B¶

Method	GSM8K	MATH	StrategyQA
Greedy Decoding	64.6	32.5	53.6
Stochastic Decoding	61.6±1.1	27.9±3.7	51.7±1.2
CNTP (Ours)	65.7±0.7	37.7±1.7	53.0±1.3
SC (40 paths)	78.3	29.5	47.7
CNTP + SC (40 paths)	71.7	41.0	54.1

TruthfulQA (Llama-2-7B-Chat)¶

Method	Info Acc.	Truth Acc.	Truth-info Acc.
Stochastic Decoding	88.0±0.6	78.0±0.5	66.0±0.3
Greedy Decoding	78.5	79.1	57.6
CNTP (Ours)	89.2±1.2	84.8±0.5	74.0±1.1

Multimodal Experiments (MMVet / MathVista)¶

Method	Llama-3.2-11B MMVet	Llama-3.2-11B MathVista
Greedy	48.0	53.5
Stochastic	47.7	53.0
CNTP	53.5 (+5.5)	58.5 (+5.0)

Ablation Study¶

Comparison of Uncertainty Metrics:

Metric	GSM8K	StrategyQA	MATH	TruthfulQA
Max token prob	Suboptimal	Suboptimal	Suboptimal	Suboptimal
Max-2nd prob	Suboptimal	Suboptimal	Suboptimal	Suboptimal
Entropy (Ours)	Optimal	Optimal	Optimal	Optimal

Relationship Strategy between Number of Trials and Entropy:

Strategy	GSM8K	StrategyQA	TruthfulQA
Fixed Trials (N=6)	81.1	72.7	3.80
Negative Correlation (High Entropy, Fewer Trials)	81.2	72.7	3.80
Positive Correlation (High Entropy, More Trials)	81.6	73.2	74.0

Best-of-N vs CNTP (GSM8K):

N	2	5	10	20	40
Best-of-N (Full-Answer PPL)	79.2	79.5	78.2	77.3	76.1
CNTP	—	—	81.6	—	—

Full-answer level perplexity selection performs worse than CNTP's sentence-level local selection.

Key Findings¶

CNTP comprehensively outperforms greedy and stochastic decoding in single-chain settings: +5.6% over greedy on MATH, and +5.7% truthfulness on TruthfulQA.
Orthogonal and complementary to SC: CNTP + SC outperforms pure SC on most tasks.
Entropy is the best uncertainty metric: By considering the full vocabulary distribution, it surpasses heuristic methods based on top-1/top-2 probabilities.
Sentence-level rather than full-answer-level PPL selection is crucial: Using full-answer PPL in Best-of-N actually leads to performance degradation.
Equally effective in multimodal domains: Yields improvements on both LLaVA-CoT and Llama-3.2-Vision.
There exists an optimal range for \(N_{max}\): Around \([10, 30]\), beyond which an exploration-exploitation imbalance occurs.

Highlights & Insights¶

The human analogy is intuitive and effective: "when uncertain, explore more paths and select the one with highest confidence"—this is the core philosophy of CNTP.
Sentence-level local optimality is a key innovation: truncating at punctuation marks is more effective than doing so at the token level or full-answer level.
As a training-free method, CNTP can be applied plug-and-play to any autoregressive model, ensuring extremely low deployment cost.
Similar in spirit to Entropix (concurrent work) but with key differences: CNTP features punctuation-based termination and positive-correlation sampling strategies.

Limitations & Future Work¶

Introduces additional token computation. Although much lower than Beam Search or SC, it still increases inference latency.
The hyperparameters \(H_{min}=0.01\), \(H_{max}=1.5\), and \(N_{max}=10\) are fixed across all experiments and have not been tuned for different tasks or models.
Experimented only on medium-scale models (\(\le\) 11B), with no evaluations on LLMs larger than 70B.
The choice of punctuation set may depend on the language or task, and its cross-lingual generalization remains to be verified.
Can be combined with speculative decoding or vLLM to further accelerate inference.

Complementary to Self-Consistency: SC votes at the full-chain level, while CNTP optimizes at the sub-sentence level; the combination of the two yields the best performance.
Parallel to Entropix: Both leverage entropy to guide sampling strategies, but CNTP uniquely introduces the punctuation-based termination mechanism and positive-correlation sampling.
Distinction from CoT / ToT: Does not rely on prompt engineering or search tree structures, operating purely at the decoding level.
Insights: CNTP could be extended to non-text domains, such as autoregressive image generation, in the future.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐ Novel concept of sentence-level entropy-adaptive sampling
Experimental Thoroughness	⭐⭐⭐⭐ 6 datasets + multiple models + ablation study
Value	⭐⭐⭐⭐ Training-free and plug-and-play
Writing Quality	⭐⭐⭐⭐ Clear motivation, solid combination of theory and experiments
Overall Recommendation	⭐⭐⭐⭐