Cautious Next Token Prediction¶
Conference: ACL 2025
arXiv: 2507.03038
Code: Available
Area: Others
Keywords: Decoding Strategy, Entropy-Adaptive Sampling, Perplexity Ranking, Training-free, LLM Inference
TL;DR¶
Proposes Cautious Next Token Prediction (CNTP), a training-free adaptive decoding strategy. When the model's prediction entropy is high (showing uncertainty), it samples multiple candidate paths up to a punctuation mark, then selects the path with the lowest perplexity as the final generation. This significantly improves accuracy without sacrificing diversity.
Background & Motivation¶
- Current mainstream LLM decoding strategies (top-p/nucleus sampling + temperature scaling) are the "industry default", but tend to generate incoherent or incorrect content when the model is uncertain.
- Greedy Decoding: Strong determinism but lacks diversity, prone to falling into local optima.
- Stochastic Sampling: Maintains diversity but does not guarantee coherence; error-prone at highly uncertain steps.
- Beam Search: Large computational overhead (\(O(L \times B)\)) and lacks adaptability, treating certain and uncertain steps equally.
- Self-Consistency: Requires running the entire decoding process \(N_{sc}\) times, with a cost of \(O(N_{sc} \times L)\), offering no adaptability at intermediate steps.
- The core insight stems from human behavior: when facing uncertain steps during problem-solving, humans tend to "think more," explore multiple paths, and eventually choose the one they are most confident in. Models can also mimic this "cautious" strategy.
- Existing inference-time methods like CoT and Self-Refinement largely rely on external feedback or extensive sampling, leaving a gap for a lightweight, adaptive, and training-free decoding scheme.
Method¶
Overall Architecture¶
The core loop of CNTP: Before generating each token, compute the predictive entropy at the current position \(\rightarrow\) If entropy is low, proceed with normal single-token sampling \(\rightarrow\) If entropy is high, independently sample \(N\) candidate paths up to a punctuation mark \(\rightarrow\) Compute the perplexity of each path \(\rightarrow\) Select the path with the lowest perplexity to append to the sequence \(\rightarrow\) Continue generation.
Key Designs¶
1. Entropy-Based Uncertainty Detection¶
Given the current sequence \(s\), compute the entropy of the probability distribution for the next token:
Set two thresholds \(H_{min}\) and \(H_{max}\), and a maximum number of trials \(N_{max}\). The entropy is linearly mapped to the number of trials:
- \(H(s) < H_{min}\): The model is highly confident; \(N=1\), degrading to standard single-step sampling.
- \(H(s) > H_{max}\): The model is highly uncertain; \(N=N_{max}\), encouraging full exploration.
2. Multi-Path Sampling and Punctuation-Based Termination¶
When \(N > 1\), independently sample \(N\) candidate paths, each starting from the current position until meeting a punctuation mark (., ?, !, :, ;, ), ], \n) or reaching stop criteria. The design of punctuation-based termination enables CNTP to make locally optimal choices at the sentence level rather than the token level or full-answer level.
3. Perplexity Ranking Selection¶
For each candidate path \(s_i\), compute the sentence-level perplexity:
where \(\mathcal{L}(s_i) = -\sum_{t} \log p(w_t | s_{<t})\).
Select the path with the lowest perplexity as the optimal continuation. This strategy leverages the model's own likelihood function as a "judge," without requiring external feedback.
4. Integration with Self-Consistency¶
CNTP can serve as an "inner-loop" optimization for Self-Consistency (SC): employing CNTP within each independent reasoning chain of SC improves the quality of individual chains before applying majority voting. Thus, CNTP and SC are orthogonally complementary.
Complexity Analysis¶
| Method | Complexity | Adaptability |
|---|---|---|
| Greedy Decoding | \(O(L)\) | None |
| Beam Search (B) | \(O(L \times B)\) | None |
| Self-Consistency (\(N_{sc}\)) | \(O(N_{sc} \times L)\) | None |
| CNTP | \(O(L \times (1 + p(N_{max}-1)))\) | Yes |
Here, \(p\) represents the proportion of high-entropy steps. Since \(p \ll 1\) in practice, the computational cost of CNTP is significantly lower than that of Beam Search and SC.
Theoretical Guarantees¶
Theorem 1: Under two mild assumptions (correct paths have the lowest perplexity; high entropy indicates low probability of the correct token): 1. The probability of CNTP generating a correct complete sequence is \(\ge\) single-sample decoding. 2. The average computational cost is strictly lower than \(L \times N_{max}\).
Key Experimental Results¶
Main Results: Llama-3.1-8B-Instruct¶
| Method | GSM8K | MATH | StrategyQA |
|---|---|---|---|
| Greedy Decoding | 79.8 | 41.5 | 72.9 |
| Stochastic Decoding | 79.4±0.8 | 41.5±1.2 | 72.0±0.7 |
| CNTP (Ours) | 81.6±0.6 | 47.1±1.7 | 73.2±0.2 |
| Beam Search (beam=5) | 82.3 | 48.0 | 72.9 |
| SC (40 paths) | 84.8 | 56.0 | 76.2 |
| CNTP + SC (40 paths) | 85.2 | 57.5 | 76.3 |
DeepSeek-R1-Distill-Qwen-1.5B¶
| Method | GSM8K | MATH | StrategyQA |
|---|---|---|---|
| Greedy Decoding | 64.6 | 32.5 | 53.6 |
| Stochastic Decoding | 61.6±1.1 | 27.9±3.7 | 51.7±1.2 |
| CNTP (Ours) | 65.7±0.7 | 37.7±1.7 | 53.0±1.3 |
| SC (40 paths) | 78.3 | 29.5 | 47.7 |
| CNTP + SC (40 paths) | 71.7 | 41.0 | 54.1 |
TruthfulQA (Llama-2-7B-Chat)¶
| Method | Info Acc. | Truth Acc. | Truth-info Acc. |
|---|---|---|---|
| Stochastic Decoding | 88.0±0.6 | 78.0±0.5 | 66.0±0.3 |
| Greedy Decoding | 78.5 | 79.1 | 57.6 |
| CNTP (Ours) | 89.2±1.2 | 84.8±0.5 | 74.0±1.1 |
Multimodal Experiments (MMVet / MathVista)¶
| Method | Llama-3.2-11B MMVet | Llama-3.2-11B MathVista |
|---|---|---|
| Greedy | 48.0 | 53.5 |
| Stochastic | 47.7 | 53.0 |
| CNTP | 53.5 (+5.5) | 58.5 (+5.0) |
Ablation Study¶
Comparison of Uncertainty Metrics:
| Metric | GSM8K | StrategyQA | MATH | TruthfulQA |
|---|---|---|---|---|
| Max token prob | Suboptimal | Suboptimal | Suboptimal | Suboptimal |
| Max-2nd prob | Suboptimal | Suboptimal | Suboptimal | Suboptimal |
| Entropy (Ours) | Optimal | Optimal | Optimal | Optimal |
Relationship Strategy between Number of Trials and Entropy:
| Strategy | GSM8K | StrategyQA | TruthfulQA |
|---|---|---|---|
| Fixed Trials (N=6) | 81.1 | 72.7 | 3.80 |
| Negative Correlation (High Entropy, Fewer Trials) | 81.2 | 72.7 | 3.80 |
| Positive Correlation (High Entropy, More Trials) | 81.6 | 73.2 | 74.0 |
Best-of-N vs CNTP (GSM8K):
| N | 2 | 5 | 10 | 20 | 40 |
|---|---|---|---|---|---|
| Best-of-N (Full-Answer PPL) | 79.2 | 79.5 | 78.2 | 77.3 | 76.1 |
| CNTP | — | — | 81.6 | — | — |
Full-answer level perplexity selection performs worse than CNTP's sentence-level local selection.
Key Findings¶
- CNTP comprehensively outperforms greedy and stochastic decoding in single-chain settings: +5.6% over greedy on MATH, and +5.7% truthfulness on TruthfulQA.
- Orthogonal and complementary to SC: CNTP + SC outperforms pure SC on most tasks.
- Entropy is the best uncertainty metric: By considering the full vocabulary distribution, it surpasses heuristic methods based on top-1/top-2 probabilities.
- Sentence-level rather than full-answer-level PPL selection is crucial: Using full-answer PPL in Best-of-N actually leads to performance degradation.
- Equally effective in multimodal domains: Yields improvements on both LLaVA-CoT and Llama-3.2-Vision.
- There exists an optimal range for \(N_{max}\): Around \([10, 30]\), beyond which an exploration-exploitation imbalance occurs.
Highlights & Insights¶
- The human analogy is intuitive and effective: "when uncertain, explore more paths and select the one with highest confidence"—this is the core philosophy of CNTP.
- Sentence-level local optimality is a key innovation: truncating at punctuation marks is more effective than doing so at the token level or full-answer level.
- As a training-free method, CNTP can be applied plug-and-play to any autoregressive model, ensuring extremely low deployment cost.
- Similar in spirit to Entropix (concurrent work) but with key differences: CNTP features punctuation-based termination and positive-correlation sampling strategies.
Limitations & Future Work¶
- Introduces additional token computation. Although much lower than Beam Search or SC, it still increases inference latency.
- The hyperparameters \(H_{min}=0.01\), \(H_{max}=1.5\), and \(N_{max}=10\) are fixed across all experiments and have not been tuned for different tasks or models.
- Experimented only on medium-scale models (\(\le\) 11B), with no evaluations on LLMs larger than 70B.
- The choice of punctuation set may depend on the language or task, and its cross-lingual generalization remains to be verified.
- Can be combined with speculative decoding or vLLM to further accelerate inference.
Related Work & Insights¶
- Complementary to Self-Consistency: SC votes at the full-chain level, while CNTP optimizes at the sub-sentence level; the combination of the two yields the best performance.
- Parallel to Entropix: Both leverage entropy to guide sampling strategies, but CNTP uniquely introduces the punctuation-based termination mechanism and positive-correlation sampling.
- Distinction from CoT / ToT: Does not rely on prompt engineering or search tree structures, operating purely at the decoding level.
- Insights: CNTP could be extended to non-text domains, such as autoregressive image generation, in the future.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ Novel concept of sentence-level entropy-adaptive sampling |
| Experimental Thoroughness | ⭐⭐⭐⭐ 6 datasets + multiple models + ablation study |
| Value | ⭐⭐⭐⭐ Training-free and plug-and-play |
| Writing Quality | ⭐⭐⭐⭐ Clear motivation, solid combination of theory and experiments |
| Overall Recommendation | ⭐⭐⭐⭐ |