Skip to content

Cautious Next Token Prediction

Conference: ACL 2025
arXiv: 2507.03038
Code: Available
Area: Others
Keywords: Decoding Strategy, Entropy-Adaptive Sampling, Perplexity Ranking, Training-free, LLM Inference

TL;DR

Proposes Cautious Next Token Prediction (CNTP), a training-free adaptive decoding strategy. When the model's prediction entropy is high (showing uncertainty), it samples multiple candidate paths up to a punctuation mark, then selects the path with the lowest perplexity as the final generation. This significantly improves accuracy without sacrificing diversity.

Background & Motivation

  • Current mainstream LLM decoding strategies (top-p/nucleus sampling + temperature scaling) are the "industry default", but tend to generate incoherent or incorrect content when the model is uncertain.
  • Greedy Decoding: Strong determinism but lacks diversity, prone to falling into local optima.
  • Stochastic Sampling: Maintains diversity but does not guarantee coherence; error-prone at highly uncertain steps.
  • Beam Search: Large computational overhead (\(O(L \times B)\)) and lacks adaptability, treating certain and uncertain steps equally.
  • Self-Consistency: Requires running the entire decoding process \(N_{sc}\) times, with a cost of \(O(N_{sc} \times L)\), offering no adaptability at intermediate steps.
  • The core insight stems from human behavior: when facing uncertain steps during problem-solving, humans tend to "think more," explore multiple paths, and eventually choose the one they are most confident in. Models can also mimic this "cautious" strategy.
  • Existing inference-time methods like CoT and Self-Refinement largely rely on external feedback or extensive sampling, leaving a gap for a lightweight, adaptive, and training-free decoding scheme.

Method

Overall Architecture

The core loop of CNTP: Before generating each token, compute the predictive entropy at the current position \(\rightarrow\) If entropy is low, proceed with normal single-token sampling \(\rightarrow\) If entropy is high, independently sample \(N\) candidate paths up to a punctuation mark \(\rightarrow\) Compute the perplexity of each path \(\rightarrow\) Select the path with the lowest perplexity to append to the sequence \(\rightarrow\) Continue generation.

Key Designs

1. Entropy-Based Uncertainty Detection

Given the current sequence \(s\), compute the entropy of the probability distribution for the next token:

\[H(s) = -\sum_{w} p(w|s) \log p(w|s)\]

Set two thresholds \(H_{min}\) and \(H_{max}\), and a maximum number of trials \(N_{max}\). The entropy is linearly mapped to the number of trials:

\[N = \max\left(1, \min\left(N_{max}, \left\lfloor\frac{H - H_{min}}{H_{max} - H_{min}} \times N_{max}\right\rfloor\right)\right)\]
  • \(H(s) < H_{min}\): The model is highly confident; \(N=1\), degrading to standard single-step sampling.
  • \(H(s) > H_{max}\): The model is highly uncertain; \(N=N_{max}\), encouraging full exploration.

2. Multi-Path Sampling and Punctuation-Based Termination

When \(N > 1\), independently sample \(N\) candidate paths, each starting from the current position until meeting a punctuation mark (., ?, !, :, ;, ), ], \n) or reaching stop criteria. The design of punctuation-based termination enables CNTP to make locally optimal choices at the sentence level rather than the token level or full-answer level.

3. Perplexity Ranking Selection

For each candidate path \(s_i\), compute the sentence-level perplexity:

\[\text{PPL}(s_i) = \exp(\mathcal{L}(s_i)/|s_i|)\]

where \(\mathcal{L}(s_i) = -\sum_{t} \log p(w_t | s_{<t})\).

Select the path with the lowest perplexity as the optimal continuation. This strategy leverages the model's own likelihood function as a "judge," without requiring external feedback.

4. Integration with Self-Consistency

CNTP can serve as an "inner-loop" optimization for Self-Consistency (SC): employing CNTP within each independent reasoning chain of SC improves the quality of individual chains before applying majority voting. Thus, CNTP and SC are orthogonally complementary.

Complexity Analysis

Method Complexity Adaptability
Greedy Decoding \(O(L)\) None
Beam Search (B) \(O(L \times B)\) None
Self-Consistency (\(N_{sc}\)) \(O(N_{sc} \times L)\) None
CNTP \(O(L \times (1 + p(N_{max}-1)))\) Yes

Here, \(p\) represents the proportion of high-entropy steps. Since \(p \ll 1\) in practice, the computational cost of CNTP is significantly lower than that of Beam Search and SC.

Theoretical Guarantees

Theorem 1: Under two mild assumptions (correct paths have the lowest perplexity; high entropy indicates low probability of the correct token): 1. The probability of CNTP generating a correct complete sequence is \(\ge\) single-sample decoding. 2. The average computational cost is strictly lower than \(L \times N_{max}\).

Key Experimental Results

Main Results: Llama-3.1-8B-Instruct

Method GSM8K MATH StrategyQA
Greedy Decoding 79.8 41.5 72.9
Stochastic Decoding 79.4±0.8 41.5±1.2 72.0±0.7
CNTP (Ours) 81.6±0.6 47.1±1.7 73.2±0.2
Beam Search (beam=5) 82.3 48.0 72.9
SC (40 paths) 84.8 56.0 76.2
CNTP + SC (40 paths) 85.2 57.5 76.3

DeepSeek-R1-Distill-Qwen-1.5B

Method GSM8K MATH StrategyQA
Greedy Decoding 64.6 32.5 53.6
Stochastic Decoding 61.6±1.1 27.9±3.7 51.7±1.2
CNTP (Ours) 65.7±0.7 37.7±1.7 53.0±1.3
SC (40 paths) 78.3 29.5 47.7
CNTP + SC (40 paths) 71.7 41.0 54.1

TruthfulQA (Llama-2-7B-Chat)

Method Info Acc. Truth Acc. Truth-info Acc.
Stochastic Decoding 88.0±0.6 78.0±0.5 66.0±0.3
Greedy Decoding 78.5 79.1 57.6
CNTP (Ours) 89.2±1.2 84.8±0.5 74.0±1.1

Multimodal Experiments (MMVet / MathVista)

Method Llama-3.2-11B MMVet Llama-3.2-11B MathVista
Greedy 48.0 53.5
Stochastic 47.7 53.0
CNTP 53.5 (+5.5) 58.5 (+5.0)

Ablation Study

Comparison of Uncertainty Metrics:

Metric GSM8K StrategyQA MATH TruthfulQA
Max token prob Suboptimal Suboptimal Suboptimal Suboptimal
Max-2nd prob Suboptimal Suboptimal Suboptimal Suboptimal
Entropy (Ours) Optimal Optimal Optimal Optimal

Relationship Strategy between Number of Trials and Entropy:

Strategy GSM8K StrategyQA TruthfulQA
Fixed Trials (N=6) 81.1 72.7 3.80
Negative Correlation (High Entropy, Fewer Trials) 81.2 72.7 3.80
Positive Correlation (High Entropy, More Trials) 81.6 73.2 74.0

Best-of-N vs CNTP (GSM8K):

N 2 5 10 20 40
Best-of-N (Full-Answer PPL) 79.2 79.5 78.2 77.3 76.1
CNTP 81.6

Full-answer level perplexity selection performs worse than CNTP's sentence-level local selection.

Key Findings

  1. CNTP comprehensively outperforms greedy and stochastic decoding in single-chain settings: +5.6% over greedy on MATH, and +5.7% truthfulness on TruthfulQA.
  2. Orthogonal and complementary to SC: CNTP + SC outperforms pure SC on most tasks.
  3. Entropy is the best uncertainty metric: By considering the full vocabulary distribution, it surpasses heuristic methods based on top-1/top-2 probabilities.
  4. Sentence-level rather than full-answer-level PPL selection is crucial: Using full-answer PPL in Best-of-N actually leads to performance degradation.
  5. Equally effective in multimodal domains: Yields improvements on both LLaVA-CoT and Llama-3.2-Vision.
  6. There exists an optimal range for \(N_{max}\): Around \([10, 30]\), beyond which an exploration-exploitation imbalance occurs.

Highlights & Insights

  • The human analogy is intuitive and effective: "when uncertain, explore more paths and select the one with highest confidence"—this is the core philosophy of CNTP.
  • Sentence-level local optimality is a key innovation: truncating at punctuation marks is more effective than doing so at the token level or full-answer level.
  • As a training-free method, CNTP can be applied plug-and-play to any autoregressive model, ensuring extremely low deployment cost.
  • Similar in spirit to Entropix (concurrent work) but with key differences: CNTP features punctuation-based termination and positive-correlation sampling strategies.

Limitations & Future Work

  • Introduces additional token computation. Although much lower than Beam Search or SC, it still increases inference latency.
  • The hyperparameters \(H_{min}=0.01\), \(H_{max}=1.5\), and \(N_{max}=10\) are fixed across all experiments and have not been tuned for different tasks or models.
  • Experimented only on medium-scale models (\(\le\) 11B), with no evaluations on LLMs larger than 70B.
  • The choice of punctuation set may depend on the language or task, and its cross-lingual generalization remains to be verified.
  • Can be combined with speculative decoding or vLLM to further accelerate inference.
  • Complementary to Self-Consistency: SC votes at the full-chain level, while CNTP optimizes at the sub-sentence level; the combination of the two yields the best performance.
  • Parallel to Entropix: Both leverage entropy to guide sampling strategies, but CNTP uniquely introduces the punctuation-based termination mechanism and positive-correlation sampling.
  • Distinction from CoT / ToT: Does not rely on prompt engineering or search tree structures, operating purely at the decoding level.
  • Insights: CNTP could be extended to non-text domains, such as autoregressive image generation, in the future.

Rating

Dimension Score
Novelty ⭐⭐⭐⭐ Novel concept of sentence-level entropy-adaptive sampling
Experimental Thoroughness ⭐⭐⭐⭐ 6 datasets + multiple models + ablation study
Value ⭐⭐⭐⭐ Training-free and plug-and-play
Writing Quality ⭐⭐⭐⭐ Clear motivation, solid combination of theory and experiments
Overall Recommendation ⭐⭐⭐⭐