Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals¶
Conference: NeurIPS 2025 arXiv: 2511.00699 Code: Not open-sourced Area: LLM Reasoning Keywords: inference-time scaling, chain-of-thought pruning, KL divergence, Best-of-N, reasoning efficiency
TL;DR¶
This paper proposes KAPPA (KL-Adjusted Pruned Path Algorithm), which progressively prunes reasoning branches in Best-of-N sampling using three training-free signals — KL divergence, confidence, and entropy — achieving up to 60% peak memory reduction and 90% token generation reduction while maintaining accuracy.
Background & Motivation¶
Background: LLMs improve reasoning accuracy via Chain-of-Thought (CoT) and Best-of-N (BoN) sampling — generating N reasoning paths and selecting the best. This is a core paradigm for inference-time scaling.
Limitations of Prior Work: Standard BoN requires fully generating all N paths, with compute and memory costs scaling linearly with N. Existing methods such as ST-BoN truncate poor branches via consistency heuristics, but consistency criteria do not directly assess branch quality. DeepConf uses confidence-weighted voting but still requires completing multiple paths.
Key Challenge: The effectiveness of inference-time scaling relies on sampling more paths (larger N is better), yet the cost of fully generating all paths limits the practical range of N — the central challenge is how to substantially reduce redundant computation without sacrificing accuracy.
Goal: Design a training-free, information-theoretic branch pruning algorithm that progressively eliminates low-quality reasoning branches during inference.
Key Insight: KL divergence is used as a self-supervised signal for branch "informativeness" — branches that deviate more from the unconditional distribution carry more information — combined with confidence and entropy for a composite score.
Core Idea: Without any external reward model, the logit distribution characteristics of the model itself suffice to determine which reasoning paths are worth continuing.
Method¶
Overall Architecture¶
KAPPA consists of three phases: Draft (exploration) → Scoring & Gating (scoring and gate-based pruning) → Continuation (exploitation).
Key Designs¶
-
Draft Phase
- Function: Generates the prefixes of N reasoning branches in parallel until the truncation point \(c\) (the earliest step at which all branches are pairwise inconsistent).
- Mechanism: Sufficiently explores the space to ensure adequate diversity among branches before evaluation begins.
- Design Motivation: Pruning too early risks eliminating potentially promising branches; branches must first "unfold."
-
Scoring & Gating Phase
- Function: Within the window \([c, c+\tau)\), computes a composite score for each surviving branch at every step and progressively prunes them.
- Mechanism: Three-signal fusion scoring:
- KL Divergence: \(D_{KL}(p_t^i \| q)\), measuring the KL divergence between the current branch logits and the unconditional (BOS token) logits, quantifying information gain.
- Confidence: \(C_t^i = \max_v p_t^i(v)\), the top-1 token probability, reflecting the model's certainty about the current prediction.
- Entropy: \(H_t^i\), the uncertainty of the logit distribution.
- Score Computation: \(s_t^i = w_{KL} \cdot \hat{EMA}_t^i + w_C \cdot \hat{C}_t^i + w_H \cdot \hat{H}_t^i\), with weights \((0.7, 0.2, 0.1)\).
- Stabilization: Median-of-Means (MoM, 4 buckets) combined with exponential moving average (EMA, \(\alpha=0.5\)) smooths the signals; scores are z-score normalized and clipped to \([-3, 3]\).
- Pruning Schedule: A linear schedule eliminates the lowest-scoring branch at each step, leaving exactly 1 branch after \(\tau\) steps.
- Design Motivation: KL divergence serves as a training-free branch quality signal, avoiding the overhead of an external reward model.
-
Continuation Phase
- Function: Continues autoregressive decoding on the sole surviving branch until EOS.
- Mechanism: All remaining computation is concentrated on the highest-quality branch.
Key Experimental Results¶
Main Results (Accuracy vs. Efficiency)¶
Evaluated on GSM8K and MATH500 with N = 5/10/20:
| Model | Dataset | Method | N=20 Accuracy | N=20 Peak Memory | Token Reduction |
|---|---|---|---|---|---|
| DeepSeek-R1-1.5B | MATH500 | BoN | ~70% | 16240 MB | - |
| DeepSeek-R1-1.5B | MATH500 | KAPPA | 72.2% (+1–2%) | 6495 MB | ~90% |
| Qwen2.5-7B | GSM8K | BoN | baseline | baseline | - |
| Qwen2.5-7B | GSM8K | KAPPA | ≈baseline | ~40%↓ | ~65% |
Core Efficiency Metrics¶
- Peak memory reduction: 4%–60% (depending on model and N).
- Token generation reduction: 65%–90% relative to BoN.
- Largest observed difference: DeepSeek-R1-1.5B on MATH500, N=20 → KAPPA uses only 2,113 tokens vs. BoN's 20,053 tokens (89.5% reduction).
Ablation Study / Key Findings¶
- Smaller models benefit more: KAPPA consistently improves accuracy by 1–2% on DeepSeek-R1-1.5B, as smaller models produce more low-quality branches, making pruning more effective.
- Over-pruning risk for larger models: Accuracy gains are inconsistent on Qwen2.5-7B, because the overall branch quality of larger models is higher and linear pruning may eliminate promising branches prematurely.
- Hyperparameters: KL weight of 0.7 is the most critical; EMA rate 0.5; MoM window 16; 4 buckets.
Highlights & Insights¶
- Training-free branch quality signals: KL divergence, confidence, and entropy are computed directly from model logits without any external reward model or additional training, yielding a lightweight and general-purpose solution.
- Progressive pruning outperforms one-shot truncation: Finer-grained than ST-BoN's one-shot truncation; gradually eliminating low-scoring branches allows scoring signals more time to accumulate.
- Information-theoretic perspective on path evaluation: Using KL divergence to measure how far a branch deviates from the unconditional distribution as a proxy for informativeness is intuitively compelling — conditional reasoning should deviate more from the prior to carry more information.
- Practical solution under the "more thinking is not always better" paradigm: Directly addresses the N-scaling bottleneck of BoN.
Limitations & Future Work¶
- Only two models and two datasets tested: Generalizability remains to be verified across tasks such as code reasoning and commonsense reasoning.
- max_new_tokens capped at 1024: DeepSeek-R1 on MATH500 frequently requires more than 1,024 tokens, and this truncation affects results.
- Linear pruning schedule may be suboptimal: Overly aggressive for larger models; a cosine schedule or adaptive schedule may be preferable.
- Accuracy may degrade as N increases: More branches amplify scoring signal noise and exacerbate over-pruning.
- Choice of reference distribution for KL divergence: Whether the unconditional logits of the BOS token constitute the optimal reference is an open question; alternative reference distributions warrant exploration.
Related Work & Insights¶
- ST-BoN (Wang et al., 2025): The direct predecessor of this work, using consistency-based one-shot truncation → KAPPA replaces this with information-theoretic progressive pruning.
- DeepConf (Fu et al., 2025): Confidence-weighted voting still requires complete generation of multiple paths → KAPPA prunes early.
- INFORM (Zhou et al., 2024): Adaptively determines the number of sampled paths without intervening within individual paths → KAPPA applies fine-grained control inside each path.
- ThinkPrune (Hou et al., 2025): Uses RL to train models to produce shorter reasoning chains → KAPPA provides a training-free alternative.
- Insight: KL divergence signals may also be applicable to draft quality assessment in speculative decoding.
Rating¶
- Novelty: ⭐⭐⭐ The idea of combining KL divergence, confidence, and entropy for branch pruning is original, though each individual component is not novel in isolation.
- Experimental Thoroughness: ⭐⭐⭐ Only two models and two datasets, with token length constraints and insufficiently comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ The algorithm is described clearly; the three-phase structure is easy to follow; figures and tables are presented intuitively.
- Value: ⭐⭐⭐⭐ Inference-time efficiency optimization is a high-demand research direction, and the training-free nature of the approach offers strong practical utility.