Skip to content

title: >- [Paper Note] Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy description: >- [ICLR 2026][LLM Reasoning][GRPO] This paper proposes using "step entropy" to quantify the information contribution of each reasoning step in CoT. It discovers that pruning 80% of the lowest-entropy steps results in almost no accuracy loss. A two-stage SFT+GRPO training pipeline is designed to enable models to autonomously insert [SKIP] tokens during i tags: - ICLR 2026 - LLM Reasoning - GRPO - [SKIP] token date: 2026-05-08 content_hash: aebe20c805b2652f


Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=cGLqQfS5wH
Code: https://github.com/staymylove/COT_Compresstion_via_Step_entropy
Area: LLM Reasoning / Efficient Inference
Keywords: Chain-of-Thought Compression, Step Entropy, Overthinking, GRPO, [SKIP] token

TL;DR

This paper proposes using "step entropy" to quantify the information contribution of each reasoning step in CoT. It discovers that pruning 80% of the lowest-entropy steps results in almost no accuracy loss. A two-stage SFT+GRPO training pipeline is designed to enable models to autonomously insert [SKIP] tokens during inference, reducing token counts by 16–57% while maintaining or even improving accuracy.

Background & Motivation

  • Background: Large Reasoning Models (LRMs) like DeepSeek-R1 and Qwen3 significantly enhance performance in math, code, and symbolic logic through "slow-thinking" long CoT, but the generated reasoning chains are often verbose.
  • Limitations of Prior Work: Long CoT leads to "overthinking" issues such as high inference latency, massive computational costs, and low efficiency, which are critical bottlenecks for large-scale deployment. Existing compression works either make reasoning implicit/latent (iCoT, COCONUT) at the cost of interpretability and verifiability, or perform pruning at the token/chunk level (TokenSkip, R1-Compress, CoT-Valve), lacking a principled way to identify "which entire steps are semantically redundant."
  • Key Challenge: Humans only record key milestones and omit obvious thoughts during problem-solving. However, current methods lack an information-theoretic systematic signal to judge which steps in a reasoning chain are critical and which are redundant.
  • Goal: Provide a theoretically grounded measure to quantify the importance of each step and use it for both static compression of existing CoT and training models for autonomous compression during inference.
  • Core Idea: [Entropy as Redundancy] If a model generates a step with high confidence (low uncertainty), that step is likely predictable, low-information redundant content. By aggregating token-level entropy into "step entropy," low-entropy steps can be safely removed.

Method

Overall Architecture

The method follows two paths: first, using step entropy as a metric for static pruning (Generate full CoT → Calculate entropy per step → Replace \(\kappa\) proportion of lowest-entropy steps with [SKIP] → Concatenate back to prompt for final answer), validating the "low entropy = redundancy" hypothesis and constructing training data; second, using SFT + GRPO two-stage training to internalize "when to skip" as an autonomous behavior during inference.

flowchart LR
    A[Full CoT Generation] --> B[Step-wise Segmentation S1..SN<br/>\n\n delimiter]
    B --> C[Compute Length-Normalized Step Entropy]
    C --> D[Prune 80% Lowest Entropy Steps<br/>Replace with SKIP]
    D --> E[Compressed CoT Dataset 130k]
    E --> F[Stage1 SFT<br/>Imitate Compression Trajectory]
    F --> G[Stage2 GRPO<br/>Composite Reward Optimization]
    G --> H[Autonomous SKIP Insertion<br/>During Inference]

Key Designs

1. Step Entropy: Aggregating token-level uncertainty into a step-level information measure. The CoT is first segmented by \n\n into a sequence of steps \(C=(S_1,\dots,S_N)\), where each step \(S_i\) contains \(M_i\) tokens. When generating the \(j\)-th token autoregressively, the model provides a distribution over vocabulary \(V\), and its Shannon entropy \(H(t_{i,j}|c_{i,j})=-\sum_{w\in V}p(w|c_{i,j})\log_2 p(w|c_{i,j})\) characterizes the instantaneous uncertainty. Summing the entropy of all tokens within a step yields the step entropy \(H(S_i|S_{<i})=\sum_{j=1}^{M_i}H(t_{i,j}|c_{i,j})\). The intuition is: high entropy indicates the model is hesitant and the information content is large; low entropy indicates nearly deterministic output and predictable content. To eliminate step-length bias, length-normalized step entropy \(H(S_i|S_{<i})=\frac{1}{M_i}\sum_{j=1}^{M_i}H(t_{i,j}|c_{i,j})\) is adopted.

2. Theoretical Basis: Step entropy is an upper bound on the mutual information with the answer. Lemma 1 proves that the conditional mutual information between a single step \(S_j\) and the final answer \(A\), given all other steps, is bounded by its step entropy: \(I(S_j;A|\bar{S}_j)\le H(S_j|S_{<j})\). Theorem 1 further generalizes this to any subset of \(K{+}1\) steps \(\tilde S\), such that \(I(\tilde S;A|C\setminus\tilde S)\le\sum_{i=0}^{K}H(S_{k_i}|S_{<k_i})\). This implies that low-entropy steps contribute very little information to the answer, providing information-theoretic rather than purely heuristic support for "pruning low-entropy steps."

3. Low-Entropy Step Pruning + [SKIP] Placeholder Inference. Steps are sorted by entropy in ascending order. The lowest \(\kappa\times N\) steps are replaced by a special [SKIP] token, while high-entropy steps are kept intact to form a compressed chain \(C'\). During inference, \(C'\) is concatenated with the user prompt and </think> delimiter, prompting the model to generate the final answer directly. A key ablation found that using explicit [SKIP] placeholders is more robust than direct deletion at high compression ratios (preserving the structure of remaining steps). Controlled experiments determined a threshold \(\kappa=0.8\)—accuracy remains stable even when 80% of low-entropy steps are pruned, only declining beyond this point and eventually converging to "no-thinking" mode accuracy.

4. SFT+GRPO Two-Stage Autonomous Compression Training. Static pruning only compresses existing chains; training is required for the model to compress autonomously. Stage 1 (SFT): Fine-tune on (Question, Compressed CoT, Answer) triplets to teach the model to predict compression paths and generate [SKIP], serving as a robust initialization for RL. Stage 2 (GRPO): Since SFT only mimics static patterns and does not explicitly optimize the accuracy-efficiency tradeoff, \(K\) completions are sampled for each prompt. A composite reward \(R(C)=[R_{correctness},R_{skip\,ratio},R_{skip\,num},R_{response\,length}]\) drives learning: +2.0 for correct answers; 1.0 for skip ratio \(\ge\kappa_{high}\), 0.5 for ratio in \([\kappa_{low},\kappa_{high})\); and a -1.0 penalty if the number of [SKIP] tokens exceeds \(\tau_{skip}\) or response length exceeds \(\tau_{length}\) to prevent degradation. The model thus learns a context-aware adaptive strategy: reason in detail when necessary, and skip when appropriate.

Key Experimental Results

Main Results Table (80% Low-Entropy Static Pruning, Pass@1 ACC% / Avg. Thinking Tokens)

Model GSM8k Math500 AIME 2024 AIME 2025
DeepSeek-R1-7B 78.54 / 298 88.17 / 3704 63.33 / 15843 35.71 / 18203
R1-7B (Ours) 80.82 / 294 (↓1.3%) 88.17 / 2604 (↓29.7%) 56.67 / 10093 (↓36.3%) 35.71 / 11471 (↓37.0%)
DeepSeek-R1-14B 82.64 / 284 84.37 / 2854 65.52 / 15415 58.62 / 18000
R1-14B (Ours) 84.00 / 278 (↓1.9%) 82.16 / 1981 (↓30.6%) 58.62 / 8706 (↓43.5%) 51.72 / 10842 (↓39.8%)
Qwen3-8B 94.46 / 3054 91.37 / 7138 79.31 / 20937 76.92 / 19902
Qwen3-8B (Ours) 94.39 / 2557 (↓16.2%) 91.13 / 5209 (↓27.0%) 81.48 / 11534 (↓44.9%) 76.00 / 11717 (↓41.1%)

The method is consistently effective across both DeepSeek-R1 and Qwen3 architectures, reducing tokens by 16–45% while slightly improving accuracy on GSM8k.

Two-Stage Training + Comparison with SOTA

Training Stage (R1-7B) GSM8k Math500 AIME 2024 AIME 2025
Baseline 78.54 88.17 63.33 35.71
SFT 78.47 (↓43% tok) 85.92 (↓25%) 56.67 (↓42%) 30.00 (↓35%)
SFT+GRPO 79.15 (↓44%) 85.00 (↓35%) 57.14 (↓57%) 33.33 (↓41%)
Method (vs. Full-CoT) Math500 ACC/Tok AIME2024 ACC/Tok
CoT-Valve ↓10.6% / ↓48.4% ↓15.0% / ↓34.6%
TokenSkip ↓5.2% / ↓11.1% ↓12.3% / ↓27.5%
R1-Compress ↓3.2% / ↓20.3% ↓6.2% / ↓12.9%
Ours (SFT+RL) ↓3.2% / ↓35.0% ↓6.2% / ↓57.0%

Key Findings

  • 80% is a Safe Threshold: Accuracy remains stable within 80% low-entropy pruning and only drops thereafter. In contrast, high-entropy pruning causes immediate drops, performing worse than "no-thinking" when over 40% is pruned. Random pruning lies in between. These curves strongly support the "low entropy = redundancy" hypothesis.
  • Step-level > Token-level: Step-level pruning maintains baseline accuracy even when 40% of thinking tokens are removed, whereas token-level entropy-based removal shows sharp declines at 20%, indicating that "reasoning steps" are the correct units for semantic compression.
  • Training > Static: On the challenging AIME 2024, the trained model achieved a 57.0% token reduction (vs. 36.3% for static) with a slight accuracy increase, proving it learned a smarter context-aware strategy than fixed rules. Scalability was also validated by maintaining accuracy on large datasets (130k/40k/90k).

Highlights & Insights

  • Redundant steps can be identified using a signal (token entropy) already generated for free during the inference process, requiring no extra scoring models or external judges, making it extremely lightweight for engineering.
  • Moving the "pruning unit" from tokens to "reasoning steps" aligns with human cognitive intuition of "skipping entire thoughts rather than omitting words," supported by mutual information theory.
  • The [SKIP] placeholder + SFT internalizes step skipping, and GRPO fine-tunes the tradeoff with composite rewards, representing a clean "discover pattern → teach model" two-stage paradigm.

Limitations & Future Work

  • Validation is limited to mathematical reasoning benchmarks (GSM8k/Math500/AIME) and MMLU; more evidence is needed for transferability to code, long-chain agents, and open-domain reasoning.
  • Step entropy requires access to token-level probability distributions, which is not directly applicable to black-box closed-source models.
  • The 80% threshold is an empirically determined fixed hyperparameter; optimal \(\kappa\) may vary by task/difficulty. GRPO reward weights and thresholds (\(\tau_{skip},\tau_{length}\)) also require tuning.
  • On some difficult problems (e.g., AIME2024 with R1-7B), static pruning sacrifices a few accuracy points, suggesting "low entropy = redundancy" is not absolute for high-difficulty long chains.
  • Explicit CoT Compression: TokenSkip / LC-Prompt (token-level controlled skipping), R1-Compress (chunk-level compression search), CoT-Valve (variable length architecture), and length-constrained RL rewards—this paper advances further with "step-level + information-theoretic signals."
  • Implicit/Latent Space Reasoning: iCoT, COCONUT, knowledge distillation for internalization, dynamic latent space compression—these offer high efficiency but lose interpretability. This work chooses to maintain explicit chains.
  • Overthinking and Efficient Inference: Echoes research on the overthinking phenomenon, providing a quantifiable and trainable path towards adaptive reasoning that is "as long as necessary, as short as possible."

Rating

  • Novelty: ⭐⭐⭐⭐ The step entropy measure is simple and supported by a theoretical upper bound. The empirical finding that 80% of low-entropy steps can be pruned is impactful—a clear conceptual upgrade over token-level methods.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers three models across two architectures, four benchmarks, both static and training routes, and comparisons with 5 SOTAs. Token vs. step ablation is comprehensive, though limited primarily to the mathematical domain.
  • Writing Quality: ⭐⭐⭐⭐ Logical flow from motivation to theory, empirical results, and training. Diagrams are clear, though some notation overlap between \(\kappa\) and \(\tau\) is slightly confusing.
  • Value: ⭐⭐⭐⭐ Directly addresses the efficiency pain point in LRM deployment. Lightweight signals + plug-and-play pruning + trainable strategies offer high practical value.