LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling¶

NeurIPS 2025 LLM Reasoning reasoning refinement chain-of-thought perplexity-based pruning test-time scaling efficient reasoning PIR

Conference: NeurIPS 2025 arXiv: 2505.19187 Code: LIMOPro (mentioned in the paper) Area: LLM Reasoning / Efficient Inference Keywords: reasoning refinement, chain-of-thought, perplexity-based pruning, test-time scaling, efficient reasoning, PIR

TL;DR¶

This paper proposes PIR (Perplexity-based Importance Refinement), a framework that categorizes reasoning chains distilled from LRMs into "progressive reasoning" and "functional steps" (verification / multi-method validation / error correction), and prunes only functional steps with low PIR scores while preserving the progressive reasoning backbone intact. Fine-tuning on the refined data improves accuracy by 0.9%–6.6% on AIME/AMC/GPQA while reducing token usage by 3%–41%, yielding up to 71% efficiency gain.

Background & Motivation¶

Background: Large reasoning models such as DeepSeek-R1 and QwQ generate CoT chains containing extensive functional steps — verification, error correction, and multi-method validation — that simulate human problem-solving but substantially increase inference overhead.
Limitations of Prior Work: Using such verbose reasoning chains for SFT transfers the same redundant reasoning behavior to student models, significantly raising inference time and compute cost. Existing approaches (e.g., SPIRIT) apply uniform perplexity-based pruning without distinguishing step types, risking the removal of critical progressive reasoning steps and degrading accuracy.
Key Challenge: The heterogeneous importance of functional steps is ignored: different instances of the same step type (e.g., verification) contribute very differently to the final answer, requiring quantitative assessment rather than heuristic deletion.
Goal: Achieve a favorable efficiency–quality trade-off for practical test-time scaling deployment, and provide a generalizable framework across diverse distillation sources (Gemini: 71.4% progressive reasoning vs. DeepSeek-R1: 59.7%).

Method¶

The PIR framework consists of a four-stage pipeline:

1. Reasoning Chain Segmentation and Classification¶

Claude 3.7 Sonnet segments reasoning chains into logical steps, each comprising multiple coherent sentences.
A two-stage classification is applied: rule-based matching of linguistic markers ("Let me check" → verification; "I made a mistake" → error correction), followed by Claude-based contextual analysis for steps lacking explicit markers.
Four reasoning patterns are defined: progressive reasoning (forward chained inference, always retained), verification (checking existing computations), multi-method validation (re-solving via an alternative approach), and error correction (fixing identified mistakes).
Human validation: 5% of steps are randomly sampled and independently evaluated by four graduate students, yielding 93.4% classification accuracy.

2. PIR Score Computation¶

Core Idea: The greater the increase in answer perplexity upon removing a step, the more important that step is.
\(\text{PIR}(x_i) = \log\!\left(\frac{\text{PPL}(R \setminus \{x_i\})}{\text{PPL}(R)}\right)\)
Perplexity is computed using Qwen2.5-32B-Instruct to measure the change in model confidence over the correct answer after removing step \(i\).
A higher PIR value indicates greater importance (removal causes a sharp drop in answer confidence).

3. Selective Pruning¶

Core Principle: All progressive reasoning steps are retained in full; only functional steps are ranked by PIR score and pruned.
Functional steps with the lowest PIR values are removed according to a preset ratio threshold.
The optimal pruning ratio lies in the range 0.2–0.3; excessively high ratios (e.g., 0.8) achieve the greatest length reduction but incur accuracy degradation.

4. Dataset Construction and Fine-tuning¶

PIR is applied to three datasets: LIMO (distilled from DeepSeek-R1), S1K (distilled from Gemini), and LIMO-V2 (distilled from QwQ).
Refined variants LIMO-P, S1K-P, and LIMO-V2-P are constructed for SFT.

Key Experimental Results¶

Model	AIME ACC↑	AIME TOK↓	AIME EFF↑	AMC ACC↑	GPQA ACC↑	GPQA TOK↓
S1-32B	37.9	6646	5.71E-5	80.9	60.7	4172
S1-32B-P	42.1(+4.2)	4716(-29%)	8.92E-5(+56%)	83.1(+2.2)	61.6(+0.9)	2472(-41%)
LIMO	56.7	12497	4.53E-5	91.9	67.2	7173
LIMO-P	63.3(+6.6)	10588(-15%)	5.98E-5(+32%)	93.8(+1.9)	71.2(+4.0)	6969(-3%)
LIMO-V2	66.3	13896	−	94.4	70.2	8035
LIMO-V2-P	71.2(+4.9)	12163(-12%)	−	96.6(+2.2)	74.2(+3.0)	6968(-13%)

Method Comparison (S1K)	AIME ACC	AIME TOK	GPQA ACC	GPQA EFF
S1-32B (baseline)	37.9	6646	60.7	1.46E-4
S1-PROMPT	36.7	8013	58.0	2.03E-4
S1-SPIRIT (all-step pruning)	37.1	4906	60.1	2.13E-4
S1-RULE (random functional pruning)	36.7	4807	58.1	1.51E-4
S1-32B-P (PIR)	42.1	4716	61.6	2.49E-4

Highlights & Insights¶

Counterintuitive "less is more" finding: Removing low-value reasoning steps actually improves accuracy, suggesting that redundant functional steps may interfere with model learning.
Step-type distinction is critical: By explicitly separating progressive reasoning from functional steps, PIR outperforms SPIRIT's undifferentiated pruning by 5 points on AIME.
Remarkable efficiency gains: S1-32B-P achieves a 71% efficiency improvement on GPQA, with accuracy increasing while token count is nearly halved.
Cross-source generalization: PIR proves effective across data distilled from Gemini, DeepSeek-R1, and QwQ, indicating that it captures universal properties of reasoning.
Cross-scale generalization: Models ranging from 3B to 32B parameters all benefit, with the most pronounced gain on AIME (+11.8% accuracy) observed at the 32B scale.

Limitations & Future Work¶

Validation is limited to mathematical and scientific reasoning tasks; broader domains such as logical and commonsense reasoning remain unexplored.
Perplexity may not fully capture the semantic contribution of certain steps, introducing the risk of information loss.
The optimal pruning ratio varies across tasks and models; an adaptive strategy is lacking.
The method relies on model perplexity outputs and is therefore not applicable to closed-source models.
The classification stage depends on Claude 3.7 Sonnet, introducing additional cost and potential classifier bias.

Rating¶

Novelty: ⭐⭐⭐⭐ The PIR metric and the design principle of retaining progressive reasoning while pruning functional steps are clear and novel, though the overall framework builds incrementally on prior work (SPIRIT).
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three distillation sources × three benchmarks × multiple model scales × ablation studies — highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ The problem is clearly defined, experimental results are presented systematically, and tables are highly informative.
Value: ⭐⭐⭐⭐ Offers direct practical value for LLM inference efficiency optimization; the method is simple and reproducible.