Skip to content

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Conference: NeurIPS 2025 arXiv: 2505.19187 Code: LIMOPro (mentioned in the paper) Area: LLM Reasoning / Efficient Inference Keywords: reasoning refinement, chain-of-thought, perplexity-based pruning, test-time scaling, efficient reasoning, PIR

TL;DR

This paper proposes PIR (Perplexity-based Importance Refinement), a framework that categorizes reasoning chains distilled from LRMs into "progressive reasoning" and "functional steps" (verification / multi-method validation / error correction), and prunes only functional steps with low PIR scores while preserving the progressive reasoning backbone intact. Fine-tuning on the refined data improves accuracy by 0.9%–6.6% on AIME/AMC/GPQA while reducing token usage by 3%–41%, yielding up to 71% efficiency gain.

Background & Motivation

  • Background: Large reasoning models such as DeepSeek-R1 and QwQ generate CoT chains containing extensive functional steps — verification, error correction, and multi-method validation — that simulate human problem-solving but substantially increase inference overhead.
  • Limitations of Prior Work: Using such verbose reasoning chains for SFT transfers the same redundant reasoning behavior to student models, significantly raising inference time and compute cost. Existing approaches (e.g., SPIRIT) apply uniform perplexity-based pruning without distinguishing step types, risking the removal of critical progressive reasoning steps and degrading accuracy.
  • Key Challenge: The heterogeneous importance of functional steps is ignored: different instances of the same step type (e.g., verification) contribute very differently to the final answer, requiring quantitative assessment rather than heuristic deletion.
  • Goal: Achieve a favorable efficiency–quality trade-off for practical test-time scaling deployment, and provide a generalizable framework across diverse distillation sources (Gemini: 71.4% progressive reasoning vs. DeepSeek-R1: 59.7%).

Method

The PIR framework consists of a four-stage pipeline:

1. Reasoning Chain Segmentation and Classification

  • Claude 3.7 Sonnet segments reasoning chains into logical steps, each comprising multiple coherent sentences.
  • A two-stage classification is applied: rule-based matching of linguistic markers ("Let me check" → verification; "I made a mistake" → error correction), followed by Claude-based contextual analysis for steps lacking explicit markers.
  • Four reasoning patterns are defined: progressive reasoning (forward chained inference, always retained), verification (checking existing computations), multi-method validation (re-solving via an alternative approach), and error correction (fixing identified mistakes).
  • Human validation: 5% of steps are randomly sampled and independently evaluated by four graduate students, yielding 93.4% classification accuracy.

2. PIR Score Computation

  • Core Idea: The greater the increase in answer perplexity upon removing a step, the more important that step is.
  • \(\text{PIR}(x_i) = \log\!\left(\frac{\text{PPL}(R \setminus \{x_i\})}{\text{PPL}(R)}\right)\)
  • Perplexity is computed using Qwen2.5-32B-Instruct to measure the change in model confidence over the correct answer after removing step \(i\).
  • A higher PIR value indicates greater importance (removal causes a sharp drop in answer confidence).

3. Selective Pruning

  • Core Principle: All progressive reasoning steps are retained in full; only functional steps are ranked by PIR score and pruned.
  • Functional steps with the lowest PIR values are removed according to a preset ratio threshold.
  • The optimal pruning ratio lies in the range 0.2–0.3; excessively high ratios (e.g., 0.8) achieve the greatest length reduction but incur accuracy degradation.

4. Dataset Construction and Fine-tuning

  • PIR is applied to three datasets: LIMO (distilled from DeepSeek-R1), S1K (distilled from Gemini), and LIMO-V2 (distilled from QwQ).
  • Refined variants LIMO-P, S1K-P, and LIMO-V2-P are constructed for SFT.

Key Experimental Results

Model AIME ACC↑ AIME TOK↓ AIME EFF↑ AMC ACC↑ GPQA ACC↑ GPQA TOK↓
S1-32B 37.9 6646 5.71E-5 80.9 60.7 4172
S1-32B-P 42.1(+4.2) 4716(-29%) 8.92E-5(+56%) 83.1(+2.2) 61.6(+0.9) 2472(-41%)
LIMO 56.7 12497 4.53E-5 91.9 67.2 7173
LIMO-P 63.3(+6.6) 10588(-15%) 5.98E-5(+32%) 93.8(+1.9) 71.2(+4.0) 6969(-3%)
LIMO-V2 66.3 13896 94.4 70.2 8035
LIMO-V2-P 71.2(+4.9) 12163(-12%) 96.6(+2.2) 74.2(+3.0) 6968(-13%)
Method Comparison (S1K) AIME ACC AIME TOK GPQA ACC GPQA EFF
S1-32B (baseline) 37.9 6646 60.7 1.46E-4
S1-PROMPT 36.7 8013 58.0 2.03E-4
S1-SPIRIT (all-step pruning) 37.1 4906 60.1 2.13E-4
S1-RULE (random functional pruning) 36.7 4807 58.1 1.51E-4
S1-32B-P (PIR) 42.1 4716 61.6 2.49E-4

Highlights & Insights

  • Counterintuitive "less is more" finding: Removing low-value reasoning steps actually improves accuracy, suggesting that redundant functional steps may interfere with model learning.
  • Step-type distinction is critical: By explicitly separating progressive reasoning from functional steps, PIR outperforms SPIRIT's undifferentiated pruning by 5 points on AIME.
  • Remarkable efficiency gains: S1-32B-P achieves a 71% efficiency improvement on GPQA, with accuracy increasing while token count is nearly halved.
  • Cross-source generalization: PIR proves effective across data distilled from Gemini, DeepSeek-R1, and QwQ, indicating that it captures universal properties of reasoning.
  • Cross-scale generalization: Models ranging from 3B to 32B parameters all benefit, with the most pronounced gain on AIME (+11.8% accuracy) observed at the 32B scale.

Limitations & Future Work

  • Validation is limited to mathematical and scientific reasoning tasks; broader domains such as logical and commonsense reasoning remain unexplored.
  • Perplexity may not fully capture the semantic contribution of certain steps, introducing the risk of information loss.
  • The optimal pruning ratio varies across tasks and models; an adaptive strategy is lacking.
  • The method relies on model perplexity outputs and is therefore not applicable to closed-source models.
  • The classification stage depends on Claude 3.7 Sonnet, introducing additional cost and potential classifier bias.

Rating

  • Novelty: ⭐⭐⭐⭐ The PIR metric and the design principle of retaining progressive reasoning while pruning functional steps are clear and novel, though the overall framework builds incrementally on prior work (SPIRIT).
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three distillation sources × three benchmarks × multiple model scales × ablation studies — highly comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ The problem is clearly defined, experimental results are presented systematically, and tables are highly informative.
  • Value: ⭐⭐⭐⭐ Offers direct practical value for LLM inference efficiency optimization; the method is simple and reproducible.