DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models¶
Conference: ACL 2026
arXiv: 2505.13975
Code: https://github.com/YuxuanJiang1/DRP (Available)
Area: LLM Reasoning / Distillation / Efficient Reasoning
Keywords: Long-CoT Pruning, Skill-aware Stepping, Distillation, Overthinking, Short-CoT Teacher
TL;DR¶
DRP enables a "Short-CoT teacher (GPT-4o)" to perform skill-level decomposition, pruning, and rewriting on the reasoning trajectories generated by a "Long-CoT student (R1-Distill-Qwen)." By distilling these "redundancy-free but style-preserved" trajectories back to the student, the model reduces GSM8K tokens for the 7B version from 917 to 328 (−64%) while increasing Pass@1 from 91.7% to 94.1%. It simultaneously achieves token reduction and accuracy gains on OOD tasks such as AIME, AMC, and MATH500.
Background & Motivation¶
Background: Large Reasoning Models (o1, DeepSeek-R1, and their distilled variants) have advanced the SOTA in mathematical and logical tasks via explicit Long-CoT. However, the cost is extreme verbosity—a student model may output nearly 1,000 tokens for a GSM8K problem and over 8,000 for AIME, leading to significant increases in inference cost and latency.
Limitations of Prior Work: Existing methods for token reduction have drawbacks: (i) prompt-based methods (e.g., TALE) use token budget constraints, which cause slight performance drops on GSM8K and complete collapse on AIME; (ii) SFT-based methods (e.g., CoT-Valve) distill with progressively shorter CoTs, leading to significant accuracy regression on MATH500/AMC; (iii) RL-based methods (e.g., ThinkPrune) use length penalties, which work for 1.5B models but require complex reward design and lack OOD stability.
Key Challenge: There is a "learnability gap" between the student's Long-CoT style and the teacher's Short-CoT style. Directly fine-tuning the student on short answers (e.g., 186 tokens) distilled from GPT-4o prevents the student from learning how to compress its own "reflection + backtracking" structure. This results in performance drops on OOD tasks (e.g., MATH500 falling to 88.6%). Both "pruning" and "distillation" alone tend to sacrifice accuracy.
Goal: To reduce tokens while increasing accuracy by maintaining the student's original long-form reasoning structure through a teacher that prunes redundancy without rewriting the underlying style.
Key Insight: The authors observe that student Long-CoTs are composed of "functional skills" (e.g., reading, equation formulation, arithmetic, comparison, verification). Thus, the teacher can first segment the student's trajectory into skill-based steps and then perform "local surgical edits"—keep, delete, rewrite, or merge—rather than rewriting the entire response.
Core Idea: Replace "teacher distillation" with "teacher-led skill-level pruning." The teacher edits the student's own trajectory, retaining the student's speaking style and structural skeleton while removing redundant branches, and then provides this pruned trajectory for student SFT.
Method¶
Overall Architecture¶
DRP embeds a "teacher/student heterogeneous combination" into a three-stage pipeline:
- Student Generates Long-CoT: R1-Distill-Qwen-1.5B/7B generates raw trajectories \(R = (T, A)\) on GSM8K and PRM12K training sets, where \(T\) is the reasoning chain within
<think>and \(A\) is the answer. - Teacher Performs Skill-aware Decomposition + Pruning + Stylistic Rewriting: The teacher (default: GPT-4o) processes \(T\) through (a) skill-aware decomposition, (b) per-step editing (4 actions), and (c) reassembling steps into a fluent text, updating \(A\) if necessary, to produce \(\hat{R} = (\hat{T}, \hat{A})\).
- Student SFT: Using \((x, \hat{R})\) pairs, the student undergoes 3 epochs of teacher-forced training via LLaMA-Factory + LoRA, targeting the standard NLL loss $\(\mathcal{L}_{\text{SFT}} = -\sum_i \log P_\theta(y_i \mid x, y_{<i})\)$.
Key Designs¶
-
Skill-Based Step Decomposition:
- Function: Segments the student's Long-CoT \(T\) into \(\{(s_1, k_1), \ldots, (s_m, k_m)\}\), where each \(s_i\) is a token span and \(k_i\) is a "functional skill" label (e.g., "Reading given quantity," "Arithmetic," "Logical inference").
- Mechanism: Utilizes teacher prompts for segmentation rather than simple punctuation or newline splits. Explicit skill labels stabilize step boundaries and provide a consistent granularity for the teacher to judge redundancy.
- Design Motivation: Skill-based segmentation yields an average of 12.6 steps per GSM8K problem (vs. 8.3 for default splits). This finer granularity leads to higher accuracy on AMC and significantly fewer tokens in OOD tasks.
-
Step-Level Pruning with Four Atomic Actions:
- Function: For each \((s_i, k_i)\), the teacher selects an action to produce \(\hat{s}_i\). Pruned steps \(\{\hat{s}_1, \ldots, \hat{s}_{m'}\}\) are reassembled into a fluent trace \(\hat{T}\).
- Mechanism: Actions include Keep (necessary and concise), Delete (redundant, verbose self-correction), Rewrite (essential logic in shorter phrasing), and Merge (combining adjacent atomic operations).
- Design Motivation: Simply being "short" is not "good." Retaining the student's "long-path + reflection" skeleton while removing loops ensures a low learnability gap and better OOD transferability.
-
Reasoning Style Preservation by Teacher-Rewriting:
- Function: When reassembling steps, the teacher is instructed to use the student's tone and reflection habits instead of the teacher's own concise style.
- Mechanism: The prompt explicitly requires "preserving the tone and speaking style of the student model."
- Design Motivation: This is the central insight of the paper—effective training CoTs must be structurally consistent with the student's reasoning process. DRP (~330 tokens) outperforms direct GPT-4o distillation (~186 tokens) on MATH500 and AMC, proving that structural integrity is more important than absolute brevity.
Loss & Training¶
- Loss: Standard NLL, $\(\mathcal{L}_{\text{SFT}} = -\sum_{i=1}^n \log P_\theta(y_i \mid x, y_{<i})\)$.
- Student Model: R1-Distill-Qwen-1.5B / 7B.
- Teacher Model: GPT-4o (default); Gemini 2.0 Flash, ChatGPT, and DeepSeek-V3 used for sensitivity analysis.
- Evaluation: zero-shot Pass@1 averaged over 5 runs; token counts measured via Qwen tokenizer with a 12k token cutoff to handle degenerate loops.
Key Experimental Results¶
Main Results¶
Tab.1: Comparison of DRP against TALE (prompting), CoT-Valve (SFT), and ThinkPrune (RL) on R1-Distill-Qwen 7B/1.5B:
| Model | Method | GSM8K (Acc / Tok) | MATH500 OOD | AIME24 OOD | AMC OOD |
|---|---|---|---|---|---|
| 7B Base | — | 91.7% / 917 | 92.4% / 2486 | 15/30 / 8674 | 31/40 / 4845 |
| 7B | +TALE | 91.0% / 522 | 91.6% / 2530 | 10/30 / 8602 | 31/40 / 3998 |
| 7B | +CoT-Valve | 90.8% / 364 | 89.4% / 1975 | 13/30 / 6315 | 30/40 / 3157 |
| 7B | +DRP | 94.1% / 328 (−64%) | 93.0% / 1781 (−28%) | 15/30 / 4966 (−43%) | 33/40 / 3258 (−33%) |
| 1.5B Base | — | 70.7% / 1443 | 80.4% / 3276 | 6/30 / 10484 | 23/40 / 6516 |
| 1.5B | +ThinkPrune | 80.0% / 712 | 79.2% / 2006 | 9/30 / 5745 | 25/40 / 3291 |
| 1.5B | +DRP | 83.4% / 721 (−50%) | 82.0% / 2122 (−35%) | 10/30 / 6135 (−42%) | 27/40 / 3657 (−44%) |
DRP is the only method in the table that achieves simultaneous token reduction and accuracy gains across all four benchmarks.
Ablation Study¶
RQ1 (Tab.2) Skill-based vs. Default vs. No Decomposition (7B Student): Skill-based decomposition yields the highest accuracy and lowest token usage, particularly on OOD tasks like AMC compared to default splits.
RQ2 (Tab.4) DRP vs. Direct GPT-4o Distillation: Directly distilling GPT-4o's short answers leads to OOD performance drops (e.g., MATH500 falls to 88.6%), while DRP's style-preserving approach increases it to 93.0%.
RQ3 (Tab.3) Teacher model sensitivity: GPT-4o (94.1% GSM8K) > Gemini 2.0 Flash (93.2%) > DeepSeek-V3 (92.7%) > ChatGPT (91.2%). All outperform the baseline.
Key Findings¶
- Structure > Length: Preserving the student's native structural skeleton is more critical for OOD generalization than achieving the shortest possible output.
- Smaller Models Benefit More: The 1.5B model gained +12.7 points on GSM8K, suggesting DRP effectively addresses overthinking caused by insufficient capacity.
- Skill Stepping is Essential: Removing decomposition entirely causes significant drops on MATH500 and AMC.
- Elimination of Long-Tail Tokens: DRP removes the secondary peaks in token distribution caused by degenerate loops, significantly improving inference stability.
Highlights & Insights¶
- "Teacher doesn't teach, teacher prunes": Bypassing the learnability gap by having the teacher perform local edits rather than global rewriting is a highly effective setup.
- Skill labels for Traceability: Moving from punctuation-based splits to skill-based edits makes the pruning process structured and reproducible.
- Empirical evidence against "Brevity is King": Setting the "sweet spot" at ~330 tokens for GSM8K proves that some structural complexity is necessary for reasoning logic.
Limitations & Future Work¶
- Limited Student Scope: Only validated on R1-Distill-Qwen 1.5B and 7B.
- Proprietary Teacher Dependency: High API costs for GPT-4o.
- Task Specificity: Evaluated only on mathematics; applicability to code or science reasoning is unproven.
- Lack of RL Integration: DRP is purely SFT-based; combining it with RL methods for further refinement is a promising direction.
Related Work & Insights¶
Compared to CoT-Valve, DRP avoids accuracy drops by using skill-aware editing instead of a length sweep. Compared to ThinkPrune, DRP provides a lightweight SFT alternative that achieves higher accuracy on 1.5B models with similar token reduction.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐
Related Papers¶
- [ACL 2026] Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning
- [ACL 2026] ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
- [ACL 2026] DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
- [ACL 2026] Stabilizing Efficient Reasoning with Step-Level Advantage Selection
- [ACL 2026] Efficient Test-Time Scaling via Temporal Reasoning Aggregation