Skip to content

DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models

Conference: ACL 2026
arXiv: 2505.13975
Code: https://github.com/YuxuanJiang1/DRP (Available)
Area: LLM Reasoning / Distillation / Efficient Reasoning
Keywords: Long-CoT Pruning, Skill-aware Stepping, Distillation, Overthinking, Short-CoT Teacher

TL;DR

DRP enables a "Short-CoT teacher (GPT-4o)" to perform skill-level decomposition, pruning, and rewriting on the reasoning trajectories generated by a "Long-CoT student (R1-Distill-Qwen)." By distilling these "redundancy-free but style-preserved" trajectories back to the student, the model reduces GSM8K tokens for the 7B version from 917 to 328 (−64%) while increasing Pass@1 from 91.7% to 94.1%. It simultaneously achieves token reduction and accuracy gains on OOD tasks such as AIME, AMC, and MATH500.

Background & Motivation

Background: Large Reasoning Models (o1, DeepSeek-R1, and their distilled variants) have advanced the SOTA in mathematical and logical tasks via explicit Long-CoT. However, the cost is extreme verbosity—a student model may output nearly 1,000 tokens for a GSM8K problem and over 8,000 for AIME, leading to significant increases in inference cost and latency.

Limitations of Prior Work: Existing methods for token reduction have drawbacks: (i) prompt-based methods (e.g., TALE) use token budget constraints, which cause slight performance drops on GSM8K and complete collapse on AIME; (ii) SFT-based methods (e.g., CoT-Valve) distill with progressively shorter CoTs, leading to significant accuracy regression on MATH500/AMC; (iii) RL-based methods (e.g., ThinkPrune) use length penalties, which work for 1.5B models but require complex reward design and lack OOD stability.

Key Challenge: There is a "learnability gap" between the student's Long-CoT style and the teacher's Short-CoT style. Directly fine-tuning the student on short answers (e.g., 186 tokens) distilled from GPT-4o prevents the student from learning how to compress its own "reflection + backtracking" structure. This results in performance drops on OOD tasks (e.g., MATH500 falling to 88.6%). Both "pruning" and "distillation" alone tend to sacrifice accuracy.

Goal: To reduce tokens while increasing accuracy by maintaining the student's original long-form reasoning structure through a teacher that prunes redundancy without rewriting the underlying style.

Key Insight: The authors observe that student Long-CoTs are composed of "functional skills" (e.g., reading, equation formulation, arithmetic, comparison, verification). Thus, the teacher can first segment the student's trajectory into skill-based steps and then perform "local surgical edits"—keep, delete, rewrite, or merge—rather than rewriting the entire response.

Core Idea: Replace "teacher distillation" with "teacher-led skill-level pruning." The teacher edits the student's own trajectory, retaining the student's speaking style and structural skeleton while removing redundant branches, and then provides this pruned trajectory for student SFT.

Method

Overall Architecture

DRP embeds a "teacher/student heterogeneous combination" into a three-stage pipeline:

  1. Student Generates Long-CoT: R1-Distill-Qwen-1.5B/7B generates raw trajectories \(R = (T, A)\) on GSM8K and PRM12K training sets, where \(T\) is the reasoning chain within <think> and \(A\) is the answer.
  2. Teacher Performs Skill-aware Decomposition + Pruning + Stylistic Rewriting: The teacher (default: GPT-4o) processes \(T\) through (a) skill-aware decomposition, (b) per-step editing (4 actions), and (c) reassembling steps into a fluent text, updating \(A\) if necessary, to produce \(\hat{R} = (\hat{T}, \hat{A})\).
  3. Student SFT: Using \((x, \hat{R})\) pairs, the student undergoes 3 epochs of teacher-forced training via LLaMA-Factory + LoRA, targeting the standard NLL loss $\(\mathcal{L}_{\text{SFT}} = -\sum_i \log P_\theta(y_i \mid x, y_{<i})\)$.

Key Designs

  1. Skill-Based Step Decomposition:

    • Function: Segments the student's Long-CoT \(T\) into \(\{(s_1, k_1), \ldots, (s_m, k_m)\}\), where each \(s_i\) is a token span and \(k_i\) is a "functional skill" label (e.g., "Reading given quantity," "Arithmetic," "Logical inference").
    • Mechanism: Utilizes teacher prompts for segmentation rather than simple punctuation or newline splits. Explicit skill labels stabilize step boundaries and provide a consistent granularity for the teacher to judge redundancy.
    • Design Motivation: Skill-based segmentation yields an average of 12.6 steps per GSM8K problem (vs. 8.3 for default splits). This finer granularity leads to higher accuracy on AMC and significantly fewer tokens in OOD tasks.
  2. Step-Level Pruning with Four Atomic Actions:

    • Function: For each \((s_i, k_i)\), the teacher selects an action to produce \(\hat{s}_i\). Pruned steps \(\{\hat{s}_1, \ldots, \hat{s}_{m'}\}\) are reassembled into a fluent trace \(\hat{T}\).
    • Mechanism: Actions include Keep (necessary and concise), Delete (redundant, verbose self-correction), Rewrite (essential logic in shorter phrasing), and Merge (combining adjacent atomic operations).
    • Design Motivation: Simply being "short" is not "good." Retaining the student's "long-path + reflection" skeleton while removing loops ensures a low learnability gap and better OOD transferability.
  3. Reasoning Style Preservation by Teacher-Rewriting:

    • Function: When reassembling steps, the teacher is instructed to use the student's tone and reflection habits instead of the teacher's own concise style.
    • Mechanism: The prompt explicitly requires "preserving the tone and speaking style of the student model."
    • Design Motivation: This is the central insight of the paper—effective training CoTs must be structurally consistent with the student's reasoning process. DRP (~330 tokens) outperforms direct GPT-4o distillation (~186 tokens) on MATH500 and AMC, proving that structural integrity is more important than absolute brevity.

Loss & Training

  • Loss: Standard NLL, $\(\mathcal{L}_{\text{SFT}} = -\sum_{i=1}^n \log P_\theta(y_i \mid x, y_{<i})\)$.
  • Student Model: R1-Distill-Qwen-1.5B / 7B.
  • Teacher Model: GPT-4o (default); Gemini 2.0 Flash, ChatGPT, and DeepSeek-V3 used for sensitivity analysis.
  • Evaluation: zero-shot Pass@1 averaged over 5 runs; token counts measured via Qwen tokenizer with a 12k token cutoff to handle degenerate loops.

Key Experimental Results

Main Results

Tab.1: Comparison of DRP against TALE (prompting), CoT-Valve (SFT), and ThinkPrune (RL) on R1-Distill-Qwen 7B/1.5B:

Model Method GSM8K (Acc / Tok) MATH500 OOD AIME24 OOD AMC OOD
7B Base 91.7% / 917 92.4% / 2486 15/30 / 8674 31/40 / 4845
7B +TALE 91.0% / 522 91.6% / 2530 10/30 / 8602 31/40 / 3998
7B +CoT-Valve 90.8% / 364 89.4% / 1975 13/30 / 6315 30/40 / 3157
7B +DRP 94.1% / 328 (−64%) 93.0% / 1781 (−28%) 15/30 / 4966 (−43%) 33/40 / 3258 (−33%)
1.5B Base 70.7% / 1443 80.4% / 3276 6/30 / 10484 23/40 / 6516
1.5B +ThinkPrune 80.0% / 712 79.2% / 2006 9/30 / 5745 25/40 / 3291
1.5B +DRP 83.4% / 721 (−50%) 82.0% / 2122 (−35%) 10/30 / 6135 (−42%) 27/40 / 3657 (−44%)

DRP is the only method in the table that achieves simultaneous token reduction and accuracy gains across all four benchmarks.

Ablation Study

RQ1 (Tab.2) Skill-based vs. Default vs. No Decomposition (7B Student): Skill-based decomposition yields the highest accuracy and lowest token usage, particularly on OOD tasks like AMC compared to default splits.

RQ2 (Tab.4) DRP vs. Direct GPT-4o Distillation: Directly distilling GPT-4o's short answers leads to OOD performance drops (e.g., MATH500 falls to 88.6%), while DRP's style-preserving approach increases it to 93.0%.

RQ3 (Tab.3) Teacher model sensitivity: GPT-4o (94.1% GSM8K) > Gemini 2.0 Flash (93.2%) > DeepSeek-V3 (92.7%) > ChatGPT (91.2%). All outperform the baseline.

Key Findings

  • Structure > Length: Preserving the student's native structural skeleton is more critical for OOD generalization than achieving the shortest possible output.
  • Smaller Models Benefit More: The 1.5B model gained +12.7 points on GSM8K, suggesting DRP effectively addresses overthinking caused by insufficient capacity.
  • Skill Stepping is Essential: Removing decomposition entirely causes significant drops on MATH500 and AMC.
  • Elimination of Long-Tail Tokens: DRP removes the secondary peaks in token distribution caused by degenerate loops, significantly improving inference stability.

Highlights & Insights

  • "Teacher doesn't teach, teacher prunes": Bypassing the learnability gap by having the teacher perform local edits rather than global rewriting is a highly effective setup.
  • Skill labels for Traceability: Moving from punctuation-based splits to skill-based edits makes the pruning process structured and reproducible.
  • Empirical evidence against "Brevity is King": Setting the "sweet spot" at ~330 tokens for GSM8K proves that some structural complexity is necessary for reasoning logic.

Limitations & Future Work

  • Limited Student Scope: Only validated on R1-Distill-Qwen 1.5B and 7B.
  • Proprietary Teacher Dependency: High API costs for GPT-4o.
  • Task Specificity: Evaluated only on mathematics; applicability to code or science reasoning is unproven.
  • Lack of RL Integration: DRP is purely SFT-based; combining it with RL methods for further refinement is a promising direction.

Compared to CoT-Valve, DRP avoids accuracy drops by using skill-aware editing instead of a length sweep. Compared to ThinkPrune, DRP provides a lightweight SFT alternative that achieves higher accuracy on 1.5B models with similar token reduction.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐