Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning¶
Conference: ACL 2025
arXiv: 2502.18001
Code: https://github.com/EIT-NLP/Distilling-CoT-Reasoning
Area: LLM Reasoning / CoT Distillation
Keywords: Chain-of-Thought, knowledge distillation, Reasoning Granularity, Small Language Models, Teacher-Student
TL;DR¶
A systematic study of the three key factors influencing CoT distillation (granularity, format, and teacher models) reveals a non-monotonic relationship between SLM performance and granularity, demonstrates that format has minimal impact, and shows that stronger teachers do not always yield better students.
Background & Motivation¶
Background: CoT prompting significantly enhances the reasoning capabilities of LLMs, but incurs heavy computational overhead, necessitating the distillation of CoT capabilities into small language models (SLMs).
Limitations of Prior Work: The choices of teacher models and generation methods in CoT distillation are often arbitrary and lack systematic guidance.
Key Challenge: Do CoT strategies that are effective for LLMs (such as finer granularity and specific formats) apply equally well to SLMs?
Goal: What constitutes the most effective CoT supervision for training student models to acquire reasoning capabilities?
Key Insight: Drawing an analogy to human teaching, this study conducts systematic experiments across three dimensions: teacher selection, teaching granularity, and teaching format.
Core Idea: CoT distillation requires "tailoring education to individual needs"—customizing the granularity and teacher selection based on the capability of the student model.
Method¶
Overall Architecture¶
A comprehensive cross-experiment involving 4 teacher models × 7 student models × 6 granularity levels × multiple formats × 7 datasets.
Key Designs¶
-
Granularity Experiments (Granularity):
- Function: Designing 6 granularity levels, ranging from minimal to highly detailed reasoning steps.
- Mechanism: Using 1-shot prompting to control GPT-4o to generate CoT annotations with varying granularities.
- Design Motivation: Testing whether SLMs benefit from finer granularity in the same way as LLMs.
-
Format Experiments (Format):
- Function: Testing different reasoning formats, such as natural language, Least-to-Most, and RaR.
- Mechanism: Keeping the content constant while only altering the presentation structure of the reasoning chain.
- Design Motivation: Investigating the sensitivity of SLMs to reasoning formats.
-
Teacher Selection Experiments (Teacher):
- Function: Comparing GPT-4o, Gemini-1.5-Flash, LLaMA-3-70B, and human annotations.
- Mechanism: Keeping granularity and format fixed, while only varying the source of the teacher model.
- Design Motivation: Verifying whether the hypothesis "the strongest teacher yields the best student" holds true.
Loss & Training¶
Standard SFT: \(\mathcal{L}_{distill} = \sum_i \mathcal{L}(S(x_i), \mathcal{C}_{T,g,f}(x_i) \oplus y_i)\)
Key Experimental Results¶
Main Results (Impact of Granularity, GPT-4o as Teacher)¶
| Model | GSM8K L1 | GSM8K L3 | GSM8K L5 | GSM8K Best |
|---|---|---|---|---|
| Gemma 2B | 49.66 | 53.37 | 53.42 | L5 (53.45) |
| LLaMA 3.2 3B | 59.59 | 62.57 | 62.29 | L4 (63.48) |
| BLOOM 3B | 18.20 | 23.81 | 22.47 | L3 (23.81) |
Ablation Study (Granularity vs. Length)¶
| Setup | GSM8K Acc | Sequence Length |
|---|---|---|
| Level 1 | 47.61 | 100.93 |
| Level 1 + Padding | 46.62 | 143.43 |
| Level 5 | 52.92 | 138.16 |
Key Findings¶
- Finding 1: The relationship between SLMs and granularity is non-monotonic; stronger student models benefit from fine granularity, whereas weaker student models prefer medium granularity.
- Finding 2: CoT formats strongly impact LLMs but have minimal effect on SLMs (due to SFT adaptation capabilities).
- Finding 3: Strong teachers do not always produce better students—human annotations have near-perfect accuracy but are often outperformed by LLM-generated CoTs (where diversity and complexity are more critical).
Highlights & Insights¶
- The teaching analogy is natural and intuitive, with a clear breakdown across three dimensions: who teaches, what to teach, and how to teach.
- The decoupling design of the length vs. granularity experiment is ingenious.
- The finding that "human annotations are not always the best" is counter-intuitive but reasonable, as the diversity of LLM annotations compensates for accuracy.
Limitations & Future Work¶
- Annotations were generated solely using 1-shot prompting; other generation strategies might yield different conclusions.
- The study does not explore hybrid granularity strategies (e.g., low granularity for simple questions and high granularity for difficult ones).
- The scale of student models is capped at 3B; models larger than 7B might exhibit different behaviors.
Related Work & Insights¶
- vs Magister et al. (2023): Early CoT distillation works did not investigate the influence of granularity.
- vs Zong et al. (2023): They argued that stronger teachers are always better, which is disproved by this work.
Supplementary Details¶
- Teacher models: GPT-4o, Gemini-1.5-Flash, LLaMA-3-70B, and human annotations.
- Student models: BLOOM (560M/1.1B/1.7B/3B), Gemma 2B, LLaMA 3.2 (1B/3B).
- Math datasets: SVAMP, GSM8K, AQuA-RAT, MATH.
- Commonsense reasoning datasets: CommonsenseQA, OpenBookQA, StrategyQA.
- Granularity levels from Level 1 to Level 6, controlled via prompt.
- CoT formats include: original CoT, Least-to-Most, RaR.
- Padding experiments demonstrate that a pure increase in sequence length is ineffective.
- Core conclusion: SLMs require customized CoT distillation strategies that align with their capabilities.
- While human annotations present high accuracy, they lack the diversity of LLM-generated ones.
- The "Only Answer" baseline can reveal the implicit pre-trained knowledge of student models.
- Weaker student models perform close to random guessing on complex datasets.
- 1-shot prompting is used to ensure consistency in granularity control.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first work to systematically study the three key factors of CoT distillation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ An extensive empirical study involving 4 teachers × 7 students × 7 datasets × 6 granularities.
- Writing Quality: ⭐⭐⭐⭐ The educational analogy framework is clear and the conclusions are actionable.
- Value: ⭐⭐⭐⭐ Provides a practical configuration guide for CoT distillation.