Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models¶

Conference: ACL 2025
arXiv: 2503.01461
Code: AIDC-AI/Marco-o1
Area: LLM Reasoning / Knowledge Distillation
Keywords: Reasoning Distillation, MCTS, Chain-of-Thought, DPO, Formalistic Thinking

TL;DR¶

Reveals the bottleneck of "formalistic long-time thinking" when directly distilling long CoT data from large reasoning models (e.g., DeepSeek-R1) to smaller models, and proposes reconstructing tree-structured CoT data from scratch using MCTS, combined with thoughts length balance, fine-grained DPO, and a joint training objective to alleviate this issue.

Background & Motivation¶

Background¶

Large reasoning models (LRMs) such as OpenAI o1 and DeepSeek-R1 exhibit powerful reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought (CoT). Distilling these reasoning capabilities into smaller models (e.g., directly fine-tuning Qwen2.5 7B on data generated by LRMs) is an efficient strategy. For instance, DeepSeek-R1 distilled models improved the AIME score from GPT-4's 9.3% to 55.5%.

Core Problem¶

The authors discovered that smaller distilled models often exhibit Formalistic Long-time Thinking—mechanically mimicking the reasoning patterns of larger models without genuinely internalizing the underlying reasoning logic. This is specifically manifested in three types of errors:

Content Repetition: The model repeatedly generates identical text segments, failing to advance the reasoning (e.g., an infinite loop like "positions are considered up to consider that the positions are...")

Over-Reflection: The model constantly questions itself using patterns such as "Wait, perhaps...", "Alternatively,...", but fails to converge to an answer.

Instruction Failure: Falling into unnecessary long reasoning in simple tasks like translation, ultimately failing to output the final answer.

Root Cause¶

Long CoT in distilled data presents a learning difficulty for smaller models.
SFT and RL methods lead to bias inheritance (e.g., overthinking patterns).
DPO training is highly sensitive to response length, exacerbating formalistic thinking.

Research Questions¶

How can long CoT reasoning be effectively transferred to smaller models through data construction, SFT, and RL methods?

Method¶

Overall Architecture¶

Consists of two parts: 1. Data side: Constructing tree-structured CoT data from scratch based on MCTS (rather than distilling from LRMs). 2. Method side: CoT-aware post-training techniques (Thoughts Length Balance + Fine-grained DPO + Joint Objective).

Key Designs¶

1. MCTS-based CoT Data Construction¶

Thought Node Definition:

Node Type	Function	Prefix Prompt
Thinking	Open-ended reasoning continuation	(None, direct continuation)
Sub-Task	Task breakdown	"Firstly, I need to break down this task."
Reflection	Check & error correction	"Let's check the result. Wait! something is wrong..."
Hypothesis	Proposing hypothesis	"I propose the following hypothesis:"
Double-Check	Verification	"Now, I need to check whether all requirements are met."
Reclarify	Re-clarification	"To ensure clarity, let me restate..."
Answer	Providing answer	"The answer is:"

MCTS Search Process: 1. Node Selection: Utilizing the UCB formula to balance exploration and exploitation: $UCB(n_i) = \frac{v(n_i)}{n_{\text{visits}}(n_i)} + C\sqrt{\frac{\ln(n_{\text{visits}}(n_{\text{parent}}))}{n_{\text{visits}}(n_i)}}$ 2. Expansion: Prompting the LLM to generate the content for the corresponding node type based on a predefined node transition matrix. 3. Rollout: Calculating rule-based correctness rewards upon reaching the Answer node. 4. Backpropagation: Propagating rewards back up the tree.

Multi-Model Collaboration: - Qwen2.5-72B-Instruct is used for Thinking nodes. - Llama3.1-70B-Instruct is used for Reflection nodes. - Self-correction within the same model tends to reuse the same error distribution; switching models helps reduce repetitive errors.

Diversity of Reasoning Patterns: Four different node transition patterns are designed (e.g., Sub-Task→Thinking→Answer, Sub-Task→Hypothesis→Thinking→Answer, etc.) and randomly sampled to produce diverse reasoning paths.

Data Extraction: - SFT Data: Selected from successful paths reaching the correct answer (highest-reward paths or paths of specific lengths). - DPO Data: Positive examples are correct paths, while negative ones are incorrect paths sharing the shortest common prefix with the positive example.

2. Thoughts Length Balance¶

It was observed that CoT length significantly impacts the DPO stage but has negligible effect on SFT.
Strategy: Use the longest CoT during the SFT stage and the shortest CoT during the DPO stage.
Extract paths from the CoT tree based on relative length (short/medium/long) instead of setting fixed token thresholds.
Shorter reasoning paths reduce redundant outputs, mitigating formalistic long-time thinking.

3. Fine-grained DPO¶

Conservative DPO (cDPO): - To handle noisy preference labels, the preference probability is set to $p(y_w \succ y_l) = 1 - \epsilon$. - The modified loss function is: $$\mathcal{L}_{\text{DPO}}^{\epsilon}(\theta, y_w, y_l) = -(1-\epsilon)\log\hat{p}_\theta(y_w \succ y_l) - \epsilon\log(1-\hat{p}_\theta(y_w \succ y_l))$$ - Reduces the impact of noisy labels by softening gradient updates.

Masking-based DPO: - Identifies the number of common prefix tokens between positive and negative samples. - Masks the loss of the common prefix tokens to zero (similar to how padding tokens are handled). - Ensures the model focuses on discriminative segments rather than the shared prefix.

4. Joint Post-training Objective¶

Training purely with DPO leads to catastrophic forgetting and distribution shift.
Incorporates SFT loss into the DPO loss: $\mathcal{L} = \mathcal{L}_{\text{DPO}} + \alpha \mathcal{L}_{\text{SFT}}$.
$\alpha = 1$ is identified as the optimal trade-off point.

Experiments¶

Experimental Setup¶

Base Models: Llama-3.1-8B-Instruct, Llama-3.2-1B, Qwen2.5-7B/1.5B-Instruct
Benchmarks: GSM8K (elementary mathematics), MATH500 (advanced mathematics), AIME (competition mathematics), Blocksworld (planning), Multi-IF (instruction following in 8 languages)
Baselines: Sky-T1 dataset (distilled from QwQ 32B)

SFT Data Comparison¶

Model	Data	GSM8K	MATH	AIME	Blocksworld	IF(Zh)	IF(En)	IF(Other)
Llama-3.1-8B	Baseline	85.5	47.0	11.7	10.0	61.5	76.2	67.1
	+Sky-T1	84.8	44.0	6.7	2.0	25.4	31.6	29.7
	+Our Data	87.4	51.4	15.0	12.4	69.2	76.6	79.1
Qwen2.5-7B	Baseline	90.4	62.0	15.0	10.6	69.6	72.8	74.4
	+Sky-T1	89.6	61.6	9.4	0.4	26.2	24.5	30.6
	+Our Data	90.7	64.0	15.0	12.0	73.1	73.4	78.8

Key Observations: - Sky-T1 data universally degrades performance on 8B models (with up to 35-50% decreases in IF tasks), validating the distillation bottleneck. - The data constructed in this work improves performance across all tasks, and shows even more significant improvements for smaller models (1B).

Step-by-Step Integration of Post-training Methods (Llama-3.1-8B)¶

Method	GSM8K	MATH	AIME	Plan.	IF(Zh)	IF(En)	IF(Other)
SFT Baseline	87.4 (0.23%)	51.4 (5.4%)	15.0 (30%)	12.4 (1.8%)	69.2 (0.77%)	76.6 (1.69%)	79.1 (1.08%)
+ DPO	86.2 (6.37%)	41.8 (31.8%)	8.3 (55%)	2.0 (93.6%)	5.7 (91.5%)	6.3 (90.9%)	6.7 (92.2%)
+ Data Balance	86.8 (5.08%)	28.0 (46.4%)	6.6 (65%)	6.8 (44.6%)	43.4 (30.8%)	44.7 (44.7%)	42.4 (45.3%)
+ cDPO	87.5 (3.71%)	48.6 (15%)	15.0 (45%)	4.4 (47.4%)	61.9 (11.2%)	66.4 (15.6%)	67.7 (15.4%)
+ Joint Loss	86.8 (0.38%)	48.6 (8.6%)	10.0 (31.7%)	8.6 (9%)	72.3 (1.15%)	78.9 (1.9%)	78.1 (2.22%)
+ Masking	87.2 (0.15%)	51.0 (5.8%)	8.0 (38.3%)	12.6 (10.2%)	72.0 (1.15%)	77.2 (1.9%)	79.1 (1.36%)

(Percentages in parentheses indicate the ratio of no-answer outputs)

Key Findings: 1. Pure DPO is Catastrophic: The ratio of no-answer outputs reached over 90% on Planning and IF tasks, leading to a collapse in performance. 2. Step-by-step Remedies are Effective: Each technique is orthogonal and complementary, eventually restoring performance to a level close to or exceeding the SFT baseline. 3. Improvements Primarily Stem from Reducing No-Answer Outputs: Joint Loss and Masking reduce the no-answer ratio from 90%+ to <10%.

Joint Loss α Hyperparameter¶

α	GSM8K	MATH	Plan.	IF(Zh)
cDPO (α=0)	87.5	48.6	4.4	61.9
α=0.5	86.5	50.0	7.8	68.8
α=1.0	86.8	48.6	8.6	72.3
α=1.5	85.5	48.4	7.6	68.4
α=2.0	85.6	48.0	8.4	70.7

$\alpha = 1$ is the optimal trade-off point.

MCTS Inference Exploration¶

Model	Test@1	Test@8	Test@32
Llama-3.1-8B Baseline	47.0	67.6	75.8
Our Best Model	51.0	70.2	79.2
+ MCTS Decode	51.0	70.8	82.8

MCTS inference achieves an additional 3.6% gain on Test@32, demonstrating the potential of scaling test-time compute.

Highlights & Insights¶

Unveiling the Essence of the Distillation Bottleneck: Formalistic long-time thinking is not a simple underperformance issue, but rather a manifestation of smaller models mechanically mimicking reasoning patterns. This offers deeper insight than the generic statement of "weak reasoning capabilities in small models".
Constructing CoT Data From Scratch outperforms distilling from LRMs, which is the most significant empirical contribution of this paper.
Exquisite Multi-model Collaborative MCTS Framework: Qwen handles reasoning while Llama handles reflection, avoiding the distributional bias of self-correction within the same model.
Failure of DPO on Long CoT is a crucial discovery (with no-answer rates >90% on Planning and IF tasks), demonstrating that standard DPO is unsuitable for direct application to reasoning models.
Orthogonal Compatibility of Five Techniques: Data balance, cDPO, Joint Loss, and Masking each resolve different issues, yielding substantial effects when combined.
Quantitative Analysis of Formalistic Thinking: Accurately measuring the severity of the issue through the no-answer ratio.

Limitations & Future Work¶

Base models are only evaluated on Llama and Qwen series, without covering other model families.
MCTS data construction requires numerous LLM inference calls (Qwen-72B + Llama-70B), incurring high computational costs.
The final performance on AIME (8.0-15.0%) is still weak, indicating that the bottleneck of complex mathematical reasoning is not yet fully broken.
Masking DPO conversely degrades performance on AIME (from 10.0% to 8.0%), illustrating that the proposed technology combination may not be universally beneficial across all tasks.
Direct comparison with DeepSeek-R1 distilled models is lacking.

Reasoning Models: OpenAI o1, DeepSeek-R1 (Guo et al. 2025), QwQ (Qwen Team 2024)
Knowledge Distillation: Direct Distillation (DeepSeek-R1), Sky-T1
MCTS for Reasoning: Tian et al. 2024, RStar (Qi et al. 2024), Math-Shepherd (Wang et al. 2024)
DPO Improvements: cDPO (Mitchell 2023), Joint SFT+DPO (Fernando et al. 2024)

Rating ⭐⭐⭐⭐¶

The analysis of the reasoning distillation bottleneck is thorough and empirically supported, with an original MCTS CoT construction framework. The combination of techniques is comprehensive, and each component is reasonably motivated. Limitations include limited improvements on competition-level mathematical reasoning, and the absence of a direct comparison with mainstream distillation methods (such as those from DeepSeek-R1).