Skip to content

Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models

Conference: ACL 2025
arXiv: 2503.01461
Code: AIDC-AI/Marco-o1
Area: LLM Reasoning / Knowledge Distillation
Keywords: Reasoning Distillation, MCTS, Chain-of-Thought, DPO, Formalistic Thinking

TL;DR

Reveals the bottleneck of "formalistic long-time thinking" when directly distilling long CoT data from large reasoning models (e.g., DeepSeek-R1) to smaller models, and proposes reconstructing tree-structured CoT data from scratch using MCTS, combined with thoughts length balance, fine-grained DPO, and a joint training objective to alleviate this issue.


Background & Motivation

Background

Large reasoning models (LRMs) such as OpenAI o1 and DeepSeek-R1 exhibit powerful reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought (CoT). Distilling these reasoning capabilities into smaller models (e.g., directly fine-tuning Qwen2.5 7B on data generated by LRMs) is an efficient strategy. For instance, DeepSeek-R1 distilled models improved the AIME score from GPT-4's 9.3% to 55.5%.

Core Problem

The authors discovered that smaller distilled models often exhibit Formalistic Long-time Thinking—mechanically mimicking the reasoning patterns of larger models without genuinely internalizing the underlying reasoning logic. This is specifically manifested in three types of errors:

Content Repetition: The model repeatedly generates identical text segments, failing to advance the reasoning (e.g., an infinite loop like "positions are considered up to consider that the positions are...")

Over-Reflection: The model constantly questions itself using patterns such as "Wait, perhaps...", "Alternatively,...", but fails to converge to an answer.

Instruction Failure: Falling into unnecessary long reasoning in simple tasks like translation, ultimately failing to output the final answer.

Root Cause

  • Long CoT in distilled data presents a learning difficulty for smaller models.
  • SFT and RL methods lead to bias inheritance (e.g., overthinking patterns).
  • DPO training is highly sensitive to response length, exacerbating formalistic thinking.

Research Questions

How can long CoT reasoning be effectively transferred to smaller models through data construction, SFT, and RL methods?


Method

Overall Architecture

Consists of two parts: 1. Data side: Constructing tree-structured CoT data from scratch based on MCTS (rather than distilling from LRMs). 2. Method side: CoT-aware post-training techniques (Thoughts Length Balance + Fine-grained DPO + Joint Objective).

Key Designs

1. MCTS-based CoT Data Construction

Thought Node Definition:

Node Type Function Prefix Prompt
Thinking Open-ended reasoning continuation (None, direct continuation)
Sub-Task Task breakdown "Firstly, I need to break down this task."
Reflection Check & error correction "Let's check the result. Wait! something is wrong..."
Hypothesis Proposing hypothesis "I propose the following hypothesis:"
Double-Check Verification "Now, I need to check whether all requirements are met."
Reclarify Re-clarification "To ensure clarity, let me restate..."
Answer Providing answer "The answer is:"

MCTS Search Process: 1. Node Selection: Utilizing the UCB formula to balance exploration and exploitation: \(UCB(n_i) = \frac{v(n_i)}{n_{\text{visits}}(n_i)} + C\sqrt{\frac{\ln(n_{\text{visits}}(n_{\text{parent}}))}{n_{\text{visits}}(n_i)}}\) 2. Expansion: Prompting the LLM to generate the content for the corresponding node type based on a predefined node transition matrix. 3. Rollout: Calculating rule-based correctness rewards upon reaching the Answer node. 4. Backpropagation: Propagating rewards back up the tree.

Multi-Model Collaboration: - Qwen2.5-72B-Instruct is used for Thinking nodes. - Llama3.1-70B-Instruct is used for Reflection nodes. - Self-correction within the same model tends to reuse the same error distribution; switching models helps reduce repetitive errors.

Diversity of Reasoning Patterns: Four different node transition patterns are designed (e.g., Sub-Task→Thinking→Answer, Sub-Task→Hypothesis→Thinking→Answer, etc.) and randomly sampled to produce diverse reasoning paths.

Data Extraction: - SFT Data: Selected from successful paths reaching the correct answer (highest-reward paths or paths of specific lengths). - DPO Data: Positive examples are correct paths, while negative ones are incorrect paths sharing the shortest common prefix with the positive example.

2. Thoughts Length Balance

  • It was observed that CoT length significantly impacts the DPO stage but has negligible effect on SFT.
  • Strategy: Use the longest CoT during the SFT stage and the shortest CoT during the DPO stage.
  • Extract paths from the CoT tree based on relative length (short/medium/long) instead of setting fixed token thresholds.
  • Shorter reasoning paths reduce redundant outputs, mitigating formalistic long-time thinking.

3. Fine-grained DPO

Conservative DPO (cDPO): - To handle noisy preference labels, the preference probability is set to \(p(y_w \succ y_l) = 1 - \epsilon\). - The modified loss function is: $\(\mathcal{L}_{\text{DPO}}^{\epsilon}(\theta, y_w, y_l) = -(1-\epsilon)\log\hat{p}_\theta(y_w \succ y_l) - \epsilon\log(1-\hat{p}_\theta(y_w \succ y_l))\)$ - Reduces the impact of noisy labels by softening gradient updates.

Masking-based DPO: - Identifies the number of common prefix tokens between positive and negative samples. - Masks the loss of the common prefix tokens to zero (similar to how padding tokens are handled). - Ensures the model focuses on discriminative segments rather than the shared prefix.

4. Joint Post-training Objective

  • Training purely with DPO leads to catastrophic forgetting and distribution shift.
  • Incorporates SFT loss into the DPO loss: \(\mathcal{L} = \mathcal{L}_{\text{DPO}} + \alpha \mathcal{L}_{\text{SFT}}\).
  • \(\alpha = 1\) is identified as the optimal trade-off point.

Experiments

Experimental Setup

  • Base Models: Llama-3.1-8B-Instruct, Llama-3.2-1B, Qwen2.5-7B/1.5B-Instruct
  • Benchmarks: GSM8K (elementary mathematics), MATH500 (advanced mathematics), AIME (competition mathematics), Blocksworld (planning), Multi-IF (instruction following in 8 languages)
  • Baselines: Sky-T1 dataset (distilled from QwQ 32B)

SFT Data Comparison

Model Data GSM8K MATH AIME Blocksworld IF(Zh) IF(En) IF(Other)
Llama-3.1-8B Baseline 85.5 47.0 11.7 10.0 61.5 76.2 67.1
+Sky-T1 84.8 44.0 6.7 2.0 25.4 31.6 29.7
+Our Data 87.4 51.4 15.0 12.4 69.2 76.6 79.1
Qwen2.5-7B Baseline 90.4 62.0 15.0 10.6 69.6 72.8 74.4
+Sky-T1 89.6 61.6 9.4 0.4 26.2 24.5 30.6
+Our Data 90.7 64.0 15.0 12.0 73.1 73.4 78.8

Key Observations: - Sky-T1 data universally degrades performance on 8B models (with up to 35-50% decreases in IF tasks), validating the distillation bottleneck. - The data constructed in this work improves performance across all tasks, and shows even more significant improvements for smaller models (1B).

Step-by-Step Integration of Post-training Methods (Llama-3.1-8B)

Method GSM8K MATH AIME Plan. IF(Zh) IF(En) IF(Other)
SFT Baseline 87.4 (0.23%) 51.4 (5.4%) 15.0 (30%) 12.4 (1.8%) 69.2 (0.77%) 76.6 (1.69%) 79.1 (1.08%)
+ DPO 86.2 (6.37%) 41.8 (31.8%) 8.3 (55%) 2.0 (93.6%) 5.7 (91.5%) 6.3 (90.9%) 6.7 (92.2%)
+ Data Balance 86.8 (5.08%) 28.0 (46.4%) 6.6 (65%) 6.8 (44.6%) 43.4 (30.8%) 44.7 (44.7%) 42.4 (45.3%)
+ cDPO 87.5 (3.71%) 48.6 (15%) 15.0 (45%) 4.4 (47.4%) 61.9 (11.2%) 66.4 (15.6%) 67.7 (15.4%)
+ Joint Loss 86.8 (0.38%) 48.6 (8.6%) 10.0 (31.7%) 8.6 (9%) 72.3 (1.15%) 78.9 (1.9%) 78.1 (2.22%)
+ Masking 87.2 (0.15%) 51.0 (5.8%) 8.0 (38.3%) 12.6 (10.2%) 72.0 (1.15%) 77.2 (1.9%) 79.1 (1.36%)

(Percentages in parentheses indicate the ratio of no-answer outputs)

Key Findings: 1. Pure DPO is Catastrophic: The ratio of no-answer outputs reached over 90% on Planning and IF tasks, leading to a collapse in performance. 2. Step-by-step Remedies are Effective: Each technique is orthogonal and complementary, eventually restoring performance to a level close to or exceeding the SFT baseline. 3. Improvements Primarily Stem from Reducing No-Answer Outputs: Joint Loss and Masking reduce the no-answer ratio from 90%+ to <10%.

Joint Loss α Hyperparameter

α GSM8K MATH Plan. IF(Zh)
cDPO (α=0) 87.5 48.6 4.4 61.9
α=0.5 86.5 50.0 7.8 68.8
α=1.0 86.8 48.6 8.6 72.3
α=1.5 85.5 48.4 7.6 68.4
α=2.0 85.6 48.0 8.4 70.7

\(\alpha = 1\) is the optimal trade-off point.

MCTS Inference Exploration

Model Test@1 Test@8 Test@32
Llama-3.1-8B Baseline 47.0 67.6 75.8
Our Best Model 51.0 70.2 79.2
+ MCTS Decode 51.0 70.8 82.8

MCTS inference achieves an additional 3.6% gain on Test@32, demonstrating the potential of scaling test-time compute.


Highlights & Insights

  1. Unveiling the Essence of the Distillation Bottleneck: Formalistic long-time thinking is not a simple underperformance issue, but rather a manifestation of smaller models mechanically mimicking reasoning patterns. This offers deeper insight than the generic statement of "weak reasoning capabilities in small models".
  2. Constructing CoT Data From Scratch outperforms distilling from LRMs, which is the most significant empirical contribution of this paper.
  3. Exquisite Multi-model Collaborative MCTS Framework: Qwen handles reasoning while Llama handles reflection, avoiding the distributional bias of self-correction within the same model.
  4. Failure of DPO on Long CoT is a crucial discovery (with no-answer rates >90% on Planning and IF tasks), demonstrating that standard DPO is unsuitable for direct application to reasoning models.
  5. Orthogonal Compatibility of Five Techniques: Data balance, cDPO, Joint Loss, and Masking each resolve different issues, yielding substantial effects when combined.
  6. Quantitative Analysis of Formalistic Thinking: Accurately measuring the severity of the issue through the no-answer ratio.

Limitations & Future Work

  1. Base models are only evaluated on Llama and Qwen series, without covering other model families.
  2. MCTS data construction requires numerous LLM inference calls (Qwen-72B + Llama-70B), incurring high computational costs.
  3. The final performance on AIME (8.0-15.0%) is still weak, indicating that the bottleneck of complex mathematical reasoning is not yet fully broken.
  4. Masking DPO conversely degrades performance on AIME (from 10.0% to 8.0%), illustrating that the proposed technology combination may not be universally beneficial across all tasks.
  5. Direct comparison with DeepSeek-R1 distilled models is lacking.
  • Reasoning Models: OpenAI o1, DeepSeek-R1 (Guo et al. 2025), QwQ (Qwen Team 2024)
  • Knowledge Distillation: Direct Distillation (DeepSeek-R1), Sky-T1
  • MCTS for Reasoning: Tian et al. 2024, RStar (Qi et al. 2024), Math-Shepherd (Wang et al. 2024)
  • DPO Improvements: cDPO (Mitchell 2023), Joint SFT+DPO (Fernando et al. 2024)

Rating ⭐⭐⭐⭐

The analysis of the reasoning distillation bottleneck is thorough and empirically supported, with an original MCTS CoT construction framework. The combination of techniques is comprehensive, and each component is reasonably motivated. Limitations include limited improvements on competition-level mathematical reasoning, and the absence of a direct comparison with mainstream distillation methods (such as those from DeepSeek-R1).