Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start¶

Conference: ICLR 2026 arXiv: 2510.25801 Code: Project Page Area: Multimodal VLM / Reinforcement Learning Keywords: Cold Start, DPO, Decoupled Learning, Self-Distillation, VLM Reasoning

TL;DR¶

This paper proposes SPECS, a three-stage cold-start framework that (1) generates preference data via self-distillation (distinguishing only format differences), (2) applies DPO for format pre-alignment as the cold start, and (3) follows with GRPO fine-tuning. By decoupling format learning from reasoning learning, SPECS achieves consistent performance gains of +4.1% on MEGA-Bench and +12.2% on MathVista.

Background & Motivation¶

Background: Inspired by DeepSeek-R1, a growing body of "MLLM-r1" works applies RL (particularly GRPO) to vision-language models to enhance reasoning. The standard training paradigm follows: cold start (SFT) → RL fine-tuning.

Limitations of Prior Work: (1) SFT-based cold start couples reasoning paradigms, task solutions, and output formats into a single learning objective, leading to instruction-style overfitting and weakened OOD generalization; (2) distillation from an external teacher model can degrade performance when the capability gap between teacher and student is too large; (3) SFT-based cold start is inconsistent with subsequent RL training objectives (SFT maximizes log-likelihood vs. RL optimizes rewards), undermining training stability.

Key Challenge: When the cold start phase learns too "deeply" (simultaneously acquiring format and reasoning content), the model overfits to the training distribution, thereby restricting the exploration space and generalization capacity available to subsequent RL.

Goal: Design a cold-start strategy better suited for subsequent RL training — one that restricts cold start to learning "shallow" format/structural conventions while reserving "deep" reasoning capability acquisition for the RL stage.

Key Insight: A Generalization Factor (GF) metric is proposed to quantify the generalization ability of different cold-start methods. Empirical analysis reveals that DPO-based cold start generalizes better than SFT-based cold start, motivating the design of a decoupled learning framework.

Core Idea: The cold start employs DPO exclusively for format alignment (where both chosen and rejected responses are correct but differ in format), while reasoning capability is delegated to RL — decoupling learning objectives to avoid the overfitting pitfalls of SFT.

Method¶

Overall Architecture¶

Three-stage training pipeline: Stage 1 applies brief GRPO to the base model (yielding GRPO-zero) → GRPO-zero is used for self-distillation to generate preference data → Stage 2 applies DPO + SFT hybrid loss for format pre-alignment cold start → Stage 3 applies GRPO for final RL fine-tuning.

Key Designs¶

Self-Distilled Preference Data Generation:
- Function: Generates chosen/rejected pairs via self-distillation using GRPO-zero, where both responses are correct but differ in format.
- Mechanism: (1) Brief GRPO on the base model yields \(\pi_{\text{GRPO-zero}}\) (format accuracy 96.74% vs. 41.62% for the base model); (2) chosen responses are generated by \(\pi_{\text{GRPO-zero}}\) and filtered by Gemini-2.5-flash for reasoning path consistency; (3) rejected responses are constructed by applying five types of format corruption (e.g., removing tags, shifting tags).
- Design Motivation: Avoids reliance on an external large teacher model (experiments show that distillation from a 72B teacher underperforms self-distillation); restricting chosen/rejected differences to format alone ensures DPO learns format conventions rather than reasoning content.
DPO-based Format Pre-Alignment Cold Start:
- Function: Trains on self-distilled preference data using a DPO + SFT hybrid loss as the RL cold start.
- Mechanism: \(\mathcal{L}_{\text{hybrid}} = \mathcal{L}_{\text{DPO}} + \lambda \mathcal{L}_{\text{SFT}}\). The DPO loss \(\mathcal{L}_{\text{DPO}} = -\mathbb{E}[\log \sigma(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)})]\) learns format preferences; the SFT loss on chosen responses regularizes against policy drift.
- Design Motivation: DPO optimizes an implicit reward model, aligning more naturally with the reward-driven objective of subsequent GRPO and yielding more stable training. Empirical GF measurements confirm that DPO consistently outperforms SFT in generalization.
Generalization Factor (GF) Metric:
- Function: Quantifies the generalization ability of different cold-start methods.
- Mechanism: \(\Gamma(n) = (1+\beta^2) \frac{G_{\text{ID}}(n) \cdot G_{\text{OOD}}(n)}{\beta^2 \cdot G_{\text{ID}}(n) + G_{\text{OOD}}(n)}\), where \(G_{\text{ID}}\) and \(G_{\text{OOD}}\) denote in-distribution and out-of-distribution performance gains, respectively. The \(F_\beta\)-score formulation with \(\beta=2\) places greater emphasis on OOD generalization.
- Design Motivation: The \(F_\beta\)-score property ensures that a low score on either ID or OOD alone yields a low overall score, making it well-suited for evaluating generalization ability.

Loss & Training¶

Stage 3 employs GRPO with reward function \(R_{\text{total}} = R_{\text{format}} + R_{\text{acc}}\), where format reward is 0.5 (correct structure) and accuracy reward is 1.0 (correct answer). Multiple-choice and numerical questions are evaluated by rule, while open-ended questions are judged by GPT-4o. Learning rate \(1 \times 10^{-6}\), batch size 128, 8 rollouts per sample.

Key Experimental Results¶

Main Results¶

Benchmark	Metric	SPECS (Ours-7B)	Backbone (QwenVL-2.5-7B)	Gain
MEGA-Bench Core	Score	39.17	35.07	+4.1
MathVista	Acc	75.90	63.70	+12.2
MathVerse	Acc	48.73	38.20	+10.5
MathVision	Acc	29.50	25.40	+4.1
MMMU	Acc	56.78	54.20	+2.5

Ablation Study¶

Configuration	AVG (Cold Start / Cold Start+RL)	Notes
Self-Distillation + Decoupled	47.27 / 50.02	Full SPECS
Qwen-72B Distillation	44.90 / 48.98	External teacher underperforms self-distillation
Qwen-32B Distillation	42.89 / 46.43	Larger capability gap yields worse results
Base Model Distillation	45.07 / 48.79	Self-distillation without GRPO-zero
Coupled Data (DPO)	47.67 / 48.68	Mixed format+content data degrades performance
SFT-based GRPO	— / 47.65	SFT cold start
DPO-based GRPO	— / 50.02	DPO cold start is superior

Key Findings¶

Self-distillation outperforms external teacher distillation: GRPO-zero achieves 96.74% format accuracy vs. 41.62% for the base model, providing higher-quality chosen responses.
Decoupled data (format-only differences) outperforms coupled data (format + correctness differences): DPO cold start restricted to format learning better facilitates subsequent RL.
DPO-based GRPO training is more stable (smoother policy loss curves) and achieves higher final performance than SFT-based GRPO.
The GF metric confirms that DPO's OOD generalization advantage over SFT grows with training steps.

Highlights & Insights¶

The core insight of "decoupled learning": shallow learning (format/structure) and deep learning (reasoning capability) are best handled separately by DPO and RL, respectively.
Self-distillation circumvents the teacher-student capability gap problem; GRPO-zero serves as an effective intermediate that improves data quality while maintaining distributional consistency.
The alignment between DPO and RL objectives explains the observed training stability difference — the transition from SFT (imitation learning) to RL (reward optimization) involves a discontinuous objective shift, whereas DPO (implicit reward) to RL (explicit reward) provides a more coherent progression.

Limitations & Future Work¶

Stage 1 requires additional GRPO pre-training to produce GRPO-zero, incurring extra computational cost.
Rejected responses in the preference data are constructed by rule-based format corruption, which may not reflect the true distribution of naturally occurring format errors.
Evaluation of reasoning consistency in chosen responses relies on Gemini-2.5-flash, introducing a dependency on an external API.
Validation is currently limited to the 7B scale; effectiveness on larger models remains unknown.

vs. SFT Cold Start (DeepSeek-R1 paradigm): Jointly learning format and reasoning in SFT degrades OOD generalization; SPECS's DPO cold start decouples these two objectives.
vs. Orsta-7B: Using the same training data, SPECS outperforms by +0.86 on MEGA-Bench and +5.7 on MathVista, demonstrating the framework's advantage.
vs. VL-Rethinker-7B: SPECS achieves comparable or slightly superior results on MEGA-Bench and MathVista, while offering a more generalizable cold-start strategy.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of decoupled learning, DPO-based cold start, and self-distillation constitutes a novel system design.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-benchmark coverage with fine-grained ablations over distillation sources, data strategies, and cold-start methods.
Writing Quality: ⭐⭐⭐ Content is solid but somewhat verbose; the exposition of the GF metric could be more concise.
Value: ⭐⭐⭐⭐ Provides a superior cold-start paradigm for RL-based VLM training with practical guidance for the MLLM-r1 ecosystem.