MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale¶

Conference	arXiv	Code	Area	Keywords
ACL 2025	2412.05237	Project Page	multimodal_vlm	Multimodal Reasoning, Instruction Tuning, CoT, Data Rewriting, Large-Scale Training Data

TL;DR¶

Proposes a scalable, low-cost method to construct MAmmoTH-VL-Instruct, a multimodal instruction tuning database of 12 million instances rich in Chain-of-Thought (CoT) reasoning, using only open-source models. The resulting model, MAmmoTH-VL-8B, achieves state-of-the-art (SOTA) performance on multimodal reasoning benchmarks (e.g., MathVerse +8.1%, MMMU-Pro +7%, MuirBench +13.3%).

Background & Motivation¶

Limitations of Prior Work: Existing multimodal instruction tuning datasets primarily originate from academic VQA datasets (e.g., VQA, AI2D, ChartQA). These datasets focus on simpler tasks and only provide short phrase-level answers without intermediate reasoning processes, limiting the model's reasoning capabilities.
Key Challenge: While Chain-of-Thought (CoT) reasoning has shown significant efficacy in text-only LLMs, constructing large-scale multimodal CoT datasets faces two major obstacles: (1) ensuring instruction diversity and complexity, and (2) generating coherent responses with detailed justifications. Manual annotation is prohibitively expensive, and relying on closed-source models like GPT-4 involves high costs and copyright concerns.
Design Motivation: To achieve low-cost and scalable construction of multimodal CoT datasets using open-source models, lowering the barrier to entry for the open-source community.

Method¶

Overall Architecture¶

A three-step data construction pipeline: 1. Data Collection and Categorization: Collected from 153 public datasets and organized into 10 major categories (General, OCR, Chart, Caption, Domain-specific, Code&Math, Language, Detection, Multi-Image, Video). 2. Instruction Data Rewriting: Translating short answers into detailed responses containing CoT reasoning using open-source models. 3. Self-Filtering: Utilizing the same MLLM as a judge (Model-as-Judge) to filter out hallucinated content.

Key Designs¶

Three-Tier Data Grouping:
- Group A (58 datasets): High-quality, original data is directly retained.
- Group B (60 datasets): Promising but brief responses, enhanced through rewriting.
- Group C (35 datasets): Excessively vague or short, discarded directly.
Task-Aware Rewriting Strategy: Tailored prompts are designed for each data category. For Caption-type data, a text-only LLM (Llama-3-70B) is used to generate task-oriented QA pairs. For other categories, a multimodal model (InternVL2-Llama3-76B) is utilized to ensure vision-language alignment.
Data Mixing Ratio: 70% rewritten data + 30% original data. t-SNE analysis reveals that the rewritten data retains the core characteristics of the original distribution while expanding the overall coverage.

Training Configuration¶

Three-stage training (based on the LLaVA-OneVision architecture): - Stage-1: Language-image alignment pre-training (558K samples, projector-only training). - Stage-2: Single-image visual instruction tuning (10M samples, full-parameter training). - Stage-3: One-Vision multi-image/video fine-tuning (2M samples, full-parameter training).

LLM Backbone: Qwen2.5-7B-Instruct, Vision Encoder: SigLIP-so400m-patch14-384

Experiments¶

Main Results: Multi-Disciplinary Knowledge & Mathematical Reasoning¶

Model	MMStar	MMMU (val)	MMMU-Pro	MathVerse	MathVista
GPT-4o	64.7	69.1	49.7	50.2	63.8
Qwen2-VL-7B	60.7	52.1	26.9	28.2	58.2
LLaVA-OV-7B	61.7	48.8	18.7	26.2	63.2
Llava-CoT-11B	57.6	48.9	18.5	24.2	54.8
MAmmoTH-VL-8B	63.0	50.8	25.3	34.2	67.6
Gain vs. Best Open-Source (~10B)	+1.3	+1.9	+7.1	+8.1	+4.4

Document and Chart Understanding¶

Model	AI2D	ChartQA	DocVQA	RealWorldQA
LLaVA-OV-7B	81.4	80.0	87.5	66.3
InternVL-2-8B	83.8	83.3	91.6	64.4
MAmmoTH-VL-8B	84.0	86.2	93.7	69.9
Gain vs. Best Open-Source (~10B)	+2.4	+2.1	+1.6	+0.6

Key Findings¶

Self-Filtering is Crucial: OCR and chart data exhibit the highest hallucination filtering rates, and removing the filtering step leads to a significant drop in model performance.
Improved Rewritten Data Quality: Post-rewritten data scores higher than the original data on both information richness and relevance (on a 5-point scale).
Significant Data Scaling Effects: Performance consistently improves as training data increases from 2M to 10M, demonstrating the scalability of large-scale CoT data.
Non-Reasoning Tasks Also Benefit: Improvements of up to 4% are observed even on non-reasoning benchmarks, indicating the generalization benefits of CoT training.

Highlights & Insights¶

Constructs a 12M large-scale multimodal CoT dataset using only open-source models, breaking the reliance on closed-source models like GPT-4.
The three-step pipeline (collect-rewrite-filter) is simple and scalable, with a methodology that can be replicated in other domains.
The 8B model significantly outperforms models of similar or even larger scales on reasoning-intensive tasks (e.g., MathVerse +8.1%).
The Model-as-Judge self-filtering method achieves an agreement rate of Kappa 0.64 (good level) with human evaluation.

Limitations & Future Work¶

Using the same generative model as a judge for self-filtering may lead to blind spots regarding its own specific error patterns.
High hallucination rates persist in OCR and chart data during rewriting, indicating that open-source MLLMs still fall short in fine-grained visual understanding.
Although training costs are lower than using GPT-4, training on a 10M scale still requires substantial computational resources.
Data collection is dependent on existing public datasets, potentially leading to insufficient coverage representing new domains or tasks.

Multimodal Instruction Tuning: LLaVA (Liu et al. 2024b) pioneered the visual instruction tuning paradigm; LLaVA-OneVision (Li et al. 2024b) extended this to multi-image and video scenarios.
Reasoning Enhancement: Chain-of-Thought (Wei et al. 2022) introduces step-by-step reasoning; LLaVA-CoT (Xu et al. 2024a) introduces CoT in a single model but relies on GPT-4 generated data.
Data Quality and Filtering: Cambrian (Tong et al. 2024) explores multi-source data fusion in training; InternVL2 (Chen et al. 2023b) focuses on large-scale pre-training; this work highlights the feasibility of self-filtering using open-source models.

Rating¶

Metric	Score (1-10)
Novelty	7
Technical Depth	7
Experimental Thoroughness	9
Writing Quality	8
Overall	7.5