MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale¶
| Conference | arXiv | Code | Area | Keywords |
|---|---|---|---|---|
| ACL 2025 | 2412.05237 | Project Page | multimodal_vlm | Multimodal Reasoning, Instruction Tuning, CoT, Data Rewriting, Large-Scale Training Data |
TL;DR¶
Proposes a scalable, low-cost method to construct MAmmoTH-VL-Instruct, a multimodal instruction tuning database of 12 million instances rich in Chain-of-Thought (CoT) reasoning, using only open-source models. The resulting model, MAmmoTH-VL-8B, achieves state-of-the-art (SOTA) performance on multimodal reasoning benchmarks (e.g., MathVerse +8.1%, MMMU-Pro +7%, MuirBench +13.3%).
Background & Motivation¶
- Limitations of Prior Work: Existing multimodal instruction tuning datasets primarily originate from academic VQA datasets (e.g., VQA, AI2D, ChartQA). These datasets focus on simpler tasks and only provide short phrase-level answers without intermediate reasoning processes, limiting the model's reasoning capabilities.
- Key Challenge: While Chain-of-Thought (CoT) reasoning has shown significant efficacy in text-only LLMs, constructing large-scale multimodal CoT datasets faces two major obstacles: (1) ensuring instruction diversity and complexity, and (2) generating coherent responses with detailed justifications. Manual annotation is prohibitively expensive, and relying on closed-source models like GPT-4 involves high costs and copyright concerns.
- Design Motivation: To achieve low-cost and scalable construction of multimodal CoT datasets using open-source models, lowering the barrier to entry for the open-source community.
Method¶
Overall Architecture¶
A three-step data construction pipeline: 1. Data Collection and Categorization: Collected from 153 public datasets and organized into 10 major categories (General, OCR, Chart, Caption, Domain-specific, Code&Math, Language, Detection, Multi-Image, Video). 2. Instruction Data Rewriting: Translating short answers into detailed responses containing CoT reasoning using open-source models. 3. Self-Filtering: Utilizing the same MLLM as a judge (Model-as-Judge) to filter out hallucinated content.
Key Designs¶
-
Three-Tier Data Grouping:
- Group A (58 datasets): High-quality, original data is directly retained.
- Group B (60 datasets): Promising but brief responses, enhanced through rewriting.
- Group C (35 datasets): Excessively vague or short, discarded directly.
-
Task-Aware Rewriting Strategy: Tailored prompts are designed for each data category. For Caption-type data, a text-only LLM (Llama-3-70B) is used to generate task-oriented QA pairs. For other categories, a multimodal model (InternVL2-Llama3-76B) is utilized to ensure vision-language alignment.
-
Data Mixing Ratio: 70% rewritten data + 30% original data. t-SNE analysis reveals that the rewritten data retains the core characteristics of the original distribution while expanding the overall coverage.
Training Configuration¶
Three-stage training (based on the LLaVA-OneVision architecture): - Stage-1: Language-image alignment pre-training (558K samples, projector-only training). - Stage-2: Single-image visual instruction tuning (10M samples, full-parameter training). - Stage-3: One-Vision multi-image/video fine-tuning (2M samples, full-parameter training).
LLM Backbone: Qwen2.5-7B-Instruct, Vision Encoder: SigLIP-so400m-patch14-384
Experiments¶
Main Results: Multi-Disciplinary Knowledge & Mathematical Reasoning¶
| Model | MMStar | MMMU (val) | MMMU-Pro | MathVerse | MathVista |
|---|---|---|---|---|---|
| GPT-4o | 64.7 | 69.1 | 49.7 | 50.2 | 63.8 |
| Qwen2-VL-7B | 60.7 | 52.1 | 26.9 | 28.2 | 58.2 |
| LLaVA-OV-7B | 61.7 | 48.8 | 18.7 | 26.2 | 63.2 |
| Llava-CoT-11B | 57.6 | 48.9 | 18.5 | 24.2 | 54.8 |
| MAmmoTH-VL-8B | 63.0 | 50.8 | 25.3 | 34.2 | 67.6 |
| Gain vs. Best Open-Source (~10B) | +1.3 | +1.9 | +7.1 | +8.1 | +4.4 |
Document and Chart Understanding¶
| Model | AI2D | ChartQA | DocVQA | RealWorldQA |
|---|---|---|---|---|
| LLaVA-OV-7B | 81.4 | 80.0 | 87.5 | 66.3 |
| InternVL-2-8B | 83.8 | 83.3 | 91.6 | 64.4 |
| MAmmoTH-VL-8B | 84.0 | 86.2 | 93.7 | 69.9 |
| Gain vs. Best Open-Source (~10B) | +2.4 | +2.1 | +1.6 | +0.6 |
Key Findings¶
- Self-Filtering is Crucial: OCR and chart data exhibit the highest hallucination filtering rates, and removing the filtering step leads to a significant drop in model performance.
- Improved Rewritten Data Quality: Post-rewritten data scores higher than the original data on both information richness and relevance (on a 5-point scale).
- Significant Data Scaling Effects: Performance consistently improves as training data increases from 2M to 10M, demonstrating the scalability of large-scale CoT data.
- Non-Reasoning Tasks Also Benefit: Improvements of up to 4% are observed even on non-reasoning benchmarks, indicating the generalization benefits of CoT training.
Highlights & Insights¶
- Constructs a 12M large-scale multimodal CoT dataset using only open-source models, breaking the reliance on closed-source models like GPT-4.
- The three-step pipeline (collect-rewrite-filter) is simple and scalable, with a methodology that can be replicated in other domains.
- The 8B model significantly outperforms models of similar or even larger scales on reasoning-intensive tasks (e.g., MathVerse +8.1%).
- The Model-as-Judge self-filtering method achieves an agreement rate of Kappa 0.64 (good level) with human evaluation.
Limitations & Future Work¶
- Using the same generative model as a judge for self-filtering may lead to blind spots regarding its own specific error patterns.
- High hallucination rates persist in OCR and chart data during rewriting, indicating that open-source MLLMs still fall short in fine-grained visual understanding.
- Although training costs are lower than using GPT-4, training on a 10M scale still requires substantial computational resources.
- Data collection is dependent on existing public datasets, potentially leading to insufficient coverage representing new domains or tasks.
Related Work¶
- Multimodal Instruction Tuning: LLaVA (Liu et al. 2024b) pioneered the visual instruction tuning paradigm; LLaVA-OneVision (Li et al. 2024b) extended this to multi-image and video scenarios.
- Reasoning Enhancement: Chain-of-Thought (Wei et al. 2022) introduces step-by-step reasoning; LLaVA-CoT (Xu et al. 2024a) introduces CoT in a single model but relies on GPT-4 generated data.
- Data Quality and Filtering: Cambrian (Tong et al. 2024) explores multi-source data fusion in training; InternVL2 (Chen et al. 2023b) focuses on large-scale pre-training; this work highlights the feasibility of self-filtering using open-source models.
Rating¶
| Metric | Score (1-10) |
|---|---|
| Novelty | 7 |
| Technical Depth | 7 |
| Experimental Thoroughness | 9 |
| Writing Quality | 8 |
| Overall | 7.5 |