ECCV 2024 Multimodal VLM Multimodal Large Language Models Pre-training Ablation Studies Vision Encoder Data Recipe Mixture-of-Experts

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training¶

Conference: ECCV 2024
arXiv: 2403.09611
Code: None (Internal Apple implementation using the AXLearn framework)
Area: Multimodal VLM
Keywords: Multimodal Large Language Models, Pre-training, Ablation Studies, Vision Encoder, Data Recipe, Mixture-of-Experts

TL;DR¶

Apple systematically ablates the three primary axes of MLLM construction (architecture, data, and training), deriving key design principles: Image Resolution > Model Size > Training Data; the choice of VL connector type has minimal impact; and the meticulous blending of caption, interleaved, and text-only data is crucial. This systematically constructed MM1 model family (ranging from 3B-30B dense models to up to 64B MoE models) achieves state-of-the-art performance in few-shot pre-training evaluations.

Background & Motivation¶

Existing MLLMs display a severe lack of transparency: closed-source models (such as GPT-4V and Gemini) disclose virtually no architectural or data details, while open-source models release weights but rarely reveal the design decision process (i.e., why a particular architecture, data ratio, or training strategy was chosen). It is argued that digesting reproducible design lessons holds more enduring value than any specific component implementation.

The core contribution of MM1 is not to propose a novel architecture, but rather to answer three pivotal design questions through large-scale systematic ablations: 1. How to select the vision encoder? Which is more important among resolution, model size, and pre-training objectives? 2. How to connect visual features to the LLM? Is it the connector architecture, token count, or resolution? 3. How to blend pre-training data? What are the optimal proportions for caption, interleaved, and text-only data?

Method¶

Overall Architecture¶

MM1 adopts a standard decoder-only MLLM architecture: Vision Encoder \(\rightarrow\) Vision-Language Connector (VL Connector) \(\rightarrow\) LLM Decoder. The input image is first processed by the encoder to extract features, then mapped into a sequence of visual tokens via the connector. These visual tokens are concatenated with text tokens and fed into the autoregressive LLM.

Ablation baseline configuration: - Vision encoder: ViT-L/14 (CLIP loss, DFN-5B + VeCap-300M, 336×336) - VL connector: C-Abstractor, 144 image tokens - Pre-training data: 45% caption + 45% interleaved + 10% text-only - LLM: 1.2B transformer decoder

Key Design 1: Vision Encoder Selection¶

The authors compare two types of visual encoder pre-training objectives (using a 2.9B LLM to ensure sufficient capacity):

Contrastive Learning (CLIP-style): Trained on large-scale image-text data, exhibiting strong semantic understanding but weaker dense prediction capabilities.

Reconstructive Loss (AIM-style): Autoregressive reconstruction loss, which preserves global details of images and theoretically benefits tasks requiring fine-grained understanding such as VQA.

Encoder	Architecture	Resolution	Pre-training Data	0-shot	4-shot	8-shot
AIM	ViT/600M	224	DFN-2B	36.6	56.6	60.7
AIM	ViT/1B	224	DFN-2B	37.9	59.5	63.3
AIM	ViT/3B	224	DFN-2B	38.9	60.9	64.9
CLIP	ViT-L	224	DFN-5B+VeCap	36.9	58.7	62.2
CLIP	ViT-H	224	DFN-5B+VeCap	37.5	60.0	63.6
CLIP	ViT-L	336	DFN-5B+VeCap	39.9	62.4	66.0
CLIP	ViT-H	336	DFN-5B+VeCap	40.5	62.6	66.3
CLIP	ViT-H	378	DFN-5B	40.9	62.5	66.4

Encoder Rule: Image resolution has the most significant impact (approx. +3% from 224 to 336), followed by model size (only <1% from ViT-L to ViT-H), and training data (adding VeCap synthetic captions adds 1% to 2% to few-shot). The performance gap between CLIP and AIM is marginal when controlling variables, with CLIP being slightly superior overall.

Key Design 2: Vision-Language Connector¶

The authors compare three VL connector architectures:

Average Pooling: \(n \times n\) average pooling + linear projection (similar to Emu2)
Attention Pooling: Uses \(k\) learnable queries for cross-attention
C-Abstractor: Convolutional mapping based on ResNet blocks (proposed by Honeybee), preserving local information + adaptive pooling

Complete ablations across four settings (combining 64 and 144 tokens, with 224 and 336 resolutions) demonstrate:

Connector Rule: The number of visual tokens and image resolution are of primary importance, whereas the type of connector architecture has negligible impact. All three architectures perform almost identically in the 336px/144-token setting. This contradicts the findings of the Honeybee paper, suggesting that connector architecture discrepancies are smoothed out with scaling up training. C-Abstractor is ultimately selected solely because it performs slightly better in specific settings.

Key Design 3: Pre-training Data Recipe¶

Utilizing three types of data, the authors summarize four key data principles:

Data Type	Data Source	Scale
Captioned Images	CC3M, CC12M, HQIPT-204M, COYO, Web Image-Text-1B	2 billion image-text pairs
Synthetic Captions	VeCap	300 million image-text pairs
Interleaved Image-Text	OBELICS + Internal Data	600 million documents
Text-only	Web pages, code, social media, encyclopedias, math	2T tokens

Data Rule 1: Interleaved data is "indispensable" for few-shot and text-only performance, whereas caption data improves zero-shot performance. Increasing the proportion of captions from 0% to 100% improves zero-shot performance from 25.8% to 39.3%; however, when the interleaved proportion falls below 50%, the 8-shot performance plunges from 61% to 45%.

Data Rule 2: Text-only data assists few-shot learning and text understanding. The combination of caption + text-only significantly enhances few-shot performance, while the interleaved + text-only mixture yields smaller gains but preserves pure text capabilities.

Data Rule 3: Meticulous blending can balance multimodal and text performance. The optimal ratio is determined to be caption:interleaved:text = 45:45:10 (approximately 5:5:1).

Data Rule 4: Synthetic caption data (VeCap), although constituting only 7% of the total dataset, contributes significantly to few-shot performance (+2.4% to 4%).

Loss & Training¶

Pre-training Configuration: - All parameters are fully unfrozen (vision encoder + LLM) - Sequence length of 4096, up to 16 images per sequence (\(378 \times 378\)), batch size 512 - Trained for 200k steps (approx. 100B tokens) - Learning rate determined via grid search on smaller models (9M \(\rightarrow\) 1.2B) and extrapolated via linear regression in log space - Extrapolation formula: \(\eta = \exp(-0.4214 \ln(N) - 0.5535)\) - Weight decay: \(\lambda = 0.1\eta\) - Cosine decay, with a 2,000-step warmup, decaying to 10% of the peak value - Final actual LR for 3B/7B/30B: 6e-5 / 4e-5 / 2e-5

MoE Scaling: - 3B-MoE: 64 experts, replaced every 2 layers, with total parameters of 64B - 7B-MoE: 32 experts, replaced every 4 layers, with total parameters of 47B - Top-2 gating + load balance loss (0.01) + router z-loss (0.001) - Only FFN layers in the LLM decoder are replaced, leaving the vision encoder and VL connector unchanged

SFT Configuration: - Approximately 1.45 million samples (LLaVA-Conv/Complex, ShareGPT-4V, academic VL datasets, etc.) - 10k steps, batch size 256, sequence length 2048 - AdaFactor, LR 1e-5, cosine decay - Both the vision encoder and LLM are unfrozen - High-resolution: Position embedding interpolation + sub-image decomposition (SPHINX approach) - Default SFT resolution is \(1344 \times 1344\) (five \(672 \times 672\) sub-images, totaling 720 tokens/image)

Key Experimental Results¶

Main Pre-training Results (Few-shot SOTA)¶

Model	Shot	COCO	NoCaps	TextCaps	VQAv2	TextVQA	VizWiz	OKVQA
Flamingo-3B	0	73.0	–	–	49.2	30.1	28.9	41.2
Flamingo-3B	8	90.6	–	–	55.4	32.4	38.4	44.6
MM1-3B	0	73.5	55.6	63.3	46.2	29.4	15.6	26.1
MM1-3B	8	114.6	104.7	88.8	63.6	44.6	46.4	48.4
Flamingo-9B	8	99.0	–	–	58.0	33.6	39.4	50.0
Emu2-14B	8	–	–	–	59.0	–	43.9	–
MM1-7B	8	116.3	106.6	88.2	63.6	46.3	45.3	51.4
Flamingo-80B	8	108.8	–	–	65.6	37.3	44.8	57.5
IDEFICS-80B	8	114.3	105.7	77.6	64.8	35.7	46.1	55.1
Emu2-37B	8	–	–	–	67.8	49.3	54.7	54.1
MM1-30B	8	123.1	111.6	92.9	70.9	49.4	49.9	58.3

MM1 leads in few-shot performance across all comparisons. Notably, MM1-30B outperforms Flamingo-80B (which has 2.7 times more parameters) and IDEFICS-80B.

Main SFT Results (12 Benchmarks)¶

Model	VQAv2	TextVQA	SQA-I	MMMU (v/t)	MathV	MME-P	MME-C	MMB	SEED	POPE	LLaVA-W	MM-Vet
MM1-3B-Chat	82.0	71.9	69.4	33.9/33.7	32.0	1482	279	67.8	63.0/68.8	87.4	72.1	43.7
MM1-3B-MoE	82.5	72.9	76.1	38.6/35.7	32.6	1469	303	70.8	63.9/69.4	87.6	76.8	42.2
LLaVA-1.5-7B	78.5	58.2	66.8	–/–	–	1511	316	64.3	58.6/66.1	85.9	63.4	31.1
LLaVA-NeXT-7B	81.8	64.9	70.1	35.8/–	34.6	1519	332	67.4	–/70.2	86.5	81.6	43.9
MM1-7B-Chat	82.8	72.8	72.6	37.0/35.6	35.9	1529	329	72.3	64.0/69.9	86.6	81.5	42.1
MM1-7B-MoE	83.4	73.8	74.4	40.9/37.9	40.9	1597	395	72.7	65.5/70.9	87.8	84.7	45.2
MM1-30B-Chat	83.7	73.5	81.0	44.7/40.3	39.4	1638	431	75.1	65.9/72.1	87.6	89.3	48.7

MoE models outperform their corresponding dense models across nearly all benchmarks, demonstrating the substantial potential of MoE in MLLMs.

Key Findings¶

Resolution > Model Size > Training Data: The selection hierarchy for the encoder is clear.
Connector types are largely inconsequential: Average Pooling, Attention Pooling, and C-Abstractor exhibit nearly identical behavior.
Token count is significant: Moving from 64 to 144 tokens yields a substantial performance improvement.
Interleaved data acts as the primary driver of few-shot capabilities: Its structural format is naturally congruent with few-shot inputs.
Pre-training duration directly affects SFT performance: More pre-training steps lead to superior downstream performance (Figure 7c).
SFT experiences diminishing returns at extremely high resolutions: \(1344\text{px}\) is optimal, while \(1792\text{px}\) yields a slight degradation.
Pre-training principles transfer to SFT: Caption-only pre-training enhances zero-shot capabilities in SFT, while connector-related architectural discrepancies disappear post-SFT as well.
Few-shot capability is preserved post-SFT: MM1-30B-Chat on MathVista achieves 0-shot 39.4 \(\rightarrow\) 4-shot 41.9 \(\rightarrow\) 8-shot (mixed resolution) 44.4.

Highlights & Insights¶

The value of systematic ablations: The primary contribution of MM1 is not the model itself but rather its reproducible design principles. Such "recipe papers" offer more enduring value to the community than individual state-of-the-art weights.
The paradigm-shifting discovery that connectors are inconsequential: This directly challenges prior research focusing intensively on VL connectors (e.g., Honeybee, BLIP-2), demonstrating that simple projection methods suffice when training scales up.
Meticulous data recipe balance: The 45:45:10 ratio is not an arbitrary choice but an optimal balance validated by systematic ablations. This suggests that the importance of data engineering in multimodal pre-training has been highly underestimated.
Efficient scaling of MoE: 3B-MoE (64B total parameters) achieves comparable or even superior performance to 7B dense models while activating only a fraction of parameters, outlining a promising direction for efficient MLLM deployment.
Mixed-resolution few-shot strategy: Since sub-image decomposition results in prohibitive token counts in few-shot settings, the proposed mixed-resolution strategy (only keeping the final \(N\) samples at high resolution) is a highly practical engineering innovation.

Limitations & Future Work¶

Bounded ablation capacity: Component-level ablations use modest 1.2B/2.9B LLMs, and data ablations are restricted to 200k steps; whether these principles still hold strictly at massive scales remains verified.
Imperfect comparison fairness for encoders: AIM was trained on less than half of CLIP's data volume, suggesting the conclusions should be cautiously interpreted.
Dependency on closed-source datasets: Web Image-Text-1B and proprietary interleaved data are not reproducible for the broader community.
Absence of video and audio modalities: The model is restricted to image-text modalities, omitting more general-purpose multimodal scenarios.
Conventional SFT data recipe: The model mostly follows the SFT recipe established by LLaVA-1.5, leaving systematic SFT data ablations largely unexplored.
Strict token limit per image: Utilizing only 144 tokens per image (compared to 2880 tokens in LLaVA-NeXT) may constrain fine-grained perception and comprehension.
Lack of RLHF/DPO explorations: The post-training relies solely on the SFT stage, without incorporating preference alignment techniques.

Flamingo: The pioneering large-scale interleaved pre-trained MLLM which serves as the direct baseline and inspiration for MM1.
IDEFICS/OpenFlamingo: Open-source replication efforts of Flamingo, but lacking systematic architectural or component ablations.
LLaVA-1.5/NeXT: The primary sources for SFT recipes; the SFT data mixture in MM1 directly adapts elements from these models.
VILA: Also investigates pre-training components, but provides fewer details on optimization and pre-training evaluations.
Emu2: Discloses pre-training details but lacks controlled ablations. MM1 comprehensively outperforms Emu2 in few-shot scenarios.
Honeybee: Introduces C-Abstractor, yet MM1 demonstrates that its advantages fade at larger training scales.

Takeaway for future investigation: The ablation methodology of MM1 provides generalized references for scaling up, suggesting that optimizing component choices at smaller scales before extrapolating yields successful results. The learning rate extrapolation formula is a highly practical, directly reusable asset.

Rating¶

Dimension	Score (1-5)	Explanation
Novelty	3.5	No architecture novelty, but the systematic ablation methodology and design guidelines are highly valuable
Experimental Thoroughness	5	Extremely thorough, covering three-axis ablations, multi-scale scale-up validation, and post-SFT transfer analysis
Engineering Value	5	The data recipe, learning rate extrapolation formula, and mixed-resolution strategies are highly actionable
Writing Quality	4.5	Well-structured, clear summarization of rules, and highly informative appendix
Overall Recommendation	⭐⭐⭐⭐⭐	A must-read recipe paper in the MLLM domain; the referential value of the ablation findings far exceeds the weights of the model itself