NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints¶

Conference: NeurIPS 2025 arXiv: 2510.08565 Code: GitHub Area: Multimodal VLM Keywords: Native MLLM, Scaling Law, Visual Encoder, Mixture-of-Experts, End-to-End Training

TL;DR¶

This paper systematically investigates the design space and scaling properties of native multimodal large language models (Native MLLMs) under data constraints. It identifies a positive log-linear optimal scaling relationship between the visual encoder and the LLM, and based on this finding proposes NaViL, which achieves competitive performance with state-of-the-art MLLMs using only approximately 600 million pre-training image-text pairs.

Background & Motivation¶

Background: The dominant MLLM paradigm adopts compositional training — independently pre-training a visual encoder and an LLM, then aligning them through multimodal fine-tuning.

Limitations of Prior Work: Compositional training makes it difficult to explore joint scaling properties of the visual and language components, and vision-language alignment is constrained by the limitations of separate training.

Key Challenge: Native MLLMs (end-to-end training) exhibit greater potential for favorable scaling laws, yet existing studies evaluate them primarily under the assumption of unlimited resources, leaving their practical feasibility under data- and compute-constrained settings largely unexplored.

Goal: Can native MLLMs match or even surpass the performance ceiling of top-tier compositional MLLMs under realistic data constraints?

Key Insight: Systematically explore key architectural choices for native MLLMs (LLM initialization, MoE, visual encoder design) and the joint scaling laws of the vision-language components.

Core Idea: The optimal parameter count of the visual encoder grows log-linearly with the LLM parameter count; the two must be scaled jointly to achieve optimal performance.

Method¶

Overall Architecture¶

NaViL is an end-to-end trained native MLLM composed of three components: a visual encoder $\mathcal{V}_{d,w}$, an MLP connector $\mathcal{C}$, and an MoE-augmented LLM. It supports arbitrary-resolution inputs and employs Visual Multi-scale Packing to enhance inference performance.

Key Designs¶

LLM Initialization: Language parameters are initialized from a pre-trained LLM (InternLM2-Base) rather than trained from scratch. Experiments show that initialized models converge more than 10× faster than their from-scratch counterparts and exhibit significantly superior zero-shot image captioning performance. This is attributed to the substantially lower textual diversity and quality of multimodal training corpora compared to pure language pre-training data.
Modality-Specific MoE: Modality-specific attention experts and FFN experts are introduced at every LLM layer. Separate projection matrices $W_Q^m, W_K^m, W_V^m, W_O^m$ process visual and textual features, while a unified global attention computation is maintained: $$x_{i,m}^{l'} = x_{i,m}^{l-1} + \text{MHA-MMoE}(\text{RMSNorm}(x_{i,m}^{l-1}))$$ $$x_{i,m}^{l} = x_{i,m}^{l'} + \text{FFN-MMoE}(\text{RMSNorm}(x_{i,m}^{l'}))$$ MoE enables the model to reach the same validation loss with only 1/10 of the data, without increasing training or inference cost.
Visual Encoder Architecture Search: Under a fixed parameter budget, the optimal combination of depth $d$ and width $w$ is explored (parameter count $\mathcal{N} = 12 \times d \times w^2$). Experiments show that extreme depth or width configurations perform poorly, while moderate configurations show little difference at large data scales.
Visual Multi-scale Packing: At inference time, the input image is progressively downsampled at rate $\tau$ to generate a multi-scale image sequence $\{I_i\}_{i=0}^n$. Each scale is processed by the visual encoder independently and then concatenated before being fed into the LLM, with special tokens <end_of_scale> separating different scales.

Scaling Property Findings¶

Scaling LLM Independently: Validation loss decreases log-linearly as LLM parameter count increases, consistent with conventional language scaling laws.
Scaling Visual Encoder Independently: Performance gains diminish — when the LLM is fixed, increasing the visual encoder beyond a threshold yields negligible benefit, indicating that the performance ceiling is constrained by LLM capacity.
Joint Scaling: The logarithm of the optimal visual encoder size grows linearly with the logarithm of the LLM size, necessitating joint scaling. This stands in contrast to the compositional approach of using a fixed-size visual encoder.

Loss & Training¶

Stage 1 — Multimodal Generative Pre-training: 500 million image-text pairs are used for training (300 million web-crawled + 200 million synthetic captions), with only visual parameters updated; subsequently, 185 million high-quality samples are used to unfreeze the attention layer text parameters.
Stage 2 — Supervised Fine-tuning: All parameters are unfrozen and fine-tuned on 68 million high-quality multimodal samples.

Key Experimental Results¶

Main Results¶

Model	#Active Params	Avg	MMVet	MMMU	MMB	MME	MathVista	OCRBench	CCB
InternVL-2.5-2B (Compositional)	2.2B	67.0	60.8	43.6	74.7	2138	51.3	804	81.7
Mono-InternVL (Native)	1.8B	56.4	40.1	33.7	65.5	1875	45.7	767	66.3
NaViL-2B (Native)	2.4B	67.1	78.3	41.8	71.2	1822	50.0	796	83.9

NaViL-2B surpasses the compositional baseline InternVL-2.5-2B on most metrics and substantially outperforms all existing native MLLMs.

Ablation Study¶

Design Choice	Effect
With vs. without LLM initialization	Initialized version converges 10×+ faster with significantly better zero-shot captioning
With vs. without MoE	MoE version reaches the same loss with only 1/10 of the data
Visual encoder $d$ = 3/6/12/24/48	Extreme configurations underperform; moderate configurations show minimal differences

Key Findings¶

Native MLLMs achieve performance competitive with top-tier compositional MLLMs for the first time at the 2B parameter scale.
The optimal visual encoder size grows log-linearly with LLM size.
MoE architecture is critical for handling heterogeneous multimodal data.

Highlights & Insights¶

First Systematic Study: Provides a comprehensive exploration of the design space and scaling properties of native MLLMs under data constraints.
High Practicality: A highly competitive native MLLM can be trained with only approximately 600 million pre-training samples.
Novel Scaling Law: The identified optimal joint scaling relationship between the visual encoder and the LLM offers important guidance for native MLLM design.
Modality-Specific MoE: Introducing attention experts in addition to FFN experts addresses the inter-modal feature scale discrepancy problem.

Limitations & Future Work¶

Only the image modality is explored; extension to video, audio, and other modalities remains unaddressed.
The validation of scaling laws is limited in scope (LLM up to 7B); whether the findings hold at larger scales is yet to be confirmed.
The visual encoder architecture directly reuses Transformer layers from the LLM; vision-specific architectural designs are not explored.
Visual Multi-scale Packing introduces additional computational overhead at inference time.

Chameleon: A native MLLM trained from scratch; its substantially inferior performance compared to NaViL underscores the importance of LLM initialization.
Mono-InternVL: The first native MLLM to introduce modality-specific MoE; NaViL extends this by additionally incorporating attention experts.
Compositional MLLMs (InternVL, Qwen2VL): The strategy of using a fixed-size visual encoder is shown to be suboptimal.

Rating¶

Novelty: ⭐⭐⭐⭐ Systematic investigation of native MLLM scaling laws is a novel contribution, though the individual architectural components are not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 14 benchmarks, extensive ablation studies, and comprehensive scaling analyses.
Writing Quality: ⭐⭐⭐⭐ Clear structure, rich figures, and systematic analysis.
Value: ⭐⭐⭐⭐⭐ Provides important practical guidance and theoretical insights for native MLLM design.