Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging¶
Conference: ICML 2025
arXiv: 2505.05464
Code: https://github.com/shiqichen17/VLM_Merging
Area: Multimodal VLM
Keywords: Model Merging, VLM Reasoning, Perception and Reasoning Decoupling, Cross-modal Transfer, Layer-wise Analysis
TL;DR¶
By directly perform weighted averaging of the parameters of a mathematical reasoning LLM and the text components of a VLM (model merging), reasoning capability is transferred to the VLM without any training. Furthermore, a layer-wise distribution pattern is discovered where perception capabilities are concentrated in the early layers, while reasoning capabilities are concentrated in the middle and late layers.
Background & Motivation¶
Background¶
Background: VLMs perform exceptionally well on visual perception + language tasks. However, they lag far behind text-only LLMs in complex multimodal reasoning (such as math chart interpretation), partly due to the scarcity of multimodal reasoning data.
Limitations of Prior Work: Enhancing the reasoning capabilities of VLMs usually requires collecting a massive amount of multimodal reasoning data and fine-tuning, which is highly expensive.
Key Challenge: Can perception and reasoning capabilities be decoupled? Can reasoning capabilities be directly transferred from LLMs to VLMs?
Goal: To explore model merging as a pathway for cross-modal capability transfer.
Key Insight: The text components of VLMs share the same architecture and initialization as LLMs, satisfying the connected subspace hypothesis of model merging.
Core Idea: Perform weighted averaging of the text parameters of the VLM and the mathematical LLM to achieve zero-training reasoning capability transfer.
Method¶
Overall Architecture¶
- Select a VLM (e.g., LLaVA-NeXT/8B) and a mathematical LLM (e.g., Dart-Math) that share a base model.
- Calculate the task vector: \(\tau = W_{\text{math}} - W_{\text{base}}\)
- Merge: \(W_{\text{merged}} = W_{\text{VLM}} + \alpha \cdot \tau\)
- Directly deploy the merged model without any training.
Key Designs¶
-
Cross-modal Model Merging:
- Function: Adds the task vector of the mathematical LLM to the text parameters of the VLM.
- Mechanism: Since the VLM and LLM are fine-tuned from the same base model, their parameter spaces are connected. Weighted averaging can transfer capabilities.
- Design Motivation: Reasoning ability should be encoded in the text processing layers. Keeping the visual encoder unchanged preserves perception capabilities.
-
Layer-wise Capability Analysis (Knockout Analysis):
- Function: Progressively masks merged parameters layer-by-layer to observe changes in perception and reasoning.
- Key Findings: (a) Perception capabilities are concentrated in the early layers; (b) reasoning capabilities are concentrated in the middle and late layers; (c) after merging, reasoning capabilities expand to all layers while the distribution of perception remains unchanged.
- Design Motivation: To understand how merging affects different capabilities within the parameter space.
Loss & Training¶
- Completely training-free.
- The merging coefficient \(\alpha\) is typically optimal between 0.3 and 0.7.
Key Experimental Results¶
Main Results¶
| Model | MathVista (Math) | MathVerse (Vision-Only) | Perception Tasks |
|---|---|---|---|
| LLaVA-NeXT | 38.2 | 21.3 | 78.5 |
| + Dart-Math Merging | 41.8 (+3.6) | 22.7 (+1.4) | 78.1 (-0.4) |
| + MetaMath Merging | 40.9 (+2.7) | 22.0 (+0.7) | 78.3 (-0.2) |
Ablation Study¶
| Configuration | Math↑ | Perception | Description |
|---|---|---|---|
| Early-layer merging only | +0.8 | Unchanged | Early layers primarily encode perception |
| Mid-to-late layer merging only | +3.1 | Unchanged | Reasoning is primarily in mid-to-late layers |
| Full-layer merging | +3.6 | Slight decrease | Optimal for reasoning, but perception is slightly affected |
| \(\alpha=0.3\) | +2.1 | Unchanged | Conservative merging |
| \(\alpha=0.7\) | +3.6 | -0.8 | Aggressive merging |
Key Findings¶
- Model merging consistently improves reasoning across multiple VLMs (LLaVA, Idefics2, InternVL2) and multiple mathematical LLMs.
- Perception and reasoning are indeed largely decoupled in the parameter space—the former in early layers and the latter in mid-to-late layers.
- Merging alters the layer distribution of reasoning (expanding it to all layers), while the perception distribution remains unchanged.
Highlights & Insights¶
- Training-free capability transfer is highly practical—requiring only parameter weighted averaging, resulting in almost zero cost.
- The discovery of the layer-wise distribution of capabilities is of great value for understanding the internal mechanisms of VLMs.
- Opens up a new direction for "model merging as a multimodal integration tool".
Limitations & Future Work¶
- Requires the VLM and LLM to share a base model, which limits the scope of applicability.
- Only mathematical reasoning was tested; other types of reasoning (logic, common sense) have not been verified.
- The slight drop in perception tasks may be exacerbated under more aggressive merging.
- More complex merging strategies (such as TIES, DARE) were not explored.
Related Work & Insights¶
- vs. Traditional Model Merging: Prior works only merged models of the same modality, whereas this paper introduces cross-modal merging for the first time.
- vs. Fine-tuning with Reasoning Data: The latter requires large amounts of multimodal reasoning data, while this paper uses zero data.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of cross-modal model merging and layer-wise analysis is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across multiple VLMs, multiple LLMs, and layer-wise ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation and in-depth analysis.
- Value: ⭐⭐⭐⭐⭐ Practical and insightful.