MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders¶

Conference: CVPR 2025
arXiv: 2501.01709
Code: https://github.com/hey-cjj/MoVE-KD
Area: Multimodal VLM
Keywords: Knowledge Distillation, Visual Encoder Fusion, Mixture of Experts, LoRA, Attention Guidance

TL;DR¶

This paper proposes MoVE-KD—the first framework to fuse the strengths of multiple visual encoders (CLIP/EVA/ConvNeXt/SAM) into a single encoder from the perspective of knowledge distillation. It alleviates multi-teacher knowledge conflicts through Mixture-of-LoRA-Experts (MoLE), adaptively weights distilled tokens and teachers using CLIP \([CLS]\) attention, and achieves consistent improvements on LLaVA/LLaVA-NeXT.

Background & Motivation¶

Different pre-trained visual encoders (such as CLIP, EVA, ConvNeXt, DINOv2) have their own strengths and perform differently across various VL tasks. \(\rightarrow\) Current multi-encoder methods (e.g., Eagle, Mini-Gemini, \(S^2\)) employ multiple encoders in parallel through feature concatenation or attention mechanisms, but the computational cost scales linearly with the number of encoders. \(\rightarrow\) AM-RADIO attempts to replicate predictions from multiple foundation models using a single model with multiple heads, but sharing a backbone to learn diverse characteristics leads to conflicts. \(\rightarrow\) Key Challenge: How to absorb the respective advantages of multiple teacher encoders without conflict while maintaining the efficiency of a single encoder? \(\rightarrow\) Key Insight: A combination of knowledge distillation + MoE + LoRA—using MoLE to allow different tokens to select different LoRA experts based on inputs, and utilizing CLIP \([CLS]\) attention to guide distillation to focus on valuable features.

Method¶

Overall Architecture¶

Training pipeline: Multiple teacher encoders (CLIP/EVA/ConvNeXt, optionally SAM) individually process images to obtain visual tokens \(\rightarrow\) each teacher's tokens are projected into a unified space via an independent 2-layer MLP encoder adapter \(\rightarrow\) token weights \(W^{(tok)}\) and teacher weights \(W^{(tea)}\) are computed based on pre-trained CLIP \([CLS]\) attention \(\rightarrow\) weighted MSE loss distills knowledge into the student encoder \(\rightarrow\) a MoLE structure is integrated within the student encoder to prevent knowledge conflicts \(\rightarrow\) Total Loss = Text Loss + KD Loss. Only the student encoder is used during inference.

Key Designs¶

Mixture-of-LoRA-Experts (MoLE):
- Function: Alleviates multi-teacher knowledge conflicts within the student encoder.
- Mechanism: Embeds the MoE architecture in each FFN layer of the student encoder, where each expert is a parameter-efficient LoRA module (two low-rank matrices); a Router linear layer dynamically selects the top-1 expert based on inputs: \(F^\star(x) = F(x) + E_i(x)\), \(i = \text{argmax}(\text{Softmax}(f(x)))\).
- Design Motivation: Directly fine-tuning the student encoder \(\rightarrow\) overfitting + catastrophic forgetting + training crash (abnormal loss); using complete FFNs as experts \(\rightarrow\) parameter explosion; LoRA is both parameter-efficient (only 0.3% of total parameters) and has better generalization; ablation studies confirm that without MoLE, KD can even degrade performance below the baseline (VQAv2: 76.7 \(\rightarrow\) 77.4 with MoLE vs 76.7 without MoLE/KD).
Attention-Guided KD Regularization (Token Weights + Teacher Weights):
- Function: Adaptively identifies which visual tokens and which teachers are more worthy of distillation.
- Mechanism: Utilizes the cross-attention between the pre-trained CLIP \([CLS]\) token and other visual tokens as a measure of importance.
- Token Weights: \(W^{(tok)} = \text{Softmax}(\frac{V^{(cls)}W^{(Q)} \cdot (V^{(res)}W^{(V)})^T}{\sqrt{d}})\), focusing the student on semantically rich regions (e.g., foreground objects) and ignoring background noise.
- Teacher Weights: \(W^{(tea)} = \text{Softmax}(\text{mean}(\frac{V^{(cls)} \cdot V_i^{(t)T}}{\sqrt{d}}))\), reflecting the response intensity of each teacher to a specific image.
- Design Motivation: \([CLS]\) attention naturally focuses on key areas (visualization shows CLIP's \([CLS]\) concentrates on meaningful objects), which is more efficient and generalizes better than learnable tokens; different teachers should contribute differently to different images, and uniform weighting would limit each teacher from exerting their unique strengths.
Encoder Adapter:
- Function: Aligns outputs from different teacher encoders into a unified representation space.
- Mechanism: Each teacher is equipped with an independent 2-layer MLP adapter to map outputs to the same dimension as the student.
- Design Motivation: Encoder output spaces from different pre-training sources are inconsistent, and simple linear interpolation cannot bridge the gap; ablation shows using adapters outperforms interpolation by 0.4% (avg: 66.3 \(\rightarrow\) 66.7).

Loss & Training¶

Total Loss:

\[\mathcal{L}_{total} = \mathcal{L}_{text} + \lambda_{kd} \cdot \mathcal{L}_{kd}\]

\(\mathcal{L}_{text}\): Standard cross-entropy loss for text generation in VLMs.
\(\mathcal{L}_{kd}\): Weighted MSE distillation loss with token weights and teacher weights.
\(\lambda_{kd} = 0.5\), CLIP teacher has a fixed weight of 0.8 (since the student is initialized from CLIP), number of MoLE experts = 3, LoRA rank = 32.
Two-stage training: Pre-training stage freezes the LLM and only trains MoLE/adapters/projection layer; fine-tuning stage unfrezes all parameters except for the teachers.
Trained on 16\(\times\)A800 GPUs.

Key Experimental Results¶

Main Results (LLaVA-1.5 7B + MoVE-KD)¶

Method	VQAv2	GQA	TextVQA	POPE	SQA	MME	MMB	Avg
LLaVA-1.5	78.5	62.0	58.2	85.9	66.8	1510.7	64.3	66.5
+RADIO	76.3	63.0	56.3	86.2	—	—	—	—
+MoVE-KD v1.0	79.5	63.2	58.3	86.9	69.3	1524.5	66.3	68.0
+MoVE-KD v1.1	79.9	63.9	59.6	86.3	69.8	1509.1	67.4	—

Ablation Study¶

Configuration (Added step-by-step)	VQAv2	GQA	VizWiz	POPE	MMB	Avg
LLaVA-1.5 (baseline)	78.5	62.0	50.0	85.9	64.3	66.5
+ KD (interpolation)	79.0	62.4	50.9	84.7	62.9	66.3
+ Encoder adapter	79.3	62.4	51.2	85.2	63.8	66.7
+ MoLE	79.1	62.8	51.9	86.4	65.4	67.4
+ Token weight	79.3	63.1	52.5	86.7	66.0	67.7
+ Teacher weight (full)	79.5	63.2	52.3	86.9	66.3	68.0

Key Findings¶

RADIO suffers from noticeable forgetting on VQAv2/TextVQA (-2.2/-1.9), whereas MoVE-KD achieves comprehensive improvements—indicating that MoLE effectively mitigates knowledge conflicts.
Introducing MoLE alone without KD does not improve performance (Table 3), ruling out the hypothesis that "performance gains come merely from parameter increases".
Directly unfreezing the encoder for fine-tuning leads to degraded performance, proving that the improvement of MoVE-KD is not due to encoder unfreezing.
v1.0 \(\rightarrow\) v1.1 (adding SAM teacher) brings further improvements, especially on TextVQA (+1.3), demonstrating the great scalability of the method regarding teachers.
A CLIP teacher weight of 0.8 is optimal; a weight too low (0.6) leads to excessive forgetting of the original CLIP knowledge.

Highlights & Insights¶

Pioneering Perspective: First to address VLM multi-encoder fusion from the KD perspective, which is more elegant than feature concatenation and incurs no extra inference overhead.
Elegant MoLE Design: LoRA as MoE experts = parameter efficiency + knowledge division of labor, with only 0.3% parameter overhead.
Reuse of [CLS] Attention: Utilizing CLIP's \([CLS]\) attention as a distillation guidance signal is an ingenious "free lunch".
Plug-and-Play: Can be directly applied to mainstream VLMs like LLaVA/LLaVA-NeXT without modifying the inference architecture.

Limitations & Future Work¶

Multi-teacher encoder forward passes are still required during training, leading to high training costs (16\(\times\)A800).
The \([CLS]\) attention fully relies on CLIP, which may result in insufficient distillation in regions that CLIP fails to attend to.
Validation is limited to the LLaVA series; generalizability has not been tested on architectures like Qwen-VL or InternVL.
MoLE currently uses only top-1 routing; top-2 or soft routing strategies have not been explored.
Occasional slight performance drops on TextVQA, possibly because enhancing visual tokens has an adverse effect on text understanding.

AM-RADIO: Multi-head, single-backbone distillation trained on DataComp-1B; MoVE-KD requires no extra data and achieves better performance.
Eagle/Mini-Gemini: Uses extra encoders for high-resolution refinement, which incurs higher inference costs.
OneS: A pioneer in multi-teacher LLM distillation; MoVE-KD introduces multi-teacher KD to the visual side of VLMs.
Insight: \([CLS]\) attention is an underestimated signal—it encodes information on "where is important in the image" and can be widely applied in scenarios like distillation, pruning, and token compression.

Rating¶

Novelty: ⭐⭐⭐⭐ First to apply multi-teacher KD + MoLE on VLM visual encoders, presenting a novel perspective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers LLaVA/LLaVA-NeXT, multiple scales (1.7B to 13B), 8 benchmarks, and detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Clear structure and well-designed ablation studies.
Value: ⭐⭐⭐⭐ Highly practical, plug-and-play improvement for VLMs; excellent teacher scalability.