MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders¶
Conference: CVPR 2025
arXiv: 2501.01709
Code: https://github.com/hey-cjj/MoVE-KD
Area: Multimodal VLM
Keywords: Knowledge Distillation, Visual Encoder Fusion, Mixture of Experts, LoRA, Attention Guidance
TL;DR¶
This paper proposes MoVE-KD—the first framework to fuse the strengths of multiple visual encoders (CLIP/EVA/ConvNeXt/SAM) into a single encoder from the perspective of knowledge distillation. It alleviates multi-teacher knowledge conflicts through Mixture-of-LoRA-Experts (MoLE), adaptively weights distilled tokens and teachers using CLIP \([CLS]\) attention, and achieves consistent improvements on LLaVA/LLaVA-NeXT.
Background & Motivation¶
Different pre-trained visual encoders (such as CLIP, EVA, ConvNeXt, DINOv2) have their own strengths and perform differently across various VL tasks. \(\rightarrow\) Current multi-encoder methods (e.g., Eagle, Mini-Gemini, \(S^2\)) employ multiple encoders in parallel through feature concatenation or attention mechanisms, but the computational cost scales linearly with the number of encoders. \(\rightarrow\) AM-RADIO attempts to replicate predictions from multiple foundation models using a single model with multiple heads, but sharing a backbone to learn diverse characteristics leads to conflicts. \(\rightarrow\) Key Challenge: How to absorb the respective advantages of multiple teacher encoders without conflict while maintaining the efficiency of a single encoder? \(\rightarrow\) Key Insight: A combination of knowledge distillation + MoE + LoRA—using MoLE to allow different tokens to select different LoRA experts based on inputs, and utilizing CLIP \([CLS]\) attention to guide distillation to focus on valuable features.
Method¶
Overall Architecture¶
Training pipeline: Multiple teacher encoders (CLIP/EVA/ConvNeXt, optionally SAM) individually process images to obtain visual tokens \(\rightarrow\) each teacher's tokens are projected into a unified space via an independent 2-layer MLP encoder adapter \(\rightarrow\) token weights \(W^{(tok)}\) and teacher weights \(W^{(tea)}\) are computed based on pre-trained CLIP \([CLS]\) attention \(\rightarrow\) weighted MSE loss distills knowledge into the student encoder \(\rightarrow\) a MoLE structure is integrated within the student encoder to prevent knowledge conflicts \(\rightarrow\) Total Loss = Text Loss + KD Loss. Only the student encoder is used during inference.
Key Designs¶
-
Mixture-of-LoRA-Experts (MoLE):
- Function: Alleviates multi-teacher knowledge conflicts within the student encoder.
- Mechanism: Embeds the MoE architecture in each FFN layer of the student encoder, where each expert is a parameter-efficient LoRA module (two low-rank matrices); a Router linear layer dynamically selects the top-1 expert based on inputs: \(F^\star(x) = F(x) + E_i(x)\), \(i = \text{argmax}(\text{Softmax}(f(x)))\).
- Design Motivation: Directly fine-tuning the student encoder \(\rightarrow\) overfitting + catastrophic forgetting + training crash (abnormal loss); using complete FFNs as experts \(\rightarrow\) parameter explosion; LoRA is both parameter-efficient (only 0.3% of total parameters) and has better generalization; ablation studies confirm that without MoLE, KD can even degrade performance below the baseline (VQAv2: 76.7 \(\rightarrow\) 77.4 with MoLE vs 76.7 without MoLE/KD).
-
Attention-Guided KD Regularization (Token Weights + Teacher Weights):
- Function: Adaptively identifies which visual tokens and which teachers are more worthy of distillation.
- Mechanism: Utilizes the cross-attention between the pre-trained CLIP \([CLS]\) token and other visual tokens as a measure of importance.
- Token Weights: \(W^{(tok)} = \text{Softmax}(\frac{V^{(cls)}W^{(Q)} \cdot (V^{(res)}W^{(V)})^T}{\sqrt{d}})\), focusing the student on semantically rich regions (e.g., foreground objects) and ignoring background noise.
- Teacher Weights: \(W^{(tea)} = \text{Softmax}(\text{mean}(\frac{V^{(cls)} \cdot V_i^{(t)T}}{\sqrt{d}}))\), reflecting the response intensity of each teacher to a specific image.
- Design Motivation: \([CLS]\) attention naturally focuses on key areas (visualization shows CLIP's \([CLS]\) concentrates on meaningful objects), which is more efficient and generalizes better than learnable tokens; different teachers should contribute differently to different images, and uniform weighting would limit each teacher from exerting their unique strengths.
-
Encoder Adapter:
- Function: Aligns outputs from different teacher encoders into a unified representation space.
- Mechanism: Each teacher is equipped with an independent 2-layer MLP adapter to map outputs to the same dimension as the student.
- Design Motivation: Encoder output spaces from different pre-training sources are inconsistent, and simple linear interpolation cannot bridge the gap; ablation shows using adapters outperforms interpolation by 0.4% (avg: 66.3 \(\rightarrow\) 66.7).
Loss & Training¶
Total Loss:
- \(\mathcal{L}_{text}\): Standard cross-entropy loss for text generation in VLMs.
- \(\mathcal{L}_{kd}\): Weighted MSE distillation loss with token weights and teacher weights.
- \(\lambda_{kd} = 0.5\), CLIP teacher has a fixed weight of 0.8 (since the student is initialized from CLIP), number of MoLE experts = 3, LoRA rank = 32.
- Two-stage training: Pre-training stage freezes the LLM and only trains MoLE/adapters/projection layer; fine-tuning stage unfrezes all parameters except for the teachers.
- Trained on 16\(\times\)A800 GPUs.
Key Experimental Results¶
Main Results (LLaVA-1.5 7B + MoVE-KD)¶
| Method | VQAv2 | GQA | TextVQA | POPE | SQA | MME | MMB | Avg |
|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5 | 78.5 | 62.0 | 58.2 | 85.9 | 66.8 | 1510.7 | 64.3 | 66.5 |
| +RADIO | 76.3 | 63.0 | 56.3 | 86.2 | — | — | — | — |
| +MoVE-KD v1.0 | 79.5 | 63.2 | 58.3 | 86.9 | 69.3 | 1524.5 | 66.3 | 68.0 |
| +MoVE-KD v1.1 | 79.9 | 63.9 | 59.6 | 86.3 | 69.8 | 1509.1 | 67.4 | — |
Ablation Study¶
| Configuration (Added step-by-step) | VQAv2 | GQA | VizWiz | POPE | MMB | Avg |
|---|---|---|---|---|---|---|
| LLaVA-1.5 (baseline) | 78.5 | 62.0 | 50.0 | 85.9 | 64.3 | 66.5 |
| + KD (interpolation) | 79.0 | 62.4 | 50.9 | 84.7 | 62.9 | 66.3 |
| + Encoder adapter | 79.3 | 62.4 | 51.2 | 85.2 | 63.8 | 66.7 |
| + MoLE | 79.1 | 62.8 | 51.9 | 86.4 | 65.4 | 67.4 |
| + Token weight | 79.3 | 63.1 | 52.5 | 86.7 | 66.0 | 67.7 |
| + Teacher weight (full) | 79.5 | 63.2 | 52.3 | 86.9 | 66.3 | 68.0 |
Key Findings¶
- RADIO suffers from noticeable forgetting on VQAv2/TextVQA (-2.2/-1.9), whereas MoVE-KD achieves comprehensive improvements—indicating that MoLE effectively mitigates knowledge conflicts.
- Introducing MoLE alone without KD does not improve performance (Table 3), ruling out the hypothesis that "performance gains come merely from parameter increases".
- Directly unfreezing the encoder for fine-tuning leads to degraded performance, proving that the improvement of MoVE-KD is not due to encoder unfreezing.
- v1.0 \(\rightarrow\) v1.1 (adding SAM teacher) brings further improvements, especially on TextVQA (+1.3), demonstrating the great scalability of the method regarding teachers.
- A CLIP teacher weight of 0.8 is optimal; a weight too low (0.6) leads to excessive forgetting of the original CLIP knowledge.
Highlights & Insights¶
- Pioneering Perspective: First to address VLM multi-encoder fusion from the KD perspective, which is more elegant than feature concatenation and incurs no extra inference overhead.
- Elegant MoLE Design: LoRA as MoE experts = parameter efficiency + knowledge division of labor, with only 0.3% parameter overhead.
- Reuse of [CLS] Attention: Utilizing CLIP's \([CLS]\) attention as a distillation guidance signal is an ingenious "free lunch".
- Plug-and-Play: Can be directly applied to mainstream VLMs like LLaVA/LLaVA-NeXT without modifying the inference architecture.
Limitations & Future Work¶
- Multi-teacher encoder forward passes are still required during training, leading to high training costs (16\(\times\)A800).
- The \([CLS]\) attention fully relies on CLIP, which may result in insufficient distillation in regions that CLIP fails to attend to.
- Validation is limited to the LLaVA series; generalizability has not been tested on architectures like Qwen-VL or InternVL.
- MoLE currently uses only top-1 routing; top-2 or soft routing strategies have not been explored.
- Occasional slight performance drops on TextVQA, possibly because enhancing visual tokens has an adverse effect on text understanding.
Related Work & Insights¶
- AM-RADIO: Multi-head, single-backbone distillation trained on DataComp-1B; MoVE-KD requires no extra data and achieves better performance.
- Eagle/Mini-Gemini: Uses extra encoders for high-resolution refinement, which incurs higher inference costs.
- OneS: A pioneer in multi-teacher LLM distillation; MoVE-KD introduces multi-teacher KD to the visual side of VLMs.
- Insight: \([CLS]\) attention is an underestimated signal—it encodes information on "where is important in the image" and can be widely applied in scenarios like distillation, pruning, and token compression.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to apply multi-teacher KD + MoLE on VLM visual encoders, presenting a novel perspective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers LLaVA/LLaVA-NeXT, multiple scales (1.7B to 13B), 8 benchmarks, and detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and well-designed ablation studies.
- Value: ⭐⭐⭐⭐ Highly practical, plug-and-play improvement for VLMs; excellent teacher scalability.