LLaVA-KD: A Framework of Distilling Multimodal Large Language Models¶
Conference: ICCV2025 arXiv: 2410.16236 Code: GitHub Area: Multimodal VLM Keywords: Knowledge Distillation, Multimodal Large Language Models, Small Model Compression, Vision-Language Alignment, Relational Distillation
TL;DR¶
This paper proposes the LLaVA-KD framework, which transfers knowledge from large-scale MLLMs to small-scale MLLMs via Multimodal Distillation (MDist) and Relational Distillation (RDist) strategies combined with a three-stage training scheme (DPT-SFT-DFT), significantly improving small model performance without modifying the model architecture.
Background & Motivation¶
Multimodal large language models (MLLMs) have achieved remarkable success in vision-language understanding, yet their growing scale and computational complexity hinder deployment in resource-constrained scenarios. Existing lightweight approaches follow two main directions:
Direct use of small LLM backbones (e.g., LLaVA-Phi, TinyLLaVA): adopting the standard two-stage training paradigm (PT+SFT) from large models leads to significant performance degradation. For instance, 4B TinyLLaVA achieves 65.0%, but performance drops sharply to 54.7% at 0.5B.
Architectural or data-quality optimization (e.g., MoE-LLaVA introducing mixture-of-experts, Bunny performing data curation): these methods require architectural modifications or data engineering overhead.
The authors identify two critical gaps: - Existing LLM distillation methods focus solely on the text modality, neglecting the importance of visual representations for cross-modal understanding. - Naively incorporating distillation into the SFT stage yields limited gains; the training paradigm itself requires redesign.
Method¶
Overall Architecture¶
LLaVA-KD consists of two core components: MLLM-oriented distillation strategies and a three-stage training scheme. Both teacher and student adopt the LLaVA-1.5 architecture (Visual Encoder SigLIP + MLP Projector + LLM) and share the same visual encoder.
Distillation Strategies¶
Multimodal Distillation (MDist)¶
MDist distills the output distributions of both response tokens and visual tokens.
Response Distillation: Standard KL divergence is applied to align teacher and student output distributions over response tokens:
Visual Distillation: The output distributions at visual token positions are further aligned:
where \(K\) is the number of visual tokens and \(V\) is the vocabulary size. The key intuition is that visual representations are equally critical for multimodal understanding in LLMs, and distilling only text outputs is insufficient.
Relational Distillation (RDist)¶
RDist enhances the student model's fine-grained visual understanding by transferring structural relationships among visual tokens. Self-correlation matrices of visual tokens are constructed as follows:
The cosine similarity between teacher and student self-correlation matrices is then maximized:
This relational distillation draws inspiration from feature relation transfer in classical vision tasks, but is applied to visual tokens generated by the LLM within an MLLM, capturing spatial and semantic dependencies among visual tokens.
Three-Stage Training Scheme¶
Stage 1: Distilled Pre-Training (DPT)¶
- The visual encoder and s-LLM are frozen; only the projector is optimized.
- MDist and RDist are introduced to enhance vision-text alignment.
- \(\mathcal{L}_{DPT} = \mathcal{L}_{reg} + \alpha\mathcal{L}_{res} + \beta\mathcal{L}_{vis} + \gamma\mathcal{L}_{rel}\)
Stage 2: Supervised Fine-Tuning (SFT)¶
- Standard SFT with the visual encoder frozen; projector and s-LLM are jointly optimized.
- The model acquires foundational multimodal understanding and instruction-following capabilities.
Stage 3: Distilled Fine-Tuning (DFT)¶
- MDist and RDist are reintroduced for refined knowledge transfer.
- \(\mathcal{L}_{DFT} = \mathcal{L}_{reg} + \alpha'\mathcal{L}_{res} + \beta'\mathcal{L}_{vis} + \gamma'\mathcal{L}_{rel}\)
The key design motivation is that establishing baseline understanding through SFT before refining knowledge via DFT outperforms directly incorporating distillation into SFT.
Key Experimental Results¶
Main Results¶
| Method | LLM | VQAv2 | GQA | SciQA | TextVQA | MME | MMB | POPE | Avg₁₀ |
|---|---|---|---|---|---|---|---|---|---|
| TinyLLaVA (0.5B) | Qwen1.5-0.5B | 73.9 | 57.4 | 60.9 | 47.4 | 59.8 | 55.0 | 83.7 | 54.7 |
| LLaVA-MOD | Qwen1.5-0.5B | - | 56.2 | 62.8 | 53.9 | 65.3 | 58.8 | - | 54.1 |
| LLaVA-KD | Qwen1.5-0.5B | 77.0 | 59.6 | 60.6 | 49.9 | 64.5 | 60.1 | 85.9 | 57.9 |
| TinyLLaVA (1.8B) | Qwen1.5-1.8B | 73.1 | 55.5 | 65.3 | 47.7 | 61.2 | 57.1 | 83.4 | 56.8 |
| LLaVA-MOD | Qwen1.5-1.8B | - | 58.7 | 68.0 | 58.5 | 66.7 | 66.3 | 87.0 | 59.9 |
| LLaVA-KD | Qwen1.5-1.8B | 79.0 | 62.3 | 64.7 | 53.4 | 69.1 | 64.0 | 86.3 | 62.1 |
Ablation Study¶
| Training Scheme | Avg₁₀ |
|---|---|
| PT-SFT (baseline) | 54.7 |
| DPT-SFT | 55.6 (+0.9) |
| PT-DFT | 55.8 |
| DPT-DFT | 55.9 |
| PT-SFT-DFT | 56.6 |
| DPT-SFT-DFT | 57.9 (+3.2) |
- DPT contributes +0.9%, validating that distillation pre-training enhances cross-modal alignment.
- DFT contributes the largest gain of +2.3% (55.6→57.9), making it the most critical stage.
- Removing SFT (DPT-DFT) reduces performance to 55.9%, confirming that the SFT stage is indispensable.
- In MDist, visual distillation \(\mathcal{L}_{vis}\) contributes more than response distillation \(\mathcal{L}_{res}\) during the DFT stage.
Highlights & Insights¶
- Importance of visual modality distillation: This work is the first MLLM distillation framework to simultaneously optimize output distributions for both visual and language modalities, addressing the gap left by prior LLM distillation methods that focus exclusively on text.
- Elegant three-stage training design: The progressive DPT→SFT→DFT scheme outperforms all two-stage combinations.
- Architecture-agnostic nature: Performance is consistently improved without any architectural modifications, making the approach orthogonal to other optimization strategies (e.g., MoE, data engineering).
- Data efficiency: LLaVA-KD surpasses LLaVA-MOD using only 1.2M training samples compared to 5M.
Limitations & Future Work¶
- Distillation requires a pre-trained large teacher model, increasing overall training overhead (approximately 120 GPU hours).
- Teacher and student must share the same visual encoder and vocabulary, limiting flexibility for cross-architecture distillation.
- Validation is restricted to the LLaVA family of architectures; generalizability to other MLLM architectures remains unexplored.
Related Work & Insights¶
- Lightweight MLLMs: LLaVA-Phi, TinyLLaVA, MoE-LLaVA, Bunny
- LLM Distillation: MiniLLM (reverse KLD), DistiLLM (skewed KLD)
- MLLM Distillation: LLaVA-MoD (KLD + preference distillation), LLaVADI
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4 |
| Technical Quality | 4 |
| Experimental Thoroughness | 4 |
| Writing Quality | 4 |
| Value | 5 |
| Overall | 4.2 |
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models¶
Conference: ICCV2025 arXiv: 2410.16236 Code: GitHub Area: Multimodal VLM Keywords: Knowledge Distillation, Multimodal Large Language Models, Small Model Training, Vision-Language Alignment
TL;DR¶
This paper proposes the LLaVA-KD framework, which efficiently transfers knowledge from large-scale MLLMs to small-scale MLLMs via two distillation strategies—Multimodal Distillation (MDist) and Relational Distillation (RDist)—combined with a three-stage training scheme (DPT-SFT-DFT), significantly improving small model performance without modifying the model architecture.
Background & Motivation¶
MLLMs have achieved remarkable success in unified vision-language understanding, but their continuously growing scale limits deployment in resource-constrained settings. Existing small-scale MLLMs (s-MLLMs) typically adopt lightweight LLM backbones to reduce computational cost, yet directly following the two-stage training paradigm (PT→SFT) of large models leads to significant performance degradation. For example, 4B TinyLLaVA achieves 65.0%, but performance drops sharply to 54.7% at 0.5B.
Prior work has attempted to address this through: - Architectural optimization: MoE-LLaVA introduces a mixture-of-experts structure. - Training data optimization: Bunny improves data quality via clustering and pruning.
However, these approaches either introduce additional parameters or incur higher data costs. The authors argue that training paradigm optimization is an underexplored yet highly promising direction. Existing LLM distillation methods focus solely on text-modality knowledge transfer, neglecting the critical role of visual representations in multimodal understanding, and directly incorporating distillation into the SFT stage yields only limited gains.
Method¶
Overall Architecture¶
LLaVA-KD consists of a large-scale teacher model (l-MLLM) and a small-scale student model (s-MLLM), both adopting the LLaVA-1.5 architecture (Visual Encoder + Projector + LLM). Teacher and student share the same visual encoder (SigLIP-B/14@384px), with a two-layer MLP projector mapping visual features \(Z_v \in \mathbb{R}^{N_p \times C}\) to the text embedding space \(H_v \in \mathbb{R}^{N_p \times D}\).
Key Design 1: Multimodal Distillation (MDist)¶
MDist applies KL divergence distillation along two dimensions: response tokens and visual tokens.
Response Distillation: Aligns teacher and student output distributions over response tokens:
Visual Distillation: Aligns teacher and student output distributions over visual tokens:
where \(K\) is the number of visual tokens and \(V\) is the vocabulary size. Unlike conventional LLM distillation that targets only response tokens, MDist explicitly incorporates the visual modality into the distillation objective, ensuring comprehensive transfer of multimodal representations.
Key Design 2: Relational Distillation (RDist)¶
RDist transfers the teacher model's ability to capture inter-visual-token relationships by constructing self-correlation matrices of visual tokens:
The two matrices are then aligned by maximizing their cosine similarity:
This design encodes spatial and semantic dependencies among visual tokens (e.g., object positions, interaction relationships), which are critical for understanding complex visual scenes.
Three-Stage Training Scheme¶
-
Distilled Pre-Training (DPT): The visual encoder and LLM are frozen; only the projector is trained. MDist and RDist are incorporated on top of the standard autoregressive loss: \(\mathcal{L}_{DPT} = \mathcal{L}_{reg} + \alpha \mathcal{L}_{res} + \beta \mathcal{L}_{vis} + \gamma \mathcal{L}_{rel}\), enhancing vision-text alignment quality.
-
Supervised Fine-Tuning (SFT): Standard SFT with the projector and LLM jointly optimized to establish foundational multimodal understanding capabilities.
-
Distilled Fine-Tuning (DFT): Distillation is reintroduced after SFT to refine the student model's knowledge: \(\mathcal{L}_{DFT} = \mathcal{L}_{reg} + \alpha' \mathcal{L}_{res} + \beta' \mathcal{L}_{vis} + \gamma' \mathcal{L}_{rel}\)
All loss weights \(\{\alpha, \beta, \gamma\}\) and \(\{\alpha', \beta', \gamma'\}\) are set to 1.0, 1.0, and 0.5, respectively.
Key Experimental Results¶
Main Results: Comparison with State-of-the-Art¶
| Method | LLM | VQAv2 | GQA | SciQA | MME | MMB | POPE | Avg₁₀ |
|---|---|---|---|---|---|---|---|---|
| TinyLLaVA (Qwen1.5-0.5B) | 0.5B | 73.9 | 57.4 | 60.9 | 59.8 | 55.0 | 83.7 | 54.7 |
| LLaVA-MOD (Qwen1.5-0.5B) | 0.5B | - | 56.2 | 62.8 | 65.3 | 58.8 | - | 54.1 |
| LLaVA-KD (Qwen1.5-0.5B) | 0.5B | 77.0 | 59.6 | 60.6 | 64.5 | 60.1 | 85.9 | 57.9 |
| TinyLLaVA (Qwen1.5-1.8B) | 1.8B | 73.1 | 55.5 | 65.3 | 61.2 | 57.1 | 83.4 | 56.8 |
| LLaVA-MOD (Qwen1.5-1.8B) | 1.8B | - | 58.7 | 68.0 | 66.7 | 66.3 | 87.0 | 59.9 |
| LLaVA-KD (Qwen1.5-1.8B) | 1.8B | 79.0 | 62.3 | 64.7 | 69.1 | 64.0 | 86.3 | 62.1 |
LLaVA-KD outperforms baselines at both the 0.5B and 1.8B scales, achieving Avg₁₀ improvements of 3.2% and 5.3% respectively, using only 1.2M training samples (compared to 5M for LLaVA-MOD).
Ablation Study: Effect of the Three-Stage Training Scheme¶
| Training Scheme | Avg₁₀ |
|---|---|
| PT-SFT (baseline) | 54.7 |
| DPT-SFT | 55.6 (+0.9) |
| PT-DFT | 55.8 |
| DPT-DFT | 55.9 |
| PT-SFT-DFT | 56.6 |
| DPT-SFT-DFT | 57.9 (+3.2) |
| DPT-DFT-DFT | 58.0 |
- DPT yields a +0.9% gain, confirming that distillation-based pre-training improves cross-modal alignment.
- DFT contributes the largest improvement (+2.3%), demonstrating effective teacher knowledge transfer.
- Skipping SFT (DPT-DFT) leads to performance degradation, confirming that SFT is indispensable for knowledge acquisition.
- DPT-DFT-DFT marginally outperforms but incurs greater computational overhead (120 GPU hours); DPT-SFT-DFT offers the best cost-performance trade-off.
Ablation Study: Distillation Objectives¶
| Distillation Target | Response | Visual | Avg₁₀ |
|---|---|---|---|
| DPT: Response only | ✓ | ✗ | 54.9 |
| DPT: Response + Visual | ✓ | ✓ | 55.1 |
| DFT: Response only | ✓ | ✗ | 57.2 |
| DFT: Response + Visual | ✓ | ✓ | 57.7 |
Visual token distillation yields additional gains in both DPT and DFT stages, validating the importance of the visual distillation component in MDist.
Highlights & Insights¶
- Novel multimodal distillation formulation: This is the first work to extend distillation from response tokens to visual tokens within MLLMs, addressing the blind spot of prior LLM distillation methods that ignore the visual modality.
- Elegant relational distillation design: Spatial and semantic inter-token relationships are captured via visual token self-correlation matrices, going beyond simple feature alignment.
- Well-motivated three-stage training: Each stage has a clearly defined role—DPT for alignment, SFT for foundational capability, and DFT for knowledge refinement.
- Strong architecture-agnosticism: The framework requires no architectural modifications and is directly applicable to various LLaVA-style MLLMs.
Limitations & Future Work¶
- Teacher and student must share the same visual encoder, limiting distillation flexibility across architectures.
- Distillation increases training computation and memory overhead due to the need to run both teacher and student models simultaneously.
- Validation is limited to the LLaVA-1.5 architecture; applicability to more advanced architectures (e.g., dynamic resolution models) remains unknown.
- Loss weights (\(\alpha, \beta, \gamma\)) are fixed as empirical values without an adaptive adjustment mechanism.
Related Work & Insights¶
- Small-scale MLLMs: TinyLLaVA, Bunny, MoE-LLaVA, MobileVLM, MiniCPM-V reduce costs via lightweight backbones or architectural optimization.
- LLM Distillation: MiniLLM (reverse KLD), DistiLLM (skewed KLD), and CoT distillation focus on the text modality.
- MLLM Distillation: LLaVA-MoD (output KLD + preference distillation + MoE); LLaVADI finds that most LLM distillation strategies provide no additional benefit for MLLMs.
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4 |
| Technical Quality | 4 |
| Experimental Thoroughness | 4 |
| Writing Quality | 4 |
| Value | 4 |
| Overall | 4.0 |