Efficient Multi-modal Large Language Models via Progressive Consistency Distillation¶

Conference: NeurIPS 2025 arXiv: 2510.00515 Code: Available Area: Multimodal VLM Keywords: Multimodal LLM, visual token compression, progressive distillation, consistency distillation, inference efficiency

TL;DR¶

This paper proposes EPIC, a framework that addresses the optimization difficulty caused by feature space perturbation during visual token compression training via progressive consistency distillation along two dimensions (Token and Layer), achieving efficient multimodal LLMs without modifying model architecture.

Background & Motivation¶

Multimodal large language models (MLLMs) feed visual tokens extracted by a visual encoder into an LLM for understanding and reasoning. However, the large number of visual tokens (e.g., 576 tokens in LLaVA-v1.5) introduces substantial computational overhead: - The quadratic complexity of attention mechanisms makes long token sequences a bottleneck at inference - High-resolution images and multi-frame videos further exacerbate the problem

Existing visual token compression methods fall into two categories:

Training-free methods (FastV, SparseVLM, etc.): Prune tokens based on importance or redundancy, leading to noticeable performance degradation.

Training-aware methods (MQT-LLaVA, TokenPacker, etc.): Achieve flexible compression through architectural modifications.

Core Problem: Existing training-aware methods primarily rely on architectural improvements while neglecting the training difficulties introduced by token compression. As shown in Figure 1: - Token compression alters the feature space distribution (introducing perturbations) - Perturbations shift the optimal point in parameter space - Higher compression ratios lead to larger shifts of the optimal point - Direct training is prone to getting trapped in local optima

Method¶

Overall Architecture¶

EPIC is built on a standard MLLM architecture (CLIP + MLP projector + Vicuna LLM) and does not modify any architectural components. The core innovation lies in the training strategy: it decomposes feature space perturbation into token and layer dimensions, proposing Token Consistency Distillation (TCD) and Layer Consistency Distillation (LCD) respectively.

A single model with shared weights serves simultaneously as both teacher and student.

Key Designs¶

1. Token Consistency Distillation (TCD)

Core Idea: Progressively increasing the compression ratio so that the optimal point shift at each step remains small, making optimization easier.

Student model: At training iteration $t$, samples a compression ratio from the range $[R_{\min,t}^{\text{stu}}, R_{\max,t}^{\text{stu}}]$
Teacher model: Uses a slightly lower compression ratio than the student (by a margin $\Delta_t$), providing better feature guidance
Progressive strategy:
- Early training: Both teacher and student use low compression ratios (easy tasks)
- Late training: Compression ratios gradually increase, and the teacher–student gap $\Delta_t$ also increases progressively
When the gap is too large, the student cannot receive effective guidance from the teacher; hence $\Delta_t$ also follows a progressive schedule

Any plug-and-play token compressor (FastV, DART, random pruning) can serve as the compression operator.

2. Layer Consistency Distillation (LCD)

Based on the observation that visual token attention drops significantly in the deeper layers of the LLM, compression at deeper layers has a smaller impact on the output.

Normalized training progress is defined as $\beta_t = t/T$
Compression layer position: $\ell_t = \text{Round}(L - \beta_t(L - \ell_{\min}))$
Progressive strategy: Compression begins at the deepest layers early in training (minimal impact) and gradually moves to shallower layers (increasing impact)
The teacher–student compression ratio gap is maintained throughout

Loss & Training¶

TCD/LCD Training Objective: $$\mathcal{L}_{\text{total}}(\theta) = (1-\lambda) \cdot \mathcal{L}_{\text{SFT}}(\theta) + \lambda \cdot \mathcal{L}_{\text{TCD/LCD}}(\theta)$$

where $\lambda = 0.7$ and $\mathcal{L}_{\text{SFT}}$ is the standard autoregressive cross-entropy loss.

The distillation loss is a temperature-scaled KL divergence: $$\mathcal{L}_{\text{TCD}}(\theta) = \mathbb{E}_{I,P,t}\left[\text{KL}(p_{\text{tea}} \| p_{\text{stu}})\right]$$ $$p_{\text{tea}} = \text{Softmax}(h_{\text{tea}}/\tau), \quad p_{\text{stu}} = \text{Softmax}(h_{\text{stu}}/\tau)$$

Training requires only a single stage (visual instruction fine-tuning), completed in approximately 12 hours on 8×A100 GPUs.

Key Experimental Results¶

Main Results¶

Table 1: Performance on 10 visual understanding benchmarks (selected results, compared against LLaVA-v1.5-7B)

Method	#Visual Tokens	VQAv2	GQA	MME	MMB	Avg. (%)	vs LLaVA
LLaVA-v1.5	576	72.2	61.9	1785	64.1	61.4	—
MQT-LLaVA	64	65.6	58.7	1810	61.3	56.8	-4.6
MQT-LLaVA	256	68.3	60.1	1740	61.7	57.8	-3.6
TokenPacker	144	71.3	62.0	1716	63.9	59.9	-1.5
TCD (Ours)	256	72.7	61.4	1807	66.1	61.7	+0.3
TCD (Ours)	128	69.7	59.9	1861	65.6	61.3	-0.1
TCD (Ours)	64	66.1	57.1	1809	64.2	59.4	-2.0
LCD (Ours)	256	72.6	62.0	1834	64.3	62.2	+0.8
LCD (Ours)	128	69.2	60.6	1832	64.1	61.3	-0.1
LCD (Ours)	64	66.0	58.3	1794	62.1	59.4	-2.0

Key findings: - Retaining 128 tokens (77.8% compression) incurs virtually no performance loss (−0.1%) - Retaining 256 tokens surpasses the original LLaVA-v1.5 (+0.3 ~ +0.8%) - Even at 64 tokens, performance drops only 2%, far outperforming MQT-LLaVA, LLaVA-Mini, etc.

Table 2: Inference efficiency analysis

Method	Visual Tokens	KV Cache (MB) ↓	CUDA Time (s) ↓	FLOPs (T) ↓
LLaVA-v1.5	576	367.2	1103.5	9.3
EPIC + DART	64	40.9 (↓88.9%)	744.3 (↓32.6%)	1.5 (↓83.9%)
EPIC + Random	64	40.9 (↓88.9%)	697.3 (↓36.8%)	1.5 (↓83.9%)

KV cache reduced by 88.9%, FLOPs reduced by 83.9%, with a practical speedup of approximately 1.6×.

Ablation Study¶

Table 3: TCD ablation

Method	VQAv2	MME	MMB	Avg. (%)
TCD (128 tokens)	69.7	1861	65.6	61.3
w/o distillation loss	67.2	1745	63.8	59.8 (−1.5)
w/o progressive compression ratio	67.1	1788	63.8	59.1 (−2.2)

Table 4: LCD ablation

Method	VQAv2	MME	MMB	Avg. (%)
LCD (128 tokens)	69.2	1832	64.1	61.3
w/o distillation loss	67.1	1761	62.9	60.5 (−0.8)
w/o progressive compression layer	68.7	1776	63.1	60.3 (−1.0)

Key ablation conclusions: - Progressive strategy is indispensable: Removing it causes TCD to drop 2.2% and LCD to drop 1.0% on average - Teacher guidance is effective: Removing the distillation loss causes TCD to drop 1.5% and LCD to drop 0.8% - TCD is more sensitive to the progressive strategy; LCD requires both components

Key Findings¶

Extreme compression yields diminishing returns: FLOPs drop sharply from 576 to 128 tokens, but further reduction from 64 to 36→18 yields minimal FLOPs savings while causing severe performance degradation. The authors define a "high-ROI region" (≥64 tokens) and a "low-ROI region" (<64 tokens).
Good generalization across compression strategies: Models trained with DART also perform well at inference with FastV and Random pruning.
Low training cost: Only a single-stage fine-tuning is required (12 hours vs. 30–48 hours for architecture modification methods).
Implies substantial visual token redundancy: 128 tokens suffice to match the performance of 576 tokens.

Highlights & Insights¶

Deep problem insight: This work is the first to explicitly identify that the core difficulty of token compression lies in feature space perturbation → shift of the optimal point in parameter space → optimization trapped in local optima.
No architectural modification required: The approach is a pure training strategy improvement, compatible with any plug-and-play token compressor.
Elegant self-distillation design: Teacher and student share weights, eliminating the need to maintain a separate model.
Theoretically intuitive progressive learning: The optimal point shift at each step is small, analogous to curriculum learning.

Limitations & Future Work¶

Validation is currently limited to LLaVA-v1.5 (7B); applicability to larger models (13B+) and newer architectures (LLaVA-Next, etc.) remains to be verified.
TCD and LCD are currently used independently; the effect of joint application has not been explored.
The progressive schedule (linear growth) may not be optimal; adaptive scheduling strategies warrant further investigation.
Sensitivity analysis of the distillation hyperparameters $\lambda=0.7$ and temperature $\tau$ is insufficient.
Evaluation on video understanding scenarios (where token counts are even larger) has not been addressed.

LLaVA [Liu et al., 2023/2024]: The base architecture of this work, using 576 visual tokens in full.
FastV [Chen et al., 2024]: Training-free token pruning based on attention scores; one of the compression components in this work.
MQT-LLaVA [Li et al., 2024]: Dynamic Q-former encoding for variable-length tokens; the primary training-aware baseline.
TokenPacker [Li et al., 2024]: Coarse-to-fine visual projector that compresses tokens via architectural modification.
VoCo-LLaMA [Ye et al., 2024]: Transfers visual information to a small number of VoCo tokens via attention modification.

Rating¶

Novelty: ★★★★☆ — The progressive distillation framework is novel; the combination of self-distillation and progressive learning is elegant.
Technical Depth: ★★★★☆ — Design intuition is clear; the decomposition into token and layer dimensions is insightful.
Experimental Thoroughness: ★★★★★ — 10 benchmarks, 3 compression strategies, comprehensive ablations, and efficiency analysis are all included.
Writing Quality: ★★★★☆ — Loss landscape visualization (Figure 1) is intuitive and effective; tables are well-organized.
Practicality: ★★★★★ — Code is open-sourced, no architectural modifications are required, training cost is low, and the method is directly deployable.