Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models¶
Conference: CVPR 2026 arXiv: 2603.16001 Code: https://github.com/LezJ/ATV-Pruning Area: Multimodal VLM Keywords: Weight Pruning, LVLM, Modality Asymmetry, Calibration Strategy, Sparsification
TL;DR¶
MoT probe experiments reveal asymmetric pruning sensitivity between the text and visual pathways in LVLMs — the text pathway is highly sensitive and must be calibrated with text tokens, while the visual pathway is highly redundant and can tolerate 60% sparsity. Based on these findings, ATV-Pruning constructs a calibration pool using all text tokens plus a small, layer-adaptively selected subset of visual tokens.
Background & Motivation¶
Background: LVLMs have enormous parameter counts, making weight pruning an effective approach for reducing deployment costs. SparseGPT and Wanda perform well on text-only LLMs, with the latter evaluating importance via weight magnitude × activation norm. However, direct application to LVLMs yields suboptimal results.
Limitations of Prior Work: Existing LVLM pruning methods (e.g., TAMP), while multimodal-aware, still process text and visual tokens within a unified framework, ignoring fundamental behavioral differences between the two modalities under pruning: (1) text and visual activations occupy distinct cluster regions in representation space (t-SNE visualization); (2) pruning masks derived from text-only vs. visual-only calibration exhibit a wide IoU distribution.
Key Challenge: Modality-agnostic calibration strategies dilute the linguistic signals necessary to protect text-related weights.
Goal: How to design calibration strategies that account for the distinct sensitivities of different modality pathways?
Key Insight: Explicitly decouple text and visual pathways via a Mixture-of-Transformer (MoT) analysis probe to independently investigate their respective pruning sensitivities.
Core Idea: The text pathway is calibrated using all text tokens (preserving sensitivity), while the visual pathway requires only a small number of high-saliency visual tokens as a supplement (exploiting redundancy).
Method¶
Overall Architecture¶
ATV-Pruning builds upon Wanda's activation-aware pruning framework. The key improvement lies in calibration pool construction: \(\mathcal{S}_{cal} = \mathcal{T} \cup \mathcal{V}_{sub}\), where \(\mathcal{T}\) contains all text tokens and \(\mathcal{V}_{sub}\) is a layer-adaptively selected subset of visual tokens.
Key Designs¶
-
MoT Sensitivity Analysis Probe (Motivating Experiment):
- Function: Decouple text/visual pathways and independently evaluate pruning sensitivity.
- Mechanism: Duplicate the QKV and FFN components of each Transformer block into separate text and visual pathways; prune each using text/visual/mixed calibration pools, then compare performance.
- Key Finding A: The text pathway is extremely sensitive — at 60% sparsity, text calibration retains 84.65% performance, visual calibration collapses to 50.92%, and mixed calibration achieves only 64.97%.
- Key Finding B: The visual pathway is highly redundant — at 60% sparsity, any calibration strategy retains 99.25%+ performance.
-
Modality-Aware Calibration Pool:
- Function: Adaptively construct a calibration pool containing all text tokens and a small number of visual tokens.
- Mechanism: Per Finding A, all text tokens are retained to protect language capability; per Finding B, only a small number of visual tokens are needed to capture vision-specific weights.
-
Layer-Adaptive Visual Token Selection:
- Function: Select the most important visual tokens at each Transformer block.
- Saliency Metric: Token representation drift (visual drift) \(s_v = 1 - \cos(\mathbf{X}_{in,v}, \mathbf{X}_{out,v})\).
- Intuition: If a block substantially updates the representation of a visual token, that token is actively engaged in computation within the block and should be included in calibration.
- The top-\(k\) visual tokens with the largest drift are added to the calibration pool.
Loss & Training¶
- Uses Wanda's importance score \(\mathbf{I}_{ij} = |\mathbf{W}_{ij}| \cdot \|\mathbf{X}_j\|_2\).
- Row-wise pruning of the lowest \(\rho\%\) scores to obtain an unstructured sparse model.
- No retraining required; this is a post-hoc pruning approach.
Key Experimental Results¶
MoT Probe Experiments (LLaVA-NeXT)¶
| Pathway | Calibration Source | Mean @ 50% Sparsity | Mean @ 60% Sparsity |
|---|---|---|---|
| Text Pathway | Text | 98.26% | 84.65% |
| Text Pathway | Visual | 94.33% | 50.92% |
| Text Pathway | Mixed | 95.86% | 64.97% |
| Visual Pathway | Text | 100.27% | 100.05% |
| Visual Pathway | Visual | 99.37% | 99.25% |
| Visual Pathway | Mixed | 100.14% | 99.57% |
Main Results (9 Multimodal Benchmarks)¶
| Method | Sparsity | Multi-Benchmark Avg. | vs. Wanda | vs. TAMP |
|---|---|---|---|---|
| ATV-Pruning | 50% | Best | Significantly better | Exceeds |
| ATV-Pruning | 60% | Best | Substantially better | Exceeds |
Highlights & Insights¶
- The MoT probe experimental design is elegant, providing the first quantitative characterization of asymmetric pruning sensitivity between text and visual pathways in LVLMs.
- The method is remarkably simple — it modifies only the calibration token selection strategy on top of Wanda, yet achieves significant performance gains.
- The finding that the visual pathway suffers almost no performance degradation at 60% sparsity is a highly valuable empirical result.
- Visual drift serves as a token saliency metric that is intuitive, effective, and computationally lightweight.
- ATV-Pruning comprehensively outperforms baselines including Wanda, SparseGPT, and TAMP across 9 standard multimodal benchmarks.
- Finding B indicates substantial redundancy in the visual processing parameters of LVLMs, offering a new perspective for model compression.
Experimental Thoroughness¶
- Results are validated consistently across multiple models including LLaVA-NeXT and Qwen2-VL.
- At 50% sparsity, ATV-Pruning retains 90%+ performance on MMBench, substantially outperforming vanilla Wanda.
- The advantage is most pronounced on SQA-img, which places the highest demands on textual reasoning ability.
- Visual token ratios ranging from 5% to 30% all yield competitive results; the default setting of 10% achieves the best trade-off.
Limitations & Future Work¶
- Visual drift computation requires an additional forward pass (though this is a one-time cost during calibration).
- The top-\(k\) ratio for visual token selection requires hyperparameter tuning, and the optimal ratio may vary across models and tasks.
- The current work only validates unstructured sparsity; structured pruning scenarios (e.g., channel pruning) warrant further exploration.
- Extending the asymmetric paradigm to other compression techniques such as quantization and knowledge distillation is a promising direction.
- The MoT probe decoupling is used for analysis only; actual pruning still operates on shared weights, and a gap may exist between probe findings and practical implementation.
- For video-input LVLMs where visual token counts grow substantially, the scalability of the selection strategy requires verification.
- The phenomenon of post-pruning performance improvement on VizWiz merits deeper investigation.