⚡ VLM Efficiency¶

🔬 ICLR2026 · 18 paper notes

📌 Same area in other venues: 📷 CVPR2026 (63) · 💬 ACL2026 (6) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8) · 📹 ICCV2025 (11)

🔥 Top topics: Multimodal/VLM ×8 · Model Compression ×6 · Compression ×5 · LLM ×4

Enhancing Visual Token Representations for Video Large Language Models via Training-free Spatial-Temporal Pooling and Gridding: Addressing the issue where Video Large Language Models (Video LLMs) lose spatio-temporal information when compressing thousands of visual tokens into a limited context, this paper proposes ST-GridPool, a training-free method. It utilizes "Pyramidal Temporal Gridding" to aggregate frame tokens across different time scales, injecting multi-granularity motion information, and "Norm-based Spatial Pooling" to weighted-preserve high-information regions based on L2 norms. It achieves consistent performance gains as a plug-and-play solution on LLaVA-Video / LLaVA-OneVision without retraining.
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit: The authors propose the HiDrop framework, which performs deep functional analysis of MLLM layers (Shallow = Propagators, Middle = Fusion Centers, Deep = Language Reasoning). It designs a three-stage strategy: Late Injection (skipping shallow layers), Concave Pyramid Pruning (pruning in middle layers), and Early Exit (exiting in deep layers). This approach compresses approximately 90% of vision tokens with negligible performance loss and achieves a 1.72× training speedup.
iLLaVA: An Image Is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models: iLLaVA breaks the inertia of "compressing tokens only in the LLM stage" by inserting token merging into both the image encoder and the LLM. Using an "information tokens + recovery tokens" merging strategy to retrieve useful information from discarded tokens, it achieves 2× throughput and 4× prefilling acceleration training-free while maintaining >95% performance.
IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning: This work reveals the implicit visual coordinate system (IVC tokens) established by RoPE positional encoding in LVLMs and proposes a training-free, prompt-aware vision token pruning strategy. By preserving both IVC tokens and semantic foreground tokens, it reduces visual tokens by approximately 50% while maintaining \(\ge 99\%\) of the original performance.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models: LearnPruner empirically debunks the prevalent assumption that "attention score = token importance." It points out that [CLS] attention in vision encoders is contaminated by attention sinks, while in LLMs, only "text-to-vision" mid-layer attention is reliable. Consequently, it replaces [CLS] attention with a learnable pruning module and superimposes text-guided pruning at the LLM mid-layers. By retaining only ~5.5% of vision tokens, it maintains 95% performance and achieves a 3.2× speedup.
Lightweight Spatio-Temporal Modeling via Temporally Shifted Distillation for Real-Time Accident Anticipation: Using a frozen image-only CLIP teacher + temporally shifted distillation, a lightweight RepMixer+RWKV student learns "predictive" temporal capabilities without large-scale video pre-training. It achieves SOTA on the DAD/CCD accident anticipation benchmarks while being 3–7× smaller than competitors and running at 80 FPS on Jetson Orin Nano.
Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models: It is observed that KV Cache in LVLMs exhibits modal-specific and head-specific semantic redundancy. Since selection based solely on importance leads to a loss of semantic coverage, MixKV is proposed to adaptively mix importance and diversity scores per head for KV Cache compression, achieving an average improvement of 5.1% under extreme compression.
Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning: This paper discovers that existing visual token pruning methods collapse on visual grounding (VG) tasks because they destroy the "global spatial reference system" constructed by positional encodings. Consequently, it proposes Nüwa—a two-stage pruning framework inspired by swarm intelligence (Boids) that employs a "Partition-Align-Aggregate" strategy on the vision encoder side to preserve spatial anchors, followed by text-guided refinement in the middle of the LLM. This approach improves performance retention on VG tasks from ~7% to 47% while maintaining VQA performance at 95%.
Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models: Photon is a multimodal large model that directly processes full 3D medical volume data (CT/MRI). It utilizes "Instruction-conditioned Token Scheduling (ITS)" to adaptively determine the number of visual tokens to retain for each question, and "Surrogate Gradient Propagation (SGP)" to ensure discrete token dropping remains differentiable during training. This approach achieves SOTA accuracy on medical VQA while providing approximately 5x training speedup and two-thirds memory savings.
PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models: PPE (Positional Preservation Embedding) leverages the rotation independence of RoPE dimensions to encode multiple original position IDs into different dimension segments of a merged token, enabling a single compressed token to carry multiple spatial/temporal positional information. PPE is a zero-parameter, plug-and-play universal operator that yields only a 3.6% average performance drop on image tasks at 55% compression and maintains comparable performance at 90% compression through cascaded compression.
Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity: PRUNESID is a training-free vision token compression framework that balances token semantic importance and information diversity through a two-stage pipeline consisting of "Principal Semantic Component Analysis (PSCA) clustering + Intra-group Non-Maximum Suppression (NMS)". By dynamically allocating token budgets according to image complexity, it maintains 96.3% relative accuracy on LLaVA-1.5 with only 11.1% of tokens, and achieves 92.8% relative performance under extreme compression (5.6%) on LLaVA-NeXT.
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration: SP-VLA categorizes VLA action sequences into "deliberative" and "intuitive" types. Deliberative types invoke the large model, while intuitive types are approximated by lightweight Ridge Regression. Simultaneously, spatial-semantic dual-perceptive token pruning is applied, achieving 1.5× lossless acceleration on LIBERO and 2.4× acceleration on SimplerEnv.
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs: To address the visual token explosion in large multimodal models (LVLMs) when processing long videos, this paper proposes ST-SimDiff, a training-free framework. It constructs a spatiotemporal graph of all visual tokens and parallelly performs community detection via "similarity" to retain representative tokens and mutation detection via "difference" to retain event tokens. Finally, it fine-tunes the token budget using attention. At 30%/50% token budgets, it consistently outperforms SOTA methods like FastV and FrameFusion, even matching the performance of the full 100% token model on certain benchmarks.
SURGE: Surprise-Guided Token Reduction for Efficient Video Understanding with VLMs: SURGE measures surprise based on the "temporal predictability of tokens." Predictable redundant tokens are pruned while unpredictable informative tokens are retained. This training-free, backbone-agnostic method reduces tokens to 1/7 of the original and cuts prefill costs by 86–98% across five video understanding benchmarks, with accuracy staying within ±1 point of the full-token baseline.
Task-Related Token Compression in Multimodal Large Language Models from an Explainability Perspective: This paper utilizes Transformer explainability methods to estimate the task-relevance of visual tokens relative to the current instruction. It trains a lightweight convolutional compressor to prune low-relevance tokens at the LLM input stage, significantly reducing FLOPs, prefill time, and KV-cache on Qwen2-VL, LLaVA-OneVision, and VILA1.5 while maintaining image and video understanding performance.
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices: This paper proposes NANOMIND, which decomposes Large Multimodal Models (LMM) into four independent "building blocks": vision, projection, language, and audio. These blocks are scheduled across heterogeneous accelerators (NPU/GPU/CPU) based on their strengths. A Token-Aware Buffer Manager (TABM) facilitates zero-copy embedding transfer on unified memory. Combined with custom hardware, low-bit fused GEMM kernels, and battery-aware scheduling, the system enables a 2000 mAh battery-powered device to perform fully offline multimodal inference. The end-to-end energy consumption is reduced by 42.3% compared to mainstream edge frameworks, achieving nearly 18.8 hours of battery life in low-power mode.
VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration: VisionTrim is a training-free acceleration framework for Multi-modal Large Language Models (MLLMs). It employs two plug-and-play modules: DVTS (selecting dominant vision tokens by balancing global semantics and local spatial continuity) and TGVC (re-clustering discarded tokens into supplementary tokens via text guidance). By compressing vision tokens during both vision encoding and LLM decoding stages, it maintains 98.8% average performance on LLaVA-1.5 while removing 88.9% of vision tokens.
VQ-Transplant: Efficient VQ-Module Integration for Pre-trained Visual Tokenizers: VQ-Transplant freezes the encoder-decoder of a pre-trained visual tokenizer and only replaces and adapts the VQ module with lightweight adjustments. This allows new quantization algorithms to be integrated into strong tokenizers like VAR with a training cost of approximately 22 hours. Using MMD-VQ, it achieves an r-FID of 0.81 on ImageNet-1K, surpassing the 0.92 r-FID of the original VAR tokenizer.