CVPR2026 VLM Efficiency AI paper notes paper summaries Multimodal/VLM Model Compression Compression LLM Diffusion Models

⚡ VLM Efficiency¶

📷 CVPR2026 · 63 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (18) · 💬 ACL2026 (6) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8) · 📹 ICCV2025 (11)

🔥 Top topics: Multimodal/VLM ×31 · Model Compression ×21 · Compression ×9 · LLM ×7 · Diffusion Models ×3

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression: To address the slow real-time deployment of streaming Video Large Language Models (Streaming VideoLLM), this paper proposes STC, a plug-and-play two-level token compression framework. STC-Cacher caches and reuses static features from adjacent frames during the ViT encoding stage, recomputing only dynamic tokens. STC-Pruner utilizes "spatio-temporal dual anchors" to prune redundant tokens before entering the LLM. STC maintains approximately 99% accuracy on ReKV while reducing ViT encoding latency by 24.5% and LLM pre-filling latency by 45.3%.
Adapting Lightweight Image-based Counting Models for Video Crowd Counting: This paper avoids adding any temporal modules to Video Crowd Counting (VCC). Instead, it analytically formulates the spatiotemporal prior—that "crowd count changes between adjacent frames should be bounded"—as a frequency-domain statistical regulator based on the Characteristic Function (ChF). This regulator constrains a lightweight Image Crowd Counting (ICC) model only during training, while inference remains single-frame. It achieves SOTA accuracy across six datasets while reaching an inference frame rate of 99.5 fps.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition: AdaptVision is proposed to enable VLMs to autonomously determine the minimum number of visual tokens required for each sample through a coarse-to-fine active vision mechanism and reinforcement learning. Combined with Decoupled Turn Policy Optimization (DTPO), it achieves an optimal balance between efficiency and accuracy.
ApET: Approximation-Error Guided Token Compression for Efficient VLMs: From an information-theoretic perspective, this paper proposes a visual token importance evaluation method based on linear approximation reconstruction error. It does not rely on attention weights, making it naturally compatible with FlashAttention. On LLaVA-1.5, it maintains 95.2% performance while compressing 88.9% of visual tokens.
Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding: AttentionPack leverages the inherent low-rank observation of LVLM KV caches (especially vision tokens). It compresses the cache along the hidden dimension using SVD via "multi-head concatenation + modality separation" and employs an "attention-aware partial decompression" strategy based on cumulative attention scores to select ranks on-demand. Without significant performance loss, it reduces memory consumption to 1/5–1/8 of the original, supporting larger batches/longer contexts and achieving up to a 74% increase in decoding throughput.
Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction: STAMP reformulates MLLM-based segmentation as a parallel "cloze" classification task for all image patches. By simultaneously predicting the entire mask using a single non-autoregressive forward pass, it achieves high segmentation precision and fast inference speed without compromising conversational capabilities, effectively resolving the long-standing "dialogue/performance/speed" trilemma in MLLM segmentation.
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding: The Blink framework is proposed to adaptively enhance visual perception in a single forward pass by dynamically expanding and discarding visual tokens across different Transformer layers of MLLMs (mimicking human "rapid-blink" scanning), improving LLaVA-1.5 performance across multiple multimodal benchmarks.
Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers: Co-Me equips visual geometric Transformers like VGGT and π3 with a lightweight "confidence predictor." It merges patch tokens that the network deems unimportant (low confidence) into a single token before passing them into the latter half of the network. This accelerates both attention and MLP without retraining or altering the backbone structure, achieving up to 21.5× speedup on VGGT with negligible accuracy loss.
CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models: This paper reformulates visual token reduction in large multimodal models (LMMs) as an "optimal subset selection" problem. It uses informativeness (visual saliency + cross-modal alignment) to score each token and coverage (log-det volume) to ensure the selected subset spans the feature space. A compact subset is then selected end-to-end via greedy submodular optimization—requiring no training, being independent of attention mechanisms, and compatible with FlashAttention/KV cache. On LLaVA-NeXT-7B, pruning 94.4% of visual tokens retains 86.7% performance with a 6.5× prefill speedup.
CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs: CORE shifts the visual token compression of LVLMs from "merging individual tokens by feature similarity" to "merging by objects." By utilizing a built-in segmentation head to generate masks for each object, it performs weighted averaging of tokens within the same object into a single compact token, combined with centroid sorting to preserve spatial order. It achieves SOTA performance in fixed-rate compression across six benchmarks; under extreme compression, it maintains 97.4% of the baseline performance while retaining only 2.2% of tokens.
Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation: For memory-constrained on-device Test-Time Adaptation (TTA), this paper utilizes forward-only, backpropagation-free Zeroth-Order (ZO) optimization to fine-tune a lightweight adapter. Observing that the Hessian remains low-rank and changes slowly during TTA, the authors replace isotropic random perturbations with curvature-aware anisotropic perturbations. This significantly reduces the variance of ZO gradient estimates, achieving a SOTA 69.0% on ImageNet-C while saving approximately 70% VRAM compared to BP-based methods.
Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression: RDVQ replaces the non-differentiable nearest neighbor indexing in vector quantization with a "distance-aware soft distribution," allowing rate loss gradients to flow back to the encoder. This achieves the first end-to-end rate-distortion joint optimization for VQ compression. Combined with a masked autoregressive entropy model, it obtains superior perceptual quality at ultra-low bitrates with less than 20% of the parameters of similar methods (saving up to 75.71% bitrate in DISTS on DIV2K-val compared to RDEIC).
DocPrune: Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning: Addressing the "large background + sparse evidence" nature of document images, DocPrune proposes a training-free, progressive three-stage visual token pruning framework (background removal → question-irrelevant region removal → adaptive pruning based on model comprehension). It improves encoder/decoder throughput by \(3.0\times / 3.3\times\) on M3DocRAG while increasing F1 by \(1.0\).
DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference: The DUET-VLM framework introduces a dual-stage visual token compression approach: the first stage selects dominant tokens within the vision encoder via V2V self-attention and aggregates the remaining tokens into contextual tokens using attention-guided local clustering; the second stage performs hierarchical pruning of visual tokens within the LLM using T2V cross-attention. On LLaVA-1.5-7B, it achieves 67% token reduction while maintaining 99%+ accuracy, 89% reduction with 97%+ accuracy, and reduces training time by 31%.
Dynamic Token Reweighting for Robust Vision-Language Models: This paper proposes Dtr (Dynamic Token Reweighting), the first inference-time defense method that optimizes the VLM's KV cache to counter multimodal jailbreak attacks. By defining "Reverse Safety Shift" (RSS), Dtr identifies vision tokens that cause safety degradation and dynamically adjusts their weights to restore safety alignment while maintaining performance on benign tasks.
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling: EvoComp inserts a lightweight compressor between the MLLM's alignment module and the LLM. It is trained using supervision labels generated via an "evolutionary algorithm that finds the token subset minimizing task loss." This approach maintains 99.3%–94.9% of original accuracy under 3×–9× compression and achieves up to 2.0× speedup on mobile NPUs.
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients: This paper proposes Quantization-aware Integrated Gradients (QIG), advancing the sensitivity analysis of LVLM quantization from the modality level to the token level. By utilizing axiomatic attribution principles, it precisely quantifies the contribution of each token to the quantization error. This approach significantly improves the accuracy of quantized models under W4A8 and W3A16 settings with almost no additional computational overhead.
FlashCache: Frequency-Domain-Guided Outlier-KV-Aware Multimodal KV Cache Compression: FlashCache is proposed as the first method to analyze the importance distribution of multimodal KV Cache from a frequency-domain perspective. It discovers that "Outlier KVs" deviating from low-frequency principal components encode critical features for inference. By identifying and prioritizing Outlier KVs through DCT low-pass filtering and performing dynamic layer-wise budget allocation, it achieves 1.69× decoding speedup with 80% KV memory compression and negligible performance loss, while maintaining native compatibility with FlashAttention.
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection: FocusUI allow UI grounding VLMs to retain only a few instruction-related visual tokens—first using a lightweight scorer trained with "instruction \(\times\) patch" saliency supervision to pick key patches, then using POSPAD to compress discarded continuous tokens into a placeholder mark retaining the final coordinate. This achieves a 1.44× speedup and 17% lower peak VRAM with only a 3.2% accuracy drop when keeping only 30% of visual tokens.
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding: GroundVTS is proposed as a query-guided fine-grained visual token sampling architecture for Video Large Language Models (Vid-LLMs). By adaptively retaining query-relevant spatio-temporal information at the token level, it achieves an 18.4-point mIoU improvement on Charades-STA and a 20.6-point mAP improvement on QVHighlights.
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models: Ours proposes HAWK, a head importance-aware visual token pruning method. It dynamically evaluates visual token importance by combining offline-calculated head contribution weights with text-guided attention scores. On Qwen2.5-VL, it retains 96.0% of original performance after pruning 80.2% of visual tokens, while reducing inference latency by 26%.
Hi-Lo Prune: Look at What You'll Lose before Pruning with Hierarchical Token Selection: To address the high inference cost caused by excessive visual tokens in Multimodal Large Language Models (MLLMs), this paper proposes a training-free pruning method, Hi-Lo Prune. Following the core philosophy of "look at what you'll lose before pruning," it employs a coarse-to-fine hierarchical selection to define a preserved token set and a "most valuable discarded token" candidate set. Prune-Aware Fusion then migrates information from the candidate set to the preserved set in shallow layers via augmented attention, followed by a one-time removal of remaining tokens at a designated layer. This approach consistently outperforms existing methods on Qwen2/2.5/3-VL and LLaVA, even when pruning 90% of tokens.
HTTM: Head-wise Temporal Token Merging for Faster VGGT: HTTM is a training-free token merging method specifically tailored for VGGT global attention layers. By employing "head-wise independent merging + temporal reordered in-block merging + cross-head adaptive outlier filtering," it accelerates long-sequence 3D reconstruction inference by up to \(7\times\) with negligible performance degradation.
Hybrid Token Compression for Vision-Language Models: Addressing the dilemma where "continuous compression loses semantics and discrete quantization loses details" when visual tokens are compressed to 1, HTC-VLM utilizes a dual-path decoupling of a continuous channel (ViT patches for details) and a discrete channel (MGVQ for 4 semantic anchors). Through a decoupled attention mask and a <voco> bottleneck, 580 tokens are compressed into 1, improving performance retention from 81.0% to 87.2% across 7 benchmarks.
IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models: This paper proposes IF-Prune, which models visual token importance estimation as an amortized variational inference problem. Using a small VLM equipped with a token-level Variational Information Bottleneck (VIB), the KL divergence between the posterior and prior of each visual token latent variable is used as the importance score to prune the large VLM. Guidance is provided in a single forward pass. Even when retaining only 5% of visual tokens, the large model maintains 95% of its original performance, outperforming the previous SOTA by approximately 8%.
LazyVAR: Accelerating Visual Autoregressive Models via Scale-wise Token Pruning and Parallel Group Decoding: LazyVAR discovers that aggregated latent features in VAR on adjacent scales become increasingly similar as the scale increases. Therefore, it leverages a "scale update index" for training-free token pruning, and then groups and decodes scales with minimal updates in parallel. This accelerates the Infinity-2B text-to-image model by up to 2.94× (taking only 0.5 seconds for 1024×1024 resolution on a single RTX 4090 card) with almost no degradation in generation quality.
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models: Addressing the instability of training small diffusion students distilled from large teachers, this paper decomposes distillation error into "coarse-easy" (low-order moment mismatch) and "fine-hard" (non-linear residual) components using linear regression. It proposes LIFT for coarse-to-fine refinement and PLACE for spatial local adaptivity via group ranking. Under extreme 90% pruning (where the student has only 1.6% of teacher parameters), it brings the FID back to 15.73 from the 50–200+ seen in conventional KD.
LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging: To address the quadratic complexity bottleneck of global attention in the 3D foundation model VGGT on long sequences, LiteVGGT proposes a "geometry-aware + cross-layer cached" token merging strategy. It preserves critical tokens based on geometric importance, merges redundant tokens into anchors, and reuses merging indices across layers. Coupled with fine-tuning and FP8 quantization, it achieves approximately 10× speedup compared to VGGT on 1000-image inputs with almost no performance degradation.
LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers: LS-ViT reformulates the estimation of the "representative Hessian" in ViT block reconstruction as a least-squares problem—fitting a shared Hessian using \((g, \Delta z)\) pairs across the entire calibration set. This explicitly recovers the covariance terms lost by previous methods due to the "sample independence assumption," achieving new SOTA in ultra-low bits like W2/A3 and W2/A4. Each block requires only one backpropagation, making training 1.8–2.7x faster than FIMA-Q.
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models: This work reveals the "smoothing misalignment" problem when channel-wise smoothing quantization (e.g., SmoothQuant) is directly applied to MLLMs—where huge differences in activation magnitudes across modalities lead to over-smoothing of non-dominant modalities. MASQuant addresses this via modality-aware smoothing factors and cross-modal low-rank compensation based on SVD whitening.
Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging: Merge3D introduces a semantic-geometric joint token merger (SemGeo Merger) for 3D video MLLMs with "2D semantic + 3D geometric" dual encoders. It uses 2D attention to select semantically salient main tokens and utilizes joint 2D×3D similarity to merge context tokens into spatial neighborhoods. While reducing visual tokens by up to 70% and achieving ~3× acceleration, it preserves performance in 3D grounding, description, and spatial reasoning.
MeToM: Metadata-Guided Token Merging for Efficient Video LLMs: MeToM utilizes "free" bitstream metadata from video codecs (residual energy, GoP packet size) as zero-cost proxies for spatio-temporal information density. It employs three modules—RPM, BTM, and MATM—to hierarchically merge visual tokens at "tokenization, pre-LLM, and intra-LLM" stages based on content complexity. Without any training, it achieves 2.65× end-to-end inference acceleration across multiple Video LLMs while maintaining or even improving accuracy.
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data and Training Recipes: MiniCPM-V 4.5 utilizes a "Unified 3D-Resampler for visual token compression + Unified Document/OCR learning via dynamic corruption + Short-long dual-mode hybrid RL" approach to build a highly efficient and powerful 8B MLLM. It outperforms GPT-4o-latest and Qwen2.5-VL 72B with a score of 77.0 on OpenCompass, while requiring only ~10% of the inference time on VideoMME.
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning: The authors first observe that replacing a 7B language model in an MLLM with a 125M OPT can approximate large model performance on factual image descriptions. They then propose MM-SeR, a multimodal self-refinement framework: the lightweight model first generates a coarse description, which then guides the extraction of finer visual features for a second refinement stage. This achieves performance parity with large models on single-sentence/detailed descriptions and long-video QA, while reducing parameters by 93% and inference time by 97%.
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping: The authors propose MoDES, the first training-free expert skipping framework for MoE Multimodal Large Language Models (MLLMs). By utilizing Global Modulated Local Gating (GMLG) and a Dual-Modality Threshold (DMT) mechanism to adaptively skip redundant experts, MoDES retains 97%+ of the original performance while skipping 88% of experts, achieving a 2.16× prefill acceleration.
NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices: Addressing the overlooked scenario where "edge devices only focus on specific classes," NuWa first employs Self-Knowledge Purification (SKP) to learn binary masks that remove "class-harmful weights." Subsequently, it formulates MHA/MLP pruning as closed-form optimization problems, enabling the derivation of smaller ViTs from large ones without retraining. These derived models achieve higher accuracy than the original on target classes while being faster, with pruning speeds 33.69× faster than state-of-the-art training-dependent methods and costs reduced by up to 99.83%.
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models: OmniZip is the first training-free token compression framework for joint audio-video understanding in Omnimodal Large Language Models (OmniLLM). It utilizes the attention distribution of audio tokens as a prior for "information density/event boundaries" to dynamically determine video token pruning rates within each time window. Combined with an Interleaved Spatiotemporal Compression (ISTC) module, it achieves 3.42× prefill acceleration and 1.4× memory reduction on Qwen2.5-Omni with almost no performance degradation.
OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data: OmniZip utilizes a lightweight RWKV backbone (ranging from several MB to 152M parameters) combined with "unified modality tokenization + modality-routed MoE." It achieves lossless compression for seven modalities (image, text, speech, tactile, gene, and database) within a single model. It improves compression rates by 42%–62% over gzip and achieves near real-time speeds of approximately 1MB/s on MacBook CPUs and iPhone NPUs.
One Layer's Trash is Another Layer's Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs: Addressing the inference slowdown caused by excessive visual tokens in Large Vision-Language Models (LVLMs), ALVTS avoids the one-time permanent pruning used by methods like FastV. Instead, it re-selects tokens at every decoding layer. Using a lightweight selector with low-rank approximation to score all visual tokens, important tokens participate in layer computation while unimportant ones skip the current layer and merge back later. This mechanism preserves 96.7% of the original accuracy while compressing 89% of tokens.
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving: The first plug-and-play token pruning framework dedicated to multi-view autonomous driving VLMs. By utilizing T-FPS (Token-level Farthest Point Sampling) to maintain semantic and spatial diversity, combined with view-adaptive pruning rates to optimize token budgets across cameras, it achieves 6.40× prefill acceleration on DriveLM with only 10% tokens remaining and a performance drop of just 3%.
PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion: PS-SR decomposes an expensive multi-step diffusion SR process into an asymmetric sampling sequence consisting of "1 step by a strong base model + T−1 steps of speculative refinement by a lightweight draft model." It then applies a frequency domain update rule to ensure subsequent steps only inject high-frequency details without altering low-frequency structures, achieving multi-step diffusion quality and detail at speeds approaching single-step models.
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization: Quant Experts (QE) is proposed, a token-aware adaptive quantization error reconstruction framework based on Mixture-of-Experts. It partitions important channels into two groups: token-independent (high-frequency, global) and token-dependent (low-frequency, local). These are compensated using low-rank adapters from shared and routed experts, respectively, to mitigate global and local quantization errors. QE consistently improves VLM performance across various quantization settings ranging from W4A6 to W3A16.
QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer: Targeting the 1.26B parameter feed-forward 3D reconstruction model VGGT, this paper proposes QVGGT, a geometry-aware post-training quantization framework. By utilizing "block-wise sensitivity mixed precision + camera token filtering compensation + task-aware scale search," it achieves nearly lossless performance under W4A16 (CO3Dv2 camera pose AUC@30 89.4 vs. FP16 89.5), while reducing memory by 3–4.9× and providing up to 2.8× hardware speedup.
Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights: The authors discover that vision model weights are approximately symmetric after removing a few outliers. Based on this, they propose DASQ—decomposing weights into a "dense symmetric kernel + sparse outliers," both represented by symmetric quantization (SymQ). This eliminates the expensive zero-point of asymmetric quantization (AsymQ). DASQ outperforms existing PTQ methods on ImageNet/COCO with lower BOPs and achieves higher accuracy and lower power consumption on FPGAs.
Rethinking Token Reduction for Large Vision-Language Models: Aiming at multi-round visual question answering (MT-VQA) scenarios, this paper unifies visual token pruning and merging into a "learnable compression mapping \(P\)" and trains a meta-generator, MetaCompress, which relies solely on images and adapts to arbitrary resolutions to produce \(P\). At a 90% compression rate, it consistently outperforms heuristic methods like FastV and PruMerge, with inference efficiency approaching the fastest equidistant sampling baseline.
S2D: Selective Spectral Decay for Quantization-Friendly Conditioning of Neural Activations: S2D attributes the root cause of activation outliers to a few "bloated" principal singular values of the weight matrix. By applying selective spectral decay only to these largest singular values during the fine-tuning stage, the model is conditioned into a "quantization-friendly" state without requiring retraining from scratch. W4A4 PTQ on ImageNet achieves gains of up to 7%.
Saliency-Driven Token Merging for Vision Transformers: SAD-TM observes that existing token merging methods rely only on "current-layer" attention parameters, which fluctuate drastically across layers. It proposes a cross-layer consistent criterion using saliency (via backpropagation) to identify "saliency outlier" tokens that deviate from the global gradient direction. By fusing these with class attention and employing a "delayed merging" strategy that skips initial layers, it achieves almost lossless FLOPs reductions of 23%~45% on DeiT, MAE, and LV-ViT without training.
SCoRe: Salience-Coverage Reduction for Vision Token Pruning in Vision-Language Models: SCoRe reformulates LVLM vision token pruning from a two-stage heuristic—"attention-based Top-k followed by post-hoc diversity"—into a unified "representativeness optimization problem." It proves this problem is equivalent to the classical weighted k-Center problem and adopts a composite score encoding both salience and coverage for greedy selection. Being training-free and plug-and-play, it retains 95% performance while pruning 94.4% of tokens.
SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference: SegMo addresses the token explosion and \(O(N^2)\) prefill bottleneck in long-video VLMs. Through "algorithm-system co-design," it jointly optimizes what to compute (Content-Aware Sparsity, CAS) and how to compute (Locally-Cohesive Segment Parallelism, LSP). Leveraging the "local cohesion" property of VLM attention, it segments videos by scenes for parallel execution with zero cross-GPU communication during prefill. This achieves up to 12.00% accuracy improvement and up to 3.55× prefill acceleration across three long-video benchmarks.
SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer: SODA is proposed to achieve high-fidelity generation under controllable acceleration ratios for Diffusion Transformers without training, utilizing offline fine-grained sensitivity modeling, dynamic programming for interval optimization, and a unified adaptive pruning strategy.
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding: TimeViper utilizes a 9B large model backbone hybridizing Mamba-2 and self-attention. Leveraging the newly discovered "visual information converges into instruction tokens layer-by-layer" phenomenon, the authors propose the TransV module within the LLM to transfer and compress redundant visual tokens into instruction tokens via gated cross-attention. This enables the processing of hour-long videos with tens of thousands of frames on a single GPU, achieving performance comparable to Transformer-based MLLMs.
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model: TransPrune proposes using "the changes in token representations during internal propagation" (token transition) to determine the importance of visual tokens. By combining two complementary signals—TTV (Token Transition Variation), which assesses the magnitude and direction changes of tokens, and IGA (Instruction-Guided Attention), which measures image token attention relative to instructions—the method achieves training-free progressive pruning. It reduces inference TFLOPs by half on LLaVA-1.5/Next and Qwen2.5-VL with almost no performance degradation.
UniCompress: Token Compression for Unified Vision-Language Understanding and Generation: UniCompress wraps off-the-shelf discrete tokenizers with lightweight "global meta-token extraction + average pooling compression + global-guided autoregressive decompression" modules. It reduces the visual token count of unified understanding-generation models by 4×, maintaining understanding performance with only minor generation degradation, all without retraining the language model.
Variation-Aware Vision Token Dropping for Faster Large Vision-Language Models: V2Drop is proposed, which for the first time adopts a perspective of token variation. By progressively dropping "lazy" vision tokens with minimal variation within the LLM, it achieves training-free, position-bias-free LVLM inference acceleration compatible with efficient operators. It retains 94.0% and 98.6% of original performance in image and video understanding tasks respectively, while reducing LLM generation latency by 31.5% and 74.2%.
ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning: ViLearn explicitly decouples two inherently different sub-tasks in "single-image-to-3D"—visible region reconstruction and invisible region hallucination—during the training phase. It utilizes the cross-attention of a pre-trained VecSet decoder to partition unordered shape tokens into two groups (Visibility Grouping, VG): visible and invisible. Subsequently, it employs Visibility-Aware Positional Encoding (VAPE) to strengthen the "image token \(\leftrightarrow\) visible token" correspondence and weaken entanglement with invisible tokens. This approach accelerates the training convergence of VecSet diffusion models by up to 4.4\(\times\) and surpasses vanilla baselines in final quality without altering the backbone or increasing inference overhead.
Vision-Oriented Lightweight Neural Architecture Search with Budget-Adaptive Evaluation: Addressing the dilemma between "accurate but slow training-based" and "fast but unreliable and family-specific training-free" NAS, this paper proposes six "vision-specific micro-tasks" with negligible training costs as architecture quality proxies. Combined with a quadratic response surface that automatically allocates data volume and training epochs within a given time budget, the method achieves SOTA rank correlation and final accuracy across CNN, Transformer, and Mamba families.
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm: VLM-Pruner is proposed as a training-free centrifugal token pruning method that balances redundancy elimination and local detail integrity through the Buffered Spatial Sparsity (BSS) criterion. It consistently outperforms existing methods across five VLMs at an 88.9% pruning rate while achieving end-to-end inference acceleration.
VLM-PTQ: Efficient Post-Training Quantization for Large Vision-Language Models: VLM-PTQ observes two overlooked issues when migrating weight compensation quantization methods like GPTQ/GPTAQ to Vision-Language Models: "round-to-nearest" is sub-optimal under asymmetric targets, and vision/text channels are treated indiscriminately. The paper uses a closed-form correction term to shift the quantization target to the true optimum and redistributes channel weights using a modality-aware importance vector. This significantly improves 3-bit/2-bit quantization accuracy across 1B~72B VLMs with negligible overhead.
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction: VQRAE transforms RAE (a representation autoencoder using a pre-trained visual backbone as the encoder) into a vector-quantized version. A single tokenizer simultaneously outputs continuous semantic features for understanding and discrete tokens for generation and reconstruction. It demonstrates for the first time that quantizing semantic features requires high dimensionality (1536) for 100% utilization and to avoid collapse, completely moving away from dual-encoders and CNN pixel encoders.
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping: VVS introduces "partial verification skipping" to speculative decoding (SD) for visual autoregressive generation for the first time. By utilizing verification-free token selection, stale feature cache reuse, and similarity-driven skip scheduling, it reduces the number of target model forward passes by up to 2.86× and achieves an end-to-end acceleration of 1.76× with minimal loss in image quality. This breaks the ceiling of SD where the "one draft, one verification" paradigm could not explicitly reduce the number of forward passes.
What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models: The authors propose the EmbedLens probing tool to systematically analyze the internal structure of visual tokens in MLLMs. They discover that visual tokens are categorized into three types: sink, dead, and alive (approximately 40% are useless). Alive tokens already encode rich semantics before entering the LLM ("pre-linguistic" property), and internal visual computation within the LLM is redundant for most tasks, allowing for direct middle-layer injection.
When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs: This paper discovers that existing token pruning methods perform worse than random pruning in deep layers of VLLMs. It proposes a method to quantify visual token information based on output probability variations, revealing the "Information Horizon"—a critical layer where visual token information uniformly dissipates to zero. This horizon is dynamically influenced by task visual complexity and model capability, and the study proves that integrating simple random pruning effectively enhances existing methods.
ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models: ZOO-Prune utilizes "zeroth-order gradient estimation" on a lightweight projection layer to measure the "sensitivity" of each visual token. By multiplying sensitivity with feature diversity into a hybrid score for greedy selection, it achieves completely training-free pruning of up to 94.4% of visual tokens, reaching a 2.30× end-to-end inference speedup with negligible accuracy loss.