⚡ LLM Efficiency¶
📷 CVPR2026 · 8 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (171) · 💬 ACL2026 (23) · 🧪 ICML2026 (48) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (34) · 📹 ICCV2025 (1)
- E\(^2\)-SCI: Elastic Edge-Cloud Speculative Decoding via Credit Inertia
-
This paper identifies strong temporal consistency in token acceptance rates across adjacent windows in edge-cloud speculative decoding (termed "Credit Inertia"). Based on this, it dynamically adjusts verification thresholds using historical acceptance rates. Combined with an Asynchronous Pipeline (PLC) that parallelizes draft generation and cloud verification, it achieves 9.4+ tokens/s on DeepSeek-R1-Distill-Qwen (1.5B/32B), representing an 88.5% speedup over the FSD baseline without compromising accuracy.
- Few-Shot Hybrid Incremental Learning: Continually Learning under Data Scarcity and Task Uncertainty
-
This paper proposes "Few-Shot Hybrid Incremental Learning (FSHIL)," a realistic new paradigm where data is scarce and task types (new classes, new domains, or both) appear stochastically. By introducing "Conditional Meta-Expanding Mixture of Experts (CME-MoE)" to reconcile stability and plasticity at the feature level and "Self-Expanding Prototype Classifier (SEPC)" to model multi-distribution boundaries at the classification layer, the method outperforms existing FSIL and HIL approaches across five datasets and three incremental settings.
- Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
-
This work reinterprets the state updates of Linear State Space Models (SSMs) as "performing test-time ridge regression on the entire history." By replacing the one-step gradient approximation in existing SSMs with the exact gain from Kalman filtering and overcoming the dual obstacles of low-precision numerical instability and parallel training via adaptive regularization and Chebyshev iterations, it outperforms linear SSMs like Mamba2 and Gated DeltaNet in short/long context tasks and ImageNet.
- Generalizable Video Quality Assessment via Weak-to-Strong Learning
-
Without relying on any human annotation labels, off-the-shelf VQA models are utilized as "weak teachers" to supervise a high-capacity Multimodal Large Language Model (MLLM) "strong student." The student is then recycled as the teacher for subsequent iterative rounds. The final model matches in-distribution performance and significantly surpasses all teachers in OOD scenarios, improving the overall OOD SRCC of VQA from 0.59 to 0.745.
- JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction
-
JUMP-Hand reformulates multi-view 3D hand reconstruction as a Mixture of Experts (MoE) problem where "each view is an expert," utilizing joint-wise, view-wise probabilistic uncertainty as an explicit gating signal. This signal drives both uncertainty-weighted triangulation in the coarse stage and uncertainty-gated cross-attention in the refinement stage, adaptively amplifying reliable views while suppressing noisy ones under severe occlusion, achieving SOTA results across three multi-view benchmarks.
- ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
-
Addressing two major bottlenecks in Video-LLM speculative decoding—"draft and target models waiting for each other" and "trade-off between speedup ratio and model alignment"—ParallelVLM implements both prefilling and decoding as draft/target parallel pipelines. It employs UV-Prune, an unbiased pruning method based on visual-text similarity variations (rather than attention scores), to expand the draft window. This achieves \(3.36\times\) and \(2.42\times\) lossless acceleration on LLaVA-OneVision-72B and Qwen2.5-VL-32B, respectively, while being training-free and plug-and-play.
- QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
-
QuietPrune proposes query-guided early pruning: visual tokens unrelated to the text query are pruned during the ViT forward process rather than after it. By utilizing a lightweight adapter initialized through an inverse transformation of the VLM projector, the text query is converted into a visual-domain
[Q-CLS]token to provide guidance. Pruning is performed in a 2×2 semi-structured manner with redundant token aggregation. On Qwen3-VL and InternVL3, it reduces prefill latency by up to 19.0% while achieving 4.2% higher accuracy than existing late-pruning methods. - Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
-
ReMix inserts an iteratively refreshed "continuous mixed state" between the discrete "mask state \(\rightarrow\) token state" transitions in Diffusion Language Models (DLLMs). This allows multiple positions in parallel decoding to coordinate in continuous space before finalizing tokens. By applying a rejection rule to reset unstable positions to masks, the method achieves a 2–8\(\times\) inference speedup without training or performance degradation, frequently even improving accuracy.