📹 Video Understanding¶

🧪 ICML2025 · 4 paper notes

📌 Same area in other venues: 📷 CVPR2026 (187) · 🔬 ICLR2026 (48) · 🧪 ICML2026 (17) · 🤖 AAAI2026 (27) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (56)

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation: This paper proposes the SGVC framework, which achieves state-of-the-art zero-shot long video captioning performance while substantially reducing computational overhead compared to LLM-based methods. It parses segment-level video descriptions into scene graphs, iteratively consolidates them into a unified graph representation using the Hungarian algorithm, and generates video-level descriptions using a lightweight graph-to-text decoder.
MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition: The MoMa framework is proposed, which injects the linear complexity SSM of Mamba into a frozen CLIP Transformer via a scale-bias sequence modulation operation (SeqMod) to achieve efficient global spatiotemporal dynamic modeling, reaching SOTA performance on multiple video recognition benchmarks with lower computational cost.
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation: ViLaMP proposes the Differential Distillation principle to achieve "mixed-precision" video processing through two mechanisms: hierarchical frame-level Differential Keyframe Selection (DKS) and patch-level Differential Feature Merging (DFM). In this paradigm, keyframes retain all visual tokens, while non-keyframes are compressed into a single token. This enables processing ultra-long videos of up to 10K frames (approximately 2.7 hours) on a single A100 GPU.
Unifying Specialized Visual Encoders for Video Language Models: MERV proposes a multi-encoder video representation method that integrates four visual encoders with different areas of expertise (DINOv2, ViViT, SigLIP, LanguageBind) into a single VideoLLM through spatio-temporal alignment and cross-attention fusion. It improves performance on video reasoning benchmarks by up to 4.62% compared to the baseline Video-LLaVA, validating the complementary strengths of different encoders.