ACL2025 Video Understanding AI paper notes paper summaries Object Tracking Multimodal/VLM Question Answering Dialogue Reasoning Adversarial Robustness

📹 Video Understanding¶

💬 ACL2025 · 8 paper notes

📌 Same area in other venues: 📷 CVPR2026 (187) · 🔬 ICLR2026 (48) · 🧪 ICML2026 (17) · 🤖 AAAI2026 (27) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (56)

🔥 Top topics: Object Tracking ×3

A Thousand Words Paint a Picture: Multimodal Goal Tracking for Grounded Social Intelligence: This paper proposes a multimodal goal tracking framework that reasons about the implicit goals of participants in social situations by integrating visual and linguistic cues, thereby enhancing the model's understanding of social contexts (i.e., "grounded social intelligence").
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models: This study presents the first systematic exploration of selection bias in multiple-choice question answering (MCQA) with Video Language Models (VLMs). By analyzing bias sources through task decomposition, it proposes BOLD, a post-processing calibration technique that reduces bias while simultaneously improving model performance.
Attention-Seeker: Dynamic Self-Attention Scoring for Unsupervised Key-Frame Extraction: This paper proposes Attention-Seeker, an unsupervised method that dynamically analyzes the attention score distribution in self-attention layers of Transformer models to extract the most representative key-frames from videos without any supervision signals. It outperforms existing unsupervised methods on multiple video summarization benchmarks.
From Teacher to Student: Tracking Memorization Through Model Distillation: This work systematically investigates the impact of knowledge distillation (KD) on the memorization behavior of large language models, finding that distillation not only compresses the model but also significantly reduces the risk of verbatim memorization of training data—with reverse KL distillation (RKLD/MiniLLM) reducing the memorization ratio from 65.4% in SFT to as low as 6.0%.
Generative Frame Sampler for Long Video Understanding: GenS is proposed, a generative frame sampling module based on VideoLLM. It outputs question-aware relevant frame intervals and confidence scores in natural language format. As a plug-and-play module, it consistently improves multiple VideoLLMs by 2-4 points on LongVideoBench, MLVU, and HourVideo.
Improving Dialogue State Tracking through Combinatorial Search for In-Context Examples: This paper proposes CombiSearch, a method that employs combinatorial scoring to select the optimal combination of in-context examples for Dialogue State Tracking (DST). It outperforms all baselines trained on 100% of the training data using only 5% of the data. Under oracle settings, its Joint Goal Accuracy (JGA) upper bound is 12% higher than traditional methods.
RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning: This paper proposes the RAVEN framework, which integrates curriculum reinforcement learning with multimodal LLMs. Through hierarchical reward mechanisms and progressive training strategies, RAVEN achieves precise temporal grounding and category prediction of advertisement video violations, unlocking emergent reasoning capabilities without requiring explicit reasoning annotation data.
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs: Based on the observation of attention score sparsity in Video-LLMs, this paper proposes the Sparse-to-Dense (StD) decoding strategy. It uses a top-K sparse attention model as a draft model to rapidly generate candidate tokens, which are then verified in parallel by a dense model, achieving up to a 1.94× lossless acceleration without requiring additional training or architectural modifications.