📹 Video Understanding¶

💬 ACL2026 · 11 paper notes

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time: This paper proposes ArrowGEV, a reinforcement learning framework inspired by the physics concept of "arrow of time," which models temporal directionality in videos by distinguishing between temporally sensitive and insensitive events, improving VLM event localization accuracy and temporal understanding capabilities.
Distorted or Fabricated? A Survey on Hallucination in Video LLMs: This paper presents the first systematic taxonomy of hallucination phenomena in Video Large Language Models (Vid-LLMs), proposing a mechanism-driven classification framework that distinguishes between "dynamic distortion" (spatiotemporal relationship and reference consistency errors) and "content fabrication" (driven by statistical priors and audio-visual conflicts), while surveying evaluation benchmarks, mitigation strategies, and root cause analysis.
GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents: This paper presents GameplayQA, an end-to-end benchmarking framework built on multi-player 3D game videos. Through dense timeline annotation (1.22 labels/second) and a structured distractor taxonomy, it systematically evaluates multimodal large language models (MLLMs) on perception and reasoning in decision-dense, multi-view synchronized scenarios, revealing a substantial gap between frontier models and human performance.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding: This paper proposes HERMES, a training-free framework for efficient streaming video understanding grounded in a mechanistic analysis of layer-wise attention preferences in MLLMs. KV caches are conceptualized as a hierarchical memory system — shallow layers as sensory memory, middle layers as working memory, and deep layers as long-term memory — enabling real-time streaming video QA with a 68% reduction in video tokens while maintaining or improving accuracy, achieving TTFT latency below 30ms, which is 10× faster than the previous SOTA.
Preference Estimation via Opponent Modeling in Multi-Agent Negotiation: This paper proposes a preference estimation method that integrates LLM-extracted natural language preference signals into a Bayesian opponent modeling framework. In multi-party, multi-issue negotiations, it fuses qualitative cues and quantitative bid information via a linguistic likelihood function, improving the full agreement rate (FAR) from 37% to 62%.
Probing for Reading Times: This paper probes the ability of representations from individual layers of language models to predict reading times, finding that early-layer representations outperform surprisal on early fixation measures, while surprisal performs better on late measures, and that the best predictor varies by language and metric.
RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora: This paper proposes the RARE framework, which tracks cross-document redundancy by decomposing documents into atomic facts and introduces CRRF (Criterion-separated Reciprocal Rank Fusion) to stabilize multi-criteria LLM judgments. The framework constructs the RedQA benchmark over high-redundancy enterprise corpora in finance, legal, and patent domains, revealing that mainstream retrievers suffer a dramatic collapse in PerfRecall@10 from 66.4% to 5.0–27.9% under 4-hop high-overlap settings.
Saber: Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for DLMs: This paper proposes Saber, a training-free sampling algorithm for diffusion language models (DLMs) that achieves an average Pass@1 improvement of 1.9% on code generation while delivering 251.4% inference speedup. This is accomplished through two strategies: adaptive acceleration (dynamically adjusting the amount of parallel decoding based on established context) and backtracking-enhanced remasking (revoking tokens falsified by newly established context).
VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis: This paper proposes VC-Inspector, a reference-free video caption evaluation metric based on open-source lightweight multimodal models (Qwen2.5-VL 3B/7B), which generates training data through a controllable factual error synthesis pipeline and achieves \(\tau_b\)=42.58 human judgment correlation on VATEX-Eval, outperforming GPT-4o-based G-VEval (\(\tau_b\)=39.40), while reaching 99.6% accuracy on hallucination detection benchmarks.
ViLL-E: Video LLM Embeddings for Retrieval: This paper proposes ViLL-E, the first unified Video LLM architecture supporting both text generation and embedding generation. Through a three-stage joint generative-contrastive training strategy and an adaptive KV-Former embedding head, ViLL-E approaches expert models on video retrieval and temporal grounding while maintaining competitive performance on VideoQA.
VISTA: Verification In Sequential Turn-based Assessment: VISTA proposes a multi-turn dialogue factuality assessment framework based on claim-level decomposition and sequential consistency tracking, subdividing unverifiable content into four categories: subjective, contradicted, lacking evidence, and abstention. It significantly outperforms FActScore and LLM-as-Judge baselines across four dialogue benchmarks and eight LLMs.