📹 Video Understanding¶
🔬 ICLR2026 · 47 paper notes
📌 Same area in other venues: 📷 CVPR2026 (178) · 🧪 ICML2026 (17) · 🤖 AAAI2026 (27) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (56)
🔥 Top topics: Object Tracking ×4 · LLM ×3 · Compression ×3 · Reasoning ×2 · Question Answering ×2
- A Training-Free Framework for Long Video Understanding via Video-Query-Options Similarity
-
Addressing the issue that hour-long videos cannot fit into the context window of Multi-modal Large Language Models (MLLMs), this paper proposes a training-free input-side framework. It utilizes a video-text retrieval model to score the relevance of video segments, followed by Adaptive Frame Sampling (AFS) and Dynamic Resolution Allocation (DRA). The relevance estimation is refined by incorporating candidate answers generated by the MLLM itself into the retrieval query (VQOS). This framework achieves an average improvement of 3~5 points for LLaVA-Video and Qwen2.5-VL across five long video benchmarks.
- A.I.R.: Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering
-
Ours proposes A.I.R., a training-free adaptive-iterative-reasoning-driven frame selection framework. It addresses the dual dilemma of inaccurate similarity in lightweight models (CLIP) and explosive costs of VLM analysis in VideoQA through a two-stage strategy (GMM adaptive initial sampling + iterative VLM fine-grained analysis). Even in the worst-case scenario, it only requires analyzing 72 frames (vs. the 128-frame baseline) while significantly improving performance across multiple long-video benchmarks.
- ARFlow: Auto-regressive Optical Flow Estimation for Arbitrary-Length Videos via Progressive Next-Frame Forecasting
-
ARFlow transforms multi-frame optical flow estimation from "one-time estimation within a fixed-length clip" to "step-by-step auto-regressive prediction of next-frame flow." By using historical flow to initialize current estimates and fusing short-term and long-term motion cues through multi-stride temporal forecasting, it improves accuracy on benchmarks like Sintel, KITTI, and Spring with nearly constant memory usage.
- AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
-
AVoCaDO, based on Qwen2.5-Omni, undergoes SFT using 107K high-quality temporally aligned audiovisual captioning data, followed by GRPO reward fine-tuning focused on key events, dialogue, and length. This enables the 7B audiovisual captioner to outperform existing open-source models on multiple benchmarks, with some metrics matching or exceeding the Gemini-2.5 series.
- Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
-
This paper first reveals that current MLLMs fail to understand intuitive physics dynamics for continua (such as fluids) using two "low-level" diagnostic tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). It then proposes Scene Dynamic Field (SDF)—mapping particle velocities calculated by a physics simulator into blue gradient maps as visual prompts. Combined with multi-task fine-tuning, this improves Qwen2-VL / GLM-4.1V performance on fluid tasks by up to 20.7%, with successful transfer to unseen physical domains like cloth, sand, and smoke.
- Cambrian-S: Towards Spatial Supersensing in Video
-
This paper proposes "spatial supersensing," a paradigm shift from passive task-driven sensing to active world modeling. It first proves via the VSI-SUPER benchmark that brute-force context expansion (including Gemini-2.5 and the self-trained Cambrian-S) fails completely on spatial recall and counting tasks in arbitrarily long videos. It then introduces a self-supervised "Latent Frame Prediction" head that uses prediction error ("surprise") as a control signal to drive memory management and event segmentation, significantly outperforming strong commercial baselines on long-video spatial tasks.
- CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval
-
CaReBench utilizes 1,000 manually annotated videos—each with captions exceeding 200 words and explicitly split into spatial and temporal versions—to establish a benchmark capable of simultaneously evaluating fine-grained video captioning and retrieval. It introduces two new metrics, ReBias and CapST, to quantify the spatiotemporal bias of VLMs, and provides a two-stage SFT baseline, CARE, which unifies captioning and retrieval into a single MLLM.
- Divid: Disentangled Spatial-Temporal Modeling within LLMs for Temporally Grounded Video Understanding
-
Divid explicitly disentangles temporal and spatial branches within the Video LLM decoder. It utilizes temporal attention to select high-resolution keyframes for queries and fuses information via a token-level soft-router. Combined with the 559K timestamp-supervised dataset TempGCap, it improves both accuracy and computational efficiency in temporal grounding and evidenced VideoQA.
- EAST: Early Action Prediction Sampling Strategy with Token Masking
-
EAST introduces a training strategy that randomly samples the observation ratio \(\rho\), allowing a single model to perform early action prediction across all observation ratios. Combined with a "dual classification compound loss (present + future)" and "difference masking" that discards half of the tokens based on temporal redundancy, it outperforms previous state-of-the-art methods by 10.1, 7.7, and 3.9 percentage points on NTU60, SSv2, and UCF101, respectively, while halving training memory and time.
- EgoBrain: Synergizing Minds and Eyes For Human Action Understanding
-
EgoBrain constructs the first large-scale dataset synchronizing egocentric video with 32-channel EEG for daily actions and proposes Brain-TIM, which utilizes a time-aware Transformer to fuse visual and brain signals, improving the visual baseline from 63.40% to 66.70% in cross-subject and cross-scene 29-category action recognition.
- Exposing and Defending the Achilles' Heel of Video Mixture-of-Experts
-
This paper systematically deconstructs the component-level adversarial vulnerabilities of Video MoE for the first time. It proposes the J-TLGA attack, which exposes the "Achilles' Heel" by first directing the router toward the weakest experts and then jointly perturbing both the router and experts. Accompanying this is J-TLAT, a hierarchical adversarial training method that repairs these weaknesses layer by layer, significantly enhancing robustness while maintaining over 60% inference computational savings.
- FARTrack: Fast Autoregressive Visual Tracking with High Performance
-
FARTrack "slims down" the autoregressive generative tracking framework of the ARTrack series. It utilizes Task-Specific Self-Distillation to compress model depth layer-by-layer and Inter-frame Autoregressive Sparsification to prune redundant background tokens in templates. Achieving 70.6% AO on GOT-10k while reaching speeds of 343 FPS on GPU and 121 FPS on CPU, it balances high performance with real-time efficiency.
- FlashVID: Efficient Video Large Language Models via Training-free Tree-Based Spatiotemporal Token Merging
-
FlashVID is proposed as a training-free inference acceleration framework for Video Large Language Models (VLLMs). By jointly modeling spatial and temporal redundancy through Tree-based Spatiotemporal Token Merging (TSTM), it maintains 99.1% of LLaVA-OneVision's performance while retaining only 10% of visual tokens. Furthermore, it enables a 10x increase in input frame capacity for Qwen2.5-VL.
- FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding
-
Ours proposes FLoC, a visual token compression framework based on the facility location function. By employing submodular optimization, it rapidly selects a subset of tokens that are both representative and diverse under a given budget. FLoC achieves training-free, model-agnostic, and query-independent token compression for long video understanding.
- FOCUS: Efficient Keyframe Selection for Long Video Understanding
-
FOCUS reformulates the task of "selecting the most query-relevant video frames under a strict token budget" as a Combinatorial Pure Exploration (CPE) problem in multi-armed bandits. By treating short video segments as arms and adaptively allocating the scoring budget using empirical means and Bernstein confidence radii, it significantly improves long video QA accuracy while viewing less than 2% of total frames.
- From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning
-
This work identifies a vicious cycle in slot-based object-centric learning between the encoder (producing sharp but noisy attention maps) and the decoder (producing spatially consistent but blurry reconstruction masks). It proposes synergistic contrastive learning objectives and a slot regularization warmup strategy to transform this into a virtuous cycle, significantly improving object discovery performance on MOVi and YouTube-VIS.
- Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
-
This paper introduces MicroG-4M, the first video benchmark for spatial-temporal and semantic understanding of human activities in microgravity (zero-gravity space) environments. It contains 4,759 real/cinematic clips, 13,261 action annotations, 1,238 captions, and 7,000+ Q&A pairs, covering fine-grained action recognition, video captioning, and visual question answering. The proposed MicroG-Bench systematically quantifies the significant performance collapse of Earth-trained models in space scenarios.
- HiTeA: Hierarchical Temporal Alignment for Training-Free Long-Video Temporal Grounding
-
HiTeA utilizes hierarchical temporal decomposition across event-scene-action levels to generate multi-granularity candidates for long videos. It then employs frozen VideoCLIP and Qwen2.5-VL for query matching and candidate refinement, significantly improving long-video temporal grounding without any task-specific training.
- IF-VidCap: Can Video Caption Models Follow Instructions?
-
This paper proposes IF-VidCap—the first instruction-following evaluation benchmark for "controllable video captioning," featuring 1,400 composite instructions with an average of 6 constraints each. Using a systematic "format correctness + content correctness" dual-dimension automatic evaluation protocol to test 26 MLLMs, it was discovered that models specialized for dense captioning actually underperform general MLLMs under instruction constraints.
- Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks Preserving Action Understanding Ability
-
To address the "action understanding degradation" caused by optimizing only for IoU in Temporal Video Grounding (TVG) models, this paper inverts the input and output of the TVG task to construct three Invert-TVG auxiliary tasks (Verb Completion / Action Recognition / Video Description) that share the same annotations. These tasks are trained alternately with low probability within the GRPO reinforcement learning framework, achieving SOTA localization accuracy while preserving action semantic understanding.
- Language-guided Open-world Video Anomaly Detection under Weak Supervision
-
Ours proposes LaGoVAD, a language-guided open-world video anomaly detection paradigm. By modeling the anomaly definition as a random variable input in natural language form, and combining it with dynamic video synthesis and contrastive learning regularization strategies, it achieves zero-shot SOTA performance across seven datasets.
- Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding
-
The paper proposes a new task called "Category Splitting," which discovers latent compositional structures within video classifier weights to split coarse-grained action categories into fine-grained sub-categories under zero-shot conditions, requiring no retraining or additional data.
- LLaVAction: Evaluating and Training Multi-modal Large Language Models for Action Understanding
-
This paper reconstructs EPIC-KITCHENS-100 into a benchmark that rigorously tests fine-grained action discrimination (EPIC-KITCHENS-100-MQA) by using "expert action recognition models to select hard distractors." It proposes LLaVAction—which strengthens visual information utilization via action tokens and a two-stage structured output—enabling general video MLLMs to outperform GPT-4o by 21 points in egocentric action recognition and achieve multiple new SOTA results.
- Measure Twice, Cut Once: A Semantic-Oriented Approach to Video Temporal Localization with Video LLMs
-
MeCo abandons the mainstream paradigm of "letting Video LLMs directly output boundary timestamps" in favor of a semantic-driven approach using structured tokens + query-focused captioning + contrastive grounding. By reframing video temporal localization as "understanding semantic structure before cutting segments," it consistently outperforms timestamp-generation methods across 9 tasks.
- Memento: Toward an All-Day Proactive Assistant for Ultra-Long Streaming Video
-
Memento utilizes "Dynamic Memory + Query-related Memory Selection + Step-Aware Memory Attention" to liberate online video LLMs from the predicament where "tokens accumulate until OOM within minutes." It achieves bounded memory usage and all-day proactive assistant capabilities on ultra-long video streams of up to 7 hours.
- OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text
-
To address the neglect of audio in existing "Composed Video Retrieval" (CoVR) benchmarks, this paper constructs OmniCVR—the first large-scale CoVR benchmark treating vision, audio, and text as first-class modalities (50K triplets / 5K gold standard test set). It proposes AudioVLM2Vec, which converts audio into textual descriptions for VLM embedding, boosting R@1 from 12.4 to 77.2 on audio-centric queries.
- OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding
-
This paper extends the classic task of "Spatio-Temporal Video Grounding (STVG) grounding only a single target" to OmniSTVG—grounding all targets (including interacting objects) mentioned in a text query. It proposes the first large-scale benchmark BOSTVG (10,018 videos, 287 categories, 1–10 targets) and a DETR-based method, OmniTube, which outperforms existing STVG methods adapted for this task across all metrics.
- Point Prompting: Counterfactual Tracking with Video Diffusion Models
-
This paper discovers that pre-trained image-conditioned video diffusion models possess inherent "zero-shot point tracking" capabilities. By painting a conspicuous red dot on the target point in the first frame and regenerating subsequent frames using SDEdit, the red dot is propagated through each frame to trace a trajectory. Combined with "counterfactual enhancement using the original frame as a negative prompt," this method outperforms all zero-shot baselines on TAP-Vid, approaches self-supervised methods, and can track through occlusions.
- PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
-
To address the inefficiency caused by excessive visual tokens in Video LLMs, PPLLaVA utilizes a "prompt-video relevance map" calculated by CLIP as a dynamic 3D convolution kernel to compress tokens. This approach reduces visual sequences by up to 1/18 while preserving key information relevant to user instructions, achieving both speedup and performance gains across seven video understanding benchmarks.
- Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
-
Addressing the issue of "when exactly to answer in an online streaming video"—a question often neglected by offline evaluations—this paper proposes the Thinking-QwenVL framework. It utilizes an Active Thought Decision Maker (ATDM) that externalizes progress \(\rho\) and confidence \(c\) to align the response timing with the "first sufficient evidence" moment \(t^\star\). Additionally, it maintains global causal states within token budgets through Hierarchical Progressive Semantic Integration (HPSI) tokens propagated across clips, improving the StreamingBench SOTA from 67.63% to 71.60%.
- QueryStream: Advancing Streaming Video Understanding with Query-Aware Pruning and Proactive Response
-
QueryStream integrates user queries directly into token pruning and response scheduling for streaming video. It utilizes Query-aware Differential Pruning (QDP) to filter irrelevant or redundant visual tokens and employs RTAR to proactively trigger the Video-LLM at "relevant and informative" moments. This approach attains or exceeds strong online baselines while retaining only 30%-57% of tokens.
- RIVER: A Real-Time Interaction Benchmark for Video LLMs
-
RIVER Bench decomposes the online interaction capabilities of video LLMs into three categories: recalling the past, understanding the present, and proactively responding after waiting for future events. By utilizing timestamped QA and response timing metrics, it demonstrates that traditional offline Video LLMs, despite performing well in offline QA, significantly lack memory and timing judgment in authentic streaming interactions. Long-short term memory and specialized proactive training can yield substantial improvements.
- ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
-
ScaleLong proposes the first benchmark that embeds questions across four timescales—Clip, Shot, Event, and Story—into the same long video. This allows for a direct comparison of MLLM capabilities across different temporal granularities while keeping content fixed, revealing a consistent U-shaped performance curve (high at both ends, collapsed in the middle) across 23 models.
- SPIKE-RL: Video-LLMs Meet Bayesian Surprise
-
This paper quantifies unexpected moments in videos into an interpretable score using "Bayesian Surprise." By tracking the KL divergence of a Video-LLM's belief distribution regarding "what happens next" before and after seeing new frames, it locates surprise segments and allocates more frame budget to these key moments via surprise-weighted sampling. Furthermore, it employs GRPO (SPIKE-RL) to optimize belief hypotheses using video captioning quality as a reward, achieving consistent improvements across five downstream video understanding tasks.
- Steering and Rectifying Latent Representation Manifolds in Frozen Multi-Modal LLMs for Video Anomaly Detection
-
The SteerVAD framework is proposed to identify "Latent Anomaly Expert" (LAE) attention heads within a completely frozen Multi-modal Large Language Model (MLLM) and dynamically steer their representation manifolds using a hierarchical meta-controller. It achieves SOTA performance in tuning-free video anomaly detection using only 1% of the training data.
- StPR: Space-Time Preserving and Routing for Exemplar-Free Video Class-Incremental Learning
-
StPR explicitly decomposes video features into two branches: "inter-frame shared semantics" and "temporal dynamics." It utilizes Frame-Shared Semantic Distillation (FSSD) to lock important semantic channels to prevent forgetting and a Temporal Decomposition-based Mixture of Experts (TD-MoE) to weight task-specific experts during inference based on temporal dynamics. Without storing any old exemplars, StPR performs video class-incremental learning and outperforms all previous methods (including those requiring exemplars) on UCF101, HMDB51, SSv2, and Kinetics400.
- TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video
-
TAPTRv3 targets any point tracking in long videos. Built upon the DETR-like point query framework of TAPTRv2, it introduces spatial context cross-attention, visibility-aware long-term attention, and scene cut-triggered global matching. These enhancements effectively reduce feature drift under long sequences, occlusions, and camera switches, setting new state-of-the-art results on multiple TAP benchmarks.
- UniTrack: Differentiable Graph Representation Learning for Multi-Object Tracking
-
UniTrack models multi-object tracking as a differentiable "graph flow network" and proposes a plug-and-play graph-theoretic loss function. It unifies detection accuracy, identity preservation, and spatiotemporal consistency into an end-to-end trainable objective. Without modifying any model architecture, it can be integrated into 7 existing trackers, reducing ID switches by up to 53% and increasing IDF1 by up to 12% across multiple benchmarks.
- V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
-
V2P-Bench constructs a human-model interaction evaluation benchmark for video visual prompt understanding. Using 980 videos and 1172 QA samples with manually annotated visual prompt frames, it systematically examines whether LVLMs can perform fine-grained video understanding based on user-indicated "targets/moments." The study finds that while current models exhibit zero-shot understanding of some visual prompts, they significantly lag behind humans in spatiotemporal relations, long videos, and honesty in refusing to answer.
- VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks
-
VidBridge-R1 discovers a conflict between convergent answering in Video QA and divergent description in video captioning during RL training. It bridges these through two intermediate proxy tasks, DarkEventInfer and MixVidQA, simultaneously enhancing QA, reasoning, and captioning capabilities within a Reason-Then-Respond video model.
- Video-KTR: Reinforcing Video Reasoning via Key Token Attribution
-
Ours proposes Video-KTR, a modality-aware policy shaping framework that identifies three types of key tokens—visual-aware, temporal-sensitive, and high-entropy—through counterfactual analysis. By performing selective reinforcement learning updates only on these tokens, the method achieves SOTA performance on multiple video reasoning benchmarks (42.7% on Video-Holmes, surpassing GPT-4o).
- Video-LevelGauge: Investigating Contextual Positional Bias in Video Language Models
-
This paper proposes Video-LevelGauge, a benchmark specifically designed to evaluate the "contextual positional bias" of Large Video Language Models (LVLMs). By inserting standardized probe clips at different positions within a context, it uses relative scores and bias pattern recognition to quantify whether a model understands the same content consistently across locations. Evaluating 27 SOTA models, it reveals a prevalent preference for the head or near-end positions in open-source models.
- Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
-
Video-STAR reformulates Open-Vocabulary Action Recognition (OVAR) as a sequential decision process of "selecting tools first, then decomposing sub-actions": during inference, a Multimodal Large Language Model (MLLM) calls domain-specific tools (e.g., pose estimation, human detection, online retrieval) to supplement visual evidence and decomposes holistic actions into discriminative sub-action primitives for scoring and matching; this is coupled with a hierarchical reward system (rewarding accuracy, tool efficiency, and sub-action relevance) to train the model via GRPO, shifting it from "relying on text priors" to "vision-grounded reasoning," significantly advancing the SOTA across five benchmarks: HMDB-51, UCF-101, K-400/600, and SSv2.
- Video Scene Segmentation with Genre and Duration Signals
-
This paper introduces "genre conventions" and "shot duration patterns" from professional filmmaking as metadata signals for video scene segmentation. It uses IMDb text definitions as soft semantic priors to enhance shot representations, employs inverse-duration-weighted sampling to generate diverse pseudo-boundaries during pre-training, and splits long shots during inference. This approach achieves SOTA performance on MovieNet-SSeg and BBC datasets and introduces the MovieChat-SSeg benchmark with scene boundary annotations.
- VideoNSA: Native Sparse Attention Scales Video Understanding
-
Ours proposes VideoNSA, which introduces Native Sparse Attention (NSA) into video-language models. Through a hybrid sparse attention mechanism with dynamic gating across three branches—compression, selection, and sliding window—it achieves 128K token video understanding using only 3.6% of the attention budget. It comprehensively outperforms token compression and training-free sparse attention baselines in long video understanding, temporal reasoning, and spatial understanding tasks.
- VUDG: A Dataset for Video Understanding Domain Generalization
-
VUDG constructs the first dataset specifically designed to evaluate domain generalization (DG) in video understanding. By utilizing 11 domains that share the same semantic space but vary only in visual style, viewpoint, or environmental conditions—coupled with a multi-expert cascaded automated annotation pipeline generating 36K QA pairs—the results demonstrate that nearly all models, including the strongest LVLMs, suffer significant performance degradation when encountering domain shifts.
- What Happens Next? Anticipating Future Motion by Generating Point Trajectories
-
This paper recasts the inherently ambiguous task of "predicting future motion from a single image" as a conditional generation task on a dense grid of point trajectories. By using a trajectory VAE to compress full-image point trajectories into a latent space and sampling diverse possible futures via rectified flow matching, the proposed method is both more accurate than regressive trajectory predictors and more physically plausible than large video models that generate RGB pixels before tracking.