📹 Video Understanding¶

📷 CVPR2026 · 92 paper notes

A4VL: A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning: This paper proposes A4VL, a training-free multi-agent perception-action alliance framework in which multiple heterogeneous VLM agents perform iterative perception exploration (event-based segmentation + CLIP-guided clue alignment for keyframe localization) and action exploration (independent reasoning → cross-scoring → consensus/pruning). A4VL comprehensively outperforms 18 VLMs and 11 long-video-specialized methods across 5 VideoQA benchmarks, with significantly lower inference latency (74s vs. GPT-4o's 127s on MLVU).
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning: This paper proposes A4VL, a training-free multi-agent perception-action alliance framework that achieves state-of-the-art performance across five VideoQA benchmarks—surpassing 28 baseline methods—while significantly reducing inference latency, through event-driven video segmentation, clue-guided keyframe selection, and a multi-round agent negotiation-and-pruning mechanism.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding: AdaSpark is proposed to reduce FLOPs for long-video processing by up to 57% while maintaining performance, via 3D spatiotemporal cube partitioning and two synergistic adaptive sparsity mechanisms: cube-level attention selection and token-level FFN selection.
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing: This paper proposes AutoGaze—a lightweight autoregressive module with only 3M parameters—that operates before the ViT to select the minimal set of patches in a multi-scale manner, eliminating spatiotemporal redundancy and achieving 4×–100× token compression and up to 19× ViT speedup, enabling MLLMs to scale to 1K-frame 4K-resolution video.
AutoGaze: Attend Before Attention — Efficient and Scalable Video Understanding via Autoregressive Gazing: This paper proposes AutoGaze, a lightweight 3M-parameter module that autoregressively selects the minimal multi-scale patch set minimizing reconstruction loss prior to ViT processing, removing redundant information from video. It achieves 4×–100× token compression and up to 19× ViT speedup, enabling MLLMs to scale to 1K-frame 4K-resolution video and reach 67.0% on VideoMME.
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding: This paper identifies severe unreliability in single-sample teacher responses under black-box distillation for video LVLMs—manifested as cross-question variance (\(\sigma=0.22\)), intra-sampling variance (\(\sigma=0.07\)–\(0.15\)), and format violation rates (1%–10%)—and proposes R-MSD, a framework that addresses these issues through a multi-sample teacher pool, task-adaptive matching, and two-stage SFT→RL adversarial distillation. The resulting 4B student model comprehensively outperforms the same-scale Qwen3-VL-4B on VideoMME, Video-MMMU, and WorldSense.
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding: This paper proposes the R-MSD framework, which constructs a teacher pool by sampling \(K\) responses per input, applies task-adaptive quality matching (quality-weighted pairing for closed-ended tasks and uniform pairing for open-ended tasks), and employs an online critic-as-discriminator adversarial distillation strategy to address the unreliability of single-sample supervision in black-box distillation of video LVLMs.
Temporally Consistent Long-Term Memory for 3D Single Object Tracking: This paper proposes ChronoTrack, a robust long-term 3D single object tracking framework built upon compact learnable memory tokens and two complementary objectives — a temporal consistency loss and a memory cycle-consistency loss — achieving state-of-the-art performance on multiple benchmarks while running in real time at 42 FPS.
CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization: This paper presents CineSRD, a training-free multimodal speaker diarization framework that performs speaker registration via visual anchor clustering and detects speaker turns using an audio language model, addressing open-world challenges in visual media such as long videos, large cast sizes, and audio-visual asynchrony.
CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning: This paper proposes the CLCR framework, which organizes each modality's features into three semantic hierarchy levels (shallow/middle/deep). An intra-level Controlled Exchange Domain (IntraCED) restricts cross-modal interaction to the shared subspace only, while an inter-level Collaborative Aggregation Domain (InterCAD) enables adaptive cross-level fusion, addressing the cross-level semantic asynchrony problem in multimodal learning.
Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining: This paper proposes ClusterSTM, which leverages intra-frame semantic clustering and a cluster-wise spatio-temporal masking strategy to retain semantically complete visual tokens under high masking ratios. A video-text relevance reconstruction objective is further introduced to enable efficient video-language pretraining at minimal computational cost, achieving a new state of the art among efficient models on retrieval, VQA, and captioning tasks.
Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing: This paper proposes a novel "grayscale always-on, color on demand" paradigm. ColorTrigger detects color redundancy online via lightweight quadratic programming on the grayscale stream, achieving 91.6% of the full-color baseline performance using only 8.1% RGB frames, enabling always-on video sensing on resource-constrained devices.
CVA: Context-aware Video-text Alignment for Video Temporal Grounding: This paper proposes CVA (Context-aware Video-text Alignment), a framework comprising three synergistic components—Query-aware Context Diversification (QCD), Context-invariant Boundary Discrimination (CBD) loss, and Context-enhanced Transformer Encoder (CTE)—to address false negatives and background association issues in video temporal grounding, achieving approximately 5-point improvement in R1@0.7 on QVHighlights.
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection: This paper proposes the Phase-wise Decomposition and Alignment (PDA) framework, which leverages the CoT reasoning capability of LLMs to decompose action labels into start–middle–end phase descriptions. Through text-guided foreground filtering and adaptive phase-wise alignment, PDA achieves fine-grained action pattern transfer, attaining an Avg mAP of 46.9 on THUMOS14 OV-TAD, surpassing the previous SOTA Ti-FAD (41.2).
DIvide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding: This paper proposes DIG, a training-free frame selection framework that classifies queries into global and localization types. For global queries, uniform sampling is applied directly; for localization queries, a dedicated pipeline consisting of content-adaptive frame selection (CAFS), LMM-based reward scoring, and video refinement is employed. DIG consistently outperforms existing methods on three long-form video understanding benchmarks.
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering: This paper proposes the EgoPointVQA dataset and the HINT (Hand Intent Tokens) method, which encodes 3D hand keypoints into hand intent tokens interleaved with visual tokens as input to an MLLM, addressing deictic gesture-based question answering in egocentric video. HINT-14B achieves 68.1% accuracy, surpassing InternVL3-14B by 5.4 pp.
Drift-Resilient Temporal Priors for Visual Tracking: This paper proposes DTPTrack—a lightweight plug-and-play temporal modeling module that assigns reliability scores to historical frames via a Temporal Reliability Calibrator (TRC) to filter noisy observations, and synthesizes the calibrated historical information into dynamic prior tokens via a Temporal Guidance Synthesizer (TGS) to suppress tracking drift, achieving state-of-the-art performance across multiple benchmarks.
Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual-Inertial Odometry: This paper proposes a dual-agent reinforcement learning framework comprising a Select Agent (which decides whether to activate the visual front-end based on IMU signals) and a Fusion Agent (which adaptively fuses visual-inertial states). Without completely removing VIBA, the framework substantially reduces its invocation frequency and computational overhead, achieving a superior accuracy–efficiency–memory trade-off.
Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition: Inspired by Kahneman's dual-system theory of human decision-making, TCEI proposes a test-time calibration framework for multi-object tracking. The intuitive system leverages transient memory of recently observed objects (confident samples as temporal priors and uncertain samples as reflective cases) for rapid prediction, while the experiential system validates and calibrates intuitive predictions using knowledge accumulated from historical videos. The entire process requires only forward passes without backpropagation, achieving significant robustness improvements under distribution shift across multiple MOT benchmarks.
EgoPointVQA: Gesture-Based Egocentric Video Question Answering: This paper proposes the EgoPointVQA dataset (4,000 synthetic + 400 real egocentric videos) and the HINT method, which encodes 3D hand keypoints into hand intent tokens interleaved with visual tokens as input to an MLLM, enabling the model to interpret pointing gestures and answer deictic questions. HINT-14B achieves 68.1% accuracy, outperforming InternVL3-14B by 6.6 percentage points.
EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions: This paper introduces EgoXtreme, the first large-scale benchmark for 6D object pose estimation in egocentric views under extreme conditions, encompassing three real-world challenges — severe motion blur, dynamic illumination, and smoke occlusion — and reveals critical failures of current state-of-the-art pose estimators under these conditions.
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration: This paper proposes an efficient post-hoc calibration method based on isotonic regression that aligns the output distribution of uncertainty models with the observed distribution, addressing inaccurate uncertainty estimation caused by domain shift in gaze tracking. It also introduces Coverage Probability Error (CPE) as a more reliable uncertainty evaluation metric than EUC.
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration: A data-efficient post-hoc calibration method is proposed that aligns the predictive distribution of uncertainty-aware gaze tracking models with the true observational distribution via isotonic regression, and introduces Coverage Probability Error (CPE) as a replacement for the unreliable Error-Uncertainty Correlation (EUC) metric for evaluating uncertainty quality.
Envisioning the Future, One Step at a Time: This paper formulates open-set future scene dynamics prediction as stepwise reasoning over sparse point trajectories, enabling rapid generation of thousands of diverse future hypotheses from a single image via an autoregressive diffusion model — orders of magnitude faster than dense prediction models.
Event6D: Event-based Novel Object 6D Pose Tracking: EventTrack6D proposes an event-depth fusion framework for 6D pose tracking that bridges the temporal gap between event cameras and depth frame rates by reconstructing intensity and depth images at arbitrary timestamps, achieving robust tracking of unseen objects at 120+ FPS while trained exclusively on synthetic data.
FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking: FC-Track is a lightweight post-association correction framework that explicitly corrects identity switch errors caused by target overlap in online MOT. It employs IoA (Intersection over Area)-based overlap-aware appearance feature filtering and a local mismatch reassignment strategy, reducing the long-term identity switch ratio to 29.55%.
FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking: This paper proposes FC-Track, a lightweight post-association correction framework that suppresses appearance updates via IoA triggering and reassigns locally mismatched detection–tracklet pairs, reducing the proportion of long-term identity switches from 36.86% to 29.55% while maintaining state-of-the-art performance on MOT17/MOT20.
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding: This paper proposes FluxMem, a training-free streaming video understanding framework that employs a three-tier hierarchical memory design (short-term / medium-term / long-term) and two adaptive token compression modules — TAS for temporal redundancy removal and SDC for spatial redundancy reduction. FluxMem achieves new state-of-the-art results on StreamingBench and OVO-Bench while discarding 60–70% of visual tokens.
Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding: This paper proposes Frame2Freq—the first family of PEFT adapters that performs temporal modeling in the frequency domain. By transforming frozen VFM frame embeddings into the spectral space via FFT and learning frequency band-level filtering, Frame2Freq surpasses fully fine-tuned models on five fine-grained action recognition benchmarks with fewer than 10% trainable parameters.
GoalForce: Teaching Video Models to Accomplish Physics-Conditioned Goals: This paper proposes the GoalForce framework, which trains video generation models on simple synthetic data using multi-channel physical control signals (goal force, direct force, and mass), enabling the model to learn backward causal planning from desired effects. The approach achieves zero-shot generalization to complex real-world scenarios such as tool use and human–object interaction.
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation: SelVA introduces the text-conditioned selective video-to-audio (V2A) generation task. Through a learnable supplementary token [SUP] and a self-supervised video mixing strategy, the model generates only the user-specified target sound from multi-source videos guided by text prompts, surpassing existing methods in audio quality, semantic alignment, and temporal synchronization.
HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering: HERBench is a video question answering benchmark specifically designed for multi-evidence integration, comprising 26,806 five-choice questions, each structurally requiring the fusion of \(\ge 3\) temporally dispersed, non-overlapping visual cues. By introducing the Minimum Required Frame Set (MRFS) metric, the benchmark exposes two critical bottlenecks in current Video-LLMs: insufficient frame retrieval and evidence fusion failure.
HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling: HieraMamba proposes a Mamba-based hierarchical architecture for video temporal grounding. Its core contribution is the Anchor-MambaPooling (AMP) module, which employs Mamba's selective scanning to progressively compress video features into multi-scale anchor tokens. Complementary anchor-conditioned and segment-pooled contrastive losses enhance the compactness and discriminability of hierarchical representations, achieving state-of-the-art performance on Ego4D-NLQ, MAD, and TACoS.
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms: This paper compares three mainstream temporal output paradigms for video temporal grounding (VTG) — text-number generation, temporal token generation, and continuous time decoding — within a unified framework, finding that the continuous distribution paradigm consistently achieves the best efficiency–accuracy Pareto frontier.
LAOF: Robust Latent Action Learning with Optical Flow Constraints: This paper proposes the LAOF framework, which leverages agent optical flow as a pseudo-supervision signal to constrain latent action learning, yielding latent action representations that are more robust to distractors. LAOF substantially outperforms unsupervised baselines on LIBERO and PROCGEN, and matches or surpasses supervised methods that use 1% action labels, while requiring no labels at all.
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning: This paper proposes AssistMimic, which formulates physics-based imitation of human-human assistive interactions as a multi-agent reinforcement learning (MARL) problem. Through motion prior initialization, dynamic reference retargeting, and contact-promoting rewards, it achieves, for the first time, physics-simulation tracking of force-exchanging assistive motions.
LensWalk: Agentic Video Understanding by Planning How You See in Videos: This paper presents LensWalk, an agentic framework that enables an LLM reasoner to actively control the temporal scope and sampling density of video observations. Through a reason-plan-observe loop, LensWalk achieves adaptive video understanding without any fine-tuning, yielding plug-and-play performance gains exceeding 5% on long video benchmarks.
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding: This paper proposes LongVideo-R1, a reasoning-capable multimodal agent that organizes videos into a hierarchical tree structure and employs an intelligent navigation strategy to achieve efficient long-video question answering with an average of only 10.5 tool calls, significantly outperforming exhaustive methods on the accuracy–efficiency trade-off.
Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding: A two-stage video moment retrieval framework is proposed: the first stage employs LLM-guided caption matching and generates auxiliary short videos as temporal priors; the second stage uses a multimodal-controlled Mamba network to efficiently fuse generated priors with long sequences, achieving state-of-the-art performance on TVR (R@1/IoU=0.5: 45.20%) while reducing computational overhead.
MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters: This paper proposes MaskAdapt, a two-stage residual learning framework that first trains a mask-invariant robust base policy and then trains a residual policy on top of the frozen base controller to modify target body parts, enabling flexible and precise motion adaptation for physics-based humanoid characters.
MINERVA-Cultural: A Benchmark for Cultural and Multilingual Long Video Reasoning: This paper introduces MINERVA-Cultural, a benchmark comprising 2,400 manually annotated video reasoning questions spanning 18 language/region locales, and reveals severe deficiencies in cultural visual perception among state-of-the-art Video-LLMs through evidence graphs and an iterative error isolation strategy (best model Gemini-2.5-Pro: 45.07% vs. human: 95.22%).
Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos: This paper introduces the Mistake Attribution (MATT) task, which attributes action mistakes in egocentric videos along three dimensions: semantic (which component of the instruction was violated), temporal (at which frame the point of no return, PNR, occurs), and spatial (which region in the PNR frame contains the error). A data engine called MisEngine automatically constructs large-scale mistake samples from existing action datasets, and a unified Transformer model, MisFormer, simultaneously addresses all three attribution sub-tasks, surpassing task-specific SOTA methods across multiple benchmarks.
MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark: This paper introduces MovieRecapsQA, a multimodal open-ended video QA benchmark constructed from movie recap videos, comprising approximately 8.2K questions across 60 movies. It proposes a reference-free evaluation metric based on atomic facts and reveals that the critical bottleneck of current MLLMs lies in visual perception rather than reasoning.
Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking: Ninja Codes leverages deep steganography to transform arbitrary images into visually inconspicuous fiducial markers via an end-to-end trained encoder. The resulting markers can be printed with standard printers and detected using RGB cameras, enabling stealthy 6-DoF pose tracking.
Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking: This paper proposes OA-SORT, an occlusion-aware tracking framework that explicitly models target occlusion states to mitigate positional cost ambiguity and Kalman Filter estimation instability. The method achieves state-of-the-art improvements on DanceTrack, SportsMOT, and MOT17, with all components being plug-and-play compatible with multiple tracker architectures.
OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments: This paper presents OpenMarcie, the largest-scale multimodal action recognition dataset for industrial environments, integrating 8 sensing modalities, 200+ channels, and 37+ hours of recordings from wearable sensors and visual data. Three benchmarks—HAR classification, open-vocabulary description, and cross-modal alignment—demonstrate the superiority of inertial+vision fusion.
Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation: This paper presents the first systematic analysis of adversarial vulnerabilities in Tracking-by-Query-Propagation (TBP) trackers, and proposes the FADE attack framework. FADE employs two complementary strategies — Temporal Query Flooding (TQF) to exhaust fixed query budgets by generating persistent spurious tracks, and Temporal Memory Corruption (TMC) to disrupt hidden state propagation of legitimate tracks. On MOT17/MOT20, FADE causes up to ~30 points of HOTA degradation and more than 10× identity switches on MOTR/MOTRv2/MeMOTR/Samba/CO-MOT.
Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding: This paper proposes QViC-MF, a framework that achieves state-of-the-art performance on MLVU, LVBench, and VNBench through question-guided multi-frame visual compression (QMSA) and a contextual memory feedback mechanism, using as few as 16 visual tokens per frame.
RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation: This paper is the first to introduce textual descriptions into RGBT tracking, proposing RAGTrack, a retrieval-augmented generation (RAG)-based framework. Through a Multimodal Transformer Encoder (MTE), Adaptive Token Fusion (ATF), and a Context-aware Reasoning Module (CRM), it achieves state-of-the-art performance on four RGBT benchmarks.
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling: This paper proposes the Verifier — a meta-model that learns to assess the per-frame reliability of predictions from multiple pre-trained trackers, selecting the best candidate at each frame to construct high-quality pseudo-label trajectories. This enables annotation-free fine-tuning for real-world point tracking and achieves state-of-the-art performance on four real-world benchmarks.
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling: This paper proposes a learnable Verifier meta-model trained on synthetic data to assess the reliability of tracker predictions and transfer this capability to the real world. By evaluating per-frame predictions from six pretrained trackers and selecting the most reliable as pseudo-labels, the proposed Track-On-R model is fine-tuned on only ~5K real videos and achieves comprehensive state-of-the-art performance across four real-world benchmarks.
Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning: This paper proposes SlotCurri, a reconstruction-guided slot-count curriculum learning strategy that begins training with very few slots and progressively expands slot capacity only in regions with high reconstruction error. Combined with structure-aware loss and cyclic inference, SlotCurri effectively addresses the over-fragmentation problem — where a single object is erroneously split across multiple slots — in video object-centric learning, achieving a +6.8 FG-ARI improvement on YouTube-VIS.
FlexHook: Rethinking Two-Stage Referring-by-Tracking in RMOT: This paper proposes FlexHook, a novel two-stage Referring-by-Tracking framework that redefines feature construction via a sampling-based Conditioning Hook (C-Hook) and replaces CLIP cosine similarity matching with a Pairwise Correspondence Decoder (PCD), making a two-stage method comprehensively surpass current state-of-the-art one-stage methods for the first time.
FlexHook: Rethinking Two-Stage Referring-by-Tracking in RMOT: FlexHook revitalizes the two-stage Referring-by-Tracking (RBT) paradigm: it introduces C-Hook to directly sample target features from the backbone (replacing dual encoding) and inject language-conditioned cues, and replaces CLIP cosine similarity with PCD (Pairwise Correspondence Decoder) for active correspondence modeling. This marks the first time a two-stage method comprehensively surpasses one-stage RMOT state-of-the-art — achieving HOTA of 42.53 (vs. 10.32 for iKUN) on Refer-KITTI-V2, with training completed in only 1.91 hours (2×4090).
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning: This paper proposes SAIL, which achieves state-of-the-art performance on both dense video captioning and event localization on ActivityNet and YouCook2 under a weakly-supervised setting (caption annotations only, no temporal boundaries), via cross-modal similarity-guided semantic-aware mask generation and auxiliary supervision from LLM-synthesized captions.
SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion: This paper proposes SAVA-X, a framework comprising three complementary modules—adaptive sampling, scene-aware view embedding, and bidirectional cross-attention fusion—to address cross-view temporal error detection in the exocentric-demonstration-to-egocentric-imitation setting, achieving comprehensive improvements over existing baselines on the EgoMe benchmark.
SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion: This paper formalizes the Ego→Exo imitation error detection task and proposes the SAVA-X (Align–Fuse–Detect) framework, which jointly addresses three core challenges—temporal misalignment, video redundancy, and cross-view domain gap—through three modules: adaptive sampling, scene-adaptive view embeddings, and bidirectional cross-attention fusion.
Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting: This paper proposes Seen-to-Scene, a unified video outpainting framework that integrates propagation-based and generation-based paradigms. By combining reference-frame-guided latent-space propagation with a video diffusion model, it achieves spatiotemporal consistency and visual fidelity in zero-shot inference that surpasses prior methods requiring input-specific adaptation.
SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild: This paper introduces SHOW3D, the first hand-object interaction dataset with accurate 3D annotations captured in truly in-the-wild environments. Through a lightweight wearable multi-camera backpack system and an ego-exo fusion annotation pipeline, the dataset comprises 4.3 million frames of multi-view data, achieving sub-centimeter annotation accuracy for both hands and objects. Cross-dataset experiments validate the generalization advantage of models trained on SHOW3D.
SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition: This paper proposes SkeletonContext, a framework that recovers the missing environmental and object context semantics in skeleton data from pretrained language models via a cross-modal context prompt module, and enhances the discriminability of motion-critical joints through a key part decoupling module. The method achieves state-of-the-art performance on NTU-60/120 and PKU-MMD under both zero-shot (ZSL) and generalized zero-shot (GZSL) settings.
SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding: This paper proposes SlotVTG, a framework that inserts a lightweight Slot Adapter into the early layers of an MLLM decoder to decompose visual tokens into object-level slot representations. A Slot Alignment Loss guided by DINOv2 priors encourages semantically coherent slot formation, substantially improving out-of-domain (OOD) generalization for video temporal grounding (up to +4.3 OOD R1@0.5), while introducing only ~0.25% additional trainable parameters.
SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking: SpikeTrack is proposed as the first RGB visual tracking framework fully compliant with the spike-driven paradigm. Through asymmetric temporal step expansion, unidirectional information flow, and a brain-inspired Memory Retrieval Module (MRM), it achieves SOTA among SNN-based trackers and is on par with ANN-based trackers, while consuming only 1/26 the energy of TransT.
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning: ROS-DVC introduces three complementary components for DETR-based dense video captioning (DVC): role-specific query initialization (separate localization and captioning queries), a cross-task contrastive alignment loss, and an overlap suppression loss. Without pretraining or LLMs, it achieves a CIDEr of 39.18 on YouCook2, surpassing DDVC which relies on GPT-2.
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning: This paper proposes ROS-DVC, which decouples the shared queries in DETR-based DVC frameworks into independent localization queries and caption queries, introduces an Overlap Suppression Loss to penalize temporal overlap between queries, and employs Cross-Task Contrastive Alignment to maintain cross-task semantic consistency. The approach achieves state-of-the-art captioning and localization performance on YouCook2 and ActivityNet Captions.
STORM: End-to-End Referring Multi-Object Tracking in Videos: STORM is the first end-to-end multimodal large language model framework for Referring Multi-Object Tracking (RMOT). It substantially reduces reliance on RMOT-annotated data through a task composition learning strategy and introduces the high-quality STORM-Bench dataset.
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos: This work presents StreamGaze, the first gaze-guided streaming video understanding benchmark, comprising 8,521 QA pairs covering three task categories — past, present, and proactive prediction. A gaze trajectory–video alignment pipeline is proposed to generate spatiotemporally grounded QA pairs, revealing a substantial gap in current MLLMs' ability to leverage gaze signals for temporal reasoning.
StreamingTOM: Streaming Token Compression for Efficient Video Understanding: This paper proposes StreamingTOM, a training-free two-stage framework for streaming video understanding. Causal Temporal Reduction (CTR) compresses per-frame tokens from 196 to 50 via causal temporal selection before the LLM, while Online Quantized Memory (OQM) constrains kv-cache growth after the LLM through 4-bit quantization and on-demand retrieval. The framework achieves a 15.7× compression ratio, 1.2× lower peak memory, and 2× faster TTFT.
StreamingTOM: Streaming Token Compression for Efficient Video Understanding: The first training-free framework to simultaneously address both pre-LLM prefill and post-LLM KV-cache efficiency bottlenecks in streaming video VLMs, achieving 15.7× compression with bounded active memory.
StreamReady: Learning What to Answer and When in Long Streaming Videos: This paper introduces a readiness-aware paradigm for streaming video understanding. By incorporating a learnable <RDY> token and proposing the Answer Readiness Score (ARS) metric, the model is trained not only to produce correct answers but also to respond at the appropriate moment when sufficient evidence has appeared. The approach achieves state-of-the-art results on 9 streaming and offline video benchmarks.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration: This paper proposes SVAgent, a storyline-guided cross-modal multi-agent framework for long video question answering. By progressively constructing narrative representations, employing DPP-based evidence selection, cross-modal consistency verification, and iterative refinement, SVAgent achieves performance gains of 5.5%–11.5% over baselines.
TCEI: Dual-level Adaptation for Multi-Object Tracking via Test-Time Calibration: Inspired by the dual-system model of human decision-making, this paper proposes TCEI, a test-time calibration framework for multi-object tracking: an intuition system leverages instantaneous memory for rapid prediction, while an experience system calibrates those predictions using accumulated knowledge. Confident and uncertain samples serve as historical priors and reflective cases, respectively, enabling online adaptation.
Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition: Inspired by Kahneman's dual-process theory, the TCEI framework proposes a test-time adaptation method that combines an intuitive system (rapid inference via transient memory of recently observed objects) with an experiential system (calibration of intuitive predictions using knowledge accumulated from historical videos), achieving significant improvements in multi-object tracking under distribution shift without requiring backpropagation.
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models: This paper proposes the AOT framework, which establishes local-global token anchors and employs Optimal Transport (OT) to aggregate the semantic information of pruned/merged tokens at both intra-frame and inter-frame levels. The method achieves training-free video token compression, retaining 97.6% of original performance while discarding 90% of tokens.
TrajTok: Learning Trajectory Tokens Enhances Video Understanding: This paper proposes TrajTok — an end-to-end differentiable trajectory tokenizer that implicitly clusters video pixels into object trajectory tokens, replacing external segmentation-and-tracking pipelines. It achieves significant improvements across three settings: training from scratch (TrajViT2), feature adaptation (TrajAdapter), and vision-language model connectors (TrajVLM), with particularly large gains on long-video QA over patch pooling.
TrajTok: Learning Trajectory Tokens Enhances Video Understanding: This paper proposes TrajTok—the first end-to-end differentiable trajectory-based video tokenizer—which encodes video into object trajectory tokens via implicit spatiotemporal clustering, requiring no external segmentation or tracking pipeline. TrajTok achieves +4.8% on K400, +4.1% on SSv2, and +8.8% on long-video QA benchmarks, with inference efficiency on par with the most efficient baselines.
U2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation: U2Flow is the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. Through augmentation-consistency-based decoupled uncertainty learning and uncertainty-guided bidirectional flow fusion, it achieves unsupervised state-of-the-art performance on KITTI and Sintel.
UETrack: A Unified and Efficient Framework for Single Object Tracking: This paper proposes UETrack, a unified and efficient single object tracking framework capable of handling five modalities simultaneously: RGB, Depth, Thermal, Event, and Language. UETrack addresses a critical gap in efficient multi-modal tracking — existing efficient trackers are limited to RGB, while multi-modal trackers are too slow for practical deployment due to complex designs. The core contributions include: (1) Token-Pooling-based Mixture-of-Experts (TP-MoE), which replaces conventional gating mechanisms with similarity-based soft assignment to enable efficient expert collaboration and specialization; and (2) Target-aware Adaptive Distillation (TAD), which adaptively determines whether each sample is suitable for distillation, filtering out unreliable teacher signals. Evaluated across 12 benchmarks on 3 hardware platforms, UETrack achieves an optimal speed-accuracy trade-off — UETrack-B attains 69.2% AUC on LaSOT at 163/56/60 FPS on GPU/CPU/AGX respectively, with only 13M parameters.
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models: UFVideo is the first Video LLM to unify global, pixel-level, and temporal-level video understanding within a single model. Through a visual-language guided alignment strategy and the SAM2 mask decoder, it simultaneously supports video question answering, object referring, video segmentation, and temporal grounding, and introduces UFVideo-Bench, a multi-grained cooperative understanding benchmark.
Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability: This paper investigates, from an interpretability perspective, the root cause of temporal logic inconsistency in Video-LLMs—namely, that cross-modal attention heads fail to effectively discriminate video tokens at different timestamps—and proposes TCAS (Temporally Conditioned Attention Sharpening), which significantly improves temporal logic consistency and general temporal grounding performance by optimizing attention distributions.
Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention: This paper proposes a unified spatiotemporal token compression method that jointly evaluates token contribution and semantic redundancy via a global retention pool, and introduces a text-aware merging mechanism inside the LLM. At an extreme compression ratio retaining only ~2% of visual tokens, the method preserves 90.1% of baseline performance while reducing FLOPs to ~2.6%.
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking: This paper proposes UTPTrack, the first unified framework that jointly prunes tokens from all three components — search region (SR), dynamic template (DT), and static template (ST) — within one-stream Transformer trackers, achieving 65–67% visual token reduction across both RGB and multimodal/language-guided tracking tasks while maintaining 99.7%–100.5% of baseline performance.
VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference: This paper identifies a strong "vertical vector" sparsity pattern in the attention maps of video models and proposes VecAttention, a fine-grained vector-wise sparse attention framework. Through TilingSelect and minS filtering, the method efficiently selects important KV vectors, achieving accuracy on par with full attention at over 78% sparsity while delivering a 2.65× speedup in attention computation.
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding: VideoARM proposes an agentic reasoning paradigm built upon a Hierarchical Multimodal Memory (HM3) structure. Through an adaptive observe–think–act–memorize loop and a coarse-to-fine tool-calling strategy, it surpasses state-of-the-art methods on long-form video understanding benchmarks while reducing token consumption to 1/34 of DVD.
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice: This paper proposes VideoAuto-R1, an on-demand reasoning framework for video understanding. During training, it adopts a "think once, answer twice" (answer→think→answer) paradigm; during inference, it uses the confidence of the first answer to determine whether to invoke CoT reasoning. The approach maintains SOTA accuracy while reducing average response length from 149 to 44 tokens (approximately 3.3× compression).
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning: This paper proposes VideoChat-M1, which replaces conventional fixed tool-calling strategies with Collaborative Policy Planning (CPP) and Multi-Agent Reinforcement Learning (MARL). Multiple policy agents dynamically generate, execute, and communicate tool-invocation plans, achieving state-of-the-art results on 8 video understanding benchmarks—surpassing Gemini 2.5 Pro by 3.6% and GPT-4o by 15.6% on LongVideoBench.
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning: VideoChat-M1 proposes the Collaborative Policy Planning (CPP) paradigm and a Multi-Agent Reinforcement Learning (MARL) training framework, enabling four heterogeneous VLM agents to dynamically generate and update tool-calling policies for video understanding. It surpasses Gemini 2.5 Pro by 3.6% and GPT-4o by 15.6% on LongVideoBench.
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking: VideoSeek proposes a long-horizon video agent that actively seeks critical evidence via video logical flow rather than exhaustively parsing all frames. Through a think-act-observe loop and a multi-granularity toolkit (overview/skim/focus), it achieves a 10.2-point improvement over the base model GPT-5 on LVBench while reducing frame usage by 93%.
VidTAG: Temporally Aligned Video to GPS Geolocalization: This paper proposes VidTAG, a dual-encoder (CLIP+DINOv2) frame-to-GPS retrieval framework that achieves temporally consistent per-frame video geolocalization at global scale, via a TempGeo module for inter-frame temporal alignment and a GeoRefiner encoder-decoder module for GPS prediction refinement.
VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding: This paper proposes VirtueBench, the first long video understanding benchmark for evaluating VLM trustworthiness under uncertainty. By constructing multi-level frame sampling for each video and annotating answerable/unanswerable ground truth at each level, it reveals that existing models tend to guess rather than honestly refuse to answer.
VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues: This paper proposes VRR-QA, a benchmark comprising 1K carefully annotated video QA pairs designed to evaluate models' ability to reason about implicit visual relationships in videos—such as off-screen events, cross-frame causality, and spatial relationship inference. The benchmark reveals significant deficiencies in implicit reasoning among current state-of-the-art VideoQA models, including GPT-O3: the best-performing model achieves only 64% accuracy, far below the human baseline of 83%.
VSI: Visual-Subtitle Integration for Keyframe Selection to Enhance Long Video Understanding: VSI proposes a dual-branch collaborative retrieval framework (Video Search + Subtitle Match) that fuses visual and textual information for precise keyframe localization. On text-dominant subtasks, it improves search accuracy from 29.48 to 45.00, representing the first cross-modal keyframe retrieval method.
Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding: This paper proposes WFS-SB, a training-free frame selection framework that applies wavelet transforms to query-frame similarity signals for semantic boundary detection. The video is segmented into semantically coherent segments, over which frame budgets are adaptively allocated and diversity-aware sampling is performed. WFS-SB substantially surpasses state-of-the-art methods on VideoMME, MLVU, and LongVideoBench.