📹 Video Understanding¶

📹 ICCV2025 · 58 paper notes

4D-Bench: Benchmarking Multi-Modal Large Language Models for 4D Object Understanding: 4D-Bench is the first benchmark for evaluating multi-modal large language models (MLLMs) on 4D object understanding. It encompasses two tasks—4D object question answering and captioning—and reveals that even the state-of-the-art GPT-4o achieves only 63% accuracy against a human baseline of 91%, exposing significant deficiencies in multi-view temporal reasoning among current MLLMs.
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding: This paper introduces 4D-Bench, the first benchmark for evaluating multi-modal large language models (MLLMs) on 4D object (dynamic 3D object) understanding, comprising two tasks: 4D object question answering and 4D object captioning. The benchmark reveals that even GPT-4o achieves only 63% accuracy on simple 4D objects (vs. 91% human baseline), with particularly weak performance on object counting and temporal understanding.
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding: This paper introduces 4D-Bench, the first benchmark for evaluating multimodal large language models (MLLMs) on 4D object understanding (i.e., 3D objects with temporal evolution). It comprises two core tasks: 4D Object QA (751 QA pairs) and 4D Object Captioning (580 objects × 5 annotations). Evaluation reveals that even the state-of-the-art GPT-4o achieves only 63% accuracy compared to 91% for humans, exposing a substantial gap in multi-view spatiotemporal understanding among MLLMs.
Adaptive Hyper-Graph Convolution Network for Skeleton-Based Human Action Recognition: This paper proposes Hyper-GCN, which replaces conventional binary graphs with an adaptive non-uniform hypergraph to model skeletal topology, and introduces virtual hyper-joints to create virtual connections that enable direct modeling of multi-joint cooperative relationships. The approach achieves state-of-the-art performance on NTU-60/120 and NW-UCLA with the most lightweight GCN design (base variant: only 1.1M parameters, 1.63 GFLOPs).
Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections: This paper proposes Hyper-GCN, which transcends the limitation of conventional GCNs that model only binary pairwise joint relationships, by introducing adaptive non-uniform hypergraph convolution and virtual hyper joints. The design enables efficient aggregation of multi-joint collaborative semantics, achieving state-of-the-art performance on NTU-60/120 and NW-UCLA benchmarks with the most lightweight GCN architecture to date.
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning: This paper proposes AIM, a training-free adaptive inference method for multimodal LLMs that achieves a 6.8× FLOPs reduction while maintaining performance, through similarity-based iterative visual token merging before the LLM and progressive PageRank-based token pruning within LLM layers. Under equal compute budgets, AIM even surpasses SOTA on long video understanding (+4.6 MLVU).
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning: This paper proposes a training-free adaptive inference framework that achieves flexible accuracy–efficiency trade-offs across a 40× FLOPs range for multimodal LLMs. The method combines iterative token merging based on embedding cosine similarity before the LLM, and progressive token pruning based on PageRank-derived multimodal importance scores within LLM layers. Strong performance is demonstrated on both video and image understanding benchmarks.
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning: This paper proposes AIM, a training-free adaptive inference method that combines iterative token merging before the LLM (based on embedding cosine similarity) with progressive token pruning within LLM layers (based on PageRank importance scores), achieving a 6.8× FLOPs reduction with negligible performance loss, and even surpassing SOTA on long video understanding benchmarks.
Aligning Effective Tokens with Video Anomaly in Large Language Models: This paper proposes VA-GPT, which efficiently aligns anomaly-relevant tokens within MLLMs via two modules — Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG) — enabling precise detection, description, and temporal localization of anomalous events.
AllTracker: Efficient Dense Point Tracking at High Resolution: AllTracker reformulates point tracking as a multi-frame long-range optical flow problem, iteratively refining correspondence estimates on low-resolution grids via 2D convolutions and pixel-aligned temporal attention, followed by upsampling. With only 16M parameters, it achieves state-of-the-art accuracy and enables high-resolution (768×1024) dense tracking of all pixels at speeds approaching optical flow methods.
An Empirical Study of Autoregressive Pre-training from Videos: This paper systematically investigates autoregressive pre-training from videos (termed Toto), training a causal Transformer on over one trillion visual tokens. Despite minimal inductive biases, the approach achieves competitive performance across image recognition, video classification, object tracking, and robot manipulation, while exhibiting scaling laws analogous to those of language models, albeit at a slower rate.
Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking: This paper proposes TRACT, a method that leverages trajectory-level information to enhance open-vocabulary multi-object tracking (OV-MOT). It improves association via Trajectory Consistency Reinforcement (TCR) and improves classification via Trajectory Feature Aggregation (TFA) and Trajectory Semantic Enrichment (TSE). TRACT achieves significant performance gains on the OV-TAO benchmark, particularly in classification accuracy.
Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition: This paper proposes the Language-Guided Action Anatomy (LGA) framework, which leverages large language models to decompose action labels into atomic-level action descriptions encoded as subject–motion–object triplets. On the video side, a clustering-based segmentation strategy partitions frame sequences into corresponding atomic action stages. Multimodal fusion and matching are then performed at the atomic level, yielding substantial improvements in few-shot action recognition performance.
Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos: This paper introduces Argus, the first model to generate complete 360° panoramic videos from standard perspective videos. Through three geometry- and motion-aware techniques—camera movement simulation, view-based frame alignment, and blended decoding—Argus achieves spatially consistent and temporally coherent panoramic video generation within a diffusion-based framework.
BlinkTrack: Feature Tracking over 80 FPS via Events and Images: BlinkTrack introduces a differentiable Kalman filter into a learning framework to address the challenges of asynchronous data association and uncertainty-aware fusion between event cameras and conventional cameras, achieving feature tracking at over 80 FPS with significantly superior performance in occlusion scenarios compared to existing methods.
Breaking the Encoder Barrier for Seamless Video-Language Understanding: This paper proposes ELVA, the first encoder-free Video Large Language Model (Video-LLM), which achieves performance comparable to encoder-based architectures through hierarchical token merging, video guidance supervision, and hybrid resolution inference, using only 7M publicly available video-text pairs while reducing FLOPs by 95% and inference latency by 92%.
DeSPITE: Exploring Contrastive Deep Skeleton-PointCloud-IMU-Text Embeddings for Action Recognition: DeSPITE proposes a privacy-preserving multimodal contrastive pre-training framework that aligns four modalities — LiDAR point clouds, skeleton poses, IMU signals, and text — into a unified embedding space, enabling cross-modal matching, retrieval, and a pre-training paradigm for human activity recognition.
DisTime: Distribution-based Time Representation for Video Large Language Models: This paper proposes DisTime, a framework that enables continuous time representation in Video-LLMs via a single learnable time token and a distribution-based time decoder. Complemented by the large-scale automatically annotated dataset InternVid-TG (1.25M events), DisTime achieves state-of-the-art performance on three categories of time-sensitive tasks: moment retrieval, dense video captioning, and grounded VQA.
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding: DynImg proposes a novel video representation method that appends non-key frames as "temporal visual prompts" below key frames to form dynamic images, enabling fine-grained spatiotemporal interaction inside the visual encoder (rather than at the high-level token stage). Combined with a 4D rotary positional encoding to maintain correct spatiotemporal ordering, DynImg surpasses SOTA by approximately 2% on multiple video understanding benchmarks while using fewer visual tokens.
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception: EgoAdapt is a framework that jointly trains cross-modal distillation and policy learning to adaptively select the optimal modality combination, achieving up to 89% GMACs reduction while maintaining performance on par with or superior to SOTA on egocentric perception tasks.
egoPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks: This paper introduces egoPPG as a new egocentric vision task, proposes PulseFormer to estimate heart rate (MAE=7.67 bpm) from the eye-tracking cameras of unmodified egocentric head-mounted devices, and demonstrates that heart rate estimation improves skill assessment accuracy on EgoExo4D by 14.1%.
EMoTive: Event-Guided Trajectory Modeling for 3D Motion Estimation: This paper proposes EMoTive, an event camera-based 3D motion estimation framework that encodes fine-grained temporal evolution via Event Kymograph and models spatiotemporal trajectories using event-density-guided non-uniform NURBS parametric curves. Optical flow and motion-in-depth fields are derived from these trajectories, achieving state-of-the-art performance on the newly constructed CarlaEvent3D dataset and real-world benchmarks.
Factorized Learning for Temporally Grounded Video-Language Models: This paper proposes D2VLM, a framework that decomposes video understanding into a "first localize evidence, then generate answers based on evidence" paradigm. It introduces evidence tokens to capture event-level visual semantics and designs Factorized Preference Optimization (FPO) to simultaneously improve temporal grounding and text response quality.
Fine-grained Spatiotemporal Grounding on Egocentric Videos: This paper presents EgoMask, the first pixel-level spatiotemporal grounding benchmark for egocentric videos, comprising short/medium/long evaluation splits and a large-scale training set EgoMask-Train. Through systematic analysis, it reveals key differences between egocentric and exocentric videos, and demonstrates that fine-tuned models can achieve substantial performance gains.
Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow: Flow4Agent is the first work to introduce optical flow motion priors into LLM-based video understanding. It employs Temporal Granularity Optimization (TGO) to cluster video events via coarse-grained optical flow and filter redundant scenes using semantic priors, and Motion Token Pruning (MTP) to remove intra-frame static redundant tokens via fine-grained optical flow. The method achieves state-of-the-art performance on long-video benchmarks including VideoMME, MLVU, and LongVideoBench.
FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases: FlowSeek integrates the prior knowledge of a depth foundation model (Depth Anything V2) and classical low-dimensional motion parameterization (motion bases) into an optical flow network, achieving state-of-the-art cross-dataset generalization while training on a single consumer-grade GPU.
Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition: This paper proposes FS-VAE (Frequency-Semantic Enhanced Variational Autoencoder), which achieves significant performance gains in zero-shot skeleton-based action recognition through three key contributions: frequency decomposition for enhanced skeleton semantic learning, multilevel semantic alignment to bridge the visual-text modality gap, and a calibrated cross-alignment loss to mitigate alignment ambiguity.
General Compression Framework for Efficient Transformer Object Tracking: This paper proposes CompressTracker, a general Transformer tracker compression framework that achieves architecture-agnostic efficient compression through three progressive innovations—stage division, replacement training, and feature mimicking—delivering a 2.42× speedup while retaining approximately 99% of SUTrack's accuracy.
HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics: This paper proposes HERMES, a framework comprising two general-purpose modules — the Episodic COmpressor (ECO) and the Semantics reTRiever (SeTR) — that capture episodic memory and semantic information from video respectively. HERMES can serve as a standalone system achieving state-of-the-art performance, or be integrated as plug-and-play components into existing video-language models, simultaneously reducing inference latency by up to 43% and memory consumption by up to 46%.
Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding: This paper addresses the Online Video Temporal Grounding (OnVTG) task by proposing a hierarchical event memory mechanism that stores historical event information at multiple temporal scales. Combined with a segment-tree-based event proposal structure and a future prediction branch, the method achieves state-of-the-art grounding accuracy and low-latency prediction on TACoS, ActivityNet Captions, and MAD.
Learning to Generalize Without Bias for Open-Vocabulary Action Recognition: This paper proposes Open-MeDe, a meta-learning-based framework for open-vocabulary action recognition (OVAR). By simulating "known-to-open" generalization tasks via cross-batch meta-optimization and stabilizing training with a Gaussian weight averaging strategy, the framework improves generalization in both in-context and out-of-context settings without relying on CLIP regularization.
MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation: MEMFOF is the first memory-efficient multi-frame optical flow method. By reducing the correlation volume resolution and introducing a high-resolution training strategy, it achieves state-of-the-art accuracy on Spring, Sintel, and KITTI benchmarks while requiring only 2.09 GB of GPU memory for 1080p inference.
MikuDance: Animating Character Art with Mixed Motion Dynamics: This paper proposes MikuDance, a diffusion-based character art animation system that achieves high-dynamic animation of complex character artwork through two core contributions: Mixed Motion Modeling, which unifies character motion and 3D camera motion into a pixel-space representation, and Mixed-Control Diffusion, which implicitly aligns character shape/scale with motion guidance within the Reference UNet.
MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration: This paper introduces MIORe and VAR-MIORe, two multi-task motion restoration benchmark datasets captured using a 1000fps industrial-grade camera and a professional lens array. The benchmarks span a full motion magnitude spectrum from near-static to extreme motion, employ an adaptive frame-averaging mechanism to generate consistent motion blur, and provide a unified evaluation platform for deblurring, frame interpolation, and optical flow estimation.
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices: MobileViCLIP introduces spatiotemporal structural re-parameterization into the efficient image-text model MobileCLIP and trains it on large-scale video-text datasets, yielding a mobile-deployable video-text model that achieves performance comparable to much larger models on zero-shot retrieval and action recognition.
Moment Quantization for Video Temporal Grounding: This paper proposes MQVTG, which for the first time introduces vector quantization into video temporal grounding (VTG) by mapping video clips to discrete vectors via a moment codebook and soft quantization, thereby enhancing foreground/background discriminability and achieving state-of-the-art performance on 6 benchmarks.
Multi-modal Multi-platform Person Re-Identification: Benchmark and Method: This paper presents MP-ReID, the first multi-modal multi-platform person re-identification benchmark encompassing three modalities (RGB, infrared, thermal) and two platforms (ground and UAV), along with a unified prompt learning framework, Uni-Prompt ReID, which leverages modality-aware, platform-aware, and visual-enhanced prompts to substantially improve ReID performance under complex real-world conditions.
Online Dense Point Tracking with Streaming Memory: This paper proposes SPOT, a framework for online dense long-range point tracking via a customized memory readout module, sensory memory, and visibility-guided splatting. SPOT achieves state-of-the-art performance on the CVO benchmark with 10× fewer parameters and 2× faster speed, while matching or surpassing offline methods on multiple sparse tracking benchmarks.
OVG-HQ: Online Video Grounding with Hybrid-modal Queries: This paper proposes OVG-HQ, a new online video grounding task supporting hybrid-modal queries (text, image, and video clip), and introduces a Parametric Memory Block (PMB) to retain historical context alongside a hybrid distillation strategy to mitigate modality imbalance, enabling real-time moment localization in streaming video.
PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View: This paper proposes PriOr-Flow, a dual-branch framework that leverages the low-distortion prior of orthogonal views to compensate for severe distortions in polar regions of ERP panoramic images, achieving significant improvements in panoramic optical flow estimation — reducing EPE by 30.0% on MPFDataset and 29.6% on FlowScape.
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs: This paper proposes Q-Frame, a training-free plug-and-play framework for video frame selection and multi-resolution adaptation. By leveraging CLIP cross-modal matching and the Gumbel-Max trick, Q-Frame achieves query-aware frame selection, enabling Video-LLMs to process more informative frames under the same computational budget. It achieves significant performance gains on three benchmarks: MLVU, LongVideoBench, and Video-MME.
RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning: This paper proposes RainbowPrompt, a prompt-evolving mechanism that integrates multiple task-specific prompts into a diversity-enhanced unified prompt via attention-based transformation and task-guided alignment, achieving an average improvement of 8.23% over existing methods on image classification and video action recognition tasks.
ResidualViT for Efficient Temporally Dense Video Encoding: This paper proposes ResidualViT, which draws an analogy to I-frame/P-frame strategies in video compression by alternating between a full ViT and a lightweight residual ViT for encoding video frames. The approach achieves up to 60% reduction in computational cost and 2.5× inference speedup while maintaining accuracy close to the original CLIP.
Simultaneous Motion And Noise Estimation with Event Cameras: This paper presents the first joint method for simultaneous motion estimation and noise estimation with event cameras. It scores each event using the local contrast in the motion-compensated image of warped events (IWE) within the Contrast Maximization (CMax) framework, and obtains motion parameters along with signal/noise classification through alternating optimization. The method achieves state-of-the-art performance on the E-MLB denoising benchmark.
Sparse-Dense Side-Tuner for Efficient Video Temporal Grounding: This paper proposes SDST (Sparse-Dense Side-Tuner), the first anchor-free side-tuning architecture for video temporal grounding (VTG). Through a sparse-dense dual-stream design, SDST jointly addresses moment retrieval (MR) and highlight detection (HD). A novel Reference-based Deformable Self-Attention (RDSA) module is introduced to resolve the context deficiency in standard deformable cross-attention. SDST achieves state-of-the-art or highly competitive results on QVHighlights, TACoS, and Charades-STA while reducing trainable parameters to 27% of the current SOTA.
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding: This paper proposes TimeExpert — the first MoE-based Video-LLM framework that routes timestamps, saliency scores, and text descriptions to specialized experts via task-aware dynamic gating and token-adaptive routing, complemented by task-dependent auxiliary losses. TimeExpert achieves state-of-the-art performance across three VTG task categories: Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision: This paper proposes TOGA — a weakly supervised vision-language model that generates pseudo temporal labels via a multi-scale visual-language connector and consistency constraints, enabling joint generation of open-ended answers and temporal grounding without any temporal annotations, achieving SOTA on NExT-GQA, MSVD-QA, and ActivityNet-QA.
Towards Efficient General Feature Prediction in Masked Skeleton Modeling: This paper proposes GFP (General Feature Prediction), a framework that elevates the reconstruction target in masked skeleton modeling from low-level joint coordinates to multi-scale high-level semantic feature prediction. Coupled with a lightweight Target Generation Network and an information maximization constraint, GFP achieves a 6.2× training speedup while attaining state-of-the-art performance.
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding: This paper introduces Video Thinking Test (Video-TT), a benchmark for evaluating both the correctness and robustness of video large language models (Video LLMs). It comprises 1,000 YouTube Shorts videos and 5,000 questions, designed around visual/narrative complexity factors and natural adversarial question variants. The benchmark reveals a substantial gap between the best-performing model (GPT-4o, 36.6%) and humans (84.3%).
Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition: This paper proposes the Trokens framework, which converts point trajectories into semantically-aware relational tokens via semantic-aware trajectory point sampling and relational motion modeling (comprising intra-trajectory HoD and inter-trajectory relative displacement descriptors). By fusing these tokens with appearance features, Trokens achieves state-of-the-art performance on six few-shot action recognition benchmarks.
UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions: UMDATrack proposes the first unified multi-domain adaptive tracking framework. It leverages text-guided diffusion models to synthesize a small number (<2% of frames) of unlabeled multi-weather videos, employs Domain-Customized Adapters (DCA) to efficiently transfer object representations across weather domains, and introduces Target-aware Confidence Alignment (TCA) based on optimal transport to enhance cross-domain localization consistency. The framework substantially outperforms existing state-of-the-art trackers under nighttime, hazy, and rainy conditions.
Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras: This paper proposes the first unsupervised learning framework based on a single network for jointly estimating optical flow and image intensity from event camera data. The core contribution is a complementary loss formulation combining a newly derived Event-based Photometric Error (PhE) with Contrast Maximization (CMax).
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers: This paper proposes Vamba — a hybrid Mamba-Transformer large multimodal model (LMM) that encodes video tokens with linear complexity via Mamba-2 blocks and updates text tokens via cross-attention. Vamba processes up to 1024 frames on a single GPU and outperforms all efficient LMM methods on hour-level video understanding benchmarks.
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges: VideoLLaMB is proposed, achieving long streaming video understanding with linear GPU memory scaling via SceneTiling semantic segmentation, recurrent memory bridge layers, and a memory cache retrieval mechanism, yielding an average improvement of 4.2 points across 4 VideoQA benchmarks.
VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization: This paper proposes VideoMiner, a tree-structured reinforcement learning framework for long-form video understanding. It iteratively applies segmentation–captioning–clustering to construct a hierarchical video tree, and introduces T-GRPO (Tree-based Group Relative Policy Optimization) to guide a policy model in adaptively exploring key frames. VideoMiner achieves state-of-the-art performance on four long-video benchmarks, and it is observed that T-GRPO spontaneously elicits chain-of-thought reasoning.
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning: This paper proposes VTimeCoT, a training-free visual-temporal chain-of-thought framework that overlays a synchronized progress bar and highlights key segments at the bottom of video frames, enabling multimodal large language models (MLLMs) to accurately perceive timestamps. The approach substantially outperforms GPT-4o and Qwen2VL-7B baselines on temporal grounding and reasoning QA tasks.
What You Have is What You Track: Adaptive and Robust Multimodal Tracking: This paper proposes FlexTrack—the first framework to systematically study tracking under temporally incomplete multimodal data—achieving adaptive computational complexity via a Heterogeneous Mixture-of-Experts fusion module (HMoE) combined with a video-level masking training strategy. FlexTrack achieves state-of-the-art performance on 9 benchmarks, with gains of 2.6% under complete modalities and 10.2% under missing-modality scenarios.
XTrack: Multimodal Training Boosts RGB-X Video Object Trackers: This paper proposes XTrack, which employs a Mixture of Modal Experts (MeME) framework and a soft-routing classifier to enable cross-modal knowledge sharing across RGB-D/T/E modalities, allowing inference with a single modality to benefit from multimodal training knowledge, achieving an average precision gain of 3%.