Skip to content

📹 Video Understanding

📷 CVPR2026 · 178 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (47) · 🧪 ICML2026 (17) · 🤖 AAAI2026 (27) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (56)

🔥 Top topics: Object Tracking ×33 · Segmentation ×14 · Compression ×11 · Multimodal/VLM ×10 · Anomaly Detection ×8

A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking

The authors propose PL-Stitch, a self-supervised framework that utilizes the Plackett-Luce probabilistic ranking model to treat the temporal ordering of video frames as a pre-training signal. By learning "procedure-aware" video representations, it significantly outperforms existing self-supervised methods in surgical phase recognition and cooking action segmentation.

Active Intelligence in Video Avatars via Closed-loop World Modeling

To address the issue of current video avatars "passively following speech/pose while lacking autonomous goal-driven behavior," this paper proposes the L-IVA task (modeling avatar control as a POMDP with I2V generation models as environment simulators) and the ORCA framework. ORCA utilizes an "Observe-Think-Act-Reflect" (OTAR) closed-loop to counteract generational randomness and a System 2/System 1 dual-system hierarchy for open-domain planning and precise grounding. On a benchmark of 100 tasks, it achieves an average task success rate of 71.0%, significantly exceeding open-loop, reactive, and reflection-free baselines.

Adaptive Capacity Autoregressive Visual Tracking

ARTrack-AC extends autoregressive tracking from "fixed-capacity per-frame prediction" to "system-level autoregression." It uses a lightweight diffusion trajectory estimator to pre-judge the stability of future video segments. A controller then switches to a low-capacity parallel mode for simple segments and a high-capacity sequential mode for difficult frames, achieving 66.7% AUC on LaSOT while being 2.9x faster than its predecessor.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

Ours proposes AdaSpark, which reduces long-video processing FLOPs by up to 57% while maintaining performance through 3D spatiotemporal cube partitioning and two synergistic adaptive sparsity mechanisms: cube-level attention selection and token-level FFN selection.

Affordance-First Decomposition for Continual Learning in Video–Language Understanding

Addressing the blurred boundary of "what to stabilize and what to plasticize" in video-language continual learning, this paper proposes Affordance-First Decomposition (AFD). It maps videos to slowly-varying affordance tokens as a shared, stable "evidence foundation" across tasks, while concentrating plasticity into a LoRA scheduler that utilizes query-based routing and conflict-triggered rank expansion. Combined with question-only replay distillation (storing no videos) for anti-forgetting, AFD achieves higher accuracy and lower forgetting on ViLCo-Bench and domain/time-incremental VideoQA.

Alert-CLIP: Abnormality-aware Latent-Enhanced Representation Tuning of CLIP for Video Anomaly Detection

To address the issue where CLIP's text space highly entangles "normal" and "abnormal" descriptions—causing near-identical similarity scores for both types of prompts—this paper reshapes CLIP's embedding geometry via three-level (Global/Regional/Hard Negative) cross-modal contrastive training using a self-built dataset (VAGTA). This transforms CLIP into a more abnormality-aware backbone, consistently outperforming original CLIP in weakly supervised, zero-shot, and open-vocabulary VAD settings.

\(\alpha\)Matte4K & \(\mu\)Matting: Dataset and Model for Ultra-Micro Precision Alpha Video Matting

Targeting 4K portrait video matting, this paper introduces \(\alpha\)Matte4K, a large-scale dataset with pixel-level precision and physical consistency generated via Physics-Based Rendering (PBR). It also proposes \(\mu\)Matting, which utilizes a portrait prior (MAE) to predict a coarse alpha and identify "difficult regions," followed by sparse 3D convolution refinement only on these regions. This approach achieves full-resolution 4K video matting without downsampling for the first time, surpassing existing SOTA in both accuracy and temporal consistency.

An Efficient Token Compression Framework for Visual Object Tracking

To address the visual token explosion and redundancy in multi-frame template tracking, ETCTrack utilizes a learnable Adaptive Token Compressor (ATC) to compress historical template frames into a refined subset. This is followed by a Hierarchical Interaction Block (HIBlock) for deep interaction with the search region. It sets new state-of-the-art accuracy across 7 benchmarks while reducing computation (template tokens reduced by 60%, MACs reduced by 21.4%, with only a 0.4% drop in accuracy).

An Empirical Study on How Video-LLMs Answer Video Questions

This paper systematically dissects the internal mechanisms of how Video-LLMs answer video questions using "attention knockout." It identifies a clear "early-layer perception, late-layer reasoning" two-stage pattern and finds that spatiotemporal modeling relies primarily on language-to-video retrieval rather than intra/inter-frame video self-attention. Furthermore, only a few intermediate layers are critical. Based on these insights, the authors design a simple strategy involving early exit for visual tokens and temporal attention pruning, which significantly reduces computational cost with almost no performance degradation.

Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning

Addressing the "when to speak" challenge in streaming dense video captioning, which is difficult to control via thresholds, this paper proposes Takusen. It is an asynchronous dual-agent framework using a small model as an "Oracle" to detect event boundaries ahead of time and a large model as a "Listener" to generate descriptions only upon receiving signals. This mechanism eliminates thresholds and achieves streaming SOTA on ActivityNet Captions and YouCook2.

AutoCut: End-to-end Advertisement Video Editing Based on Multimodal Discretization and Controllable Generation

AutoCut proposes an end-to-end advertisement video editing framework that unifies video, audio, and text into a shared discrete token space via Residual Vector Quantization (RQVAE). By performing multimodal alignment and supervised fine-tuning on Qwen3-8B, it achieves unified processing of four tasks—video selection, ordering, script generation, and background music (BGM) selection—outperforming GPT-4o baselines on multiple metrics.

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

AutoGaze is proposed as a lightweight module with only 3M parameters that autoregressively selects a minimal set of multi-scale patches to reconstruct video before the ViT. It removes \(4\times-100\times\) spatio-temporal redundancy, achieving up to \(19\times\) acceleration for ViT and \(10\times\) for MLLMs. It enables MLLMs to scale to 1K-frame 4K-resolution videos for the first time, reaching 67.0% on VideoMME.

Beyond Explicit Language: Plug-and-Play Visual-to-Linguistic Modeling Toward General Object Tracking

Addressing the issues where vision-language tracking relies on static text and fails when text is missing, this paper proposes TIMI, a plug-and-play module. Using a "Textual Inversion Module," visual patches from templates and search regions are inversely mapped into "pseudo-descriptions" within the CLIP text embedding space. These implicit linguistic cues are then injected back into the visual backbone layer-by-layer via a "Multi-layer Semantic Injection mechanism." This provides dynamic, adaptive semantic guidance without any explicit text input, achieving stable performance gains across multiple trackers like MCITrack, DUTrack, and SeqTrack with minimal overhead.

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

Without modifying the plain ViT backbone and lightweight decoder of ViTPose, TAR-ViTPose employs "Joint-centered Temporal Aggregation (JTA) + Global Restore Attention (GRA)" to align, aggregate, and inject joint features from adjacent frames back into the current frame. This plug-and-play approach improves 2D video pose estimation by +2.3 mAP on PoseTrack2017 compared to single-frame ViTPose, while achieving higher speeds (413 fps for ViT-S).

Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning

PNTrack equips self-supervised trackers with a Dual-mode Contextual Association (DCA) mechanism. In early training stages, semantic patch tokens are fed as prompts to forward/backward tracking branches to accelerate convergence. In later stages, random background tokens are injected as noise to perturb the feature space, forcing the model to learn robust representations. This entire mechanism is enabled only during training and completely removed during inference, achieving a new self-supervised SOTA across 8 tracking benchmarks.

Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation

DiTTA utilizes a lightweight temporal add-on to perform test-time adaptation (TTA) on the initial frames of a test video for an image semantic segmentation (ISS) model. By distilling the temporal propagation capabilities of SAM2, it "bootstraps" the model into a video-specific VSS model. The model is subsequently frozen for high-speed inference on the remaining frames without requiring video labels, outperforming fully supervised VSS methods on VSPW.

Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions

The authors propose DynUAV—a multi-object tracking benchmark (42 videos, 1.7M+ bounding boxes, 8 categories) that intentionally creates intense ego-motion through aggressive UAV maneuvers. It breaks the implicit "smooth near-linear motion" assumption of existing UAV-MOT datasets. Experiments with 11 SOTA trackers demonstrate that existing methods' detection and association simultaneously collapse under drastic viewpoint and scale mutations.

Building a Precise Video Language with Human-AI Oversight

Addressing the long-standing issues of video captioning—"lack of specifications, lack of oversight, and model hallucinations"—this work defines "what should be described" via a structured specification (5 dimensions + 200+ visual primitives). It introduces CHAI (Critique-based Human-AI Oversight), where the model generates a pre-caption, humans provide only "critiques" to pinpoint errors, and the model revises the text into a post-caption, naturally producing (pre-caption, critique, post-caption) triples. By using these preference and critique signals for SFT/DPO post-training, the open-source Qwen3-VL-8B outperforms Gemini-3.1-Pro across captioning, reward modeling, and critique generation. It further enhances Wan2.2 text-to-video generation in following 400-word long prompts.

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

To address the lack of object-level annotation data in Dense Video Object Captioning (DVOC), this paper employs a VLM (Gemini 2.0 Flash) to automatically generate object-level captions for videos with drawn bounding boxes. By augmenting LVIS/LV-VIS into the first DVOC training sets (LVISCap / LV-VISCap) containing (mask, box, category, caption) quadruplets, the authors train CaptionFormer—the first end-to-end model to jointly segment, detect, track, and caption every object trajectory—achieving new SOTA results on VidSTG, VLN, and BenSMOT benchmarks.

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

CineSRD is proposed as a training-free multi-modal speaker diarization framework. It performs speaker registration via visual anchor clustering and speaker transition detection using an audio language model, addressing open-world challenges in cinematic works such as long-duration videos, large character counts, and audio-visual asynchrony.

CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning

The CLCR framework is proposed to organize each modal feature into three semantic levels (shallow/middle/deep). It utilizes an Intra-level Controlled Exchange Domain (IntraCED) to restrict cross-modal interaction within a shared subspace and an Inter-layer Collaborative Aggregation Domain (InterCAD) for adaptive cross-layer fusion, addressing the cross-level semantic asynchrony problem in multimodal learning.

Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

Ours proposes ClusterSTM, which retains semantically complete visual tokens under high masking rates through intra-frame semantic clustering and cluster-wise spatio-temporal masking strategies. By introducing a video-text relevance reconstruction objective, it achieves efficient video-language pretraining at extremely low computational cost, reaching a new SOTA for efficient models on tasks such as retrieval, VQA, and captioning.

CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

Addressing the issue where existing AIGC video detection datasets rely on low-quality open-source models and fail to generalize to high-fidelity commercial models, this research constructs the CoCoVideo-26K benchmark covering 13 commercial models with 26,000 "semantically aligned real-fake paired" segments. It proposes the CoCoDetect framework, which captures texture-level differences using dual-head contrastive training with R3D-18 and routes uncertain samples via confidence gating to an MLLM for physical/semantic reasoning, achieving an average Acc of 90.69% and AUC of 95.93%.

Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing

Ours proposes a new paradigm of "Grayscale Always-on, RGB On-demand." Through ColorTrigger, color redundancy is detected online using lightweight quadratic programming on the grayscale stream. By using only 8.1% of RGB frames, it maintains 91.6% of the full-color baseline performance, enabling always-on video perception for resource-constrained devices.

CoWTracker: Tracking by Warping instead of Correlation

CoWTracker replaces the "cost volume calculation for matching" in dense point tracking with "warping target frame features back to the reference frame based on current trajectory estimates + global reasoning via spatio-temporal Transformer." By removing the cost volume, which grows quadratically with resolution, it achieves SOTA results on TAP-Vid / RoboTAP. Furthermore, the same model outperforms specialized optical flow methods when zero-shot transferred to optical flow tasks.

CVA: Context-aware Video-text Alignment for Video Temporal Grounding

The Context-aware Video-text Alignment (CVA) framework is proposed, consisting of three synergistic components: Query-aware Context Diversification (QCD), Context-invariant Boundary Discrimination (CBD) loss, and Context-enhanced Transformer Encoder (CTE). It addresses false negatives and background correlation issues in video temporal grounding, achieving approximately a 5-point improvement in [email protected] on the QVHighlights dataset.

D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network

D2FANet introduces frequency-domain feature aggregation to video object detection for the first time. It employs a frequency-domain branch (Octave convolution for high/low frequency decomposition + cross-scale neighborhood fusion + frequency temporal attention) and a spatio-temporal branch (adaptive token aggregation guided by importance maps) to enhance object queries independently. The concatenated queries are fed into the detection head, achieving 91.8% mAP with Swin-Base on ImageNet VID with the fastest inference speed.

DarkAct: A RGB-Thermal Dataset and Fusion Framework for Multimodal Low-Light Action Recognition

To address the lack of data and methods for human action recognition under nighttime/low-light conditions, the authors construct the first large-scale paired RGB-Thermal video dataset, DarkAct (12,778 video pairs, 27 action classes). They further propose DarkAct-Net, a fusion framework that extracts motion-salient regions using motion-aware attention and dynamically integrates the two modalities based on reliability through illumination-adaptive fusion. It achieves 74.4% Top-1 accuracy in multimodal recognition, significantly exceeding all single-modality and existing fusion baselines.

DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions

Aiming at the realistic but long-neglected action recognition scenario of "low-light + handheld 6-DoF shaking," this paper first utilizes an Adaptive IMU Motion Compensation (AIMC) driven by angular velocity to correct event stream distortions caused by shaking. Subsequently, Iterative Greedy Sampling (IGS) is employed to select the most informative keyframes, followed by a four-stage Hybrid Spatio-Temporal Swin Transformer (HSTS) for recognition. The authors also release DarkShake-DVS (18,041 segments, 62 classes), the first event-based action dataset featuring low-light, intense shaking, and synchronized IMU data, outperforming SOTA on three benchmarks.

DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation

DeRVOS decouples Referring Video Object Segmentation (RVOS) into two upstream branches: "Consistent Trajectory Generation" and "Multimodal Understanding." Using a frozen DVIS++ and a pre-trained BEiT-3, the model directly produces stable instance trajectories and aligned vision-language features. A TAIS module then converges the task into "Referring Expression \(\leftrightarrow\) Instance Trajectory" matching, outperforming LVLM-based methods by 4.7% on MeViS.

DETACH: Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning

Addressing the novel non-intrusive activity recognition scenario of "fixed exocentric video + ambient sensors," DETACH decomposes both video and sensor data into "spatial components + temporal components." It establishes cross-modal spatial correspondence via online clustering and performs fine-grained temporal alignment using spatial-guided weighted contrastive loss. On Opportunity++ and HWU-USP, it achieves up to a 30% F1 improvement and a 50% mAP improvement over methods adapted from egocentric baselines.

DIvide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

DIG is a training-free frame selection framework that categorizes queries into global or grounding types. It employs uniform sampling for global queries and a specialized pipeline—consisting of content-adaptive frame selection, LMM-based reward scoring, and video refinement—for grounding queries, consistently outperforming existing methods across three long video understanding benchmarks.

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

The paper introduces the EgoPointVQA dataset and the HINT (Hand Intent Tokens) method. By encoding 3D hand keypoints into hand intent tokens and interleaving them with vision tokens as input to an MLLM, the approach solves gesture-based deictic question answering in egocentric videos. HINT-14B achieves 68.1% accuracy, outperforming InternVL3-14B by 5.4pp.

Drift-Resilient Temporal Priors for Visual Tracking

Ours proposes DTPTrack—a lightweight plug-and-play temporal modeling module that assigns reliability scores to historical frames via a Temporal Reliability Calibrator (TRC) to filter noise, and synthesizes calibrated historical information into dynamic prior tokens via a Temporal Guidance Synthesizer (TGS) to suppress tracking drift, achieving SOTA performance across multiple benchmarks.

Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual-Inertial Odometry

A dual-agent reinforcement learning framework is proposed, utilizing a Select Agent (deciding whether to trigger the visual front-end based on IMU signals) and a Fusion Agent (adaptively fusing visual-inertial states). This approach significantly reduces the calling frequency and computational overhead of VIBA without completely removing it, achieving a superior trade-off between accuracy, efficiency, and memory usage.

Dynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

This paper redefines "inferring rigid-body physical states and parameters from monocular video" as a text generation problem: training a VLM (ΔYNAMICS, based on Qwen2.5-VL-3B) to directly output a YAML configuration describing the entire scene (geometry / initial velocity / material / camera / gravity). This is then passed to MuJoCo for re-simulation. Performance is enhanced by "reasoning about motion events in natural language before generating the configuration" and using "optical flow input." It achieves a segmentation IoU 7 times higher than mainstream VLMs on CLEVRER and successfully transfers to 235 real-world videos.

EarlyTom: Early Token Compression Completes Fast Video Understanding

EarlyTom is a training-free video token compression framework that shifts the compression point from "after the visual encoder" to "inside the visual encoder" via intra-encoder frame merging, paired with a decoupled spatial token selection strategy. On LLaVA-OneVision-7B, it reduces Time-to-First-Token (TTFT) by up to \(2.65\times\) and FLOPs by 61%, while maintaining over 96% of the full-token baseline accuracy.

Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation

Addressing the dilemma in RAFT-based optical flow methods where "all-pairs correlation volume sampling" either leads to memory explosion or low speed at high resolutions, this paper observes that only 1.6% of the correlation volume is actually sampled. Based on this, a sampling operator featuring block sparsity + patch-major layout + fused CUDA kernel is designed. It mathematically reproduces the RAFT sampling definition with bit-accuracy while reducing both time and memory complexity from quadratic to linear \(\mathcal{O}(n)\). The method saves up to 63–67% of end-to-end inference time and achieves SOTA on the precision-speed Pareto front using a self-built 8K dataset.

Efficient Frame Selection for Long Video Understanding via Reinforcement Learning

Addressing the issue that "uniform sampling misses keyframes" in long video understanding, this paper trains a lightweight, plug-and-play query-adaptive frame selector. It first distills a semantic relevance prior from a frozen CLIP and then fine-tunes it using an improved GRPO (with hierarchical rewards at both frame and combination levels), directly using the downstream MLLM's accuracy as the signal. The method achieves an average +3.28% gain across four mid-to-long video benchmarks, with more significant improvements on longer videos.

Ego-Grounding for Personalized Question-Answering in Egocentric Videos

This paper proposes MyEgo—the first diagnostic benchmark for "personalized egocentric video question answering" (541 long videos, 5K questions regarding "my things/my activities/my past"). It systematically examines whether mainstream MLLMs can perform ego-grounding (understanding, remembering, and tracking the "camera wearer/me"). The results reveal that GPT-5 achieves only 46% accuracy, trailing humans by nearly 40 points. Furthermore, increasing model scale or adding Chain-of-Thought (CoT) fails to solve the issue, as the bottleneck lies in long-term memory and identity tracking.

EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions

Ours proposes EgoXtreme, the first large-scale 6D object pose estimation benchmark for egocentric views under extreme conditions. It covers three real-world challenges—severe motion blur, dynamic lighting, and smoke occlusion—revealing significant failures of current SOTA pose estimators in these environments.

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Video Understanding

This paper identifies "Semantic Aggregation Hallucinations (SAH)," an overlooked type of video hallucination where a model perceives each frame correctly but misattributes semantics during cross-event aggregation. The authors construct ELV-Halluc, the first benchmark targeting SAH (348 multi-event videos, adversarial triplet Q&A). Systematic evaluation of 19 MLLMs proves that SAH increases with semantic complexity. By employing improved positional encoding and DPO with 8K adversarial pairs, the SAH Ratio is reduced by up to 27.7%.

Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration

This paper proposes a data-efficient post-hoc calibration method that aligns the predicted distribution of uncertainty-aware gaze tracking models with the true observed distribution using isotonic regression. It introduces the Coverage Probability Error (CPE) metric to replace the unreliable Error-Uncertainty Correlation (EUC) for evaluating uncertainty quality.

Enhancing Video Vision Language Model with Hippocampal Sensing

This paper mimics the hippocampal cross-modal association mechanism by first performing SFT on a Video VLM using "cross-modal temporal prediction" (completing audio from video, and vice versa), followed by a contrastive RL strategy (VANAO) with "negative-aware rewards" to enforce genuine joint audio-visual reasoning. This approach enables 7B/8B small models to rival GPT-4o and Gemini-1.5-Pro across multiple video VQA benchmarks.

Envisioning the Future, One Step at a Time

Ours models open-set future scene dynamic prediction as step-by-step reasoning on sparse point trajectories. Through an autoregressive diffusion model, it achieves rapid generation of thousands of diverse future hypotheses from a single image, at speeds several orders of magnitude faster than dense models.

EthoCLIP: Ontology-Enhanced Video-Language Pretraining for Animal Behavior Understanding

Addressing the "extreme data scarcity" in animal behavior videos, this paper injects the expert-constructed Neuro Behavior Ontology (NBO) as an inductive bias into CLIP-style video-language contrastive learning. The authors first construct the AnimalBand dataset (74,000 videos) using a unified ontological labeling scheme. They then explicitly encode "parent-child/synonym" relationships between behavioral labels using Ontological Semantic Embedding (OSE) and Ontology-Aware Graph Modeling (OAGM). EthoCLIP significantly outperforms traditional backbones and general VLMs in transfer and classification, approaching full-scale performance with only 40%–60% of the data.

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

The EVATok framework is proposed, featuring a three-step pipeline—optimal token allocation estimation, a lightweight router, and adaptive tokenizer training. This allows the video tokenizer to adaptively allocate token lengths based on clip complexity, saving over 24.4% of tokens while achieving SOTA generation quality on UCF-101.

Event6D: Event-based Novel Object 6D Pose Tracking

EventTrack6D proposes an event-depth fusion framework for 6D pose tracking. By reconstructing intensity and depth images at arbitrary timestamps, it bridges the gap between event cameras and low-frame-rate depth sensors. Trained exclusively on synthetic data, it achieves robust tracking of unseen objects at 120+ FPS.

Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition

To address the issues of slow training and uniform treatment of all spatio-temporal regions in skeleton Masked Autoencoders (MAE), AMR utilizes a "decoupled cross-attention decoder" to achieve significant acceleration by "predicting fewer and larger patches." It then employs "motion energy-guided focal reconstruction" to concentrate the reconstruction focus of large patches on high-motion regions, achieving an 8x speedup and performance improvements on NTU-60/120 and PKU-II, surpassing existing SOTA.

Fine-VAD: Towards Fine-Grained Video Anomaly Detection via Progressive Cross-Granularity Learning

Addressing the challenge of scarce samples per anomaly class in fine-grained video anomaly detection, this paper proposes a progressive cross-granularity learning paradigm: first learning general anomaly representations with abundant binary labels, then constructing an intermediate semantic skeleton via K-means pseudo-macro clustering, and finally refining with sparse category labels. Implemented as Fine-VAD with CLIP alignment, it achieves a relative improvement of 47.7% in mean AVG mAP on UCF-Crime and XD-Violence.

First Frame Is the Place to Go for Video Content Customization

Discovered the inherent ability of video generation models to implicitly treat the first frame as a "concept memory buffer" for storing and reusing multiple visual entities. Proposed FFGo—a lightweight LoRA adaptation method using only 20-50 training samples to activate this capability without modifying the architecture, achieving multi-reference video content customization. Rated best in 81.2% of cases in user studies.

FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

Ours proposes FlashMotion, the first three-stage training framework to achieve few-step (4-step) trajectory-controllable video generation. By employing a strategy of training a trajectory adapter \(\rightarrow\) distilling a fast generator \(\rightarrow\) fine-tuning the adapter with a hybrid adversarial-diffusion approach, it simultaneously outperforms existing multi-step methods in visual quality and trajectory accuracy with 4-step inference, achieving a 47x speedup.

FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding

FlexiVideo replaces the fixed multi-frame encoding window with a mechanism that first segments video into "internally stable" scene clips based on frame differencing. It then employs a shared 3D convolutional kernel with dynamic temporal window adjustment for scene-level encoding. This approach reduces visual tokens by 43.5% while consistently outperforming Qwen2.5-VL-3B across six video benchmarks.

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

FluxMem is a training-free streaming video understanding framework that utilizes a hierarchical memory design (Short/Medium/Long-term) and two adaptive token compression modules (TAS for temporal redundancy + SDC for spatial redundancy). It achieves new SOTA on StreamingBench and OVO-Bench while discarding 60-70% of visual tokens.

Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding

Frame2Freq is proposed as the first family of PEFT adapters for temporal modeling in the frequency domain. By using FFT to transform frozen VFM patch embeddings into the spectral space and learning band-level filtering, it outperforms full fine-tuning models on five fine-grained action recognition benchmarks with <10% trainable parameters.

From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation

Addressing the lack of dense ground truth (GT) for continuous-time event flow and the limitation of Contrast Maximization (CM) focusing only on "alignment to a point" while ignoring trajectory continuity, this paper proposes the Spatio-Temporal Structural Consistency (STSC) self-supervised paradigm. It treats events as samples on a spatio-temporal manifold rather than discrete points to be aligned. Combined with a bidirectional multi-scale network and curriculum-guided hybrid supervision, it achieves SOTA results on DSEC-Flow and MVSEC for both standard and high temporal resolution (HTR) flow (DSEC EPE 0.663, an 11.6% reduction relative to BFlow).

Gamba: Mamba-based Graph Convolutional Network with Dynamic Graph Topology Learning for Action Recognition

To address the issue where directly stacking GCN and Mamba causes Mamba to scan along physically non-adjacent joint sequences, Gamba uses a node classification module to rearrange joints into Mamba-friendly sequences based on motion categories. It then employs a unidirectional State Space Model (SSM) to simultaneously model intra-class local and inter-class global relationships, paired with Mamba-TCN for temporal modeling, achieving SOTA on NTU RGB+D 60/120 and NW-UCLA with lower self-attention overhead.

GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

GIFT is a training-free keyframe selection framework that reformulates the problem of "which frames to feed into a Video VLM" from a greedy frame-by-frame addition to a global evaluation of each frame's "irreplaceability" (high relevance \(\times\) visual isolation among more relevant frames). Utilizing "Budget-Aware Refinement," it gradually recovers temporal context as the frame budget increases, achieving a maximum average improvement of 12.5% over uniform sampling on LLaVA-Video-7B.

Gloria: Consistent Character Video Generation via Content Anchors

Gloria proposes using a compact set of "Content Anchors" to represent the multi-view appearance and expression identity of a character. Through two mechanisms—superset content anchoring (to prevent copy-pasting) and RoPE weak conditioning (to distinguish multiple anchors)—it achieves consistent character video generation exceeding 10 minutes.

GoalForce: Teaching Video Models to Accomplish Physics-Conditioned Goals

Proposes the Goal Force framework which trains video generation models on simple synthetic data using multi-channel physical control signals (goal force, direct force, mass). This enables models to learn reverse planning of causal chains from target effects, achieving zero-shot generalization to complex real-world scenarios such as tool use and human-object interaction.

Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

SelVA introduces a text-conditioned selective video-to-audio (V2A) generation task. By utilizing learnable supplementary tokens [SUP] and a self-supervised video mixing strategy, the model selectively generates target sounds specified by text prompts from multi-source videos, outperforming existing methods in audio quality, semantic alignment, and temporal synchronization.

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

HERBench is a VideoQA benchmark specifically designed for multi-evidence integration, consisting of 26,806 five-choice multiple-choice questions. Each question structurally necessitates the fusion of \(\ge 3\) temporally dispersed, non-overlapping visual cues. By introducing the Minimum Required Frame Set (MRFS) metric, it identifies two critical bottlenecks in current Video-LLMs: insufficient frame retrieval and evidence fusion failure.

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

This paper proposes the new task of "Open-Vocabulary Temporal Sentence Grounding in Videos" (OV-TSGV) and constructs two benchmarks, Charades-OV and ActivityNet-OV. It introduces HERO, a plug-and-play framework that captures multi-granularity semantics via hierarchical text embeddings and enhances alignment through parallel semantic-guided visual filtering and contrastive masked text refinement, achieving SOTA performance on both standard and open-vocabulary benchmarks.

HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

HieraMamba proposes a Mamba-based hierarchical video temporal grounding architecture centered on the Anchor-MambaPooling (AMP) module. This module uses Mamba's selective scanning to compress video features layer-by-layer into multi-scale anchor tokens. Combined with anchor-conditioned and segment-pooled contrastive losses, it enhances the compactness and discriminativeness of hierarchical representations, achieving SOTA on Ego4D-NLQ, MAD, and TACoS.

Hierarchical Action Learning for Weakly-Supervised Action Segmentation

HAL leverages the time-scale asymmetry—where low-level visual features change rapidly while high-level action semantics change slowly—to construct a hierarchical causal generative process with a smooth transition constraint. This allows the model to learn identifiable high-level action latent variables under weak supervision using only action transcripts, mitigating over-segmentation and achieving new SOTA results on the Breakfast, CrossTask, Hollywood, and GTEA benchmarks.

Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

Addressing the long-standing issues of independent prediction, jitter, and fragmentation under occlusion in motion estimation for multi-object tracking, this paper proposes HyperSSM. It utilizes hypergraphs to connect targets with similar motion states into hyperedges to achieve "group consensus," and embeds hypergraph convolutions directly into the state transitions of a State Space Model (SSM) to maintain temporal smoothness. This allows correlated targets to mutually constrain and complement each other's motion. HyperSSM achieves SOTA performance across four linear and non-linear benchmarks: MOT17, MOT20, DanceTrack, and SportsMOT.

Image Guides Images: Consistent Video Amodal Completion with Rectified In-Context Exemplar Guidance

IC-Amodal proposes a training-free framework for Video Amodal Completion (VAC). By leveraging a pre-trained image inpainting model (Flux.1-Fill), it reformulates VAC as "rectified in-context learning." It utilizes dual-frame collaboration to construct reliable exemplars to address the cold-start problem, followed by sub-region attention weight modulation to anchor the model's focus on the exemplars. This achieves both open-world generalization and temporal consistency without fine-tuning, outperforming state-of-the-art (SOTA) models fine-tuned on synthetic data.

Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation

This paper proposes a new paradigm of "Interactive Tracking," where users can guide or correct the tracker using natural language instructions at any time. The authors release InteractTrack, the first large-scale interactive tracking benchmark (150 videos, 140,000 frames, 4D evaluation protocol), showing that 25 SOTA trackers fail in this setting. Finally, a strong baseline IMAT is introduced, featuring positive and negative memory banks.

InternVideo-Next: Towards World-Understanding Video Models

InternVideo-Next decomposes the traditional "Encoder-Decoder" architecture of Masked Video Modeling (MVM) into a three-stage Encoder-Predictor-Decoder (EPD) framework. It utilizes a two-stage self-supervised pre-training strategy: Stage 1 constructs a latent space that is both detail-preserving and semantically rich using a conditional diffusion decoder and image-level semantic priors; Stage 2 performs latent space prediction toward a frozen teacher to learn world knowledge. Using only publicly available unlabeled videos, this model, which lacks any video-text supervision, outperforms video-text pre-trained competitors on benchmarks like K400 and SSv2 for the first time.

InterRVOS: Interaction-Aware Referring Video Object Segmentation

This work extends Referring Video Object Segmentation (RVOS) from segmenting only the referred subject (actor) to a new task, InterRVOS, which simultaneously segments both the actor and target in an interaction. The authors constructed InterRVOS-127K, a dataset with 127,000 actor-target dual-mask annotations, and proposed ReVIOSa, an MLLM-based architecture. ReVIOSa explicitly models interaction directionality using two role-specific tokens ([SEG_ACT] and [SEG_TAR]) combined with an Attention Mask Loss (AML), significantly outperforming existing methods on the new benchmark.

Joint Learning of General and Diverse Patterns with Mixture of Memory Experts for Weakly-Supervised Video Anomaly Detection

MoME introduces a sparse Mixture of Experts framework with "Internal Memory + Shared External Memory," allowing normal/abnormal experts to learn commonalities in external memory and disparities in internal memory. Guided by LLM-generated semantic prototypes for expert routing, it balances generalization and discrimination, achieving SOTA results on UCF-Crime and XD-Violence (88.32% AUC / 86.15% AP).

LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation

LaDy introduces an overlooked "Physical Dynamics" dimension to Skeleton-based Temporal Action Segmentation (STAS). It utilizes a Lagrangian dynamics branch to explicitly synthesize generalized joint forces (torques) from joint coordinates, ensures these forces adhere to the work-energy theorem via an energy consistency loss, and injects force information into spatial features (fusion) and temporal features (hierarchical gating). It achieves new SOTA results across six datasets, notably improving F1@50 by up to 5.2% on PKU-MMD v2 with only 1.83M parameters.

LAOF: Robust Latent Action Learning with Optical Flow Constraints

The proposed LAOF framework utilizes the agent's optical flow as a pseudo-supervision signal to constrain latent action learning, making latent action representations more robust to interference. It significantly outperforms unsupervised baselines on LIBERO and PROCGEN and matches or exceeds supervised methods using 1% action labels under label-free conditions.

Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

LMFT quantifies the motion intensity of each patch using the "L1 motion difference of tokens from adjacent frames" in video domain adaptation. It then utilizes reinforcement learning to learn a fine-tunable motion threshold to discard low-motion (background) tokens, feeding only action-related tokens into the ViT. This simultaneously mitigates domain shift caused by labels/backgrounds and reduces training time by 10–20 times.

Learning from Noisy Supervision: A Denoising-Debiasing Framework for Weakly Supervised Video Anomaly Detection

Addressing the noise supervision problem in the MIL framework—where normal snippets in abnormal bags are misidentified as anomalies—this paper proposes the plug-and-play D2MIL framework. It dynamically discards high-loss noise based on the observation that "noise samples exhibit higher loss," and subsequently recovers mis-deleted hard samples using a frozen VLM. D2MIL provides consistent improvements across five mainstream MIL baselines on ShanghaiTech, UCF-Crime, and MSAD.

Learning from Synthetic Data via Provenance-Based Input Gradient Guidance

This paper proposes utilizing "provenance" automatically obtained during the synthetic data generation process as an auxiliary supervision signal. By employing input gradient guidance—specifically inhibiting input gradients in non-target regions—the method directly encourages models to learn discriminative representations focused on target areas. Its effectiveness is validated across multiple tasks and modalities, including weakly supervised localization, spatio-temporal action detection, and image classification.

Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

AssistMimic is proposed to model the physical imitation of human-human assistive interactions as a Multi-Agent Reinforcement Learning (MARL) problem. Through motion prior initialization, dynamic reference retargeting, and contact-promoting rewards, it achieves the first physics-based simulation and tracking of assistive movements involving force exchange.

Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding

Addressing the blind assumption of "always providing a segment for any query" in Video Temporal Grounding (VTG), this paper proposes Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) based on GRPO. Combined with four rewards (format, refusal-IoU, explanation, query correction) and a specifically constructed "hard-irrelevant query" dataset HI-VTG, the model learns to refuse queries that are highly semantically similar but actually mismatched and explain why. This significantly improves refusal and explanation quality across several relevance-aware VTG scenarios without compromising standard grounding accuracy.

LensWalk: Agentic Video Understanding by Planning How You See in Videos

LensWalk is proposed as an agentic framework that allows an LLM reasoner to actively control the temporal scope and sampling density of video observations. Through a reason-plan-observe loop, it achieves adaptive video understanding, providing a plug-and-play performance gain of over 5% on long video benchmarks without the need for fine-tuning.

Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

To address the issues in existing Referring Video Object Segmentation (RVOS) datasets, which contain only short clips of a few seconds and where targets are visible throughout, the authors construct Long-RVOS. This is the first minute-level long video benchmark, featuring 2,193 videos with an average duration of 60 seconds, frequent occlusions, target disappearance/reappearance, and scene cuts. It includes three types of descriptions (Static, Dynamic, and Mixed) and two new metrics (\(tIoU\) and \(vIoU\)). The authors also propose a motion-enhanced baseline, ReferMo, which utilizes MPEG-4 keyframes and motion vectors for a "local perception to global interaction" workflow. ReferMo is supervised only on keyframes and uses SAM2 for propagation during inference, significantly outperforming seven SOTA methods in long-video scenarios.

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

LongVideo-R1 is proposed as a multimodal Agent equipped with reasoning capabilities. Utilizing a hierarchical video tree structure and an intelligent navigation strategy, it achieves efficient long video question answering with an average of only 10.5 tool calls, significantly outperforming exhaustive methods in the accuracy-efficiency trade-off.

M4-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

To efficiently adapt SAM2 for RGB-D Video Salient Object Detection (RGB-D VSOD), M4-SAM injects "Modality-Aware MoE-LoRA" into the frozen SAM2 encoder for parameter-efficient fine-tuning, utilizes "Gated Multi-Level Feature Fusion + Memory Bank" to aggregate multi-scale temporal information, and employs "Pseudo-Guided Initialization" to eliminate dependence on manual prompts. It achieves SOTA across all metrics on three RGB-D VSOD datasets, with the entire training process taking approximately 5 hours on two 4090 GPUs.

MA-Bench: Towards Fine-grained Micro-Action Understanding

Ours proposes MA-Bench, a micro-action understanding benchmark containing 1,000 videos and 12,000 structured QA pairs. It systematically evaluates the fine-grained micro-action understanding capabilities of 23 MLLMs through a three-layer "Perception-Understanding-Reasoning" architecture and provides MA-Bench-Train (20.5K samples) for model fine-tuning.

MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters

This paper proposes the MaskAdapt framework, which achieves flexible and precise motion adaptation for physics-based humanoid characters through a two-stage residual learning paradigm: first training a mask-invariant robust base policy, and then training a residual policy on a frozen base controller to modify target body parts.

Matching Every Pair to Track Every Point: PairFormer for All-Pairs Tracking and Video Trajectory Fields

PairFormer upgrades video motion modeling from "tracking a few query points" to "predicting dense displacement and visibility fields for every frame pair" (All-Pairs Tracking, APT). Using a feed-forward Transformer (Spatio-temporal encoder + CorrBank + Broadcast Motion Mixer + Trajectory Field Decoder), it outputs sequence-consistent dense trajectory fields in a single forward pass. Accompanied by the PAIRender synthetic data platform providing all-to-all supervision and benchmarks, it achieves SOTA on APT-Bench and competitiveness on standard TAP benchmarks.

MDS-VQA: Model-Informed Data Selection for Video Quality Assessment

MDS-VQA enables a VQA model to "identify which videos it cannot assess accurately" by using a ranking-based failure predictor to estimate difficulty combined with content diversity for greedy selection. By annotating only a 5% "difficult and diverse" subset for active fine-tuning, the average multi-domain SRCC improved from 0.651 to 0.722, and the method achieved first place in the gMAD competition.

Memory Matters: Boosting Training-Free Zero-Shot Temporal Action Localization with a Learnable Lookup Table

Addressing the issue where "Training-Free Zero-Shot Temporal Action Localization (TF ZS-TAL) adapts independently per video, discards knowledge after use, and cannot accumulate historical insights," this paper proposes a Learnable Lookup Table (LLT) maintained by action category and updated online during the test stream. High-confidence "easy-to-judge frames" are aggregated into category prototypes, and a lightweight residual module aligns lookup items and text prototypes to the current video. Without fine-tuning the VLM, this allows training-free ZS-TAL to reuse knowledge across videos, improving the average mAP on THUMOS'14 (75/25 split) from 9.2 (T3AL) to 12.8 (a relative +40%).

MER-Tracker: Towards High-Speed 3D Point Tracking via Multi-View Event-RGB Hybrid Cameras

To address the issues of low frame rates in standard RGB cameras (~30fps), which cause motion blur and missing dynamics in high-speed motion, this paper constructs a cuboid capture rig with "4 RGB + 2 Event cameras" and proposes MER-Tracker. By fusing the texture fidelity of RGB with the microsecond-level temporal resolution of event streams, it outputs accurate high-speed 3D point trajectories at 150fps, representing the first systematic work in high-speed 3D point tracking.

META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding

META enables a training-free video understanding agent to "self-evolve its toolbox" through iterative problem-solving. It condenses recurring multi-step tool combinations from successful trajectories into reusable macro-tools and distills failure trajectories into tool usage constraints. Without updating any parameters, it improves strong VLMs by 4.6%–7.6% across three long-video benchmarks.

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

Minerva-Ego is a benchmark for complex reasoning in egocentric long videos, consisting of 1,160 human-annotated five-way multiple-choice questions. Each question is paired with a dense reasoning trajectory that binds "when (timestamp)" to "where (segmentation mask)". The authors reveal that state-of-the-art (SOTA) video models (Gemini 2.5 Pro at 40.1% vs. humans at 91.8%) are primarily bottlenecked by perceptual grounding. They demonstrate that directly prompting the model on pixels regarding "where to look" and "which frames to watch" can improve accuracy by up to approximately 5.8%.

Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos

This paper proposes the Mistake Attribution (MATT) task, which attributes execution errors in egocentric videos to three dimensions: semantic (which instruction component was violated), temporal (which frame contains the Point of No Return, PNR), and spatial (where the error region is in the PNR frame). Through the MisEngine data engine, large-scale mistake samples are automatically constructed from existing action datasets. A unified Transformer model, MisFormer, is designed to simultaneously complete three attribution sub-tasks, outperforming specialized SOTA methods across multiple benchmarks.

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

The authors distill the "motion priors" naturally encoded in Video Diffusion Models (VDMs) to serve as auxiliary supervision for aligning VLM text-visual attention. This significantly enhances the VLM's fine-grained motion understanding without adding trainable parameters or modifying the architecture.

MoVie: Broaden Your Views with Human Motion for Action Detection

MoVie decomposes human skeleton motion into a set of "motion primitives" (a learnable motion dictionary) and utilizes an orthogonal projection to treat these fine-grained motion signals as a "regularizer" to calibrate RGB visual features. This approach moves beyond naive feature concatenation/fusion, achieving a new SOTA in frame-level action detection across four real-world datasets: TSU, Charades, Multi-THUMOS, and PKU-MMD (e.g., an improvement of ~+15.9% mAP over the visual-only baseline on TSU-CS).

MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

This paper introduces MovieRecapsQA, a multimodal open-ended video question-answering benchmark constructed from movie recap videos. It contains approximately 8.2K questions covering 60 movies and features a reference-free evaluation metric based on atomic facts, revealing that the primary bottleneck for current MLLMs lies in visual perception rather than reasoning.

MPL: Match-guided Prototype Learning for Few-shot Action Recognition

Addressing the issue where prototype learning and video matching operate independently and incompatibly in few-shot action recognition, MPL utilizes matching results as guidance signals for prototype construction. By sequentially applying sample-level E-Match for query-semantic enhancement, cross-sample attention for shared action pattern aggregation, and frame-level K-Match for refinement, the method generates discriminative prototypes that are inherently compatible with the matching mechanism. SOTA results are achieved across four datasets.

MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos

MS-Temba transforms the Mamba State Space Model into a "Multi-Scale Dilated SSM" by stacking parallel branches with varying temporal dilation rates into a hierarchical structure, then uses a lightweight Mamba fuser to unify multi-scale features. With only 17M parameters, it achieves SOTA in Temporal Action Detection (TAD) on 40-minute-long, densely annotated daily activity videos, reducing parameters by 5x compared to Transformer-based solutions.

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

MuKV simultaneously stores the historical KV cache of long streaming videos at patch, frame, and segment granularities. It employs "self-attention + frequency" dual-signal pruning to compress redundancy and utilizes "semi-hierarchical retrieval" for online recall. This approach significantly improves long streaming VideoQA accuracy without increasing memory footprint or online latency.

MV-TAP: Tracking Any Point in Multi-View Videos

MV-TAP extends "Track Any Point" (TAP) from single-view to multi-view synchronized videos by modeling directly in 2D pixel space. It utilizes camera ray encoding to inject geometric context and a view attention layer to exchange information across viewpoints. This allows for trajectory completion in the presence of occlusion or motion blur in a single view by leveraging other views, significantly outperforming single-view SOTA methods processed independently on DexYCB, Panoptic Studio, Kubric, and Harmony4D.

Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference

This work encodes videos directly into Implicit Neural Representations (INR/NeRV). It utilizes Centered Kernel Alignment (CKA) to identify optimal "INR intermediate layer ↔ downstream backbone injection point" pairs and trains ultra-lightweight 1×1 convolutional Micro Adapters for feature conversion. During inference, it only decodes up to the required intermediate layer, skipping pixel reconstruction and early backbone layers. This unified representation serves multiple tasks (classification, detection, action recognition, captioning) simultaneously, reducing end-to-end latency by up to 89.5% and inference FLOPs by up to 29.9%.

No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

The paper proposes LAVIDA, an end-to-end zero-shot Video Anomaly Detection (VAD) framework. By utilizing an Anomaly Exposure Sampler, semantic segmentation datasets are transformed into pseudo-anomalies for training. Combining MLLM for deep anomaly semantic extraction and anti-attention token compression to handle spatio-temporal sparsity, it achieves SOTA results at both frame and pixel levels without using any real VAD data.

Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking

The authors propose OA-SORT, an occlusion-aware tracking framework that explicitly models the occlusion states of objects to mitigate position cost confusion and Kalman Filter estimation instability. It achieves SOTA-level improvements on DanceTrack, SportsMOT, and MOT17, with components that can be integrated into various trackers in a plug-and-play manner.

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

To address the issues of narrow categories and oversimplified scenarios in existing Spatio-Temporal Video Grounding (STVG) datasets, this paper constructs the OmniGround benchmark, covering 81 categories across 3,475 real-world complex videos. The study introduces a Forward-Backward-Refine (FBR) annotation pipeline, a four-dimensional data quality evaluation framework (DeepSTG), and a training-free two-stage baseline (PG-TAF). PG-TAF improves state-of-the-art (SOTA) grounding accuracy on OmniGround by 25.6% and 35.6% (relative m_tIoU/m_vIoU gains), respectively.

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

Addressing the bottleneck where Video Temporal Grounding (VTG) fails to accurately locate "rare concepts" in open-world scenarios, the authors constructed OmniVTG—a large-scale dataset featuring 2,124 hours, 350,000 queries, and a vocabulary far exceeding the sum of existing datasets—using a "Semantic Coverage Iterative Expansion" pipeline. They further proposed a three-stage training paradigm (SFT→CoT→RL) based on "predict-then-self-correct," enabling Qwen2.5-VL-7B to achieve zero-shot SOTA on four public VTG benchmarks while maintaining nearly consistent performance on rare concepts.

One-Shot Flow, Any-Time Frame: A Bidirectional Warping Framework for Event-Based Video Frame Interpolation

Addressing the dilemma in Event-based Video Frame Interpolation (E-VFI) where "forward warping is fast but suffers from holes, while backward warping yields high quality but requires recomputation for every frame," this paper proposes "One-Shot Flow, Any-Time Frame." By computing a bidirectional motion representation covering the entire duration once, optical flow at any time can be queried directly. A bidirectional warping mechanism with explicit repair masks is then used to fuse the strengths of both directions, refreshing both reconstruction quality and inference efficiency on synthetic and real-world datasets (PSNR 36.90 for GOPRO Skip 15, with only 7.27GB VRAM for 127-frame interpolation, while TLXNet directly encounters OOM).

OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments

Ours proposes OpenMarcie, the largest multimodal action recognition dataset for industrial scenarios to date, integrating 8 sensing modalities from wearable sensors and vision data with 200+ channels and 37+ hours of recording. The superiority of inertial + vision fusion is validated across three benchmarks: HAR classification, open-vocabulary description, and cross-modal alignment.

Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation

This work provides the first systematic analysis of the adversarial vulnerability of Tracking-by-Query-Propagation (TBP) trackers. It proposes the FADE attack framework, utilizing Temporal Query Flooding (TQF) to exhaust fixed query budgets and Temporal Memory Corrosion (TMC) to disrupt hidden state propagation. On MOT17/MOT20, it achieves up to a 30-point HOTA decrease and over a 10x increase in ID switches against MOTR, MOTRv2, MeMOTR, Samba, and CO-MOT.

Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning

Addressing the bimanual action segmentation task of simultaneously labeling per-frame actions for left and right hands in unedited videos, this paper proposes Polyphony—a three-stage method. It utilizes a shared ViT with alternating training to resolve dominant hand gradient monopoly, structured semantic conditioning to eliminate fine-grained action ambiguity, and a diffusion segmenter with cross-hand feature fusion to model bimanual coordination. It achieves up to 16.8 points improvement on HA-ViD/ATTACH datasets and outperforms SOTA on the single-stream Breakfast dataset with a \(12\times\) smaller backbone.

Progressive Cross-Modal Causal Intervention for Long-Term Action Recognition

PCMCI decomposes three types of "spurious correlations" relied upon by Vision-Language Models (VLMs) in long-term action recognition—co-occurrence hallucination, codependency illusion, and visual confounders—into a three-stage progressive causal intervention pipeline (OT-augmented backdoor adjustment → relation-aware backdoor adjustment → cross-modal front-door adjustment). By deconfounding step-by-step to obtain robust text/video representations, it significantly improves mAP on Breakfast, COIN, and Charades (e.g., Breakfast mAP from 76.32 to 90.51).

Progressive Multi-cue Alignment for Unaligned RGBT Tracking

PMATrack decomposes the "one-time regression" of cross-modal alignment parameters in unaligned RGBT tracking into a three-stage progressive estimation: "center offset → scale transformation → global refinement." By employing difficulty-aware routing to select the most cost-effective expert from three alignment cues at each stage, it sets new SOTA records on benchmarks like the newly created MUART244 with reduced computational overhead.

ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy

ProgTrack mimics the human eye's tracking habit of "large first, small later, then fill gaps" by decomposing UAV multi-object tracking into a three-stage progressive matching process: "large objects use IoU, small objects use Context-Enhanced ReID, and remaining hard-to-match targets use relative inter-object positions." Coupled with a Pure Kalman Filter (PKF) that handles occlusions and missed detections, it achieves SOTA MOTP/IDF1 results on VisDrone2019 and MDMT.

Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding

The QViC-MF framework is proposed, which achieves SOTA on multiple benchmarks including MLVU, LVBench, and VNBench using minimal visual tokens (16 per frame) through question-guided multi-frame visual compression (QMSA) and a context memory feedback mechanism.

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

Introduces text descriptions to RGBT tracking for the first time, proposing the RAGTrack framework based on Retrieval-Augmented Generation (RAG). By utilizing a multimodal Transformer encoder, adaptive token fusion, and a context-aware reasoning module, it achieves SOTA performance on four RGBT benchmarks.

Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

A learnable Verifier meta-model is proposed, trained on synthetic data to "judge tracker prediction reliability" and transferred to the real world. By evaluating predictions from 6 pre-trained teachers frame-by-frame to select the most reliable pseudo-labels, the Track-On-R model achieves comprehensive SOTA across 4 real-world benchmarks using only ~5K real videos for fine-tuning.

Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning

This paper proposes SlotCurri, a reconstruction-guided curriculum learning strategy for slot quantity. By starting training with minimal slots and incrementally expanding slot capacity only in regions with high reconstruction errors, combined with structure-aware loss and recurrent inference, it effectively addresses the over-fragmentation problem in video object-centric learning where a single object is incorrectly split across multiple slots. It achieves a +6.8 FG-ARI improvement on YouTube-VIS.

Rethinking Occlusion Modeling for UAV Tracking

Addressing the "clustered" nature of real-world occlusions in UAV perspectives, this paper generates spatially correlated occlusion masks (COM) via cluster sampling to train robust representations. Combined with a cost-aware depth bias (CADB) that ties inference to layer costs for automatic shallow-layer termination, the resulting OCTrack achieves a balance between accuracy and a real-time speed of 265 FPS across four UAV benchmarks.

FlexHook: Rethinking Two-Stage Referring-by-Tracking in RMOT

Ours proposes FlexHook, a novel two-stage Referring-by-Tracking framework that redefines feature construction through a sampling-based Conditioning Hook (C-Hook) and replaces CLIP cosine similarity matching with a Pairwise Correspondence Decoder (PCD), making a two-stage method fully surpass current state-of-the-art (SOTA) one-stage methods for the first time.

Robust Promptable Video Object Segmentation

Addressing the performance collapse of promptable video object segmentation (PVOS) models like SAM2 under adverse weather and noise, this paper constructs the first RobustPVOS benchmark (351 real-world adverse videos + large-scale time-varying synthetic degradation data) and proposes MoGA. MoGA uses object pointers from the memory bank to "condition" the gating of a shared low-rank adapter, providing each object with unique, cross-frame consistent robustification. Training only 1.1M parameters, it consistently outperforms frame-by-frame robustification methods across various degradations.

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

SAIL is proposed to achieve dual SOTA in dense video captioning and event localization on ActivityNet and YouCook2 under a weakly-supervised setting (captions only, no temporal boundaries). This is achieved through cross-modal similarity-guided semantic-aware mask generation and auxiliary supervision from LLM-synthesized captions.

SAM2Text: Towards Prompt-Free and Multi-Resolution Video Scene Text Segmentation

SAM2 is systematically adapted into SAM2Text specialized for video scene text segmentation (video STS): LoRA is employed to enable the encoder to learn text features, a self-prompting module is added to eliminate external prompts, 512/1024 high-resolution branches are appended to the decoder to preserve stroke details, and a "Short-term FIFO + Top-K Long-term Retrieval" dual-layer memory is used to stabilize cross-frame flickering. Two pixel-level video text datasets (STS-SynthV / STS-RealV) are released, achieving SOTA performance on both image and video benchmarks.

SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

This work formalizes the Ego→Exo imitation error detection task and proposes the SAVA-X (Align–Fuse–Detect) framework. It jointly addresses three major challenges—temporal misalignment, video redundancy, and cross-view domain gaps—through three modules: adaptive sampling, scene-adaptive view embedding (SVE), and bidirectional cross-view fusion.

Scene-Centric Unsupervised Video Panoptic Segmentation

This paper introduces the first fully unsupervised Video Panoptic Segmentation (VPS) task and proposes VideoCUPS. Starting from monocular "scene-centric" videos, the method generates temporally consistent panoptic pseudo-labels using self-supervised depth, motion, and visual cues. A novel Video DropLoss is then employed to train a VPS model on these pseudo-labels. VideoCUPS significantly outperforms four strong baselines on Cityscapes-VPS, KITTI-STEP, Waymo, and MOTS, while demonstrating robust label-efficient transfer capabilities.

SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks

This paper proposes SDTrack, the first fully Spiking Neural Network (SNN) based Transformer pipeline for event tracking. By utilizing Global Trajectory Prompt (GTP), asynchronous event streams are aggregated into 3-channel event frames rich in trajectory information. A full spike-driven SNN Transformer tracker, featuring Intrinsic Position Learning (IPL), predicts target boxes end-to-end. SDTrack achieves competitive or SOTA accuracy on three event-tracking benchmarks with minimal parameters and energy consumption (Tiny version: 19.61M / 8.16mJ).

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

SEASON is a training-free decoding method for VideoLLMs. It constructs "temporally homogenized" hard negatives that disrupt temporal information while preserving spatial structure. A per-token self-diagnostic mechanism determines whether a token is prone to temporal or spatial hallucination, adaptively applying contrastive decoding. It outperforms all training-free methods on three hallucination benchmarks without degrading general video understanding capabilities.

Seeing Conversations: Communication Context Identification in Egocentric Video

This paper proposes "Communication Context Identification (CCI)," a new task aimed at determining whether individuals in an egocentric video belong to the wearer's conversation group. The authors release a 68.9-hour multi-person, multi-conversation dataset and design CoCoNet—a lightweight model utilizing only structured facial features with joint temporal-relational reasoning—achieving a 96% balanced accuracy on CCI.

Seeing Motion Through Polarity for Event-based Action Recognition

Addressing the issue where existing event-text cross-modal action recognition methods stack positive and negative polarities into a single frame, thereby losing motion direction cues, POKER introduces a Polarity Motion Catcher (PMC) to explicitly decouple polarities and extract spatio-temporal motion primitives. Simultaneously, a Polarity Motion Reasoner (PMR) enables MLLMs to progressively reason about polarity-aware motion text descriptions. Finally, a polarity alignment loss pulls both feature paths toward class centers, delivering stable improvements of 1.3~2.6 points over the EventBind baseline on three EAR benchmarks.

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

This paper proposes a scene-level long video understanding benchmark, SceneBench, which reveals the severe "forgetting" phenomenon of mainstream VLMs in long-range contexts across "scenes" (indicated by a sharp drop in accuracy). A lightweight Scene-RAG (Scene Retrieval-Augmented Generation) is employed to dynamically recall cross-scene contexts into the input, yielding a \(+2.50\%\) improvement, which serves as evidence for the conclusion that "models indeed fail to remember long-range contexts."

Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

Reformulates movie trailer generation as a task of "mask reconstruction on trailer shot sequences using movie shots as prompts." By employing a Transformer encoder with self-paced mask rate scheduling and iterative re-masking (self-correction), the model significantly outperforms selection-then-ranking and autoregressive methods in F1 and ranking accuracy.

SHANDS: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training

SHANDS is the first multi-view RGB video dataset for open surgery training, recording incision and suturing operations from 52 experts/trainees using five synchronized cameras. It provides frame-level annotations for 15 gesture primitives and 8 clinically-validated error categories, establishing benchmarks for mainstream video models across single-view, multi-view, and cross-view protocols.

SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild

Ours presents SHOW3D, the first hand-object interaction dataset in true in-the-wild environments with precise 3D annotations. By designing a lightweight wearable multi-camera backpack system and an ego-exo fusion annotation pipeline, 4.3 million frames of multi-view data were collected. Both hands and objects achieve sub-centimeter annotation accuracy. Cross-dataset experiments validate the generalization advantages of models trained on this data.

SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

Ours proposes the SkeletonContext framework, which reconstructs missing environmental and object contextual semantics from pre-trained language models via a cross-modal context prompt module. It further enhances the discriminativeness of motion-critical joints through a key part decoupling module, achieving SOTA performance on NTU-60/120 and PKU-MMD under Zero-Shot (ZSL) and Generalized Zero-Shot (GZSL) settings.

SkillSight: Efficient First-Person Skill Assessment with Gaze

SkillSight models skill levels using egocentric video + gaze. It first trains a teacher model on "Video + Gaze" to achieve SOTA, then distills it into a student model that uses only gaze and turns off the camera during inference. On three cross-domain datasets, it approaches or exceeds heavy video-based methods with 14–73x lower power consumption.

SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition

For Event-Based Action Recognition (EAR), rather than aggregating events into H-W frames along the temporal axis, this paper projects them along the H/W axes into two "temporal views" (T-H and T-W). It systematically re-engineers three stages: representation (translation-invariant TISM), fusion (dual-branch dynamic fusion DDCF), and augmentation (diverse temporal warping DTW). Ours achieves Top-1 gains of +7.0%/+10.7%/+10.2% on three EAR benchmarks while reducing parameters by 30.1% and computation by 35.7%.

SoccerMaster: A Vision Foundation Model for Soccer Understanding

SoccerMaster utilizes a shared spatio-temporal ViT encoder and five lightweight task heads to integrate four categories of "spatial perception + semantic reasoning" tasks—player detection/identification, pitch registration, event classification, and vision-language alignment—into a single supervised multi-task pre-training stage. Supported by an automated annotation pipeline, SoccerFactory, which mass-produces dense spatial labels, the model outperforms general vision foundation models (SigLIP 2 / DINOv3) and specialized soccer models (MatchVision) across downstream tasks such as detection, tracking, camera calibration, and commentary generation.

Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking

The paper unifies "modality-missing completion" and "complete modality enhancement" in RGBT tracking into a single spatio-temporal conditional denoising process. By using short-term and long-term temporal cues from historical frames as conditions, a denoiser reconstructs missing modalities under strong noise and enhances complete modalities under weak noise. This single architecture and parameter set handle both scenarios, achieving SOTA or near-SOTA performance on three RGBT benchmarks under both complete and missing settings.

Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation

To address the issues of "indistinguishable adjacent actions" and "blurry boundaries" in Skeleton-based Temporal Action Segmentation (STAS), this paper shifts modeling to the frequency domain. It employs a learnable "Spectral Scalpel" (Multi-scale Adaptive Spectral Filtering, MASF) to amplify action-specific frequencies and suppress shared frequencies, while using an "Adjacent Action Discrepancy Loss" (AADL) as an explicit target to widen the amplitude spectrum gap between adjacent segments. This approach achieves SOTA results across five datasets with lower FLOPs and parameters.

SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking

Propose SpikeTrack, the first RGB visual tracking framework fully compliant with the spike-driven paradigm. By utilizing asymmetric timestep expansion, unidirectional information flow, and a brain-inspired Memory Retrieval Module (MRM), it achieves SOTA performance among SNN trackers and performs on par with ANN trackers, while consuming only 1/26 the energy of TransT.

SpikeTrack: High-performance and Energy-efficient Event-Based Object Tracking with Spiking Neural Network

SpikeTrack utilizes a pure spike-driven Spiking Transformer for event-based single object tracking. By employing a "Multi-Search-Single-Template (MSST)" training paradigm, it feeds the inherent temporal continuity of tracking into the membrane potential accumulation of the SNN. Furthermore, "Dynamic Integer LIF (DI-LIF)" neurons adaptively adjust the spike firing upper limit based on input sparsity. It achieves SOTA accuracy on FE108, FELT, and VisEvent benchmarks, while consuming only 6.6% of the energy and 25.8% of the parameters compared to the second-best method.

SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation

SPOT achieves SOTA across 6 benchmarks (Ref-YouTube-VOS, MeViS, ReVOS, etc.) without altering architectures or performing video pre-training. It relies solely on two new loss constraints to regulate the spatiotemporal behavior of prompt points generated by image-pretrained MLLMs for SAM: a Brownian Bridge loss models target trajectories as endpoint-constrained Gaussian processes for temporal smoothness, and a prompt quality loss ensures spatial geometric consistency.

Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

This paper proposes ROS-DVC, which decouples shared queries in the DETR-based DVC framework into independent localization and caption queries. It introduces an Overlap Suppression Loss to penalize temporal overlaps between queries and a Cross-Task Contrastive Alignment to ensure cross-task semantic consistency, achieving SOTA captioning and localization performance on YouCook2 and ActivityNet Captions.

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

StreamingTOM is proposed, a training-free two-stage streaming video understanding framework: Causal Temporal Reduction (CTR) performs causal temporal selection before the LLM to compress tokens per frame from 196 to 50; Online Quantized Memory (OQM) limits kv-cache growth after the LLM via 4-bit quantization and on-demand retrieval. The framework achieves a 15.7× compression ratio, 1.2× lower peak VRAM, and 2× faster TTFT.

StreamRAG: Enhancing Real-Time Video Understanding with Retrieval Augmentation

StreamRAG systematically introduces RAG to streaming video QA for the first time. By employing three plug-and-play modules—Real-time Event Segmentation, Low-latency Knowledge Extraction via historical token reuse, and Dynamic Retrieval Range selection based on query recency—it enhances models like Qwen2-VL and ViSpeak on OVO-Bench/StreamingBench with accuracy gains of approximately 11%~20% without altering the backbone MLLM architecture, while nearly halving caption generation latency.

StreamReady: Learning What to Answer and When in Long Streaming Videos

This paper proposes a readiness-aware streaming video understanding paradigm. By introducing a learnable <RDY> token and the Answer Readiness Score (ARS) metric, the model learns not only to provide correct answers but also to answer at the precise moment evidence appears. It achieves SOTA performance across 9 streaming and offline video benchmarks.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

The authors propose SVAgent, a storyline-guided cross-modal multi-agent framework for long video question answering. By progressively constructing narrative representations, utilizing DPP evidence selection, performing cross-modal consistency verification, and implementing iterative refinement, the framework achieves a performance improvement of 5.5%-11.5% over baselines.

T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding

T2SGrid transforms Video Temporal Grounding (VTG) from "frame-by-frame processing" to "grid-by-grid processing." By using a sliding window to concatenate continuous frames into a 2D grid in row-major order, it enables Vision-LLMs to utilize their superior spatial reasoning for temporal interpretation. Combined with a "composite text timestamp shared by the entire grid" for absolute time perception, it boosts the mIoU of Qwen2-VL-7B (which lacks temporal encoding) from 7.9 to 44.3 on Charades-STA and ActivityNet.

TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation

This paper proposes TacSIm, the first large-scale dataset and benchmark designed to reconstruct full-team trajectories from real Premier League broadcast footage and perform tactical style imitation in a virtual football environment. It quantifies tactical imitation fidelity using two metrics: spatial occupancy similarity and motion vector similarity.

Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition

Inspired by Kahneman's dual-process theory, the TCEI framework proposes a test-time adaptation method that combines an intuitive system (fast inference using episodic memory of recently observed objects) and an experiential system (calibrating intuitive predictions using accumulated experience from historical videos). It significantly improves multi-object tracking performance under distribution shifts without requiring backpropagation.

TF-CADE: Foreground-Concentrated Text-Video Alignment for Zero-Shot Temporal Action Detection

Addressing the issue where "text does not influence predictions" in zero-shot temporal action detection, this paper introduces an Action-Concentrated Aggregation (ACA) module. ACA aggregates video features into a foreground video embedding based on temporal foreground saliency for explicit alignment with text. Furthermore, a Certainty-based Confidence Reweighting (CCR) mechanism injects video-level priors back into snippet-level classification scores to suppress semantically irrelevant action classes. This approach achieves SOTA performance on THUMOS14/ActivityNet in both in-distribution and cross-dataset zero-shot settings.

TGTrack: Temporal Generative Learning for Unified Single Object Tracking

TGTrack introduces a parallel generative supervision task of "predicting the next frame" into a unified single object tracking framework. By utilizing an autoregressive generative decoder with gated fusion and polar temporal tokens, it converts traditional implicit and passive temporal modeling into explicit and active temporal learning, achieving SOTA results across 11 benchmarks in 5 modalities (e.g., 75.3% AUC on LaSOT).

The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection

Addressing the issue where top-k selection in WSVAD focuses only on the highest-scoring segments and misses dispersed or vague anomalies, this paper proposes a Temporal Clustering + Uncertainty Dual Exploration strategy to cover diverse and ambiguous anomaly segments. It advocates using Recall@FPR and AP to replace AUROC, which is heavily "inflated" by class imbalance, improving AP on UCF-Crime from 35.48% to 38.33%.

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Authors constructed SpookyBench, a synthetic benchmark where information exists "purely in inter-frame temporal dynamics while single frames are total noise." While humans can read text or identify objects with 98% accuracy using motion grouping, 15 state-of-the-art Video-VLMs (including GPT-4o, Gemini 2.5 Pro, and Qwen2.5-VL-72B) all achieved 0% accuracy. This clearly exposes a "time blindness" in current video models—they rely on per-frame spatial features and lack mechanisms for processing pure temporal information.

TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction

TimeBridge introduces an auxiliary task to the iBOT joint embedding framework: given only the start and end frames of a video, the model must "reconstruct" the intermediate frames. This forces the model to learn authentic temporal transformations. With 400 epochs of training, it achieves new SOTA on dense video prediction benchmarks such as DAVIS (73.5 J&F) and VIP (47.5 mIoU).

TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection

To mitigate the interference of "Weakly Labeled Information (WLI)" arising from video-level labels in WSVAD, TLMA utilizes a triplet learning strategy dynamically constructed from model predictions to push WLI away from true anomalies in the feature space. Combined with a motion-aware feature enhancement module based on frame-wise Sobel edge differences to highlight foreground dynamics, the method achieves SOTA performance on UCF-Crime, XD-Violence, and MSAD benchmarks while significantly reducing false alarm rates.

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

The AOT framework is proposed to achieve training-free video token compression by establishing local-global token anchors and utilizing Optimal Transport (OT) to aggregate the semantic information of pruned/merged tokens at both intra-frame and inter-frame levels. It retains 97.6% of the original performance even when 90% of tokens are pruned.

Toward Low-Cost yet Effective Temporal Learning for UAV Tracking

Addressing single object tracking for Unmanned Aerial Vehicles (UAVs), this paper first proposes an evaluation metric, Precision per FLOP (PPF), which couples accuracy gain with computational overhead. This metric reveals that existing temporal modules generally possess low "cost-effectiveness." Consequently, a lightweight temporal module (LETL) is designed that propagates and merges only a small number of representative appearance tokens. Integrated into a one-stream framework, the resulting LETrack achieves SOTA performance across six aerial datasets with negligible additional computational cost.

Towards Streaming Referring Video Segmentation via Large Language Model

StreamingRVOS transforms MLLM-based referring image segmentation into a "frame-by-frame streaming" referring video segmentation paradigm. It utilizes Semantic Embedding Reuse (SER) to feed the previous frame's [SEG] token back into the MLLM as temporal context, and employs Online Mask Consistency Perception (OMCP) to determine whether to re-invoke the MLLM for the current frame. Without adding any parameters, the 1B variant achieves a 19.2% improvement over Sa2VA on MeViS, while streaming inference reaches 7 FPS on a single A800 GPU.

TrajTok: Learning Trajectory Tokens Enhances Video Understanding

Ours proposes TrajTok—an end-to-end differentiable trajectory tokenizer that implicitly clusters video pixels into object trajectory tokens, replacing external segmentation+tracking pipelines. It achieves significant improvements across three scenarios: training from scratch (TrajViT2), feature adaptation (TrajAdapter), and Vision-Language Model connectors (TrajVLM), notably outperforming patch pooling in long-video QA.

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

TAPFormer utilizes a "Transient Asynchronous Fusion" mechanism to integrate low-frame-rate RGB frames with high-frequency event streams into a continuous latent representation that updates alongside events. This enables stable, high-frequency arbitrary point tracking in motion-blurred, low-light, and high-speed scenarios, improving average pixel error within thresholds by 28.2% on a self-built real-world frame-event dataset.

TVHighlights: LLM-Guided Human-Free Collaborative Training for Video Highlight Detection in Movies and TV Dramas

To address the challenge where highlight segments in movies and TV dramas lack a unified definition and manual annotation is both expensive and subjective, the authors first automatically construct the human-free TVHighlights dataset by repurposing community derivative works. Subsequently, LTV-HD is proposed: a lightweight multimodal network is pre-trained with video-level weak labels, followed by a self-improving closed loop involving an LLM for mutual error correction. Ultimately, this achieves SOTA performance of 92.74% AUC / 71.20% AP without any human annotation.

U2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation

U2Flow is the first recurrent unsupervised framework for joint estimation of optical flow and pixel-wise uncertainty. By leveraging decoupled uncertainty learning based on augmentation consistency and uncertainty-guided bidirectional flow fusion, it achieves unsupervised SOTA on KITTI and Sintel.

UETrack: A Unified and Efficient Framework for Single Object Tracking

This paper proposes UETrack, a unified and efficient single object tracking framework capable of processing five modalities: RGB, Depth, Thermal, Event, and Language. UETrack addresses the gap in efficient multi-modal tracking: existing efficient trackers are limited to RGB, while multi-modal trackers are often too slow due to complex designs. Core innovations include: (1) Token-Pooling-based Mixture-of-Experts (TP-MoE), which replaces traditional gating mechanisms with similarity-based soft assignment for efficient expert collaboration and specialization; (2) Target-aware Adaptive Distillation (TAD), which adaptively determines whether each sample is suitable for distillation to filter unreliable teacher signals. Across 12 benchmarks and 3 hardware platforms, UETrack achieves an optimal balance between speed and accuracy. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, respectively.

UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

UFVideo is the first Video LLM to unify global, pixel-level, and temporal-level video understanding capabilities. Through a vision-language guided alignment strategy and a SAM2 mask decoder, it simultaneously supports video QA, object referring, video segmentation, and temporal grounding within a single model. Furthermore, the multi-granularity cooperative understanding benchmark, UFVideo-Bench, is introduced.

Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability

This paper analyzes the root cause of temporal logic inconsistency in Video-LLMs from an interpretability perspective—specifically the inability of cross-modal attention heads to effectively distinguish video tokens at different timestamps—and proposes TCAS (Temporally Conditioned Attention Sharpening) to significantly improve temporal logic consistency and general temporal grounding performance by optimizing attention distribution.

Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention

This paper proposes a unified spatiotemporal token compression method that jointly evaluates token contribution and semantic redundancy via a global retention pool. By introducing text-aware merging within the LLM, the method maintains 90.1% of baseline performance at an extreme 2% visual token retention rate, while reducing FLOPs to approximately 2.6%.

UniVBench: Towards Unified Evaluation for Video Foundation Models

UniVBench utilizes 200 human-crafted, copyright-free multi-shot videos and an agentic evaluation system, UniV-Eval, to evaluate video understanding, generation, editing, and the newly proposed "video reconstruction" within a single framework. It is the first to provide a unified answer to whether unified video models truly excel in both perception and generation.

Unstitching the Chimera: Frame-Level Risk and Train-Free Mitigation for Video Hallucination

This paper characterizes a neglected type of video hallucination from the perspective of "frames" rather than "tokens"—Chimera Hallucination: where the model stitches together fragments that actually exist in the video but do not belong to the same event chain into a false continuous narrative. To address this, the authors propose CH-Risk, a single-forward-pass, reference-free risk metric to quantify this risk, and CH-M, a training-free two-stage intervention (segment routing sSAFR + residual token calibration RTC), to correct high-risk samples. This approach consistently reduces hallucinations and improves accuracy across 9 benchmarks and 6 VideoLLMs with <5% latency, <2.5% VRAM, and ≈1% FLOPs overhead.

UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

Ours proposes UTPTrack, the first unified framework to jointly prune tokens across three components: Search Region (SR), Dynamic Template (DT), and Static Template (ST) within one-stream Transformer trackers. It achieves 65–67% visual token reduction in RGB and multimodal/language-guided tracking while maintaining 99.7%–100.5% of baseline performance.

VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference

This paper identifies a strong "vertical vector" sparsity pattern in video model attention maps and proposes VecAttention, a fine-grained vector-wise sparse attention framework. By implementing efficient important vector selection via TilingSelect + minS filtering, it achieves video understanding accuracy comparable to full attention at 78%+ sparsity, accelerating attention computation by 2.65x.

Video Panels for Long Video Understanding

The authors propose tiling multiple adjacent video frames together into a single "comic-style" panel image to trade spatial resolution for temporal resolution. This approach improves the long video understanding capabilities of existing VLMs—increasing VideoLLaMA 3's QA accuracy by 19.4% on the TimeScope (Long) benchmark—without modifying architectures, additional training, or adding parameters.

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

VideoChat-M1 proposes the Collaborative Policy Planning (CPP) paradigm and a Multi-Agent Reinforcement Learning (MARL) training method. By employing 4 heterogeneous VLM agents to dynamically generate and update tool-calling policies for video understanding, it outperforms Gemini 2.5 Pro by 3.6% and GPT-4o by 15.6% on LongVideoBench.

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

VideoITG reformulates "selecting frames based on user instructions" as a standalone temporal grounding task. By utilizing a GPT-4o-driven three-stage pipeline (VidThinker), it automatically annotates "which frames are relevant to an instruction" across 40K videos, generating 500K instruction-aligned annotations. A plug-and-play frame selector is then trained and prepended to various Video-LLMs, achieving or exceeding the performance of 64-frame uniform sampling using only 16–32 frames.

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

VideoNet constructs a video action recognition benchmark covering 37 domains and 1,000 fine-grained "domain-specific actions" (offering both multiple-choice and binary few-shot protocols). Using a fully automated pipeline, it collects nearly 500,000 VQA training pairs, bringing the "forgotten task" of domain-specific action recognition back into the VLM evaluation spotlight. The results show that open-source 8B VLMs achieve less than 50% accuracy in multiple-choice, while a 4B model fine-tuned on this data outperforms all 8B open-source models.

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

VideoSeek proposes a long-horizon video agent that utilizes video logical flow to actively "seek" key evidence rather than exhaustively parsing all frames. Through a think-act-observe loop and a multi-granularity toolkit (overview/skim/focus), it achieves a 10.2-point improvement over the base GPT-5 model on LVBench while reducing frame usage by 93%.

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism transforms the Mixture-of-Experts in Image-to-Video (I2V) transfer from a "group of homogeneous generalists" into "heterogeneous experts specialized by temporal resolution." By utilizing content-aware multi-rate sampling to feed different rhythms of video streams to each expert and dynamic bidirectional interaction for information exchange between fast/slow paths, it achieves new SOTA results on K400/UCF-101/HMDB-51/SSv2 with lower computational costs.

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

This paper proposes VirtueBench, the first long video understanding benchmark evaluating VLM trustworthiness under uncertainty. By constructing multi-level frame sampling for each video and labeling answerable/unanswerable ground truths, it reveals a widespread tendency among existing models to guess rather than honestly refuse.

Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding

WFS-SB is proposed as a training-free frame selection framework that utilizes wavelet transforms to detect semantic boundaries within query-frame similarity signals. By partitioning videos into semantically coherent segments, it adaptively allocates frame budgets and performs diversity-aware sampling, significantly outperforming SOTA methods on VideoMME, MLVU, and LongVideoBench.

Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning

The LAS-VAD framework is proposed, utilizing an Anomaly-Connected Components (ACC) mechanism to group video frames into semantically consistent clusters for pseudo-label generation to mitigate the lack of frame-level annotations. It further incorporates an Intention Awareness Mechanism (IAM) leveraging position-velocity-acceleration features to distinguish between normal and abnormal behaviors with similar appearances, achieving 89.96% AP (I3D) on XD-Violence.

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

This paper proposes WorldMM, a video reasoning agent based on multimodal memory. It constructs three complementary types of memory: episodic memory (multi-time-scale textual knowledge graph), semantic memory (continuously updated relational knowledge graph), and visual memory (frame-level retrieval library). Through an adaptive multi-round retrieval agent, it dynamically selects the most relevant memory sources and temporal granularities, outperforming the previous SOTA by an average of 8.4% across five long-video QA benchmarks.

Your One-Stop Solution for AI-Generated Video Detection

The authors construct AIBD-Bench—a large-scale benchmark for AI-generated video detection covering 31 latest video generation models and 440k+ videos. They provide a standardized data construction pipeline features "attribute balancing + comprehensive selection + de-biased preprocessing." By conducting over 1,500 evaluations on 33 detectors, they extract 8 analyses and 4 new findings (crucially: "higher generation quality \(\neq\) harder to detect").