ICML2026 Video Understanding AI paper notes paper summaries Object Tracking Reasoning Speech & Audio Compression LLM Anomaly Detection

📹 Video Understanding¶

🧪 ICML2026 · 17 paper notes

📌 Same area in other venues: 📷 CVPR2026 (187) · 🔬 ICLR2026 (48) · 🤖 AAAI2026 (27) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (56) · 🧪 ICML2025 (4)

🔥 Top topics: Object Tracking ×4 · Reasoning ×2

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes: This paper proposes the AVTrack dataset and the AVTracker baseline method to address the Audio-Visual Instance Segmentation and tracking (AVIS) task in complex human-centric scenes. By defining eight challenging conditions, a rigorous evaluation benchmark was constructed. A three-stage divide-and-conquer framework was designed (ASR segmented aggregation → local speaker localization → global identity association), which outperforms existing state-of-the-art methods by approximately 8 percentage points on the HOTA metric.
Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning: Foresee-to-Ground (F2G) reformulates Video Temporal Grounding (VTG) from direct timestamp regression into an "identify-then-measure" two-stage problem. By utilizing predictive temporal perception and a span evidence encoder to build a candidate event evidence pool, the LLM generates precise boundaries constrained by selected events. This approach improves [email protected] by 4.1 points on Charades-STA and 6.7 points on ActivityNet.
MetaphorVU: Towards Metaphorical Video Understanding: This paper proposes the first metaphorical video understanding benchmark, MetaphorVU-Bench (860 videos + 8-category metaphor taxonomy), and an enhancement method, MetaphorBoost. By utilizing a metaphor knowledge graph with 54K nodes and 200K edges as an external cognitive scaffold, the study quantitatively reveals that the core bottleneck for MLLMs in metaphorical video understanding is the "lack of cross-domain mapping" rather than visual recognition errors. The optimal model still lags behind humans (83.4) by 17 points.
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models: This paper argues that existing Omni-LLM token compression methods are suboptimal due to their "symmetric" treatment of audio and video. It proposes OmniSIFT—a two-stage asymmetric compression framework that first prunes video redundancy via spatio-temporal saliency to obtain "visual anchors," which then guide audio selection. With only 4.85M additional parameters, it consistently outperforms existing baselines and even the original model on Qwen2.5-Omni-7B while retaining only 25% of tokens.
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection: The authors propose OPL (Orthogonal Projection Layer) and an enhanced version G-OPL, which utilize a learnable orthogonal subspace derived from QR decomposition to explicitly project out "task-irrelevant variables" and "facial privacy components" within the video anomaly detection feature space. They introduce four privacy-aware metrics (SSC/ARD/PD/FPD), demonstrating that face prediction accuracy by linear SVM probes significantly decreases while maintaining or even improving VAD AUC.
ProAct-VL: A Proactive VideoLLM for Real-Time AI Companions: ProAct-VL enables VideoLLMs to autonomously decide when to respond and generate short-segment commentary under streaming input via a chunk-level I/O paradigm, a lightweight FLAG decision head, and transition-aware loss functions. It achieves ~1s low latency and strong proactivity—obtaining a TimeDiff of only 1.20s and a trigger F1 of 63.25% in game commentary tasks, significantly outperforming offline models like GPT-4o.
RELO: Reinforcement Learning to Localize for Visual Object Tracking: RELO reformulates the "where is the target" problem in single object tracking as an MDP on a spatial feature map. It treats each spatial position as an action and replaces traditional manual center heatmap supervision with actor-critic + direct IoU/AUC rewards. Coupled with two stabilization designs—"regression warmup" and "layer-aligned temporal token propagation"—it achieves SOTA with 57.5% AUC on LaSOText.
Return of Frustratingly Easy Unsupervised Video Domain Adaptation: This paper proposes MetaTrans—a "frustratingly easy" Unsupervised Video Domain Adaptation (UVDA) method. It decouples spatial and temporal domain gaps through spatio-temporal feature subtraction in a dual-stream Transformer. By using only two basic losses (supervised + domain adversarial), it outperforms complex SOTA methods and reduces hyperparameter search costs from exponential to linear.
Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval: To address query ambiguity and temporal sparse supervision caused by "short queries vs. long videos" in Partially Relevant Video Retrieval (PRVR), this paper proposes Holmes, a hierarchical evidential learning framework based on the Dirichlet distribution. It distinguishes precise, polysemous, and under-determined queries using a three-fold principle at the inter-video level for adaptive label calibration, and achieves dense alignment at the intra-video level via flexible optimal transport with a dustbin. Holmes achieves SOTA on ActivityNet, Charades, and TVR datasets.
SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition: SkelHCC maps CLIP to Hyperbolic space to explicitly align skeleton-language representations across three granularities: "Joint → Body Part → Full Body." It utilizes LLM-generated body part importance masks for training-free multi-granularity voting cache inference, achieving a 9% improvement over Prev. SOTA on NTU120 one-shot action recognition with only 0.5M trainable parameters.
SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling: SLAP transplants the "Least Action Principle of classical mechanics" onto the video semantic manifold, modeling the completion of missing frames in sparsely sampled videos as a two-point boundary value problem on a Riemannian manifold. By replacing probabilistic generation with semantic dynamics to enforce object permanence, it achieves 83.9% accuracy on tunnel occlusion tests (outperforming diffusion models by 12 points) with a 177× inference speedup.
STORM: Segment, Track, and Object Re-Localization from a Single Image: STORM proposes a 6D pose tracking framework that "runs with only a single reference image": it utilizes Hierarchical Spatial Fusion Attention (HSFA) for reference-query feature alignment (producing segmentation masks + SAM3D meshes) and trains a Tracking Verifier using BCE binary classification. The negative logit is defined as the energy score \(E=-g_\theta\), and re-localization is automatically triggered when the score exceeds a threshold for \(L=3\) consecutive frames, pushing zero-shot 6D tracking accuracy on LM-O/YCB-V close to the ground-truth mask upper bound.
Unified Multimodal Visual Tracking with Dual Mixture-of-Experts: OneTrackerV2 unifies five tracking tasks (RGB / RGB+D / RGB+T / RGB+E / RGB+N) into a single network for end-to-end training. It utilizes a Meta Merger for modality fusion and a Dual MoE to explicitly decouple heterogeneous features—"spatial-temporal matching" and "modality fusion"—into T-MoE and M-MoE blocks. A dissimilarity loss and router clustering are employed to prevent these features from collapsing into the same subspace.
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding: Video-MTR is an RL-based multi-turn reasoning framework that guides MLLMs to iteratively select key video segments through a gated dual-level reward mechanism. It achieves SOTA performance in long video understanding using only 8K data, outperforming methods that require 257K to 4.4 million samples (improving data efficiency by two orders of magnitude).
VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority: VideoSEAL identifies the "evidence misalignment" problem in existing agentic long video QA systems—where agents answer correctly without actually seeing the evidence—and attributes the root cause to "coupled agents conflating planning and answering authority." It proposes a planner-inspector decoupling framework: the planner handles long-horizon evidence search, while the inspector holds exclusive answering authority and only releases the answer when pixel-level evidence is sufficient. This improves accuracy on LVBench from 48.2% to 55.1% (↑20.5%) and on LongVideoBench from 52.2% to 62.0%.
VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking: VideoTemp-o3 is a unified agentic video understanding framework. By joint modeling of temporal grounding and video QA through a unified masking strategy for cold-start SFT and a penalty-aware IoU reward, it achieves high-quality multi-round iterative grounding and precise answering in long video understanding. It reaches an mIoU of 15.6% on ultra-long videos (> 20 minutes), surpassing Gemini-2.5-Pro's 14.8%.
VSCD: Video Scene Change Detection in Unaligned Scenarios: This paper introduces the VSCD task—detecting object-level changes pixel-by-pixel between two video sequences of the same environment recorded at different times through a query-centric multi-reference model. It utilizes temporal consistency, patch-level correspondence, and confidence-weighted fusion to handle unconstrained camera motion and severe viewpoint mismatch.