Skip to content

📹 Video Understanding

🧪 ICML2026 · 8 paper notes

📌 Same area in other venues: 💬 ACL2026 (8) · 📷 CVPR2026 (77) · 🔬 ICLR2026 (22) · 🤖 AAAI2026 (33) · 🧠 NeurIPS2025 (58) · 📹 ICCV2025 (57)

🔥 Top topics: Object Tracking ×3

CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

This paper proposes CLEAR for video subtitle removal: a two-stage training pipeline (Stage I uses a dual encoder + orthogonal decoupling to self-supervise a subtitle prior mask; Stage II adds LoRA + an occlusion head to the Wan2.1 video diffusion model for adaptive weighting). Inference requires no mask or text detector at all; with only 0.77% trainable parameters, PSNR reaches 26.80 dB on a Chinese test set (+6.77 dB over the strongest baseline), and zero-shot generalizes to six languages.

Find, Fix, Reason: Context Repair for Video Reasoning

This work addresses the dilemma in video reasoning where "on-policy RL stagnates at a capability ceiling, while off-policy distillation leads to entropy collapse." It introduces a frozen, tool-integrated large teacher model that inserts minimal "evidence patches" (key-frame intervals, error types) when the student fails during rollout. The student then re-answers the same question, and the repaired trajectory is incorporated into GRPO optimization via a chosen-rollout mechanism.

Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection

The authors propose OPL (Orthogonal Projection Layer) and its enhanced version G-OPL, which use a learnable orthogonal subspace derived from QR decomposition to explicitly project out "task-irrelevant variables" and "facial privacy components" from the feature space of video anomaly detection. Four privacy-aware metrics (SSC/ARD/PD/FPD) are introduced. While maintaining or improving VAD AUC, the accuracy of linear SVM probes for facial prediction drops significantly.

RELO: Reinforcement Learning to Localize for Visual Object Tracking

RELO reframes the "where is the target" problem in visual single-object tracking as an MDP over a spatial feature map, treating each spatial location as an action. It replaces traditional handcrafted center heatmap supervision with actor-critic and direct IoU/AUC rewards, and introduces two stabilization designs—"warmup regression" and "layer-aligned temporal token propagation." On LaSOText, it achieves SOTA with 57.5% AUC.

Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval

This paper addresses the query ambiguity and temporally sparse supervision in Partially Relevant Video Retrieval (PRVR) caused by "short queries vs. long videos." It proposes Holmes, a hierarchical evidential learning framework based on the Dirichlet distribution. Holmes distinguishes precise, polysemous, and under-determined queries across videos using a triple-principle and adaptively calibrates labels. Within videos, it achieves dense alignment via flexible optimal transport with a dustbin. The method achieves SOTA on ActivityNet, Charades, and TVR datasets.

STORM: Segment, Track, and Object Re-Localization from a Single Image

STORM proposes a "single reference image" 6D pose tracking framework: hierarchical spatial fusion attention (HSFA) aligns reference-query features (producing segmentation masks + SAM3D mesh), then a BCE-trained Tracking Verifier outputs a logit whose negative is used as an energy score \(E=-g_\theta\). If the score exceeds a threshold for \(L=3\) consecutive frames, automatic re-localization is triggered. This pushes annotation-free 6D tracking accuracy on LM-O / YCB-V close to the ground-truth mask upper bound.

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

OneTrackerV2 unifies five tracking tasks—RGB, RGB+D, RGB+T, RGB+E, RGB+N—into a single network trained end-to-end. It uses a Meta Merger for modality fusion, and Dual MoE to explicitly decouple "spatiotemporal matching" and "modality fusion" into T-MoE and M-MoE, respectively. Dissimilarity loss and router clustering ensure these do not collapse into the same subspace.

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

VideoSEAL identifies a prevalent "correct answer without seeing evidence" misalignment in existing agentic long video QA systems, attributing the root cause to "coupled agents conflating planning and answering authority." It proposes a planner-inspector decoupling framework: the planner is responsible for long-horizon evidence search, while the inspector holds exclusive answering authority and only permits answers when pixel-level evidence is sufficient. On LVBench, accuracy improves from 48.2% to 55.1% (↑20.5%), and on LongVideoBench from 52.2% to 62.0%.