📹 Video Understanding¶
🧪 ICML2026 · 8 paper notes
📌 Same area in other venues: 💬 ACL2026 (8) · 📷 CVPR2026 (77) · 🔬 ICLR2026 (22) · 🤖 AAAI2026 (33) · 🧠 NeurIPS2025 (58) · 📹 ICCV2025 (57)
🔥 Top topics: Object Tracking ×3
- CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
-
This paper proposes CLEAR for video subtitle removal: a two-stage training pipeline (Stage I uses a dual encoder + orthogonal decoupling to self-supervise a subtitle prior mask; Stage II adds LoRA + an occlusion head to the Wan2.1 video diffusion model for adaptive weighting). Inference requires no mask or text detector at all; with only 0.77% trainable parameters, PSNR reaches 26.80 dB on a Chinese test set (+6.77 dB over the strongest baseline), and zero-shot generalizes to six languages.
- Find, Fix, Reason: Context Repair for Video Reasoning
-
This work addresses the dilemma in video reasoning where "on-policy RL stagnates at a capability ceiling, while off-policy distillation leads to entropy collapse." It introduces a frozen, tool-integrated large teacher model that inserts minimal "evidence patches" (key-frame intervals, error types) when the student fails during rollout. The student then re-answers the same question, and the repaired trajectory is incorporated into GRPO optimization via a chosen-rollout mechanism.
- Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
-
The authors propose OPL (Orthogonal Projection Layer) and its enhanced version G-OPL, which use a learnable orthogonal subspace derived from QR decomposition to explicitly project out "task-irrelevant variables" and "facial privacy components" from the feature space of video anomaly detection. Four privacy-aware metrics (SSC/ARD/PD/FPD) are introduced. While maintaining or improving VAD AUC, the accuracy of linear SVM probes for facial prediction drops significantly.
- RELO: Reinforcement Learning to Localize for Visual Object Tracking
-
RELO reframes the "where is the target" problem in visual single-object tracking as an MDP over a spatial feature map, treating each spatial location as an action. It replaces traditional handcrafted center heatmap supervision with actor-critic and direct IoU/AUC rewards, and introduces two stabilization designs—"warmup regression" and "layer-aligned temporal token propagation." On LaSOText, it achieves SOTA with 57.5% AUC.
- Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval
-
This paper addresses the query ambiguity and temporally sparse supervision in Partially Relevant Video Retrieval (PRVR) caused by "short queries vs. long videos." It proposes Holmes, a hierarchical evidential learning framework based on the Dirichlet distribution. Holmes distinguishes precise, polysemous, and under-determined queries across videos using a triple-principle and adaptively calibrates labels. Within videos, it achieves dense alignment via flexible optimal transport with a dustbin. The method achieves SOTA on ActivityNet, Charades, and TVR datasets.
- STORM: Segment, Track, and Object Re-Localization from a Single Image
-
STORM proposes a "single reference image" 6D pose tracking framework: hierarchical spatial fusion attention (HSFA) aligns reference-query features (producing segmentation masks + SAM3D mesh), then a BCE-trained Tracking Verifier outputs a logit whose negative is used as an energy score \(E=-g_\theta\). If the score exceeds a threshold for \(L=3\) consecutive frames, automatic re-localization is triggered. This pushes annotation-free 6D tracking accuracy on LM-O / YCB-V close to the ground-truth mask upper bound.
- Unified Multimodal Visual Tracking with Dual Mixture-of-Experts
-
OneTrackerV2 unifies five tracking tasks—RGB, RGB+D, RGB+T, RGB+E, RGB+N—into a single network trained end-to-end. It uses a Meta Merger for modality fusion, and Dual MoE to explicitly decouple "spatiotemporal matching" and "modality fusion" into T-MoE and M-MoE, respectively. Dissimilarity loss and router clustering ensure these do not collapse into the same subspace.
- VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority
-
VideoSEAL identifies a prevalent "correct answer without seeing evidence" misalignment in existing agentic long video QA systems, attributing the root cause to "coupled agents conflating planning and answering authority." It proposes a planner-inspector decoupling framework: the planner is responsible for long-horizon evidence search, while the inspector holds exclusive answering authority and only permits answers when pixel-level evidence is sufficient. On LVBench, accuracy improves from 48.2% to 55.1% (↑20.5%), and on LongVideoBench from 52.2% to 62.0%.