ECCV2024 Video Understanding AI paper notes paper summaries Object Tracking Human Pose Self-Supervised Learning Reasoning Few-/Zero-Shot Learning Anomaly Detection

📹 Video Understanding¶

🎞️ ECCV2024 · 51 paper notes

📌 Same area in other venues: 📷 CVPR2026 (187) · 🔬 ICLR2026 (48) · 🧪 ICML2026 (17) · 🤖 AAAI2026 (27) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (56)

🔥 Top topics: Object Tracking ×10 · Human Pose ×4 · Self-Supervised Learning ×2 · Reasoning ×2 · Few-/Zero-Shot Learning ×2

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos: ActionSwitch is proposed—the first online temporal action localization (On-TAL) framework to detect overlapping action instances in streaming videos without category information. The core idea is to model multi-action detection as a state classification problem for a finite state machine, augmented by a conservativeness loss to reduce fragmented false positives. It achieves SOTA among OAD-extension methods on datasets such as THUMOS14, FineAction, and Epic-Kitchens 100.
Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts: Proposes Adapt2Reward, which adapts pre-trained video-language models into generalizable language-conditioned reward functions using learnable failure prompts. Requiring only a small amount of robot data from a single environment, it generalizes to new environments and tasks, outperforming prior methods by approximately 28% on MetaWorld.
AMEGO: Active Memory from Long EGOcentric Videos: Proposes AMEGO, a method for online construction of structured "active memory" from long egocentric videos. By combining HOI tracklets, location segments, and semantic-free visual queries, it outperforms Video QA baselines by 12.7% on the newly proposed AMB benchmark.
Bayesian Evidential Deep Learning for Online Action Detection: This paper proposes the BEDL (Bayesian Evidential Deep Learning) framework. Incorporating a Bayesian teacher-evidential student architecture, it achieves accurate and efficient inference as well as reliable uncertainty quantification in online action detection tasks. Furthermore, it designs a attention module based on Bayesian mutual information for active feature selection.
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects: Based on the HANDS23 challenge (using the AssemblyHands and ARCTIC datasets), this study systematically benchmarks and deeply analyzes 3D pose estimation methods for egocentric hand-object interactions, revealing the effectiveness of distortion correction, high-capacity Transformers, and multi-view fusion, while highlighting unresolved challenges such as rapid motion, severe occlusion, and object reconstruction under narrow viewpoints.
Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training: This paper proposes a unified 3D single object tracking (SOT) framework that addresses the scarcity of point cloud data and the sparseness/incompleteness of LiDAR scans through 3D generative pre-training and matching knowledge distillation from a pre-trained 2D foundation tracker, achieving SOTA performance on KITTI, Waymo, and nuScenes.
Classification Matters: Improving Video Action Detection with Class-Specific Attention: Proposes a class-specific query (class queries) mechanism, which assigns an independent learnable query to each action class, allowing the model to dynamically attend to context regions relevant to each class, significantly improving classification performance in video action detection.
CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner: This paper proposes the CrossGLG framework, which utilizes LLM-generated text descriptions to guide skeleton feature learning in a "global \(\to\) local \(\to\) global" manner, significantly outperforming competitors in one-shot 3D action recognition with only 2.8% of the parameter size of the SOTA model.
Data Collection-Free Masked Video Modeling: This paper proposes a Pseudo-Motion Generator (PMG) to recursively generate pseudo-motion videos from static images. Combined with Masked Video Modeling (VideoMAE) for self-supervised pre-training, it entirely eliminates the collection costs, privacy, and copyright concerns of real video data, and even enables effective video Transformer pre-training using only synthetic images.
DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video: This paper proposes DINO-Tracker, which combines the semantic features of pretrained DINOv2 with test-time single-video optimization. Through Delta-DINO residual fine-tuning and multi-source self-supervised losses, it achieves long-range dense point tracking. It reaches state-of-the-art (SOTA) performance among self-supervised methods and is comparable to supervised trackers, particularly outperforming existing methods by a wide margin in long-term occlusion scenarios.
Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning: EMP-Net proposes an efficient multi-level post-reasoning network. It reduces the domain alignment overhead of CLIP in few-shot action recognition by avoiding most gradient backpropagations through a post-reasoning mechanism. Meanwhile, it leverages multi-level representations (global, patch, and frame levels) to enhance feature discriminativeness, achieving an optimal balance between efficiency and performance.
EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere: Proposes EgoPoser to robustly estimate full-body poses from sparse and intermittent tracking signals of the head and hands from head-mounted devices. Through four core designs—global motion decomposition, realistic field-of-view modeling, SlowFast temporal fusion, and body-shape-aware pose optimization—it achieves state-of-the-art performance in large-scale real-world scenarios while running at over 600 fps.
Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking: This paper proposes FERMT (Feature Extraction and Relation Modeling Tracker). By decomposing the attention mechanism in a one-stream tracker into four functionally distinct sub-modules—shallow attentive feature extraction and deep attentive relation modeling—and introducing a dual attention unit for feature preprocessing, it outperforms leading real-time trackers by 5.6% in AO score on GOT-10k while achieving a 54% speedup on CPU.
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignability for Semi-Supervised Fine-Grained Action Recognition: This paper proposes the FinePseudo framework, which utilizes metric learning based on temporal alignability to improve pseudo-label quality. It represents the first systematic approach to semi-supervised fine-grained action recognition, significantly outperforming existing methods on four fine-grained datasets.
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos: This work introduces the Goldfish framework, which achieves efficient understanding of arbitrarily long videos by segmenting them into short clips and utilizing a text-similarity-based retrieval mechanism to select the top-k segments most relevant to the question. It also presents the MiniGPT4-Video short video model and the TVQA-long benchmark for long video evaluation.
HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization: Proposes HAT—the first anchor-based Transformer framework that introduces long-term historical context in Online Temporal Action Localization (OnTAL). Through action anticipation-guided history compression and future-driven history refinement, it significantly outperforms OAT on procedural egocentric datasets (EGTEA/EK100) and achieves comparable or superior performance on standard datasets (THUMOS/MUSES).
IAM-VFI: Interpolate Any Motion for Video Frame Interpolation with Motion Complexity Map: The IAM-VFI framework is proposed, which introduces a Motion Complexity Map (MCM) to perceive the difficulty levels of local motion. By adaptively allocating computational resources and processing strategies to regions with varying complexities, it achieves robust video frame interpolation for arbitrary motion patterns.
Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection: A framework for unsupervised video anomaly detection is proposed, which interleaves the training of a weighted one-class classification (wOCC) model and a weakly-supervised (WS) model. It mitigates training fluctuations using soft labels and progressively optimizes the segmentation threshold via an adaptive thresholding strategy, achieving performance close to weakly-supervised methods without requiring any manual annotations.
LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow: Proposes LayeredFlow—the first real-world non-Lambertian benchmark dataset containing multi-layer optical flow annotations (150k optical flow pairs, 185 scenes, 360 objects), while defining the multi-layer optical flow task, introducing a large-scale synthetic training dataset, and presenting a RAFT-based multi-layer optical flow baseline.
Learning Anomalies with Normality Prior for Unsupervised Video Anomaly Detection: This paper proposes LANP, an unsupervised video anomaly detection method based on a normality prior (where the beginning and ending segments of a video are typically normal events). The normalness of unlabeled segments is estimated through normality propagation, combined with a loss re-weighting strategy to mitigate the negative influence of mispropagated labels, achieving superior performance on ShanghaiTech and UCF-Crime compared to existing methods.
Leveraging Temporal Contextualization for Video Action Recognition: This paper proposes the TC-CLIP framework. By introducing a temporal contextualization (TC) mechanism, global video action cues are compressed into a small number of context tokens and injected into the CLIP encoding process. Additionally, a video-conditional prompting (VP) module is designed to inject visual information into the text branch. Under four settings—zero-shot, few-shot, base-to-novel, and fully supervised—TC-CLIP consistently outperforms existing CLIP-based video recognition methods.
Local All-Pair Correspondence for Point Tracking: This paper proposes LocoTrack, which achieves all-pair correspondence matching for any points in a video via a local 4D correlation volume. Combined with a lightweight correlation encoder and a length-generalizable Transformer, it obtains state-of-the-art accuracy across all TAP-Vid benchmarks while executing nearly 6 times faster than SOTA methods.
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition: This paper proposes EVI-MAE, the first multimodal representation learning method that jointly models egocentric video and body-worn IMUs. Through self-supervised MAE pre-training, it learns cross-modal video-IMU alignment and utilizes a graph neural network to model cooperative movement relationships among multiple IMU devices. It achieves state-of-the-art (SOTA) performance on egocentric action recognition with outstanding robustness.
Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation: This paper proposes a self-supervised method that integrates a non-linear motion prior (parametric trajectory function) into the contrast maximization framework for dense continuous-time motion estimation with event cameras, improving the zero-shot performance of synthetic-data pre-trained models by 29% on the real-world EVIMO2 dataset.
Occluded Gait Recognition with Mixture of Experts: An Action Detection Perspective: This paper revisits the problem of occluded gait recognition from the perspective of action detection, proposing the GaitMoE method. GaitMoE adaptively constructs action anchors through Mixture of Temporal Experts (MTE) and generates action proposals using Mixture of Action Experts (MAE). Trained end-to-end using only ID labels, it effectively handles various occlusion scenarios. Additionally, the first unified occluded gait dataset, OccGait, is constructed.
On the Utility of 3D Hand Poses for Action Recognition: This paper proposes HandFormer, a lightweight multimodal Transformer that combines densely sampled 3D hand poses (to capture fine-grained actions) with sparsely sampled RGB frames (to provide scene semantics). By efficiently modeling hand-object interactions through micro-action temporal decomposition and trajectory encoding, it achieves state-of-the-art (SOTA) performance on Assembly101 and H2O. Notably, the pose-only model outperforms existing skeleton-based methods with \(5\times\) fewer FLOPs.
OneTrack: Demystifying the Conflict Between Detection and Tracking in End-to-End 3D Trackers: This paper provides an in-depth analysis of the fundamental cause of the performance conflict between detection and tracking tasks in end-to-end 3D trackers—subtle differences in positive sample assignment lead to contradictory classification gradients. It proposes OneTrack, which leverages gradient coordination, query grouping, and attention masking to achieve conflict-free joint optimization of detection and tracking under a unified feature representation for the first time, achieving SOTA performance on nuScenes.
Online Temporal Action Localization with Memory-Augmented Transformer: This paper proposes MATR (Memory-Augmented Transformer), which models long-term context by selectively storing historical segment features in a memory queue, and employs a dual Transformer decoder to locate the end and start times of actions respectively. It achieves new state-of-the-art results on two online temporal action localization benchmarks, THUMOS14 and MUSES, even comparable to some offline methods.
Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition: By freezing the spatial Transformers in the ViViT factorised encoder and introducing a rational temporal Transformer initialization strategy along with compact adapters, this paper significantly reduces training costs and memory consumption while preserving or even slightly improving accuracy, offering a more efficient action recognition training solution for resource-constrained researchers.
PiTe: Pixel-Temporal Alignment for Large Video-Language Model: The PiTe model is proposed to achieve spatiotemporal video-language alignment at the pixel level using object trajectories. The PiTe-143k dataset is constructed, and the method significantly outperforms existing approaches in zero-shot QA, temporal localization, and dense captioning tasks.
R²-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding: This paper proposes R²-Tuning, which appends a lightweight R² Block (only 1.5% of total parameters) recursively in a backward manner onto the last several layers of a frozen CLIP model. It enables query-modulated spatial pooling and coarse-to-fine temporal refinement, outperforming state-of-the-art (SOTA) methods that require additional temporal backbones on 6 VTG benchmarks across 3 tasks with only 2.7M parameters.
Referring Atomic Video Action Recognition: This paper proposes a new task, "Referring Atomic Video Action Recognition" (RAVAR), and the RefAVA dataset (containing 36,630 instances). It also introduces RefAtomNet, which fuses visual, textual, and location-semantic tri-stream tokens through cross-stream agent attention, improving mAP by 3.85%/3.17% over the best baseline BLIPv2.
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data: Proposal of the Retrieval from Counterfactually Augmented Data (RCAD) task and the Feint6K dataset, revealing that SOTA video-text models lag far behind humans in action semantic understanding (InternVideo 58.2% vs. Human 95.2%), and introduction of the LLM-teacher method to improve action embedding learning via LLM knowledge distillation.
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos: RGNet is proposed to deeply unify the two stages of long video temporal grounding—clip retrieval and temporal localization—into a single network. Through the sparse attention of the RG-Encoder and contrastive clip sampling, end-to-end optimization is achieved, yielding SOTA performance on MAD and Ego4D.
SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders: SA-DVAE introduces feature disentanglement to zero-shot skeleton-based action recognition for the first time. Using a dual-head VAE, it splits skeleton features into a semantic-related branch and a semantic-unrelated branch, aligning only the semantic-related part with text. Coupled with an adversarial total correlation penalty to enhance disentanglement, it achieves SOTA performance on NTU RGB+D 60/120 and PKU-MMD benchmarks.
SAFNet: Selective Alignment Fusion Network for Efficient HDR Imaging: SAFNet proposes a selective alignment fusion strategy that jointly refines raw valuable region masks and cross-exposure optical flows through a pyramid decoder. It explicitly fuses HDR images only after performing precise alignment in valuable regions, outperforming the state-of-the-art on Kalantari 17 and a self-built Challenge123 dataset while achieving an order of magnitude faster inference speed.
SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow: SEA-RAFT achieves SOTA accuracy while maintaining a simple architecture through three improvements: Mixture of Laplace (MoL) loss, direct regression of initial optical flow, and rigid-flow pre-training, achieving a speedup of over 2.3× compared to existing methods.
Self-Supervised Any-Point Tracking by Contrastive Random Walks: Proposes GMRW (Global Matching Random Walk) which combines a global matching Transformer architecture with a contrastive random walk self-supervised objective, achieving robust "Tracking Any Point" (TAP) performance without annotations for the first time, and designs label warping data augmentation to prevent the Transformer from learning shortcut solutions.
SemTrack: A Large-Scale Dataset for Semantic Tracking in the Wild: This paper proposes the SemTrack dataset and SemTracker method, expanding traditional object tracking from "locating where the target is" to "understanding what the target is doing"—tracking targets while capturing their semantic trajectories (who/what they interact with, when, where, and how they interact), and introducing a meta-learning strategy to address the challenges of long-tailed interaction categories.
SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking: SLAck proposes to uniformly fuse three cues—semantics, location, and appearance—during the early association stage of multi-object tracking. By learning implicit motion priors and cross-cue synergy through a lightweight Spatial-Temporal Object Graph (STOG), it avoids heuristic post-processing rules and significantly improves the tracking performance of novel categories on open-vocabulary MOT and TAO TETA benchmarks.
SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow: Propopses the SPAM video annotation engine, which combines synthetic data pre-training, pseudo-label self-training, and graph-hierarchy active learning, generating Multiple Object Tracking (MOT) annotations close to ground-truth (GT) quality with only 3-20% of the manual annotation effort.
Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition: This paper proposes SPDP-Net, which models social relationships among individuals through spatio-temporal proximity and utilizes a Dual-Path Transformer (DPATr) architecture to synergistically recognize multi-granular activities along two paths: individual-to-global and individual-to-social. It significantly outperforms previous SOTA models on the JRDB-PAR dataset with an overall F1 score of 46.5%.
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos: This paper proposes the Spherical World-Locking (SWL) framework, which implicitly transforms multimodal perception streams into a world-locked spherical coordinate system to eliminate the challenges posed by self-motion, thereby achieving more precise audio-visual localization in egocentric videos.
Text-Guided Video Masked Autoencoder: A text-guided masking (TGM) strategy is proposed to mask salient video regions by utilizing natural language descriptions instead of motion priors, unifying MAE with video-text contrastive learning to achieve state-of-the-art relative performance on five action recognition datasets and one egocentric dataset.
TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning: This paper proposes a bi-directional reasoning framework, TimeCraft, to address the task of weakly-supervised temporal grounded video question answering (temporal grounded VQA). By establishing two symmetric reasoning paths (forward: temporal grounding \(\rightarrow\) answering; backward: answering \(\rightarrow\) temporal grounding) and employing cycle-consistency constraints to provide self-supervised signals, the model simultaneously localizes the video segments supporting the answer and yields the correct answer without requiring temporal annotations.
Towards Model-Agnostic Dataset Condensation by Heterogeneous Models: The Heterogeneous Model Dataset Condensation (HMDC) method is proposed. By simultaneously using two structurally different models (such as ConvNet and ViT) for dataset condensation, and designing a Gradient Balancing Module (GBM) and a Mutual Distillation (MD) mechanism, it generates condensed images that are universally applicable to various models, addressing the limitation where conventional methods overfit to a single model.
Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance: LoRAT introduces LoRA to visual object tracking for the first time. Through two LoRA-friendly designs—decoupled position encoding (shared spatial components + independent type embeddings) and a pure MLP detection head—it enables training a tracker with a ViT-g backbone using laboratory-level resources. It achieves a SUC of 0.762 on LaSOT (new SOTA), while the lightest variant, LoRAT-B-224, runs at 209 FPS.
UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation: This paper proposes the UniINR framework, which leverages a unified spatio-temporal implicit neural representation (INR) to simultaneously perform rolling shutter correction, deblurring, and arbitrary frame-rate video frame interpolation from a single rolling shutter blurred frame and paired event streams in a single pass.
Vamos: Versatile Action Models for Video Understanding: Proposes the Vamos framework, which uses Large Language Models as reasoners to flexibly unify visual embeddings and general text descriptions as video representations. It discovers that text-only representations consistently achieve competitive or even superior performance across multiple video understanding benchmarks. Furthermore, it designs a Token Bottleneck Model to achieve interpretable evidence selection and a 5x inference speedup.
VideoMamba: Spatio-Temporal Selective State Space Model: This paper proposes VideoMamba (KAIST version), a pure Mamba-based video recognition model. By designing a Spatio-Temporal Forward and Backward SSM, it effectively handles the complex interaction between non-sequential spatial information and sequential temporal information in videos, achieving competitive performance with Transformers while maintaining linear complexity.
VideoMamba: State Space Model for Efficient Video Understanding: This work innovatively adapts Mamba's selective state space model to the video domain, proposing VideoMamba, a pure SSM architecture. It achieves efficient spatiotemporal context modeling with linear complexity, demonstrating superior performance on both short and long video understanding tasks.