ETAP: Event-based Tracking of Any Point¶

Conference: CVPR 2025
arXiv: 2412.00133
Code: https://github.com/tub-rip/ETAP
Area: Video Understanding
Keywords: Event-based Camera, Tracking Any Point, Contrastive Learning, Feature Alignment, Motion Robustness

TL;DR¶

This paper proposes ETAP, the first tracking any point (TAP) approach designed purely for event-based cameras. It resolves the motion-dependency challenges inherent in event data via a novel feature-alignment contrastive loss. Trained on the newly constructed synthetic dataset EventKubric, ETAP significantly outperforms baseline methods, achieving a 136% improvement in the AJ metric across multiple benchmarks.

Background & Motivation¶

Background: Tracking Any Point (TAP) represents a significant paradigm shift in motion estimation in recent years, moving from tracking individual, salient keypoints to tracking arbitrary points. Representative approaches like CoTracker and TAPIR have demonstrated outstanding performance under benign, standard scenarios.

Limitations of Prior Work: Existing TAP methods rely exclusively on conventional frame-based cameras, which are severely constrained under extreme lighting conditions or high-speed motion. Inherent limitations of frame-based cameras—such as limited frame rates, motion blur, and saturation artifacts—lead to visual aliasing and tracker drift, posing a critical bottleneck in real-world deployments like robotic perception.

Key Challenge: Event cameras, with their microsecond-level temporal resolution and high dynamic range (HDR), are inherently well-suited for high-speed tracking. However, event data poses a fundamental challenge: feature appearances are highly dependent on the direction of scene motion. The event data generated by the exact same scene under different motion directions is substantially different, rendering correlation-based tracking methods difficult to apply directly. Furthermore, training data poses another challenge: existing synthetic event datasets are overly simplistic, employing only 2D planar motion, which leads to poor generalization in the real world.

Goal: (1) Develop the first event-based TAP method; (2) Resolve the motion-dependency issue of event features; (3) Construct a high-quality synthetic training dataset.

Key Insight: The authors observe the mathematical properties of event data under time reversal—where reversing time changes the motion direction but preserves the scene structure—and leverage this property to design a contrastive loss that enforces the learning of motion-invariant features.

Core Idea: Pairs of data with different motion directions for the same scene are generated via time reversal, and a contrastive loss is used to constrain the feature descriptors of corresponding points to remain consistent under different motions, thereby learning motion-robust correlation features.

Method¶

Overall Architecture¶

The overall pipeline of ETAP is as follows: Input Event Data Stream → Convert to Event Stack (image-like tensor) → Extract spatial features via a Multi-Scale Feature Encoder → Update point positions and descriptors using a Transformer-based Iterative Optimization Module → Output trajectories, visibility flags, and descriptors for each point. During training, a time-reversed and rotated variant of the data is additionally generated to compute the feature alignment loss.

Key Designs¶

Event Stack Representation and Multi-scale Feature Encoding:
- Function: Convert asynchronous and sparse event data into a regular grid representation compatible with CNNs.
- Mechanism: A Mixed-Density event stack is adopted, where the \(N_e\) events preceding each timestep are hierarchically binned into \(C=10\) channels. Each channel \(h_c\) aggregates \(N_e/2^{c-1}\) events, capturing multi-scale temporal information from fine to coarse grain. A feature encoder \(\phi_\lambda\) extracts \(d\)-dimensional feature maps at 4 scales to initialize point descriptors and compute correlation features.
- Design Motivation: The hierarchical temporal binning design preserves the fine-grained temporal information of recent events while covering a broader temporal context, which is more effective than simpler voxel grids (ablation studies confirm it slightly outperforms the voxel grid).
Feature Alignment Loss (FA-loss):
- Function: Enforce the feature encoder to learn motion-invariant descriptors.
- Mechanism: For each training sample, a variant is generated using time reversal + random rotation (\(\theta \in \{0, 90°, 180°, 270°\}\)), keeping the scene structure intact while changing the motion direction. The corresponding point descriptors \(d_{t}^{s,i}\) and \(\tilde{d}_{t}^{s,i}\) are extracted from the original and variant sequences, and aligned by minimizing their normalized cosine similarity loss: \(\mathcal{L}_{fa} = \sum_t \frac{1}{|\mathcal{P}_t|} \sum_{i,s} (1 - \langle u(d), u(\tilde{d}) \rangle)^2\). Mathematically, while events under time reversal differ, their trigger conditions are equivalent (derived from linear event generation models), meaning the corresponding point descriptors should theoretically be identical.
- Design Motivation: This targets the core challenge of pure event tracking: motion-direction dependency causing correlation features to degrade over time. The contrastive loss provides an explicit motion-invariance constraint, which experiments demonstrate reduces the gap between keypoint inter-cluster and intra-cluster similarity from 0.38 to 0.067.
Transformer Iterative Refinement Tracker:
- Function: Track multiple points in parallel, iteratively updating positions and descriptors.
- Mechanism: Following the CoTracker architecture, a token \(\mathcal{O}_t^{s,i,m}\) is constructed for each point at each timestep, containing displacement, visibility, descriptor, correlation features, and positional encoding. Alternating intra-point attention (across points) and temporal attention (across time) are performed for \(M=4\) iterative refinement steps. Correlation features are computed as the inner product of descriptors and neighboring feature maps in a \(49 \times 4 = 196\) dimensional space.
- Design Motivation: Parallel multi-point tracking exploits spatial relationships among points (e.g., rigid body constraints), while the alternating attention mechanism captures spatio-temporal dependencies while maintaining efficiency.

Loss & Training¶

The total loss is formulated as \(\mathcal{L} = 0.1 \mathcal{L}_{tp} + \mathcal{L}_{vis} + 0.1 \mathcal{L}_{fa}\), where \(\mathcal{L}_{tp}\) is the trajectory prediction error (absolute distance) and \(\mathcal{L}_{vis}\) is the cross-entropy loss for visibility. Training follows a two-stage strategy: the first \(10^5\) steps optimize only trajectory and visibility losses, followed by another \(1.2 \times 10^5\) steps where the FA-loss is introduced. Training data is sourced from the EventKubric dataset (10,173 samples), generated via a three-step pipeline: Kubric rendering + FILM frame upsampling + ESIM event simulation.

Key Experimental Results¶

Main Results¶

Task / Dataset	Metric	ETAP	E2Vid+CoTracker	Gain
TAP / EventKubric	AJ	0.539	0.229	+136%
TAP / EventKubric	\(\delta_{avg}^x\)	0.668	0.328	+104%
TAP / E2D2 (fidget spinner)	AJ	0.389	0.179	+117%
Feature Tracking / EDS	Feature Age	0.701	-	-
Feature Tracking / EDS	Expected FA	0.610	-	-

Method	Input	EDS FA↑	EDS EFA↑	EC FA↑	EC EFA↑
ETAP (Ours)	E	0.701	0.610	0.891	0.886
FE-TAP (E+F)	E+F	0.676	0.589	0.844	0.838
DDFT (E+F)	E+F	0.576	0.472	0.825	0.818
HASTE (E)	E	0.096	0.063	0.442	0.427

Ablation Study¶

Configuration	EDS FA↑	EDS EFA↑	EC FA↑	Description
ETAP Full Model	0.701	0.610	0.891	Optimal combination of all design decisions
w/o FA-loss	0.686	0.593	0.887	Removing contrastive loss drops performance by 2.1%
Low resolution 256×256	0.598	0.500	0.780	Resolution has the largest impact
High resolution 512×512	0.659	0.561	0.808	Significant improvement
MOVi-F Baseline Data	0.598	0.500	0.780	EventKubric reduces domain gap yielding 8% gain over MOVi-F

Key Findings¶

Resolution is the most influential factor; scaling up from 256 to 512 yields an approximate 10% improvement in FA.
Employing a random contrast threshold \(\sim \mathcal{U}(0.16, 0.34)\) outperforms using a fixed value by approximately 5%.
EventKubric improves performance by 8% over the pre-rendered MOVi-F dataset, validating the importance of high-quality synthetic data.
FA-loss yields a 2.1% gain on EDS, and feature independence experiments show it effectively minimizes the feature gap between different motion directions.
ETAP is the first pure event-based method to outperform combined event+frame approaches on the Feature Tracking benchmark.

Highlights & Insights¶

The concept of utilizing time reversal to generate training pairs is elegant: leveraging the mathematical properties of event generation models allows for the acquisition of motion variant pairs without extra data or annotations, supported by a solid physical and theoretical foundation. This concept can be generalized to any motion-dependent sensor data.
Systematic Data Engineering: Beyond constructing a new synthetic data pipeline, the authors conduct careful ablations on every design decision (resolution, frame rate, threshold, scene dynamics), demonstrating how to advance model performance through rigorous data engineering.
Cross-Modal Outperformance: The pure event-based method outperforms frame+event fusion methods (FE-TAP) on the feature tracking benchmark, proving the unique advantages of event cameras in high-speed tracking scenarios.

Limitations & Future Work¶

Currently, event cameras only provide monochromatic information, making them unable to leverage color cues for establishing appearance correspondences.
Tracking features initialized during periods of no motion (no events) exhibit poor quality, which is an inherent limitation of event data.
While EventKubric is more realistic than prior synthetic datasets, a sim-to-real gap still persists with real event data.
Potential improvement directions: Incorporating sparse frame information for feature initialization, or re-initializing features once motion is detected.

vs CoTracker: CoTracker is the state-of-the-art frame-based TAP approach. ETAP employs a similar Transformer architecture but adapts it specifically for event-based inputs. Both perform comparably during benign sequences, but ETAP excels under high-speed/HDR conditions.
vs DDFT: DDFT was previously the strongest event feature tracking approach, but its training data is restricted to simple 2D planar motion and requires self-supervised fine-tuning. ETAP significantly outperforms it (by 19%) using more realistic 3D data and the FA-loss.
vs FE-TAP: FE-TAP integrates frames and events for correlation-based tracking but inherits standard frame-based limitations in high-speed scenarios. In contrast, ETAP's pure event-based approach proves to be more robust.

Rating¶

Novelty: ⭐⭐⭐⭐ First event-based TAP method; the FA-loss design is novel and theoretically grounded.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 datasets, 2 tasks, 8 tables, exhaustive ablations.
Writing Quality: ⭐⭐⭐⭐ Clear logic, rigorous mathematical derivations.
Value: ⭐⭐⭐⭐ Fills a void in event-based TAP, holding substantial practical value for high-speed robotic perception.