Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation¶

Conference: ECCV 2024
arXiv: 2407.10802
Code: GitHub
Area: Video Understanding
Keywords: Event Camera, contrast maximization, motion estimation, self-supervised learning, optical flow

TL;DR¶

This paper proposes a self-supervised method that integrates a non-linear motion prior (parametric trajectory function) into the contrast maximization framework for dense continuous-time motion estimation with event cameras, improving the zero-shot performance of synthetic-data pre-trained models by 29% on the real-world EVIMO2 dataset.

Background & Motivation¶

Background: Optical flow estimation and point tracking are core tasks in low-level vision. Recently, frame-based methods have made significant progress relying on large-scale synthetic datasets. As a novel vision sensor, event cameras feature high speed, high dynamic range, and low power consumption, making them naturally suited for motion perception tasks.

Limitations of Prior Work: The event camera domain faces two main constraints: (a) a lack of large-scale real labeled data, and immature event simulators leading to a significant sim-to-real gap for models trained on synthetic data; (b) existing contrast-loss-based self-supervised methods are limited to low-degree-of-freedom (e.g., feature tracking, ego-motion estimation) or short-term (\(\le 0.1\)s) linear optical flow problems, failing to handle long-term non-linear complex motion.

Key Challenge: Long-term dense motion estimation requires handling non-linear trajectories, but the contrast maximization framework previously only supported linear motion models. Additionally, the association between events and trajectories is a high-dimensional assignment problem, which is difficult to solve efficiently and differentiably.

Goal: Long-term (\(\ge 0.3\)s) dense non-linear event motion estimation, while reducing the domain adaptation gap.

Key Insight: A two-stage strategy: supervised pre-training on synthetic data, followed by fine-tuning on real data using the proposed self-supervised contrast loss.

Core Idea: Introduce a continuous-time parametric trajectory prior (e.g., Bézier curves) into the contrast maximization framework, and efficiently and differentiably associate events with dense non-linear trajectories via KNN soft assignment and coarse-grained spatio-temporal displacement fields.

Method¶

Overall Architecture¶

The overall pipeline is divided into a prediction module and a loss module. Events within the time interval \([t_s, t_e]\) are first voxelized and fed into a neural network (Bflow architecture or UNet), which predicts continuous-time trajectory coefficients for each pixel location. The predicted trajectories and the original events are fed together into the self-supervised loss module: a coarse-grained spatio-temporal displacement field is first interpolated, events are then warped according to the look-up-table displacements, and finally, an IWE (Image of Warped Events) is constructed, using the gradient magnitude of the IWE as the training loss.

Key Designs¶

Trajectory Motion Prior:
- Function: Predict a continuous-time trajectory \(\mathbf{q}_n(t)\) represented by a weighted combination of parametric basis functions for each pixel.
- Mechanism: Represent the trajectory as a weighted sum of basis functions \(\mathbf{q}_n(t) = \sum_{j=1}^{N_c} g_j(t) \mathbf{p}_{n,j}\), where \(g_j(t)\) is a shared temporal basis function, and \(\mathbf{p}_{n,j}\) are pixel-level control points. The paper explores polynomial bases \(g_j(t)=t^j\), Bézier curve bases \(g_j(t)=\binom{N_c}{j}(1-t)^{N_c-j}t^j\), and learnable bases.
- Design Motivation: The non-linear trajectory prior strikes a balance between motion versatility and regularization. Compared to linear optical flow models, it can handle long-term complex motion, while the smoothness of trajectories naturally provides temporal regularization to prevent event collapse.
KNN Soft Assignment:
- Function: Efficiently and differentiably compute the displacement \(\Delta\mathbf{x}_k\) for each event.
- Mechanism: First relax the problem into interpolation on a coarse-grained spatio-temporal displacement field of shape \([N_\text{bins}, h/4, w/4]\). For each temporal bin center, KNN search is executed using the KeOps symbolic matrix framework to find the \(N_\text{traj}\) nearest neighbor trajectories, computing the displacement as the average displacement of neighbor trajectories: \(\Delta\mathbf{x}_k = \frac{1}{N_\text{traj}} \sum_{n=1}^{N_\text{traj}} [\mathbf{q}_n(t_\text{ref}) - \mathbf{q}_n(t_k)]\).
- Design Motivation: Directly computing associations for all event-trajectory pairs is computationally infeasible (the number of events can reach tens of millions). Through coarse-grained lookup tables and KNN interpolation, the problem is dimensionally reduced from 3D to 2D, which saves memory while maintaining differentiability.
Random Reference Time + Regularization:
- Function: Randomly sample reference times \(t_\text{ref} \sim \mathcal{U}(0,1)\) during training and impose spatial smoothness regularization.
- Mechanism: The training loss is \(\mathcal{L} = 1/G + \lambda R\), where \(G\) is the IWE gradient magnitude and \(R\) is the L1 norm of the spatial gradient of the displacement field. A different reference time is randomly selected at each iteration.
- Design Motivation: (a) Random reference times require the IWE to be sharp at any moment, enhancing the regularization effect, with memory overhead growing only linearly (vs. previous methods using a fixed set of multiple reference times); (b) the spatial smoothness term encourages consistency of neighboring trajectories, further suppressing event collapse.

Loss & Training¶

Pre-training: Supervised pre-training (L1 loss) on the MultiFlow synthetic dataset for 50 epochs with a batch size of 10.
Self-supervised fine-tuning: Fine-tuned on EVIMO2 using contrast loss for 15 epochs, with batch size 6, learning rate \(10^{-4}\), using the AdamW optimizer.
Spatial smoothness weight \(\lambda=0.003\), number of KNN neighbors \(N_\text{traj}=32\).
For the optical flow task, a UNet architecture was additionally trained on DSEC, demonstrating the versatility of the method.

Key Experimental Results¶

Main Results¶

EVIMO2 Continuous Optical Flow Dataset:

Method	TEPE ↓	TAE ↓	%Out ↓
Paredes et al. (Self-supervised)	21.69	51.91	0.634
E-RAFT (Linear)	19.38	74.52	0.656
BFlow (Zero-shot/OOD)	8.63	19.94	0.363
Ours (Self-supervised FT)	6.14	16.98	0.254
BFlow (In-domain Supervised)	3.38	11.68	0.166

DSEC Optical Flow Benchmark (Comparison of self-supervised methods):

Method	Inference Time (ms)	EPE ↓	AE ↓	%Out ↓
E-RAFT (Supervised)	46.3	0.788	10.56	2.684
Ours (Self-supervised)	~9	Best	AE Gain 19%	%Out Gain 14%

Ablation Study¶

Impact of Different Motion Priors (EVIMO2):

Motion Prior	TEPE ↓	TAE ↓	%Out ↓
Polynomial Basis (SSL)	6.78	19.76	0.272
Learnable Basis (SSL)	7.46	19.78	0.284
Bézier Curve (SSL)	6.14	16.98	0.254
Polynomial Basis (OOD)	9.36	20.88	0.363
Bézier Curve (OOD)	8.63	19.94	0.363

Key Findings¶

Self-supervised fine-tuning improves zero-shot performance by approximately 29% (TEPE decreased from 8.63 to 6.14), effectively bridging the sim-to-real gap.
The Bézier curve performs the best among the three motion priors (\(N_c=10\)).
For long-term prediction (>200ms), the improvement from self-supervised fine-tuning is particularly significant.
The inference speed on DSEC is 5 times faster than previous self-supervised methods.

Highlights & Insights¶

Integrating motion priors into the CM framework is an elegant design: non-linear trajectories adapt to complex motion, while their smoothness naturally provides regularization, subtly solving the event collapse issue.
The KNN scheme implemented using the KeOps symbolic matrix framework resolves the scalability issues of large-scale event-trajectory association, representing a practical engineering contribution.
The construction of the EVIMO2 continuous optical flow dataset is itself a contribution, providing a new evaluation benchmark for long-term motion estimation with event cameras.

Limitations & Future Work¶

A significant gap still exists between self-supervised methods and in-domain supervised methods (TEPE 6.14 vs. 3.38), indicating that the information provided by the self-supervised loss signal is limited.
Handling of occlusions and independently moving objects may not be robust enough.
KeOps KNN computation may still become a bottleneck at very large resolutions.
Farther advanced trajectory representations (such as B-splines and neural implicit representations) have not yet been explored.

vs. Bflow [27]: Bflow relies on supervised training using synthetic data and suffers from a sim-to-real gap on real data. This paper significantly narrows this gap through self-supervised fine-tuning.
vs. Shiba et al. [57]: Limited to linear motion models and fixed reference times, whereas this paper extends to non-linear trajectories and random reference times.
vs. Paredes et al. [47]: Uses piecewise linear optical flow stitching to approximate non-linear motion, yielding far inferior performance compared to explicit non-linear trajectory modeling.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of introducing motion priors into the CM framework is novel, and the KNN association scheme is practical.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated on two tasks (trajectory estimation, optical flow) and multiple datasets, with comprehensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear logic, complete formula derivations, and rich figures and tables.
Value: ⭐⭐⭐⭐ Provides an effective tool for mitigating the sim-to-real gap in event camera motion estimation, offering practical value to the event camera community.