Skip to content

Event6D: Event-based Novel Object 6D Pose Tracking

Conference: CVPR 2026 arXiv: 2603.28045 Code: https://chohoonhee.github.io/Event6D Area: Video Understanding Keywords: Event Camera, 6D Pose Tracking, Novel Object Generalization, Bimodal Reconstruction, Sim-to-Real Transfer

TL;DR

EventTrack6D proposes an event-depth fusion framework for 6D pose tracking that bridges the temporal gap between event cameras and depth frame rates by reconstructing intensity and depth images at arbitrary timestamps, achieving robust tracking of unseen objects at 120+ FPS while trained exclusively on synthetic data.

Background & Motivation

Event cameras provide microsecond-level latency, making them well-suited for 6D object pose tracking in fast dynamic scenes where conventional RGB-D approaches suffer from motion blur and large inter-frame pixel displacements. However, the sparse asynchronous output of event cameras is incompatible with standard pose estimation pipelines, and existing event-based 6D pose datasets are small in scale with limited motion diversity.

Core Challenge: Depth frame rates are typically far lower than the temporal resolution of the event stream, creating a temporal gap between them. Dense photometric and geometric information must be interpolated between depth frames.

Method

Overall Architecture

The inputs are an event stream and low-frame-rate depth maps. A dual reconstruction module reconstructs intensity and depth images at arbitrary timestamps → yielding dense photometric and geometric cues → render-and-compare 6D pose tracking.

Key Designs

  1. Bimodal Reconstruction (Intensity + Depth):

    • Function: Recover dense intensity and depth maps at arbitrary timestamps from sparse event streams.
    • Mechanism: Conditioned on the most recent depth measurement, the method leverages the temporal information in the event stream to reconstruct both modalities. Intensity reconstruction recovers scene appearance from brightness changes encoded in events; depth reconstruction infers geometric changes from motion information in events. Both reconstructions operate within a shared feature space.
    • Design Motivation: To fill the temporal gap between depth frames so that tracking can operate at the temporal resolution of the event stream.
  2. Large-Scale Synthetic Benchmark Suite:

    • Function: Provide large-scale event + depth + pose annotation data for training and evaluation.
    • Mechanism: Three benchmark components are constructed: (1) EventBlender6D — a large-scale synthetic training set (495,840 samples, 1,033 objects); (2) a simulated evaluation set; (3) a real-event evaluation set. Synthetic data encompasses diverse motion patterns and object appearances.
    • Design Motivation: Existing event-based 6D pose datasets are too small (e.g., YCB-Ev contains only 21 objects) to support training for novel object generalization.
  3. Novel Object Generalization:

    • Function: Track previously unseen objects without object-specific training.
    • Mechanism: The model is trained solely on synthetic data, learning generalizable tracking capabilities from a sufficiently diverse set of 1,033 objects and motion patterns. At test time, it generalizes directly to unseen real-world objects without fine-tuning.
    • Design Motivation: Retraining the model for every new object is impractical in real-world deployments.

Loss & Training

Intensity reconstruction loss + depth reconstruction loss + render-and-compare pose estimation loss. The model is trained exclusively on synthetic data and transferred to real scenes in a zero-shot manner.

Key Experimental Results

Main Results

Method Data Type FPS Novel Object Generalization Fast Motion Robustness
Traditional RGB-D Methods RGB-D <30 No Poor (motion blur)
EventTrack6D Events + Depth 120+ Yes Strong

EventTrack6D significantly outperforms traditional methods in high-dynamic scenes.

Ablation Study

Configuration Tracking Accuracy Notes
Intensity reconstruction only Medium Lacks geometric information
Depth reconstruction only Medium Lacks appearance information
Bimodal reconstruction Best Photometric + geometric complementarity
Without depth conditioning Poor Depth conditioning is critical

Key Findings

  • The complementarity of bimodal reconstruction is essential — using either modality alone leads to a significant performance drop.
  • Zero-shot sim-to-real transfer works well, suggesting that object diversity across 1,033 instances is sufficient for learning generalizable tracking.
  • 120+ FPS enables the microsecond-level latency advantage of event cameras to be fully realized in tracking applications.

Highlights & Insights

  • Practical Event-Based 6D Tracking: This work provides the first systematic validation of event cameras for novel-object 6D pose tracking; 120+ FPS is particularly valuable for real-time applications such as robotic manipulation.
  • Large-Scale Synthetic Data Strategy: Training on synthetic data with 1,033 objects circumvents the bottleneck of real-world annotation for generalization.
  • Depth-Conditioned Reconstruction: Using depth frames as anchors for interpolation-based reconstruction yields greater stability than purely event-driven reconstruction.

Limitations & Future Work

  • The cost and availability of event cameras still constrain practical deployment.
  • Depth camera frame rate remains a bottleneck — reconstruction quality degrades when the interval between depth frames is too large.
  • Sim-to-real domain gap may manifest in certain extreme scenarios.
  • Future work may explore tracking from events alone, without depth input.
  • vs. Traditional RGB-D Tracking (BundleSDF, etc.): Frame rate limitations and motion blur are fundamental issues; EventTrack6D addresses them at the hardware level by adopting event cameras.
  • vs. YCB-Ev / E-POSE Datasets: These datasets are too small with limited motion diversity; EventBlender6D provides a larger and more diverse benchmark.
  • vs. FoundationPose: FoundationPose excels in static or slow-motion scenes but degrades under fast motion.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of event cameras and 6D tracking is valuable; the large-scale benchmark is a meaningful contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Dual validation on synthetic and real data with sufficient ablations.
  • Writing Quality: ⭐⭐⭐⭐ Benchmark descriptions are detailed and clear.
  • Value: ⭐⭐⭐⭐ Applicable to robotics and AR domains.