Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation¶
Metadata¶
- Conference: ICCV 2025
- arXiv: 2509.15224
- Code: bartn8/depthanyevent
- Area: 3D Vision / Event Camera Depth Estimation
- Keywords: Event Camera, Monocular Depth Estimation, Cross-Modal Distillation, Vision Foundation Model, Recurrent Architecture
TL;DR¶
This paper proposes a cross-modal distillation paradigm that leverages a vision foundation model (VFM) from the image domain (Depth Anything v2) to generate pseudo-labels for training event-based depth estimation networks. It further introduces a VFM-based recurrent architecture, DepthAnyEvent-R, achieving state-of-the-art performance in event-based monocular depth estimation without requiring costly depth annotations.
Background & Motivation¶
- Background: Event cameras capture brightness changes at microsecond-level temporal resolution with high dynamic range, making them particularly suitable for high-speed motion and challenging illumination scenarios (autonomous driving, UAVs, robotics, etc.)
- Limitations of Prior Work: Event data lacks large-scale datasets with dense depth annotations; the prohibitive annotation cost severely constrains learning-based depth estimation methods.
- Key Challenge: VFMs in the image domain (e.g., Depth Anything v2) achieve strong depth estimation capabilities through large-scale pretraining, yet no equivalent large-scale dataset exists in the event domain.
- Goal: DAVIS cameras can simultaneously output spatially aligned RGB frames and event streams, providing a natural condition for cross-modal knowledge transfer.
Method¶
Overall Architecture¶
The paper presents two main contributions: 1. Cross-Modal Distillation Paradigm: A pretrained VFM serves as the teacher model to process RGB frames and generate pseudo depth labels \(\mathbf{D}^*\), which supervise event-domain student models (e.g., E2Depth, EReFormer). 2. VFM Adaptation to the Event Domain: Directly fine-tuning DAv2 to the event domain (DepthAnyEvent), or extending it into a recurrent architecture (DepthAnyEvent-R).
Cross-Modal Distillation¶
- Teacher Model: DAv2 ViT-Large, fine-tuned on EventScape for 10K steps.
- Student Model: Any event-based depth estimation network.
- Alignment Condition: Frames and events are spatially and temporally aligned (naturally satisfied by DAVIS cameras).
- Training Loss:
Scale-invariant loss:
Scale and shift are solved via least squares: \((s,t) = \arg\min_{s,t} \sum (\mathbf{sD}+t - \mathbf{D}^*)^2\)
Gradient regularization term:
Event Representation — Tencode¶
To minimize modifications to the pretrained VFM, Tencode is adopted to encode events as RGB images:
The R/B channels encode positive/negative polarity, and the G channel encodes relative timestamps, preserving both spatial and temporal information.
DepthAnyEvent (Vanilla VFM Adaptation)¶
DAv2 ViT-Small is directly fine-tuned using the Tencode representation without any architectural modification.
DepthAnyEvent-R (Recurrent VFM Architecture)¶
ConvLSTM modules are inserted after the multi-scale feature maps of the DAv2 encoder to incorporate temporal information from historical event stacks:
- Encoder \(\mathcal{G}\) splits Tencode images into patches → multi-layer Transformer → multi-scale feature maps \(\mathbf{F}_s\)
- At each scale \(s\), ConvLSTM module \(\mathcal{R}_s\) receives \(\mathbf{F}_s\) and hidden state \(\mathbf{H}_s^i\), outputting enhanced features \(\hat{\mathbf{F}}_s\) and updated hidden state \(\mathbf{H}_s^{i+1}\)
- Hierarchical fusion → decoder \(\mathcal{D}\) → final depth map
- This design addresses degraded prediction quality in static scenes where events are sparse.
Key Experimental Results¶
Main Results: Zero-Shot Generalization (Trained Only on EventScape Synthetic Data)¶
| Model | Dataset | Abs Rel↓ | RMSE↓ | δ<1.25↑ |
|---|---|---|---|---|
| E2Depth | MVSEC | 0.527 | 7.894 | 0.363 |
| EReFormer | MVSEC | 0.518 | 8.423 | 0.361 |
| DepthAnyEvent | MVSEC | 0.466 | 7.824 | 0.408 |
| DepthAnyEvent-R | MVSEC | 0.469 | 8.064 | 0.428 |
| E2Depth | DSEC | 0.395 | 13.258 | 0.409 |
| EReFormer | DSEC | 0.297 | 11.608 | 0.524 |
| DepthAnyEvent | DSEC | 0.297 | 11.072 | 0.519 |
| DepthAnyEvent-R | DSEC | 0.276 | 10.942 | 0.555 |
Ablation Study: Distillation vs. Full Supervision (After MVSEC Fine-tuning)¶
| Model | Supervision | Abs Rel↓ | RMSE↓ | δ<1.25↑ |
|---|---|---|---|---|
| E2Depth | Synth | 0.527 | 7.894 | 0.363 |
| E2Depth | Distilled | 0.400 | 6.786 | 0.479 |
| E2Depth | Supervised | 0.420 | 7.268 | 0.432 |
| DepthAnyEvent | Synth | 0.466 | 7.824 | 0.408 |
| DepthAnyEvent | Distilled | 0.397 | 6.910 | 0.461 |
| DepthAnyEvent | Supervised | 0.373 | 6.627 | 0.471 |
| DepthAnyEvent-R | Distilled | 0.399 | 6.830 | 0.462 |
| DepthAnyEvent-R | Supervised | 0.365 | 6.465 | 0.489 |
Key Findings¶
- Distillation vs. Full Supervision: On E2Depth, the distilled model surpasses the fully supervised counterpart on several metrics (RMSE 6.786 vs. 7.268), indicating that the density of VFM pseudo-labels compensates for the sparsity of LiDAR annotations.
- DSEC Dataset: DepthAnyEvent-R achieves Abs Rel 0.226 under distillation vs. 0.191 under full supervision — a manageable gap.
- Tencode vs. Voxel Grid: Ablation experiments (C) vs. (D) demonstrate that Tencode outperforms Voxel Grid.
- Importance of Pretraining: Without pretraining (E), Abs Rel degrades to 0.446, significantly worse than 0.365 with pretraining (C).
- Mixed Supervision (F): Training jointly with ground truth and distilled labels yields the best performance on several metrics.
Highlights & Insights¶
- Paradigm Innovation: This work is the first to systematically distill knowledge from an image-domain VFM into the event domain, elegantly exploiting the natural alignment property of DAVIS cameras.
- Practical Value: The distillation scheme entirely eliminates the need for costly depth annotation, requiring only aligned RGB frames and event streams for training.
- VFM Pseudo-Labels Outperform Sparse LiDAR: Dense pseudo-labels generated by the VFM outperform LiDAR annotations in certain scenarios, as LiDAR ground truth is itself semi-sparse.
- Principled Recurrent Architecture: The ConvLSTM modules naturally integrate temporal information into the VFM, improving depth estimation quality in static scenes and continuous sequences.
Limitations & Future Work¶
- Dependency on RGB Frame Alignment: The distillation approach requires DAVIS cameras or similar frame-event aligned devices; it cannot be directly applied in pure event-camera settings.
- Fixed VFM Backbone: Experiments are limited to DAv2 ViT-Small/Large; larger models or alternative VFMs (e.g., Metric3D) remain unexplored.
- Information Loss in Tencode: Three-channel RGB encoding inevitably discards fine-grained temporal information.
- Inference Speed Not Thoroughly Analyzed: The overhead of recurrent unrolling and ConvLSTM in DepthAnyEvent-R is not discussed in detail.
Related Work & Insights¶
- The large-scale pretraining strategy of Depth Anything v2 (DAv2) forms the foundation of the proposed distillation approach.
- Ablation experiment (B) comparing teacher models shows that teacher model quality directly impacts distillation effectiveness.
- Self-supervised event-based depth estimation (Zhu et al.) avoids annotations but sacrifices accuracy; the distillation scheme achieves a favorable balance between the two.
- This cross-modal distillation paradigm could be generalized to other event-camera tasks, such as optical flow and semantic segmentation.
Rating ⭐⭐⭐⭐¶
The method is concise and elegant, with thorough experiments and clearly articulated contributions. The cross-modal distillation paradigm demonstrates strong generalizability and practical value; distillation performance approaches or even surpasses full supervision, validating the feasibility of transferring VFM knowledge to the event domain. The recurrent architecture design is principled and natural. A limitation lies in insufficient exploration of alternative VFMs and event representations.