EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes¶
Conference: NeurIPS 2025 arXiv: 2512.00771 Code: To be confirmed Area: 3D Vision Keywords: Event camera, 3D geometry estimation, low-light, point map reconstruction, dynamic scene reconstruction
TL;DR¶
EAG3R integrates asynchronous event streams from event cameras into the MonST3R point map reconstruction framework. Through a Retinex enhancement module, an SNR-aware fusion mechanism, and an event photometric consistency loss, it achieves robust depth estimation, pose tracking, and 4D reconstruction in extreme low-light dynamic scenes, significantly outperforming RGB-only methods via zero-shot transfer to nighttime scenarios.
Background & Motivation¶
Background: DUSt3R/MonST3R leverages Transformers to directly regress dense point maps for pose-free 3D reconstruction, sparking a wave of research addressing challenging scenarios such as long sequences and dynamic scenes.
Limitations of Prior Work: In real-world scenarios such as autonomous driving, rapid motion and drastic illumination changes cause RGB image degradation including blur, overexposure, and underexposure. Existing RGB-only methods suffer severe performance degradation under these conditions.
Key Challenge: RGB cameras rely on long-exposure imaging, making them inherently ill-suited for extreme lighting and fast-motion scenes. Event cameras offer high temporal resolution and high dynamic range, yet have not been integrated into modern learning-based geometry estimation pipelines.
Goal: To effectively incorporate event camera data into a point map-based reconstruction framework, maintaining robustness in extreme low-light dynamic scenes.
Key Insight: Lightweight event adapters and SNR-guided adaptive fusion are added on top of the MonST3R backbone, while event streams are used to construct photometric consistency constraints in global optimization.
Core Idea: SNR-aware fusion trusts RGB features in high-SNR regions and event features in low-SNR regions, while brightness changes from events serve as additional supervision signals in global optimization.
Method¶
Overall Architecture¶
EAG3R enhances MonST3R in two aspects: (1) feature extraction stage — Retinex enhancement + lightweight event adapter + SNR-aware fusion; (2) global optimization stage — event photometric consistency loss. The inputs are low-light video and corresponding event streams; the outputs include per-frame depth maps, camera poses, and a global dynamic point cloud.
Key Designs¶
-
Retinex Image Enhancement Module:
- Function: Restores visibility of low-light images and generates an SNR confidence map.
- Design Motivation: Directly using degraded low-light images causes feature extraction failure; a pixel-level "reliability" metric is needed to guide subsequent fusion.
- Mechanism: A shallow network estimates the illumination map \(L_{\text{illum}}^t\), producing the enhanced image \(I_{\text{lu}}^t = I^t \odot L_{\text{illum}}^t\). The SNR map is then computed as: \(\mathcal{M}_{\text{snr}}^t = \frac{\widetilde{I}_g^t}{|I_g^t - \widetilde{I}_g^t| + \epsilon}\) where \(\widetilde{I}_g^t\) is the mean-filtered result of the grayscale image.
- Novelty: Unlike standalone low-light enhancement preprocessing (e.g., RetinexFormer), this module is jointly trained with downstream tasks and additionally outputs an SNR map for fusion.
-
Lightweight Event Adapter:
- Function: Extracts high-fidelity features from sparse event streams.
- Design Motivation: Event data can still capture structural information in low-light scenes, but requires a dedicated encoder.
- Mechanism: A pretrained Swin Transformer serves as the event encoder. Events are voxelized and hierarchical features \(\{F_{\text{evt},l}^t\}_{l=1}^4\) are extracted. Cross-attention is applied at each level to interact with image features: \(F'_{\text{evt},l} = \text{CrossAttn}(Q=F_{\text{evt},l}^t, K=F_{\text{img},l}^t, V=F_{\text{img},l}^t)\)
- Novelty: The image encoder is frozen; only the event adapter is trained, preserving pretrained image features.
-
SNR-Aware Feature Fusion:
- Function: Adaptively combines image and event features.
- Design Motivation: Image quality varies significantly across regions — RGB is reliable in bright areas while events are more reliable in dark areas.
- Mechanism: Weighted fusion using the normalized SNR map: \(F_{\text{cat}}^t = (F_{\text{img-final}}^t \odot \hat{\mathcal{M}}_{\text{snr}}^t) \| (F_{\text{evt-final}}'^t \odot (1 - \hat{\mathcal{M}}_{\text{snr}}^t))\) High-SNR regions favor image features; low-SNR regions favor event features.
- Novelty: Compared to simple uniform fusion or attention-based fusion, SNR guidance provides a physically meaningful prior.
-
Event Photometric Consistency Loss:
- Function: Introduces event-based spatiotemporal consistency constraints in global optimization.
- Design Motivation: MonST3R's original optical flow constraints are unreliable under low light; event streams provide a more stable motion signal.
- Mechanism: Salient patches are defined at Harris corner locations. The brightness increment observed from events, \(\Delta L_{\mathcal{P}_m}(u)\), is compared against the predicted brightness increment derived from image gradients and motion fields, \(\Delta \hat{L}_{\mathcal{P}_m}(u; X_{\text{global}})\). After normalization, an L2 residual is computed: \(\mathcal{L}_{\text{event}} = \sum_{\mathcal{P}_m} \sum_{u \in \mathcal{P}_m} \left\| \frac{\Delta L_{\mathcal{P}_m}(u)}{\|\Delta L_{\mathcal{P}_m}\|} - \frac{\Delta \hat{L}_{\mathcal{P}_m}(u; X_{\text{global}})}{\|\Delta \hat{L}_{\mathcal{P}_m}\|} \right\|^2\)
- Novelty: Normalization eliminates the unknown contrast sensitivity threshold \(C\), making the loss invariant to different event sensors.
Loss & Training¶
The joint optimization objective is: $\(X_{\text{global}}^* = \arg\min \left(\mathcal{L}_{\text{align}} + w_{\text{smooth}}\mathcal{L}_{\text{smooth}} + w_{\text{flow}}\mathcal{L}_{\text{flow}} + w_{\text{event}}\mathcal{L}_{\text{event}}\right)\)$
- Training uses only the MVSEC outdoor_day2 sequence (daytime); zero-shot evaluation is performed on outdoor_night1–3 (nighttime).
- Fine-tuned components include MonST3R's ViT-Base decoder, DPT head, enhancement network, and event adapter.
- Trained for 25 epochs on 4× RTX 3090 GPUs for approximately 24 hours.
Key Experimental Results¶
Main Results — Monocular Depth Estimation (MVSEC Night1–3)¶
| Method | Night1 Abs Rel↓ | Night1 δ<1.25↑ | Night2 Abs Rel↓ | Night2 δ<1.25↑ | Night3 Abs Rel↓ | Night3 δ<1.25↑ |
|---|---|---|---|---|---|---|
| DUSt3R | 0.407 | 0.393 | 0.415 | 0.384 | 0.463 | 0.335 |
| MonST3R | 0.370 | 0.373 | 0.309 | 0.469 | 0.317 | 0.453 |
| DUSt3R (LightUp) | 0.425 | 0.351 | 0.462 | 0.347 | 0.525 | 0.293 |
| MonST3R (Finetune) | 0.376 | 0.426 | 0.328 | 0.472 | 0.302 | 0.509 |
| EAG3R | 0.353 | 0.491 | 0.307 | 0.518 | 0.288 | 0.533 |
Main Results — Camera Pose Estimation (MVSEC Night1–3)¶
| Method | Night1 ATE↓ | Night2 ATE↓ | Night3 ATE↓ |
|---|---|---|---|
| DUSt3R | 1.474 | 3.921 | 4.109 |
| MonST3R | 0.559 | 0.626 | 0.733 |
| MonST3R (Finetune) | 0.580 | 0.467 | 0.402 |
| Easi3R_monst3r (Finetune) | 0.540 | 0.448 | 0.394 |
| EAG3R | 0.482 | 0.428 | 0.409 |
Ablation Study (Night3 Depth Estimation)¶
| Method | Abs Rel↓ | δ<1.25↑ | RMSE log↓ |
|---|---|---|---|
| MonST3R (Baseline) | 0.317 | 0.453 | 0.418 |
| MonST3R (Finetune) | 0.302 | 0.509 | 0.401 |
| + Event | 0.297 | 0.518 | 0.396 |
| + Event + LightUp | 0.291 | 0.523 | 0.388 |
| + Event + LightUp + SNR Fusion (Full) | 0.288 | 0.533 | 0.371 |
Key Findings¶
- Event streams contribute most: Adding event input alone improves Abs Rel from 0.302 to 0.297, validating the central role of event signals in low-light scenes.
- LightUp enhancement is effective but limited: Applying RetinexFormer as standalone preprocessing can actually worsen results (DUSt3R LightUp performs worse than the vanilla baseline), demonstrating that enhancement must be jointly optimized with downstream tasks.
- SNR fusion is critical: The final SNR fusion step reduces RMSE log from 0.388 to 0.371, indicating that adaptive weight allocation is more effective than simple concatenation.
- Zero-shot nighttime generalization: Training exclusively on daytime data achieves substantial improvements over all baselines at night, demonstrating the cross-scene generalization capability of event data.
Highlights & Insights¶
- First event-augmented point map-based reconstruction framework: Integrating event cameras into the DUSt3R/MonST3R paradigm opens a new direction at the intersection of event cameras and geometric foundation models.
- SNR-guided fusion: Using signal-to-noise ratio as a physical prior to guide multimodal fusion is more interpretable and efficient than purely learned attention mechanisms.
- Event photometric consistency loss: The brightness change model from events is cleverly used as a constraint in global optimization; normalization renders the loss insensitive to sensor-specific parameters.
- Zero-shot nighttime generalization: Training solely on daytime data yet significantly outperforming all methods at night directly demonstrates the high dynamic range advantage of event cameras.
Limitations & Future Work¶
- Validation is conducted only on the MVSEC dataset, which is relatively small and limited in scene diversity (outdoor driving only).
- Event data in training comes from real event cameras; V2E-synthesized events cause gradient explosion, limiting scalability to larger datasets.
- The event adapter uses Swin Transformer, but parameter count and computational overhead are not reported in detail.
- The impact of event streams under normal lighting conditions — whether they introduce noise — is not evaluated.
- Dynamic reconstruction is assessed only qualitatively, without quantitative metrics.
Related Work & Insights¶
- DUSt3R/MonST3R: The backbone architecture of this work and the pioneering contributions to point map-based pose-free reconstruction.
- RetinexFormer: The inspiration for the Retinex enhancement module; this work demonstrates that standalone preprocessing is inferior to end-to-end joint training.
- EvLight: A pioneering work on adaptive event-image feature fusion.
- Insights: The SNR-aware fusion concept is generalizable to other multimodal tasks (e.g., RGB-LiDAR, RGB-Thermal fusion), where physically-informed fusion weights are more robust than purely data-driven approaches.
Rating¶
- Novelty: ⭐⭐⭐⭐ First integration of event cameras into the point map reconstruction framework; SNR-aware fusion is innovative.
- Experimental Thoroughness: ⭐⭐⭐ Three tasks with complete ablation studies, but evaluation on a single dataset is a limitation.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, complete mathematical derivations, and highly informative figures.
- Value: ⭐⭐⭐⭐ Provides a practical solution for 3D reconstruction under extreme conditions; the event + foundation model direction holds considerable promise.