Depth Hypothesis Guided Iterative Refinement for Event-Image Monocular Depth Estimation¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: Not available
Area: 3D Vision
Keywords: Event camera, Monocular depth estimation, Depth hypothesis volume, Cost volume, Iterative refinement

TL;DR¶

HypoDepth reformulates event-image monocular depth estimation from "direct regression of continuous depth" to "constrained search within discrete depth hypotheses." By utilizing a lightweight 3D cost volume and a GRU iterative unit, it progressively refines residual depth from low to high resolution. It achieves SOTA on DSEC and MVSEC, with a Tiny version capable of real-time operation on resource-constrained devices.

Background & Motivation¶

Background: Event cameras possess high temporal resolution and high dynamic range, capturing scene changes in scenarios where conventional cameras fail (e.g., high-speed motion, low light). Thus, combining events and images for monocular depth estimation (MDE) is promising. Current mainstream approaches (e.g., RAMNet using RNN alignment, UniCT using self-attention, SRFNet/PCDepth using iterative feature refinement) focus on optimizing contextual features and then using a regression head to output dense depth.

Limitations of Prior Work: Monocular depth estimation is inherently ill-posed—the same set of features can correspond to infinite depth solutions. Direct regression of the full depth distribution is highly non-linear, difficult to converge, and sensitive to noise. Regardless of feature alignment or fusion quality, the final mapping from "feature to continuous depth" remains a non-linear bottleneck, and refinement is often limited to the feature level without explicit constraints in the depth space.

Key Challenge: Treating depth as an unbounded continuous variable for regression results in a solution space that is too large and free, failing to alleviate the ill-posed nature. Conversely, fields like optical flow and stereo matching have proven that replacing "regression" with "searching within constrained candidates" significantly stabilizes optimization (e.g., RAFT maintains 2D/4D cost volumes and iteratively predicts residual flow). however, monocular depth lacks a second perspective to build such cost volumes, making direct adaptation impossible.

Goal: (1) Reformulate the ill-posed depth regression as a constrained depth search task; (2) Construct a cost volume capable of guiding iterative refinement in a monocular setting without a second view; (3) Ensure the mechanism is lightweight enough to run across multiple resolutions for real-time deployment.

Key Insight: The authors observe that correspondences can be artificially created in a "virtual depth space." By discretizing continuous depth into a set of candidate depth values (depth hypotheses) and calculating the matching confidence between contextual features and "geometric projections under those hypotheses," a 3D cost volume is obtained. This effectively replaces the "matching between two frames" in RAFT with "matching between context features and depth hypotheses."

Core Idea: Use a discrete Depth Hypothesis Volume (DHV) to constrain the depth search space within a set of reasonable candidates. Construct a 3D cost volume for multi-scale correlation lookup and use a GRU to iteratively predict residual depth for global-to-local refinement—essentially "replacing unbounded depth regression with constrained depth search to alleviate ill-posedness."

Method¶

Overall Architecture¶

Given an image \(I\) and a corresponding event stream \(E\), the goal is to estimate a dense depth map. HypoDepth is a three-stage, multi-resolution iterative pipeline: images and events pass through backbones to extract multi-resolution context features \(\rightarrow\) EIFusion adaptively fuses the two modalities by resolution \(\rightarrow\) Iterative depth decoders run sequentially at 1/16, 1/8, and 1/4 resolutions. Internally, each decoder constructs a cost volume via DHV, performs correlation lookup, and updates residual depth using a GRU, refining depth from global consistency to local detail. The 1/16 stage initializes depth to zero, and each subsequent resolution uses the previous stage's output as a prior.

The architecture follows a dual-loop structure: "multi-resolution serial execution + multiple iterations within a single resolution."

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Image + Event Stream"] --> B["Dual-stream Backbone<br/>Multi-res Context Features"]
    B --> C["EIFusion Cross-modal Fusion<br/>Low-res: CA / High-res: Conv"]
    C --> D["Depth Hypothesis Volume (DHV)<br/>Discrete Candidates"]
    D --> E["3D Cost Volume<br/>Context ↔ DHV Confidence"]
    E --> F["Correlation Lookup<br/>Local Search + Multi-scale Pooling"]
    F --> G["Geometric Encoding + GRU Iteration<br/>Predict Residual Depth ×N"]
    G -->|"Resolution < 1/4<br/>Current Depth as Prior"| D
    G -->|"Resolution = 1/4 Finished"| H["Dense Depth Map"]

Key Designs¶

1. Depth Hypothesis Volume (DHV): Replacing Unbounded Regression with Bounded Search

This addresses the non-linearity and convergence issues of direct regression. At each pixel, \(D\) candidate depth values \(\{\hat d_i\}_{i=0}^{D-1}\) are uniformly sampled within a depth range \([d_{min}, d_{max}]\) provided by the prior. To maintain robustness (following RAMNet), normalized log depth is used:

\[d_i = \frac{\log(\hat d_i) - \log(d_{min})}{\log(d_{max}) - \log(d_{min})}, \quad DHV = \{d_{0}, d_{1}, \dots, d_{D-1}\}\]

The DHV is then linearly projected into the same feature space as context features \(F_{ct}\) to get \(F_{dhv} = \text{Embedding}(DHV)\), where \(F_{dhv} \in \mathbb{R}^{c \times D}\) represents geometric projections at specific depth hypotheses. This compresses the solution space from the "entire real axis" to "\(D\) candidates," providing noise resistance and a stable starting point for refinement.

2. 3D Cost Volume + Multi-scale Correlation Lookup: Creating "Matches" in Virtual Space

To solve the lack of explicit correspondences in monocular setups, the authors perform bi-directional cross-attention between \(F_{ct}\) and \(F_{dhv}\):

\[F'_{dhv} = F_{dhv} + \text{CA}(Q_{dhv}, K_{ct}, V_{ct}), \quad F'_{ct} = F_{ct} + \text{CA}(Q_{ct}, K_{dhv}, V_{dhv})\]

A 3D cost volume \(CV = \text{Sim}(F'_{ct}, F'_{dhv}) \in \mathbb{R}^{H \times W \times D}\) is computed via dot-product similarity, encoding "matching confidence at each hypothesis." A pyramid \(\{CV^1, CV^2, CV^3\}\) is formed by pooling along the depth dimension. During the \(k\)-th iteration for pixel \(P\), the index \(i^*\) closest to the previous depth \(\psi_P^{k-1}\) is found:

\[i^* = \arg\min_i |\psi_P^{k-1} - d_i|\]

Local correlation vectors are retrieved around \(i^*\) with radius \(r\). The multi-scale nature allows the model to handle both near and far objects efficiently. Since \(D \ll H \times W\), this 3D volume is significantly more efficient than the 4D volumes used in RAFT.

3. Geometric Encoding + GRU Iterative Residual Update: Stable Convergence

The Geometric Encoder processes multi-scale correlations and injects fused context features \(F_{ct}\) to ensure spatial consistency. This is fed into a GRU-based iterative unit. The GRU's hidden state passes historical information across iterations (ablation shows simple convolutions converge poorly without this). The prediction head outputs residual depth \(\Delta\Psi\), updating \(\Psi^k = \Psi^{k-1} + \Delta\Psi\). Finally, convex upsampling and mapping back to the true scale are applied:

\[\hat\Psi^k = d_{max}\exp\left(\log\tfrac{d_{max}}{d_{min}}(\Psi^k - 1)\right)\]

4. Multi-resolution Adaptive Fusion & Coarse-to-fine Refinement

The pipeline follows 1/16 \(\rightarrow\) 1/8 \(\rightarrow\) 1/4. EIFusion uses cross-attention at low resolutions (1/16, 1/8) for global semantic alignment and simple convolutional fusion at high resolution (1/4) to preserve structural details while saving computation. The number of hypotheses \(D\) is increased with resolution (e.g., \(32, 64, 96\)).

Loss & Training¶

The SILog loss is applied to all \(K\) iterative predictions \(\{\hat\Psi_i\}_{i=1}^K\) with exponential weighting:

\[L = \sum_{i=1}^K 0.8^{K-i}\,\alpha\sqrt{V(\delta_i)} - \lambda E(\delta_i), \quad \delta_i = \log(\hat\Psi_i) - \log(\Psi_{gt})\]

Events are represented as voxel grids (\(B=3\) bins). Backbone is a fine-tuned Swin-T. Training uses 8 iterations per stage, AdamW optimizer, and one-cycle scheduling.

Key Experimental Results¶

Main Results¶

On the DSEC dataset (480×640, dense events), E denotes Event-only, E+I denotes Event + Image:

Input	Method	δ1 ↑	Abs Rel ↓	RMSE ↓	FLOPs(G)	Param(M)	Inference(ms)
E	DepthAnyEvent-R	0.592	0.252	9.824	39	25.5	12.9
E	Ours-E-T (Tiny)	0.755	0.168	5.377	5	1.4	11.8
E	Ours-E-B	0.839	0.129	4.442	134	32.3	58.2
E+I	PCDepth	0.893	0.103	3.591	162	66.5	64.3
E+I	Ours-B	0.901	0.099	3.583	155	60.2	71.6

Ours-E-B improves Abs Rel by ~49% over DepthAnyEvent-R. The Tiny version achieves a strong balance between efficiency (1.4M params) and accuracy.

On MVSEC (Abs Rel):

Input	Method	day1 Abs Rel ↓	night1 Abs Rel ↓
E	HMNet	0.254	0.323
E+I	PCDepth	0.228	0.271
E+I	Ours-B	0.212	0.268

Ablation Study¶

(Performed on DSEC using a reduced base model):

Configuration	δ1 ↑	Abs Rel ↓	Remark
Full model	0.885	0.105	Baseline
W/o EIFusion	0.877	0.110	Simple concat reduces complementarity
W/o CA in decoder	0.869	0.111	Weakens semantic/geometric correspondence
W/o GRU	0.858	0.115	Conv layers lack temporal prior, weakest convergence
W/o DHV	0.873	0.113	Continuous depth prior leads to error accumulation

Key Findings¶

GRU is critical: Removing it caused the largest performance drop, proving the necessity of "historical prior transfer" across iterations.
DHV outperforms continuous priors: Discrete hypotheses allow the search mechanism to correct previous errors, whereas continuous regression tends to accumulate them.
Efficiency through 3D volumes: Since \(D \ll H \times W\), the computation is manageable even at high resolutions.

Highlights & Insights¶

Adapting the RAFT paradigm to Monocular Vision: HypoDepth cleverly creates artificial correspondences through discrete hypotheses in "virtual space," a concept transferable to other continuous regression tasks lacking matching pairs.
"Regression to Search": Discretizing the solution space provides inherent noise resistance and stabilizes the ill-posed problem.
Resolution-Aware Resource Allocation: Using expensive cross-attention only at low resolutions and increasing hypothesis density \(D\) as resolution grows achieves a SOTA balance of precision and speed.

Limitations & Future Work¶

Dependence on Depth Bounds: DHV sampling relies on \([d_{min}, d_{max}]\). If the actual scene depth falls outside these bounds, the search space will fail to cover the ground truth.
Latency: Multiple iterations at high resolutions increase inference time (71.6ms for Ours-B), requiring the Tiny version for strict real-time applications.
Sampling Density: Uniform log-space sampling may not be optimal for all scenes; adaptive bins (similar to AdaBins) could be explored.

vs. RAFT: While RAFT uses a 4D cost volume from two frames, HypoDepth uses a lighter 3D volume from a single frame + hypotheses.
vs. PCDepth / SRFNet: These rely on feature-level refinement. HypoDepth focuses on explicit spatial search in the depth domain, outperforming PCDepth on DSEC/MVSEC.
vs. AdaBins: Both use discretization, but HypoDepth uses it to build a cost volume for search-based refinement rather than for direct bin classification.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Excellent adaptation of cost-volume search to monocular settings)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Dual datasets, zero-shot, and exhaustive ablation)
Writing Quality: ⭐⭐⭐⭐ (Clear logic and formulations)
Value: ⭐⭐⭐⭐⭐ (Significant performance gains and a deployable Tiny version)