DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation¶

Conference: ECCV 2024
arXiv: 2303.05021
Code: https://github.com/duanyiqun/DiffusionDepth
Area: 3D Vision
Keywords: Monocular Depth Estimation, Diffusion Models, Denoising Process, Depth Refinement, Self-Diffusion

TL;DR¶

This paper introduces the diffusion denoising process to monocular depth estimation for the first time. By performing visually conditioned iterative denoising in the latent depth space, and proposing a self-diffusion mechanism to resolve the mode collapse issue caused by sparse Ground Truth (GT) depths, it achieves SOTA performance on KITTI and NYU-Depth-V2.

Background & Motivation¶

Background: Monocular depth estimation is a fundamental visual task that predicts pixel-wise depth from a single 2D image, widely applied in autonomous driving, robotics, and augmented reality. Existing methods are mainly categorized into: (1) Regression methods (e.g., BTS, PixelFormer) which directly regress depth values; (2) Classification methods (e.g., AdaBins, BinsFormer) which discretize continuous depth before classification.

Limitations of Prior Work: (1) Pure regression methods are prone to overfitting, struggle to generate fine depth details, and often result in blurry object boundaries and shapes. (2) Classification-based methods discretize depth values using bin centers, leading to discontinuities and fuzziness in depth maps. (3) Both categories adopt static, single-step prediction, lacking the capability for progressive refinement.

Key Challenge: Depth estimation requires capturing both coarse-grained scene layout and fine-granular object details, but existing single-step prediction paradigms struggle to accommodate both simultaneously. While diffusion models have demonstrated powerful progressive refinement capabilities in generative tasks, applying them to depth estimation faces a critical barrier—GT depths in outdoor scenes are extremely sparse (e.g., KITTI has only 3.75%-5% valid pixels), and training generative models directly on sparse GT leads to mode collapse.

Goal: (1) How to introduce the iterative refinement capability of diffusion models into depth estimation? (2) How to effectively train diffusion models under sparse GT depth scenarios?

Key Insight: Reframe depth estimation as a conditional denoising diffusion process—starting from a random depth distribution, it progressively denoises it into an accurate depth map guided by monocular visual conditions. Concurrently, a self-diffusion strategy is proposed to add noise to the model's own refined depths instead of the sparse GT, bypassing the sparsity issue.

Core Idea: Reframe depth estimation as a visually-conditioned iterative denoising process, and resolve the training challenge under sparse GT scenarios via a self-diffusion mechanism, thereby achieving progressive, high-quality depth refinement.

Method¶

Overall Architecture¶

The overall architecture of DiffusionDepth consists of three core components: (1) A visual feature extractor (Swin Transformer + FPN) to construct multi-scale visual conditions \(c\) from the input image; (2) A depth encoder-decoder to map depth maps to the latent space and back; (3) A Monocular Conditional Denoising Block (MCDB) to perform conditional denoising in the latent space. During inference, starting from random noise \(x_T\), a refined depth latent representation \(x_0\) is obtained through a \(T\)-step denoising process, which is then decoded to output the final depth map.

Key Designs¶

Task Reformulation - Depth Estimation as Denoising:
- Function: Transitions depth estimation from conventional regression/classification to a conditional generative paradigm.
- Mechanism: Defines the conditional denoising process as \(p_\theta(x_{t-1}|x_t, c) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, c), \sigma_t^2 I)\), where \(c\) represents the monocular visual conditions. DDIM is employed to accelerate inference, with the random variance set to 0 to ensure deterministic predictions.
- Design Motivation: The iterative refinement of diffusion models is naturally suited for depth estimation—the initial steps establish the coarse structure, while subsequent steps progressively correct details and distance relationships.
Monocular Conditional Denoising Block (MCDB):
- Function: Performs denoising on the depth latent representation guided by visual conditions.
- Mechanism: The visual conditions \(c \in \mathbb{R}^{H/4 \times W/4 \times c}\) are upsampled to match the size of the depth latent \(x_t \in \mathbb{R}^{H/2 \times W/2 \times d}\) via a local projection layer, fused using element-wise addition, and further processed by CNN blocks and self-attention layers. This employs a lightweight design to ensure inference efficiency.
- Design Motivation: Visual conditions provide scene semantics and structural information. Hierarchical fusion allows the denoising process to utilize visual cues in the image to guide depth refinement.
Self-Diffusion Mechanism:
- Function: Resolves the mode collapse issue of training diffusion models under sparse GT depths.
- Mechanism: During training, instead of adding noise to the sparse GT depths, noise is added to the refined depth latent \(x_0\) output by the model's current denoising process to produce \(x_t\). The supervision signal aligns the refined depth with the GT using a sparse validity mask. Random cropping, jittering, and flipping augmentations are applied.
- Design Motivation: Outdoor depth GT covers only 3.75%-5% of the pixels; training diffusion models directly on sparse GT results in collapse (experimentally verified as non-convergent on KITTI). Self-diffusion enables the model to learn to "organize" the entire depth map rather than merely regressing known parts.

Loss & Training¶

The total loss is a weighted sum of three terms: \(L = \lambda_1 L_{ddim} + \lambda_2 L_{pixel} + \lambda_3 L_{latent}\). Among them, \(L_{ddim} = \|x_{t-1} - \mu_\theta(x_t, t, c)\|^2\) supervises the denoising step; \(L_{pixel}\) is the pixel-level depth loss (a variant of SILog); \(L_{latent} = \|x_0 - \hat{x}_0\|^2\) aligns the latent space. The diffusion process involves 1000 steps of training and 20 steps of inference. The first 50% of iterations utilize L1+L2 auxiliary losses. An AdamW optimizer is used to train on 8×A100 GPUs for 30 epochs.

Key Experimental Results¶

Main Results¶

Dataset	Metric	DiffusionDepth	Prev. SOTA (URCDC-Depth)	Gain
KITTI Eigen (0-80m)	RMSE↓	2.016	2.032	0.8%
KITTI Eigen (0-80m)	Abs Rel↓	0.050	0.050	-
KITTI Official (0-50m)	RMSE↓	1.418	1.528	7.2%
KITTI Official (0-50m)	Abs Rel↓	0.041	0.049	16.3%
NYU-Depth-V2	RMSE↓	0.295	-	-
NYU-Depth-V2	Abs Rel↓	0.085	-	-

Ablation Study¶

Configuration	Rel.↓	RMSE↓	Description
20 inference steps (Standard)	0.086	0.298	Default configuration
15 steps (Direct change)	0.118	0.455	Severe performance degradation
10 steps (Direct change)	0.182	0.751	Unusable
15 steps (Retrained)	-	~0.30	Only minor degradation
Diffusion on sparse GT (KITTI)	-	Not converging	Mode collapse
Diffusion on sparse GT (NYU)	-	Comparable	Feasible in dense GT scenes

Key Findings¶

The self-diffusion mechanism is the key to successful training in outdoor scenes—direct diffusion on sparse GT does not converge on KITTI.
Visualization of the denoising process shows: the first 10 steps establish scene structure, and subsequent steps refine distance relationships and boundaries.
Using \(\times 2\) downsampling in the depth latent space is slightly superior to \(\times 4\) downsampling.
The framework is compatible with various visual backbones (ResNet, Swin) and can be combined with classification-based methods like Bins.
Inference speed: 14 FPS with ResNet backbone, 5 FPS with Swin backbone (RTX 3090, 20 steps).

Highlights & Insights¶

First to introduce diffusion models to depth estimation: Opens up a new paradigm of generative methods in 3D perception tasks.
Ingeniously designed self-diffusion mechanism: Solves the core difficulty of training generative models under sparse GT with a simple and clean idea.
Visualization reveals denoising semantics: Visualizations demonstrate physical interpretability, first identifying shape boundaries and then correcting depth relationships.
Highly versatile: The diffusion head can be freely integrated with different backbones and depth feature extraction methods.

Limitations & Future Work¶

Inference speed is constrained by the number of denoising steps; while 20 steps is practical, it remains slower than direct regression methods.
Directly changing the number of inference steps (without retraining) leads to severe performance drops, limiting flexibility.
Performance on the online benchmark (KITTI Online) is slightly lower than VA-Depth and URCDC-Depth, likely due to not using advanced feature extraction techniques such as long-range attention.
Faster sampling strategies (such as Consistency Models) can be explored to reduce inference steps to 1-4.

DiffusionDet: A pioneer in object detection using diffusion models, framing bounding box generation as a denoising process.
Latent Diffusion: The concept of performing diffusion in the latent space directly inspired the design of the depth latent space in this study.
VA-Depth: Performs depth refinement using variational inference, representing a non-diffusion refinement paradigm.
Insights: The iterative refinement paradigm of diffusion models can be extended to other 3D perception tasks such as stereo matching and depth completion.

Rating¶

Novelty: ⭐⭐⭐⭐ Establishes diffusion models for depth estimation for the first time; the self-diffusion mechanism is highly creative.
Experimental Thoroughness: ⭐⭐⭐⭐ Dual indoor-outdoor datasets + extensively ablated + denoising process visualisations.
Writing Quality: ⭐⭐⭐⭐ Clearly structured with well-articulated motivations.
Value: ⭐⭐⭐⭐ Pioneering work that paves the way for subsequent methods like Marigold.