MVDD: Multi-View Depth Diffusion Models¶

Conference: ECCV 2024
arXiv: 2312.04875
Code: Project Page
Area: Image Generation
Keywords: Diffusion Models, Multi-View Depth, 3D Shape Generation, Epipolar Attention, Depth Completion

TL;DR¶

MVDD is proposed, a diffusion model based on multi-view depth map representations. By incorporating epipolar "line segment" attention and denoising depth fusion, it achieves 3D-consistent, high-quality shape generation, enabling the synthesis of dense point clouds with over 20K points.

Background & Motivation¶

Background: 3D shape generation is an important direction in AIGC. Existing generative models include methods based on implicit functions (AutoSDF, 3D-LDM), voxels (Vox-Diff), and point clouds (DPM, PVD, LION). While diffusion models have achieved immense success in 2D image generation, replicating this success in 3D generation remains challenging.

Limitations of Prior Work: (a) Implicit methods suffer from cubic computational growth with resolution or generate over-smoothed shapes; (b) point cloud diffusion models train extremely slowly on unstructured data (\(>10000\) epochs) and can only generate around 2048 points, failing to capture fine details; (c) multi-view RGB diffusion suffers from the Janus problem and 3D inconsistency.

Key Challenge: High-quality 3D generation requires high resolution and fine details, but existing point cloud, voxel, or implicit representations are either limited in resolution or excessively costly to train. A representation is needed that fits the diffusion framework well and efficiently represents complex 3D shapes.

Goal: Design a diffusion model capable of generating 3D-consistent multi-view depth maps for high-quality point cloud and mesh generation.

Key Insight: Multi-view depth maps are a representation that "registers" 3D surfaces onto 2D grids, naturally fitting 2D diffusion architectures while generating higher-resolution outputs than voxel or point cloud methods.

Core Idea: Replace point clouds or implicit functions with multi-view depth maps as the generation target of the 3D diffusion model, and leverage epipolar line segment attention to resolve cross-view consistency.

Method¶

Overall Architecture¶

MVDD represents a 3D shape \(\mathcal{X}\) as \(N\) multi-view depth maps \(\mathbf{x} \in \mathbb{R}^{N \times H \times W}\). The forward process adds noise to each depth map independently for \(T\) steps, while the reverse process denoises them using a U-Net. The key is introducing cross-view conditioning during the denoising process, so that the denoising step for each view is conditioned on its neighboring views:

\[p_\theta(\mathbf{x}_{t-1}^v | \mathbf{x}_t^v, \mathbf{x}_t^{r_1:r_R}) := \mathcal{N}(\mathbf{x}_{t-1}^v; \mu_\theta(\mathbf{x}_t^v, \mathbf{x}_t^{r_1:r_R}, t), \beta_t \mathbf{I})\]

Finally, the multi-view depth maps are back-projected and fused to obtain a dense point cloud (20K+ points). Optionally, SAP can be used for high-quality mesh reconstruction.

Key Designs¶

Epipolar Line Segment Attention: Unlike the full attention of MVDream or the epipolar attention of SyncDreamer, MVDD leverages the current step's depth estimation to narrow down the attention range. For a pixel \(v_{ij}\) on the source view \(v\), its depth value \(\mathbf{x}_t^{v_{ij}}\) is first back-projected into the 3D space to obtain a point \(\rho^{v_{ij}}\):

\[\rho^{v_{ij}} = \mathbf{x}_t^{v_{ij}} A^{-1} v_{ij}\]

Then, \(k-1\) equally spaced points along the ray direction of this 3D point are selected and projected onto the neighboring view \(r\) to form a "line segment." Features are sampled only at these locations to serve as K and V. Design Motivation: Leveraging the current depth estimate avoids searching the entire epipolar line, narrowing the search to the vicinity of the estimated position, which balances efficiency and effectiveness.

Visibility Threshold Filtering: Visibility is determined by checking the difference between the back-projected depth and the predicted depth of the neighboring view:

\[M(r_{mn}) = \|z(\pi_{v \to r} \rho^{v_{ij}}) - \mathbf{x}_t^{r_{mn}}\| < \tau\]

Attention weights at locations where this condition is not met are set to a very small value.

Depth Concatenation: The depth values of the sampled points \(\{z(\rho_1^{v_{ij}}), \ldots, z(\rho_k^{v_{ij}})\}\) are concatenated to the feature dimension of V, enabling the model to perceive the spatial positions of these points. Intuition: If the geometric feature of \(v_{ij}\) matches highly with that of a sampled point \(\rho_1^{v_{ij}}\), the denoised depth should shift closer to the depth value of that point.
Denoising Depth Fusion: Even with epipolar attention ensuring semantic consistency, back-projected 3D points might still not align perfectly, causing "double-layer" artifacts. Borrowing from MVS methods, depth averaging is performed after the U-Net output during denoising steps: the back-projected depths of each pixel from other visible views are averaged, re-noised, and passed to the next step. Visibility check uses dual thresholds:

\[\|v_{ij} - v_{\tilde{i}\tilde{j}}\| < \psi_{\max}, \quad \frac{|\mathbf{x}^{v_{ij}} - z(\rho^{v_{\tilde{i}\tilde{j}}})|}{|\mathbf{x}^{v_{ij}}|} < \epsilon_\theta\]

This fusion is only applied in the final 20 steps, with an extra depth filtering in the final step to remove invisible points.

Loss & Training¶

Standard DDPM objective function to predict noise:

\[L_t = \mathbb{E}_{t, \mathbf{x}_0, \epsilon_t} \left[\|\epsilon_t - \epsilon_\theta(\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \epsilon_t, t)\|^2\right]\]

Training setup: \(T=1000\) steps, cosine schedule, depth map resolution of \(128 \times 128\), 8 views, trained on 8×A100 for approximately 3000 epochs.

Key Experimental Results¶

Main Results¶

ShapeNet Unconditional Generation (1-NNA EMD↓):

Category	DPM	PVD	LION	3D-LDM	IM-GAN	MVDD
Airplane	73.47	64.89	63.49	80.10	64.04	62.50
Car	80.33	71.29	65.70	-	57.04	56.80
Chair	65.73	56.14	57.31	65.30	55.54	54.51

Depth Completion (EMD×10²↓):

Category	PointFlow	PVD	DPF-Net	MVDD
Airplane	1.180	1.030	1.105	0.900
Chair	3.649	2.939	3.320	2.400
Car	2.851	2.146	2.318	1.460

Ablation Study¶

Ablation on Epipolar Attention Design (Chair, 1-NNA CD↓):

Component	Full Model	w/o Line Segment Attention	w/o Depth Concatenation	w/o Threshold Filtering
1-NNA↓	Best	Significant degradation	Quality degradation	Slight degradation

The effectiveness of denoising depth fusion is intuitively shown in Fig. 4: noticeable "double-layer" artifacts appear when fusion is not utilized.

Key Findings¶

The point cloud density generated by MVDD is 10 times that of existing point cloud diffusion models (20K+ vs 2048), enabling it to capture fine structures like chair slats and thin airplane wings.
As point cloud density increases, the performance of sparse methods like LION drops sharply, whereas MVDD remains stable.
MVDD comprehensively surpasses all baselines in the depth completion task, demonstrating that the model has learned realistic 3D shape priors.
It can serve as a 3D prior for downstream tasks such as GAN inversion to prevent geometric collapse.

Highlights & Insights¶

Insights on representation choice: Multi-view depth maps "reduce the dimensionality" of 3D generation to 2D generation, perfectly fitting mature 2D diffusion architectures, which is more efficient than denoising on unstructured point clouds.
Utilizing intermediate results: Epipolar "line segment" attention cleverly utilizes the depth estimates from diffusion intermediate steps to narrow down the search range, which was unexploited in previous multi-view diffusion works.
Versatility: The same unconditional generative model can be directly applied to depth completion and used as a 3D prior, demonstrating exceptional flexibility.

Limitations & Future Work¶

The \(128 \times 128\) depth map resolution still has room for improvement; higher resolutions could provide more geometric details.
The fixed camera configuration of 8 views may not be suitable for all scenes (e.g., objects with thin structures might require more views).
Training data is limited to single classes from ShapeNet, and cross-category generalization has not been demonstrated.
Integration with current mainstream text-to-3D methods (SDS-based) has not yet been explored.
Inference requires 1000 denoising steps, leaving room for speed optimization.

DDPM/DDIM: Foundation diffusion frameworks.
MVDream: Multi-view RGB diffusion using 3D self-attention. MVDD borrows the concept of cross-view interaction but designs a more efficient attention mechanism specifically for depth maps.
PVD: Point cloud diffusion baseline, which denoises directly on point coordinates, leading to slow training and limited numbers of points.
SAP: A post-processing method for reconstructing meshes from point clouds.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to introduce multi-view depth representations to 3D diffusion generation, with a cleverly designed epipolar line segment attention.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three tasks (generation, completion, and GAN prior) with comprehensive quantitative and qualitative results.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation with explicit physical intuition behind each module design.
Value: ⭐⭐⭐⭐ — The potential of multi-view depth as a 3D representation warrants further research follow-up.