PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors¶

Conference: CVPR 2025
arXiv: 2411.19036
Code: https://gsw-d.github.io/PCDreamer/
Area: 3D Vision / Point Cloud Completion
Keywords: Point Cloud Completion, Multi-view Diffusion Priors, Multi-modality Fusion, Shape Consolidation, Confidence Filtering

TL;DR¶

PCDreamer is proposed to leverage large-scale multi-view diffusion models to "dream" multi-view images of the missing regions from partial point clouds, achieving high-fidelity point cloud completion through a multi-modality shape fusion module and a confidence-guided shape consolidation module, showing outstanding capability in reconstructing fine-grained local details.

Background & Motivation¶

Key Challenge¶

Key Challenge: Point cloud completion is a critical task in 3D vision, but single-view partial point clouds often lack more than half of their shape information, resulting in an exceptionally large search space.

Background¶

Background: Existing methods extract features from partial point clouds to directly predict the missing regions, which tends to yield random guesses under severe occlusions (e.g., a table lamp missing its top shade).

Limitations of Prior Work¶

Limitations of Prior Work: Although using images as auxiliary guidance can improve performance, acquiring paired image and partial point cloud data is highly challenging in practice.

Proposed Solution¶

Proposed Solution: Multi-view consistent images generated by large-scale multi-view diffusion models encode global and local shape cues, which are particularly beneficial for shape completion (e.g., capturing symmetric structures).

Supplementary Notes¶

Supplementary Notes: However, multi-view images generated by diffusion models exhibit inherent inconsistencies, and direct fusion introduces noise and unreliable points.

Method¶

Overall Architecture¶

PCDreamer consists of three core modules: (1) The multi-view image generation module utilizes a chain of foundation models (ControlNet→RGB, Wonder3D/SVD→multi-view, DepthAnything→depth map) to "dream" multi-view images from the partial point cloud; (2) The multi-modality shape fusion module fuses the features of the point cloud and multi-view images via an attention mechanism to generate an initial complete shape; (3) The shape consolidation module filters out unreliable points based on confidence scores and performs upsampling to generate the final dense, uniform, and complete point cloud.

Key Designs¶

Design 1: Multi-modality Shape Fusion

Function: Effectively fuses the reliable geometric information of the partial point cloud with the global and local shape cues in the multi-view images.
Mechanism: A dual-encoder architecture is designed. The point cloud encoder employs a patch-based Transformer (extracting patch features via DGCNN + sinusoidal position encoding + Transformer encoder) to obtain \(\mathcal{F}_P \in \mathbb{R}^{128}\). The image encoder uses a ResNet backbone + camera pose MLP encoding + Transformer encoder to obtain \(\mathcal{F}_I \in \mathbb{R}^{128}\). Fusion is performed through cross-attention: point cloud features serve as Q, and image features serve as K/V, leading to \(\mathcal{F}_{fusion} = \text{softmax}(\frac{QK}{\sqrt{128}})V\).
Design Motivation: Directly fusing multi-view depth maps into 3D would result in irregular and noisy shapes; by performing fusion in the feature space rather than the geometric space, the system can extract useful information while mitigating inconsistencies.

Design 2: Confidence-Guided Shape Consolidation

Function: Filters out unreliable points from the initial complete point cloud to generate clean and dense final results.
Mechanism: A confidence score is calculated for each predicted point by considering two factors: (1) the consistency between the points and the multi-view features \(\mathcal{F}_I\); (2) the mutual consistency among the points. The point coordinates are encoded using an MLP and concatenated with image features, and the average self-attention scores are processed via a Sigmoid function to serve as confidence scores. The points are partitioned into high and low confidence sets (75% / 25%). The high-confidence set serves as the filtered intermediate point cloud, which is then processed by an upsampling network to generate the final dense results.
Design Motivation: The inherent inconsistencies in multi-view images generated by diffusion models often cause noise points and local holes in the initial completion results.

Design 3: Flexible Multi-view Generation Pipeline

Function: Automatically generates matching multi-view RGB and depth images from a partial point cloud.
Mechanism: A chain of foundation models is designed: partial point cloud → depth map → ControlNet generating RGB → Wonder3D/SVD generating multi-view RGB → DepthAnything generating multi-view depth maps. This method does not rely on a specific diffusion model and is compatible with both Wonder3D and SVD.
Design Motivation: There are no off-the-shelf foundation models capable of directly generating consistent multi-view depth maps from a partial point cloud; combining multiple large models compensates for their individual limitations.

Loss & Training¶

Chamfer Distance (CD) is used to supervise the distance between the ground truth (GT) and both the initial and final point clouds.
The upsampling network employs an additional uniformity loss to ensure a uniform distribution of points.

Key Experimental Results¶

PCN Dataset Results (CD ×10³ ↓)¶

Main Results¶

Method	Plane	Cabinet	Car	Chair	Lamp	Avg CD ↓	F1 ↑
PCN	5.82	10.91	9.00	11.09	11.91	9.93	0.657
PoinTr	4.31	9.23	7.60	8.35	8.27	7.76	0.810
SnowFlakeNet	3.95	8.82	7.52	7.48	6.34	6.96	0.828
AnchorFormer	3.62	8.79	7.20	7.12	—	—	—
PCDreamer	Best	Best	Best	Best	Best	Best	Best

Key Findings¶

PCDreamer achieves the most significant advantage on categories with severe missing details, such as lamps (which require inferring missing top structures).
Using SVD for multi-view generation achieves better cross-category generalization than Wonder3D.
Feature space fusion achieves significantly better results than direct 3D fusion, as verified by ablation studies.
Confidence filtering removes approximately 25% of unreliable points, significantly improving completion accuracy.
Depth maps provide more accurate shape information for completion than RGB images.

Highlights & Insights¶

Leveraging Diffusion Priors for Point Cloud Completion: Creatively reformulates "acquiring information about missing regions" into "allowing diffusion models to imagine the appearance of missing regions".
Dual Considerations in Confidence Scoring: Simultaneously considers the consistency between points and multi-view features as well as the mutual consistency among points, effectively addressing diffusion-related inconsistencies.
No Paired Data Required: Multi-view images are automatically generated via a foundation model chain, addressing the practical challenge of acquiring paired data.

Limitations & Future Work¶

High dependency on a chain of foundation models (ControlNet + Wonder3D/SVD + DepthAnything) leads to relatively low inference efficiency.
The quality of diffusion generation directly affects the completion results, making it potentially non-robust to low-quality or out-of-domain inputs.
The pipeline design might lead to error accumulation (depth estimation error → multi-view inconsistency → completion error).
Future work can explore end-to-end training or more efficient multi-view generation methods.

PoinTr [Yu et al.] predicts missing points from partial point clouds using a Transformer architecture.
Wonder3D [Long et al.] generates multi-view images and normals through cross-domain attention.
DepthAnything [Yang et al.] provides robust single-view depth estimation.
Ours introduces a new paradigm of generative priors for point cloud completion.

Rating¶

⭐⭐⭐⭐ — Leveraging diffusion priors to "dream" missing information for point cloud completion is a creative concept. The designs of multi-modality fusion and confidence consolidation are reasonable and effective. Its advantage in fine structure recovery is impressive, but inference efficiency remains a notable bottleneck.