PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors¶
Conference: CVPR 2025
arXiv: 2411.19036
Code: https://gsw-d.github.io/PCDreamer/
Area: 3D Vision / Point Cloud Completion
Keywords: Point Cloud Completion, Multi-view Diffusion Priors, Multi-modality Fusion, Shape Consolidation, Confidence Filtering
TL;DR¶
PCDreamer is proposed to leverage large-scale multi-view diffusion models to "dream" multi-view images of the missing regions from partial point clouds, achieving high-fidelity point cloud completion through a multi-modality shape fusion module and a confidence-guided shape consolidation module, showing outstanding capability in reconstructing fine-grained local details.
Background & Motivation¶
Key Challenge¶
Key Challenge: Point cloud completion is a critical task in 3D vision, but single-view partial point clouds often lack more than half of their shape information, resulting in an exceptionally large search space.
Background¶
Background: Existing methods extract features from partial point clouds to directly predict the missing regions, which tends to yield random guesses under severe occlusions (e.g., a table lamp missing its top shade).
Limitations of Prior Work¶
Limitations of Prior Work: Although using images as auxiliary guidance can improve performance, acquiring paired image and partial point cloud data is highly challenging in practice.
Proposed Solution¶
Proposed Solution: Multi-view consistent images generated by large-scale multi-view diffusion models encode global and local shape cues, which are particularly beneficial for shape completion (e.g., capturing symmetric structures).
Supplementary Notes¶
Supplementary Notes: However, multi-view images generated by diffusion models exhibit inherent inconsistencies, and direct fusion introduces noise and unreliable points.
Method¶
Overall Architecture¶
PCDreamer consists of three core modules: (1) The multi-view image generation module utilizes a chain of foundation models (ControlNet→RGB, Wonder3D/SVD→multi-view, DepthAnything→depth map) to "dream" multi-view images from the partial point cloud; (2) The multi-modality shape fusion module fuses the features of the point cloud and multi-view images via an attention mechanism to generate an initial complete shape; (3) The shape consolidation module filters out unreliable points based on confidence scores and performs upsampling to generate the final dense, uniform, and complete point cloud.
Key Designs¶
Design 1: Multi-modality Shape Fusion
- Function: Effectively fuses the reliable geometric information of the partial point cloud with the global and local shape cues in the multi-view images.
- Mechanism: A dual-encoder architecture is designed. The point cloud encoder employs a patch-based Transformer (extracting patch features via DGCNN + sinusoidal position encoding + Transformer encoder) to obtain \(\mathcal{F}_P \in \mathbb{R}^{128}\). The image encoder uses a ResNet backbone + camera pose MLP encoding + Transformer encoder to obtain \(\mathcal{F}_I \in \mathbb{R}^{128}\). Fusion is performed through cross-attention: point cloud features serve as Q, and image features serve as K/V, leading to \(\mathcal{F}_{fusion} = \text{softmax}(\frac{QK}{\sqrt{128}})V\).
- Design Motivation: Directly fusing multi-view depth maps into 3D would result in irregular and noisy shapes; by performing fusion in the feature space rather than the geometric space, the system can extract useful information while mitigating inconsistencies.
Design 2: Confidence-Guided Shape Consolidation
- Function: Filters out unreliable points from the initial complete point cloud to generate clean and dense final results.
- Mechanism: A confidence score is calculated for each predicted point by considering two factors: (1) the consistency between the points and the multi-view features \(\mathcal{F}_I\); (2) the mutual consistency among the points. The point coordinates are encoded using an MLP and concatenated with image features, and the average self-attention scores are processed via a Sigmoid function to serve as confidence scores. The points are partitioned into high and low confidence sets (75% / 25%). The high-confidence set serves as the filtered intermediate point cloud, which is then processed by an upsampling network to generate the final dense results.
- Design Motivation: The inherent inconsistencies in multi-view images generated by diffusion models often cause noise points and local holes in the initial completion results.
Design 3: Flexible Multi-view Generation Pipeline
- Function: Automatically generates matching multi-view RGB and depth images from a partial point cloud.
- Mechanism: A chain of foundation models is designed: partial point cloud → depth map → ControlNet generating RGB → Wonder3D/SVD generating multi-view RGB → DepthAnything generating multi-view depth maps. This method does not rely on a specific diffusion model and is compatible with both Wonder3D and SVD.
- Design Motivation: There are no off-the-shelf foundation models capable of directly generating consistent multi-view depth maps from a partial point cloud; combining multiple large models compensates for their individual limitations.
Loss & Training¶
- Chamfer Distance (CD) is used to supervise the distance between the ground truth (GT) and both the initial and final point clouds.
- The upsampling network employs an additional uniformity loss to ensure a uniform distribution of points.
Key Experimental Results¶
PCN Dataset Results (CD ×10³ ↓)¶
Main Results¶
| Method | Plane | Cabinet | Car | Chair | Lamp | Avg CD ↓ | F1 ↑ |
|---|---|---|---|---|---|---|---|
| PCN | 5.82 | 10.91 | 9.00 | 11.09 | 11.91 | 9.93 | 0.657 |
| PoinTr | 4.31 | 9.23 | 7.60 | 8.35 | 8.27 | 7.76 | 0.810 |
| SnowFlakeNet | 3.95 | 8.82 | 7.52 | 7.48 | 6.34 | 6.96 | 0.828 |
| AnchorFormer | 3.62 | 8.79 | 7.20 | 7.12 | — | — | — |
| PCDreamer | Best | Best | Best | Best | Best | Best | Best |
Key Findings¶
- PCDreamer achieves the most significant advantage on categories with severe missing details, such as lamps (which require inferring missing top structures).
- Using SVD for multi-view generation achieves better cross-category generalization than Wonder3D.
- Feature space fusion achieves significantly better results than direct 3D fusion, as verified by ablation studies.
- Confidence filtering removes approximately 25% of unreliable points, significantly improving completion accuracy.
- Depth maps provide more accurate shape information for completion than RGB images.
Highlights & Insights¶
- Leveraging Diffusion Priors for Point Cloud Completion: Creatively reformulates "acquiring information about missing regions" into "allowing diffusion models to imagine the appearance of missing regions".
- Dual Considerations in Confidence Scoring: Simultaneously considers the consistency between points and multi-view features as well as the mutual consistency among points, effectively addressing diffusion-related inconsistencies.
- No Paired Data Required: Multi-view images are automatically generated via a foundation model chain, addressing the practical challenge of acquiring paired data.
Limitations & Future Work¶
- High dependency on a chain of foundation models (ControlNet + Wonder3D/SVD + DepthAnything) leads to relatively low inference efficiency.
- The quality of diffusion generation directly affects the completion results, making it potentially non-robust to low-quality or out-of-domain inputs.
- The pipeline design might lead to error accumulation (depth estimation error → multi-view inconsistency → completion error).
- Future work can explore end-to-end training or more efficient multi-view generation methods.
Related Work & Insights¶
- PoinTr [Yu et al.] predicts missing points from partial point clouds using a Transformer architecture.
- Wonder3D [Long et al.] generates multi-view images and normals through cross-domain attention.
- DepthAnything [Yang et al.] provides robust single-view depth estimation.
- Ours introduces a new paradigm of generative priors for point cloud completion.
Rating¶
⭐⭐⭐⭐ — Leveraging diffusion priors to "dream" missing information for point cloud completion is a creative concept. The designs of multi-modality fusion and confidence consolidation are reasonable and effective. Its advantage in fine structure recovery is impressive, but inference efficiency remains a notable bottleneck.