Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective¶

Conference: AAAI 2026 arXiv: 2511.12170 Code: https://github.com/RobWonn/PGNet Area: 3D Vision Keywords: Point Cloud Completion, Multimodal Fusion, Generative Prior, Correction Paradigm, Feature Alignment

TL;DR¶

This paper proposes a novel Completion-by-Correction paradigm that leverages a pretrained image-to-3D model to generate a topologically complete shape prior, then corrects it in feature space to align with local observations. This replaces the conventional Completion-by-Inpainting approach, achieving a 23.5% reduction in average CD and a 7.1% improvement in F-score on ShapeNet-ViPC.

Background & Motivation¶

State of the Field¶

Point cloud completion aims to recover complete 3D shapes from partial observations, with broad applications in autonomous driving, augmented reality, and robotics. Deep learning methods such as PoinTr and SeedFormer have achieved notable progress, yet purely unimodal approaches still struggle to distinguish between occlusion-induced missing regions and structural holes under severe occlusion. Multimodal methods therefore exploit RGB images to provide complementary texture and semantic information to assist completion.

Limitations of Prior Work¶

Existing multimodal methods (CSDN, XMFNet, EGIInet, etc.) all follow the Completion-by-Inpainting paradigm—fusing image and point cloud features and then directly synthesizing missing geometry from the fused latent representation. Through empirical analysis, the authors identify fundamental drawbacks of this approach:

Structural inconsistency: Without an explicit structural skeleton, the network must "hallucinate" missing structures, often producing topological artifacts.

Semantic ambiguity: Under severe degradation, the fused features provide insufficient constraints, yielding results that are semantically plausible but geometrically incoherent.

Unconstrained synthesis: Synthesizing geometry from incomplete representations is inherently ill-posed.

Root Cause & Starting Point¶

The authors argue that the fundamental issue lies in performing unconstrained synthesis from incomplete representations. Instead, they propose first obtaining a topologically complete initial shape via an image-to-3D model, and then correcting it to align with the observation. This transforms the problem from unconstrained synthesis to guided refinement, making it better-posed.

Method¶

Overall Architecture¶

PGNet (PriorGroundNet) consists of three stages:

Corrective Dual-Feature Encoding: Parallel encoding of the generated prior and the partial observation, with feature-space correction of the prior.
Grounded Seed Generation: Synthesis of a coarse yet topologically complete seed point cloud as a structural skeleton.
Hierarchical Grounded Refinement: Iterative geometry refinement through two stacked Grounded Refinement Blocks (GRBs).

The inputs are a partial point cloud \(P_o \in \mathbb{R}^{M \times 3}\) and a corresponding single-view RGB image \(I\), with the goal of reconstructing the complete point cloud. A pretrained Trellis image-to-3D model first generates a prior point cloud \(P_g\) from the image, which is then aligned to the observation via a learned correction function \(\mathcal{T}\).

Key Designs¶

1. Corrective Dual-Feature Encoding¶

Mechanism: Since \(P_g\) and \(P_o\) differ in scale, pose, and point distribution, feature-space alignment is required.

Partial point cloud encoder: Employs hierarchical local feature aggregation (FPS + DGCNN) to extract \(N_e = 128\) representative points and initial features. Learnable relative positional encodings \(\Phi\) are incorporated to mitigate pose discrepancy. A Salient Transformer (dual-branch structure) then fuses global and local context:
- Global branch: MHSA produces long-range context \(A_o\)
- Local branch: kNN + shared MLP + max pooling produces local patterns \(X_o\)
- A learnable saliency gate \(G_o = \sigma(\text{MLP}([A_o, X_o]))\) adaptively fuses both

\[F_o = (1 - G_o) \odot A_o + G_o \odot X_o\]

Generative prior encoder: Uses the same hierarchical encoding, but employs a Grounding Transformer to correct the prior in feature space:
- A self-attention branch captures the internal structure of the prior
- A grounding branch (cross-attention) takes \(F_g''\) as query and \(F_o\) as key/value to yield observation-aligned features
- A saliency gate fuses both branches analogously

Design Motivation: The Salient Transformer enhances the reliability of \(F_o\) (attending globally in sparse regions, locally in detailed regions), while the Grounding Transformer injects reliable observation signals into the generated prior.

2. Grounded Seed Generation¶

Mechanism: Produces a coarse but topologically complete and geometrically grounded skeleton point cloud.

Max pooling over \(F_g\) and \(F_o\) extracts global representations \(\hat{F}_g\) and \(\hat{F}_o\)
Cross-attention fuses the global features to yield \(\hat{F}_{\text{fused}}\)
Inspired by PixelShuffle, an MLP followed by reshaping expands the global feature into \(N_c = 512\) seed features
Cross-attention then aligns the seed features with \(F_o\) (grounding)
A final MLP generates the coarse point cloud \(P_c\):

\[P_c = \text{MLP}([\text{Replicate}(\hat{F}_{\text{fused}}, N_c), F_{\text{seed}}, F_{\text{gr}}])\]

Mechanism: \(K=2\) stacked Grounded Refinement Blocks (GRBs) progressively improve geometric fidelity. Each GRB contains two components:

(a) Dual-Source Feature Association: - Query from observation: for each point, interpolate features from \(F_o\) via kNN + IDW (inverse distance weighting) - Query from prior: since \(P_o\) and \(P_g\) are spatially misaligned, kNN + IDW interpolation is performed in feature space rather than geometric space - Dual-source features are concatenated: \(f_{as}(p_i) = [f_{\text{interp},o}(p_i), f_{\text{interp},g}(p_i)]\)

(b) Structure-Aware Upsampling: - Cross-Scale Shape Context (CSSC) module: for each point, a geometric transformer aggregates multi-scale shape context from the previous resolution - Attention weights jointly account for feature similarity and relative spatial position - Predicts \(r\) displacement vectors (\(r=2\)), upsampling hierarchically: \(512 \to 1024 \to 2048\)

Loss & Training¶

L1 Chamfer Distance is used as the training objective, applied to both the coarse output and each upsampled output (multi-level supervision):

\[\mathcal{L} = \frac{1}{K+1}\left(\mathcal{L}_{\text{CD}}(P_c, P_{gt}) + \sum_{k=1}^{K}\mathcal{L}_{\text{CD}}(P^{(k)}, P_{gt})\right)\]

Training details: AdamW optimizer, initial learning rate \(2 \times 10^{-4}\), cosine annealing, trained per-category for 100K steps, batch size 192, NVIDIA RTX 4090. Prior generation uses the Trellis model with Poisson disk sampling to 2048 points.

Key Experimental Results¶

Main Results¶

Evaluated on ShapeNet-ViPC (38,328 objects, 13 categories):

Method	Type	Avg. CD (×10⁻³) ↓	Avg. F-score ↑
PoinTr	Unimodal	2.851	0.683
SeedFormer	Unimodal	2.902	0.688
ViPC	Multimodal	3.308	0.591
CSDN	Multimodal	2.570	0.695
XMFNet	Multimodal	1.454	0.797
EGIInet	Multimodal	1.211	0.836
PGNet (Ours)	Multimodal	0.926	0.895

Compared to the previous SOTA EGIInet: CD reduced by 23.5%, F-score improved by 7.1%. Gains are especially pronounced on heavily occluded categories such as cabinet (+42.2%) and sofa (+26.6%).

Ablation Study¶

Ablation on the cabinet category (CD ×10⁻³ / F-score):

Configuration	CD ↓	F-score ↑	Note
w/o Prior Feature Grounding	1.185	0.827	Removes feature-space correction
w/o Seed Grounding	1.219	0.821	Removes seed grounding
w/o Dual-Source Association	1.324	0.803	Largest impact; dual-source association is central
w/o Structure-Aware	1.275	0.800	Removes structure-aware upsampling
PGNet (Full)	1.111	0.839	Complete model

Paradigm comparison (Inpainting vs. Correction): the Inpainting variant achieves an average CD of 1.10 versus 0.93 for PGNet, with the Inpainting variant showing 41.4% higher CD on cabinet.

Key Findings¶

The Completion-by-Correction paradigm is inherently more robust than Completion-by-Inpainting, demonstrating that correcting from a complete prior is superior to synthesizing from incomplete features.
Dual-Source Feature Association is the most critical component (removing it increases CD by 19.2%), confirming that jointly exploiting observational fidelity and prior structural information is essential.
Gains are largest on heavily occluded categories (cabinet, sofa), validating the central role of the prior skeleton when missing regions are large.

Highlights & Insights¶

Paradigm innovation: Proposes a new completion paradigm—shifting from "filling in missing parts" to "correcting a complete prior"—transforming an ill-posed synthesis problem into a well-posed refinement problem.
Elegant use of image-to-3D models: Rather than fusing at the geometric level (susceptible to pose/scale discrepancy), correction is performed in feature space, yielding a more principled design.
Saliency gating mechanism: A unified gating design across both the Salient Transformer and the Grounding Transformer is concise and effective.
Feature-space interpolation: Dual-Source Association applies kNN in feature space rather than geometric space for prior features, elegantly circumventing geometric misalignment.

Limitations & Future Work¶

The method depends on the pretrained Trellis image-to-3D model; prior quality directly limits performance, and inference overhead is increased.
Each category is trained separately for 100K steps, incurring high training cost; cross-category generalization has not been verified.
Evaluation is limited to the ShapeNet-ViPC synthetic dataset; real-scene validation is absent.
Biases (hallucinations) introduced by the prior generation model may cause systematic errors, a concern not thoroughly analyzed in the paper.

SymmCompletion (AAAI 2025): Exploits symmetry priors for point cloud completion; combining it with the present framework—generating a prior then correcting it—is a natural direction.
PCDreamer (CVPR 2025): Diffusion-based point cloud completion; direct geometric fusion suffers from pose discrepancy, a problem the present method avoids via feature-space correction.
Advances in image-to-3D models (e.g., Trellis, TripoSR) will directly raise the ceiling of this framework.
The "generate-then-correct" paradigm is broadly transferable to other 3D reconstruction tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — A paradigm-level innovation; the shift from Inpainting to Correction is well-motivated and compelling.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive ablations, but limited to a single dataset with no real-scene validation.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear motivation, high-quality figures, and coherent narrative.
Value: ⭐⭐⭐⭐ — Opens a new direction for multimodal point cloud completion, though real-world deployment requires further validation.