Skip to content

P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising

Conference: ECCV 2024
arXiv: 2408.16325
Code: Project Page
Area: 3D Vision
Keywords: Point Cloud Denoising, Schrödinger Bridge, Diffusion Models, Optimal Transport, DINOv2

TL;DR

Proposed P2P-Bridge, which formulates point cloud denoising as a Schrödinger Bridge problem to learn the optimal transport plan between noisy and clean point clouds. It introduces a data-to-data (rather than data-to-noise) diffusion framework for the first time, significantly outperforming existing methods on both synthetic data and real-world indoor scenes (ScanNet++, ARKitScenes).

Background & Motivation

  • Background: Point cloud denoising is a fundamental preprocessing task in 3D vision. Deep learning methods (ScoreDenoise, MAG, PD-Flow) have demonstrated superior performance compared to traditional approaches, but they are primarily trained under the assumption of synthetic Gaussian noise.
  • Limitations of Prior Work: Noise generated by real-world scanners (LiDAR, mobile phones) is far more complex than isotropic Gaussian noise, including effects like outlier clusters, ghost points, and edge flares. Consequently, existing methods suffer from significant performance degradation in real-world scenarios.
  • Key Challenge: Traditional diffusion models use a Gaussian prior (data-to-noise) and cannot learn sensor-specific noise characteristics. Furthermore, the distance metrics used by existing methods scale non-linearly with the point cloud size, hindering model scalability.
  • Goal: To design a denoising framework capable of learning data-specific noise characteristics, achieving outstanding performance on both synthetic noise and real-world indoor scene noise.
  • Key Insight: Reformulating the denoising problem as a Schrödinger Bridge problem—seeking the optimal transport path from noisy point clouds to clean point clouds.
  • Core Idea: Replacing the traditional data-to-noise diffusion process with a data-to-data diffusion bridge, incorporating shortest-path interpolation to achieve meaningful interpolation between unordered point clouds, and introducing DINOv2 semantic features to assist the denoising process.

Method

Overall Architecture

P2P-Bridge formulates denoising as a reverse diffusion process, where the noisy point cloud \(\tilde{\mathcal{P}}\) serves as the prior distribution \(p_{\text{prior}}\) and the clean point cloud \(\mathcal{P}\) serves as the data distribution \(p_{\text{data}}\). A network is trained to learn the optimal transport plan from \(\tilde{\mathcal{P}}\) to \(\mathcal{P}\), and iterative denoising is performed during inference using DDPM sampling. The model is based on the PVCNN architecture, taking coordinates, RGB, and DINOv2 features as inputs.

Key Designs

Module 1: Tractable Diffusion Bridge

Treating noisy-clean point cloud pairs as paired boundary data of the Schrödinger Bridge. By setting the drift \(\mathbf{f} := 0\) and using a linear diffusion schedule \(g^2(t)\), the posterior distribution has an analytical form:

\[q(\mathbf{x}_t | \mathbf{x}_0, \mathbf{x}_T) = \mathcal{N}(\mathbf{x}_t; \mu_t(\mathbf{x}_0, \mathbf{x}_T), \Sigma_t)\]

where the posterior mean and variance are:

\[\mu_t = \frac{\bar{\sigma}_t^2}{\bar{\sigma}_t^2 + \sigma_t^2} \mathbf{x}_0 + \frac{\sigma_t^2}{\bar{\sigma}_t^2 + \sigma_t^2} \mathbf{x}_T, \quad \Sigma_t = \frac{\sigma_t^2 \bar{\sigma}_t^2}{\sigma_t^2 + \bar{\sigma}_t^2}\]

where \(\sigma_t^2 = \int_0^t g^2(\tau) d\tau\) and \(\bar{\sigma}_t^2 = \int_t^1 g^2(\tau) d\tau\). This simplifies the complex SB problem into a tractable learning framework.

Module 2: Shortest-Path Interpolation for Unordered Point Clouds

Since point clouds are unordered, the interpolation described by \(\mu_t\) requires defining reasonable point correspondences. The shortest-path interpolation from PointMixup is adopted to find the optimal bijective assignment between the noisy and clean point clouds:

\[\phi^* = \arg\min_{\phi \in \Phi} \sum_{i=1}^{N} \|\mathbf{x}_T^i - \mathbf{x}_0^{\phi(i)}\|_2\]

When the stochasticity of the bridge vanishes (\(g^2(t) \to 0\)), the bridge SDE degenerates to an optimal transport ODE:

\[d\mathbf{x}_t = \frac{g^2(t)}{\sigma_t^2} (\mathbf{x}_t - \mathbf{x}_0) dt\]

This assignment only needs to be computed once for each data pair, and the sorted clean point clouds can be directly reused in subsequent training.

Module 3: Semantics-augmented Feature Embedding

Innovatively introducing point-wise DINOv2 features: pixel-level DINOv2 features are projected onto the noisy point cloud using camera poses and intrinsic parameters, providing high-level semantic information for each point. The network architecture is based on PVCNN (PointVoxel-CNN), enhanced with multi-head global attention and feature embedding modules, utilizing 1x1 convolutions to map input features into a high-dimensional space.

Loss & Training

  • Noise Prediction Loss: Standard diffusion model noise prediction objective:
\[\mathcal{L} = \|\epsilon_\theta(\mathbf{x}_t, t) - \frac{\mathbf{x}_t - \mathbf{x}_0}{\sigma_t}\|_2^2\]
  • DDPM Denoising: Iterative DDPM sampling is used during inference, yielding strong results with only 3 function evaluations (steps).
  • Patch Processing: Large-scale indoor scenes are processed using a patch-based approach, and coordinates in overlapping regions are averaged (rather than relying on direct concatenation followed by FPS sampling), which effectively reduces patch boundary artifacts.
  • Training Configuration: PU-Net for object datasets, and batch size of 32 for scene datasets with up to 100K steps.
  • Timestep Conditioning: The timestep \(t\) is conditioned using sinusoidal positional embeddings, and global features are injected via adaptive group normalization (AdaGN).

Key Experimental Results

Main Results (Object-level Denoising, PU-Net Dataset, CD × 10⁴)

Method 10K pts 1% Noise CD 10K pts 3% Noise CD 50K pts 3% Noise CD 50K pts 3% Noise P2M
ScoreDenoise 2.52 4.71 1.93 1.04
MAG 2.50 4.69 1.93 1.05
PD-Flow 2.13 5.19 3.90 2.86
I-PFN 2.31 5.49 2.54 1.65
P2P-Bridge 2.28 3.99 1.56 0.84

Significant advantages in high-noise (3%) settings: 50K points CD 1.56 vs ScoreDenoise 1.93 (-19%), P2M 0.84 vs 1.04 (-19%).

Real-world Indoor Scenes (ScanNet++ Apple LiDAR + 3DMatch, Metrics × 10⁴)

Method Feature CD P2M
Bilateral XYZ 64.28 63.51
ScoreDenoise XYZ 58.78 57.99
PD-Flow XYZ 54.02 53.14
I-PFN XYZ 52.31 51.49
P2P-Bridge XYZ 35.56 34.78
P2P-Bridge XYZ+RGB 35.17 34.39
P2P-Bridge XYZ+RGB+DINO 34.88 34.11

On ScanNet++, CD is reduced from 52.31 (the second-best method, I-PFN) to 34.88 (-33%), showing a massive advantage.

Key Findings

  • The data-to-data paradigm shows massive benefits on real noise: While methods perform similarly under synthetic Gaussian noise, P2P-Bridge leads by a large margin in real-world indoor scenes, validating the importance of learning data-specific noise characteristics.
  • Excellent results are achieved in only 3 denoising steps: The robustness of DDPM sampling makes the model insensitive to the number of denoising steps.
  • DINOv2 semantic features are highly effective: Incorporating DINOv2 features further reduces the CD on ScanNet++ (from 35.56 to 34.88), as high-level semantic information helps in distinguishing structural boundaries.
  • Methods trained on Gaussian noise generate severe artifacts during patch processing (boundary points are misclassified as outliers), whereas P2P-Bridge effectively avoids this through its coordinate averaging strategy and training on real-world noise.
  • It generalizes well to the unseen PC-Net dataset, showing superior adaptability compared to competing methods.

Highlights & Insights

  • Elegant Problem Formulation: Reconceptualizes denoising as an optimal transport problem under the Schrödinger Bridge framework. It provides a solid theoretical foundation and extends diffusion models from "noise-to-data" to "data-to-data".
  • Crucial Shortest-Path Interpolation: Ingeniously solves the core technical challenge of interpolating between unordered point clouds. The assignment only needs to be computed once, ensuring efficient training.
  • Pioneering Real-world Evaluation: This is the first work to systematically evaluate point cloud denoising methods on realistic scan datasets such as ScanNet++ and ARKitScenes, filling an assessment gap in the field.
  • Semantics-assisted Denoising: The introduction of DINOv2 features adds a new dimension to point cloud denoising, moving beyond purely geometric features.

Limitations & Future Work

  • It requires paired noisy-clean training data, which is expensive to acquire (relying on high-precision Faro scanners).
  • The computation of the optimal assignment \(\phi^*\) is \(O(N^3)\), which potentially becomes a bottleneck for large-scale point clouds.
  • It cannot address missing or incomplete point cloud regions, requiring integration with point cloud completion methods.
  • The patch partitioning strategy for large-scale scenes still needs optimization as spatial consistency across patches requires improvement.
  • Unsupervised or self-supervised variants have not been explored; these could reduce the dependence on paired data.
  • ScoreDenoise / MAG: Representative score-matching methods that perform well under Gaussian noise but degrade severely in real-world environments.
  • PD-Flow: A normalizing flow-based method that shows reasonable performance on real-world noise but suffers from patch-wise artifacts.
  • I²-SB (Schrödinger Bridge for Images): An application of SB in image-to-image translation, from which this work borrows its tractable bridge framework.
  • PointMixup: Provides the theoretical foundation for meaningful interpolation between unordered point clouds.
  • Insight: The SB paradigm can be generalized to other 3D data-to-data transformation tasks such as point cloud completion, shape generation, and point cloud registration.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Translating Schrödinger Bridge to point cloud denoising for the first time; the data-to-data diffusion paradigm is pioneering.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across synthetic and real-world datasets, compared against multiple baselines, with ablation studies covering key design choices.
  • Writing Quality: ⭐⭐⭐⭐ — Clean mathematical derivations, naturally transitioning from SDE to the tractable framework with intuitive illustrations.
  • Value: ⭐⭐⭐⭐⭐ — Realizes high performance in only 3 denoising steps with open-source code and remarkable results on real-world scenes, indicating extremely broad practical application prospects.