UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis¶
Conference: ICCV 2025 arXiv: 2507.18997 Code: GitHub Area: 3D Vision Keywords: point cloud analysis, parameter-efficient fine-tuning, denoising, completion, prompt learning
TL;DR¶
This paper proposes UPP, a unified point-level prompting framework that reformulates point cloud denoising and completion as prompting mechanisms for downstream tasks. It introduces a Rectification Prompter to filter noise, a Completion Prompter to recover missing regions, and a Shape-Aware Unit to capture geometry-sensitive features. With only 6.3% of the parameters, UPP surpasses full fine-tuning on noisy and incomplete point clouds.
Background & Motivation¶
Pre-trained point cloud models (Point-MAE, ReCon, etc.) have achieved remarkable progress on various downstream tasks. However, real-world point clouds frequently suffer from noise and incompleteness due to object occlusion, reflective surfaces, and sensor resolution limitations, which severely degrades model performance.
Limitations of existing approaches:
Dedicated denoising/completion models + downstream tasks (pipeline paradigm): - Conflicting objectives between denoising and completion: denoising removes excess points while completion adds missing ones, and naive integration leads to mutual interference - Domain gap between the enhancement tasks and the downstream task, resulting in suboptimal performance - Complex training pipelines with high computational and memory overhead
Parameter-efficient fine-tuning (PEFT) methods (IDPT, Point-PEFT, DAPT): - Improve representational capacity only in the latent feature space - Neglect explicit suppression of noise and defects in the input point cloud - Features become indistinguishable on low-quality data, leading to severe performance degradation
UPP's innovation: Denoising and completion are reformulated as downstream task-oriented prompting mechanisms, intervening in the input data space rather than only the feature space, with unified end-to-end training.
Method¶
Overall Architecture¶
The pre-trained backbone is frozen; three trainable components are inserted: 1. Rectification Prompter: Predicts corrective vector prompts after shallow blocks to filter noise 2. Completion Prompter: Generates completion point prompts after deep blocks to recover missing regions 3. Shape-Aware Unit: Inserted within each block to capture geometry-sensitive features
Rectification Prompter¶
Given a noisy incomplete point cloud \(\boldsymbol{x} \in \mathbb{R}^{S \times 3}\), encoded into \(L\) tokens and processed through \(d_r\) transformer blocks. Features are propagated from sparse center tokens to dense points via spatial interpolation:
An MLP predicts a correction vector \(\boldsymbol{v}_r \in \mathbb{R}^{S \times 3}\) for each point. Large-magnitude vectors correspond to low-confidence noisy points, which are filtered by threshold \(\tau\):
Training objective: Noisy points are supervised with displacement vectors toward the clean surface; clean points are supervised with zero displacement:
Completion Prompter¶
The rectified point cloud \(\boldsymbol{x}_r\) is re-sampled and re-encoded; after \(d_c\) blocks, tokens are down-projected and concatenated into a global feature \(\boldsymbol{f}_c\) to predict coarse centers \(\boldsymbol{c}_m\) for missing regions.
Key design: reuse of the MAE pre-trained decoder for local patch reconstruction:
The final output merges rectified and completed points via FPS re-sampling: \(\boldsymbol{x}_c = \text{FPS}([\boldsymbol{x}_m, \boldsymbol{x}_r])\)
Loss function (L1 Chamfer Distance):
Shape-Aware Unit¶
Inserted within each transformer block, comprising two innovations:
- Shape-Aware Attention: Establishes connections based on spatial distance rather than feature similarity; noisy outliers are unlikely to alter spatial neighborhood relations, yielding greater robustness
- Low-rank adapter: \(\boldsymbol{h}_{i+1} = W_2 \cdot \sigma(W_1(\hat{\boldsymbol{h}}_i)) + \hat{\boldsymbol{h}}_i\), preventing feature over-smoothing
Total Loss¶
A staged optimization strategy is adopted to improve training stability.
Key Experimental Results¶
Noisy Point Cloud Classification (Main Results)¶
| Method | Reference | Params (M)↓ | Noisy ModelNet40↑ | Noisy ShapeNet55↑ |
|---|---|---|---|---|
| Point-MAE (FFT) | ECCV22 | 22.1 (100%) | 89.42 | 88.13 |
| +Point-PEFT | AAAI24 | 0.7 (3.2%) | 87.52 (−1.90) | 86.01 (−2.12) |
| +DAPT | CVPR24 | 1.1 (5.0%) | 86.43 (−2.99) | 86.33 (−1.80) |
| +UPP (Ours) | — | 1.4 (6.3%) | 92.95 (+3.53) | 90.40 (+2.27) |
| ReCon (FFT) | ICML23 | 43.6 (100%) | 89.67 | 89.01 |
| +UPP (Ours) | — | 1.4 (3.2%) | 91.69 (+2.02) | 89.68 (+0.67) |
| Point-FEMAE (FFT) | AAAI24 | 27.4 (100%) | 89.59 | 88.63 |
| +UPP (Ours) | — | 1.4 (5.1%) | 91.94 (+2.35) | 90.08 (+1.45) |
UPP surpasses full fine-tuning across all three backbones while using only 3.2%–6.3% of the parameters. Existing PEFT methods consistently degrade performance.
Real-World Data (ScanObjectNN)¶
| Method | Params (M) | Acc. (%) |
|---|---|---|
| Point-FEMAE (baseline) | 27.4 | 90.71 |
| +Point-PEFT | 0.7 | 89.16 |
| +DAPT | 1.1 | 89.67 |
| +UPP (Ours) | 1.4 | 91.39 |
Ablation Study¶
| Base | Rect. Prompter | Compl. Prompter | SA-Unit | Acc. (%) |
|---|---|---|---|---|
| ✓ | ✗ | ✗ | ✗ | 89.42 |
| ✓ | ✓ | ✗ | ✗ | 90.90 |
| ✓ | ✗ | ✓ | ✗ | 91.36 |
| ✓ | ✗ | ✗ | ✓ | 91.28 |
| ✓ | ✓ | ✓ | ✓ | 92.95 |
Each component individually contributes 1.5–2 percentage points; their combination achieves the best performance.
Key Findings¶
- Adverse effect of PEFT methods: Existing 3D PEFT methods (Point-PEFT, DAPT) actually hurt performance on noisy data, as they ignore explicit handling of input noise
- Importance of input-space intervention: UPP performs rectification and completion in the data space rather than solely in the feature space, yielding more direct and effective improvements
- Robustness of Shape-Aware Attention: Spatial-distance-based attention is more resistant to noise interference than feature-similarity-based attention
- Backbone agnosticism: UPP generalizes effectively across Point-MAE, ReCon, and Point-FEMAE
Highlights & Insights¶
- Paradigm shift: Denoising and completion are transformed from standalone pre-processing steps into unified prompts for downstream tasks, eliminating domain gaps and objective conflicts
- Data-space prompting: Unlike VPT and similar methods that add prompt tokens solely in the feature space, UPP operates directly in the point coordinate space (displacing or adding discrete points)
- Reuse of pre-trained decoder: The MAE decoder weights, which are typically discarded after pre-training, are cleverly repurposed for point cloud completion
Limitations & Future Work¶
- The staged optimization strategy increases training complexity
- The number of completion points \(M\) is a fixed hyperparameter, limiting adaptability to varying degrees of incompleteness
- Validation is limited to classification; effectiveness on segmentation and detection tasks remains to be confirmed
Related Work & Insights¶
- Point cloud pre-training: Point-MAE, ReCon, Point-FEMAE, PointGPT
- Point cloud enhancement: ScoreDenoise, PoinTr, T-CorresNet
- 3D PEFT: IDPT, Point-PEFT, DAPT, GAPrompt
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Paradigm innovation unifying denoising and completion as prompting mechanisms
- Technical Depth: ⭐⭐⭐⭐ — Elegant three-component design; Shape-Aware Attention is supported by theoretical analysis
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple backbones, multiple datasets, comprehensive ablations
- Value: ⭐⭐⭐⭐ — Parameter-efficient, open-source, directly improves robustness of existing models