UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis¶

Conference: ICCV 2025 arXiv: 2507.18997 Code: GitHub Area: 3D Vision Keywords: point cloud analysis, parameter-efficient fine-tuning, denoising, completion, prompt learning

TL;DR¶

This paper proposes UPP, a unified point-level prompting framework that reformulates point cloud denoising and completion as prompting mechanisms for downstream tasks. It introduces a Rectification Prompter to filter noise, a Completion Prompter to recover missing regions, and a Shape-Aware Unit to capture geometry-sensitive features. With only 6.3% of the parameters, UPP surpasses full fine-tuning on noisy and incomplete point clouds.

Background & Motivation¶

Pre-trained point cloud models (Point-MAE, ReCon, etc.) have achieved remarkable progress on various downstream tasks. However, real-world point clouds frequently suffer from noise and incompleteness due to object occlusion, reflective surfaces, and sensor resolution limitations, which severely degrades model performance.

Limitations of existing approaches:

Dedicated denoising/completion models + downstream tasks (pipeline paradigm): - Conflicting objectives between denoising and completion: denoising removes excess points while completion adds missing ones, and naive integration leads to mutual interference - Domain gap between the enhancement tasks and the downstream task, resulting in suboptimal performance - Complex training pipelines with high computational and memory overhead

Parameter-efficient fine-tuning (PEFT) methods (IDPT, Point-PEFT, DAPT): - Improve representational capacity only in the latent feature space - Neglect explicit suppression of noise and defects in the input point cloud - Features become indistinguishable on low-quality data, leading to severe performance degradation

UPP's innovation: Denoising and completion are reformulated as downstream task-oriented prompting mechanisms, intervening in the input data space rather than only the feature space, with unified end-to-end training.

Method¶

Overall Architecture¶

The pre-trained backbone is frozen; three trainable components are inserted: 1. Rectification Prompter: Predicts corrective vector prompts after shallow blocks to filter noise 2. Completion Prompter: Generates completion point prompts after deep blocks to recover missing regions 3. Shape-Aware Unit: Inserted within each block to capture geometry-sensitive features

Rectification Prompter¶

Given a noisy incomplete point cloud \(\boldsymbol{x} \in \mathbb{R}^{S \times 3}\), encoded into \(L\) tokens and processed through \(d_r\) transformer blocks. Features are propagated from sparse center tokens to dense points via spatial interpolation:

\[\boldsymbol{f}_r = \mathcal{F}(\boldsymbol{h}_{d_r}, \boldsymbol{c}, \boldsymbol{x}) \in \mathbb{R}^{S \times D_r}\]

An MLP predicts a correction vector \(\boldsymbol{v}_r \in \mathbb{R}^{S \times 3}\) for each point. Large-magnitude vectors correspond to low-confidence noisy points, which are filtered by threshold \(\tau\):

\[\boldsymbol{x}_r = \{\boldsymbol{x} + \boldsymbol{v}_r \cdot \alpha \mid \tau > \|\boldsymbol{v}_r\|\}\]

Training objective: Noisy points are supervised with displacement vectors toward the clean surface; clean points are supervised with zero displacement:

\[\mathcal{L}_{\text{rect}} = \frac{1}{S_n}\sum_{i \in \boldsymbol{n}} \|\boldsymbol{v}_r^i - \boldsymbol{v}_{gt}^i\|^2 + \frac{1}{S}\sum_{i \in \boldsymbol{x}} \|\boldsymbol{v}_r^i\|^2\]

Completion Prompter¶

The rectified point cloud \(\boldsymbol{x}_r\) is re-sampled and re-encoded; after \(d_c\) blocks, tokens are down-projected and concatenated into a global feature \(\boldsymbol{f}_c\) to predict coarse centers \(\boldsymbol{c}_m\) for missing regions.

Key design: reuse of the MAE pre-trained decoder for local patch reconstruction:

\[\boldsymbol{x}_m = \mathcal{D}([\boldsymbol{h}_m + \text{Embed}(\boldsymbol{c}_m), \boldsymbol{h}_{d_c}])\]

The final output merges rectified and completed points via FPS re-sampling: \(\boldsymbol{x}_c = \text{FPS}([\boldsymbol{x}_m, \boldsymbol{x}_r])\)

Loss function (L1 Chamfer Distance):

\[\mathcal{L}_{\text{comp}} = \mathcal{C}_1(\boldsymbol{c}_m, \mathcal{P}_m) + \mathcal{C}_1(\boldsymbol{x}_m, \mathcal{P}_m) + \mathcal{C}_1(\boldsymbol{x}_c, \mathcal{P}_{gt})\]

Shape-Aware Unit¶

Inserted within each transformer block, comprising two innovations:

Shape-Aware Attention: Establishes connections based on spatial distance rather than feature similarity; noisy outliers are unlikely to alter spatial neighborhood relations, yielding greater robustness
Low-rank adapter: \(\boldsymbol{h}_{i+1} = W_2 \cdot \sigma(W_1(\hat{\boldsymbol{h}}_i)) + \hat{\boldsymbol{h}}_i\), preventing feature over-smoothing

Total Loss¶

\[\mathcal{L} = \mathcal{L}_{\text{rect}} + \mathcal{L}_{\text{comp}} + \mathcal{L}_{\text{task}}\]

A staged optimization strategy is adopted to improve training stability.

Key Experimental Results¶

Noisy Point Cloud Classification (Main Results)¶

Method	Reference	Params (M)↓	Noisy ModelNet40↑	Noisy ShapeNet55↑
Point-MAE (FFT)	ECCV22	22.1 (100%)	89.42	88.13
+Point-PEFT	AAAI24	0.7 (3.2%)	87.52 (−1.90)	86.01 (−2.12)
+DAPT	CVPR24	1.1 (5.0%)	86.43 (−2.99)	86.33 (−1.80)
+UPP (Ours)	—	1.4 (6.3%)	92.95 (+3.53)	90.40 (+2.27)
ReCon (FFT)	ICML23	43.6 (100%)	89.67	89.01
+UPP (Ours)	—	1.4 (3.2%)	91.69 (+2.02)	89.68 (+0.67)
Point-FEMAE (FFT)	AAAI24	27.4 (100%)	89.59	88.63
+UPP (Ours)	—	1.4 (5.1%)	91.94 (+2.35)	90.08 (+1.45)

UPP surpasses full fine-tuning across all three backbones while using only 3.2%–6.3% of the parameters. Existing PEFT methods consistently degrade performance.

Real-World Data (ScanObjectNN)¶

Method	Params (M)	Acc. (%)
Point-FEMAE (baseline)	27.4	90.71
+Point-PEFT	0.7	89.16
+DAPT	1.1	89.67
+UPP (Ours)	1.4	91.39

Ablation Study¶

Base	Rect. Prompter	Compl. Prompter	SA-Unit	Acc. (%)
✓	✗	✗	✗	89.42
✓	✓	✗	✗	90.90
✓	✗	✓	✗	91.36
✓	✗	✗	✓	91.28
✓	✓	✓	✓	92.95

Each component individually contributes 1.5–2 percentage points; their combination achieves the best performance.

Key Findings¶

Adverse effect of PEFT methods: Existing 3D PEFT methods (Point-PEFT, DAPT) actually hurt performance on noisy data, as they ignore explicit handling of input noise
Importance of input-space intervention: UPP performs rectification and completion in the data space rather than solely in the feature space, yielding more direct and effective improvements
Robustness of Shape-Aware Attention: Spatial-distance-based attention is more resistant to noise interference than feature-similarity-based attention
Backbone agnosticism: UPP generalizes effectively across Point-MAE, ReCon, and Point-FEMAE

Highlights & Insights¶

Paradigm shift: Denoising and completion are transformed from standalone pre-processing steps into unified prompts for downstream tasks, eliminating domain gaps and objective conflicts
Data-space prompting: Unlike VPT and similar methods that add prompt tokens solely in the feature space, UPP operates directly in the point coordinate space (displacing or adding discrete points)
Reuse of pre-trained decoder: The MAE decoder weights, which are typically discarded after pre-training, are cleverly repurposed for point cloud completion

Limitations & Future Work¶

The staged optimization strategy increases training complexity
The number of completion points \(M\) is a fixed hyperparameter, limiting adaptability to varying degrees of incompleteness
Validation is limited to classification; effectiveness on segmentation and detection tasks remains to be confirmed

Point cloud pre-training: Point-MAE, ReCon, Point-FEMAE, PointGPT
Point cloud enhancement: ScoreDenoise, PoinTr, T-CorresNet
3D PEFT: IDPT, Point-PEFT, DAPT, GAPrompt

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Paradigm innovation unifying denoising and completion as prompting mechanisms
Technical Depth: ⭐⭐⭐⭐ — Elegant three-component design; Shape-Aware Attention is supported by theoretical analysis
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple backbones, multiple datasets, comprehensive ablations
Value: ⭐⭐⭐⭐ — Parameter-efficient, open-source, directly improves robustness of existing models