Deformation-based In-Context Learning for Point Cloud Understanding¶
Conference: CVPR 2026 arXiv: 2604.02845 Code: link Area: 3D Vision Keywords: point cloud in-context learning, deformation network, geometric reasoning, masked point modeling, multi-task general model
TL;DR¶
This paper proposes DeformPIC, which reframes point cloud In-Context Learning from a "masked reconstruction" paradigm to a "deformation transfer" paradigm. A Deformation Extraction Network (DEN) extracts task-specific semantics, and a Deformation Transfer Network (DTN) applies the extracted deformation to the query point cloud, achieving CD reductions of 1.6/1.8/4.7 on reconstruction/denoising/registration respectively.
Background & Motivation¶
Background: 3D point cloud ICL aims to enable models to handle diverse tasks (reconstruction, denoising, registration, segmentation) from a small number of examples. Existing methods (PIC, PIC++) are built upon Masked Point Modeling (MPM).
Limitations of Prior Work: (1) Geometry-free: MPM predicts target point clouds from geometry-free masked tokens, lacking explicit geometric reasoning; (2) Train-inference mismatch: during training the target is partially masked (allowing the model to exploit visible parts), whereas at inference the target is entirely unknown.
Key Challenge: Masked tokens are abstract placeholders that encode no geometric correspondence, forcing the model to implicitly infer spatial structure through self-attention alone.
Goal: To equip ICL with explicit geometric manipulation capability and eliminate the inconsistency between training and inference objectives.
Key Insight: Tasks are reformulated as "deforming the query point cloud under prompt guidance," since deformation inherently preserves geometric continuity.
Core Idea: Extract task-specific deformation information from the prompt pair via DEN, then transfer and apply it to the query point cloud via DTN.
Method¶
Overall Architecture¶
A dual-network architecture: DEN extracts a task token \(\hat{T}_{\text{task}}\) from the prompt input→target pair; DTN deforms the query input under AdaLN-Zero modulation conditioned on \(\hat{T}_{\text{task}}\).
Key Designs¶
-
Deformation Extraction Network (DEN): A mini-PointNet encodes the prompt input and target tokens, which are concatenated with a learnable task token and processed by a Transformer to extract \(\hat{T}_{\text{task}} = \mathcal{E}([T_{\text{task}} \| T_{P_i} \| T_{P_t}])\). Design Motivation: PIC processes prompt and query jointly, but task-semantic extraction and geometric reconstruction are distinct objectives; decoupling them improves efficiency.
-
Deformation Transfer Network (DTN): AdaLN-Zero injects the task token into each Transformer layer: \(h^{(l+1)} = h^{(l)} + \sigma^{(l)} \cdot \mathcal{A}[(1+\eta^{(l)}) \cdot \text{LN}(h^{(l)}) + \kappa^{(l)}]\) where \(\sigma, \eta, \kappa\) are generated from \(\hat{T}_{\text{task}}\) via zero-initialized MLPs. Design Motivation: AdaLN-Zero, borrowed from DiT, enables fine-grained, layer-wise conditioning.
-
Train-Inference Consistency: Both training and inference execute the same deformation process — the query point cloud is provided as input and the deformed point cloud is produced as output, with no masking operation required.
Loss & Training¶
- \(L_2\) Chamfer Distance: \(\mathcal{L} = \frac{1}{|\hat{R}|}\sum_{p \in \hat{R}} \min_{g \in R} \|p - g\|_2^2 + \frac{1}{|R|}\sum_{g \in R} \min_{p \in \hat{R}} \|g - p\|_2^2\)
- AdamW + cosine decay, lr warmup for 10 epochs, 300 total training epochs, batch size 128
Key Experimental Results¶
Main Results (ShapeNet In-Context Dataset, Chamfer Distance ×1000 ↓)¶
| Method | Reconstruction Avg | Denoising Avg | Registration Avg | Segmentation mIoU↑ |
|---|---|---|---|---|
| PIC-Cat | 4.3 | 5.3 | 14.1 | 79.0 |
| PIC-S-Cat | 6.9 | 6.5 | 24.1 | 83.8 |
| PIC-S-Sep | 5.1 | 12.0 | 6.7 | 83.7 |
| DeformPIC | 2.7 | 3.5 | 2.0 | 83.9 |
Ablation Study¶
| Comparison | Metric Change | Remark |
|---|---|---|
| vs PIC-Cat (reconstruction) | 4.3→2.7 (−1.6) | Deformation outperforms masked reconstruction |
| vs PIC-Cat (denoising) | 5.3→3.5 (−1.8) | Explicit geometric manipulation is effective |
| vs PIC-Cat (registration) | 14.1→2.0 (−12.1) | Registration is inherently geometric; deformation is a natural fit |
| vs task-specific PCT | 2.6/2.2/6.3 vs 2.7/3.5/2.0 | ICL greatly surpasses task-specific models on registration |
Key Findings¶
- Registration shows the largest improvement (CD 14.1→2.0), as registration is essentially rigid-body transformation, which is naturally aligned with the deformation paradigm.
- Segmentation performance remains at SOTA (83.9 mIoU), demonstrating that the deformation paradigm generalizes to discrete semantic tasks.
- SOTA results are also achieved in cross-domain evaluation on ModelNet40 and ScanObjectNN, confirming strong generalization.
- Qualitative results show that DeformPIC produces more complete and geometrically accurate 3D shapes.
Highlights & Insights¶
- Paradigm shift: Moving from "predicting masked content" to "deforming the input" better reflects the geometric nature of 3D data.
- Train-inference consistency is critical: eliminating the mismatch yields substantial performance gains.
- Decoupled design (DEN for extraction + DTN for transfer) is more effective than joint processing.
- Successful transfer of AdaLN-Zero from DiT to point cloud ICL demonstrates the cross-domain applicability of diffusion model conditioning techniques.
- Registration benefits from a natural alignment with the deformation framework, as it is intrinsically a geometric transformation, driving CD from 14.1 down to 2.0.
- Maintaining SOTA on segmentation (a discrete semantic task) confirms the generality of the deformation framework.
- Strong cross-domain generalization (ShapeNet→ModelNet40/ScanObjectNN) validates the robustness of the proposed method.
Limitations & Future Work¶
- The deformation paradigm adapts less naturally to discrete semantic tasks such as part segmentation compared to continuous geometric tasks.
- Primary evaluation is conducted on synthetic datasets; performance on real-world point clouds remains to be verified.
- Larger-scale pre-training has not been explored.
- DEN and DTN employ separate encoders; a shared encoder may yield further improvements.
- The information from a single prompt pair may be insufficient; few-shot ICL with multiple prompts warrants exploration.
- The method may be limited when deformation magnitude is extremely large (e.g., deforming a cup into a car).
- Scalability of 300-epoch training to larger datasets has not been verified.
Related Work & Insights¶
- Core distinction from PIC/PIC++: masked reconstruction → deformation transfer; joint processing → decoupled processing.
- Neural deformation methods (FlowNet3D, Pixel2Mesh) have already demonstrated the effectiveness of deformation-based strategies.
- AdaLN-Zero, originating from DiT, shows that conditioning techniques developed for diffusion models transfer effectively to other domains.
- DG-PIC and PCoTTA employ transfer learning for adaptation to new scenarios, and are orthogonally complementary to DeformPIC.
Technical Details¶
- Point cloud sampling: 1,024 points/object, 64 patches × 32 points/patch
- Point Encoder: mini-PointNet maps point patches to tokens
- AdaLN-Zero initialization: \(W_1, W_2, W_3\) are zero-initialized so that DTN is equivalent to an unconditional Transformer at the start of training
- 5 difficulty levels: L1 (slight perturbation) to L5 (high noise / large-angle rotation)
- vs task-specific models: reconstruction is competitive (2.7 vs 2.5), denoising has a gap (3.5 vs 2.2), registration greatly surpasses (2.0 vs 5.9)
- End-to-end deformation objective: directly predicts deformed point cloud coordinates, avoiding instabilities of displacement-field optimization
- Training configuration: AdamW, lr warmup 1e-6→1e-4 (10 epochs), cosine decay, weight decay 0.05
- Baselines: four categories — task-specific models (PointNet/DGCNN/PCT/ACT), multi-task models, pre-trained multi-task models, and ICL models
- Dataset scale: 174,404 training samples + 43,050 test samples, covering 4 tasks × 5 difficulty levels
- Single-GPU training: all experiments are completed on a single NVIDIA TITAN RTX 24 GB GPU
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Redefines point cloud ICL from masked reconstruction to deformation transfer; paradigm-level innovation
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation on ShapeNet with cross-domain assessment, though real-world scenarios are limited
- Writing Quality: ⭐⭐⭐⭐ — Problem analysis is clear and comparison figures are intuitive
- Value: ⭐⭐⭐⭐ — Achieves significant progress in the emerging direction of point cloud ICL