Deformation-based In-Context Learning for Point Cloud Understanding¶

Conference: CVPR 2026 arXiv: 2604.02845 Code: link Area: 3D Vision Keywords: point cloud in-context learning, deformation network, geometric reasoning, masked point modeling, multi-task general model

TL;DR¶

This paper proposes DeformPIC, which reframes point cloud In-Context Learning from a "masked reconstruction" paradigm to a "deformation transfer" paradigm. A Deformation Extraction Network (DEN) extracts task-specific semantics, and a Deformation Transfer Network (DTN) applies the extracted deformation to the query point cloud, achieving CD reductions of 1.6/1.8/4.7 on reconstruction/denoising/registration respectively.

Background & Motivation¶

Background: 3D point cloud ICL aims to enable models to handle diverse tasks (reconstruction, denoising, registration, segmentation) from a small number of examples. Existing methods (PIC, PIC++) are built upon Masked Point Modeling (MPM).

Limitations of Prior Work: (1) Geometry-free: MPM predicts target point clouds from geometry-free masked tokens, lacking explicit geometric reasoning; (2) Train-inference mismatch: during training the target is partially masked (allowing the model to exploit visible parts), whereas at inference the target is entirely unknown.

Key Challenge: Masked tokens are abstract placeholders that encode no geometric correspondence, forcing the model to implicitly infer spatial structure through self-attention alone.

Goal: To equip ICL with explicit geometric manipulation capability and eliminate the inconsistency between training and inference objectives.

Key Insight: Tasks are reformulated as "deforming the query point cloud under prompt guidance," since deformation inherently preserves geometric continuity.

Core Idea: Extract task-specific deformation information from the prompt pair via DEN, then transfer and apply it to the query point cloud via DTN.

Method¶

Overall Architecture¶

A dual-network architecture: DEN extracts a task token \(\hat{T}_{\text{task}}\) from the prompt input→target pair; DTN deforms the query input under AdaLN-Zero modulation conditioned on \(\hat{T}_{\text{task}}\).

Key Designs¶

Deformation Extraction Network (DEN): A mini-PointNet encodes the prompt input and target tokens, which are concatenated with a learnable task token and processed by a Transformer to extract \(\hat{T}_{\text{task}} = \mathcal{E}([T_{\text{task}} \| T_{P_i} \| T_{P_t}])\). Design Motivation: PIC processes prompt and query jointly, but task-semantic extraction and geometric reconstruction are distinct objectives; decoupling them improves efficiency.
Deformation Transfer Network (DTN): AdaLN-Zero injects the task token into each Transformer layer: \(h^{(l+1)} = h^{(l)} + \sigma^{(l)} \cdot \mathcal{A}[(1+\eta^{(l)}) \cdot \text{LN}(h^{(l)}) + \kappa^{(l)}]\) where \(\sigma, \eta, \kappa\) are generated from \(\hat{T}_{\text{task}}\) via zero-initialized MLPs. Design Motivation: AdaLN-Zero, borrowed from DiT, enables fine-grained, layer-wise conditioning.
Train-Inference Consistency: Both training and inference execute the same deformation process — the query point cloud is provided as input and the deformed point cloud is produced as output, with no masking operation required.

Loss & Training¶

\(L_2\) Chamfer Distance: \(\mathcal{L} = \frac{1}{|\hat{R}|}\sum_{p \in \hat{R}} \min_{g \in R} \|p - g\|_2^2 + \frac{1}{|R|}\sum_{g \in R} \min_{p \in \hat{R}} \|g - p\|_2^2\)
AdamW + cosine decay, lr warmup for 10 epochs, 300 total training epochs, batch size 128

Key Experimental Results¶

Main Results (ShapeNet In-Context Dataset, Chamfer Distance ×1000 ↓)¶

Method	Reconstruction Avg	Denoising Avg	Registration Avg	Segmentation mIoU↑
PIC-Cat	4.3	5.3	14.1	79.0
PIC-S-Cat	6.9	6.5	24.1	83.8
PIC-S-Sep	5.1	12.0	6.7	83.7
DeformPIC	2.7	3.5	2.0	83.9

Ablation Study¶

Comparison	Metric Change	Remark
vs PIC-Cat (reconstruction)	4.3→2.7 (−1.6)	Deformation outperforms masked reconstruction
vs PIC-Cat (denoising)	5.3→3.5 (−1.8)	Explicit geometric manipulation is effective
vs PIC-Cat (registration)	14.1→2.0 (−12.1)	Registration is inherently geometric; deformation is a natural fit
vs task-specific PCT	2.6/2.2/6.3 vs 2.7/3.5/2.0	ICL greatly surpasses task-specific models on registration

Key Findings¶

Registration shows the largest improvement (CD 14.1→2.0), as registration is essentially rigid-body transformation, which is naturally aligned with the deformation paradigm.
Segmentation performance remains at SOTA (83.9 mIoU), demonstrating that the deformation paradigm generalizes to discrete semantic tasks.
SOTA results are also achieved in cross-domain evaluation on ModelNet40 and ScanObjectNN, confirming strong generalization.
Qualitative results show that DeformPIC produces more complete and geometrically accurate 3D shapes.

Highlights & Insights¶

Paradigm shift: Moving from "predicting masked content" to "deforming the input" better reflects the geometric nature of 3D data.
Train-inference consistency is critical: eliminating the mismatch yields substantial performance gains.
Decoupled design (DEN for extraction + DTN for transfer) is more effective than joint processing.
Successful transfer of AdaLN-Zero from DiT to point cloud ICL demonstrates the cross-domain applicability of diffusion model conditioning techniques.
Registration benefits from a natural alignment with the deformation framework, as it is intrinsically a geometric transformation, driving CD from 14.1 down to 2.0.
Maintaining SOTA on segmentation (a discrete semantic task) confirms the generality of the deformation framework.
Strong cross-domain generalization (ShapeNet→ModelNet40/ScanObjectNN) validates the robustness of the proposed method.

Limitations & Future Work¶

The deformation paradigm adapts less naturally to discrete semantic tasks such as part segmentation compared to continuous geometric tasks.
Primary evaluation is conducted on synthetic datasets; performance on real-world point clouds remains to be verified.
Larger-scale pre-training has not been explored.
DEN and DTN employ separate encoders; a shared encoder may yield further improvements.
The information from a single prompt pair may be insufficient; few-shot ICL with multiple prompts warrants exploration.
The method may be limited when deformation magnitude is extremely large (e.g., deforming a cup into a car).
Scalability of 300-epoch training to larger datasets has not been verified.

Core distinction from PIC/PIC++: masked reconstruction → deformation transfer; joint processing → decoupled processing.
Neural deformation methods (FlowNet3D, Pixel2Mesh) have already demonstrated the effectiveness of deformation-based strategies.
AdaLN-Zero, originating from DiT, shows that conditioning techniques developed for diffusion models transfer effectively to other domains.
DG-PIC and PCoTTA employ transfer learning for adaptation to new scenarios, and are orthogonally complementary to DeformPIC.

Technical Details¶

Point cloud sampling: 1,024 points/object, 64 patches × 32 points/patch
Point Encoder: mini-PointNet maps point patches to tokens
AdaLN-Zero initialization: \(W_1, W_2, W_3\) are zero-initialized so that DTN is equivalent to an unconditional Transformer at the start of training
5 difficulty levels: L1 (slight perturbation) to L5 (high noise / large-angle rotation)
vs task-specific models: reconstruction is competitive (2.7 vs 2.5), denoising has a gap (3.5 vs 2.2), registration greatly surpasses (2.0 vs 5.9)
End-to-end deformation objective: directly predicts deformed point cloud coordinates, avoiding instabilities of displacement-field optimization
Training configuration: AdamW, lr warmup 1e-6→1e-4 (10 epochs), cosine decay, weight decay 0.05
Baselines: four categories — task-specific models (PointNet/DGCNN/PCT/ACT), multi-task models, pre-trained multi-task models, and ICL models
Dataset scale: 174,404 training samples + 43,050 test samples, covering 4 tasks × 5 difficulty levels
Single-GPU training: all experiments are completed on a single NVIDIA TITAN RTX 24 GB GPU

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Redefines point cloud ICL from masked reconstruction to deformation transfer; paradigm-level innovation
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation on ShapeNet with cross-domain assessment, though real-world scenarios are limited
Writing Quality: ⭐⭐⭐⭐ — Problem analysis is clear and comparison figures are intuitive
Value: ⭐⭐⭐⭐ — Achieves significant progress in the emerging direction of point cloud ICL