Skip to content

Deformation-based In-Context Learning for Point Cloud Understanding

Conference: CVPR 2026 arXiv: 2604.02845 Code: link Area: 3D Vision Keywords: point cloud in-context learning, deformation network, geometric reasoning, masked point modeling, multi-task general model

TL;DR

This paper proposes DeformPIC, which reframes point cloud In-Context Learning from a "masked reconstruction" paradigm to a "deformation transfer" paradigm. A Deformation Extraction Network (DEN) extracts task-specific semantics, and a Deformation Transfer Network (DTN) applies the extracted deformation to the query point cloud, achieving CD reductions of 1.6/1.8/4.7 on reconstruction/denoising/registration respectively.

Background & Motivation

Background: 3D point cloud ICL aims to enable models to handle diverse tasks (reconstruction, denoising, registration, segmentation) from a small number of examples. Existing methods (PIC, PIC++) are built upon Masked Point Modeling (MPM).

Limitations of Prior Work: (1) Geometry-free: MPM predicts target point clouds from geometry-free masked tokens, lacking explicit geometric reasoning; (2) Train-inference mismatch: during training the target is partially masked (allowing the model to exploit visible parts), whereas at inference the target is entirely unknown.

Key Challenge: Masked tokens are abstract placeholders that encode no geometric correspondence, forcing the model to implicitly infer spatial structure through self-attention alone.

Goal: To equip ICL with explicit geometric manipulation capability and eliminate the inconsistency between training and inference objectives.

Key Insight: Tasks are reformulated as "deforming the query point cloud under prompt guidance," since deformation inherently preserves geometric continuity.

Core Idea: Extract task-specific deformation information from the prompt pair via DEN, then transfer and apply it to the query point cloud via DTN.

Method

Overall Architecture

A dual-network architecture: DEN extracts a task token \(\hat{T}_{\text{task}}\) from the prompt input→target pair; DTN deforms the query input under AdaLN-Zero modulation conditioned on \(\hat{T}_{\text{task}}\).

Key Designs

  1. Deformation Extraction Network (DEN): A mini-PointNet encodes the prompt input and target tokens, which are concatenated with a learnable task token and processed by a Transformer to extract \(\hat{T}_{\text{task}} = \mathcal{E}([T_{\text{task}} \| T_{P_i} \| T_{P_t}])\). Design Motivation: PIC processes prompt and query jointly, but task-semantic extraction and geometric reconstruction are distinct objectives; decoupling them improves efficiency.

  2. Deformation Transfer Network (DTN): AdaLN-Zero injects the task token into each Transformer layer: \(h^{(l+1)} = h^{(l)} + \sigma^{(l)} \cdot \mathcal{A}[(1+\eta^{(l)}) \cdot \text{LN}(h^{(l)}) + \kappa^{(l)}]\) where \(\sigma, \eta, \kappa\) are generated from \(\hat{T}_{\text{task}}\) via zero-initialized MLPs. Design Motivation: AdaLN-Zero, borrowed from DiT, enables fine-grained, layer-wise conditioning.

  3. Train-Inference Consistency: Both training and inference execute the same deformation process — the query point cloud is provided as input and the deformed point cloud is produced as output, with no masking operation required.

Loss & Training

  • \(L_2\) Chamfer Distance: \(\mathcal{L} = \frac{1}{|\hat{R}|}\sum_{p \in \hat{R}} \min_{g \in R} \|p - g\|_2^2 + \frac{1}{|R|}\sum_{g \in R} \min_{p \in \hat{R}} \|g - p\|_2^2\)
  • AdamW + cosine decay, lr warmup for 10 epochs, 300 total training epochs, batch size 128

Key Experimental Results

Main Results (ShapeNet In-Context Dataset, Chamfer Distance ×1000 ↓)

Method Reconstruction Avg Denoising Avg Registration Avg Segmentation mIoU↑
PIC-Cat 4.3 5.3 14.1 79.0
PIC-S-Cat 6.9 6.5 24.1 83.8
PIC-S-Sep 5.1 12.0 6.7 83.7
DeformPIC 2.7 3.5 2.0 83.9

Ablation Study

Comparison Metric Change Remark
vs PIC-Cat (reconstruction) 4.3→2.7 (−1.6) Deformation outperforms masked reconstruction
vs PIC-Cat (denoising) 5.3→3.5 (−1.8) Explicit geometric manipulation is effective
vs PIC-Cat (registration) 14.1→2.0 (−12.1) Registration is inherently geometric; deformation is a natural fit
vs task-specific PCT 2.6/2.2/6.3 vs 2.7/3.5/2.0 ICL greatly surpasses task-specific models on registration

Key Findings

  • Registration shows the largest improvement (CD 14.1→2.0), as registration is essentially rigid-body transformation, which is naturally aligned with the deformation paradigm.
  • Segmentation performance remains at SOTA (83.9 mIoU), demonstrating that the deformation paradigm generalizes to discrete semantic tasks.
  • SOTA results are also achieved in cross-domain evaluation on ModelNet40 and ScanObjectNN, confirming strong generalization.
  • Qualitative results show that DeformPIC produces more complete and geometrically accurate 3D shapes.

Highlights & Insights

  • Paradigm shift: Moving from "predicting masked content" to "deforming the input" better reflects the geometric nature of 3D data.
  • Train-inference consistency is critical: eliminating the mismatch yields substantial performance gains.
  • Decoupled design (DEN for extraction + DTN for transfer) is more effective than joint processing.
  • Successful transfer of AdaLN-Zero from DiT to point cloud ICL demonstrates the cross-domain applicability of diffusion model conditioning techniques.
  • Registration benefits from a natural alignment with the deformation framework, as it is intrinsically a geometric transformation, driving CD from 14.1 down to 2.0.
  • Maintaining SOTA on segmentation (a discrete semantic task) confirms the generality of the deformation framework.
  • Strong cross-domain generalization (ShapeNet→ModelNet40/ScanObjectNN) validates the robustness of the proposed method.

Limitations & Future Work

  • The deformation paradigm adapts less naturally to discrete semantic tasks such as part segmentation compared to continuous geometric tasks.
  • Primary evaluation is conducted on synthetic datasets; performance on real-world point clouds remains to be verified.
  • Larger-scale pre-training has not been explored.
  • DEN and DTN employ separate encoders; a shared encoder may yield further improvements.
  • The information from a single prompt pair may be insufficient; few-shot ICL with multiple prompts warrants exploration.
  • The method may be limited when deformation magnitude is extremely large (e.g., deforming a cup into a car).
  • Scalability of 300-epoch training to larger datasets has not been verified.
  • Core distinction from PIC/PIC++: masked reconstruction → deformation transfer; joint processing → decoupled processing.
  • Neural deformation methods (FlowNet3D, Pixel2Mesh) have already demonstrated the effectiveness of deformation-based strategies.
  • AdaLN-Zero, originating from DiT, shows that conditioning techniques developed for diffusion models transfer effectively to other domains.
  • DG-PIC and PCoTTA employ transfer learning for adaptation to new scenarios, and are orthogonally complementary to DeformPIC.

Technical Details

  • Point cloud sampling: 1,024 points/object, 64 patches × 32 points/patch
  • Point Encoder: mini-PointNet maps point patches to tokens
  • AdaLN-Zero initialization: \(W_1, W_2, W_3\) are zero-initialized so that DTN is equivalent to an unconditional Transformer at the start of training
  • 5 difficulty levels: L1 (slight perturbation) to L5 (high noise / large-angle rotation)
  • vs task-specific models: reconstruction is competitive (2.7 vs 2.5), denoising has a gap (3.5 vs 2.2), registration greatly surpasses (2.0 vs 5.9)
  • End-to-end deformation objective: directly predicts deformed point cloud coordinates, avoiding instabilities of displacement-field optimization
  • Training configuration: AdamW, lr warmup 1e-6→1e-4 (10 epochs), cosine decay, weight decay 0.05
  • Baselines: four categories — task-specific models (PointNet/DGCNN/PCT/ACT), multi-task models, pre-trained multi-task models, and ICL models
  • Dataset scale: 174,404 training samples + 43,050 test samples, covering 4 tasks × 5 difficulty levels
  • Single-GPU training: all experiments are completed on a single NVIDIA TITAN RTX 24 GB GPU

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Redefines point cloud ICL from masked reconstruction to deformation transfer; paradigm-level innovation
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation on ShapeNet with cross-domain assessment, though real-world scenarios are limited
  • Writing Quality: ⭐⭐⭐⭐ — Problem analysis is clear and comparison figures are intuitive
  • Value: ⭐⭐⭐⭐ — Achieves significant progress in the emerging direction of point cloud ICL