DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding¶

Conference: ECCV 2024
arXiv: 2407.08801
Code: https://github.com/Jinec98/DG-PIC
Area: 3D Vision
Keywords: Point Cloud Understanding, Domain Generalization, In-Context Learning, Multi-Task Learning, Test-Time Adaptation

TL;DR¶

DG-PIC is proposed, representing the first point cloud understanding framework that simultaneously addresses multi-domain and multi-task learning in a unified model. Through dual-level source prototype estimation and a test-time feature shifting mechanism, it enhances generalization capability to unseen domains without requiring model updates.

Background & Motivation¶

Background: Point cloud understanding is widely used in fields such as autonomous driving and robotics, but models are typically trained and tested on a single dataset. Performance degrades significantly when facing new data with different distributions (e.g., from synthetic ModelNet40 to real-scan ScanObjectNN).

Limitations of Prior Work: - Domain Generalization (DG) methods are generally designed for a single task (e.g., classification), lack multi-task capabilities, and ignore the valuable utility of the test data itself. - In-Context Learning (ICL) methods (such as PIC) can handle multiple tasks but are restricted to a single dataset, exhibiting poor cross-domain generalization. - Neither type of method can simultaneously solve the "multi-domain" and "multi-task" problems.

Key Challenge: A unified model needs to reconcile task generalization (multi-task) and domain generalization (multi-domain), whereas existing methods can only address one of them.

Goal: To handle point cloud understanding across multiple domains and tasks in a unified model, and generalize to unseen domains during testing without updating model parameters.

Key Insight: Combining the multi-task ICL of PIC with test-time domain generalization—learning cross-domain generalizable information via PIC during pre-training, and shifting target domain features towards the source domain during testing.

Core Idea: Dual-level source prototypes + dual-level test-time feature shifting, which aligns unseen test domain data to known source domains without requiring model updates.

Method¶

Overall Architecture¶

DG-PIC consists of two stages: (1) Pre-training stage—training a PIC model on multiple source domains based on the Masked Point Modeling (MPM) framework, using a cross-domain prompt pairing strategy; (2) Testing stage—freezing the model, determining the distance between test samples and each source domain via dual-level source prototype estimation, and then aligning test data to the source domains through dual-level feature shifting before selecting the most similar samples from the nearest source domain as prompts.

Key Designs¶

Multi-domain Prompt Pairing:
- Function: Randomly selects prompt samples from different source domains during pre-training to enhance cross-domain associations.
- Mechanism: Assuming the query is from domain \(D_s^i\) and the prompt is from domain \(D_s^j (j \neq i)\), the predicted masked patch is: \(P \sim (D_s^i, D_s^j) = Trans([F_\theta(I_i) \oplus F_\theta(T_i^k) \oplus F_\theta(I_j) \oplus F_\theta(T_j^k)], Mask)\) The training loss uses Chamfer Distance: \(\text{CD}(P,G) = \frac{1}{|P|}\sum_{x \in P}\min_{y \in G}\|x-y\|^2 + \frac{1}{|G|}\sum_{y \in G}\min_{x \in P}\|y-x\|^2\)
- Design Motivation: Cross-domain pairing forces the model to learn domain-invariant feature representations.
Dual-level Source Prototype Estimation:
- Function: Computes global and local prototypes for each source domain to act as anchors for test-time feature alignment.
- Mechanism:
  - Local prototype \(Z_{local}^{i,m}\): Averages the patch-level features of all samples in domain \(D_s^i\): \(Z_{local}^{i,m} = \frac{1}{N_{D_s^i}} \sum_{n=1}^{N_{D_s^i}} F_\theta(P_m)\)
  - Global prototype \(Z_{global}^i\): Averages max-pooled patch features: \(Z_{global}^i = \frac{1}{N_{D_s^i}} \sum_{n=1}^{N_{D_s^i}} max(F_\theta(P_m))\)
  - Computes the Euclidean distance of test samples to each source domain prototype: \(\mathcal{E}_{global}^i = \|F_{global} - Z_{global}^i\|\), \(\mathcal{E}_{local}^{i,m} = \|F_{local}^m - Z_{local}^{i,m}\|\)
- Design Motivation: Global features capture shape context, whereas local features capture geometric structural details. Dual-level representation is required to comprehensively represent the source domains.
Dual-level Test-time Feature Shifting:
- Function: Shifts target domain features towards the source domain direction during test time without updating model parameters.
- Mechanism:
  - Macro semantic coefficient \(\alpha\): Derived from the global distance, controlling the contribution level of each source domain to the feature shift: \(\alpha = softmax(\mathcal{E}_{global})\)
  - Micro positional coefficient \(\beta^i\): Derived from the local distance, considering the alignment of patch positions: \(\beta^i = softmax(\mathcal{E}_{local}^i)\)
  - Final feature shifting formula: \(F'_{local} = \frac{1}{R}\sum_{i=1}^{R}\alpha_i\left(\frac{1}{M}\sum_{m=1}^{M}\beta^{i,m}F_{local}^m\right) + \frac{1}{R}\sum_{i=1}^{R}(1-\alpha_i)\left(\frac{1}{M}\sum_{m=1}^{M}(1-\beta^{i,m})Z_{local}^{i,m}\right)\)
- Design Motivation: \(\alpha\) utilizes cross-domain semantic similarity to regulate the overall shifting intensity, while \(\beta\) utilizes geometric similarity of co-located patches for fine-grained alignment. Even across domains, patches at the same relative position should share similar geometric structures (e.g., table edges versus a flat surface).
Test-time Prompt Selection:
- Function: Selects the most similar sample from the nearest source domain as the prompt.
- Mechanism: Determines the nearest source domain by combining global and local distances: \(\mathcal{E}^i = \lambda \cdot \mathcal{E}_{global}^i + (1-\lambda) \cdot \frac{1}{M}\sum_{m=1}^{M}\mathcal{E}_{local}^{i,m}\) (\(\lambda=0.5\)), and then finds the sample with the minimum feature distance in that domain to serve as the prompt.

Loss & Training¶

Pre-training uses Chamfer Distance loss
AdamW optimizer, lr=0.001, cosine learning rate schedule
Trained for 300 epochs, batch size 128
Each point cloud is sampled with 1024 points, divided into 64 patches (32 points each), with a mask ratio of 0.7
Absolutely no model parameter updates during test time

Key Experimental Results¶

Main Results (ScanObjectNN as target domain, CD×10⁻³ ↓)¶

Method	Setting	Reconstruction	Denoising	Registration
DG-PIC (Ours)	Test-time DG	4.1	15.2	5.8
PIC	Supervised	72.9	80.0	12.7
Point-MAE (task-specific)	Supervised	30.4	36.0	31.2
PCT (multi-task)	Supervised	31.5	36.5	34.9
PointCutMix (task-specific)	Train-time DG	44.8	43.5	41.3

Ablation Study (CD×10⁻³ ↓)¶

Model	Prototype Estimation	Feature Shifting	Anchor Domain	Reconstruction	Denoising	Registration
Model A	Random	Mean	Single Domain	8.4	40.5	6.7
Model B	Global Only	Mean	Single Domain	7.2	38.3	6.4
Model C	Local Only	Mean	Single Domain	7.3	36.7	6.7
Model D	Global+Local	Mean	Single Domain	6.8	35.1	6.2
Model E	Global+Local	Mean	All Domains	6.3	32.4	6.5
Model F	Global+Local	Macro Only	All Domains	5.2	22.7	6.0
Model G	Global+Local	Micro Only	All Domains	4.9	25.6	6.2
Ours	Global+Local	Macro+Micro	All Domains	4.1	15.2	5.8

Key Findings¶

DG-PIC significantly outperforms all baseline methods across all three tasks; on the reconstruction task, its CD is only 5.6% of PIC's (4.1 vs 72.9).
Traditional methods (PointNet, DGCNN, etc.) exhibit poor cross-domain generalization, with CD values generally falling within the 30-45 range.
Although DG methods (Pointmixup, PointCutMix) introduce domain generalization, the variance of results across tasks is small, indicating they focus solely on domain differences while ignoring task differences.
PIC can handle multiple tasks but fails in cross-domain scenarios (CD 72.9-80.0), proving that ICL alone is insufficient to bridge domain gaps.
Each component of the dual-level design (global+local prototypes, macro+micro shifting) independently contributes to performance, with the denoising task benefiting the most.

Highlights & Insights¶

Pioneering Setting: This paper introduces the novel setting of multi-domain multi-task point cloud understanding and establishes a benchmark containing 4 datasets (2 synthetic + 2 real), 7 object categories, 3 tasks, and 30,954 samples.
Test-Time Generalization Without Model Updates: Domain adaptation is achieved via feature-space shifting rather than fine-tuning, maintaining a manageable computational overhead.
Elegant Design Intuition for Micro Positional Coefficients: Exploits the effective prior that patches at the same relative position on an object should exhibit similar geometric structures across domains.
A Unified Model for Three Tasks: Reconstruction, denoising, and registration share a single network, with tasks toggled via prompts.

Limitations & Future Work¶

The benchmark only contains 7 shared classes, with limited scale and diversity.
The framework only considers regression-like tasks for coordinates (reconstruction/denoising/registration) and does not cover discriminative tasks like classification or segmentation.
Source domain prototypes are calculated as a simple average of all samples, potentially losing intra-class multi-modal distribution information.
The formulation for feature shifting is somewhat heuristic and lacks rigorous theoretical analysis.
Ablation studies and thorough experiments were primarily conducted with ScanObjectNN as the target domain, leaving other target domain configurations insufficiently explored.

vs PIC: PIC is a single-domain multi-task ICL model. DG-PIC introduces test-time domain generalization on top of it, achieving massive performance improvements (CD: 72.9 \(\rightarrow\) 4.1 on the reconstruction task).
vs Point-BERT / Point-MAE: Although these self-supervised features learn rich representations, they do not account for cross-domain generalization, yielding poor direct transfer performance.
vs Pointmixup / PointCutMix: Train-time DG methods focus on "domain invariance" via mixed data augmentation but ignore "task invariance", leading to mediocre results.
vs DGLSS / SemanticSTF: Pioneering works in point cloud DG, but they are task-specific and do not support a unified multi-task model.

Rating¶

Novelty: ⭐⭐⭐⭐ Proposes the multi-domain multi-task point cloud understanding setting for the first time; the dual-level design is highly creative.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation studies and various baseline comparisons, though the choice of target domain configurations is somewhat simple.
Writing Quality: ⭐⭐⭐⭐ Clear structure, well-justified motivation, and intuitive diagrams.
Value: ⭐⭐⭐⭐ The new setting and benchmark make a strong contribution to advancing point cloud generalization research.