DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding¶
Conference: ECCV 2024
arXiv: 2407.08801
Code: https://github.com/Jinec98/DG-PIC
Area: 3D Vision
Keywords: Point Cloud Understanding, Domain Generalization, In-Context Learning, Multi-Task Learning, Test-Time Adaptation
TL;DR¶
DG-PIC is proposed, representing the first point cloud understanding framework that simultaneously addresses multi-domain and multi-task learning in a unified model. Through dual-level source prototype estimation and a test-time feature shifting mechanism, it enhances generalization capability to unseen domains without requiring model updates.
Background & Motivation¶
Background: Point cloud understanding is widely used in fields such as autonomous driving and robotics, but models are typically trained and tested on a single dataset. Performance degrades significantly when facing new data with different distributions (e.g., from synthetic ModelNet40 to real-scan ScanObjectNN).
Limitations of Prior Work: - Domain Generalization (DG) methods are generally designed for a single task (e.g., classification), lack multi-task capabilities, and ignore the valuable utility of the test data itself. - In-Context Learning (ICL) methods (such as PIC) can handle multiple tasks but are restricted to a single dataset, exhibiting poor cross-domain generalization. - Neither type of method can simultaneously solve the "multi-domain" and "multi-task" problems.
Key Challenge: A unified model needs to reconcile task generalization (multi-task) and domain generalization (multi-domain), whereas existing methods can only address one of them.
Goal: To handle point cloud understanding across multiple domains and tasks in a unified model, and generalize to unseen domains during testing without updating model parameters.
Key Insight: Combining the multi-task ICL of PIC with test-time domain generalization—learning cross-domain generalizable information via PIC during pre-training, and shifting target domain features towards the source domain during testing.
Core Idea: Dual-level source prototypes + dual-level test-time feature shifting, which aligns unseen test domain data to known source domains without requiring model updates.
Method¶
Overall Architecture¶
DG-PIC consists of two stages: (1) Pre-training stage—training a PIC model on multiple source domains based on the Masked Point Modeling (MPM) framework, using a cross-domain prompt pairing strategy; (2) Testing stage—freezing the model, determining the distance between test samples and each source domain via dual-level source prototype estimation, and then aligning test data to the source domains through dual-level feature shifting before selecting the most similar samples from the nearest source domain as prompts.
Key Designs¶
-
Multi-domain Prompt Pairing:
- Function: Randomly selects prompt samples from different source domains during pre-training to enhance cross-domain associations.
- Mechanism: Assuming the query is from domain \(D_s^i\) and the prompt is from domain \(D_s^j (j \neq i)\), the predicted masked patch is: \(P \sim (D_s^i, D_s^j) = Trans([F_\theta(I_i) \oplus F_\theta(T_i^k) \oplus F_\theta(I_j) \oplus F_\theta(T_j^k)], Mask)\) The training loss uses Chamfer Distance: \(\text{CD}(P,G) = \frac{1}{|P|}\sum_{x \in P}\min_{y \in G}\|x-y\|^2 + \frac{1}{|G|}\sum_{y \in G}\min_{x \in P}\|y-x\|^2\)
- Design Motivation: Cross-domain pairing forces the model to learn domain-invariant feature representations.
-
Dual-level Source Prototype Estimation:
- Function: Computes global and local prototypes for each source domain to act as anchors for test-time feature alignment.
- Mechanism:
- Local prototype \(Z_{local}^{i,m}\): Averages the patch-level features of all samples in domain \(D_s^i\): \(Z_{local}^{i,m} = \frac{1}{N_{D_s^i}} \sum_{n=1}^{N_{D_s^i}} F_\theta(P_m)\)
- Global prototype \(Z_{global}^i\): Averages max-pooled patch features: \(Z_{global}^i = \frac{1}{N_{D_s^i}} \sum_{n=1}^{N_{D_s^i}} max(F_\theta(P_m))\)
- Computes the Euclidean distance of test samples to each source domain prototype: \(\mathcal{E}_{global}^i = \|F_{global} - Z_{global}^i\|\), \(\mathcal{E}_{local}^{i,m} = \|F_{local}^m - Z_{local}^{i,m}\|\)
- Design Motivation: Global features capture shape context, whereas local features capture geometric structural details. Dual-level representation is required to comprehensively represent the source domains.
-
Dual-level Test-time Feature Shifting:
- Function: Shifts target domain features towards the source domain direction during test time without updating model parameters.
- Mechanism:
- Macro semantic coefficient \(\alpha\): Derived from the global distance, controlling the contribution level of each source domain to the feature shift: \(\alpha = softmax(\mathcal{E}_{global})\)
- Micro positional coefficient \(\beta^i\): Derived from the local distance, considering the alignment of patch positions: \(\beta^i = softmax(\mathcal{E}_{local}^i)\)
- Final feature shifting formula: \(F'_{local} = \frac{1}{R}\sum_{i=1}^{R}\alpha_i\left(\frac{1}{M}\sum_{m=1}^{M}\beta^{i,m}F_{local}^m\right) + \frac{1}{R}\sum_{i=1}^{R}(1-\alpha_i)\left(\frac{1}{M}\sum_{m=1}^{M}(1-\beta^{i,m})Z_{local}^{i,m}\right)\)
- Design Motivation: \(\alpha\) utilizes cross-domain semantic similarity to regulate the overall shifting intensity, while \(\beta\) utilizes geometric similarity of co-located patches for fine-grained alignment. Even across domains, patches at the same relative position should share similar geometric structures (e.g., table edges versus a flat surface).
-
Test-time Prompt Selection:
- Function: Selects the most similar sample from the nearest source domain as the prompt.
- Mechanism: Determines the nearest source domain by combining global and local distances: \(\mathcal{E}^i = \lambda \cdot \mathcal{E}_{global}^i + (1-\lambda) \cdot \frac{1}{M}\sum_{m=1}^{M}\mathcal{E}_{local}^{i,m}\) (\(\lambda=0.5\)), and then finds the sample with the minimum feature distance in that domain to serve as the prompt.
Loss & Training¶
- Pre-training uses Chamfer Distance loss
- AdamW optimizer, lr=0.001, cosine learning rate schedule
- Trained for 300 epochs, batch size 128
- Each point cloud is sampled with 1024 points, divided into 64 patches (32 points each), with a mask ratio of 0.7
- Absolutely no model parameter updates during test time
Key Experimental Results¶
Main Results (ScanObjectNN as target domain, CD×10⁻³ ↓)¶
| Method | Setting | Reconstruction | Denoising | Registration |
|---|---|---|---|---|
| DG-PIC (Ours) | Test-time DG | 4.1 | 15.2 | 5.8 |
| PIC | Supervised | 72.9 | 80.0 | 12.7 |
| Point-MAE (task-specific) | Supervised | 30.4 | 36.0 | 31.2 |
| PCT (multi-task) | Supervised | 31.5 | 36.5 | 34.9 |
| PointCutMix (task-specific) | Train-time DG | 44.8 | 43.5 | 41.3 |
Ablation Study (CD×10⁻³ ↓)¶
| Model | Prototype Estimation | Feature Shifting | Anchor Domain | Reconstruction | Denoising | Registration |
|---|---|---|---|---|---|---|
| Model A | Random | Mean | Single Domain | 8.4 | 40.5 | 6.7 |
| Model B | Global Only | Mean | Single Domain | 7.2 | 38.3 | 6.4 |
| Model C | Local Only | Mean | Single Domain | 7.3 | 36.7 | 6.7 |
| Model D | Global+Local | Mean | Single Domain | 6.8 | 35.1 | 6.2 |
| Model E | Global+Local | Mean | All Domains | 6.3 | 32.4 | 6.5 |
| Model F | Global+Local | Macro Only | All Domains | 5.2 | 22.7 | 6.0 |
| Model G | Global+Local | Micro Only | All Domains | 4.9 | 25.6 | 6.2 |
| Ours | Global+Local | Macro+Micro | All Domains | 4.1 | 15.2 | 5.8 |
Key Findings¶
- DG-PIC significantly outperforms all baseline methods across all three tasks; on the reconstruction task, its CD is only 5.6% of PIC's (4.1 vs 72.9).
- Traditional methods (PointNet, DGCNN, etc.) exhibit poor cross-domain generalization, with CD values generally falling within the 30-45 range.
- Although DG methods (Pointmixup, PointCutMix) introduce domain generalization, the variance of results across tasks is small, indicating they focus solely on domain differences while ignoring task differences.
- PIC can handle multiple tasks but fails in cross-domain scenarios (CD 72.9-80.0), proving that ICL alone is insufficient to bridge domain gaps.
- Each component of the dual-level design (global+local prototypes, macro+micro shifting) independently contributes to performance, with the denoising task benefiting the most.
Highlights & Insights¶
- Pioneering Setting: This paper introduces the novel setting of multi-domain multi-task point cloud understanding and establishes a benchmark containing 4 datasets (2 synthetic + 2 real), 7 object categories, 3 tasks, and 30,954 samples.
- Test-Time Generalization Without Model Updates: Domain adaptation is achieved via feature-space shifting rather than fine-tuning, maintaining a manageable computational overhead.
- Elegant Design Intuition for Micro Positional Coefficients: Exploits the effective prior that patches at the same relative position on an object should exhibit similar geometric structures across domains.
- A Unified Model for Three Tasks: Reconstruction, denoising, and registration share a single network, with tasks toggled via prompts.
Limitations & Future Work¶
- The benchmark only contains 7 shared classes, with limited scale and diversity.
- The framework only considers regression-like tasks for coordinates (reconstruction/denoising/registration) and does not cover discriminative tasks like classification or segmentation.
- Source domain prototypes are calculated as a simple average of all samples, potentially losing intra-class multi-modal distribution information.
- The formulation for feature shifting is somewhat heuristic and lacks rigorous theoretical analysis.
- Ablation studies and thorough experiments were primarily conducted with ScanObjectNN as the target domain, leaving other target domain configurations insufficiently explored.
Related Work & Insights¶
- vs PIC: PIC is a single-domain multi-task ICL model. DG-PIC introduces test-time domain generalization on top of it, achieving massive performance improvements (CD: 72.9 \(\rightarrow\) 4.1 on the reconstruction task).
- vs Point-BERT / Point-MAE: Although these self-supervised features learn rich representations, they do not account for cross-domain generalization, yielding poor direct transfer performance.
- vs Pointmixup / PointCutMix: Train-time DG methods focus on "domain invariance" via mixed data augmentation but ignore "task invariance", leading to mediocre results.
- vs DGLSS / SemanticSTF: Pioneering works in point cloud DG, but they are task-specific and do not support a unified multi-task model.
Rating¶
- Novelty: ⭐⭐⭐⭐ Proposes the multi-domain multi-task point cloud understanding setting for the first time; the dual-level design is highly creative.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation studies and various baseline comparisons, though the choice of target domain configurations is somewhat simple.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, well-justified motivation, and intuitive diagrams.
- Value: ⭐⭐⭐⭐ The new setting and benchmark make a strong contribution to advancing point cloud generalization research.