Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning¶

Conference: CVPR 2026 arXiv: 2603.22070 Code: N/A Area: 3D Vision / Point Cloud Analysis Keywords: Test-time adaptation, point cloud recognition, Bayesian inference, multimodal distribution learning, zero-shot generalization

TL;DR¶

BayesMM proposes a training-free dynamic Bayesian distribution learning framework that models textual and geometric modalities as Gaussian distributions and automatically balances modality weights via Bayesian model averaging, achieving robust test-time adaptation across multiple point cloud benchmarks with an average improvement exceeding 4%.

Background & Motivation¶

Background: Large multimodal 3D vision-language models (e.g., ULIP-2, Uni3D) achieve strong zero-shot generalization through contrastive pre-training, yet suffer notable performance degradation under distribution shift.

Limitations of Prior Work: - Cache-based test-time adaptation (TTA) methods maintain sample caches of limited capacity, where sample replacement leads to progressive information loss; - Fusion of zero-shot and cached logits relies on empirically tuned hyperparameters (\(\lambda\), \(\gamma\)), lacking theoretical grounding, resulting in unstable adaptation.

Key Challenge: How to continuously exploit statistical information from all historical samples at test time while fusing different modalities in a principled manner?

Key Insight: Model textual and geometric features of each class as Gaussian distributions, and automatically balance the contributions of both modalities within a Bayesian framework.

Core Idea: Replace discrete caches with distributions and replace heuristic fusion with Bayesian model averaging, enabling continuous, stable, training-free test-time adaptation.

Method¶

Overall Architecture¶

Input: streaming point cloud sequence \(\{X_t\}\) + fixed text prototypes \(\{T_c\}\) → frozen point cloud encoder \(\Phi\) and text encoder \(\Psi\) → textual distribution learning (offline) + geometric distribution learning (online update) → Bayesian weighted fusion → predicted class.

Key Designs¶

Textual Distribution Learning:
- Function: Estimate a per-class Gaussian distribution from \(M\) LLM-generated semantic paraphrases.
- Mechanism: Compute the empirical mean \(\bar{\mathbf{z}}^c\) and covariance \(\mathbf{S}^c\), establish a prior \(p(\boldsymbol{\nu}^c) = \mathcal{N}(\bar{\mathbf{z}}^c, \beta^2\mathbf{I})\), and derive the deterministic prototype \(\boldsymbol{\nu}^c_{\text{MAP}}\) via MAP estimation.
- Design Motivation: A single text template cannot capture semantic diversity; Gaussian modeling over multiple paraphrases provides richer class-level semantic priors.
Geometric Distribution Learning:
- Function: Maintain an online Gaussian distribution \(\{\boldsymbol{\mu}_t^c, \boldsymbol{\Sigma}_t^c\}\) per class and update it recursively as new samples arrive.
- Mechanism: Initialized from text prototypes \(\boldsymbol{\mu}_0^c = \bar{\mathbf{z}}^c\), with closed-form recursive Bayesian updates: \(\boldsymbol{\mu}_t^c = \boldsymbol{\Sigma}_t^c((\boldsymbol{\Sigma}^c)^{-1}\mathbf{x}_t + (\boldsymbol{\Sigma}_{t-1}^c)^{-1}\boldsymbol{\mu}_{t-1}^c)\) \(\boldsymbol{\Sigma}_t^c = ((\boldsymbol{\Sigma}_{t-1}^c)^{-1} + (\boldsymbol{\Sigma}^c)^{-1})^{-1}\)
- Design Motivation: Distribution parameters continuously accumulate statistics from all historical samples, eliminating cache capacity constraints and information loss.
Bayesian Model Averaging:
- Function: Automatically fuse the posterior predictions of the textual and geometric modalities.
- Mechanism: \(p(c|\mathbf{x}_t) = p(c|\mathbf{x}_t, \boldsymbol{\Omega}^c) p(\boldsymbol{\Omega}^c|\mathbf{x}_t) + p(c|\mathbf{x}_t, \boldsymbol{\Theta}_t^c) p(\boldsymbol{\Theta}_t^c|\mathbf{x}_t)\)
- The weight of each modality is its posterior evidence \(p(\boldsymbol{\Omega}^c|\mathbf{x}_t)\) and \(p(\boldsymbol{\Theta}_t^c|\mathbf{x}_t)\), adjusted automatically.
- Design Motivation: The \(\lambda\) in cache-based methods requires manual tuning; the Bayesian framework allocates weights automatically based on data evidence, yielding greater robustness.

Loss & Training¶

Entirely training-free: All encoders are frozen; adaptation is performed solely via closed-form Bayesian updates of distribution parameters.
No additional hyperparameters require domain-specific tuning.

Key Experimental Results¶

Main Results (ModelNet-C, 7 corruption types)¶

Backbone	Method	Add Global	Add Local	Drop Global	Jitter	Mean
ULIP	Zero-shot	33.55	43.92	54.70	44.08	48.60
ULIP	+ Hierarchical Cache	46.15	47.85	59.16	49.92	55.02
ULIP	+ BayesMM	54.82	53.93	63.09	53.04	59.42
Uni3D	Zero-shot	72.45	56.36	68.15	56.24	69.69
Uni3D	+ Hierarchical Cache	77.51	71.15	72.16	62.52	74.63
Uni3D	+ BayesMM	77.59	73.30	74.96	65.84	76.56

Ablation Study (Distribution alignment verification)¶

Configuration	KL Divergence (init→final)	MMD (init→final)	Note
Text modality only	High	High	Single modality insufficient
Geometric modality only	Medium	Medium	Lacks semantic prior
BayesMM (full)	17.2 → 12.6	0.91 → 0.71	Bayesian fusion converges continuously

Key Findings¶

BayesMM yields consistent improvements across all four backbone models (ULIP, ULIP-2, OpenShape, Uni3D).
Effective under the Sim-to-Real setting, demonstrating cross-domain generalization.
KL divergence and MMD decrease continuously throughout adaptation, indicating progressive distribution alignment rather than overfitting.

Highlights & Insights¶

Fully training-free TTA: No gradient updates required; adaptation is achieved entirely via closed-form Bayesian updates.
Introduces distribution learning into 3D multimodal TTA, offering a theoretically more principled alternative to cache-based methods.
Model-agnostic: plug-and-play compatible with any pre-trained 3D vision-language model.

Limitations & Future Work¶

The Gaussian assumption may be inappropriate for complex non-Gaussian feature distributions.
Maintaining per-class covariance matrices incurs non-trivial computational overhead when the number of classes is large.
Geometric distributions may be poorly estimated when very few test samples from a given class appear in the stream.

Conceptually similar to DOTA (online Gaussian TTA for 2D VLMs), but extended to 3D multimodal settings.
The Bayesian model averaging paradigm is generalizable to other multimodal fusion scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ Bayesian framework replaces cache-based methods with theoretical elegance
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four backbone models × multiple benchmarks × diverse settings
Writing Quality: ⭐⭐⭐⭐ Derivations are clear and formulations are rigorous
Value: ⭐⭐⭐⭐ A practical plug-and-play TTA solution