Skip to content

Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning

Conference: CVPR 2026 arXiv: 2603.22070 Code: N/A Area: 3D Vision / Point Cloud Analysis Keywords: Test-time adaptation, point cloud recognition, Bayesian inference, multimodal distribution learning, zero-shot generalization

TL;DR

BayesMM proposes a training-free dynamic Bayesian distribution learning framework that models textual and geometric modalities as Gaussian distributions and automatically balances modality weights via Bayesian model averaging, achieving robust test-time adaptation across multiple point cloud benchmarks with an average improvement exceeding 4%.

Background & Motivation

Background: Large multimodal 3D vision-language models (e.g., ULIP-2, Uni3D) achieve strong zero-shot generalization through contrastive pre-training, yet suffer notable performance degradation under distribution shift.

Limitations of Prior Work: - Cache-based test-time adaptation (TTA) methods maintain sample caches of limited capacity, where sample replacement leads to progressive information loss; - Fusion of zero-shot and cached logits relies on empirically tuned hyperparameters (\(\lambda\), \(\gamma\)), lacking theoretical grounding, resulting in unstable adaptation.

Key Challenge: How to continuously exploit statistical information from all historical samples at test time while fusing different modalities in a principled manner?

Key Insight: Model textual and geometric features of each class as Gaussian distributions, and automatically balance the contributions of both modalities within a Bayesian framework.

Core Idea: Replace discrete caches with distributions and replace heuristic fusion with Bayesian model averaging, enabling continuous, stable, training-free test-time adaptation.

Method

Overall Architecture

Input: streaming point cloud sequence \(\{X_t\}\) + fixed text prototypes \(\{T_c\}\) → frozen point cloud encoder \(\Phi\) and text encoder \(\Psi\) → textual distribution learning (offline) + geometric distribution learning (online update) → Bayesian weighted fusion → predicted class.

Key Designs

  1. Textual Distribution Learning:

    • Function: Estimate a per-class Gaussian distribution from \(M\) LLM-generated semantic paraphrases.
    • Mechanism: Compute the empirical mean \(\bar{\mathbf{z}}^c\) and covariance \(\mathbf{S}^c\), establish a prior \(p(\boldsymbol{\nu}^c) = \mathcal{N}(\bar{\mathbf{z}}^c, \beta^2\mathbf{I})\), and derive the deterministic prototype \(\boldsymbol{\nu}^c_{\text{MAP}}\) via MAP estimation.
    • Design Motivation: A single text template cannot capture semantic diversity; Gaussian modeling over multiple paraphrases provides richer class-level semantic priors.
  2. Geometric Distribution Learning:

    • Function: Maintain an online Gaussian distribution \(\{\boldsymbol{\mu}_t^c, \boldsymbol{\Sigma}_t^c\}\) per class and update it recursively as new samples arrive.
    • Mechanism: Initialized from text prototypes \(\boldsymbol{\mu}_0^c = \bar{\mathbf{z}}^c\), with closed-form recursive Bayesian updates: \(\boldsymbol{\mu}_t^c = \boldsymbol{\Sigma}_t^c((\boldsymbol{\Sigma}^c)^{-1}\mathbf{x}_t + (\boldsymbol{\Sigma}_{t-1}^c)^{-1}\boldsymbol{\mu}_{t-1}^c)\) \(\boldsymbol{\Sigma}_t^c = ((\boldsymbol{\Sigma}_{t-1}^c)^{-1} + (\boldsymbol{\Sigma}^c)^{-1})^{-1}\)
    • Design Motivation: Distribution parameters continuously accumulate statistics from all historical samples, eliminating cache capacity constraints and information loss.
  3. Bayesian Model Averaging:

    • Function: Automatically fuse the posterior predictions of the textual and geometric modalities.
    • Mechanism: \(p(c|\mathbf{x}_t) = p(c|\mathbf{x}_t, \boldsymbol{\Omega}^c) p(\boldsymbol{\Omega}^c|\mathbf{x}_t) + p(c|\mathbf{x}_t, \boldsymbol{\Theta}_t^c) p(\boldsymbol{\Theta}_t^c|\mathbf{x}_t)\)
    • The weight of each modality is its posterior evidence \(p(\boldsymbol{\Omega}^c|\mathbf{x}_t)\) and \(p(\boldsymbol{\Theta}_t^c|\mathbf{x}_t)\), adjusted automatically.
    • Design Motivation: The \(\lambda\) in cache-based methods requires manual tuning; the Bayesian framework allocates weights automatically based on data evidence, yielding greater robustness.

Loss & Training

  • Entirely training-free: All encoders are frozen; adaptation is performed solely via closed-form Bayesian updates of distribution parameters.
  • No additional hyperparameters require domain-specific tuning.

Key Experimental Results

Main Results (ModelNet-C, 7 corruption types)

Backbone Method Add Global Add Local Drop Global Jitter Mean
ULIP Zero-shot 33.55 43.92 54.70 44.08 48.60
ULIP + Hierarchical Cache 46.15 47.85 59.16 49.92 55.02
ULIP + BayesMM 54.82 53.93 63.09 53.04 59.42
Uni3D Zero-shot 72.45 56.36 68.15 56.24 69.69
Uni3D + Hierarchical Cache 77.51 71.15 72.16 62.52 74.63
Uni3D + BayesMM 77.59 73.30 74.96 65.84 76.56

Ablation Study (Distribution alignment verification)

Configuration KL Divergence (init→final) MMD (init→final) Note
Text modality only High High Single modality insufficient
Geometric modality only Medium Medium Lacks semantic prior
BayesMM (full) 17.2 → 12.6 0.91 → 0.71 Bayesian fusion converges continuously

Key Findings

  • BayesMM yields consistent improvements across all four backbone models (ULIP, ULIP-2, OpenShape, Uni3D).
  • Effective under the Sim-to-Real setting, demonstrating cross-domain generalization.
  • KL divergence and MMD decrease continuously throughout adaptation, indicating progressive distribution alignment rather than overfitting.

Highlights & Insights

  • Fully training-free TTA: No gradient updates required; adaptation is achieved entirely via closed-form Bayesian updates.
  • Introduces distribution learning into 3D multimodal TTA, offering a theoretically more principled alternative to cache-based methods.
  • Model-agnostic: plug-and-play compatible with any pre-trained 3D vision-language model.

Limitations & Future Work

  • The Gaussian assumption may be inappropriate for complex non-Gaussian feature distributions.
  • Maintaining per-class covariance matrices incurs non-trivial computational overhead when the number of classes is large.
  • Geometric distributions may be poorly estimated when very few test samples from a given class appear in the stream.
  • Conceptually similar to DOTA (online Gaussian TTA for 2D VLMs), but extended to 3D multimodal settings.
  • The Bayesian model averaging paradigm is generalizable to other multimodal fusion scenarios.

Rating

  • Novelty: ⭐⭐⭐⭐ Bayesian framework replaces cache-based methods with theoretical elegance
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four backbone models × multiple benchmarks × diverse settings
  • Writing Quality: ⭐⭐⭐⭐ Derivations are clear and formulations are rigorous
  • Value: ⭐⭐⭐⭐ A practical plug-and-play TTA solution