TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models¶
Conference: AAAI 2026 arXiv: 2511.16423 Code: To be confirmed Area: Multimodal VLM Keywords: Federated Learning, CLIP Adaptation, One-Shot FL, Hierarchical Bayes, Training-Free
TL;DR¶
TOFA is a federated learning framework that adapts CLIP via hierarchical Bayesian inference of personalized visual prototype distributions, globally aligned LLM-based text augmentation, and adaptive modality fusion — achieving training-free, single-round communication adaptation that outperforms one-shot baselines and even some multi-round training methods across 9 datasets.
Background & Motivation¶
Background: Adapting VLMs such as CLIP in federated learning (FL) has attracted growing attention. Existing approaches primarily rely on prompt learning (PromptFL, pFedPrompt) or fine-tuning for downstream task adaptation, but require multiple rounds of client–server communication.
Limitations of Prior Work: - High communication overhead: Multi-round interaction incurs substantial communication costs and demands long-term system stability. - Insufficient computational resources: Many mobile clients and low-capacity servers cannot support model training. - One-shot methods ill-suited for VLMs: Existing one-shot FL methods (e.g., FedLPA, FENS) are designed for traditional models and do not effectively leverage the multimodal information in VLMs. - Data heterogeneity: Non-IID data distributions cause misalignment between local and global optimization objectives.
Key Challenge: How to fully exploit the multimodal information of VLMs while handling data heterogeneity under the strict constraints of zero training and a single communication round.
Goal: Design a training-free, one-shot FL framework for high-quality VLM adaptation.
Key Insight: Extract complementary information from two parallel pipelines — a visual pipeline using Bayesian inference for personalized prototypes, and a textual pipeline using LLM augmentation with global alignment for robust text representations.
Core Idea: Combine hierarchical Bayesian inference for personalized visual distribution estimation, LLM-augmented text with global alignment, and confidence-based modality fusion to achieve training-free, single-round VLM adaptation in FL.
Method¶
Overall Architecture¶
TOFA consists of three modules: - Visual Pipeline: Each client computes local visual statistics → uploads to server → server derives global class prototype distributions using an uninformative prior → distributions are sent back to clients → each client uses the global distribution as a prior to infer a personalized local posterior → GDA-based classification. - Textual Pipeline: Each client generates augmented text descriptions using a local LLM → computes importance scores for each text prompt → server performs global alignment to select robust text prompts → prompts are combined via weighted aggregation for classification. - Adaptive Fusion: Sample-wise confidence-based fusion dynamically balances visual and textual predictions.
Key Designs¶
-
Collaborative Prototype Distribution Learning (Visual Pipeline):
- Function: Learn personalized class-specific visual prototype distributions for each client.
- Mechanism: CLIP visual features are assumed to follow a class-specific Gaussian distribution \(\mathcal{N}(\mathbf{w}_c, \mathbf{\Sigma})\). A hierarchical Bayesian model is employed: the global posterior \(q(\theta) = \pi(\theta|D)\) is computed by aggregating local statistics from all clients; this global posterior then serves as an informative prior, and combined with local data, yields a personalized posterior \(q(\theta^k) \propto L(D^k|\theta^k)[L(D|\theta)]^\alpha \pi(\theta)\), where the power prior parameter \(\alpha\) controls the influence of global information. A Normal-Inverse-Wishart conjugate prior ensures a closed-form posterior, eliminating the need for iterative optimization. GDA is used for final classification.
- Design Motivation: Using only mean prototypes discards distributional variance information. The Bayesian framework naturally trades off between global generalization and local personalization, with \(\alpha\) controlling this balance. The conjugate prior enables one-shot feasibility by avoiding iterative inference.
-
Globally Aligned Text Augmentation (Textual Pipeline):
- Function: Select robust and generalizable text prompts from LLM-generated class descriptions.
- Mechanism: Each client uses an LLM to generate dataset-aware class descriptions \(\{t_c^m\}_{m=1}^M\). Each client computes the classification confidence \(p_c^k(t_c^m)\) of each text prompt per class. The server scores text prompts using a KL divergence-inspired importance metric \(r(t_c^m) = \frac{1}{K}\sum_{k=1}^K u^k(t_c^0)\log\frac{u^k(t_c^m)}{u^k(t_c^0)}\), where \(u^k\) measures the confidence in distinguishing the target class from others. High-scoring prompts exhibit stable performance across heterogeneous data environments.
- Design Motivation: Manual templates such as "A photo of a {class}" are overly simplistic, while LLM-augmented descriptions introduce rich semantics but vary in quality. Global alignment ensures that selected text prompts are effective across heterogeneous clients.
-
Adaptive Modality Fusion:
- Function: Fuse visual and textual predictions in a sample-wise manner.
- Mechanism: The fusion rule is \(f_M^k(\mathbf{z}) = \eta(\mathbf{z})f_V^k(\mathbf{z}) + (1-\eta(\mathbf{z}))f_T(\mathbf{z})\), with weight \(\eta(\mathbf{z}) = \sigma(\log\frac{\max_j \text{softmax}(f_V^k(\mathbf{z}))_j}{\max_j \text{softmax}(f_T(\mathbf{z}))_j})\). Theorem 1 proves that when \(\eta\) is proportional to the difference in loss between the two modalities, the generalization error of the fused classifier is minimized. Calibrated confidence is used as a proxy for accuracy.
- Design Motivation: Different samples exhibit varying reliability across modalities — for some samples the visual modality is more accurate, while for others the textual modality is superior. A fixed weighting scheme cannot adapt to this sample-level variation.
Loss & Training¶
TOFA is entirely training-free: - The visual pipeline transmits only sufficient statistics (mean, scatter matrix, sample count). - The textual pipeline transmits only importance scores. - Single communication round: One client→server→client exchange completes the entire process.
Key Experimental Results¶
Main Results¶
CLIP Datasets (16-shot, 10 clients, label shift):
| Method | Training-free | One-shot | OxfordPets | Flowers102 | Food101 | Caltech101 | DTD |
|---|---|---|---|---|---|---|---|
| CoOp | ✗ | ✗ | 89.18 | 69.03 | 82.54 | 90.62 | 63.97 |
| pFedPrompt | ✗ | ✗ | 91.84 | 96.46 | 92.26 | 96.54 | 77.14 |
| Zero-Shot CLIP | ✓ | ✓ | 85.77 | 66.14 | 77.31 | 86.29 | 42.32 |
| CLIP-GDA | ✓ | ✓ | 88.81 | 91.23 | 79.05 | 92.55 | 60.64 |
| FedLPA+PromptFL | ✗ | ✓ | 83.42 | 78.60 | 74.74 | 88.69 | 52.75 |
| TOFA | ✓ | ✓ | 91.23 | 95.78 | 85.49 | 94.58 | 71.68 |
CIFAR-10/100 (100 clients, Dir(0.3)):
| Method | CIFAR-10 | CIFAR-100 |
|---|---|---|
| FedAvg | 75.10 | 42.52 |
| Zero-Shot CLIP | 87.71 | 64.92 |
| CoOp | 93.11 | 74.83 |
| TOFA | 93.18 | 76.63 |
Ablation Study¶
| Configuration | Performance | Note |
|---|---|---|
| Visual only | Below full model | Lacks robust textual information |
| Textual only | Below full model | Lacks personalized visual information |
| w/o Global Alignment | Performance drop | LLM text quality is inconsistent |
| w/o Adaptive Fusion (fixed weight) | Performance drop | Cannot adapt to sample-level differences |
| Full TOFA | Best | Three modules are complementary |
Key Findings¶
- As a training-free, one-shot method, TOFA surpasses multi-round training approaches such as CoOp and PromptFL on multiple datasets.
- TOFA remains effective under extreme heterogeneity (CIFAR-100, 100 clients, Dir(0.3)), demonstrating strong robustness.
- Competitive performance on the DomainNet feature-shift scenario confirms that the method generalizes to both label shift and feature shift settings.
Highlights & Insights¶
- Elegant use of hierarchical Bayes: Using the global posterior as an informative local prior elegantly achieves personalization within a single communication round. The conjugate prior guarantees a closed-form solution, avoiding any iterative optimization — this is the key mathematical infrastructure enabling training-free adaptation.
- Global alignment of text augmentation: Rather than simply averaging text scores across clients, a KL divergence-inspired selection criterion identifies text prompts that are robust across heterogeneous environments, yielding substantially higher quality than directly using raw LLM outputs.
- Theoretically grounded sample-wise fusion: Theorem 1 connects the generalization error bound of the fused classifier to the mixing coefficient, providing principled justification for the confidence-based weighting rather than ad hoc design.
Limitations & Future Work¶
- Gaussian distribution assumption: Whether CLIP features truly follow a class-conditional Gaussian distribution remains an open question; more flexible distributional assumptions may be needed in complex scenarios such as fine-grained recognition.
- LLM consistency requirement: All clients are required to use the same LLM version for text augmentation generation, which may be difficult to guarantee in practical FL deployments.
- Classification-only applicability: The GDA-based visual pipeline restricts the method to classification tasks, precluding extension to detection or segmentation.
- Insufficient privacy analysis: Although only statistics rather than raw data are transmitted, whether class-specific means and covariance matrices could leak private information warrants deeper investigation.
Related Work & Insights¶
- vs. PromptFL/pFedPrompt: These multi-round training methods require repeated client–server communication and gradient computation. TOFA is entirely training-free and completes adaptation in a single round, achieving comparable or superior performance on most datasets.
- vs. CLIP-GDA: Both methods employ GDA, but CLIP-GDA is a purely local approach without federated aggregation. TOFA incorporates global information via hierarchical Bayes, yielding greater robustness under heterogeneous settings.
- vs. FedLPA: FedLPA is a one-shot method but requires client-side training and relies solely on the visual modality. TOFA is training-free and leverages both modalities, achieving substantially superior performance.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of hierarchical Bayes, global text alignment, and adaptive fusion represents a pioneering contribution in the FL + VLM domain.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 9 datasets with diverse heterogeneity settings, 4 categories of baselines, and comprehensive ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous, though the high density of formulas somewhat reduces readability.
- Value: ⭐⭐⭐⭐ Provides a practical solution for resource-constrained federated VLM adaptation; the performance under training-free and one-shot constraints is impressive.