NEURIPS2025 AI Safety Clustered Federated Learning Descriptor Distribution Shift Test-Time Adaptation DBSCAN Wasserstein Distance

FLUX: Efficient Descriptor-Driven Clustered Federated Learning under Arbitrary Distribution Shifts¶

Conference: NEURIPS2025 arXiv: 2511.22305 Code: To be confirmed Area: AI Safety Keywords: Clustered Federated Learning, Descriptor, Distribution Shift, Test-Time Adaptation, DBSCAN, Wasserstein Distance

TL;DR¶

Flux extracts compact distribution descriptors on the client side (marginal \(P(X)\) mean/covariance + class-conditional \(P(Y|X)\) mean/covariance), performs unsupervised clustering on the server via adaptive DBSCAN to automatically determine the number of clusters and group assignments, trains cluster-specific models, and at test time matches unlabeled new clients to the optimal model using only feature descriptors — the first method to simultaneously handle four types of distribution shifts with communication overhead comparable to FedAvg.

Background & Motivation¶

Background: Federated learning (FL) enables privacy-preserving collaborative training across multiple parties, but conventional approaches assume IID client data. Clustered federated learning (CFL) addresses non-IID heterogeneity by grouping clients with similar distributions and training independently per group, while personalized federated learning (PFL) tailors a model for each client.

Limitations of Prior Work: - Existing CFL methods (IFCA, FedRC, etc.) require the number of clusters \(M\) to be specified in advance, which is infeasible in real-world deployments. - Most methods handle only a single type of distribution shift (e.g., feature shift only or label shift only) and cannot accommodate multiple co-occurring shifts. - At test time, new clients that did not participate in training cannot be assigned a model — PFL methods suffer severe performance degradation on unseen clients. - Methods such as FedDrift incur computational overhead more than 300× that of FedAvg, making them unscalable.

Key Challenge: Real-world FL deployments involve unknown shift types, an unknown number of clusters, and unlabeled clients at test time — yet existing methods assume at least one of these conditions is known.

Goal: Design a CFL framework that requires no prior knowledge during either training or testing, uniformly handling four types of distribution shifts (\(P(X)\) feature shift, \(P(Y)\) label shift, \(P(Y|X)\) concept shift, \(P(X|Y)\) concept shift), while maintaining computational and communication efficiency comparable to FedAvg.

Key Insight: Rather than clustering on model parameters or loss values — signals with limited discriminative power for distribution shifts — the method operates directly on statistical characteristics of data distributions, extracting compact distribution descriptors as the basis for clustering.

Core Idea: Approximate the 2-Wasserstein distance using moment statistics (mean + covariance) of client data to construct descriptors, apply adaptive DBSCAN for automatic clustering, and decompose the CFL problem into three independently optimizable subproblems: descriptor extraction, unsupervised clustering, and local classification.

Method¶

Overall Architecture¶

Flux formulates CFL as a probabilistic graphical model (PGM), decomposing the joint distribution into three independently optimizable components:

Local Classifier \(P(Y|X; \theta)\): each client independently trains a local model.
Descriptor Extractor \(P(Y,X|D; \psi)\): maps high-dimensional data to compact descriptors.
Unsupervised Clustering \(P(D|C; \lambda)\): groups clients based on descriptors.

Training pipeline: local client training → descriptor extraction and upload to server → server-side clustering → intra-cluster FedAvg aggregation → repeat until convergence. Test pipeline: new client extracts a feature-only descriptor → matches against cluster centroids → retrieves the nearest cluster's specialist model.

Key Designs¶

Distribution Descriptor Extraction
- Function: Compresses each client's private data into a compact distributional representation to serve as the basis for clustering.
- Mechanism: The joint distribution is factored as \(P(X,Y)=P(Y|X)P(X)\) and each factor is encoded separately. A shared encoder \(f_e\) maps raw data to a latent space, and a client-invariant dimensionality reduction operator \(\xi\) (shared PCA, \(l=10\)) compresses the representation to low dimension. The following statistics are computed from the compressed representation: (a) marginal \(P(X)\) mean \(\mu_x\) and covariance \(\Sigma_x\); (b) per-class conditional \(P(X|Y=u)\) mean \(\mu_u\) and covariance \(\Sigma_u\). The final descriptor is \(d=[\mu_x, \Sigma_x, \mu_1, \Sigma_1, \ldots, \mu_U, \Sigma_U] \in \mathbb{R}^{2(U+1)l}\). This descriptor is proven to be Lipschitz-equivalent to the 2-Wasserstein distance (approximation error \(\xi < 1.1\) on MNIST), with a communication ratio \(L/p \leq 3.5\times10^{-3}\) — virtually zero additional communication overhead.
- Design Motivation: Parameter-based clustering is susceptible to permutation invariance and over-parameterization, leading to erroneous groupings; loss-based methods cannot distinguish clients with identical loss values but different distributions; directly using distributional statistics precisely captures all four shift types. The design also satisfies label-independence requirements: at test time, only \(d'=[\mu_x, \Sigma_x]\) is needed for matching.
Adaptive Density-Based Clustering
- Function: Automatically determines the number of clusters and assigns clients to groups on the server side.
- Mechanism: Extends the DBSCAN algorithm — the \(\varepsilon\) parameter is estimated via elbow detection on the sorted second-nearest-neighbor distance curve, calibrated with a dataset-specific scaling factor, and noise points are reassigned as singleton clusters to ensure full client coverage. Clustering complexity is \(O(L \cdot \log(L))\), far below FedAvg's aggregation cost of \(O(N_{\text{client}} \cdot \theta)\).
- Design Motivation: Eliminates the need to preset the number of clusters \(K\) — a core assumption of existing CFL methods (IFCA, FedEM, FedRC) that is infeasible in practice. As a density-based method, DBSCAN naturally determines the cluster count automatically and is insensitive to cluster shape.
Label-Free Test-Time Adaptation
- Function: Enables new clients that did not participate in training to obtain the optimal cluster model without any labels.
- Mechanism: A new client \(q\) extracts only the feature descriptor \(d'(q)=[\mu_x, \Sigma_x]\) (the label-independent sub-vector), computes Euclidean distance to the centroid \(\gamma_m\) of each training cluster (the mean of member \(d'\) vectors within the cluster), and selects the nearest cluster model: \(c^*(q) = \arg\min_m \kappa(d'(q) - \gamma_m)\). No additional training, online adaptation, or repeated server interaction is required.
- Design Motivation: PFL methods (pFedMe, APFL) are inherently supervised and completely fail on unseen clients (N/A in Table 1); TTA-FL methods (ATP) rely on unsupervised objectives such as entropy minimization, which can yield overconfident erroneous predictions under concept shift. Flux's descriptor matching is deterministic, single-pass, and zero-cost.

Loss & Training¶

The overall optimization objective is decomposed into three independent subproblems optimized jointly:

\[\{\theta^{(k),*}\}, \psi^*, \lambda^* = \arg\max \sum_{k=1}^{K} \left[ \log P(d^{(k)}|c^{(k)};\lambda) + \sum_{(x,y)} \log P(y,x|d^{(k)};\psi) + \sum_{(x,y)} \log P(y|x;\theta^{(k)}) \right]\]

First term: clustering quality (adaptive optimization of DBSCAN's \(\varepsilon\)).
Second term: descriptor extraction quality (optimization of PCA fitting parameters \(\psi\)).
Third term: standard classification loss (local cross-entropy optimization per client).

All three terms are fully decoupled and optimized independently — the theoretical foundation of Flux's efficiency. Differential privacy can be seamlessly integrated into the descriptor \(d\) without affecting accuracy (validated in Appendix C.2).

Key Experimental Results¶

Main Results¶

Test-phase performance — model assignment for new unlabeled clients:

Dataset	FedAvg	IFCA	APFL	ATP	CFL	FeSEM	Flux	Gain
MNIST	85.6	78.2	84.7	85.6	86.1	82.8	94.0	+7.9pp
FMNIST	68.8	63.5	69.2	68.4	69.4	66.2	81.2	+11.8pp
CIFAR-10	31.9	36.6	36.6	33.6	33.2	35.3	38.7	+2.1pp
CIFAR-100	38.0	38.6	37.3	37.5	38.6	39.8	41.3	+1.5pp
CheXpert (AUC)	56.1	58.5	64.0	N/A	58.5	58.3	78.6	+14.6pp
Office-Home	37.1	29.6	36.7	37.9	21.0	25.8	39.2	+1.3pp

Ablation Study¶

Ablation	Configuration	MNIST Accuracy	Difference
Descriptor matching vs. random assignment	Feature shift setting	95.0% vs. 41.9%	+53.1pp
Full descriptor \(P(X)+P(Y\\|X)\)	Full version	93.86%	—
Marginal descriptor only \(P(X)\)	Conditional term removed	90.96%	−2.9pp
DBSCAN clustering	Default	94.0%	—
Replaced with K-Means (requires preset \(K\))	Flux-prior	95.7%	+1.7pp
Scalability, 100 clients	Flux vs. APFL	>84% vs. ~70%	>14pp

Key Findings¶

Detection of all four shift types: The descriptor design decomposes \(P(X,Y)=P(Y|X)P(X)\) and extracts marginal and conditional statistics separately, enabling discrimination of all four distribution shift types — a capability absent from any existing CFL method.
Extreme efficiency advantage: FedDrift's training time exceeds Flux's by more than 300×; FeSEM's by more than 4×; Flux runs on par with FedAvg (difference on the order of seconds).
Strong performance on real-world datasets: On the CheXpert medical imaging dataset, Flux outperforms the best baseline (APFL) by 14.6pp at test time; on Office-Home, most CFL baselines degenerate to a single global model, while Flux continues to cluster effectively.
Flux-prior upper bound: When the true cluster count \(K\) is provided (Flux-prior), performance further improves to 95.7% on MNIST, but Flux without knowledge of \(K\) already achieves 94.0%, leaving only a small gap.

Highlights & Insights¶

Unified handling of four shift types: Decomposing \(P(X,Y)\) into \(P(X)\) and \(P(Y|X)\) statistics simultaneously covers feature shift, label shift, \(P(Y|X)\) concept shift, and \(P(X|Y)\) concept shift — no prior framework has achieved this.
Mathematical elegance of the distribution descriptor: The theoretical guarantee of Lipschitz equivalence between the descriptor and the 2-Wasserstein distance ensures that distances in descriptor space directly reflect distributional differences, providing principled justification for clustering outcomes.
Independent optimization via PGM decomposition: The three subproblems (classification, descriptor extraction, clustering) are fully decoupled and each optimized independently — simplifying algorithm design while guaranteeing scalability.
Minimalist test-time adaptation: New client model assignment requires a single Euclidean distance computation, with no online adaptation, no multi-round communication, and no labels — of significant practical value for real-world deployment.

Limitations & Future Work¶

Data quantity dependency: The statistical robustness of descriptors relies on clients having sufficiently diverse training data; moment estimates may be inaccurate for clients with small datasets.
Static framework: The clustering is performed once and does not accommodate temporal evolution of client distributions (concept drift) — although the authors note that the clustering procedure can be repeated, no formal mechanism is provided.
Blind spot under \(P(Y|X)\) concept shift at test time: Since labels are unavailable at test time, only \(P(X)\) descriptors can be used, making it impossible to distinguish \(P(Y|X)\) concept shift (same inputs, different labels) — a theoretical ceiling of the framework.
Limited gains on complex datasets: Improvements on CIFAR-10/100 are only 1.5–2.1pp, suggesting that descriptor discriminability may be insufficient for high-dimensional complex visual tasks.

vs. IFCA/FedRC (metric/parameter-based CFL): Require a preset number of clusters; loss- or gradient-based clustering signals have limited sensitivity to distribution shift type. Flux's descriptors directly encode distributional characteristics, yielding stronger discriminative power.
vs. FedDrift (dynamic CFL): Supports distribution drift but incurs 300× the computational cost of Flux and cannot scale to 100 clients. Flux trades minimal overhead for scalability.
vs. APFL/pFedMe (PFL): Strong performance under known associations, but completely fail on new clients (APFL degrades from >90% to ~70% at 100 clients). Flux maintains stable performance in both settings.
vs. ATP (TTA-FL): Entropy minimization-based unsupervised adaptation is unstable on complex datasets such as CIFAR-100 and prone to overconfident mispredictions. Flux's deterministic descriptor matching is more reliable.
Insights: The idea of using descriptors as an approximation of the Wasserstein distance is transferable to distribution matching scenarios such as domain adaptation and OOD detection. The PGM decomposition makes each component of the framework independently replaceable and upgradeable (e.g., substituting a stronger clustering algorithm).

Rating¶

Novelty: ⭐⭐⭐⭐ — The first descriptor-driven unified CFL framework with an elegant PGM decomposition; however, the core techniques (moment statistics + DBSCAN) are relatively mature.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Six datasets (including two real-world), ten SOTA baselines, four shift types × eight severity levels, scalability experiments, and comprehensive ablations — exceptionally thorough.
Writing Quality: ⭐⭐⭐⭐ — Rigorous problem formulation, clear PGM modeling, and strong theory-to-practice correspondence; the logical chain from motivation to method to experiments is complete and coherent.
Value: ⭐⭐⭐⭐ — Zero prior knowledge + test-time adaptation + FedAvg-level overhead directly addresses core obstacles to real-world CFL deployment (no need to know the number of clusters or the type of distribution shift).