Covariances for Free: Exploiting Mean Distributions for Training-free Federated Learning¶

Conference: NeurIPS 2025 arXiv: 2412.14326 Code: dipamgoswami/FedCOF Area: Optimization Keywords: federated learning, training-free, Covariance Estimation, Pre-trained Models, Communication Efficiency

TL;DR¶

This paper proposes FedCOF, which leverages only the class means uploaded by clients to perform unbiased estimation of class covariance matrices on the server side, enabling initialization of a global classifier with zero training and minimal communication overhead — achieving performance on par with or surpassing Fed3R, which requires transmission of second-order statistics.

Background & Motivation¶

A central challenge in federated learning (FL) is performance degradation caused by client data heterogeneity (non-iid). Pre-trained models can substantially alleviate this issue, and several recent training-free approaches have emerged:

FedNCM: Each client uploads only class means; the server aggregates them and directly initializes the classifier using normalized means. Communication overhead is minimal, but accuracy is limited since only first-order statistics are exploited.
Fed3R: Each client additionally uploads a \(d \times d\) feature matrix \(G_k\) and a \(d \times C\) label matrix \(B_k\); the server solves for the classifier via ridge regression. Accuracy is higher, but communication overhead increases by \(d^2 K\), which is especially costly for high-dimensional features and large client populations.

Key Challenge: Second-order statistics (covariances) significantly improve classifier quality, yet directly transmitting covariance matrices incurs prohibitive communication costs and privacy risks. The question is whether the benefits of second-order statistics can be obtained while transmitting only first-order statistics (class means).

Core Problem¶

How to construct an unbiased estimator of the global class covariance matrix from client class means?
How to use the estimated covariances to efficiently initialize the global classifier?
Can the accuracy of Fed3R be matched or exceeded under the same communication budget as FedNCM?

Method¶

4.1 Estimating Covariances from Client Means¶

The mathematical foundation is as follows: for class \(c\), the sample mean \(\overline{F}_{k,c}\) of client \(k\) satisfies:

\[\mathbb{E}[\overline{F}_{k,c}] = \mu_c, \quad \text{Var}[\overline{F}_{k,c}] = \frac{\Sigma_c}{n_{k,c}}\]

That is, the variance of sample means reflects the underlying class covariance. Based on this, the paper proposes an unbiased covariance estimator:

\[\hat{\Sigma}_c = \frac{1}{K-1} \sum_{k=1}^{K} n_{k,c} (\hat{\mu}_{k,c} - \hat{\mu}_c)(\hat{\mu}_{k,c} - \hat{\mu}_c)^\top + \gamma I_d\]

where \(\hat{\mu}_{k,c}\) is the class \(c\) mean at client \(k\), \(\hat{\mu}_c\) is the global class mean, \(n_{k,c}\) is the sample count, and \(\gamma I_d\) is a shrinkage regularization term. Critically, this estimator requires only client class means and sample counts — no covariance matrices need to be transmitted.

4.2 Classifier Initialization Using Only the Within-Class Scatter Matrix¶

The paper decomposes the ridge regression matrix \(G = FF^\top\) into three components:

\[G = \underbrace{\sum_{c}(N_c-1)\hat{S}_c}_{G_{\text{with}}} + \underbrace{\sum_{c} N_c(\hat{\mu}_c - \hat{\mu}_g)(\hat{\mu}_c - \hat{\mu}_g)^\top}_{G_{\text{btw}}} + N\hat{\mu}_g\hat{\mu}_g^\top\]

Experiments reveal that the between-class scatter matrix \(G_{\text{btw}}\) is severely ill-conditioned (condition numbers on the order of \(10^7\)), whereas the within-class scatter matrix \(G_{\text{with}}\) has condition numbers of only \(10^3\). FedCOF therefore removes \(G_{\text{btw}}\) and initializes the classifier using only within-class covariances:

\[W^* = \hat{G}^{-1} B, \quad \hat{G} = \sum_{c}(N_c - 1)\hat{\Sigma}_c + N\hat{\mu}_g\hat{\mu}_g^\top\]

This strategy parallels the use of only the within-class scatter matrix in Linear Discriminant Analysis (LDA).

4.3 Multi-Round Federated Setting¶

In multi-round scenarios where clients participate in batches, the server accumulates means and counts from all clients seen so far to update the covariance estimate. Each client transmits its statistics only once, so the total communication overhead is identical to the single-round case.

4.4 Improvement for Few-Client Settings¶

When the number of clients is small, insufficient means degrade covariance estimation. The paper proposes a strategy of sampling multiple subset means per client — for example, 10 clients each sampling 2 means yields approximately a 2.6% accuracy improvement.

Key Experimental Results¶

Main Results (5 Datasets, 3 Pre-trained Models)¶

Method	Key Advantage	Communication Cost
FedNCM	Baseline, uses means only	\(dC'K\) (lowest)
Fed3R	Uses second-order statistics	\((dC'+d^2)K\) (high)
FedCOF	Uses estimated covariances	\(dC'K\) (same as FedNCM)

Key results:

CUB200 (SqueezeNet): FedCOF 53.7% vs. Fed3R 50.4% vs. FedNCM 37.8%; FedCOF surpasses Fed3R by 3.3%
Stanford Cars (ViT-B/16): FedCOF 52.5% vs. Fed3R 47.9%; surpasses by 4.6%
iNat-120K (MobileNetv2): FedCOF 44.1% vs. Fed3R 41.5% vs. FedNCM 36.0%; communication cost reduced from 61k MB to 280 MB (218× compression)
Compared to FedNCM, FedCOF achieves a 24–26% improvement on Cars with identical communication overhead

Comparison with Prompt-Tuning Methods (ViT-B/32)¶

Dataset	PFPT	FedCOF	Communication Advantage of FedCOF
CIFAR-100	75.1% (847 MB)	75.3% (9 MB)	94× more efficient
CUB200	38.6% (1766 MB)	65.0% (7 MB)	245× more efficient
Cars	12.9% (1736 MB)	50.4% (8 MB)	37.5% higher accuracy

Comparison with Full Fine-Tuning (SqueezeNet)¶

FedCOF without any training surpasses FedAvg and FedAdam. Further fine-tuning (FedCOF + FedAdam) reaches 55.7% on CUB200, exceeding Fed3R + FedAdam at 51.2%.

Highlights & Insights¶

Mathematical elegance: Starting from the fundamental statistical property of the variance of sample means, an unbiased covariance estimator is derived with rigorous and complete proofs.
Zero communication overhead increase: The communication budget is identical to FedNCM, yet performance improves substantially (up to 26%).
Insight on removing the between-class scatter matrix: Condition number analysis reveals the ill-conditioned nature of \(G_{\text{btw}}\), motivating a within-class-only covariance strategy that outperforms Fed3R's use of the full \(G\).
Strong generalizability: Applicable to diverse pre-trained models (SqueezeNet, MobileNetv2, ViT); serves as a superior initialization for subsequent fine-tuning and linear probing.
Large-scale validation: Effectiveness demonstrated on real-world iNat-120K (9,275 clients, 1,203 classes).

Limitations & Future Work¶

Reliance on the iid assumption: The unbiasedness of the covariance estimator rests on the assumption that same-class samples are iid across clients; while the generalization capability of pre-trained models partially compensates for this, bias may arise in extreme non-iid settings.
Sensitivity to client count: Estimation quality degrades with fewer clients; although multi-subset sampling mitigates this, it increases client-side computation.
Limited privacy guarantees: Transmitting class means and counts is preferable to transmitting covariances, but is not directly compatible with secure aggregation protocols and requires additional privacy-preserving mechanisms.
Shrinkage parameter tuning: The choice of \(\gamma\) varies across model feature dimensions (1 for SqueezeNet, 0.1 for MobileNetv2), and no adaptive selection strategy is provided.

Method	Type	Transmitted Content	Communication Cost	Training Required
FedAvg / FedAdam	Full fine-tuning	Model parameters	Very high	Yes
PFPT	Prompt-Tuning	Prompt parameters	High	Yes
FedNCM	Training-free	Class means	Very low	No
Fed3R	Training-free	Class means + \(G_k\)	High	No
CCVR	Post-calibration	Class means + covariances	Very high	Partial
FedCOF	Training-free	Class means	Very low	No

The core advantage of FedCOF is that its communication overhead matches FedNCM while achieving accuracy on par with or superior to Fed3R, without any training.

The approach of estimating second-order statistics from means may generalize to other settings where covariances are needed but communication or privacy constraints are prohibitive, such as distributed continual learning or federated class-incremental learning. The condition number analysis further reveals structural issues with the \(G\) matrix in ridge regression, offering insights potentially applicable to classifier initialization beyond the federated setting. Additionally, FedCOF's use of normalized means for classifier initialization aligns with the neural collapse phenomenon, wherein classifier weights converge toward class mean directions.

Rating¶

Novelty: ⭐⭐⭐⭐ — The derivation of the unbiased covariance estimator is concise and elegant; the core idea of obtaining covariances "for free" from means is highly ingenious
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 5 datasets, 3 models, diverse baselines, and comprehensive ablation studies
Writing Quality: ⭐⭐⭐⭐ — Clear structure with a well-connected chain from motivation to theory to experiments to analysis
Value: ⭐⭐⭐⭐ — Provides an elegant and practical solution to the real-world problem of communication efficiency in federated learning