DeepAFL: Deep Analytic Federated Learning¶
Conference: ICLR 2026 arXiv: 2603.00579 Code: None Area: Optimization / Federated Learning Keywords: Federated Learning, Analytic Learning, Gradient-Free Training, Residual Blocks, Data Heterogeneity
TL;DR¶
This paper proposes DeepAFL, which designs gradient-free analytic residual blocks and introduces a layer-wise federated training protocol, achieving for the first time a deep analytic federated learning model with representation learning capability. The method maintains ideal invariance to data heterogeneity while overcoming the fundamental limitation of existing analytic approaches to single-layer linear models, surpassing the state of the art by 5.68%–8.42% across three benchmark datasets.
Background & Motivation¶
Federated Learning (FL) is the dominant distributed learning paradigm for breaking down data silos. However, conventional gradient-based FL methods (e.g., FedAvg, FedProx, SCAFFOLD) face four core challenges: (1) Data heterogeneity—distributional discrepancies across clients degrade model performance after aggregation, especially in non-IID settings; (2) Convergence—heterogeneous data causes client model divergence, potentially leading to suboptimal global solutions after aggregation; (3) Scalability—communication and computational overhead grows substantially as more clients participate; (4) Communication cost—multi-round gradient exchanges require significant bandwidth.
In recent years, Analytic Learning has offered a promising alternative to these problems. Its core idea is to replace iterative gradient updates with closed-form solutions, thereby fundamentally eliminating the instability of gradient-based training. Several works have introduced analytic learning into federated settings (e.g., FedAnalytic), demonstrating superior invariance to data heterogeneity—since closed-form solutions do not depend on learning rates or iterative rounds, they are unaffected by non-IID data distributions.
However, existing analytic FL methods face a fundamental bottleneck: they are restricted to training single-layer linear models (e.g., ridge regression / least-squares classifiers) on top of frozen pre-trained backbones. Without representation learning capability, these models rely entirely on the quality of pre-trained features and perform suboptimally on tasks requiring feature adaptation.
The core tension addressed in this paper is: how to endow analytic models with deep representation learning capability while preserving invariance to data heterogeneity? The core idea draws on the success of ResNet by designing gradient-free analytic residual blocks—each layer admits a closed-form solution, and stacking them progressively enables deep representation learning.
Method¶
Overall Architecture¶
The overall pipeline of DeepAFL proceeds as follows: (1) all clients share a pre-trained backbone for initial feature extraction; (2) analytic residual blocks are stacked layer-by-layer on top of the backbone; (3) training at each layer is completed in a single round of local client computation followed by server aggregation (no multi-round communication required); (4) after layer-wise training is complete, a full deep model is obtained. The inputs are local data from each client, and the output is a globally applicable deep classification/regression model.
Key Designs¶
- Gradient-Free Analytic Residual Block: Inspired by ResNet, each residual block takes the form \(\mathbf{h}^{(l+1)} = \mathbf{h}^{(l)} + f^{(l)}(\mathbf{h}^{(l)})\), where \(f^{(l)}\) is a linear transformation with nonlinear activation. The key innovation is that the parameters of \(f^{(l)}\) are obtained via least-squares (rather than gradient descent), yielding a closed-form solution.
Specifically, given the input feature matrix \(\mathbf{H}^{(l)}\) and target \(\mathbf{Y}\) at layer \(l\), the parameters \(\mathbf{W}^{(l)}\) of the residual mapping \(f^{(l)}\) are obtained by solving:
\(\mathbf{W}^{(l)} = \arg\min_{\mathbf{W}} \|\phi(\mathbf{H}^{(l)}) \mathbf{W} - (\mathbf{Y} - \mathbf{H}^{(l)})\|_F^2 + \lambda \|\mathbf{W}\|_F^2\)
where \(\phi(\cdot)\) denotes a nonlinear feature mapping (e.g., random features or kernel approximations) and \(\lambda\) is the regularization coefficient. This problem admits the closed-form solution: \(\mathbf{W}^{(l)} = (\phi(\mathbf{H}^{(l)})^\top \phi(\mathbf{H}^{(l)}) + \lambda \mathbf{I})^{-1} \phi(\mathbf{H}^{(l)})^\top (\mathbf{Y} - \mathbf{H}^{(l)})\).
The skip connection ensures stable information flow—even if a given layer's mapping is suboptimal, the shortcut preserves input information. Stacking multiple such blocks enables progressive feature refinement.
- Layer-Wise FL Protocol: Conventional FL trains the entire model, requiring many rounds of communication. DeepAFL adopts a layer-wise protocol: for layer \(l\), each client \(k\) computes locally the covariance matrix \(\mathbf{A}_k^{(l)} = \phi(\mathbf{H}_k^{(l)})^\top \phi(\mathbf{H}_k^{(l)})\) and the cross-covariance matrix \(\mathbf{B}_k^{(l)} = \phi(\mathbf{H}_k^{(l)})^\top (\mathbf{Y}_k - \mathbf{H}_k^{(l)})\), and uploads these matrices to the server.
The server simply performs summation aggregation: \(\mathbf{A}^{(l)} = \sum_k \mathbf{A}_k^{(l)}\), \(\mathbf{B}^{(l)} = \sum_k \mathbf{B}_k^{(l)}\), and computes the global solution: \(\mathbf{W}^{(l)} = (\mathbf{A}^{(l)} + \lambda \mathbf{I})^{-1} \mathbf{B}^{(l)}\).
This protocol offers three key advantages: - Invariance to data heterogeneity: Due to the associativity of matrix summation, the aggregated result is mathematically equivalent to centralized training regardless of how data is distributed across clients. - Single-round communication per layer: Each layer requires only one communication round (upload matrices → server computes → broadcast parameters), with no iteration needed. - Privacy-friendly: Only aggregated statistics, rather than raw data or gradients, are transmitted.
- Feature Mapping Strategy: To introduce nonlinearity while preserving closed-form solutions, DeepAFL employs Random Features to approximate kernel mappings. This is a classical technique: random projections followed by nonlinear activations implicitly compute high-dimensional kernel features at controlled computational cost. Different random feature mappings can be used at each layer to increase representational diversity.
Loss & Training¶
The training objective at each layer is a regularized least-squares regression:
Since this is a convex problem, it has a unique global optimum. The training procedure is purely feed-forward—layers are solved sequentially from bottom to top, each in a single pass, with no backpropagation required. The total number of training rounds equals the number of layers (rather than the hundreds or thousands of communication rounds typical of gradient-based FL).
Key Experimental Results¶
Main Results¶
Comparison on three benchmark datasets under non-IID federated settings:
| Method | Dataset 1 | Dataset 2 | Dataset 3 | Training Paradigm |
|---|---|---|---|---|
| FedAvg | Baseline | Baseline | Baseline | Multi-round gradient |
| FedProx | ~FedAvg | ~FedAvg | ~FedAvg | Multi-round gradient + regularization |
| SCAFFOLD | > FedAvg | > FedAvg | > FedAvg | Variance reduction |
| FedAnalytic (single-layer) | Limited by linear model | Limited by linear model | Limited by linear model | Single-layer analytic |
| DeepAFL | SOTA (+5.68%~8.42%) | SOTA | SOTA | Deep analytic |
DeepAFL outperforms the previous state-of-the-art by 5.68%–8.42% across all three benchmark datasets.
Ablation Study¶
| Configuration | Key Metric | Remark |
|---|---|---|
| 1 layer vs. multi-layer | Multi-layer significantly better | Validates the necessity of deep representation learning |
| With vs. without skip connections | With skip connections more stable | Skip connections ensure information flow |
| Varying number of layers | Diminishing returns | Improvement plateaus after 3–5 layers |
| IID vs. non-IID | Negligible performance gap | Confirms invariance to data heterogeneity |
| Varying number of clients | Stable performance | Good scalability |
Key Findings¶
- Depth + Analytic = Win-Win: DeepAFL is the first to demonstrate that analytic learning can be made "deep," and that depth yields substantial performance gains (surpassing both single-layer analytic methods and multi-round gradient-based methods).
- Heterogeneity invariance validated both theoretically and empirically: Regardless of how data is partitioned in a non-IID manner, DeepAFL produces results consistent with centralized training—a property that gradient-based FL cannot achieve.
- Highly communication-efficient: Each layer requires only one communication round; the total number of rounds equals the number of layers (typically 3–5), far fewer than the hundreds of rounds required by gradient-based methods.
- No hyperparameter tuning burden: There are no learning rates, momentum terms, or similar hyperparameters to tune; the regularization coefficient \(\lambda\) is the only hyperparameter that needs to be set.
Highlights & Insights¶
- Breaking the perception that "analytic learning = shallow models": Through the design of analytic residual blocks, this work demonstrates that gradient-free methods can construct deep networks—a methodological breakthrough.
- Elegant transfer of the ResNet paradigm: The most successful architectural design in deep learning (residual connections) is seamlessly transferred to the analytic learning framework, exemplifying cross-paradigm methodological synthesis.
- A paradigm shift for federated learning: Rather than mitigating the central challenge of data heterogeneity in FL through various heuristics, DeepAFL fundamentally eliminates its influence—a qualitative rather than incremental improvement.
- Minimalist algorithmic design: The entire method involves only matrix multiplication, inversion, and summation, making it straightforward to implement and theoretically transparent.
- Complete theoretical guarantees: Invariance to data heterogeneity is established with rigorous mathematical proofs, not merely empirical observations.
Limitations & Future Work¶
- Dependence on pre-trained backbone quality: Although DeepAFL introduces representation learning capability, it still operates on top of frozen pre-trained features. Poor backbone quality is difficult to compensate for with deeper analytic blocks.
- Computational bottleneck of matrix inversion: Each layer requires inverting a \(d \times d\) matrix (where \(d\) is the feature dimension), which becomes non-trivial when feature dimensions are large (e.g., 1024-dimensional features from ViT-Large).
- Limitations of random features: While random feature approximations of kernel mappings are efficient, the resulting representations may still lag behind the hierarchical features learned by end-to-end deep networks.
- Restricted task types: Validation is currently limited to classification tasks. Applicability to generative tasks (e.g., federated LLM training) remains unexplored.
- Privacy risks from transmitted matrices: Although only aggregated statistics rather than raw data are transmitted, covariance matrices may still leak statistical properties of client data, necessitating further differential privacy analysis.
- Potential future directions: Integration with differential privacy; end-to-end analytic feature learning without freezing the backbone; more efficient matrix computation methods (e.g., the Woodbury identity).
Related Work & Insights¶
- FedAvg (McMahan et al., 2017): The foundational FL algorithm, aggregating client models through multi-round averaging. DeepAFL replaces multi-round approximate averaging with single-round exact summation.
- Analytic federated learning (e.g., FedCR, ACIL-FL): Direct predecessors of DeepAFL, but constrained to single-layer linear models. DeepAFL's residual block design overcomes this fundamental limitation.
- Extreme Learning Machines (ELM): A classical approach combining random features with least-squares solutions, which can be viewed as a special case of DeepAFL's single-layer setting.
- Deep Unfolding: The idea of unrolling iterative optimization steps layer-by-layer shares conceptual similarity with DeepAFL's layer-wise solving approach.
- Insights: Analytic learning, as an alternative paradigm to gradient-based learning, exhibits unique advantages in federated settings that place high demands on convergence stability. Future work could explore analytic learning in other distributed and decentralized scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐