FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning¶
Conference: NeurIPS 2025 arXiv: 2510.07664 Code: GitHub Area: Optimization Keywords: Federated Learning, Semi-Asynchronous, Gradient Aggregation, Model Aggregation, Divide-and-Conquer
TL;DR¶
This paper proposes FedQS, the first framework to simultaneously optimize both gradient aggregation and model aggregation strategies in semi-asynchronous federated learning (SAFL). By partitioning clients into four categories and adaptively adjusting training strategies, FedQS achieves comprehensive improvements over baselines in accuracy, convergence speed, and stability.
Background & Motivation¶
In federated learning, the semi-asynchronous paradigm (SAFL) strikes a balance between synchronous and asynchronous training, yet faces critical challenges:
Performance gap between two aggregation strategies: - Gradient aggregation (FedSGD): faster convergence and higher accuracy, but severe oscillation - Model aggregation (FedAvg): more stable, but slower convergence and lower accuracy - When stale updates and non-IID data coexist, the accuracy gap between the two surges to 11.52%
Lack of theoretical understanding: existing analyses remain empirical
Limitations of server-side vs. client-side approaches: - Server-side methods: tightly coupled with specific aggregation strategies - Client-side methods: lack access to global information
Method¶
Overall Architecture¶
FedQS comprises three modules: - Mod① (Global Aggregation Estimation): deployed on clients to estimate the global gradient direction - Mod② (Local Training Adaptation): deployed on clients to adjust training strategies according to client type - Mod③ (Global Model Aggregation): deployed on the server for adaptive weighted aggregation
Key Designs¶
Mod①: Global Aggregation Estimation - Each client stores the two most recent global models and computes a pseudo-global gradient: \(L_g(w_g^t) = w_g^t - w_g^{t-1}\) - Computes the local-global gradient similarity \(s_i^t\) (e.g., cosine similarity) - Core innovation: acquires global information from the client perspective, achieving decoupling from the aggregation strategy
Mod②: Local Training Adaptation (Divide-and-Conquer Strategy)
Clients are partitioned into four categories based on update frequency \(f_i^t\) and gradient similarity \(s_i^t\):
| Type | Speed | Similarity | Strategy |
|---|---|---|---|
| FBC (fast but biased) | \(f_i^t > \bar{f}^t\) | \(s_i^t < \bar{s}^t\) | Maintain learning rate; trigger feedback mechanism to increase aggregation weight |
| FUC (fast and aligned) | \(f_i^t > \bar{f}^t\) | \(s_i^t > \bar{s}^t\) | Reduce learning rate \(\eta_i^t = \eta_i^{t-1} - a\mathcal{F}\); incorporate momentum |
| SUC (slow but aligned) | \(f_i^t < \bar{f}^t\) | \(s_i^t > \bar{s}^t\) | Increase learning rate \(\eta_i^t = \eta_i^{t-1} + a\mathcal{F}\); incorporate momentum |
| SBC (slow and biased) | \(f_i^t < \bar{f}^t\) | \(s_i^t < \bar{s}^t\) | Increase learning rate; distinguish between straggling and distribution shift via validation set |
Momentum update formula: $\(w_{i,e}^t = w_{i,e-1}^t - \eta_i^t \left[\sum_{r=1}^{e}(m_i^t)^r \nabla F_{i,e-r}(w_{i,e-r-1}^t) + \nabla F_{i,e}(w_{i,e-1}^t)\right]\)$
Mod③: Global Model Aggregation - Adjusts aggregation weights for clients triggering the feedback mechanism: \(p_i = \frac{\exp(\phi - \mathcal{F})}{2\phi - \mathcal{F}} \cdot \frac{(1+\mathcal{G})}{2K}\) - Performs normalized weighted aggregation
Loss & Training¶
Convergence guarantees (Theorems 4.2 and 4.3): - Both FedQS-SGD and FedQS-Avg achieve exponential convergence rates - The convergence bound consists of three terms: \(\mathcal{V}^t\) (exponential decay term), \(\mathcal{U} = O(\delta^2)\) (data heterogeneity), and \(\mathcal{W} = O(G_c^2)\) (gradient variation) - Assumptions: \(L\)-smoothness, bounded gradients, bounded heterogeneity
Key Experimental Results¶
Main Results¶
Accuracy and convergence speed (three task types):
| Method | CV (x=0.1) | CV (x=0.5) | CV (x=1) | NLP (R=200) | NLP (R=600) | RWD-Gender | RWD-Ethnicity |
|---|---|---|---|---|---|---|---|
| FedAvg | 56.05 | 73.71 | 77.86 | 47.04 | 45.52 | 77.10 | 77.25 |
| M-step | 62.17 | 80.49 | 82.46 | 49.38 | 48.12 | 78.20 | 78.01 |
| FedQS-Avg | 63.91 | 80.26 | 82.74 | 50.43 | 50.08 | 78.94 | 78.85 |
| FedSGD | 65.71 | 83.87 | 85.42 | 48.04 | 49.64 | 77.15 | 78.33 |
| WKAFL | 64.66 | 85.14 | 86.02 | 50.49 | 50.09 | 78.96 | 76.97 |
| FedQS-SGD | 68.88 | 86.11 | 86.79 | 52.22 | 52.49 | 78.74 | 79.24 |
Wall-clock time comparison (seconds):
| Method | CV (x=0.1) | NLP (R=200) | RWD-Gender |
|---|---|---|---|
| FedAvg (Sync) | 78,048 | 22,417 | 30,149 |
| FedQS-Avg (SAFL) | 32,827 | 6,023 | 5,701 |
| FedQS-SGD (SAFL) | 32,784 | 5,248 | 5,523 |
Compared to synchronous baselines, FedQS reduces wall-clock time by approximately 70% on average.
Ablation Study¶
Module ablation (average over CV tasks):
| Module | Configuration | Avg Accuracy | SGD Accuracy | Avg Convergence | SGD Convergence |
|---|---|---|---|---|---|
| Mod① | Cosine | 74.14 | 80.59 | 251 | 230 |
| Mod① | Euclidean | 75.69 | 79.55 | 244 | 232 |
| Mod① | Manhattan | 76.56 | 80.28 | 228 | 221 |
| Mod② | w/o momentum | 73.21 | 78.88 | 269 | 242 |
| Mod② | with momentum | 74.14 | 80.59 | 251 | 230 |
| Mod③ | w/o feedback | 68.35 | 78.83 | 284 | 268 |
| Mod③ | with feedback | 74.14 | 80.59 | 251 | 230 |
Key ablation findings: - Removing momentum: average accuracy drops by 4.3%, requiring 6% more epochs to converge - Removing the feedback mechanism: FedQS-Avg accuracy drops sharply by 7.81% - Choice of similarity function has a relatively minor impact (cosine/Euclidean/Manhattan differences are small)
Robustness to system configurations:
| Scenario | FedAvg | FedQS-Avg | FedSGD | FedQS-SGD |
|---|---|---|---|---|
| N=50, 1:20 | 70.1 | 79.2 | 77.4 | 80.7 |
| N=200, 1:100 | 49.4 | 64.7 | 74.4 | 80.1 |
Under extreme heterogeneity (200 clients, speed ratio 1:100), FedQS-SGD still achieves 80.1%.
Key Findings¶
- Root cause of the gradient–model aggregation gap: stale updates in gradient aggregation affect only the direction and magnitude, whereas in model aggregation they reset the optimization trajectory; non-IID data further amplifies this gap.
- Effectiveness of the divide-and-conquer strategy: the four-category client partition covers all combinations of heterogeneity, each addressed by a targeted optimization strategy.
- Complementarity of momentum and feedback mechanism: momentum accelerates local convergence while the feedback mechanism improves global aggregation.
- Hyperparameter sensitivity: \(a\) (learning rate adjustment rate) has the largest impact; \(k\) (momentum adjustment speed) has the smallest.
Highlights & Insights¶
- First unified framework: simultaneously optimizes both gradient and model aggregation strategies rather than focusing on one alone.
- Client-side adaptivity: does not require the server to have prior knowledge of client characteristics; clients can dynamically adjust strategies in response to changing resources.
- Theoretical guarantees: provides exponential convergence proofs for both aggregation strategies.
- Negligible additional overhead: clients only require one additional similarity computation and two comparisons; communication overhead increases by only a 1-bit signal and a few floating-point values.
- Broad experimental coverage: three task types — CV (CIFAR-10), NLP (Shakespeare), and real-world data (UCI Adult).
Limitations & Future Work¶
- The model aggregation mode introduces a degree of oscillation.
- Three new hyperparameters (\(a, m_0, k\)) are introduced, increasing implementation and reproduction complexity.
- Experiments are limited to medium-scale models (ResNet-18, LSTM, FCN); applicability to large-scale models remains unverified.
- Automatic hyperparameter tuning mechanisms could be explored in future work.
Related Work & Insights¶
- WKAFL employs cosine similarity for weighted aggregation of stale gradients, inspiring Mod① of FedQS.
- FedAT adopts a hierarchical asynchronous framework but requires prior knowledge of client performance distributions.
- The gradient correction idea from SCAFFOLD is applied in FedAC; FedQS addresses the problem from an orthogonal perspective via divide-and-conquer.
- The role of momentum in federated optimization is further validated.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to jointly optimize both aggregation strategies
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three task types + 8 baselines + ablations + hyperparameter analysis + system configuration analysis
- Practicality: ⭐⭐⭐⭐ — Low additional overhead, good scalability
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with rich figures and tables