Streaming Federated Learning with Markovian Data¶
Conference: NeurIPS 2025 arXiv: 2503.18807 Code: None Area: Optimization Keywords: Federated Learning, Markovian Data Streams, Stochastic Gradient Descent, Non-convex Optimization, Client Drift
TL;DR¶
This work provides the first rigorous analysis of streaming federated learning with Markovian data under non-convex objectives, establishing that Minibatch SGD, Local SGD, and Local SGD-M all achieve sample complexity inversely proportional to the number of clients (linear speedup), and that Local SGD-M matches the communication complexity of Minibatch SGD without requiring heterogeneity assumptions.
Background & Motivation¶
Classical federated learning (FL) assumes clients hold pre-collected datasets; however, in applications such as health monitoring, environmental sensing, and robotic control, clients continuously collect data, forming data streams. These streams are characterized by:
Markovian dependence: Data generated by physical or biological systems naturally exhibits temporal correlations rather than i.i.d. sampling.
Memory constraints: Clients can only store a limited number of current samples and cannot control the sampling process.
Non-stationarity: The initial distribution of a Markov chain differs from its stationary distribution, introducing bias in early samples.
Core Problem: Can FL still support collaborative learning under Markovian data streams? Specifically, can the per-client sample complexity decrease linearly with the number of clients \(M\)?
Limitations of Prior Work: - Markovian SGD in the centralized setting has been studied, but federated work focuses on federated reinforcement learning (FRL) and is limited to convex objectives or linear function approximation. - Studies on streaming data in FL either assume adversarial environments or i.i.d. data streams.
Goal: To rigorously characterize the impact of Markovian sampling on FL convergence rates under non-convex smooth objectives, and to demonstrate that momentum can eliminate client drift without heterogeneity assumptions.
Method¶
Overall Architecture¶
The paper considers \(M\) clients collaboratively minimizing \(\min_{w} \frac{1}{M}\sum_{m=1}^M F_m(w)\), where each \(F_m(w) = \mathbb{E}_{x \sim \pi_m}[f_m(w;x)]\). Each data stream \(X_m = (x_t^m)_{t \in \mathbb{N}}\) is modeled as a time-homogeneous Markov chain converging to a unique stationary distribution \(\pi_m\).
Key Designs¶
-
Minibatch SGD Analysis (problem class \(\mathcal{F}_1\)):
- Requires only global smoothness (Assumption 4.1) and bounded gradient noise (Assumption 4.3).
- Convergence upper bound: \(\mathbb{E}[\|\nabla F(\hat{w}_T)\|^2] \leq \mathcal{O}\left(\frac{\Delta_0}{\gamma T} + \frac{C_\infty \sigma^2}{\nu_{ps} MK}\right)\)
- The second term reflects the cost of Markovian data: gradient noise is amplified by the inverse spectral gap \(\nu_{ps}\) and cannot be eliminated by reducing the learning rate.
- Sample complexity \(K = \mathcal{O}(C_\infty \sigma^2 / (\nu_{ps} M\epsilon^2))\): proportional to \(1/M\) → linear speedup ✓
-
Local SGD Analysis (problem class \(\mathcal{F}_3\), requires heterogeneity assumptions):
- Additionally requires per-sample smoothness (Assumption 4.2) and bounded gradient divergence BGD (Assumption 4.4).
- Introduces a client drift term \(\frac{LK\eta(\theta^2+\sigma^2)}{\delta^2}\), controllable via a small learning rate.
- Communication complexity: \(T = \mathcal{O}\left(\frac{L\Delta_0}{\epsilon^2}(\delta^2 \vee \frac{\theta^2+\sigma^2}{\delta^2\epsilon^2})\right)\)
- Optimal communication complexity is attained only under low heterogeneity and low noise conditions.
-
Local SGD with Momentum (Local SGD-M):
- Core update: \(v_t^{(m,k)} = \beta \nabla f_m(w_t^{(m,k)}; x_t^{(m,k)}) + (1-\beta) v_t\)
- Forms a convex combination of the gradient estimate with the previous round's aggregated update.
- Key advantage: Requires no BGD heterogeneity assumption and operates on problem class \(\mathcal{F}_2\).
- Convergence upper bound: \(\mathbb{E}[\|\nabla F(\hat{w}_T)\|^2] \leq \mathcal{O}\left(\frac{L\Delta_0}{\beta T} + \frac{C_\infty \sigma^2}{\nu_{ps} MK}\right)\)
- Communication complexity \(T = \mathcal{O}(L\Delta_0/(\beta\epsilon^2))\): matches the i.i.d. lower bound up to a constant factor \(1/\beta\).
-
Relaxed Markovian Assumptions:
- Does not require the commonly used uniform geometric ergodicity (exponentially fast convergence to stationarity); instead requires only absolute continuity of the transition kernel with respect to the stationary measure (Assumption 3.2).
- Non-stationarity is handled via a simple measure-change argument rather than relying on fast mixing.
- Results are applicable to a broader class of physical and biological systems.
Loss & Training¶
- Non-convex smooth loss functions bounded below globally.
- Server aggregation via simple averaging.
Key Experimental Results¶
Main Results¶
| Method | K=10 | K=50 | K=100 | Note |
|---|---|---|---|---|
| Minibatch SGD | Continuous improvement | Continuous improvement | Continuous improvement | Gradient norm decreases monotonically with K |
| Local SGD-M | Continuous improvement | Continuous improvement | Continuous improvement | Consistent trend with Minibatch SGD |
| Local SGD | Improvement | Marginal improvement | Nearly unchanged | Heterogeneity becomes the bottleneck |
| SCAFFOLD | Improvement | Nearly unchanged | Nearly unchanged | Limited advantage under Markovian data |
Experiments use the Beijing Multi-Site Air Quality dataset (12 stations, 2013–2017) for PM2.5 concentration prediction.
Ablation Study (Effect of Number of Clients)¶
| Method | 60 clients | 120 clients | 240 clients | Note |
|---|---|---|---|---|
| Minibatch SGD | Higher gradient norm | Moderate | Lowest | Linear speedup verified |
| Local SGD | Same trend | Same trend | Same trend | Collaboration is beneficial |
| Local SGD-M | Same trend | Same trend | Same trend | Unaffected by heterogeneity |
Key Findings¶
- More clients lead to lower gradient norms across all three methods, confirming that collaboration is beneficial under Markovian data.
- Local SGD saturates as K increases (limited by heterogeneity), whereas Minibatch SGD and Local SGD-M continue to benefit.
- SCAFFOLD, a classical algorithm for eliminating heterogeneity under i.i.d. data, performs poorly under Markovian data.
- Local SGD-M eliminates client drift via momentum rather than control variates, offering a simpler and more effective alternative.
Highlights & Insights¶
- Theoretical clarity: Three problem classes \(\mathcal{F}_3 \subset \mathcal{F}_2 \subset \mathcal{F}_1\) are hierarchically defined, corresponding to the assumption requirements of the three algorithms in a well-organized manner.
- A new role for momentum: Momentum serves not only as an acceleration mechanism but also as a means of eliminating client drift—by forming a convex combination with the previous round's global direction, it naturally constrains the deviation of local updates.
- Practical significance of relaxed assumptions: The absence of a uniform geometric ergodicity requirement (which is difficult to verify in practice) makes the results applicable to a broader range of physical and biological systems.
Limitations & Future Work¶
- Only independent clients are considered (\(X_m\) and \(X_{m'}\) are independent); spatially correlated data streams are not addressed.
- Sample complexity under Markovian data remains higher than in the i.i.d. setting, inflated by the factor \(C_\infty / \nu_{ps}\).
- Partial participation scenarios are not considered.
- The combined effect of momentum and control variates under Markovian data is not analyzed.
Related Work & Insights¶
- vs. SCAFFLSA (Wang et al.): SCAFFLSA eliminates heterogeneity via control variates but requires higher per-round sample counts and heterogeneity assumptions; Local SGD-M is simpler and more concise.
- vs. Deng et al. (2024): Linear speedup is established for federated reinforcement learning but is restricted to linear function approximation and convex objectives; this work extends the results to the non-convex setting.
- vs. Sun et al. (2018): Centralized Markovian SGD does not involve communication complexity considerations.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of Markovian data streams, federated learning, and non-convex objectives constitutes a novel theoretical contribution.
- Experimental Thoroughness: ⭐⭐⭐ Experiments are limited in scale (linear regression with non-convex regularization); deep learning experiments are absent.
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretical exposition is rigorous and clear; the relationships among assumptions and results are organized exceptionally well.
- Value: ⭐⭐⭐⭐ Fills a gap in the theory of streaming FL and provides guidance for applications such as health monitoring and environmental sensing.