Streaming Federated Learning with Markovian Data¶

Conference: NeurIPS 2025 arXiv: 2503.18807 Code: None Area: Optimization Keywords: Federated Learning, Markovian Data Streams, Stochastic Gradient Descent, Non-convex Optimization, Client Drift

TL;DR¶

This work provides the first rigorous analysis of streaming federated learning with Markovian data under non-convex objectives, establishing that Minibatch SGD, Local SGD, and Local SGD-M all achieve sample complexity inversely proportional to the number of clients (linear speedup), and that Local SGD-M matches the communication complexity of Minibatch SGD without requiring heterogeneity assumptions.

Background & Motivation¶

Classical federated learning (FL) assumes clients hold pre-collected datasets; however, in applications such as health monitoring, environmental sensing, and robotic control, clients continuously collect data, forming data streams. These streams are characterized by:

Markovian dependence: Data generated by physical or biological systems naturally exhibits temporal correlations rather than i.i.d. sampling.

Memory constraints: Clients can only store a limited number of current samples and cannot control the sampling process.

Non-stationarity: The initial distribution of a Markov chain differs from its stationary distribution, introducing bias in early samples.

Core Problem: Can FL still support collaborative learning under Markovian data streams? Specifically, can the per-client sample complexity decrease linearly with the number of clients \(M\)?

Limitations of Prior Work: - Markovian SGD in the centralized setting has been studied, but federated work focuses on federated reinforcement learning (FRL) and is limited to convex objectives or linear function approximation. - Studies on streaming data in FL either assume adversarial environments or i.i.d. data streams.

Goal: To rigorously characterize the impact of Markovian sampling on FL convergence rates under non-convex smooth objectives, and to demonstrate that momentum can eliminate client drift without heterogeneity assumptions.

Method¶

Overall Architecture¶

The paper considers \(M\) clients collaboratively minimizing \(\min_{w} \frac{1}{M}\sum_{m=1}^M F_m(w)\), where each \(F_m(w) = \mathbb{E}_{x \sim \pi_m}[f_m(w;x)]\). Each data stream \(X_m = (x_t^m)_{t \in \mathbb{N}}\) is modeled as a time-homogeneous Markov chain converging to a unique stationary distribution \(\pi_m\).

Key Designs¶

Minibatch SGD Analysis (problem class \(\mathcal{F}_1\)):
- Requires only global smoothness (Assumption 4.1) and bounded gradient noise (Assumption 4.3).
- Convergence upper bound: \(\mathbb{E}[\|\nabla F(\hat{w}_T)\|^2] \leq \mathcal{O}\left(\frac{\Delta_0}{\gamma T} + \frac{C_\infty \sigma^2}{\nu_{ps} MK}\right)\)
- The second term reflects the cost of Markovian data: gradient noise is amplified by the inverse spectral gap \(\nu_{ps}\) and cannot be eliminated by reducing the learning rate.
- Sample complexity \(K = \mathcal{O}(C_\infty \sigma^2 / (\nu_{ps} M\epsilon^2))\): proportional to \(1/M\) → linear speedup ✓
Local SGD Analysis (problem class \(\mathcal{F}_3\), requires heterogeneity assumptions):
- Additionally requires per-sample smoothness (Assumption 4.2) and bounded gradient divergence BGD (Assumption 4.4).
- Introduces a client drift term \(\frac{LK\eta(\theta^2+\sigma^2)}{\delta^2}\), controllable via a small learning rate.
- Communication complexity: \(T = \mathcal{O}\left(\frac{L\Delta_0}{\epsilon^2}(\delta^2 \vee \frac{\theta^2+\sigma^2}{\delta^2\epsilon^2})\right)\)
- Optimal communication complexity is attained only under low heterogeneity and low noise conditions.
Local SGD with Momentum (Local SGD-M):
- Core update: \(v_t^{(m,k)} = \beta \nabla f_m(w_t^{(m,k)}; x_t^{(m,k)}) + (1-\beta) v_t\)
- Forms a convex combination of the gradient estimate with the previous round's aggregated update.
- Key advantage: Requires no BGD heterogeneity assumption and operates on problem class \(\mathcal{F}_2\).
- Convergence upper bound: \(\mathbb{E}[\|\nabla F(\hat{w}_T)\|^2] \leq \mathcal{O}\left(\frac{L\Delta_0}{\beta T} + \frac{C_\infty \sigma^2}{\nu_{ps} MK}\right)\)
- Communication complexity \(T = \mathcal{O}(L\Delta_0/(\beta\epsilon^2))\): matches the i.i.d. lower bound up to a constant factor \(1/\beta\).
Relaxed Markovian Assumptions:
- Does not require the commonly used uniform geometric ergodicity (exponentially fast convergence to stationarity); instead requires only absolute continuity of the transition kernel with respect to the stationary measure (Assumption 3.2).
- Non-stationarity is handled via a simple measure-change argument rather than relying on fast mixing.
- Results are applicable to a broader class of physical and biological systems.

Loss & Training¶

Non-convex smooth loss functions bounded below globally.
Server aggregation via simple averaging.

Key Experimental Results¶

Main Results¶

Method	K=10	K=50	K=100	Note
Minibatch SGD	Continuous improvement	Continuous improvement	Continuous improvement	Gradient norm decreases monotonically with K
Local SGD-M	Continuous improvement	Continuous improvement	Continuous improvement	Consistent trend with Minibatch SGD
Local SGD	Improvement	Marginal improvement	Nearly unchanged	Heterogeneity becomes the bottleneck
SCAFFOLD	Improvement	Nearly unchanged	Nearly unchanged	Limited advantage under Markovian data

Experiments use the Beijing Multi-Site Air Quality dataset (12 stations, 2013–2017) for PM2.5 concentration prediction.

Ablation Study (Effect of Number of Clients)¶

Method	60 clients	120 clients	240 clients	Note
Minibatch SGD	Higher gradient norm	Moderate	Lowest	Linear speedup verified
Local SGD	Same trend	Same trend	Same trend	Collaboration is beneficial
Local SGD-M	Same trend	Same trend	Same trend	Unaffected by heterogeneity

Key Findings¶

More clients lead to lower gradient norms across all three methods, confirming that collaboration is beneficial under Markovian data.
Local SGD saturates as K increases (limited by heterogeneity), whereas Minibatch SGD and Local SGD-M continue to benefit.
SCAFFOLD, a classical algorithm for eliminating heterogeneity under i.i.d. data, performs poorly under Markovian data.
Local SGD-M eliminates client drift via momentum rather than control variates, offering a simpler and more effective alternative.

Highlights & Insights¶

Theoretical clarity: Three problem classes \(\mathcal{F}_3 \subset \mathcal{F}_2 \subset \mathcal{F}_1\) are hierarchically defined, corresponding to the assumption requirements of the three algorithms in a well-organized manner.
A new role for momentum: Momentum serves not only as an acceleration mechanism but also as a means of eliminating client drift—by forming a convex combination with the previous round's global direction, it naturally constrains the deviation of local updates.
Practical significance of relaxed assumptions: The absence of a uniform geometric ergodicity requirement (which is difficult to verify in practice) makes the results applicable to a broader range of physical and biological systems.

Limitations & Future Work¶

Only independent clients are considered (\(X_m\) and \(X_{m'}\) are independent); spatially correlated data streams are not addressed.
Sample complexity under Markovian data remains higher than in the i.i.d. setting, inflated by the factor \(C_\infty / \nu_{ps}\).
Partial participation scenarios are not considered.
The combined effect of momentum and control variates under Markovian data is not analyzed.

vs. SCAFFLSA (Wang et al.): SCAFFLSA eliminates heterogeneity via control variates but requires higher per-round sample counts and heterogeneity assumptions; Local SGD-M is simpler and more concise.
vs. Deng et al. (2024): Linear speedup is established for federated reinforcement learning but is restricted to linear function approximation and convex objectives; this work extends the results to the non-convex setting.
vs. Sun et al. (2018): Centralized Markovian SGD does not involve communication complexity considerations.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of Markovian data streams, federated learning, and non-convex objectives constitutes a novel theoretical contribution.
Experimental Thoroughness: ⭐⭐⭐ Experiments are limited in scale (linear regression with non-convex regularization); deep learning experiments are absent.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical exposition is rigorous and clear; the relationships among assumptions and results are organized exceptionally well.
Value: ⭐⭐⭐⭐ Fills a gap in the theory of streaming FL and provides guidance for applications such as health monitoring and environmental sensing.