MAR-FL: A Communication Efficient Peer-to-Peer Federated Learning System¶
Conference: NeurIPS 2025 arXiv: 2512.05234 Code: https://github.com/felix-fjm/mar-fl Area: Optimization / Federated Learning Keywords: Federated Learning, P2P Communication, Moshpit All-Reduce, Differential Privacy, Knowledge Distillation
TL;DR¶
This paper proposes MAR-FL, a system that reduces the communication complexity of P2P federated learning from \(O(N^2)\) to \(O(N \log N)\) via a Moshpit All-Reduce mechanism and dynamic group aggregation, while maintaining robustness against network jitter.
Background & Motivation¶
Background: Federated learning (FL) is transitioning from centralized to P2P architectures to eliminate central server bottlenecks and single points of failure.
Limitations of Prior Work: Existing P2P FL approaches incur high communication costs (e.g., RDFL: \(O(N^2)\)), making them difficult to scale to large deployments; they are also vulnerable to node churn.
Key Challenge: Global model aggregation requires all nodes to communicate, yet P2P settings lack a central coordinator — the key challenge is how to efficiently achieve global averaging in a fully distributed setting.
Key Insight: The paper draws inspiration from Moshpit SGD's dynamic grouping strategy, using a distributed hash table (DHT) to coordinate only metadata rather than model parameters.
Core Idea: Global aggregation is decomposed into \(\lceil\log_M N\rceil\) rounds of local group-wise aggregation, with deterministic group keys used to avoid redundant pairings.
Method¶
Overall Architecture¶
Each peer performs: local Momentum-SGD update → multiple rounds of Moshpit All-Reduce (MAR) group aggregation → optional Moshpit Knowledge Distillation (MKD) → globally averaged model.
Key Designs¶
-
Moshpit All-Reduce Aggregation (MAR)
-
Function: Partitions \(N\) peers into groups of size \(M\), with each peer communicating with \(M-1\) others per round.
- Mechanism: Global averaging is achieved in \(\lceil\log_M N\rceil \approx \log N\) rounds, yielding a total communication complexity of \(O(N \log N)\).
-
Design Motivation: A single peer dropout affects only its local group and does not block the overall process.
-
Distributed Coordination (DHT)
-
Function: Uses Hivemind Kademlia DHT to coordinate lightweight information only.
- Mechanism: Model weights are never transmitted through the DHT; only barrier and group metadata are exchanged.
-
Design Motivation: Control-plane overhead is \(O(N \log N)\), which is negligible relative to model exchange traffic.
-
Moshpit Knowledge Distillation (MKD)
-
Function: Accelerates convergence and further reduces the number of required communication rounds.
- Mechanism: Automatically selects top-\(\ell\) teachers based on minimum KL divergence, combining KL divergence and cross-entropy losses with a smooth weight decay schedule \(\lambda = \max(0, 1 - (t-1)/K)\).
-
Design Motivation: Reduces communication requirements by an additional 2–3×.
-
Differential Privacy Adaptation
-
Function: Adapts the DP mechanism from FedAvg to a serverless P2P architecture.
- Mechanism: Each peer performs local gradient clipping and noise addition; aggregation is performed within groups via MAR.
- Design Motivation: DP guarantees are fully decentralized, requiring no central server.
Loss & Training¶
Knowledge distillation loss: \(L = \lambda\tau^2 D_{KL}(\text{teacher} \| \text{student}) + (1-\lambda)\,CE(y, \text{student})\). Convergence analysis shows that the mixing error satisfies exponential convergence.
Key Experimental Results¶
Main Results¶
| Method | Communication Cost (Relative) | Model Performance | Scalability |
|---|---|---|---|
| FedAvg (Centralized) | 1.0× | 99%+ | Central bottleneck |
| RDFL (P2P, 125 nodes) | 10.0× | 99%+ | \(O(N^2)\) |
| MAR-FL | 1.0× | 99%+ | \(O(N\log N)\) |
Ablation Study¶
| Configuration | Communication Rounds to 50% Accuracy | Relative Improvement |
|---|---|---|
| MAR-FL (base) | 8 rounds | Baseline |
| + MKD (K=20) | 3.5 rounds | 2.3× |
| + Teacher selection | 3.2 rounds | 2.5× |
| + Progressive decay | 3.1 rounds | 2.6× |
Key Findings¶
- Communication efficiency improves by 10× over RDFL (\(O(N\log N)\) vs. \(O(N^2)\)).
- Performance degrades by less than 5% at 50% participation rate, demonstrating robustness to partial participation.
- DP-MAR-FL exhibits the same privacy-utility tradeoff curve as centralized FedAvg with DP.
- MKD further reduces the communication required for convergence by 56%.
Highlights & Insights¶
- Communication Complexity Breakthrough: The reduction from \(O(N^2)\) to \(O(N\log N)\) represents a significant advance for P2P FL, enabling scalability to thousands of nodes.
- Modular Serverless Design: Fully decentralized with no single point of failure; DHT transmits only metadata.
- Natural Extension to Privacy: DP adaptation requires no central coordination, preserving privacy guarantees end-to-end.
Limitations & Future Work¶
- A performance gap relative to centralized FedAvg remains, though the advantage over P2P baselines is clear.
- Performance degrades noticeably when participation rate falls below 50%.
- Experiments are conducted only on MNIST and 20NG; deployment over real wireless networks is not simulated.
Related Work & Insights¶
- vs. RDFL: RDFL uses a closed topology with \(O(N^2)\) communication and lacks fault tolerance; MAR-FL employs dynamic grouping with strong fault tolerance.
- vs. Moshpit SGD: This work is the first to apply dynamic grouping in the FL setting, additionally incorporating differential privacy and knowledge distillation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Communication complexity breakthrough + knowledge distillation integration
- Experimental Thoroughness: ⭐⭐⭐⭐ Optimization experiments are thorough; real-network deployment is absent
- Writing Quality: ⭐⭐⭐⭐⭐ Clear logic and rigorous derivations
- Value: ⭐⭐⭐⭐⭐ Directly addresses the scalability challenge in P2P FL