Distributed and Decentralised Training: Technical Governance Challenges in a Shifting AI Landscape¶
Conference: ICML 2025
arXiv: 2507.07765
Code: None
Area: AI Safety
Keywords: Distributed training, decentralised training, compute governance, low-communication algorithms, AI policy
TL;DR¶
This paper systematically distinguishes between two emerging paradigms: distributed training (multi-data centre) and decentralised training (community-driven). It analyzes how low-communication training algorithms (such as DiLoCo) enable these two paradigms and delves into the challenges and opportunities they present for technical AI governance (compute structuring, capability proliferation, and shut-down capability).
Background & Motivation¶
Current frontier LLM training highly relies on centralized large-scale data centres. The power required for a single training run has exceeded 100 megawatts and is projected to grow to over 5 gigawatts by 2030. This centralized setup was once viewed as a natural chokepoint for AI governance—easy to detect and concentrated in the hands of a few entities.
However, recent algorithmic advances are disrupting this landscape:
Energy bottlenecks driving distribution: Building a single hyper-scale cluster faces permit and energy supply bottlenecks, prompting hyper-scalers to adopt multi-data centre training (e.g., GPT-4.5 and Gemini-1.5 are already trained across data centres).
The rise of decentralised AI: Since early 2024, startups such as Prime Intellect and Nous Research have raised approximately $145 million, aiming to train open-source models that compete with frontier models using community-contributed compute.
Lagging policy discourse: In policy discourse, "distributed" and "decentralised" are frequently conflated and lack precise definitions, severely hindering effective governance discussions.
The core motivation of this paper is to provide a clear conceptual framework for the AI policy community to understand the technical foundations, differences, and governance implications of these two training paradigms.
Method¶
Overall Architecture¶
The paper proposes a two-dimensional analytical framework (see Figure 1 in the paper), classifying training paradigms into four quadrants based on two dimensions: number of parties and number of compute locations:
| Quadrant | Parties | Compute Locations | Representative Cases |
|---|---|---|---|
| Centralized | Single | Single | Grok 3 (xAI) |
| Distributed | Single | Multiple | GPT-4.5 (OpenAI) |
| Decentralised | Multiple | Multiple | INTELLECT-1 |
| Hypothetical | Multiple | Single | No instances yet |
Definitions of Core Terms:
- Distributed Training: Training across multiple physically dispersed compute pools, but coordinated by a centralized entity. A more precise term would be "multi-data centre training" or "geographically distributed training".
- Decentralised Training: Utilizing community-contributed compute resources, without a centralized coordinating entity.
Key Designs¶
1. Low-Communication Data-Parallel Training Algorithms¶
Traditional data parallelism requires gradient synchronization after each training batch, which incurs massive communication overhead. Taking an internet connection as an example: if training a single batch takes 4 seconds and synchronization takes 60 seconds, the GPU utility is only 6.25%.
The core breakthrough of DiLoCo (Distributed Low Communication) and subsequent algorithms:
- Dual-optimizer architecture: A local optimizer (such as an SGD variant) handles local updates, and a global optimizer adjusts global parameters after periodic synchronization.
- 500x reduction in synchronization frequency: Reducing synchronization from every step to once every 500 steps.
- Taking internet training as an example again, synchronizing once every 500 steps (~33 minutes) increases GPU utility from 6.25% to over 97%.
Summary of key algorithmic developments:
| Algorithm | Core Mechanism | Communication Reduction | Key Features |
|---|---|---|---|
| DiLoCo | Dual optimizers + sparse synchronization | ~500× | Robust to asynchronous and heterogeneous data |
| Async Local-SGD | Asynchronous local SGD | Significant | Adaptable to unstable connections |
| DeMo | Decoupled momentum optimization | Significant | Reduces transmitted data volume |
| Streaming DiLoCo | Overlapping communication and computation | ~500× | Close to a "free lunch" |
| Eager Updates | Eager update strategy | ~500× | Further optimizes DiLoCo |
Important additional features of these algorithms:
- Robustness to asynchronous communication: Robust to scenarios where some replicas fail to participate in synchronization, making them suitable for unstable network connections.
- Robustness to heterogeneous data distribution: Performance does not degrade when different workers train on data from different distributions—a key to unlocking private data.
- Effect scaling with model size: Charles et al. (2025) found that under certain settings, DiLoCo even outperforms traditional data-parallel training, and this advantage grows with model size.
2. Peer-to-Peer (P2P) Communication Architectures¶
Decentralised training further eliminates centralized bottlenecks:
- No centralized synchronization node: Model synchronization is achieved through direct node-to-node communication (peer-to-peer), rather than through a central parameter server.
- Trustless participation: New nodes can join the training using cryptographic proofs without requiring centralized authorization.
- Fault tolerance and dynamic reorganization: Single GPU failures do not interrupt training; the network automatically reorganizes and accepts replacement GPUs.
- Architecture closer to blockchain: Such as Ethereum's decentralized structure, rather than traditional data parallelism.
Prime Intellect and Nous Research have successfully pre-trained a 10-billion parameter model (INTELLECT-1) on GPUs across different continents at standard internet speeds.
3. Natural Fit Between Reasoning Models and Decentralisation¶
The paper provides an in-depth analysis of why reasoning models are particularly suited for decentralized environments:
- Traditional SSL training: 1:1 ratio of forward to backward propagation; every forward pass requires weight updates and synchronization.
- RL post-training: The ratio of forward to backward propagation can reach 1000:1; the model explores a large number of "thinking trajectories" before performing a single weight update.
- Lowered hardware barrier: The computation and memory requirements for generating thinking trajectories are lower than those for full backpropagation; consumer-grade hardware (such as Apple M3 Ultra, 512GB RAM) can run the inference of DeepSeek-R1 (671B parameters).
- Further reduced communication demand: Only thinking trajectories and reward values need to be transmitted, avoiding the transfer of full gradients.
Loss & Training¶
As this is a governance analysis paper, it does not propose a new loss function. However, the analysis of training strategies shows that:
- DiLoCo employs an outer optimizer to periodically correct global parameters, systemically guaranteeing the stability of model convergence under sparse synchronization conditions.
- In decentralized scenarios, each node independently runs a local optimizer, ensuring the credibility of compute contributions through cryptographic verification (such as the TopLoc locality-sensitive hashing scheme).
- Post-training RL strategies such as GRPO are naturally suited for low-bandwidth environments because the generation phase is an "inference-only" operation.
Key Experimental Results¶
Main Results¶
This is a position paper and does not contain traditional experiments. The core empirical evidence is sourced from prior work:
| Training Paradigm | Model | Scale | Communication Conditions | Conclusion |
|---|---|---|---|---|
| Decentralised Pre-training | INTELLECT-1 | 10B | Cross-continent / Internet | Feasibility proven, but performance has not yet caught up with models of the same scale |
| Distributed Training | GPT-4.5 | Undisclosed | Multi-data centre | Already adopted by frontier models |
| Distributed Training | Gemini-1.5 | Undisclosed | Multi-data centre | Already adopted by frontier models |
| Precedent: Distributed Computing | Folding@Home | 280,000 GPUs | Internet | Peak compute exceeded \(10^{18}\) FLOPS |
Ablation Study¶
Analysis of the impact of communication frequency on GPU utility:
| Synchronization Frequency | Connection Type | GPU Utility | Note |
|---|---|---|---|
| Every 1 step (Traditional) | High-speed interconnect / Single cluster | ~80% | Traditional data parallelism, communication overhead ~20% |
| Every 1 step | Internet | 6.25% | Cross-geographic training completely infeasible |
| Every 500 steps (DiLoCo) | Internet | >97% | Low-communication algorithms make cross-geographic training feasible |
Key Findings¶
- Scaling effects of low-communication algorithms: The performance of DiLoCo and subsequent methods improves as model size increases, suggesting they can keep pace with the scaling paradigm.
- Explosive growth of the decentralised startup ecosystem: Since early 2024, at least 6 startups have raised approximately $145 million, aiming to train o3-grade models.
- INTELLECT-1 synchronization interval: Synchronization occurs every 38 minutes and can be deliberately extended to obfuscate communication patterns.
- Post-training is better suited for decentralisation: The high forward-to-backward propagation ratio of RL post-training naturally reduces communication demands.
Highlights & Insights¶
- Highly valuable conceptual clarification: Clearly distinguishing between distributed vs. decentralised establishes a foundation for policy discussions. Previously, these two terms were frequently conflated in policy literature, leading to governance measures lacking precise targeting.
- Balanced double-edged sword perspective: The paper analyzes governance challenges (structured compute evading regulation, capability proliferation, and lack of shut-down capabilities) while acknowledging the positive values of decentralisation (privacy-preserving training unlocking more data, and mitigating the concentration of power).
- Unique analysis of reasoning models: It provides a profound insight into the natural alignment between the RL post-training of reasoning models and decentralized training, an analysis that is rare in existing literature.
- "Marginal risk" framework: Adopting the marginal risk analysis framework from Kapoor et al., it argues that if decentralized AI lags significantly behind the frontier and defensive technologies are sufficient, government intervention may not be necessary.
Limitations & Future Work¶
- Lack of quantitative analysis: The paper lacks a concrete estimation of the total compute capacity that decentralized training pools can aggregate, relying only on a rough analogy to Folding@Home.
- Superficial safety discussion: There is a lack of discussion on how safety alignment is executed in decentralized training (e.g., decentralized collection of human feedback in RLHF).
- Limited technical depth: The analysis of algorithms like DiLoCo remains at an overview level, without addressing technical details such as theoretical guarantees of convergence or specific communication compression schemes.
- Lack of a game-theoretic perspective: The strategic interaction between hyper-scalers and decentralized communities is not analyzed—would the former take actions to suppress the latter?
- Insufficient international governance dimension: Issues such as jurisdictional conflicts in cross-border decentralized training and international coordination mechanisms receive little attention.
- Limited empirical cases: INTELLECT-1 and INTELLECT-2 are currently the only case studies of decentralized large models, and their performance remains questionable, leaving the generalizability of the conclusions to be validated.
Related Work & Insights¶
- DiLoCo series (Douillard et al., 2023; 2025): The algorithmic foundation of low-communication training, enabling cross-geographic training.
- Sastry et al., 2024: A theoretical framework for compute governance, upon which this paper discusses the impact of new paradigms on governance assumptions.
- Kapoor et al., 2024: The marginal risk framework, providing a methodological tool to evaluate the risks and benefits of decentralized AI.
- Seferis & Fist, 2025: A concrete technical proposal for structured compute detection.
- Hivemind (Ryabinin et al., 2020): A decentralized deep learning framework in PyTorch, which is part of the technical foundation of INTELLECT-1.
- Federated Learning and Privacy Protection (Sani et al., 2024): The data privacy direction of decentralized pre-training.
Inspiration for subsequent research: The governance challenges of decentralized training may catalyze a new paradigm of "on-chain AI governance" (combining blockchain and AI safety), an interdisciplinary direction that warrants attention.
Rating¶
| Dimension | Score | Description |
|---|---|---|
| Novelty | ★★★★☆ | First to systematically clarify the differences between the two training paradigms for the AI policy community |
| Technical Depth | ★★★☆☆ | Overview nature, lacking algorithmic details, but policy analysis is thorough |
| Practicality | ★★★★☆ | Directly useful for policymakers and governance researchers |
| Clarity | ★★★★★ | Excellent writing, clear conceptual definitions, and rigorous argumentative structure |
| Overall Rating | ★★★★☆ | High-quality governance analysis paper, filling an important gap in AI policy |
Rating¶
- Novelty: TBD
- Experimental Thoroughness: TBD
- Writing Quality: TBD
- Value: TBD