DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes¶

Conference: AAAI 2026
arXiv: 2502.06728
Code: github.com/schneiderkamplab/DeToNATION
Area: Other (Distributed Training / Systems Optimization)
Keywords: distributed training, decoupled momentum, FSDP, gradient compression, large language models

TL;DR¶

This paper proposes FlexDeMo — a hybrid sharding training strategy that integrates Fully Sharded Data Parallelism (FSDP) with decoupled momentum optimization. It applies FSDP sharding within nodes and synchronizes only the fast-moving momentum components across nodes, achieving loss convergence comparable to full-synchronization AdamW while substantially accelerating training.

Background & Motivation¶

Distributed Training Bottleneck¶

Training large deep neural networks requires transmitting gradients across accelerators, and network bandwidth is increasingly becoming a bottleneck. As the number of participating nodes grows and network congestion intensifies, the overhead of gradient synchronization escalates sharply.

Limitations of Prior Work¶

The Decoupled Momentum (DeMo) optimizer reduces communication by exchanging only the fast-moving components of gradients, but it has three critical limitations:

Models must fit on a single accelerator: DeMo is built on Distributed Data Parallelism (DDP), requiring the model and optimizer states to reside entirely within each accelerator's memory, making it unsuitable for LLMs.

all_gather bandwidth grows linearly: Communication cost scales linearly with the number of accelerators (rather than nodes), leading to poor scalability.

Unclear hyperparameter selection: Hyperparameters such as chunk size, TopK, and the sign function lack systematic investigation.

Paper Goals¶

This work combines the memory efficiency of FSDP with the communication compression of DeMo to overcome the "large model + low bandwidth" training bottleneck. It also introduces new replication schemes and systematically analyzes key hyperparameters.

Method¶

Overall Architecture¶

FlexDeMo adopts a hybrid sharding strategy: - Intra-node: FSDP is used to shard model and optimizer states across multiple accelerators. - Inter-node: Only selected gradient components (rather than full gradients) are synchronized using a DeMo-style replication scheme.

Core Idea: The typically higher intra-node bandwidth supports full communication, while the lower inter-node bandwidth transmits only compressed, critical information.

Key Designs¶

1. FlexDeMo Optimizer: Fusion of FSDP and Decoupled Momentum¶

Algorithm flow (Algorithm 1): 1. Gradient Reduce-Scatter: Perform reduce-scatter within the intra-node sharding group \(S\) to obtain gradients for local parameter shards. 2. Local SGD: Compute local gradient \(\Delta_t\). 3. Momentum Accumulation: \(m_t \leftarrow \beta m_t + \Delta_t\) 4. Extract Fast Component: Extract the fast-moving component \(q_t\) of momentum \(m_t\) via DCT-II (or alternative schemes). 5. Momentum Update: \(m_{t+1} \leftarrow m_t - q_t\) (remove the already-synchronized portion from momentum). 6. Inter-node Synchronization: Synchronize \(q_t\) within the replication group \(R\). 7. Parameter Update: \(\theta_{t+1} \leftarrow \theta_t - \eta Q_t\)

Key implementation details: - The no_sync context manager is used to disable automatic gradient synchronization. - Accelerator 0 of node 0 replicates only with accelerator 0 of node 1, substantially reducing cross-node communication. - Degenerate behavior: \(|R|=1\) reduces to FSDP; \(|S|=1\) reduces to DDP+DeMo.

2. Replication Schemes: Challenging DeMo's Design Choices¶

Four replication schemes are introduced and compared:

Replication Scheme	Selection Strategy	Index Transmission Required	Characteristics
DeMo	Extract fast-moving momentum components via DCT-II	Yes	Original scheme, strong theoretical foundation
Random	Randomly select \(n\) indices	No (shared seed)	Halves bandwidth, independent of frequency-domain transforms
Striding	Select indices at uniform intervals of \(n\)	No (shared seed)	Structured sampling
DiLoCo	Full synchronization every \(n\) steps	No	Federated learning style

Advantage of Random: Because no index transmission is required, Random achieves half the bandwidth of DeMo at the same compression ratio.

3. Sign Function and Hyperparameter Analysis¶

Systematic experiments confirm the following key design choices: - Sign before synchronization: Quantizes gradient values into a ternary system (−1, 0, 1), substantially reducing transmitted data. Experiments demonstrate that directional information is more important than magnitude. - Communication precision: fp32 outperforms fp16; full precision has a significant impact on both DeMo and Random schemes. - chunk size = 32: Validated as the default choice through experimentation. - TopK = 4: Achieves the best performance in T5 experiments.

4. Decoupled AdamW: A Decoupled Variant of AdamW¶

A variant of AdamW is implemented in which the first- and second-order moments (EMA and moving average of squared gradients) are not synchronized. However, experiments show that DeMo-SGD outperforms Decoupled AdamW in most settings.

Loss & Training¶

Standard task losses are used (translation: cross-entropy; classification: cross-entropy; language modeling: causal LM loss).
Optimizer: DeMo-SGD (SGD + momentum accumulation + decoupled replication).
No additional loss terms are introduced; the core contribution lies in the communication strategy rather than the objective function.

Key Experimental Results¶

Main Results¶

T5-base Translation Task (Opus Books En-Fr):

Replication Scheme	Compression Ratio	Validation Loss Rank	Notes
Random 1/2, 1/4	50%, 25%	Best	Fast convergence
DeMo 1/8, 1/4	12.5%, 25%	Second	Higher compression but slightly inferior
DiLoCo	Various	Slower	Converges more slowly than DeMo/Random
Striding	Various	Slowest	Not competitive

OLMo2-1B Causal Language Modeling (Dolma v1.6, 2 nodes × 4 GPUs, 10K steps):

Method	Training Loss	Wall-clock Time	Speedup vs. Full Sync
DeMo 1/32	Best	~2.6× faster	Significant improvement
DeMo 1/16	Near best	~2.6× faster	Significant improvement
Random 1/4	Good	~2.6× faster	Significant improvement
Hybrid-FSDP + AdamW	Baseline	Baseline	—

Ablation Study¶

Bandwidth-Constrained Experiments (ViT-B, 2 nodes, varying bandwidth):

Bandwidth (Mbps)	Random SGD 1/32	DeMo SGD 1/32	Decoupled AdamW Full Sync
10	~3.33× faster than DeMo	Baseline	~18× slower than Random
100	Noticeably faster	Moderate	Significantly slower
1000	Gap narrows	Moderate	Gap narrows
10000	Nearly identical	Nearly identical	Nearly identical

Bandwidth Usage Measurement (T5-small, compression ratio 1/16):

Method	Average Bandwidth (Mbps)	Relative Ratio
Full Sync	1070	7×
DeMo	291	2×
Random	152	1×

Key Findings¶

Random uses half the bandwidth of DeMo: No index transmission for selected gradients is required.
Sign is foundational: Applying the sign function before synchronization yields significant positive effects across all replication schemes.
Optimal replication scheme is task/architecture-dependent: DeMo performs best on ViT and decoder architectures; Random performs best on encoder-decoder architectures.
64-node scaling experiments: DeMo does not scale well due to all_gather; Random is 64% faster than full synchronization.
DeMo-SGD > Decoupled AdamW: SGD is more suitable for decoupled training in the vast majority of settings.

Highlights & Insights¶

Engineering meets theory: The work addresses DeMo's practical incompatibility with FSDP while introducing new replication schemes that challenge existing design choices.
Elegance of the Random scheme: Requiring no frequency-domain transforms, no index transmission, and minimal implementation complexity, it achieves competitive performance in many settings.
Significance of Sign: The finding that gradient direction is more important than magnitude carries profound implications for optimization theory.
Comprehensive hyperparameter analysis: Provides practitioners with clear guidance for tuning.

Limitations & Future Work¶

Asynchronous communication not yet exploited: The potential for overlapping communication and computation via CUDA streams remains untapped.
FSDP2/SimpleFSDP not adopted: Newer FSDP variants may further accelerate base communication.
Cross-node sharding not implemented: The current approach assumes the model fits within the combined memory of all accelerators on a single node.
No final model quality evaluation: Only training/validation losses are reported; downstream task performance is not assessed.
Decoupled AdamW underperforms: Dedicated moment synchronization strategies may be required.

ZeRO / DeepSpeed: Pioneered staged parameter sharding.
DiLoCo: Local optimization with periodic global averaging in a federated learning style.
SignSGD: Theoretical foundation for gradient sign compression.
GradZip / PowerSGD: Low-rank gradient compression approaches.
Insight: The hybrid sharding + decoupled optimization paradigm is extensible to heterogeneous clusters (e.g., CPU+GPU, cloud+edge).

Rating¶

Novelty: ⭐⭐⭐⭐ (Combining FSDP and DeMo is natural yet non-trivial)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (T5/ViT/OLMo2 across three domains + bandwidth/scalability/hyperparameter analysis)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, rich figures and tables)
Value: ⭐⭐⭐⭐⭐ (Directly lowers the bandwidth barrier for large model training; highly practical)