CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning¶

Conference: CVPR 2025
arXiv: 2603.12543
Code: Yes (released with paper)
Area: Reinforcement Learning / Distributed Systems
Keywords: Distributed RL, network-aware training, sim-to-real, latency robustness, edge deployment

TL;DR¶

This paper proposes the CALF framework, which injects configurable network delay, jitter, and packet loss models into RL training. This reduces policy performance degradation by approximately 3-4 times when deployed on real distributed edge devices, revealing that network conditions represent an important but overlooked dimension in the sim-to-real gap.

Background & Motivation¶

Background: RL policies are typically trained under the assumption of zero-latency synchronous interactions, whereas sim-to-real research primarily focuses on physical and visual domain randomization. Large-scale distributed RL frameworks (such as IMPALA, SEED RL) optimize the communication of the training infrastructure rather than the network latency in the control loop.

Limitations of Prior Work: (1) When policies are deployed to edge devices (e.g., Raspberry Pi + cloud servers), Wi-Fi latency (30-80ms), jitter, and packet loss cause a 40-80% performance drop; (2) Existing latency-aware methods (like DCAC and delayed MDPs) either assume fixed latency or require modifying the RL algorithm itself; (3) There is a lack of reproducible, cross-hardware, network-aware RL training infrastructure.

Key Challenge: RL training assumes perfect communication, whereas real-world deployment faces imperfect networks. Network conditions constitute a sim-to-real gap dimension that is orthogonal to physical and visual domains.

Goal: (1) Quantify the impact of networks on distributed RL; (2) Verify whether network-aware training can narrow this gap; (3) Provide reproducible infrastructure.

Key Insight: Instead of modifying the RL algorithm, modify the training environment to transparently inject network impairments into the communication link. This is an algorithm-agnostic infrastructure solution.

Core Idea: Use NetworkShim middleware to transparently inject latency, jitter, and packet loss into agent-environment communication, allowing the training process to "experience" deployment-time network conditions.

Method¶

Overall Architecture¶

CALF implements the policy and environment as networked services communicating via message passing. It supports three progressive deployment modes: Mode 1 (local simulation, zero latency) \(\to\) Mode 2 (simulation + simulated network) \(\to\) Mode 3 (real edge hardware + real network). NetworkShim middleware sits on the communication link to transparently inject network impairments.

Key Designs¶

NetworkShim Network Impairment Injector:
- Function: Transparently injects latency, jitter, and packet loss into agent-environment communication.
- Mechanism: For each packet, packet loss is first sampled according to Bernoulli(\(p_{loss}\)), and then latency is sampled from \(\max(0, \mathcal{N}(\mu_{latency}, \sigma_{jitter}^2))\). It supports two types of models: synthetic models (Ethernet 2ms±0.5ms, Wi-Fi-normal 30ms±10ms/2% packet loss, Wi-Fi-degraded 80ms±40ms/10% packet loss) and replay models based on real Wi-Fi trace collection.
- Design Motivation: The transparent design makes the agent and environment network-unaware, requiring no modifications to any RL algorithm; the synthetic + trace models cover both controllable experiments and real-world scenarios.
Progressive Deployment Modes:
- Function: Enables progressive-stage verification from pure simulation to real hardware.
- Mechanism: Mode 1 has zero network overhead (~100K steps/hr CartPole) for rapid iteration; Mode 2 injects a simulated network (~50K steps/hr) for network-aware training; Mode 3 runs the environment on a Raspberry Pi and the policy on a Desktop (~20K steps/hr) for real hardware validation. The exact same policy code runs across all three modes.
- Design Motivation: Deployment parity—ensuring that training and deployment share the exact same code path.
Latency-Robust State Representation:
- Function: Enables the policy to infer the current state from delayed observations.
- Mechanism: CartPole uses frame stacking (\(k=d+1\) frames for a latency of \(d\) steps) to infer velocity; MiniGrid uses LSTM to maintain a belief state. Optionally, action history tracking can be added for in-flight actions.
- Design Motivation: Under network latency, the policy can only observe outdated states, necessitating temporal information to infer the current state.

Loss & Training¶

Standard PPO is employed (via Stable-Baselines3) under three training regimes: Baseline (Mode 1, no network), Delay-Only (Mode 2, fixed 50ms), and Full Net-Aware (Mode 2, latency + jitter + loss). 10 random seeds are used, and each policy is evaluated under 5 deployment conditions for 50 episodes.

Key Experimental Results¶

Main Results¶

Training Regime	Sim-Clean	Wi-Fi-Normal	Wi-Fi-Degraded	Sim-to-Real Gap
Baseline (CartPole)	500±0	285±45	120±60	76%
Delay-Only	498±3	410±25	280±40	44%
Full Net-Aware	495±5	465±15	420±30	15%

Training Regime	Sim-Clean	Wi-Fi-Normal	Wi-Fi-Degraded	Sim-to-Real Gap
Baseline (MiniGrid)	0.92±0.05	0.55±0.12	0.28±0.15	70%
Full Net-Aware	0.88±0.06	0.78±0.08	0.72±0.10	18%

Ablation Study¶

Network Phenomenon	Impact on CartPole	Description
Fixed delay 50ms	Moderate degradation (~15%)	Can be compensated via frame stacking
Random jitter 10ms	Significant degradation (~25%)	Unpredictability is more destructive
2% packet loss	Severe degradation (~35%)	Missing observations lead to control failure
Jitter + packet loss	Maximum degradation (~55%)	Combined effect far outweighs individual impacts

Key Findings¶

Network-aware training reduces the sim-to-real gap of CartPole from 76% to 15% (an approximate 4x improvement).
Random jitter and packet loss are more destructive than fixed delay—modeling fixed delay alone is insufficient.
Policies trained with Full Net-Aware exhibit only a ~1% performance degradation on Sim-Clean (retaining near-optimal performance under ideal conditions).
Hierarchical policy graphs (with distributed deployment of multiple policy units) can also run successfully under the CALF framework.

Highlights & Insights¶

Reveals an overlooked dimension of the sim-to-real gap: Network conditions are orthogonal to physical and visual domains. Even with perfect physical modeling, a 100ms latency can cause control failure. This observation is simple yet crucial.
Algorithm-agnostic infrastructure solution: Modifying the training environment rather than the RL algorithm allows any RL method to be plug-and-play. This software design philosophy is highly valuable.
Quantitative analysis of individual network phenomena: The finding that jitter > packet loss > fixed latency provides practical guidelines for system design.

Limitations & Future Work¶

Validations were only conducted on CartPole and MiniGrid, which are relatively simple tasks.
WAN or adversarial network conditions were not tested.
Real-world hardware experiments were limited to a Raspberry Pi + Desktop setup, leaving more complex edge devices (such as Jetson, mobile phones) untested.
Throughput drops significantly (Mode 3 is only 20% of Mode 1), meaning the feasibility of large-scale training remains to be verified.
The framework could be combined with domain randomization strategies to simultaneously randomize both physical and network parameters.

vs DCAC: Modifies the TD-learning algorithm to handle delay, requiring an algorithm redesign. CALF achieves algorithm agnosticism via environment-level injection.
vs IMPALA/SEED RL: Optimizes worker-learner communication of the training infrastructure. CALF focuses instead on the agent-environment communication in the control loop.
vs Domain Randomization: Randomizes physical/visual parameters. CALF complements this by adding network conditions into the randomization distribution.

Rating¶

Novelty: ⭐⭐⭐⭐ Pointing out networks as an independent dimension of the sim-to-real gap is an important insight.
Experimental Thoroughness: ⭐⭐⭐ Simple environments and limited quantitative results.
Writing Quality: ⭐⭐⭐⭐ Clear problem definition and rigorous experimental design.
Value: ⭐⭐⭐⭐ High practical engineering value for edge-deployed RL.