URB -- Urban Routing Benchmark for RL-Equipped Connected Autonomous Vehicles¶

Conference: NeurIPS 2025 arXiv: 2505.17734 Code: GitHub Area: Autonomous Driving / Reinforcement Learning for Route Planning Keywords: Urban routing benchmark, multi-agent reinforcement learning, autonomous driving, traffic simulation, game theory

TL;DR¶

This paper presents URB — the first large-scale MARL benchmark environment for urban mixed-traffic (human + CAV) routing, integrating 29 real-world traffic networks, the microscopic traffic simulator SUMO, and empirical travel demand patterns. Experiments reveal that current state-of-the-art MARL algorithms rarely outperform human drivers, highlighting the urgent need for algorithmic advances in this domain.

Background & Motivation¶

Connected autonomous vehicles (CAVs) hold promise for alleviating urban congestion through optimized routing decisions, yet the central question remains: which algorithms should govern collective routing? Reinforcement learning (RL) naturally frames routing as a sequential decision-making problem, but a standardized, realistic benchmark for evaluating and comparing algorithms is lacking.

Key limitations of prior work include: (1) the enormous action space of urban routing (the number of paths grows exponentially with network size), non-stationarity arising from agents competing for shared resources, and high simulation costs; (2) prior studies are confined to simplified scenarios without unified evaluation protocols; and (3) the impact of CAV deployment on human drivers is largely neglected.

The paper's starting point is to construct a comprehensive benchmark framework that enables RL researchers to test algorithms on realistic traffic scenarios while allowing transportation researchers to assess the societal implications of CAV deployment using state-of-the-art RL methods. The central finding is that current MARL algorithms are far from mature for large-scale urban routing and often fail to surpass even random policies.

Method¶

Overall Architecture¶

URB formalizes urban routing as an Agent-Environment Cycle (AEC) game: human drivers first learn and stabilize their routing strategies → a fraction of vehicles are converted to CAVs → CAVs are trained using MARL algorithms → strategies are evaluated. The environment is simulated using SUMO, human behavior is modeled via a Generic Learning Model (HLM), and CAVs select routes based on local observations.

Key Designs¶

Real-World Traffic Networks and Demand:
- 29 real-world traffic networks: 28 sub-regions of Île-de-France and 1 Ingolstadt network
- Each network is paired with synthetic travel demand patterns derived from empirical data (AM peak period)
- Network scales range from small (St. Arnoult) to large (Ingolstadt), covering varying levels of complexity
Flexible Parameterization:
- CAV market penetration rate (0–100%) is configurable
- CAV behavioral profiles: selfish (minimize own travel time), cooperative (minimize system-wide travel time), altruistic, or even adversarial
- Observation space: CAVs observe the routing choices of agents that departed earlier
- Action space: selection from \(K_i\) pre-computed candidate paths
Human Behavior Modeling:
- Based on the classical day-to-day routing learning model (Gawron 1998)
- Human drivers update their expected travel times based on recent experience and select the subjectively optimal path each day
- After sufficient learning iterations, behavior converges toward user equilibrium (a special case of Nash equilibrium)
Comprehensive Evaluation Metrics:
- Core metric: travel time \(t\) (\(t^{pre}\), \(t^{train}\), \(t^{test}\), \(t_{CAV}\), \(t_{HDV}\))
- Training cost \(c_{all}\): measures the negative systemic impact incurred during training
- Change analysis: mean speed change \(\Delta_V\) and mileage change \(\Delta_L\)
- Win rate \(WR\): proportion of runs in which CAV post-training travel time is lower than the human baseline

Loss & Training¶

The benchmark implements four MARL algorithms: - IQL (Independent Q-Learning): each agent trains an independent DQN - IPPO (Independent PPO): each agent independently applies PPO - MAPPO (Multi-Agent PPO): centralized critic with decentralized actors - QMIX: factorizes the joint Q-function via a mixing network

Cooperative algorithms (MAPPO, QMIX) use a cooperative reward (minimizing system-wide travel time); independent algorithms (IQL, IPPO) use a selfish reward. Baselines include All-or-Nothing (AON), random, and human routing.

Key Experimental Results¶

Main Results¶

Scenario 1: 40% CAV market penetration (average over 5 repetitions):

Algorithm	St. Arnoult \(t^{test}\)	Provins \(t^{test}\)	Ingolstadt \(t^{test}\)	St. Arnoult WR
Human baseline	3.15	2.80	4.21	100%
AON	3.15	2.67	4.11	100%
Random	3.38	2.93	4.40	0%
IPPO	3.28	2.90	4.37	0%
IQL	3.36	2.91	4.41	0%
MAPPO	3.35	2.91	4.38	0%
QMIX	3.24	2.94	4.47	80%

Ablation Study¶

Configuration	Description
QMIX 6000 episodes	80% WR only on the smallest network, St. Arnoult
QMIX 20000 episodes	Extended training yields diminishing returns; performance remains poor on larger networks
Varying CAV penetration	Higher CAV penetration does not necessarily improve system-level performance
Fixed episode budget	MARL training converges extremely slowly; each episode requires a full SUMO simulation

Key Findings¶

MARL algorithms rarely surpass human routing: Only QMIX achieves an 80% win rate on the smallest network (St. Arnoult); on larger networks it even underperforms the random policy
CAV deployment harms human drivers: Across all scenarios \(t_{HDV} > t^{pre}\), indicating that CAV presence increases travel time for human drivers
Training imposes substantial system costs: System efficiency consistently degrades during training (reduced speeds, increased mileage)
The simple AON baseline achieves optimal performance in several scenarios: exposing the immaturity of current MARL approaches

Highlights & Insights¶

The value of negative results: Reporting poor MARL performance candidly is more meaningful than presenting superficial successes, and points the community toward genuinely needed breakthroughs
Interdisciplinary design: Transportation engineering (SUMO simulation), transport science (human routing behavior modeling), and machine learning (MARL) are organically integrated
Societal impact assessment: Beyond evaluating CAV performance, the benchmark tracks the effects of CAV deployment on human drivers, reflecting responsible AI research practice
Open and extensible: The modular design allows the community to freely add new algorithms, networks, and tasks

Limitations & Future Work¶

Simulation efficiency bottleneck: each episode requires a full SUMO simulation, making large-scale training extremely time-consuming
Only pre-trip routing decisions are currently supported; en-route rerouting is not considered
Human behavior is modeled with a standardized formulation that does not capture heterogeneity or irrational behavior
Only discrete route selection is employed; continuous routing optimization remains unexplored
Practical CAV deployment constraints such as communication latency and partial observability are not modeled

vs. FLOW / RESCO: These traffic RL benchmarks focus on signal control rather than routing decisions; URB fills the gap for urban routing MARL benchmarks
vs. RouteRL: URB extends RouteRL by incorporating 29 networks, system-level evaluation metrics, and benchmark algorithms
A cautionary message for MARL research: The paper underscores the substantial gap between real-world large-scale non-stationary multi-agent problems and commonly used MARL benchmarks

Rating¶

Novelty: ⭐⭐⭐⭐ First large-scale urban routing MARL benchmark with an innovative problem formulation, though core methods rely on existing algorithms
Experimental Thoroughness: ⭐⭐⭐⭐ Three networks, four algorithms, and multi-metric evaluation, though only a single scenario configuration is presented
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, interdisciplinary content is well organized, and background knowledge is sufficiently covered
Value: ⭐⭐⭐⭐⭐ Provides important benchmark value for both MARL and autonomous vehicle routing; the negative results are highly instructive