VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play¶

Conference: NeurIPS 2025 arXiv: 2502.01932
Code: https://github.com/thu-uav/VolleyBots
Area: Reinforcement Learning Keywords: Multi-drone systems, robot sports, multi-agent reinforcement learning, game theory, sim-to-real

TL;DR¶

This paper presents VolleyBots, a multi-drone volleyball competition testbed that integrates cooperative-adversarial gameplay, turn-based interaction, and agile 3D motion control. Built on Isaac Sim, it establishes a task curriculum from single-agent training to multi-agent competition. A hierarchical policy achieves a 69.5% win rate on the 3v3 task, with demonstrated zero-shot sim-to-real transfer.

Background & Motivation¶

Robot sports are considered an ideal setting for showcasing embodied intelligence due to their well-defined objectives, explicit rules, and dynamic interactions. Prior work has made progress in individual sub-domains: RoboCup focuses on team cooperation and adversarial play; table tennis robots emphasize turn-based interaction; drone pursuit-evasion tasks require agile 3D maneuvering. However, no existing platform simultaneously encompasses mixed cooperative-adversarial gameplay, turn-based interaction, and agile 3D maneuvering.

Volleyball naturally possesses all three characteristics: intra-team coordination requires tight cooperation to execute bump-set-spike sequences, inter-team play demands anticipation and exploitation of opponent strategies, and ball trajectories require drones to perform rapid acceleration, sharp turns, and precise positioning under underactuated quadrotor dynamics. These features are deeply intertwined, forming a complex problem that demands both low-level motion control and high-level strategic decision-making, without access to any expert demonstration data.

The central approach is to use volleyball as a vehicle, drawing on the progressive human learning process, to design a task curriculum from simple to complex — systematically benchmarking and advancing multi-agent RL and game-theoretic algorithms on challenging embodied tasks.

Method¶

Overall Architecture¶

VolleyBots is built on NVIDIA Isaac Sim, leveraging the OmniDrones simulator for GPU-parallelized data collection. The platform comprises three components: (1) Environment — defining entities, observations, actions, and rewards; (2) Tasks — including 3 single-agent tasks, 3 multi-agent cooperative tasks, and 3 multi-agent competitive tasks; (3) Algorithms — covering RL, MARL, and game-theoretic baselines.

Key Designs¶

Simulated Entities and Physical Interaction:
- Uses the Iris quadrotor model with a virtual paddle of radius 0.2 m (restitution coefficient 0.8)
- Ball: radius 0.1 m, mass 5 g, restitution coefficient 0.8, supporting realistic bouncing and collision
- Court follows standard volleyball dimensions \(9\text{m} \times 18\text{m}\), net height 2.43 m
- Design Motivation: High-fidelity physical interaction is the foundation for sim-to-real transfer
Progressive Task Curriculum:
- Single-agent tasks: Back and Forth (sprint traversal), Hit the Ball (striking distance), Solo Bump (continuous juggling) — evaluating fundamental 3D maneuvering capability
- Cooperative tasks: Bump and Pass (two-agent rally), Set and Spike Easy/Hard (set-spike coordination) — introducing intra-team turn-based coordination
- Competitive tasks: 1v1, 3v3, 6v6 — integrating all challenges: cooperation, adversarial play, turn-based interaction, and 3D maneuvering
- Design Motivation: Mirrors the human learning progression from fundamentals to team coordination to formal match play
Dual Action Spaces and Hierarchical Rewards:
- Two action space options: CTBR (collective thrust + body rates, higher-level abstraction) and PRT (per-rotor thrust, fine-grained control)
- Reward function consists of three components: misbehave penalty (general violation penalties), task reward (sparse task-level reward), and shaping reward (auxiliary guidance reward)
- Design Motivation: PRT offers stronger maneuverability but is harder to train; CTBR facilitates sim-to-real transfer; the hierarchical reward balances exploration and learning efficiency
Hierarchical Policy:
- Low level: a set of primitive skills (Hover, Serve, Pass, Set, Attack) trained with PPO
- High level: a rule-based, event-driven policy that assigns low-level skills to each drone upon each hit
- For the Attack skill, the high-level policy selects left or right direction with equal probability
- Design Motivation: End-to-end approaches struggle to jointly optimize motion control and strategic play; hierarchical decomposition effectively alleviates this tension

Loss & Training¶

Single-agent and cooperative tasks use the standard PPO/MAPPO clipped surrogate objective
Competitive tasks incorporate game-theoretic training paradigms including Self-Play, Fictitious Self-Play (FSP), and PSRO (Uniform/Nash meta-solvers)
All algorithms share unified hyperparameters within each task to assess cross-task robustness

Key Experimental Results¶

Main Results¶

Task	Metric	PPO (PRT)	TD3 (PRT)	SAC (PRT)	Note
Back and Forth	Waypoints reached	10.04±0.20	0.99±0.01	0.83±0.25	PPO far surpasses off-policy
Hit the Ball	Hitting distance (m)	11.40±0.06	5.29±1.28	3.87±2.34	PPO ≈ 2× SOTA
Solo Bump	Juggling count	10.83±1.24	3.68±1.43	1.36±0.60	PPO advantage is significant

Competitive Task	Metric	SP	FSP	PSRO_Nash	Hierarchical
3v3	Exploitability↓	25.76	38.86	35.83	—
3v3	Win Rate↑	0.59	0.52	0.61	0.695
3v3	Elo↑	1077	906	1268	—

Ablation Study¶

Configuration	Bump and Pass	Set & Spike (Easy)	Set & Spike (Hard)
MAPPO w/o shaping	11.32±0.91	0.25	0.25
MAPPO w. shaping	13.71±0.58	0.99	0.75
MADDPG w. shaping	0.84±0.09	0.23	0.22
QMIX w. shaping	0.09±0.00	0.02	0.02

Key Findings¶

On-policy methods (PPO/MAPPO) consistently outperform off-policy methods across all tasks and exhibit stronger cross-task hyperparameter robustness
The PRT action space achieves marginally higher final performance than CTBR, though CTBR converges faster
No algorithm converges to an effective strategy on the 6v6 task, exposing a scalability bottleneck in current methods
The hierarchical policy achieves a 69.5% win rate against the strongest baseline (SP's Nash equilibrium policy) on the 3v3 task

Highlights & Insights¶

The first work to unify cooperative-adversarial, turn-based, and 3D maneuvering challenges within a single platform, filling a gap in robot sports research and providing a rigorous standard testbed for embodied intelligence.
Zero-shot sim-to-real deployment validates the practical significance of the platform: the Solo Bump policy trained entirely in simulation successfully executes ball juggling on a real quadrotor, demonstrating that the simulator faithfully approximates real-world physics.

Limitations & Future Work¶

No algorithm currently converges on the 6v6 format; more advanced hierarchical or communication mechanisms are needed to handle large-scale coordination
The high-level layer of the hierarchical policy remains hand-crafted; future work could employ RL or LLMs to automatically learn strategic decision-making
The current observation space is state-based; incorporating visual observations would allow exploration of more realistic perceptual challenges

vs. MQE (quadruped soccer): MQE also supports cooperative-adversarial play, but VolleyBots uniquely features turn-based interaction and 3D aerial maneuvering, imposing higher demands on low-level control
vs. Robot Table Tennis: Table tennis robots involve turn-based interaction but lack multi-agent cooperation; VolleyBots offers greater richness in multi-agent coordination
vs. SMPLOlympics: The humanoid Olympic setting covers a broad range of sports but lacks the depth of 3D aerial locomotion and team coordination

Rating¶

Novelty: ⭐⭐⭐⭐ First to bring volleyball to a multi-drone platform with systematic task design, though the core contribution leans toward benchmarking
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across 9 tasks, diverse RL/MARL/game-theoretic algorithms, ablations, and sim-to-real validation
Writing Quality: ⭐⭐⭐⭐ Clear structure, rich figures, and detailed task descriptions
Value: ⭐⭐⭐⭐ As a standardized benchmark, it offers long-term impact for multi-agent embodied intelligence research