GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies¶
Conference: NeurIPS 2025 arXiv: 2511.00998 Code: Available Area: 3D Vision Keywords: 3D Gaussian Splatting, Multi-Agent Collaboration, Diffusion Policy, Imitation Learning, Robot Manipulation
TL;DR¶
GauDP is proposed to enable scalable, perception-enhanced multi-agent collaborative imitation learning by constructing a globally consistent 3D Gaussian field from decentralized RGB observations of multiple agents and dynamically allocating Gaussian attributes back to each agent's local viewpoint.
Background & Motivation¶
In multi-agent embodied collaboration (e.g., industrial assembly, surgical robotics, domestic assistance), each agent must synchronize with others while completing its own subtask. Existing approaches face two core dilemmas:
Local observations only: Concatenating all agents' local views as input to a shared policy fails to capture the joint collaborative state, leading to desynchronized execution (e.g., one arm attempting to place food before another has opened the lid).
Global observations only: Provides a consistent scene representation but lacks high-resolution agent-specific information, degrading fine-grained control (grasping, placement) performance.
Naively fusing global and local signals lacks 3D structural constraints, making spatial reasoning difficult. A unified representation that simultaneously encodes global consistency and local precision is therefore required.
Method¶
Overall Architecture¶
GauDP operates in four stages:
- Local context extraction: Each agent extracts local features from its own 2D observations.
- Global 3D Gaussian field construction: A shared 3D Gaussian field is built from all views as global context.
- Global context allocation and fusion: Global context is fused with local context and processed through an encoder.
- Action prediction: A diffusion policy predicts actions via cross-attention over the per-agent fused features.
Formally, given synchronized multi-view observations \(\mathcal{O} = \{\mathcal{I}_1, \dots, \mathcal{I}_N\}\), the goal is to predict future action sequences \(\mathbf{a} = \{a_1, \dots, a_L\}\). The conditional policy is defined as \(\pi_\Phi(\mathbf{a} | \mathcal{O}) := \pi_\Phi(\mathbf{a} | \mathcal{O}, \mathcal{G})\), where \(\mathcal{G} = \mathcal{F}(\mathcal{O})\) denotes the mapping from observations to Gaussians.
Key Designs¶
1. Global Context Reconstruction¶
Objective: Construct a unified, view-agnostic 3D representation from multi-view 2D RGB inputs. Conventional 3DGS requires dense views, precise poses, and per-scene optimization over several minutes, making it unsuitable for rapid adaptation in embodied settings.
Solution: NoPoSplat (a feed-forward network) is adopted to reconstruct a 3D Gaussian representation directly from sparse, pose-free views, fine-tuned on robot manipulation scenes. The pipeline proceeds as follows:
- Each RGB image is independently encoded by a shared-weight ViT encoder.
- A cross-view ViT decoder fuses information across views via cross-attention layers.
- A Gaussian parameter prediction head estimates per-pixel 3D Gaussians: \(\mathcal{G}_i = \mathcal{F}(\mathbf{x}_i)\), \(\mathcal{G}_i \in \mathbb{R}^{C_\mathcal{G} \times H \times W}\).
Additional depth supervision is introduced: depth maps \(\hat{D}\) are rendered by projecting each Gaussian into the camera coordinate frame and compared against ground-truth depth \(D\). Reconstruction quality improves substantially (PSNR 17.9 → 23.4).
Key point: Depth and pose are used only during fine-tuning; only RGB input is required at deployment.
2. Global Context Allocation and Pixel-Level Synergy¶
Broadcasting the full global context to every agent introduces irrelevant information. A selective distribution mechanism is proposed:
- The natural alignment between Gaussians and source pixels established during reconstruction is exploited.
- Each agent receives only the subset of Gaussians associated with its own view, which already incorporates information from other views via cross-attention.
- Selected Gaussians are transformed back into a 2D grid matching the spatial dimensions of the original image.
- The result is concatenated with local image features and fused at the pixel level through a lightweight convolutional fusion module.
This design ensures each agent obtains a targeted global representation while maintaining spatial consistency.
3. Coordinate System Selection¶
Ablation studies show that using each agent's local camera coordinate system outperforms a unified world coordinate system—preserving agent-centric spatial relationships and avoiding cross-view alignment errors.
Loss & Training¶
- Reconstruction loss (NoPoSplat fine-tuning stage): \(\mathcal{L}_{rec} = \mathcal{L}_{rgb} + \alpha \cdot \mathcal{L}_{depth}\)
- Diffusion policy loss (policy training stage): standard DDPM denoising loss
- Training configuration: action prediction horizon 8, observation steps 3, action execution steps 6; DDPM 100 steps; Adam, \(lr=10^{-4}\), warm-up + cosine decay; 100 epochs, batch size 32, single A800 GPU.
Key Experimental Results¶
Main Results¶
Evaluated on the RoboFactory benchmark, comprising 6 collaborative manipulation tasks with 2–4 robot arms:
| Method | Lift Barrier | Place Food | Stack Cube | Align Camera | Stack Cube(4) | Take Photo | Avg. |
|---|---|---|---|---|---|---|---|
| DP3(XYZ+RGB) | 31% | 25% | 1% | 18% | 0% | 11% | 14.33% |
| 3D Dense Policy | 28% | 18% | 0% | 0% | 0% | 7% | 8.83% |
| DP | 9% | 12% | 6% | 3% | 0% | 0% | 5.00% |
| GauDP | 72% | 15% | 2% | 26% | 0% | 3% | 19.67% |
3D reconstruction quality (2-view reconstruction):
| Method | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| Pretrain (NoPoSplat) | 17.918 | 0.580 | 0.492 |
| After fine-tuning | 23.424 | 0.779 | 0.148 |
Ablation Study¶
| Configuration | Lift Barrier | Place Food | Stack Cube | Align Camera | Avg. |
|---|---|---|---|---|---|
| Unified world coordinate | 30% | 1% | 8% | 26% | 10.83% |
| Coarse fusion (w/o prefuse) | 2% | 4% | 0% | 1% | 1.17% |
| Gaussian only (w/o Image) | 32% | 7% | 0% | 28% | 11.17% |
| Image only (w/o Gaussian) | 9% | 12% | 6% | 3% | 5.00% |
| Full model | 72% | 15% | 2% | 26% | 19.67% |
Real-robot experiments: GauDP achieves 17/30, 19/30, and 27/30 on Card Box Stacking, Handover, and Grab Roller, respectively, outperforming the DP baseline on all tasks.
Key Findings¶
- GauDP surpasses all baselines using only RGB input, achieving the highest average success rate of 19.67%.
- On Lift Barrier, it reaches 72%, substantially outperforming the second-best method DP3 (31%).
- Removing pixel-level fusion causes performance to collapse to 1.17%, demonstrating the critical importance of fine-grained fusion.
- Both image and Gaussian representations are indispensable: images provide appearance cues while Gaussians provide global structure.
Highlights & Insights¶
- Elegant design philosophy: 3DGS serves as a bridge to unify local precision and global consistency without requiring additional sensor modalities.
- Naturally scalable: The flexibility of the Gaussian representation requires no architectural changes as the number of agents increases.
- Self-supervised reconstruction: The same multi-view data used to train the diffusion policy is leveraged for 3DGS fine-tuning, requiring no additional data.
- No pose or depth at inference: Only RGB input is needed at deployment.
Limitations & Future Work¶
- Success rates on high-precision tasks such as Stack Cube remain very low (2%), indicating significant room for improvement in fine-grained manipulation.
- Overall success rates under 3–4 arm configurations are low, suggesting that high-complexity collaboration remains challenging.
- Training is slightly more expensive (6.5 vs. 4.8/2.5 GPU hours) and inference is marginally slower (1.28 vs. 1.49 FPS).
- Integration with VLA models and the application of Gaussians as world models in dynamic scenes remain unexplored.
Related Work & Insights¶
- NoPoSplat: The core dependency, enabling feed-forward 3DGS reconstruction from sparse, pose-free views.
- 3D Diffusion Policy (DP3): The primary baseline, a diffusion policy with point cloud input.
- RoboFactory: A multi-arm collaborative manipulation benchmark built via automated data collection.
- Insight: 3DGS as an intermediate representation for aggregating multi-view information is broadly applicable to multi-agent perception–decision tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ Combining 3DGS with diffusion policies for multi-agent settings is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablations with both simulation and real-robot experiments.
- Writing Quality: ⭐⭐⭐⭐ Framework design is clear and motivation is well articulated.
- Value: ⭐⭐⭐⭐ Opens a new direction for visual representations in multi-agent collaboration.