EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation¶
Conference: NeurIPS 2025 arXiv: 2501.01895 Code: To be confirmed Area: 3D Vision Keywords: embodied AI, video diffusion, multi-view generation, robotic manipulation, 4D Gaussian Splatting
TL;DR¶
EnerVerse is a generative robotic foundation model that constructs a 4D embodied space via chunk-wise autoregressive video diffusion, sparse context memory, and multi-view generation priors. Combined with a 4DGS data flywheel to narrow the Sim2Real gap, it translates 4D world representations into physical actions through a policy head, achieving state-of-the-art performance on the LIBERO benchmark.
Background & Motivation¶
- Video generation models have made significant advances in spatiotemporal imagination, naturally motivating their application to robotic action planning.
- Existing methods naively adapt general-purpose video generation models to robotic tasks, overlooking the substantial gap between 2D video representation spaces and 3D robot environments.
- Multi-view observation is critical for robotic manipulation (resolving occlusion and motion ambiguity), yet multi-camera calibration and data collection are prohibitively costly.
- The Sim2Real gap remains a core bottleneck for large-scale application of simulated data.
Core Problem¶
How to build a unified framework that generates high-quality 4D embodied spaces and directly translates them into physical actions, while addressing the scarcity of multi-view data and the Sim2Real gap.
Method¶
1. Chunk-wise Autoregressive Video Diffusion¶
The minimal unit of future space is defined as a chunk. The model iteratively predicts the next chunk to extend the space. Training optimizes a denoising objective:
At inference, newly denoised frames serve as clean inputs for the next iteration, terminating upon detection of an EOS frame. v-prediction is adopted.
2. Sparse Memory Mechanism¶
During training, sparsely sampled frames (approximately 80% discarded) are used as context rather than consecutive frames. Benefits include: - Reducing redundancy and encouraging the model to learn deeper chunk prediction capabilities. - Enhancing robustness to out-of-distribution (OOD) scenarios. - At inference, sliding-window smoothing enables seamless transitions while conserving GPU memory.
Ablation: without sparse memory, LIBERO-Long scores only 30.8 vs. 73 with sparse memory.
3. Multi-view Diffusion Generation¶
Single-view generation is extended to multi-view video generation by: - Encoding camera intrinsics and extrinsics via ray direction maps. - Cross-view attention to ensure geometric consistency. - Temporal attention to capture scene dynamics.
Pre-training on multi-view data establishes a 3D prior; at inference, auxiliary views are generated from a single camera with depth warping.
4. EnerVerse-D Data Flywheel¶
Combining the generative model with 4D Gaussian Splatting: 1. Sparse real observations are complemented by the generative model to produce multi-view videos. 2. 4DGS reconstructs the 4D scene and renders high-fidelity images. 3. Rendered images are fed back to the generative model for further refinement, forming an iterative loop.
5. EnerVerse-A Policy Head¶
Visual features \(E\) are extracted from the first denoising step of the UNet intermediate layers, cached, and passed to a DiT action head. The head predicts action chunks (\(\tau\) steps \(\times\) 7-DoF delta pose). Inference of 8-step actions takes approximately 280 ms on a single RTX 4090.
Key Experimental Results¶
LIBERO Benchmark¶
| Model | Visual Input | Spatial | Object | Goal | Long | Avg |
|---|---|---|---|---|---|---|
| Diffusion Policy | S-RGB | 78.3 | 92.5 | 68.3 | 50.5 | 72.4 |
| OpenVLA | S-RGB | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| S-RGB x2 | 76.0 | 90.0 | 82.0 | 78.0 | 81.5 | |
| EnerVerse | S-RGB | 92.1 | 93.2 | 78.1 | 73.0 | 84.1 |
| EnerVerse | RGB+2Render | 91.2 | 97.7 | 85.0 | 80.0 | 88.5 |
CALVIN (ABC → D)¶
| Method | Input | 1 | 2 | 3 | 4 | 5 | Avg Len |
|---|---|---|---|---|---|---|---|
| RoboFlamingo | S-RGB, G-RGB | 82.4 | 61.9 | 46.6 | 33.1 | 23.5 | 2.47 |
| GR-1 | S-RGB, G-RGB, P | 85.4 | 71.2 | 59.6 | 49.7 | 40.1 | 3.06 |
| EnerVerse | S-RGB | 90.8 | 73.0 | 57.3 | 43.7 | 35.6 | 3.00 |
Ablation Study on Training Strategy (LIBERO-Spatial)¶
| Strategy | Success Rate |
|---|---|
| Train from scratch | Failed |
| Load general pre-training | 79 |
| Single-stage joint training | 86.3 |
| Two-stage fine-tuning | 92.1 |
Highlights & Insights¶
- The combination of chunk-wise autoregression and sparse memory enables theoretically unlimited sequence generation.
- Multi-view diffusion priors allow single-camera deployment to benefit from 3D spatial understanding.
- The 4DGS data flywheel elegantly addresses the Sim2Real gap.
- A unified backbone simultaneously supports video generation and action prediction.
Limitations & Future Work¶
- Video generation inevitably introduces artifacts, which are particularly pronounced in highly dynamic robot scenarios.
- Rendered viewpoints are currently determined heuristically; Next-Best-View methods have not been integrated.
- The relationship between video generation quality and control success rate is not yet well understood.
- The data flywheel operates offline; online adaptation has not been realized.
Related Work & Insights¶
- vs. AVID: simply adapts DynamicCrafter without 3D priors; EnerVerse's multi-view pre-training provides spatial understanding.
- vs. Diffusion Policy: directly learns actions without video generation priors; EnerVerse leverages video imagination to enhance the policy.
- vs. OpenVLA: a 7B VLA model surpassed by EnerVerse with a smaller model.
-
vs. GR-2: also adopts video pre-training but remains in 2D; EnerVerse extends to 4D.
-
Video generation as a pre-training task for robotic policy learning is a promising paradigm.
- The 4DGS data flywheel concept generalizes to other domains requiring cross-domain data augmentation.
- Single-camera plus depth warping for multi-view generation is a practical deployment strategy.
Rating¶
- ⭐ Novelty: 4/5 — A well-designed 4D embodied space generation framework with a creative data flywheel.
- ⭐ Experimental Thoroughness: 4.5/5 — Comprehensive validation on LIBERO, CALVIN, and real-world settings with detailed ablations.
- ⭐ Writing Quality: 3.5/5 — Content-rich but somewhat verbose; core contributions could be more prominently highlighted.
- ⭐ Value: 4/5 — Provides a unified paradigm of video generation and policy learning for embodied intelligence.