EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation¶

Conference: NeurIPS 2025 arXiv: 2501.01895 Code: To be confirmed Area: 3D Vision Keywords: embodied AI, video diffusion, multi-view generation, robotic manipulation, 4D Gaussian Splatting

TL;DR¶

EnerVerse is a generative robotic foundation model that constructs a 4D embodied space via chunk-wise autoregressive video diffusion, sparse context memory, and multi-view generation priors. Combined with a 4DGS data flywheel to narrow the Sim2Real gap, it translates 4D world representations into physical actions through a policy head, achieving state-of-the-art performance on the LIBERO benchmark.

Background & Motivation¶

Video generation models have made significant advances in spatiotemporal imagination, naturally motivating their application to robotic action planning.
Existing methods naively adapt general-purpose video generation models to robotic tasks, overlooking the substantial gap between 2D video representation spaces and 3D robot environments.
Multi-view observation is critical for robotic manipulation (resolving occlusion and motion ambiguity), yet multi-camera calibration and data collection are prohibitively costly.
The Sim2Real gap remains a core bottleneck for large-scale application of simulated data.

Core Problem¶

How to build a unified framework that generates high-quality 4D embodied spaces and directly translates them into physical actions, while addressing the scarcity of multi-view data and the Sim2Real gap.

Method¶

1. Chunk-wise Autoregressive Video Diffusion¶

The minimal unit of future space is defined as a chunk. The model iteratively predicts the next chunk to extend the space. Training optimizes a denoising objective:

\[\min_{\theta} \mathbb{E}_{t, \mathbf{z}, \boldsymbol{\epsilon}} \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\mathbf{z}_t^{1:M}, \mathbf{o}_t^{1:K}, t)\|_2^2\]

At inference, newly denoised frames serve as clean inputs for the next iteration, terminating upon detection of an EOS frame. v-prediction is adopted.

2. Sparse Memory Mechanism¶

During training, sparsely sampled frames (approximately 80% discarded) are used as context rather than consecutive frames. Benefits include: - Reducing redundancy and encouraging the model to learn deeper chunk prediction capabilities. - Enhancing robustness to out-of-distribution (OOD) scenarios. - At inference, sliding-window smoothing enables seamless transitions while conserving GPU memory.

Ablation: without sparse memory, LIBERO-Long scores only 30.8 vs. 73 with sparse memory.

3. Multi-view Diffusion Generation¶

Single-view generation is extended to multi-view video generation by: - Encoding camera intrinsics and extrinsics via ray direction maps. - Cross-view attention to ensure geometric consistency. - Temporal attention to capture scene dynamics.

Pre-training on multi-view data establishes a 3D prior; at inference, auxiliary views are generated from a single camera with depth warping.

4. EnerVerse-D Data Flywheel¶

Combining the generative model with 4D Gaussian Splatting: 1. Sparse real observations are complemented by the generative model to produce multi-view videos. 2. 4DGS reconstructs the 4D scene and renders high-fidelity images. 3. Rendered images are fed back to the generative model for further refinement, forming an iterative loop.

5. EnerVerse-A Policy Head¶

Visual features \(E\) are extracted from the first denoising step of the UNet intermediate layers, cached, and passed to a DiT action head. The head predicts action chunks (\(\tau\) steps \(\times\) 7-DoF delta pose). Inference of 8-step actions takes approximately 280 ms on a single RTX 4090.

Key Experimental Results¶

LIBERO Benchmark¶

Model	Visual Input	Spatial	Object	Goal	Long	Avg
Diffusion Policy	S-RGB	78.3	92.5	68.3	50.5	72.4
OpenVLA	S-RGB	84.7	88.4	79.2	53.7	76.5
MAIL	S-RGB x2	76.0	90.0	82.0	78.0	81.5
EnerVerse	S-RGB	92.1	93.2	78.1	73.0	84.1
EnerVerse	RGB+2Render	91.2	97.7	85.0	80.0	88.5

CALVIN (ABC → D)¶

Method	Input	1	2	3	4	5	Avg Len
RoboFlamingo	S-RGB, G-RGB	82.4	61.9	46.6	33.1	23.5	2.47
GR-1	S-RGB, G-RGB, P	85.4	71.2	59.6	49.7	40.1	3.06
EnerVerse	S-RGB	90.8	73.0	57.3	43.7	35.6	3.00

Ablation Study on Training Strategy (LIBERO-Spatial)¶

Strategy	Success Rate
Train from scratch	Failed
Load general pre-training	79
Single-stage joint training	86.3
Two-stage fine-tuning	92.1

Highlights & Insights¶

The combination of chunk-wise autoregression and sparse memory enables theoretically unlimited sequence generation.
Multi-view diffusion priors allow single-camera deployment to benefit from 3D spatial understanding.
The 4DGS data flywheel elegantly addresses the Sim2Real gap.
A unified backbone simultaneously supports video generation and action prediction.

Limitations & Future Work¶

Video generation inevitably introduces artifacts, which are particularly pronounced in highly dynamic robot scenarios.
Rendered viewpoints are currently determined heuristically; Next-Best-View methods have not been integrated.
The relationship between video generation quality and control success rate is not yet well understood.
The data flywheel operates offline; online adaptation has not been realized.

vs. AVID: simply adapts DynamicCrafter without 3D priors; EnerVerse's multi-view pre-training provides spatial understanding.
vs. Diffusion Policy: directly learns actions without video generation priors; EnerVerse leverages video imagination to enhance the policy.
vs. OpenVLA: a 7B VLA model surpassed by EnerVerse with a smaller model.
vs. GR-2: also adopts video pre-training but remains in 2D; EnerVerse extends to 4D.
Video generation as a pre-training task for robotic policy learning is a promising paradigm.
The 4DGS data flywheel concept generalizes to other domains requiring cross-domain data augmentation.
Single-camera plus depth warping for multi-view generation is a practical deployment strategy.

Rating¶

⭐ Novelty: 4/5 — A well-designed 4D embodied space generation framework with a creative data flywheel.
⭐ Experimental Thoroughness: 4.5/5 — Comprehensive validation on LIBERO, CALVIN, and real-world settings with detailed ablations.
⭐ Writing Quality: 3.5/5 — Content-rich but somewhat verbose; core contributions could be more prominently highlighted.
⭐ Value: 4/5 — Provides a unified paradigm of video generation and policy learning for embodied intelligence.