Skip to content

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

Conference: NeurIPS 2025 arXiv: 2501.01895 Code: To be confirmed Area: 3D Vision Keywords: embodied AI, video diffusion, multi-view generation, robotic manipulation, 4D Gaussian Splatting

TL;DR

EnerVerse is a generative robotic foundation model that constructs a 4D embodied space via chunk-wise autoregressive video diffusion, sparse context memory, and multi-view generation priors. Combined with a 4DGS data flywheel to narrow the Sim2Real gap, it translates 4D world representations into physical actions through a policy head, achieving state-of-the-art performance on the LIBERO benchmark.

Background & Motivation

  • Video generation models have made significant advances in spatiotemporal imagination, naturally motivating their application to robotic action planning.
  • Existing methods naively adapt general-purpose video generation models to robotic tasks, overlooking the substantial gap between 2D video representation spaces and 3D robot environments.
  • Multi-view observation is critical for robotic manipulation (resolving occlusion and motion ambiguity), yet multi-camera calibration and data collection are prohibitively costly.
  • The Sim2Real gap remains a core bottleneck for large-scale application of simulated data.

Core Problem

How to build a unified framework that generates high-quality 4D embodied spaces and directly translates them into physical actions, while addressing the scarcity of multi-view data and the Sim2Real gap.

Method

1. Chunk-wise Autoregressive Video Diffusion

The minimal unit of future space is defined as a chunk. The model iteratively predicts the next chunk to extend the space. Training optimizes a denoising objective:

\[\min_{\theta} \mathbb{E}_{t, \mathbf{z}, \boldsymbol{\epsilon}} \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\mathbf{z}_t^{1:M}, \mathbf{o}_t^{1:K}, t)\|_2^2\]

At inference, newly denoised frames serve as clean inputs for the next iteration, terminating upon detection of an EOS frame. v-prediction is adopted.

2. Sparse Memory Mechanism

During training, sparsely sampled frames (approximately 80% discarded) are used as context rather than consecutive frames. Benefits include: - Reducing redundancy and encouraging the model to learn deeper chunk prediction capabilities. - Enhancing robustness to out-of-distribution (OOD) scenarios. - At inference, sliding-window smoothing enables seamless transitions while conserving GPU memory.

Ablation: without sparse memory, LIBERO-Long scores only 30.8 vs. 73 with sparse memory.

3. Multi-view Diffusion Generation

Single-view generation is extended to multi-view video generation by: - Encoding camera intrinsics and extrinsics via ray direction maps. - Cross-view attention to ensure geometric consistency. - Temporal attention to capture scene dynamics.

Pre-training on multi-view data establishes a 3D prior; at inference, auxiliary views are generated from a single camera with depth warping.

4. EnerVerse-D Data Flywheel

Combining the generative model with 4D Gaussian Splatting: 1. Sparse real observations are complemented by the generative model to produce multi-view videos. 2. 4DGS reconstructs the 4D scene and renders high-fidelity images. 3. Rendered images are fed back to the generative model for further refinement, forming an iterative loop.

5. EnerVerse-A Policy Head

Visual features \(E\) are extracted from the first denoising step of the UNet intermediate layers, cached, and passed to a DiT action head. The head predicts action chunks (\(\tau\) steps \(\times\) 7-DoF delta pose). Inference of 8-step actions takes approximately 280 ms on a single RTX 4090.

Key Experimental Results

LIBERO Benchmark

Model Visual Input Spatial Object Goal Long Avg
Diffusion Policy S-RGB 78.3 92.5 68.3 50.5 72.4
OpenVLA S-RGB 84.7 88.4 79.2 53.7 76.5
MAIL S-RGB x2 76.0 90.0 82.0 78.0 81.5
EnerVerse S-RGB 92.1 93.2 78.1 73.0 84.1
EnerVerse RGB+2Render 91.2 97.7 85.0 80.0 88.5

CALVIN (ABC → D)

Method Input 1 2 3 4 5 Avg Len
RoboFlamingo S-RGB, G-RGB 82.4 61.9 46.6 33.1 23.5 2.47
GR-1 S-RGB, G-RGB, P 85.4 71.2 59.6 49.7 40.1 3.06
EnerVerse S-RGB 90.8 73.0 57.3 43.7 35.6 3.00

Ablation Study on Training Strategy (LIBERO-Spatial)

Strategy Success Rate
Train from scratch Failed
Load general pre-training 79
Single-stage joint training 86.3
Two-stage fine-tuning 92.1

Highlights & Insights

  • The combination of chunk-wise autoregression and sparse memory enables theoretically unlimited sequence generation.
  • Multi-view diffusion priors allow single-camera deployment to benefit from 3D spatial understanding.
  • The 4DGS data flywheel elegantly addresses the Sim2Real gap.
  • A unified backbone simultaneously supports video generation and action prediction.

Limitations & Future Work

  • Video generation inevitably introduces artifacts, which are particularly pronounced in highly dynamic robot scenarios.
  • Rendered viewpoints are currently determined heuristically; Next-Best-View methods have not been integrated.
  • The relationship between video generation quality and control success rate is not yet well understood.
  • The data flywheel operates offline; online adaptation has not been realized.
  • vs. AVID: simply adapts DynamicCrafter without 3D priors; EnerVerse's multi-view pre-training provides spatial understanding.
  • vs. Diffusion Policy: directly learns actions without video generation priors; EnerVerse leverages video imagination to enhance the policy.
  • vs. OpenVLA: a 7B VLA model surpassed by EnerVerse with a smaller model.
  • vs. GR-2: also adopts video pre-training but remains in 2D; EnerVerse extends to 4D.

  • Video generation as a pre-training task for robotic policy learning is a promising paradigm.

  • The 4DGS data flywheel concept generalizes to other domains requiring cross-domain data augmentation.
  • Single-camera plus depth warping for multi-view generation is a practical deployment strategy.

Rating

  • ⭐ Novelty: 4/5 — A well-designed 4D embodied space generation framework with a creative data flywheel.
  • ⭐ Experimental Thoroughness: 4.5/5 — Comprehensive validation on LIBERO, CALVIN, and real-world settings with detailed ablations.
  • ⭐ Writing Quality: 3.5/5 — Content-rich but somewhat verbose; core contributions could be more prominently highlighted.
  • ⭐ Value: 4/5 — Provides a unified paradigm of video generation and policy learning for embodied intelligence.