4D Visual Pre-training for Robot Learning¶

Conference: ICCV 2025 arXiv: 2508.17230 Code: https://github.com/JackHck/FVP Area: 3D Vision / Robot Learning Keywords: Point cloud pre-training, diffusion models, imitation learning, 3D representation learning, robot manipulation

TL;DR¶

FVP formulates 3D visual pre-training as a next-point-cloud-prediction problem, training a conditional diffusion model to predict the current-frame point cloud from historical-frame point clouds. This approach achieves a 28% average success rate improvement over DP3 across 12 real-world manipulation tasks, establishing a new state of the art.

Background & Motivation¶

Existing robot visual representation learning predominantly relies on 2D image pre-training (e.g., R3M, MVP, VC-1), neglecting the inherently 3D nature of the physical world.
3D point clouds have demonstrated superior efficiency and generalization over 2D images as visual inputs for robot manipulation (e.g., DP3, RISE).
However, large-scale 3D data is scarce on the internet, making it difficult to train general-purpose 3D representations from web data as done in the 2D domain.
Existing 3D pre-training methods (contrastive learning, masked reconstruction) do not sufficiently exploit temporal information for understanding robot motion dynamics.

Core Problem¶

How can a general-purpose 3D visual pre-training method be designed to improve robot manipulation performance in the absence of large-scale 3D data?
How should the pre-training objective be designed so that the visual encoder learns the dynamic changes of the physical environment and robot motion characteristics?
Can the pre-training framework generalize across different 3D encoders, datasets, robot platforms (single-arm / dual-arm / humanoid), and large VLA models?

Method¶

Overall Architecture¶

FVP (4D Visual Pre-training) formulates the visual pre-training objective as next-point-cloud-prediction. The pipeline consists of two stages: 1. Pre-training stage: A 3D visual encoder maps the previous-frame point cloud $o_{t-1}$ to a latent representation $\mathbf{z}$, which then conditions a point cloud diffusion model to iteratively denoise Gaussian noise into the current-frame point cloud $o_t$. 2. Downstream fine-tuning: The pre-trained 3D visual encoder replaces the encoder in a downstream imitation learning method (e.g., DP3, RISE) and is fine-tuned end-to-end.

Key Designs¶

Next-Point-Cloud-Prediction objective: Unlike contrastive learning (same timestep as positive pairs, different timesteps as negatives) or masked point cloud reconstruction, FVP predicts the next-frame point cloud conditioned on the previous-frame observation. This enables the visual model to capture robot motion characteristics and temporal environment dynamics, acquiring the behavioral information critical for imitation learning.
Conditional diffusion model: A Point-Voxel Diffusion network serves as the denoising backbone. The latent representation $\mathbf{z} \in \mathbb{R}^{N \times C_v}$ output by the visual encoder is concatenated with the noisy point cloud $o_t^T \in \mathbb{R}^{N \times 3}$ to form $o_t^{T,+} \in \mathbb{R}^{N \times (C_v+3)}$, from which the diffusion model predicts the noise $\epsilon$.
Universal encoder interface: FVP imposes no constraints on the 3D encoder architecture, supporting PointNet++, Point Transformer, DP3 Encoder, RISE Encoder, and others, making it a plug-and-play pre-training module.
Flexible pre-training data: FVP supports both in-domain (small-scale task demonstration data) and out-of-domain (e.g., the large-scale RoboMind dataset) pre-training.

Loss & Training¶

Pre-training loss: Standard $L_2$ noise prediction loss of the diffusion model: $$\mathcal{L} = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \mathbf{I})} \left[ \| \epsilon - \epsilon_\theta(o_t^{T,+}, T) \|_2^2 \right]$$
Downstream fine-tuning: The pre-trained encoder initializes the encoder of DP3/RISE and is fine-tuned end-to-end (unfrozen). Ablation experiments show that freezing the encoder significantly degrades performance due to domain gap.
Using a single previous frame (1 frame) as the conditioning input yields the best results; incorporating additional historical frames proves detrimental.

Key Experimental Results¶

Simulation Experiments (Adroit + MetaWorld)¶

Method	In-domain Avg. Gain	Out-of-domain Avg. Gain
FVP (DP3)	+16.9%	+24.7%

FVP surpasses 3D pre-training methods including PointMAE, STRL, and C2P, as well as 2D pre-training methods R3M and MVP, on both Adroit and MetaWorld benchmarks.

Real-World Experiments (12 Tasks, 4 Robot Types)¶

Task	DP3	DP3+FVP	RISE+FVP
PickSquare	14/20	20/20	20/20
PlaceBottle	13/20	20/20	19/20
PickPlace	11/20	17/20	17/20
FlipCup	10/20	16/20	14/20
Assembly	6/20	13/20	13/20
ArtiManip	7/20	16/20	13/20

FVP achieves absolute success rate improvements of 15%–55% across real-world tasks.

2D Pre-training vs. FVP (using DP3 as the policy)¶

Method	Average
R3M	12.5/20
MVP	15.5/20
MAE (Soup-1M+100 DoH)	15.3/20
DP3+FVP	16.4/20

VLA Model (RDT-1B) Results¶

Input	PickSquare	PlaceBottle	PutBox	StackBowl	WipePlate
2D Image	12/20	10/20	6/20	8/20	3/20
2D Image + R3M	15/20	12/20	7/20	11/20	4/20
3D Point Cloud	14/20	12/20	9/20	13/20	4/20
3D + FVP Pre-training	18/20	17/20	9/20	16/20	5/20

Ablation Study¶

Historical-frame vs. current-frame conditioning: Using the previous frame as the condition significantly outperforms using the current frame (e.g., PickSquare: 20/20 vs. 15/20), confirming the importance of temporal information for pre-training.
Frozen vs. fine-tuned encoder: Freezing the pre-trained encoder causes a sharp performance drop on downstream tasks (e.g., PickSquare: 20/20 → 11/20), indicating that domain gap between out-of-domain pre-training and in-domain tasks necessitates end-to-end fine-tuning.
Number of historical frames: A single frame yields the best results; performance degrades monotonically as more frames are added (PickSquare: 20→19→17→15 for 1/2/3/4 frames), suggesting that excessive historical context introduces noise.

Highlights & Insights¶

Novel pre-training paradigm: FVP is the first to adopt next-point-cloud-prediction as a 3D visual pre-training objective, naturally incorporating temporal dynamics in a manner more suitable for robot tasks than contrastive learning or masked reconstruction.
Strong generality: FVP supports arbitrary 3D encoders (PointNet++, Point Transformer, DP3 Encoder) and is compatible with diverse robot platforms (single-arm with gripper/dexterous hand, dual-arm, humanoid), and extends to large VLA models.
Large-scale real-world validation: Consistent and substantial gains are demonstrated across 12 real-world manipulation tasks and 4 robot morphologies.
Simplicity and effectiveness: The core idea is intuitive and straightforward to implement; as a plug-and-play module, FVP directly enhances existing 3D imitation learning methods.

Limitations & Future Work¶

Dependence on point cloud data: The method requires datasets with depth or point cloud information. Large-scale open-source datasets such as Open-X-Embodiment lack camera parameters and depth data, precluding their direct use.
Limited pre-training data scale: Due to the scarcity of 3D data, pre-training is currently conducted on small in-domain datasets or RoboMind; the effectiveness of truly web-scale large-scale pre-training has not yet been validated.
End-to-end fine-tuning required: The significant performance drop when the encoder is frozen indicates that the pre-trained representations are not fully domain-agnostic, and domain gap remains a concern.
Limited gains on VLA models: Improvements on language-conditioned and long-horizon tasks are relatively modest (e.g., long-horizon tasks improve from only 0/20 to 3/20), suggesting that 3D information provides limited benefit for high-level semantic reasoning.

Dimension	FVP	DP3	RISE
Pre-training	✅ Next-point-cloud-prediction	❌ None	❌ None
Input	3D point cloud	3D point cloud	3D point cloud
Encoder	Universal (multiple supported)	Lightweight DP3 Encoder	Sparse conv + Transformer
Role	Pre-training module	End-to-end policy	End-to-end policy

vs. PointMAE / STRL / C2P (3D pre-training methods): These methods rely on masked reconstruction or contrastive learning without exploiting temporal information. FVP introduces dynamic information by predicting the next frame, yielding substantial improvements in both simulation and real-world tasks.
vs. R3M / MVP (2D pre-training methods): Despite being pre-trained on far larger datasets (>300M samples), 2D representations are less effective than FVP's 3D pre-trained representations for 3D manipulation tasks.

Broader Implications¶

Universality of the "next-X-prediction" paradigm: Next-token-prediction in LLMs, next-frame-prediction in video models, and next-point-cloud-prediction here all demonstrate that predicting the next temporal state is a highly effective self-supervised objective across modalities.
3D vs. 2D for robot manipulation: This work further substantiates the advantage of 3D point cloud inputs for robot manipulation, particularly in scenarios requiring fine-grained 3D spatial perception such as dexterous hand manipulation.
Pre-training + fine-tuning in robotics: With the emergence of 3D robot datasets such as RoboMind, large-scale 3D pre-training is poised to become a new trend.
Extensibility to other 3D tasks: The next-point-cloud-prediction objective may also benefit other domains requiring dynamic 3D perception, such as autonomous driving and 3D scene understanding.

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing the next-prediction paradigm to 3D point cloud pre-training is a novel contribution, though the diffusion model itself is an existing tool.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 12 real-world tasks, 4 robot types, simulation and real-world evaluation, multiple encoders, VLA extension, and comprehensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear structure and well-organized experiments; some notation and descriptions could be made more concise.
Value: ⭐⭐⭐⭐⭐ A simple yet effective general-purpose method with large-scale real-world validation; of significant reference value to the 3D robot learning community.