Image Generation as a Visual Planner for Robotic Manipulation¶

Conference: CVPR 2026 arXiv: 2512.00532 Code: GitHub Area: Image Generation / Robotic Manipulation Keywords: Visual Planning, Robotic Manipulation, Diffusion Models, Grid Image Generation, LoRA

TL;DR¶

This work adapts a pretrained image generation model (DiT) via LoRA fine-tuning into a visual planner for robotic manipulation, generating temporally coherent action sequences in the form of \(3\times3\) grid images, supporting both text-conditioned and trajectory-conditioned control modes.

Background & Motivation¶

Generating realistic robotic manipulation videos is a critical step toward unified perception-planning-action systems. Existing video diffusion models require large domain-specific datasets, exhibit limited generalization, and incur high computational costs. Meanwhile, large-scale image generation models (e.g., FLUX.1-dev), trained on language-image pairs, demonstrate strong compositional generation capabilities—arranging multiple semantically consistent sub-images within a single grid layout, implicitly exhibiting temporal transitions reminiscent of short video clips.

The central hypothesis of this paper is that pretrained image generators already encode transferable temporal priors, and lightweight LoRA fine-tuning is sufficient to repurpose them as visual planners for robotic manipulation, without designing dedicated video architectures.

Method¶

Overall Architecture¶

The framework reformulates robotic manipulation video generation as a \(3\times3\) grid image problem. Nine frames are uniformly sampled from each manipulation video and arranged in a serpentine layout (\(1\to2\to3\), \(6\leftarrow5\leftarrow4\), \(7\to8\to9\)) to form a grid image. During training, only the top-left first frame is retained as the conditioning input (the remaining cells are masked to black), and the model learns to predict the complete 9-frame grid. The method is built upon FLUX.1-dev (DiT architecture) and employs LoRA adapters for parameter-efficient fine-tuning.

Key Designs¶

Serpentine Grid Layout: Nine frames are arranged in serpentine order within a \(3\times3\) grid, ensuring that temporally adjacent frames are also spatially adjacent. This design exploits the local attention mechanism of Transformers for modeling short-range temporal dependencies, achieving inter-frame consistency without explicit temporal modeling. The grid is constructed as:

\[\mathbf{D} = \begin{bmatrix} D^{\text{img}_1} & D^{\text{img}_2} & D^{\text{img}_3} \\ D^{\text{img}_6} & D^{\text{img}_5} & D^{\text{img}_4} \\ D^{\text{img}_7} & D^{\text{img}_8} & D^{\text{img}_9} \end{bmatrix}\]

The design motivation is to ensure that any two temporally adjacent frames are also physically adjacent in the grid, facilitating local attention in capturing continuous actions.

Dual-Mode Conditioning:
Text-conditioned generation: Given a language instruction (e.g., "pick up the red cup") and the first frame, text embeddings \(c_{\text{text}} = \{e_{\text{clip}}, E_{\text{t5}}\}\) are encoded via CLIP and T5 and injected into the DiT via cross-attention. This mode emphasizes semantic understanding—the model must interpret high-level semantics and translate them into a plausible action sequence.
Trajectory-conditioned generation: A 2D end-effector trajectory (color-coded red→blue to indicate temporal progression) is rendered onto the first frame, and the composited image replaces the first frame as the conditioning input \(\tilde{\mathbf{D}}^{\tau}\). This mode emphasizes spatial precision—the model generates actions along the provided trajectory path.
Parameter-Efficient Adaptation via LoRA: LoRA is applied to the query/value projections of self-attention layers and feed-forward layers in the DiT (with low rank \(r \ll d\)), training only \(O(rd)\) parameters instead of \(O(d^2)\). This enables efficient transfer from general image generation to the robotic video domain without increasing inference latency.

Loss & Training¶

A latent-space MSE loss is employed: \(\mathcal{L}_{\text{lat}} = \|\mathcal{E}(\mathbf{D}_{gt}) - \mathcal{E}(\hat{\mathbf{D}})\|_2^2\), where \(\mathcal{E}\) denotes the VAE encoder. The model performs single-pass grid generation (non-autoregressive), predicting the complete 9-frame grid in one step and leveraging the compositional priors of the image generation model for implicit temporal reasoning.

Key Experimental Results¶

Main Results¶

Dataset	Method	FVD↓	SSIM↑	MSE↓	Success↑
JacoPlay	Text	490.7	0.797	0.00695	80.6%
JacoPlay	Traj	503.37	0.802	0.00680	74.0%
BridgeV2	Text	644.2	0.733	0.0135	73.2%
BridgeV2	Traj	693.2	0.726	0.0152	70.9%
RT-1	Text	698.0	0.727	0.0118	72.4%
RT-1	Traj	688.1	0.731	0.0117	81.7%

Ablation Study (BridgeV2)¶

Configuration	FVD↓	SSIM↑	Success↑	Note
Full (Traj)	644.2	0.733	73.2%	Full model
Full (Text)	693.2	0.726	70.9%	Full model
w/o LoRA	4377.1	0.064	0%	Frozen backbone fails completely
w/o Prompt Template	843.4	0.754	2.5%	Severe degradation in semantic guidance
w/o Trajectory Overlay	720.0	0.749	3.9%	Loss of spatial control

Key Findings¶

LoRA is a critical component: removing it causes FVD to surge from 644 to 4377, with success rate dropping to 0%.
The prompt template is essential for semantic understanding: its removal reduces success rate from 73.2% to 2.5%.
Text conditioning outperforms trajectory conditioning on JacoPlay and BridgeV2 (semantic following), while trajectory conditioning outperforms on RT-1 (spatial following).
The two conditioning modes are complementary: text excels at semantic reasoning, while trajectory excels at geometric precision.

Highlights & Insights¶

Novel perspective: This work is the first to systematically demonstrate that pretrained image generation models can serve as visual planners for robotics—converting an image generator into a video synthesizer via LoRA fine-tuning alone.
Extremely simple temporal modeling: Temporal consistency is achieved entirely through grid layout and local attention, without any dedicated temporal modules.
Cost efficiency: The approach requires neither large-scale video datasets nor specialized video architectures, leveraging the compositional priors of pretrained image generators with only LoRA fine-tuning.

Limitations & Future Work¶

Occasional color/texture inconsistencies appear at grid tile boundaries, with minor misalignment at seams.
The 9-frame sequence length is relatively short, making it difficult to cover long-horizon manipulation tasks.
Success rate is evaluated via visual inspection and has not been validated in closed-loop real robot execution.
Evaluation is limited to 3 datasets; cross-domain generalization (e.g., from JacoPlay to BridgeV2) remains unverified.
Trajectory conditioning requires a pre-provided 2D trajectory, limiting its practical utility for autonomous planning.

RIGVid: Estimates 6-DoF trajectories from AI-generated task videos for execution on real robots.
Gen2Act: Generates human execution videos and conditions policies on them to generalize to new scenes.
ControlNet: Injects spatial conditioning via a branch into a frozen text-to-image model, inspiring the trajectory conditioning design in this work.
Insight: The compositional priors of image generation models may be broadly applicable to other planning tasks—such as navigation path planning and assembly sequence planning.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of repurposing image generators as visual planners is novel and thought-provoking.
Experimental Thoroughness: ⭐⭐⭐ Three datasets with complete ablations, but direct comparison with video generation baselines and real-robot validation are lacking.
Writing Quality: ⭐⭐⭐ Structure is clear, though mathematical descriptions are slightly redundant with some repetitive content.
Value: ⭐⭐⭐ Offers an interesting research direction, but practical value is limited by the absence of closed-loop execution validation.