Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning¶
Conference: CVPR 2026 arXiv: 2512.00961 Code: https://qiwang067.github.io/genreward Area: Diffusion Models / Reinforcement Learning Keywords: Video Diffusion Models, Goal-Driven Reward, Reinforcement Learning, Forward-Backward Representation, World Knowledge Transfer
TL;DR¶
This paper proposes GenReward, a framework that fine-tunes a pre-trained video diffusion model to generate goal-conditioned videos, and derives two-level goal-driven reward signals—video-level and frame-level—to guide reinforcement learning agents without manually designed reward functions, achieving substantial improvements over baselines on Meta-World robotic manipulation tasks.
Background & Motivation¶
Background: Reinforcement learning relies on carefully designed reward functions to guide policy learning. However, crafting appropriate rewards demands domain expertise and generalizes poorly across tasks. Existing approaches such as RoboCLIP compute text/video–observation similarity via VLMs as rewards, Diffusion Reward uses the entropy of a conditional diffusion model as a reward signal, and TADPoLe computes zero-shot rewards using a frozen text-conditioned diffusion model.
Limitations of Prior Work: Existing methods do not fully exploit generated videos as goal-driven rewards to transfer the rich world knowledge embedded in generative models. (1) RoboCLIP and similar methods depend on expert demonstration videos; (2) Diffusion Reward utilizes only the entropy of the diffusion model rather than its generated content; (3) TADPoLe disregards action information and thus cannot provide fine-grained goal-achievement guidance. These approaches offer limited reward signal quality on complex tasks.
Key Challenge: Video diffusion models encode rich world knowledge (e.g., how objects are manipulated), yet prior work has not found an effective way to convert this knowledge into fine-grained, actionable reward signals.
Goal: (1) How can videos generated by diffusion models provide trajectory-level (video-level) rewards? (2) How can agents be guided at the frame level toward specific goal states? (3) How can action information be incorporated for more precise goal achievement?
Key Insight: The key idea is to fine-tune a pre-trained video diffusion model to generate goal-conditioned videos, then exploit the generated videos at two levels: (1) measuring trajectory-level alignment via the latent space of a video encoder; and (2) using CLIP to select the most relevant frame as the goal state and learning a forward-backward representation to estimate the probability of reaching that goal.
Core Idea: A fine-tuned video diffusion model generates goal videos; a video encoder computes video-level rewards from these videos, while a learned forward-backward representation yields frame-level rewards, enabling goal-driven reinforcement learning without manually engineered reward functions.
Method¶
Overall Architecture¶
GenReward consists of three stages: (a) fine-tuning a pre-trained video diffusion model (CogVideoX-5B-I2V) to support domain-specific goal-conditioned video generation; (b) computing video-level rewards as the latent-space cosine similarity between the agent's trajectory and the generated goal video using the video encoder; (c) selecting a goal frame from the generated video via CLIP and learning a forward-backward representation to compute frame-level rewards. The final reward is \(r^{\text{gen}} = \alpha \cdot r^{\text{video}} + \beta \cdot r^{\text{FB}} + r^{\text{env}}\). The entire framework is built on top of the DreamerV3 world model.
Key Designs¶
-
Video Diffusion Model Adaptation and Video-Level Reward:
- Function: Generate goal-conditioned videos and provide trajectory-level behavioral imitation signals.
- Mechanism: The 3D Causal VAE of CogVideoX-5B-I2V encodes the agent's historical observations \(\mathbf{o}_{0:T}\) and the generated goal video \(\mathbf{V}^{\text{goal}}\) into latent vectors \(\mathbf{z}^v\) and \(\mathbf{z}^{\text{goal}}\), respectively. The video-level reward is computed as the cosine similarity \(r^{\text{video}} = \cos(\mathbf{z}^v, \mathbf{z}^{\text{goal}})\). To handle length mismatches, both sequences are uniformly sampled to 16 frames. This reward is computed once every 128 online interaction steps.
- Design Motivation: The encoder of a video diffusion model, pre-trained on large-scale video data, naturally encodes semantic understanding of action sequences in its latent space, making it more suitable than image-based models such as CLIP for measuring temporal sequence alignment.
-
Frame-Level Goal Selection and Forward-Backward Representation:
- Function: Provide fine-grained, action-aware goal-achievement rewards.
- Mechanism: (1) OpenCLIP computes the similarity between each frame of the generated video and the task description; the highest-scoring frame \(I^*\) is selected as the goal state. (2) A forward representation \(F: S \times A \times Z \to Z\) and a backward representation \(B: S \to Z\) are learned such that \(F(s,a,z)^\top B(s')\) approximates the long-term state occupancy probability of reaching \(s'\) from \((s,a)\). (3) The frame-level reward is \(r^{\text{FB}}(s,a,I^*) = F(s,a,\psi(I^*))^\top B(\psi(I^*))\), where \(\psi\) denotes the DINOv3 encoder. Training minimizes the Bellman residual and uses a target network for stability.
- Design Motivation: Video-level rewards reflect only overall trajectory similarity and lack fine-grained guidance toward specific goal states. The FB representation, which incorporates action information, estimates the probability of reaching the goal from the current state-action pair, enabling truly goal-driven action selection.
-
Training Pipeline and Reward Fusion:
- Function: Balance learning stability and utilization of world knowledge.
- Mechanism: The FB network is trained for an initial 100K steps, then frozen for reward computation. During online interaction, the generative reward replaces the environment reward every \(\Delta_t\) steps; the original environment reward is used otherwise. Policy optimization and world model learning proceed within the DreamerV3 framework.
- Design Motivation: Pre-training the FB network ensures reward quality, while freezing it afterward prevents reward distribution drift from destabilizing policy learning.
Loss & Training¶
Fine-tuning of the video diffusion model uses the standard denoising objective \(\|\hat{\epsilon}_\theta(\mathbf{x}_t, t, c_{\text{text}}, c_{\text{image}}) - \epsilon\|_2^2\). The FB representation is trained by minimizing the Bellman residual with a slow-moving average target network. The policy and value function are optimized within the DreamerV3 framework.
Key Experimental Results¶
Main Results (Meta-World Dense Reward)¶
| Task | Dense Reward | RoboCLIP | Diffusion Reward | TADPoLe | GenReward |
|---|---|---|---|---|---|
| Pick Out of Hole | 193 | ~250 | ~300 | ~100 | 582 |
| Bin Picking | 398 | ~500 | ~450 | ~200 | 822 |
| Shelf Place | 154 | ~300 | ~350 | ~100 | 814 |
Ablation Study¶
| Configuration | Performance (Pick Place) | Notes |
|---|---|---|
| Full GenReward | Best | Complete model |
| w/o video-level reward | Significant drop | Without video-level reward, the agent fails to imitate generated video behavior |
| w/o FB reward | Moderate drop | Without frame-level reward, fine-grained goal-achievement capability is reduced |
Key Findings¶
- GenReward substantially surpasses the original dense reward on all three tasks—Pick Out of Hole, Bin Picking, and Shelf Place (193→582, 398→822, 154→814).
- TADPoLe performs worst across most tasks, indicating that a frozen text-conditioned diffusion model alone provides limited reward signal.
- Both excessively large and small video-level reward weights \(\alpha\) degrade performance (too small fails to imitate video behavior; too large hinders exploration).
- Videos generated from different source datasets (RT-1, RLBench, Bridge) consistently yield improvements, validating the robustness of world knowledge transfer.
- The quality of frame-level goal selection depends on the alignment capability of CLIP between task descriptions and video frames.
Highlights & Insights¶
- This work is the first to use the generated output of a video diffusion model—rather than merely its internal representations—as goal-driven rewards for RL, marking a transition from "using diffusion models to understand the world" to "using diffusion models to guide action." This paradigm is transferable to any RL scenario requiring learning from demonstrations.
- The integration of the forward-backward representation endows the reward with action awareness, compensating for the limitation of pure visual similarity rewards that ignore environment dynamics. Treating "the probability of reaching the goal" as a reward signal is more informative than simple distance metrics.
- The framework operates without expert demonstrations by treating the diffusion model's generated videos as "virtual experts," substantially reducing data requirements.
Limitations & Future Work¶
- Computational overhead: Computing video-level and frame-level rewards (video encoding + FB inference) increases training cost.
- Goal frame selection depends on the cross-domain generalization of CLIP, which may be unreliable for scenes outside its training distribution.
- Evaluation scope: Experiments are limited to Meta-World and DCS, which are relatively controlled environments; transferability to real-world robotic settings remains unknown.
- Fine-tuning data: The video diffusion model requires domain-specific fine-tuning data, which may necessitate additional collection efforts for entirely new domains.
Related Work & Insights¶
- vs. Diffusion Reward: Diffusion Reward uses the entropy of a conditional diffusion model as a reward, whereas this paper leverages latent features from the generated video encoder combined with an FB representation, yielding richer and more directional signals.
- vs. RoboCLIP: RoboCLIP relies on CLIP embeddings of expert videos/text as sparse rewards; this paper provides dense, goal-driven rewards without requiring expert demonstrations.
- vs. TADPoLe: TADPoLe applies a frozen text-conditioned diffusion model for zero-shot rewards but achieves poor results, demonstrating that denoising gradients alone are insufficient to provide effective reward signals.
- vs. UniPi: UniPi generates videos from text and trains an inverse dynamics model to predict actions; this paper directly uses generated videos as reward signals, which is a more straightforward approach.
Rating¶
- Novelty: ⭐⭐⭐⭐ The dual video-level and frame-level reward design is relatively novel, though the theoretical foundations of individual components (FB representation, VAE encoder similarity) are drawn from prior work.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation and sensitivity analyses are provided, but the test environments are relatively simple and real-robot experiments are absent.
- Writing Quality: ⭐⭐⭐⭐ The method is described clearly with complete algorithmic pseudocode.
- Value: ⭐⭐⭐⭐ Establishes a new paradigm for leveraging generative model priors in RL; practical impact depends on generalization to more complex environments.