Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning¶

Conference: ECCV 2024
arXiv: 2407.06642
Code: GitHub
Area: Image Generation / Personalized Text-to-Image
Keywords: personalized T2I, reinforcement learning, deterministic policy gradient, look forward, DINO reward

TL;DR¶

This work models personalized T2I generation as a Deterministic Policy Gradient (DPG) framework—with the diffusion model acting as the policy and the denoising steps as actions. By introducing a "look forward" mechanism to capture long-term visual consistency and a DINO similarity reward, it improves the DINO score from 0.694 to 0.738 (+6.3%) and CLIP-I from 0.762 to 0.797 (+4.6%) on the DreamBooth benchmark.

Background & Motivation¶

Background: Personalized T2I (Textual Inversion, DreamBooth, Custom Diffusion) embeds personal concepts (pets, friends, etc.) by fine-tuning diffusion models, but generally suffers from the loss of visual details—the color, texture, and structure of generated objects are often inconsistent with the reference images.

Limitations of Prior Work: (1) Existing methods employ simple step-by-step reconstruction loss (\(\epsilon\)-prediction), which cannot directly optimize the visual consistency of the final generation output; (2) Different denoising timesteps focus on different features (early stages on structure, later stages on details), but step-by-step reconstruction loss is blind to this; (3) RL methods for general T2I (DPOK, DRaFT) rely on human preference or aesthetic rewards, whereas personalized scenarios typically have only 4~6 reference images, making it difficult to train a dedicated reward model.

Key Challenge: Step-by-step reconstruction loss cannot capture the long-term visual consistency of the diffusion process, especially the correspondence in structure and detail between the final generated image and the reference images.

Goal: To design a flexible RL framework utilizing various differentiable/non-differentiable objective functions to improve the visual fidelity of personalized T2I.

Key Insight: Treating the diffusion model as a deterministic policy and introducing a Q-function to learn cumulative rewards, enabling a "look forward" mechanism to the final generated output.

Core Idea: Learn the cumulative reward of looking forward from the current timestep to \(\hat{x}_{0,t}\): \(\frac{1-\bar{\alpha}_t}{\bar{\alpha}_t}\|\hat{z}_t - z_t\|^2\) through the Q-function in the DPG framework, combined with DINO similarity rewards to directly optimize visual consistency.

Method¶

Overall Architecture¶

Reference image set → Diffusion process noise injection → U-Net policy noise prediction → "Look forward" to obtain \(\hat{x}_{0,t}\) → Decode to image → DINO encoder feature extraction → Reward calculation → Q-function cumulative reward learning → Gradient backpropagation to optimize U-Net policy.

Key Designs¶

Deterministic Policy Gradient (DPG) Framework
- State: \(\{x_t, t, \tau(y)\}\) (latent state + timestep + text condition)
- Action: \(\hat{z}_t = \epsilon_\theta(x_t, t, \tau(y))\) (predicted noise)
- Policy: Diffusion model \(\epsilon_\theta\)
- Q-function \(Q_\phi\) estimates cumulative rewards, with the optimization objective: \(\max_\theta \mathbb{E}[Q_\phi(x_t, \epsilon_\theta(x_t, t, \tau(y)))]\)
- Design Motivation: Q-learning in RL naturally supports long-term cumulative rewards, offsetting the myopic nature of step-by-step reconstruction loss.
"Look Forward" Mechanism
- Predict the final output at any timestep \(t\): \(\hat{x}_{0,t} = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\hat{z}_t)\)
- Rewrite reward as \(\|\hat{x}_{0,t} - x_0\|^2 = \frac{1-\bar{\alpha}_t}{\bar{\alpha}_t}\|\hat{z}_t - z_t\|^2\)—a reconstruction loss with timestep-dependent weighting.
- Q-function learns the cumulative reward: \(Q_\phi(x_t, \cdot) = \mathcal{L}(x_t, \cdot) + \gamma Q_\phi(x_{t-1}, \cdot)\)
- Design Motivation: The "look forward" results at different timesteps reflect different levels of features (early steps = structure, later steps = details). The Q-function implicitly learns to focus on different levels.
DINO Reward
- Decode \(\hat{x}_{0,t}\) into image \(\hat{I} = \mathcal{D}(\hat{x}_{0,t})\) and extract feature \(\hat{\kappa}\) using a DINO encoder.
- Reward \(r(x_t) = -(1 - \hat{\kappa} \cdot \kappa)\) (cosine distance from the DINO features of the reference image).
- Combined with reconstruction reward: \(\nabla_\theta \frac{1}{B}\sum_B(\lambda Q_\phi + (-\|\epsilon - \epsilon_\theta\|^2))\)
- Design Motivation: DINO excels at capturing unique visual characteristics of objects, making it a more suitable personalization reward signal than human preferences.

Loss & Training¶

Alternating optimization of the Q-function and U-Net (Algorithm 1). Based on the DreamBooth baseline, using Stable Diffusion V1.4, the parameter size of the Q-function is only 0.26M (vs. U-Net's 859.4M). Training was performed on a 32G V100 GPU.

Key Experimental Results¶

Main Results¶

Comparison on the DreamBooth benchmark (30 concepts, 25 prompts):

Method	DINO↑	CLIP-I↑	CLIP-T↑
Custom Diffusion	0.649	0.712	0.321
Custom Diffusion + DINO reward	0.640	0.715	0.320
Custom Diffusion + Look Forward	0.669	0.728	0.322
DreamBooth	0.694	0.762	0.282
DreamBooth + DINO reward	0.723	0.783	0.270
DreamBooth + Look Forward	0.738	0.797	0.269

Comparison on the Custom benchmark:

Method	DINO↑	CLIP-I↑	CLIP-T↑
DreamBooth	0.640	0.737	0.309
DreamBooth + Look Forward	0.680	0.773	0.303
DreamBooth + DINO reward	0.653	0.753	0.310

Ablation Study¶

Ablation Item	DINO↑	CLIP-I↑	CLIP-T↑
DreamBooth Baseline	0.644	0.707	0.239
w/o discount factor \(\gamma\)	0.727	0.761	0.209
\(\gamma = 0.9986\)	0.704	0.743	0.213
\(\lambda = 0.1\) (DINO weight)	0.704	0.743	0.213
\(\lambda = 1\) (DINO weight)	0.727	0.746	0.211

User Study:

Preference	Ours	DreamBooth	Similar
Image Fidelity	55.1%	12.0%	32.9%
Text Fidelity	19.6%	20.4%	60.0%

Key Findings¶

Look Forward brings the most significant improvement: DINO increases from 0.694 to 0.738 (+6.3%), and CLIP-I from 0.762 to 0.797 (+4.6%).
DINO reward improves DINO score by 4.2% (0.694 to 0.723) over the DreamBooth baseline.
An inherent trade-off exists between image fidelity and text fidelity, but the drop in text fidelity is marginal (0.282 to 0.269).
The Q-network parameter size is extremely small (0.26M), adding almost no computational overhead.
55.1% of users prefer the image fidelity of this method (vs. 12.0% for DreamBooth).

Highlights & Insights¶

The DPG framework elegantly maps the diffusion process into an RL problem, where the Q-function learns long-term cumulative rewards.
The derivation of the "look forward" mechanism is concise: it is equivalent to a timestep-weighted reconstruction loss, but achieved cumulatively via the Q-function.
The framework is highly flexible: any differentiable or non-differentiable reward function can be plugged in (DINO is just one realization).
The Q-network has only 0.26M parameters, leading to a lightweight, low-cost implementation.

Limitations & Future Work¶

In some scenarios, overemphasizing visual fidelity might lead to a decline in text alignment.
Only DINO is used as a reward example, leaving other stronger visual similarity metrics (such as DINOv2, SSIM, etc.) unexplored.
Based on the DreamBooth baseline, performance is bound by the generation capabilities and text encoder of Stable Diffusion V1.4.
Direct comparison with contemporaneous RL-based T2I methods (like DRaFT) in personalized scenarios is missing.

vs. DreamBooth: DreamBooth only uses step-by-step reconstruction loss, whereas this work introduces long-term visual consistency via the DPG framework.
vs. DRaFT: DRaFT propagates gradients directly based on differentiable rewards, whereas this work supports more flexible rewards via the Q-function.
vs. DPOK: DPOK uses stochastic policy gradient with KL regularization for general T2I, whereas this work uses deterministic policy gradient for personalization.
vs. Custom Diffusion: Custom Diffusion only fine-tunes cross-attention, leading to weaker visual fidelity; incorporating Look Forward can improve this.

Rating¶

Novelty: ⭐⭐⭐⭐ Modeling the diffusion process as DPG provides an elegant theoretical framework.
Experimental Thoroughness: ⭐⭐⭐⭐ DreamBooth + Custom benchmarks + user study + ablation studies.
Writing Quality: ⭐⭐⭐ Extensive mathematical derivations in the methodology section, but overall clear.
Value: ⭐⭐⭐⭐ DINO improved by 6.3%, 55.1% user preference, showing significant practical efficacy.