Enhancing Diffusion Models with Text-Encoder Reinforcement Learning¶
Conference: ECCV 2024
arXiv: 2311.15657
Code: GitHub
Area: Image Generation
Keywords: Diffusion Models, Text Encoder, Reinforcement Learning, LoRA, Text-Image Alignment
TL;DR¶
This paper proposes TexForce, which utilizes reinforcement learning (DDPO) combined with LoRA to fine-tune the text encoder of diffusion models, thereby improving text-image alignment and visual quality. It can be seamlessly combined with existing U-Net fine-tuning methods to achieve superior performance.
Background & Motivation¶
Current text-to-image diffusion models (such as Stable Diffusion) are trained to optimize log-likelihood, which creates a gap between this objective and the specific requirements of downstream tasks (e.g., image aesthetics, text-image alignment). Existing improvement methods mainly fine-tune the U-Net through reinforcement learning (DDPO, DPOK) or direct backpropagation (ReFL, AlignProp). However, several key issues remain:
Neglecting the Text Encoder: Almost all methods keep the pretrained text encoder frozen and only tune the U-Net. However, the text encoder itself is sub-optimal—users often need to carefully engineer prompts to obtain satisfactory results.
Side Effects of U-Net Fine-tuning: Fine-tuning the U-Net can easily disrupt the visual appearance of images, leading to style degradation or mode collapse.
Text Encoder as a Bottleneck: Even if the U-Net is fine-tuned, the sub-optimal text encoder still limits overall performance.
The authors observe a key phenomenon: Fine-tuning the U-Net tends to improve reward scores by changing the visual appearance, whereas fine-tuning the text encoder achieves the same goal by introducing new visual concepts, with the latter being superior in maintaining semantics.
Method¶
Overall Architecture¶
The core of TexForce is straightforward: using the DDPO (Denoising Diffusion Policy Optimization) algorithm, LoRA fine-tuning is applied to the text encoder \(\tau_\phi\), which is optimized based on task-specific reward functions. The key features are:
- The text encoder is fine-tuned via RL, while the U-Net is frozen.
- Parameter-efficient fine-tuning is achieved using LoRA.
- Simple combination with existing fine-tuning methods of U-Net is enabled without extra training.
Key Designs¶
Reinforcement Learning in Diffusion Models¶
The denoising process is formalized as a Markov Decision Process. The text encoder \(\tau_\phi\) acts as a policy network that maps the text description to actions (text embeddings), thereby influencing the generation process of the diffusion model. The optimization goal is to maximize the expected reward:
The policy gradient can be computed as:
In practice, PPO (Proximal Policy Optimization) is used to stabilize training:
where \(A\) is the advantage value of the normalized reward, and \(r_t\) is the probability ratio of the new policy to the old policy.
LoRA Low-Rank Adaptation¶
Trainable low-rank matrices are inserted into the feed-forward layers of the text encoder: \(W' = W + \alpha \Delta W\), where \(\Delta W\) is initialized to zero. The advantages of LoRA are:
- Overfitting Prevention: The low-rank constraint limits the parameter space.
- Flexible Switching: LoRA weights for different tasks can be flexibly replaced.
- Weight Merging: LoRA weights for different tasks can be directly weight-blended to combine capabilities.
Why Fine-tune the Text Encoder¶
From a theoretical perspective, when optimizing the ELBO with a small number of prompts \(s\) during the fine-tuning stage:
When \(\phi\) is fixed, \(q_\phi\) may serve as a sub-optimal estimation of \(p(\mathbf{z})\), leading to an increase in the KL divergence term. Therefore, fine-tuning the text encoder \(\tau_\phi\) can reduce the KL term, which is particularly important under limited data.
Direct Fusion of U-Net and Text Encoder LoRA Weights¶
The most important design is that the text encoder LoRA weights in TexForce can be directly combined with existing U-Net LoRA weights without any additional training. Experiments demonstrate that this simple fusion strategy consistently outperforms joint training.
The authors hypothesize that physical reason: The fixed U-Net acts as a pixel generation prior during text encoder fine-tuning, and joint fine-tuning would make the optimization of the text encoder more complex.
Loss & Training¶
- Reward Functions: Supports various non-differentiable rewards, including ImageReward, HPSv2, JPEG compression size, face quality score, and hand detection confidence.
- Optimization Algorithm: PPO with importance sampling.
- LoRA Fusion: \(\sum_i \alpha_i \theta_i\), where LoRA weights for different tasks are combined via weighted summation.
Key Experimental Results¶
Main Results¶
Quantitative Results on ImageReward Dataset
| Method | ImageReward Score↑ |
|---|---|
| SDv1.4 | 0.2154 |
| ReFL (U-Net fine-tuning) | 0.4485 |
| TexForce (Text Encoder fine-tuning) | 0.4556 |
| ReFL + TexForce | 0.6553 |
Quantitative Results on HPSv2 Dataset
| Method | HPSv2 Score↑ |
|---|---|
| SDv1.4 | 0.2752 |
| AlignProp (U-Net fine-tuning) | 0.2821 |
| TexForce | 0.2767 |
| AlignProp + TexForce | 0.2914 |
Multi-backbone Results (ImageReward)
| Backbone | Original | ReFL | TexForce | ReFL+TexForce |
|---|---|---|---|---|
| SDv1.5 | 0.2140 | 0.5484 | 0.4086 | 0.6703 |
| SDv2.1 | 0.3891 | 0.5223 | 0.5084 | 0.6158 |
Ablation Study¶
Simple Fusion vs. Joint Training
| Method | ImageReward Score |
|---|---|
| SDv1.4 | 0.2154 |
| ReFL Only | 0.4485 |
| TexForce Only | 0.4556 |
| Joint Training | 0.5009 |
| Simple Fusion (ReFL+TexForce) | 0.6553 |
Simple fusion significantly outperforms joint training, validating the importance of keeping U-Net fixed as a prior.
Key Findings¶
- Behavioral Difference: Fine-tuning the U-Net alters visual appearance to pursue rewards, whereas fine-tuning the text encoder introduces new concepts to achieve the same goal—the latter preserves semantics better.
- Complementarity: The capacities learned by both are complementary; a simple combination of them outperforms joint training.
- GPT-4V Evaluation: TexForce scores the highest in text-image alignment, ReFL performs better in visual appearance, and the combined scheme achieves the best overall performance.
- Cross-backbone Robustness: The method is effective across SDv1.4, SDv1.5, and SDv2.1, consistently improving even the already strong SDv2.1.
- Mixable LoRA Weights: LoRA weights trained for different tasks (e.g., ImageReward + face quality) can be directly blended to enhance specific target qualities.
Highlights & Insights¶
- Novel Perspective: While almost all concurrent works focus on U-Net fine-tuning, this paper is among the earliest to systematically study text encoder fine-tuning.
- Simple yet Powerful Combination Strategy: Merges the advantages of different fine-tuning schemes without requiring additional training, significantly lowering the barrier to practical deployment.
- Theoretical and Empirical Consistency: Provides theoretical motivation for fine-tuning the text encoder starting from ELBO analysis, which is perfectly validated by experiments.
- High Flexibility: Does not require differentiable reward functions; any evaluation metric for image quality can serve as a reward.
Limitations & Future Work¶
- RL training is slower and more computationally heavy than direct backpropagation (as shown in Figure 4, optimizing the text encoder is more challenging).
- Validated only on the Stable Diffusion series; SDXL or newer architectures are not tested.
- The fusion coefficient \(\alpha_i\) for combining different reward-trained LoRA weights requires manual tuning.
- Generalization capability of the fine-tuned text encoder to out-of-distribution prompts has not been studied.
- Lacks direct comparison with the concurrent work TextCraftor.
Related Work & Insights¶
- DDPO: The base RL algorithm used in this work, formulating the diffusion denoising process as an MDP.
- ReFL / AlignProp / DRaFT: Direct backpropagation methods, which are more prone to overfitting and mode collapse.
- TextCraftor: A concurrent work exploring text encoder fine-tuning, but it relies on differentiable rewards and is thus less flexible.
- Insight: The concept of text encoder fine-tuning can be extended to diffusion models in other modalities, such as video generation and 3D generation.
Rating¶
| Dimension | Rating (1-5) |
|---|---|
| Novelty | 4 |
| Theoretical Depth | 3.5 |
| Experimental Thoroughness | 4.5 |
| Practical Value | 4.5 |
| Writing Quality | 4 |
| Overall Rating | 4.1 |