Towards Reliable Advertising Image Generation Using Human Feedback¶
Conference: ECCV 2024
arXiv: 2408.00418
Code: Yes
Area: Image Generation
Keywords: Advertising Image Generation, Human Feedback, Diffusion Model Fine-Tuning, Multimodal Detection, E-commerce Applications
TL;DR¶
Constructs a million-scale human-annotated advertising image dataset RF1M, proposes a multimodal RFNet to automatically detect the usability of generated images, and designs the Consistent Condition regularization-driven RFFT fine-tuning method, boosting the advertising image availability rate from 56.4% to 85.5%.
Background & Motivation¶
In the e-commerce sector, attractive advertising images are crucial for improving click-through rates (CTR). Although diffusion models (paired with ControlNet) can automatically generate harmonious backgrounds for products, unqualified images are frequently produced during the generation process:
- Space Mismatch: Improper spatial relationships between the product and the background (e.g., the product floating).
- Size Mismatch: The size of the product is inconsistent with the background (e.g., a massage chair being smaller than a cabinet).
- Indistinctiveness: The product fails to stand out due to complex backgrounds or similar colors.
- Shape Hallucination: The background erroneously extends the product's shape (e.g., adding bases or stands).
These unqualified images can mislead consumers, requiring extensive manual inspection. Core Problem: How to establish a reliable advertising image generation pipeline to produce images with high availability rates?
Two-fold solution:
Recurrent Generation: Leveraging randomness to generate multiple times, replacing manual inspection with automatic detection.
Model Fine-Tuning (RFFT): Fine-tuning diffusion models using human feedback to fundamentally improve the availability rate.
Method¶
Overall Architecture¶
The complete pipeline consists of three core components:
- RF1M Dataset: 1.05 million human-annotated advertising images providing five fine-grained labels.
- RFNet (Reliable Feedback Network): A multimodal detection network that automatically evaluates the usability of generated images.
- RFFT (Reliable Feedback Fine-Tuning): Fine-tuning ControlNet using RFNet feedback, combined with Consistent Condition regularization to prevent collapse.
Generation flow: Product image + text prompt → Stable Diffusion + ControlNet inpainting → RFNet detection → Available / Regenerating.
Key Designs¶
1. RF1M Dataset Construction¶
Based on product generation on JD.com, containing 1,058,230 samples. Each sample includes: - Generated advertising image + product image with a transparent background - Prompt written by professional designers - Depth map (generated by DPT) and saliency map (generated by U2-Net) - Product title/description - Human-annotated five category labels (Available / Space Mismatch / Size Mismatch / Indistinctiveness / Shape Hallucination)
Validated by online A/B testing on JD.com: exposed over 60 million times in total, boosting CTR by 2.2%.
2. RFNet Multimodal Detection Network¶
Fuses five modal cues to make judgements:
Input Modalities: - \(I_o\) (original product image): For understanding product appearance. - \(I_g\) (generated advertising image): For evaluating overall appearance. - \(I_d\) (depth map): For determining the spatial relationship between the product and background. - \(I_s\) (saliency map): For checking if the product contour is highlighted. - \(Cap\) (product description text): For providing product attribute knowledge.
Network Architecture: - Image Encoder: Pre-trained ResNet50 to encode the four images into \(\{e_o, e_g, e_d, e_s\}\). - Text Encoder: Fine-tuned RoBERTa to encode the product description into \(e_c\). - Feature Filter Module (FFM): \(N_1\) cross-attention + convolutional modules, using \(e_o\) as Query and \(e_c\) as Key/Value to extract visually-related attributes:
- Self-Attention Fusion: \(N_2\) self-attention layers to integrate features from all modalities:
- A final fully connected classifier outputs probabilities for the five categories.
3. RFFT Fine-tuning Method¶
Key Challenge: Direct fine-tuning with usability feedback leads to generation collapse—the model learns to generate simple and repetitive backgrounds to avoid unqualified cases, reaching a 99.8% availability rate but crashing the image aesthetics.
Feedback Signal \(F_{AC}\):
where \(y_d\) is the one-hot vector for the "Available" category, backpropagating gradients to tune ControlNet.
Consistent Condition (CC) Regularization:
Key Insight: Rather than restricting the generated image to remain unchanged (like KL regularization), the model should maintain a consistent direction of influence from the text condition. First, extract the text guiding direction from classifier-free guidance:
Then, constrain the fine-tuned model's condition direction to match the reference model's:
KL vs. CC Regularization essential differences: - KL regularization is in adversary with \(F_{AC}\)—one attempts to alter the output while the other attempts to keep it unchanged. - CC regularization is in synergy with \(F_{AC}\)—maintaining control consistency while improving the availability rate.
Final loss: \(F_{total} = F_{AC} + \beta L_{CC}\)
Loss & Training¶
RFNet Training: - ResNet50 (ImageNet pre-trained) + RoBERTa (fine-tuned on product descriptions), images resized to 384×384. - FFM: width 384, 8 heads, \(N_1=1\); Self-Attention: \(N_2=3\). - Trained for 10 epochs, initial learning rate of 1e-4, decayed by a factor of 10 at epoch 5.
RFFT Fine-tuning: - 8×A100, local batch size of 4, 4-step gradient accumulation, AdamW with learning rate of 1e-5. - Base model: MajicmixRealistic_v7 + ControlNet v1.1. - Only trains ControlNet parameters, freezing the rest. - 40-step DDIM; one of the last 10 steps is randomly chosen to generate \(\hat{x}_0^t\), which undergoes RFNet evaluation and backpropagation.
Key Experimental Results¶
Main Results¶
RFNet Detection Performance Comparison (1000 test images):
| Model | Precision | Recall | F1 | AP |
|---|---|---|---|---|
| ResNet50 | 74.87 | 73.66 | 74.26 | 77.29 |
| ResNeXt50 | 77.73 | 76.88 | 77.30 | 79.62 |
| HRNet | 72.89 | 73.12 | 73.01 | 73.07 |
| ViT | 75.59 | 78.33 | 76.93 | 79.31 |
| RFNet | 86.45 | 85.23 | 85.83 | 87.58 |
RFNet leads significantly across all metrics, exceeding the second-best method by 8.5 percentage points in F1.
Advertising Image Availability Rate Comparison (1000 products, single generation):
| Method | Ava (RFNet) ↑ | Human Ava ↑ |
|---|---|---|
| Ori (Original Model) | 56.4% | 70.1% |
| PromptEng | 62.9% | 73.2% |
| PPO | 65.9% | 74.9% |
| DPO | 57.3% | 71.8% |
| ReFL | 84.7% | 84.9% |
| Ours (RFFT) | 85.5% | 86.3% |
RFFT increases the availability rate by 29.1 percentage points compared to the original model (56.4% \(\to\) 85.5%), and human inspection consistency validates the reliability of RFNet.
Ablation Study¶
RFNet Modality Contribution Ablation:
| \(I_o\) | \(I_g\) | \(I_d\) | \(I_s\) | Cap | AP |
|---|---|---|---|---|---|
| ✗ | ✓ | ✓ | ✓ | ✓ | 81.17 |
| ✓ | ✗ | ✓ | ✓ | ✓ | 82.06 |
| ✓ | ✓ | ✗ | ✓ | ✓ | 85.31 |
| ✓ | ✓ | ✓ | ✗ | ✓ | 83.91 |
| ✓ | ✓ | ✓ | ✓ | ✗ | 84.53 |
| ✓ | ✓ | ✓ | ✓ | ✓ | 87.58 |
| Coarse-grained labels | 82.06 |
The original product image \(I_o\) contributes the most (removing it drops AP by 6.41), and fine-grained labels outperform coarse-grained labels by 5.52.
CC Regularization vs. KL Regularization: As \(\beta\) increases, the availability rate of KL regularization significantly decreases (adversarial effect), whereas CC regularization maintains a high availability rate.
Key Findings¶
- The trends of RFNet's "Ava" and "Human Ava" are consistent, proving that the model faithfully reflects human feedback.
- The recurrent generation strategy can further improve the availability rate, but RFFT fine-tuning requires fewer attempts to yield acceptable results.
- The fine-tuned ControlNet generalizes well to different LoRAs and diffusion model weights (e.g., Maji_v6, SD_v1.5) without retraining.
- Human preference evaluation by 200 professionals indicates that RFFT matches the original model in terms of image aesthetics, significantly outperforming ReFL.
- Aesthetic feedback (ImageReward) + CC regularization can be combined with usability feedback without mutual conflict.
Highlights & Insights¶
- Industrial-Grade Complete Solution: Provides a comprehensive solution for reliable advertising image generation from dataset \(\to\) detection network \(\to\) model fine-tuning \(\to\) online deployment.
- CC Regularization Resolves Aesthetic-Usability Trade-off: Ingeniously shifts constraints from "fixed output" to "consistent conditional direction", thoroughly avoiding the adversarial effect.
- Million-Scale Labeled Dataset: RF1M serves as the first large-scale, multimodal, human-annotated dataset specifically for advertising image generation.
- Multimodal Fusion Detection: Auxiliary cues such as depth maps, saliency maps, and product descriptions significantly enhance detection precision.
Limitations & Future Work¶
- Single dataset source (JD.com), which may contain domain bias.
- The five-category classification granularity of RFNet can be further refined.
- RFFT only fine-tunes ControlNet without exploring fine-tuning on U-Net.
- The hyperparameter \(\beta\) for CC regularization requires manual tuning.
- The current pipeline incurs a high inference cost (multiple generation passes + multimodal detection), suggesting potential efficiency optimizations.
Related Work & Insights¶
- ReFL / DRaFT: End-to-end fine-tuning of diffusion models directly using gradients of differentiable rewards; RFFT draws inspiration from a similar end-to-end policy.
- DDPO / DPOK: Models the denoising process as a multi-step MDP and updates it via Policy Gradient, but incurs higher training costs.
- Diffusion-DPO: Enhances diffusion models with human comparison data, but lacks optimization oriented specifically towards usability challenges.
- ControlNet: RFFT only fine-tunes portioned ControlNet parameters, presenting promising generalization and training efficiency.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The approach of using CC regularization to resolve RLHF collapse offers a unique insight.
- Effectiveness: ⭐⭐⭐⭐⭐ — Availability rate increased by 29.1%, online CTR boosted by 2.2%, validated by real-world industrial systems.
- Engineering Value: ⭐⭐⭐⭐⭐ — A complete pipeline whose dataset is publicly available and deployed in production environments at JD.com.
- Recommendation: ⭐⭐⭐⭐ — Pioneering work applying RLHF to advertising images; CC regularization is highly transferrable.