Towards Reliable Advertising Image Generation Using Human Feedback¶

Conference: ECCV 2024
arXiv: 2408.00418
Code: Yes
Area: Image Generation
Keywords: Advertising Image Generation, Human Feedback, Diffusion Model Fine-Tuning, Multimodal Detection, E-commerce Applications

TL;DR¶

Constructs a million-scale human-annotated advertising image dataset RF1M, proposes a multimodal RFNet to automatically detect the usability of generated images, and designs the Consistent Condition regularization-driven RFFT fine-tuning method, boosting the advertising image availability rate from 56.4% to 85.5%.

Background & Motivation¶

In the e-commerce sector, attractive advertising images are crucial for improving click-through rates (CTR). Although diffusion models (paired with ControlNet) can automatically generate harmonious backgrounds for products, unqualified images are frequently produced during the generation process:

Space Mismatch: Improper spatial relationships between the product and the background (e.g., the product floating).
Size Mismatch: The size of the product is inconsistent with the background (e.g., a massage chair being smaller than a cabinet).
Indistinctiveness: The product fails to stand out due to complex backgrounds or similar colors.
Shape Hallucination: The background erroneously extends the product's shape (e.g., adding bases or stands).

These unqualified images can mislead consumers, requiring extensive manual inspection. Core Problem: How to establish a reliable advertising image generation pipeline to produce images with high availability rates?

Two-fold solution:

Recurrent Generation: Leveraging randomness to generate multiple times, replacing manual inspection with automatic detection.

Model Fine-Tuning (RFFT): Fine-tuning diffusion models using human feedback to fundamentally improve the availability rate.

Method¶

Overall Architecture¶

The complete pipeline consists of three core components:

RF1M Dataset: 1.05 million human-annotated advertising images providing five fine-grained labels.
RFNet (Reliable Feedback Network): A multimodal detection network that automatically evaluates the usability of generated images.
RFFT (Reliable Feedback Fine-Tuning): Fine-tuning ControlNet using RFNet feedback, combined with Consistent Condition regularization to prevent collapse.

Generation flow: Product image + text prompt → Stable Diffusion + ControlNet inpainting → RFNet detection → Available / Regenerating.

Key Designs¶

1. RF1M Dataset Construction¶

Based on product generation on JD.com, containing 1,058,230 samples. Each sample includes: - Generated advertising image + product image with a transparent background - Prompt written by professional designers - Depth map (generated by DPT) and saliency map (generated by U2-Net) - Product title/description - Human-annotated five category labels (Available / Space Mismatch / Size Mismatch / Indistinctiveness / Shape Hallucination)

Validated by online A/B testing on JD.com: exposed over 60 million times in total, boosting CTR by 2.2%.

2. RFNet Multimodal Detection Network¶

Fuses five modal cues to make judgements:

Input Modalities: - \(I_o\) (original product image): For understanding product appearance. - \(I_g\) (generated advertising image): For evaluating overall appearance. - \(I_d\) (depth map): For determining the spatial relationship between the product and background. - \(I_s\) (saliency map): For checking if the product contour is highlighted. - \(Cap\) (product description text): For providing product attribute knowledge.

Network Architecture: - Image Encoder: Pre-trained ResNet50 to encode the four images into \(\{e_o, e_g, e_d, e_s\}\). - Text Encoder: Fine-tuned RoBERTa to encode the product description into \(e_c\). - Feature Filter Module (FFM): \(N_1\) cross-attention + convolutional modules, using \(e_o\) as Query and \(e_c\) as Key/Value to extract visually-related attributes:

\[e_f = \text{Conv}(\text{Conv}(\text{CrossAttn}(e_o, e_c)) \otimes \text{Conv}(e_o)) + e_o\]

Self-Attention Fusion: \(N_2\) self-attention layers to integrate features from all modalities:

\[f = \text{SelfAttention}(\text{Concat}(e_f, e_g, e_d, e_s))\]

A final fully connected classifier outputs probabilities for the five categories.

3. RFFT Fine-tuning Method¶

Key Challenge: Direct fine-tuning with usability feedback leads to generation collapse—the model learns to generate simple and repetitive backgrounds to avoid unqualified cases, reaching a 99.8% availability rate but crashing the image aesthetics.

Feedback Signal \(F_{AC}\):

\[F_{AC} = -\frac{1}{N} \sum_{i=1}^{N} y_d \log(\hat{o}_i)\]

where \(y_d\) is the one-hot vector for the "Available" category, backpropagating gradients to tune ControlNet.

Consistent Condition (CC) Regularization:

Key Insight: Rather than restricting the generated image to remain unchanged (like KL regularization), the model should maintain a consistent direction of influence from the text condition. First, extract the text guiding direction from classifier-free guidance:

\[\nabla_{x_t} \log p_\theta^t(z|x_t, c) \approx -\frac{1}{\sqrt{1-\bar{\alpha}_t}} (\epsilon_\theta(x_t, z, c) - \epsilon_\theta(x_t, c))\]

Then, constrain the fine-tuned model's condition direction to match the reference model's:

\[L_{CC} = \|\nabla_{x_t} \log p_\theta^t(z|x_t, c) - \nabla_{x_t} \log p_{ref}^t(z|x_t, c)\|_2\]

KL vs. CC Regularization essential differences: - KL regularization is in adversary with \(F_{AC}\)—one attempts to alter the output while the other attempts to keep it unchanged. - CC regularization is in synergy with \(F_{AC}\)—maintaining control consistency while improving the availability rate.

Final loss: \(F_{total} = F_{AC} + \beta L_{CC}\)

Loss & Training¶

RFNet Training: - ResNet50 (ImageNet pre-trained) + RoBERTa (fine-tuned on product descriptions), images resized to 384×384. - FFM: width 384, 8 heads, \(N_1=1\); Self-Attention: \(N_2=3\). - Trained for 10 epochs, initial learning rate of 1e-4, decayed by a factor of 10 at epoch 5.

RFFT Fine-tuning: - 8×A100, local batch size of 4, 4-step gradient accumulation, AdamW with learning rate of 1e-5. - Base model: MajicmixRealistic_v7 + ControlNet v1.1. - Only trains ControlNet parameters, freezing the rest. - 40-step DDIM; one of the last 10 steps is randomly chosen to generate \(\hat{x}_0^t\), which undergoes RFNet evaluation and backpropagation.

Key Experimental Results¶

Main Results¶

RFNet Detection Performance Comparison (1000 test images):

Model	Precision	Recall	F1	AP
ResNet50	74.87	73.66	74.26	77.29
ResNeXt50	77.73	76.88	77.30	79.62
HRNet	72.89	73.12	73.01	73.07
ViT	75.59	78.33	76.93	79.31
RFNet	86.45	85.23	85.83	87.58

RFNet leads significantly across all metrics, exceeding the second-best method by 8.5 percentage points in F1.

Advertising Image Availability Rate Comparison (1000 products, single generation):

Method	Ava (RFNet) ↑	Human Ava ↑
Ori (Original Model)	56.4%	70.1%
PromptEng	62.9%	73.2%
PPO	65.9%	74.9%
DPO	57.3%	71.8%
ReFL	84.7%	84.9%
Ours (RFFT)	85.5%	86.3%

RFFT increases the availability rate by 29.1 percentage points compared to the original model (56.4% \(\to\) 85.5%), and human inspection consistency validates the reliability of RFNet.

Ablation Study¶

RFNet Modality Contribution Ablation:

\(I_o\)	\(I_g\)	\(I_d\)	\(I_s\)	Cap	AP
✗	✓	✓	✓	✓	81.17
✓	✗	✓	✓	✓	82.06
✓	✓	✗	✓	✓	85.31
✓	✓	✓	✗	✓	83.91
✓	✓	✓	✓	✗	84.53
✓	✓	✓	✓	✓	87.58
Coarse-grained labels					82.06

The original product image \(I_o\) contributes the most (removing it drops AP by 6.41), and fine-grained labels outperform coarse-grained labels by 5.52.

CC Regularization vs. KL Regularization: As \(\beta\) increases, the availability rate of KL regularization significantly decreases (adversarial effect), whereas CC regularization maintains a high availability rate.

Key Findings¶

The trends of RFNet's "Ava" and "Human Ava" are consistent, proving that the model faithfully reflects human feedback.
The recurrent generation strategy can further improve the availability rate, but RFFT fine-tuning requires fewer attempts to yield acceptable results.
The fine-tuned ControlNet generalizes well to different LoRAs and diffusion model weights (e.g., Maji_v6, SD_v1.5) without retraining.
Human preference evaluation by 200 professionals indicates that RFFT matches the original model in terms of image aesthetics, significantly outperforming ReFL.
Aesthetic feedback (ImageReward) + CC regularization can be combined with usability feedback without mutual conflict.

Highlights & Insights¶

Industrial-Grade Complete Solution: Provides a comprehensive solution for reliable advertising image generation from dataset \(\to\) detection network \(\to\) model fine-tuning \(\to\) online deployment.
CC Regularization Resolves Aesthetic-Usability Trade-off: Ingeniously shifts constraints from "fixed output" to "consistent conditional direction", thoroughly avoiding the adversarial effect.
Million-Scale Labeled Dataset: RF1M serves as the first large-scale, multimodal, human-annotated dataset specifically for advertising image generation.
Multimodal Fusion Detection: Auxiliary cues such as depth maps, saliency maps, and product descriptions significantly enhance detection precision.

Limitations & Future Work¶

Single dataset source (JD.com), which may contain domain bias.
The five-category classification granularity of RFNet can be further refined.
RFFT only fine-tunes ControlNet without exploring fine-tuning on U-Net.
The hyperparameter \(\beta\) for CC regularization requires manual tuning.
The current pipeline incurs a high inference cost (multiple generation passes + multimodal detection), suggesting potential efficiency optimizations.

ReFL / DRaFT: End-to-end fine-tuning of diffusion models directly using gradients of differentiable rewards; RFFT draws inspiration from a similar end-to-end policy.
DDPO / DPOK: Models the denoising process as a multi-step MDP and updates it via Policy Gradient, but incurs higher training costs.
Diffusion-DPO: Enhances diffusion models with human comparison data, but lacks optimization oriented specifically towards usability challenges.
ControlNet: RFFT only fine-tunes portioned ControlNet parameters, presenting promising generalization and training efficiency.

Rating¶

Novelty: ⭐⭐⭐⭐ — The approach of using CC regularization to resolve RLHF collapse offers a unique insight.
Effectiveness: ⭐⭐⭐⭐⭐ — Availability rate increased by 29.1%, online CTR boosted by 2.2%, validated by real-world industrial systems.
Engineering Value: ⭐⭐⭐⭐⭐ — A complete pipeline whose dataset is publicly available and deployed in production environments at JD.com.
Recommendation: ⭐⭐⭐⭐ — Pioneering work applying RLHF to advertising images; CC regularization is highly transferrable.