AutoPP: Towards Automated Product Poster Generation and Optimization¶

Conference: AAAI2026
arXiv: 2512.21921
Code: JD-GenX/AutoPP
Area: Recommender Systems
Keywords: product poster generation, CTR optimization, diffusion model, DPO, multimodal generation

TL;DR¶

This paper proposes AutoPP, the first pipeline to unify automated product poster generation with CTR-feedback-driven optimization in a single framework. It employs a unified design module to jointly design background, text, and layout; an element rendering module for efficient and controllable poster generation; and Isolated DPO (IDPO) to achieve element-level click-through rate optimization.

Background & Motivation¶

Product posters require the artful combination of product images, text, and backgrounds to attract user clicks, yet manual creation and iterative refinement are highly time- and labor-intensive.
Existing methods exhibit clear automation bottlenecks:
- PAID adopts a four-stage pipeline (prompt → layout → background → text rendering), in which text attributes (font, color) rely on hand-crafted rules, limiting automation and compromising visual coherence.
- PosterMaker leverages SD3 + ControlNet for simultaneous background and text rendering, but still requires users to manually design the layout and advertising copy for each poster, making large-scale production inefficient.
On the optimization side, CG4CTR and CAIG only optimize the CTR of background elements, neglecting the impact of text and layout on click-through rates; moreover, their coarse-grained joint optimization cannot attribute improvements to specific elements.

Core Problem¶

Generation side: How can high-quality posters be generated automatically from only basic product information (product image + candidate copy), without manual layout design, copywriting, or text attribute specification?
Optimization side: How can CTR improvements be precisely attributed to specific poster elements (background / text / layout) to enable fine-grained, element-level optimization rather than coarse holistic adjustment?

Method¶

Overall Architecture¶

AutoPP consists of two major components: a generator and an optimizer.

1. Unified Design Module¶

A multimodal large language model (MLLM, initialized from LLaVA) jointly generates three key elements: background prompt \(b\), selected copy \(T^*\), and layout \(l\).
Input: product image \(I_{\text{product}}\) + candidate copy set \(T\).
The joint distribution is modeled autoregressively: \(\pi(y|I_{\text{product}}, X_{\text{instr}}) = \prod_i p(y_i | I_{\text{product}}, X_{\text{instr}}, y_{<i})\)
Compared to distributed multi-model pipelines, single-model joint inference ensures design coherence.

2. Element Rendering Module¶

Built upon FLUX.1 dev; product images and glyph images are encoded as condition tokens.
Key innovation: Decomposed Attention (DA) mechanism replacing full attention in MM-DiT:
- Condition Self-Attention: Glyph and product tokens each perform self-attention independently, capturing intra-element dependencies.
- Image-Condition Cross-Attention: Query = [prompt tokens; noise tokens], Key/Value = [all token types concatenated], enabling cross-modal information exchange.
Advantage of the token-based mechanism: no pixel-level alignment required, robust to spatial misalignment between glyph images and target images.
Training loss: flow matching loss + OCR perceptual loss (intermediate features from PaddleOCRv4 backbone enforce text region clarity, \(\lambda=0.1\)).

3. Systematic Element Replacement¶

Starting from generated posters, one element is replaced at a time (keeping the rest fixed) to create variants:
- Background replacement: GPT-4o generates alternative background descriptions based on the original prompt.
- Text replacement: a candidate copy of equal length is selected from the candidate set.
- Layout replacement: the unified design module regenerates the layout.
Variant posters are randomly displayed on the JD.com platform to collect CTR feedback.

4. Isolated Direct Preference Optimization (IDPO)¶

Standard DPO performs coarse-grained alignment over the entire output, unable to distinguish the contribution of individual elements.
IDPO introduces element-aware weights: \(w_i = \sum_{c \in \{b, T^*, l\}} \alpha_c \cdot \mathbb{I}(y_i \in c)\)
The replaced element receives weight \(\alpha=5\); unchanged elements receive \(\alpha=1\), so CTR feedback precisely guides the most influential elements.
Weighted log-likelihood normalization: \(\log \pi^w(y|I,X) = \frac{\sum_i w_i \log p(y_i|I,X,y_{<i})}{\sum_i w_i}\)

AutoPP1M Dataset¶

Generation subset: 1 million high-quality product posters (1:1 aspect ratio, ≥800×800), sourced from JD.com, with aesthetic filtering, blur detection, and watermark removal.
Optimization subset: 50,000 pairwise comparisons collected via a 10-day random display experiment; each poster viewed by at least 50 users, with over 1.118 million users participating; pairwise CTR difference ≥1%.

Key Experimental Results¶

Poster Generation (Offline Evaluation, 500 Posters)¶

Method	FID↓	CLIP-T↑	Alignment↓	Overlap↓	MIoU↑
P&R	104.05	27.21	0.014	0.024	0.203
PAID	83.55	28.92	0.013	0.041	0.215
GPT-4o	63.47	29.58	0.009	0.018	0.140
AutoPP	60.71	29.75	0.007	0.011	0.256

Text Rendering Quality¶

Method	Sen. Acc↑	NED↓	FID↓	CLIP-T↑
PosterMaker	57.87	21.93	49.76	30.43
AutoPP	65.19	12.94	43.19	30.49

Online CTR Optimization (JD.com, 1-Week Experiment, 10,000 Products)¶

AutoPP (IDPO): relative CTR improvement +4.49%
AutoPP (standard DPO): +3.10%
CG4CTR / CAIG: negative CTR gain (due to optimizing background only, neglecting text and layout)

Efficiency¶

DA mechanism reduces MM-DiT block GFLOPs by 18% at 800×800 resolution and 24% at 1024×1024.
No additional parameters introduced (PosterMaker +1.6B, Flux-ControlNet +4.2B).

Effect of Data Scale¶

Reward Accuracy improves with data scale: 10K→51.20%, 30K→67.19%, 50K→75.99%.

Highlights & Insights¶

End-to-end full automation: From basic product information to final optimized poster, with no manual input of layout, copy attributes, or manual adjustments required.
Fine-grained attribution via IDPO: Through systematic single-element replacement combined with element-aware weighted DPO, this work is the first to precisely attribute CTR improvements to isolated elements.
Decomposed Attention: Without adding parameters, decomposing full attention into condition SA + image-condition CA reduces computational overhead for long sequences.
Large-scale industrial validation: AutoPP1M is the largest product poster dataset to date; the online experiment involved over one million real users.
Cross-lingual generalization: Although trained primarily on Chinese data, the model exhibits emergent cross-lingual generation capability in English, Japanese, and Korean.

Limitations & Future Work¶

CTR optimization uses aggregated data from all users, potentially overlooking minority preferences; personalized preference learning could be explored in future work.
The design module and rendering module remain two separate stages; future work could integrate them into a single autoregressive model with unified RLHF-based optimization.
The element replacement strategy relies on GPT-4o for generating background variants, introducing a dependency on an external model.
Validation is limited to the JD.com platform; cross-platform generalizability remains unknown.

Method	Fully Automated	Text Rendering	Layout Design	CTR Optimization	Element Attribution
PAID	✗ (manual text rules)	Rule-based	Automatic	✗	✗
PosterMaker	✗ (user-provided layout & copy)	SD3+ControlNet	Manual	✗	✗
CG4CTR	-	-	-	✓ (background only)	✗
CAIG	-	-	-	✓ (background only)	✗
AutoPP	✓	Token+DA	Automatic	✓ (all elements)	✓ (IDPO)

The element-isolation optimization paradigm of IDPO generalizes to other multi-element compositional optimization scenarios (e.g., ad creatives, web design, recommendation feed cards).
The condition SA + cross-attention decomposition strategy of Decomposed Attention applies to any multi-condition controlled generation task.
The paradigm of systematic element replacement combined with preference optimization can be applied to other settings requiring online A/B testing feedback.
The use of MLLMs for joint design (simultaneously outputting layout, copy selection, and background description) is worth emulating in other multi-step design tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — The element-level attribution optimization of IDPO and the fully automated pipeline design are novel, though individual sub-modules (MLLM design, FLUX rendering) build on mature architectures.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Dual offline and online validation, million-user-scale experiments, and complete ablation studies.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, rich illustrations, and complete method descriptions.
Value: ⭐⭐⭐⭐⭐ — Strong industrial deployment value; actually deployed on JD.com, where even a 0.5% CTR improvement yields significant commercial returns.