AutoPP: Towards Automated Product Poster Generation and Optimization¶
Conference: AAAI2026
arXiv: 2512.21921
Code: JD-GenX/AutoPP
Area: Recommender Systems
Keywords: product poster generation, CTR optimization, diffusion model, DPO, multimodal generation
TL;DR¶
This paper proposes AutoPP, the first pipeline to unify automated product poster generation with CTR-feedback-driven optimization in a single framework. It employs a unified design module to jointly design background, text, and layout; an element rendering module for efficient and controllable poster generation; and Isolated DPO (IDPO) to achieve element-level click-through rate optimization.
Background & Motivation¶
- Product posters require the artful combination of product images, text, and backgrounds to attract user clicks, yet manual creation and iterative refinement are highly time- and labor-intensive.
- Existing methods exhibit clear automation bottlenecks:
- PAID adopts a four-stage pipeline (prompt → layout → background → text rendering), in which text attributes (font, color) rely on hand-crafted rules, limiting automation and compromising visual coherence.
- PosterMaker leverages SD3 + ControlNet for simultaneous background and text rendering, but still requires users to manually design the layout and advertising copy for each poster, making large-scale production inefficient.
- On the optimization side, CG4CTR and CAIG only optimize the CTR of background elements, neglecting the impact of text and layout on click-through rates; moreover, their coarse-grained joint optimization cannot attribute improvements to specific elements.
Core Problem¶
- Generation side: How can high-quality posters be generated automatically from only basic product information (product image + candidate copy), without manual layout design, copywriting, or text attribute specification?
- Optimization side: How can CTR improvements be precisely attributed to specific poster elements (background / text / layout) to enable fine-grained, element-level optimization rather than coarse holistic adjustment?
Method¶
Overall Architecture¶
AutoPP consists of two major components: a generator and an optimizer.
1. Unified Design Module¶
- A multimodal large language model (MLLM, initialized from LLaVA) jointly generates three key elements: background prompt \(b\), selected copy \(T^*\), and layout \(l\).
- Input: product image \(I_{\text{product}}\) + candidate copy set \(T\).
- The joint distribution is modeled autoregressively: \(\pi(y|I_{\text{product}}, X_{\text{instr}}) = \prod_i p(y_i | I_{\text{product}}, X_{\text{instr}}, y_{<i})\)
- Compared to distributed multi-model pipelines, single-model joint inference ensures design coherence.
2. Element Rendering Module¶
- Built upon FLUX.1 dev; product images and glyph images are encoded as condition tokens.
- Key innovation: Decomposed Attention (DA) mechanism replacing full attention in MM-DiT:
- Condition Self-Attention: Glyph and product tokens each perform self-attention independently, capturing intra-element dependencies.
- Image-Condition Cross-Attention: Query = [prompt tokens; noise tokens], Key/Value = [all token types concatenated], enabling cross-modal information exchange.
- Advantage of the token-based mechanism: no pixel-level alignment required, robust to spatial misalignment between glyph images and target images.
- Training loss: flow matching loss + OCR perceptual loss (intermediate features from PaddleOCRv4 backbone enforce text region clarity, \(\lambda=0.1\)).
3. Systematic Element Replacement¶
- Starting from generated posters, one element is replaced at a time (keeping the rest fixed) to create variants:
- Background replacement: GPT-4o generates alternative background descriptions based on the original prompt.
- Text replacement: a candidate copy of equal length is selected from the candidate set.
- Layout replacement: the unified design module regenerates the layout.
- Variant posters are randomly displayed on the JD.com platform to collect CTR feedback.
4. Isolated Direct Preference Optimization (IDPO)¶
- Standard DPO performs coarse-grained alignment over the entire output, unable to distinguish the contribution of individual elements.
- IDPO introduces element-aware weights: \(w_i = \sum_{c \in \{b, T^*, l\}} \alpha_c \cdot \mathbb{I}(y_i \in c)\)
- The replaced element receives weight \(\alpha=5\); unchanged elements receive \(\alpha=1\), so CTR feedback precisely guides the most influential elements.
- Weighted log-likelihood normalization: \(\log \pi^w(y|I,X) = \frac{\sum_i w_i \log p(y_i|I,X,y_{<i})}{\sum_i w_i}\)
AutoPP1M Dataset¶
- Generation subset: 1 million high-quality product posters (1:1 aspect ratio, ≥800×800), sourced from JD.com, with aesthetic filtering, blur detection, and watermark removal.
- Optimization subset: 50,000 pairwise comparisons collected via a 10-day random display experiment; each poster viewed by at least 50 users, with over 1.118 million users participating; pairwise CTR difference ≥1%.
Key Experimental Results¶
Poster Generation (Offline Evaluation, 500 Posters)¶
| Method | FID↓ | CLIP-T↑ | Alignment↓ | Overlap↓ | MIoU↑ |
|---|---|---|---|---|---|
| P&R | 104.05 | 27.21 | 0.014 | 0.024 | 0.203 |
| PAID | 83.55 | 28.92 | 0.013 | 0.041 | 0.215 |
| GPT-4o | 63.47 | 29.58 | 0.009 | 0.018 | 0.140 |
| AutoPP | 60.71 | 29.75 | 0.007 | 0.011 | 0.256 |
Text Rendering Quality¶
| Method | Sen. Acc↑ | NED↓ | FID↓ | CLIP-T↑ |
|---|---|---|---|---|
| PosterMaker | 57.87 | 21.93 | 49.76 | 30.43 |
| AutoPP | 65.19 | 12.94 | 43.19 | 30.49 |
Online CTR Optimization (JD.com, 1-Week Experiment, 10,000 Products)¶
- AutoPP (IDPO): relative CTR improvement +4.49%
- AutoPP (standard DPO): +3.10%
- CG4CTR / CAIG: negative CTR gain (due to optimizing background only, neglecting text and layout)
Efficiency¶
- DA mechanism reduces MM-DiT block GFLOPs by 18% at 800×800 resolution and 24% at 1024×1024.
- No additional parameters introduced (PosterMaker +1.6B, Flux-ControlNet +4.2B).
Effect of Data Scale¶
- Reward Accuracy improves with data scale: 10K→51.20%, 30K→67.19%, 50K→75.99%.
Highlights & Insights¶
- End-to-end full automation: From basic product information to final optimized poster, with no manual input of layout, copy attributes, or manual adjustments required.
- Fine-grained attribution via IDPO: Through systematic single-element replacement combined with element-aware weighted DPO, this work is the first to precisely attribute CTR improvements to isolated elements.
- Decomposed Attention: Without adding parameters, decomposing full attention into condition SA + image-condition CA reduces computational overhead for long sequences.
- Large-scale industrial validation: AutoPP1M is the largest product poster dataset to date; the online experiment involved over one million real users.
- Cross-lingual generalization: Although trained primarily on Chinese data, the model exhibits emergent cross-lingual generation capability in English, Japanese, and Korean.
Limitations & Future Work¶
- CTR optimization uses aggregated data from all users, potentially overlooking minority preferences; personalized preference learning could be explored in future work.
- The design module and rendering module remain two separate stages; future work could integrate them into a single autoregressive model with unified RLHF-based optimization.
- The element replacement strategy relies on GPT-4o for generating background variants, introducing a dependency on an external model.
- Validation is limited to the JD.com platform; cross-platform generalizability remains unknown.
Related Work & Insights¶
| Method | Fully Automated | Text Rendering | Layout Design | CTR Optimization | Element Attribution |
|---|---|---|---|---|---|
| PAID | ✗ (manual text rules) | Rule-based | Automatic | ✗ | ✗ |
| PosterMaker | ✗ (user-provided layout & copy) | SD3+ControlNet | Manual | ✗ | ✗ |
| CG4CTR | - | - | - | ✓ (background only) | ✗ |
| CAIG | - | - | - | ✓ (background only) | ✗ |
| AutoPP | ✓ | Token+DA | Automatic | ✓ (all elements) | ✓ (IDPO) |
- The element-isolation optimization paradigm of IDPO generalizes to other multi-element compositional optimization scenarios (e.g., ad creatives, web design, recommendation feed cards).
- The condition SA + cross-attention decomposition strategy of Decomposed Attention applies to any multi-condition controlled generation task.
- The paradigm of systematic element replacement combined with preference optimization can be applied to other settings requiring online A/B testing feedback.
- The use of MLLMs for joint design (simultaneously outputting layout, copy selection, and background description) is worth emulating in other multi-step design tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The element-level attribution optimization of IDPO and the fully automated pipeline design are novel, though individual sub-modules (MLLM design, FLUX rendering) build on mature architectures.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Dual offline and online validation, million-user-scale experiments, and complete ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, rich illustrations, and complete method descriptions.
- Value: ⭐⭐⭐⭐⭐ — Strong industrial deployment value; actually deployed on JD.com, where even a 0.5% CTR improvement yields significant commercial returns.