Rethinking Layered Graphic Design Generation with a Top-Down Approach¶

Metadata¶

Conference: ICCV 2025
arXiv: 2507.05601
Code: Not released
Area: Diffusion Models · Graphic Design Generation
Keywords: Layered design, VLM, top-down, text rendering, design automation

TL;DR¶

This paper proposes Accordion, a top-down framework that converts AI-generated rasterized design images into editable layered designs (comprising background, foreground object, and vectorized text layers), where a VLM plays distinct roles across three stages: reference creation, design planning, and layer generation.

Background & Motivation¶

Root Cause¶

Key Challenge: Background: Graphic designs (posters, advertisements, etc.) are inherently layered — background layers, foreground object layers, and vectorized text layers. Generative AI models can produce visually appealing rasterized designs, yet lack editability (e.g., text cannot be modified, elements cannot be separated).

Existing methods adopt a bottom-up strategy (e.g., COLE sequentially generates the background, adds objects, and finally places text), leading to: 1. Lack of global coordination among visual elements — elements added later may conflict with earlier decisions 2. Text may occupy excessive space or overlap with foreground objects 3. No global visual reference serves as an overall design blueprint

Core Insight: Human designers typically consult existing designs for inspiration (color, layout, typographic style) before creating layers — this is the top-down strategy. This paper represents the first attempt to convert AI-generated rasterized design images into editable layered designs.

Method¶

Overall Architecture: Three-Stage VLM Pipeline¶

Accordion is built around a VLM (LLaVA-1.5-7B), which assumes different roles across three stages:

Stage 1: Reference Creation¶

VLM role: Prompt Enhancer - Input: user's short intent \(I\) or sketch draft \(S\) - The VLM expands the brief description into a detailed prompt \(P_{des}\) via in-context learning - Fed into a T2I model (Flux) to generate a reference image \(R\)

Stage 2: Design Planning¶

VLM role: Design Planner - Input: reference image \(R\) and composite prompt \(P = P_{task} + P_{des} + P_{ocr}\) - The VLM outputs an ordered dictionary sequence \(\{D_*\}\): each entry contains a bounding box and attributes - Text attributes include content, color, font, alignment, line count, angle, etc. - Key capability: correcting meaningless AI-generated text into semantically valid content

Stage 3: Layer Generation¶

VLM role: Quality Selector - Layers are extracted sequentially according to the plan: 1. Text removal: text regions are removed conditioned on text bounding boxes 2. Object extraction: SAM extracts foreground objects → inpainting model fills the background 3. Result selection: the VLM evaluates multiple candidates and selects the best - Final composition: background \(B\) + object layers \(\{O_n\}\) + vectorized text layer \(T\)

Training Data Construction¶

Three types of references are used to fine-tune the VLM: 1. Original designs: parsed directly to train text de-rendering capability 2. Designs with meaningless text: text regions are corrupted using an SD1.5 inpainting model (strength 0.5–0.7) to train repair capability 3. Text-free backgrounds: all text removed to train the ability to add text from the background

A total of 156,932 training samples (39K × 3 reference types + questionnaire dataset).

Key Experimental Results¶

Main Results: Quantitative Comparison on the DesignIntention Benchmark¶

Method	Design & Layout	Content Relevance	Typography & Color	Graphics & Imagery	Creativity	Average
COLE	6.0	6.9	5.7	6.2	5.1	6.0
Open-COLE	6.3	7.0	5.6	7.1	5.3	6.3
Accordion	6.7	7.4	6.1	7.3	5.1	6.5

Accordion achieves an average score of 6.5, surpassing COLE (6.0) and Open-COLE (6.3), while using only 39K training samples — far fewer than COLE's 100K.

Designer User Study (29 designers, 30 cases)¶

Evaluation Dimension	Accordion Preferred over COLE (%)
Text-to-template editability	73.5%
Sketch-to-description reasonableness	87.2%

Key Findings¶

Accordion produces an average text length of 61.7 characters vs. COLE's 42.3 (1.5×) — indicating more effective use of available space
Aesthetic score of 4.98 vs. COLE's 4.72 — global reference ensures visual harmony among elements
Layer count detection MAE: 0.494 for text layers, 0.274 for object layers — indicating high layering accuracy
VLM questionnaire-based selection improves text removal PSNR from 18.01 to 21.18

Highlights & Insights¶

Paradigm shift: top-down vs. bottom-up — establishing a global reference before layered extraction avoids incremental visual conflicts
Triple-role VLM: a single VLM serves as prompt enhancer, design planner, and quality selector across different stages
Model agnosticism: any T2I model (Flux / SD3 / future models) can be plugged in as the reference source without retraining
Design variants: supports creative exploration via upstream model variants, inference-time variants, and downstream model variants

Limitations & Future Work¶

SAM achieves an IoU of only 68.4%, with difficulties in extracting transparent or hollow objects
The framework assumes the text layer always resides above object layers, precluding more complex layer hierarchies
The text layer supports only 2,000 predefined styles, with no support for freeform text or special-effect typography
Inference time of 36.7 seconds per sample leaves room for improving interactive usability

Layered design generation: COLE, Open-COLE, De-Render
Text rendering: TextDiffuser, TextDiffuser-2
VLM applications: LLaVA, GPT-4V applied to design evaluation

Rating¶

Novelty: ★★★★☆ — Top-down strategy and first attempt at converting AI-generated designs into editable layered formats
Technical Depth: ★★★★☆ — Elegantly designed three-stage pipeline with thoughtful training data construction
Practicality: ★★★★★ — Directly addresses real-world design needs