PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow¶

Conference: CVPR 2026 arXiv: 2603.25738 Code: https://henghuiding.com/PSDesigner/ Area: Image Generation / Automated Design Keywords: Automated graphic design, PSD file manipulation, tool invocation, reinforcement learning, creative workflow

TL;DR¶

This paper presents PSDesigner, an automated graphic design system that simulates the creative workflow of human designers. It operates through three collaborative modules — AssetCollector (resource collection), GraphicPlanner (tool-call planning), and ToolExecutor (PSD operation execution) — and is trained on CreativePSD, the first PSD-format design dataset, enabling the system to learn professional design workflows and directly generate editable PSD files.

Background & Motivation¶

Background: Graphic design plays a critical role in e-commerce and advertising. Existing automated approaches fall into two categories: (a) text-to-image models (e.g., FLUX, Glyph-Byt5) that generate design images; and (b) MLLM-driven methods (e.g., LaDeCo, COLE) that directly generate editable design files in JSON format.
Limitations of Prior Work: Text-to-image methods produce non-editable outputs with inaccurate text rendering (particularly for Chinese); MLLM-based methods group layers by predefined categories (underlay/text) and predict all attributes in a single pass, resulting in an unintuitive design process with limited flexibility.
Key Challenge: Existing methods significantly oversimplify professional design workflows — (a) grouping by category is less intuitive than grouping by visual concept; (b) one-shot prediction of all layer attributes lacks the flexibility of progressive design; (c) only simple layer hierarchies and a limited set of attribute types are supported, falling far short of production-level design requirements.
Goal: To construct an automated design system that simulates the workflow of human designers, handles complex PSD layer hierarchies (averaging 48.35 layers), supports diverse layer types and 60+ attribute types, and generates editable, professional-grade PSD files.
Key Insight: By observing how human designers work — first collecting thematic assets, then iteratively integrating resources grouped by visual concept, and refining defects after each integration step — the paper formalizes this process as a tool-call prediction problem for VLMs.
Core Idea: Graphic design is modeled as a sequential tool-call prediction task for VLMs, with SFT and GRPO training enabling the model to learn iterative asset integration and defect correction operations.

Method¶

Overall Architecture¶

Given a user instruction, PSDesigner operates in three stages: (1) AssetCollector employs an LLM to identify visual concepts and gather relevant assets (images/text) for each concept; (2) the nested layer hierarchy is traversed bottom-up, with GraphicPlanner predicting tool calls for asset integration (\(\mathcal{X}_{gen}\) mode) followed by defect correction (\(\mathcal{X}_{edt}\) mode) at each iteration; (3) ToolExecutor executes these tool calls in Adobe Photoshop via the UXP API, supporting 70+ PSD operations.

Key Designs¶

CreativePSD Dataset (the first PSD-format design dataset):
- Function: Provides supervised training data on professional design operations for GraphicPlanner.
- Mechanism: Constructed in three stages — Stage I: high-quality PSD files are collected from the internet and commercial sources, with professional annotators grouping layers by visual concept; Stage II: PSD files are parsed to extract raw assets, metadata, and intermediate rendering results; Stage III: training samples are constructed in both \(\mathcal{X}_{gen}\) (asset integration) and \(\mathcal{X}_{edt}\) (defect correction) modes from the extracted information. Each training sample is a triplet \((a, \mathcal{C}, x)\): assets + observations + tool-call sequence.
- Design Motivation: Existing datasets (CGL/Crello/Design39K) average only 4–5 layers, 2 layer types, and limited attributes. CreativePSD contains 10,454 samples with an average of 48.35 layers, 5 layer types, and 60+ attribute types, enabling learning of realistic design workflows.
GraphicPlanner (Dual-Mode VLM Tool-Call Predictor):
- Function: Predicts the next PSD operation tool call based on the current design state.
- Mechanism: Built on Qwen2.5-VL-7B with mode-specific LoRA modules injected. The \(\mathcal{X}_{gen}\) mode receives assets \(a\) and observations \((M, R)\) (layer metadata + current rendering) to predict integration tool calls; the \(\mathcal{X}_{edt}\) mode receives observations \((M, R, G)\) (perturbed metadata + rendering + pre-group rendering) to predict correction tool calls. Training proceeds in two stages: SFT first establishes basic tool–parameter mapping, followed by GRPO reinforcement learning to refine the precision of parameter values.
- Design Motivation: Asset integration and defect correction are fundamentally different operations; mode-specific LoRA modules prevent task interference. The GRPO reward function directly evaluates the correctness of tool names and parameter values, improving the accuracy of tool-call predictions.
ToolExecutor (UXP-Based PSD Operation Executor):
- Function: Translates tool calls predicted by GraphicPlanner into actual PSD file operations.
- Mechanism: Implements 70+ Photoshop operations using JavaScript APIs on the Adobe UXP framework, including inserting images/text/adjustment layers, applying effects (inner glow, drop shadow, etc.), setting blending modes, and clipping masks.
- Design Motivation: Direct manipulation of PSD files — rather than JSON representations — enables support for the complex layer attributes and effect configurations required in production-level design.

Loss & Training¶

The SFT stage uses standard autoregressive cross-entropy loss. The GRPO reinforcement learning stage employs a dedicated reward function \(r\) that compares generated tool calls against ground truth on tool names, parameter names, and parameter values. SFT training runs for 15,000 steps (\(\mathcal{X}_{gen}\)) / 12,000 steps (\(\mathcal{X}_{edt}\)) with batch size 64, lr \(= 2\text{e-}4\), and LoRA rank \(= 32\). GRPO training runs for 6,000 steps with group size 8, using 4,000 PSD files.

Key Experimental Results¶

Main Results¶

User intent → design (VLM score, out of 10):

Method	Quality	Layout	Relevance	Harmony	Creativity	Editable
PSDesigner	7.62	8.68	7.78	8.02	8.45	✓ PSD
CanvaGPT	8.52	8.15	4.72	7.21	7.52	✓
FLUX	8.18	6.88	6.92	6.82	6.95	✗
PosterCraft	7.95	8.35	8.42	8.05	5.87	✗
OpenCOLE	5.12	3.66	5.25	6.68	6.08	Partial

Design composition on Crello-v5 (VLM score):

Method	Quality	Layout	Harmony	Creativity
Ours	7.85	7.43	6.77	6.94
LaDeCo	5.95	6.03	7.22	5.75
Ground Truth	8.13	9.18	8.90	7.12

Ablation Study¶

Configuration	Quality	Layout	Harmony	Creativity
Full model (Crello)	7.85	7.43	6.77	6.94
w/o \(\mathcal{X}_{edt}\)	6.05	5.88	5.90	6.75
w/o layer info \(M\)	6.25	6.10	6.18	6.02
w/o RL (GRPO)	6.38	6.00	6.35	6.20
Full model (PSD)	6.28	6.15	7.02	6.88
w/o \(\mathcal{X}_{edt}\) (PSD)	5.32	5.15	6.22	6.05

Key Findings¶

Removing the \(\mathcal{X}_{edt}\) mode (defect correction) has the largest impact on quality and layout (drops of 1.80 and 1.55 points respectively on Crello), demonstrating that iterative refinement is critical to design quality.
Removing layer metadata \(M\) prevents the model from perceiving other elements within the same group, leading to significant degradation in layout and harmony.
GRPO reinforcement learning is essential for accurately predicting tool-call parameter values; its removal causes a 1.43-point drop in layout score.
PSDesigner is the only system capable of simultaneously generating editable PSD files and accurately rendering Chinese text.

Highlights & Insights¶

Human-Like Workflow Design: Decomposing the design process into an iterative "collect → integrate → refine" pipeline closely mirrors how human designers think. This problem decomposition paradigm is transferable to other multi-step creative tasks (e.g., slide authoring, video editing).
CreativePSD Dataset: The first PSD-format training dataset for design, with an average of 48 layers and 60+ attributes, substantially expanding the complexity level that automated design systems can handle. The three-stage data construction pipeline (collection → parsing → training data construction) also constitutes a reusable methodology.
GRPO for Tool-Call Refinement: Applying reinforcement learning to improve the precision of tool-call parameters is an elegant design choice. SFT can only approximate the parameter value distribution, whereas GRPO's reward signal directly optimizes output–GT alignment.

Limitations & Future Work¶

Evaluation relies primarily on VLM scoring and user studies, lacking more objective quantitative metrics (e.g., layer attribute prediction accuracy).
AssetCollector depends on the quality of external image search/generation models; failures in asset collection cascade into degraded design quality.
Only static designs are supported; dynamic or interactive designs (e.g., web pages, animations) are not addressed.
The dataset scale (10,454 samples) remains small compared to LLM training corpora, potentially limiting generalization.
Deep coupling with Photoshop via the UXP API restricts extensibility to other design tools.

vs. LaDeCo: LaDeCo groups layers by predefined categories (image/text) and predicts all attributes for each category in a single pass, failing to handle complex hierarchies. PSDesigner groups by visual concept and integrates resources iteratively, achieving 1.9 points higher quality on Crello.
vs. COLE/OpenCOLE: COLE constructs a pipeline of multiple task-specific models, but its editability is limited (only a single image layer plus text layers). PSDesigner unifies the process under VLM-driven tool calls, supporting full PSD layer hierarchies.
vs. T2I methods (FLUX/PosterCraft): These methods produce visually high-quality but non-editable raster images, with frequent errors in rendering Chinese or complex text.

Rating¶

Novelty: ⭐⭐⭐⭐ — First application of the VLM tool-call paradigm to PSD-level automated design; CreativePSD is also a first-of-its-kind contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive comparisons against multiple baselines with thorough ablations, though objective metrics such as tool-call accuracy are absent.
Writing Quality: ⭐⭐⭐⭐ — The comparison figure between human designer workflows and PSDesigner is highly intuitive; problem motivation is clearly articulated.
Value: ⭐⭐⭐⭐ — Significant contribution to the automated design field, demonstrating the potential of VLM + tool invocation for creative tasks.