coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation¶

Conference: CVPR 2025
arXiv: 2603.12829
Code: To be confirmed
Area: Image Generation / Compositional Generation
Keywords: Multi-Agent, Compositional Generation, Layout Planning, Text-to-Image, Closed-Loop Reasoning

TL;DR¶

This paper proposes coDrawAgents, an interactive multi-agent dialogue framework consisting of four expert agents: Interpreter, Planner, Checker, and Painter. Through divide-and-conquer incremental layout planning, visual context-aware reasoning, and explicit error correction, it achieves 0.94 (SOTA) on GenEval and 85.17 (SOTA) on DPG-Bench.

Background & Motivation¶

Background: T2I models still struggle to correctly compose multiple objects and maintain attribute consistency in complex scenes.

Limitations of Prior Work: (1) Global layout planning faces \(O(n^2)\) relational complexity; (2) Most methods lack visual feedback for spatial reasoning; (3) It is difficult to correct errors once a rough structure is settled early in the diffusion pipeline.

Key Challenge: The capacity bottleneck of a single agent is apparent, while existing multi-agent methods mostly rely on fixed pipelines and lack closed-loop reasoning.

Goal: To solve compositional image generation in complex scenes via a multi-agent closed-loop dialogue.

Key Insight: Decompose the generation process into four specialized roles, interacting through dynamic dialogue rather than sequential pipelining.

Core Idea: Compositional image generation requires "planning-checking-rendering" closed-loop collaboration.

Method¶

Overall Architecture¶

Interpreter decides generation mode (layout-free/layout-aware). In layout-aware mode: parse text into structured object descriptions -> group by semantic priority -> for each group: Planner performs incremental planning -> Checker validates and corrects -> Painter renders the canvas -> loop to the next group.

Key Designs¶

Interpreter
- Function: Decides the generation mode and framework, decomposing complex prompts into structured object descriptions.
- Mechanism: LLM + CoT in three steps (identification, priority grouping, attribute enrichment).
- Design Motivation: Generate simple prompts directly; adaptive selection avoids unnecessary overhead.
Planner + Visual Chain-of-Thought (VCoT)
- Function: Incrementally plans the layout of current high-priority objects.
- Mechanism: Employs GPT-5 for multimodal VCoT, taking global text, object descriptions, historical layouts, partial images, and object localization as input.
- Three-step Reasoning: Canvas state analysis -> Context-aware planning -> Physical constraint execution.
- Design Motivation: Divide-and-conquer reduces complexity, and visual context eliminates "imagining from thin air".
Checker
- Function: Two-stage checking and correction (current proposal check + full history backtracking).
- Mechanism: Performs object-level (size, ratio) and global-level (position, occlusion) checks, backtracking to correct history.
- Design Motivation: Explicit correction compensates for the "hard-to-change once decided" drawback of diffusion models.
Painter
- Function: Calls Flux for layout-free generation and 3DIS for layout-aware generation.
- Designed as plug-and-play, allowing replacement with any T2I/L2I models.

Loss & Training¶

Training-free framework. Planner/Checker utilize GPT-5 for inference.

Key Experimental Results¶

Main Results¶

GenEval comparison:

Model	Overall
DALL-E 3	0.67
FLUX.1-dev	0.67
SD3-Medium	0.74
GPT Image 1	0.84
coDrawAgents	0.94

DPG-Bench: Overall 85.17 (SOTA), Relation 92.92 (best).

Ablation Study¶

Configuration	Overall
Layout-free only	77.60
+ Layout-aware	82.61
+ Visual context	84.51
+ Checker	85.17

Agent efficiency: On average, only 1.52 Planner calls and 1.62 Checker calls per image, which is far fewer than the average object count of 2.79.

Key Findings¶

GenEval score of 0.94 is 10 percentage points higher than GPT Image 1.
Position metric increases from FLUX's 0.20 to 0.95, indicating that incremental planning + checking is highly effective for spatial localization.
Counting metric improves from 0.79 to 0.94, showing that the divide-and-conquer strategy effectively resolves object counting inaccuracy.
Each component has independent contributions (divide-and-conquer +5pp, visual context +2pp, Checker +0.7pp).

Highlights & Insights¶

Closed-loop multi-agent dialogue allows the Checker to backtrack and correct historical iterative errors.
Visual context-aware planning avoids "imagining from thin air".
Semantic priority grouping effectively reduces agent invocation times.
The GenEval score of 0.94 is highly impressive.

Limitations & Future Work¶

Heavily relies on GPT-5, resulting in high computational costs and susceptibility to hallucinations.
The Painter is constrained by the performance of the underlying base models.
Limited to 2D generation.

GoT performs one-shot global reasoning, whereas coDrawAgents conducts incremental visual-aware reasoning.
The divide-and-conquer + visual CoT paradigm can be generalized to video and 3D generation.

Rating¶

Novelty: ⭐⭐⭐⭐ The closed-loop multi-agent dialogue framework is a meaningful innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on two benchmarks with thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear architectural diagram.
Value: ⭐⭐⭐⭐ The 0.94 SOTA on GenEval holds system-level significance.