MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation¶

Conference: CVPR 2025
arXiv: 2505.02648
Code: None
Area: Object Detection
Keywords: Compositional Diffusion, Multi-Agent Collaboration, Complex Scene Generation, Text-to-Image, Training-free

TL;DR¶

MCCD proposes a compositional diffusion method based on multi-agent collaboration. It utilizes an MLLM-driven multi-agent system to parse complex scenes, and achieves accurate and high-fidelity generation of multi-object complex scenes through hierarchical compositional diffusion (Gaussian masks and regional enhancement) without requiring training.

Background & Motivation¶

Background: Diffusion models perform exceptionally well on text-to-image generation, but frequently suffer from issues like missing objects and incorrect attribute binding when handling complex prompts containing multiple objects, attributes, and relations.

Limitations of Prior Work: (1) Standard diffusion models struggle to correctly handle spatial relations and attribute binding for multiple objects; (2) existing compositional methods (such as Attend-and-Excite) remain inadequate in extremely complex scenes; (3) scene parsing typically relies on simple rules, failing to handle semantically complex descriptions.

Key Challenge: Complex scenes contain multi-level information (object counts, locations, attributes, relationships), requiring end-to-end support from language understanding to visual generation.

Goal: Significantly enhance the generation capabilities of diffusion models in complex scenes in a training-free manner.

Key Insight: Using a multi-agent system (based on MLLMs) for systematic scene parsing, and using hierarchical diffusion for fine-grained regional generation.

Core Idea: Multi-agent collaborative parsing of complex prompts → Generating structured layout → Gaussian mask regional constraints → Regional enhancement for fine-grained generation.

Method¶

Overall Architecture¶

Given a complex text prompt, the multi-agent collaborative scene parsing module first structurally decomposes the prompt into an object list, attributes, spatial relationships, and layout information. Subsequently, the hierarchical compositional diffusion module refines each object region using Gaussian masks and filtering mechanisms, achieving accurate generation through regional enhancement.

Key Designs¶

Multi-Agent Collaborative Scene Parsing:
- Function: Structurally decompose complex text prompts
- Mechanism: Designs multiple MLLM agents with distinct roles—the object extraction agent identifies all objects and attributes, the layout planning agent generates spatial coordinates (bounding boxes), and the relationship verification agent checks if the relationships between objects are satisfied. Multiple agents iteratively optimize the parsing results through a collaborative mechanism.
- Design Motivation: A single LLM struggles to process all complex semantics at once; divided collaboration is closer to how humans process complex information.
Hierarchical Compositional Diffusion:
- Function: Precisely control the generation of each object during the diffusion sampling process
- Mechanism: Generates a Gaussian mask for each object region as a soft spatial constraint, blending noise predictions of each region using the masks during denoising. A filtering operation is applied to eliminate information leakage between regions, ensuring each region independently generates the correct object.
- Design Motivation: Simple attention manipulation is insufficient for handling multi-object scenes; explicit spatial constraints are more reliable.
Regional Enhancement:
- Function: Improve the generation quality and details of each object region
- Mechanism: During specific steps of diffusion, perform local enhancement on each target region—regenerating details using object-specific prompts within that region, and then fusing it with the global generation result. This ensures correct attributes and rich details for each object.
- Design Motivation: Global generation often compromises on details, and regional enhancement provides an opportunity for fine-grained correction.

Loss & Training¶

MCCD is a training-free method that directly manipulates the diffusion sampling process during inference, without modifying the model weights.

Key Experimental Results¶

Main Results¶

Benchmark	Metric	MCCD	Baseline SD	Gain
T2I-CompBench	Attribute Binding	Substantially improved	Standard SD	Significant accuracy improvement
T2I-CompBench	Spatial Relations	Substantially improved	Standard SD	Improved relationship accuracy
Complex Scenes	Object Completeness	Substantially improved	Attend-Excite	More objects correctly generated

Ablation Study¶

Configuration	Effect	Explanation
Full MCCD	Best	Multi-agent + Hierarchical diffusion
Single-agent Parsing	Degraded	Insufficient complex scene parsing
Without Gaussian Mask	Degraded	Inaccurate object positioning
Without Regional Enhancement	Detail degraded	Attribute binding errors

Key Findings¶

Multi-agent collaborative parsing performs significantly better than a single LLM call.
The combination of Gaussian masks and regional enhancement is key—the former manages positioning, while the latter manages quality.
The training-free approach can significantly boost the capabilities of baseline models.

Highlights & Insights¶

LLM Multi-Agent for Image Generation: Deploys the multi-agent collaboration paradigm into scene parsing for text-to-image generation. This framework can be transferred to more complex generative tasks like video and 3D generation.
Elegance of Hierarchical Composition: Gaussian masks provide soft constraints while regional enhancement provides hard corrections, complementing each other to form complete spatial control.
Practical Value of Training-Free Methods: Directly enhances the capabilities of existing models without retraining, lowering the threshold for application.

Limitations & Future Work¶

Multi-agent invocation increases inference latency and API costs.
The quality of layout planning is limited by the spatial reasoning capabilities of MLLMs.
For highly overlapping object scenes (e.g., stacked items), Gaussian masks might be insufficient.
Only validated on the SD series; compatibility with newer models like FLUX remains unknown.

vs Attend-and-Excite: A&E enhances key tokens through attention manipulation, whereas MCCD provides a more comprehensive framework spanning from parsing to generation.
vs RPG (Regional Planning): RPG also performs regional generation but utilizes a single LLM; MCCD's multi-agent approach is more powerful.
vs LayoutGPT: LayoutGPT generates layouts using LLMs, but MCCD's multi-agent collaboration is more systematic.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of multi-agent collaboration and compositional diffusion is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive validation across multiple benchmarks.
Writing Quality: ⭐⭐⭐⭐ Clear and systematic methodology description.
Value: ⭐⭐⭐⭐ Effectively advances complex scene generation.