Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers¶

Conference: AAAI 2026 arXiv: 2511.07934 Code: https://github.com/HHHHStar/Laytrol Area: Image Generation / Controllable Generation Keywords: Layout Control, Multimodal Diffusion Transformer, Parameter Copying, ControlNet, FLUX

TL;DR¶

Laytrol achieves high-quality layout-to-image generation on FLUX by initializing the layout control network via parameter copying from MM-DiT, adopting a dedicated initialization scheme (layout encoder initialized as a pure text encoder with zero-initialized outputs), and constructing the LaySyn dataset using FLUX-generated images to mitigate distribution shift.

Background & Motivation¶

State of the Field¶

Background: As MM-DiT architectures (e.g., SD3, FLUX) emerge as state-of-the-art T2I models, enabling spatial layout control on these models has become a critical challenge. Existing layout-to-image methods (GLIGEN, MIGC, SiamLayout) typically train new control modules from scratch, resulting in low visual quality and stylistic inconsistency with the base model. Two root causes are identified: (1) training datasets (COCO/LAION) suffer from distribution shift relative to the base model's pretraining data; and (2) control modules trained from scratch cannot inherit pretrained knowledge.

Paper Goals¶

Goal: The paper aims to incorporate layout control capabilities into MM-DiT (FLUX) while maximally preserving the pretrained model's image generation quality and style. The core challenge lies in adapting ControlNet-style parameter copying to layout conditions—whose token structure (text + coordinates) is fundamentally different from pixel-level conditions (e.g., depth maps, edge maps) and cannot be simply added.

Method¶

Overall Architecture¶

Laytrol constructs a parallel layout control network on top of FLUX's MM-DiT. Inputs include a global text prompt and layout conditions (N entities, each with a local prompt and bounding box coordinates). The layout control network shares the MM-DiT architecture and is initialized by copying parameters from MM-DiT. During training, the base model parameters are frozen; only the layout encoder and layout control modules are trained.

Key Designs¶

Layout Encoder Initialized as Pure Text Encoder (satisfying C1): Layout tokens are encoded as \(C_L = \text{T5}(p_i) + W_0 \times \text{MLP}(\text{Fourier}(b_i))\), where \(W_0\) is zero-initialized. At the start of training, \(C_L = \text{T5}(p_i)\) reduces to pure text tokens, naturally falling within the input domain of MM-DiT and correctly activating the copied parameters. As training proceeds, \(W_0\) becomes non-zero and spatial information is progressively injected.
Zero-Initialized Layout Control Output (satisfying C2): The output of each Laytrol block is fused into the base model via a zero-initialized linear layer \(W_0\): \(X' = X_T' + W_0 \times X_L'\). At the beginning of training, Laytrol does not interfere with the base model, ensuring training stability.
Object-Level RoPE: Each layout token is assigned the positional index of the patch corresponding to its bounding box center as the RoPE rotation matrix, rather than sharing the position \((0,0)\) across all layout tokens. This encourages image tokens near the bounding box to attend more to the corresponding layout token, providing coarse-grained spatial information.
LaySyn Dataset: Approximately 400K images are generated by FLUX itself and annotated with layouts using Grounding DINO. Layout prompting (randomly inserting spatial/size phrases such as "on the left," "tiny," and "large" into object descriptions) is employed to mitigate FLUX's inherent layout bias, which causes generated images to cluster around repetitive layout patterns.

Loss & Training¶

Standard denoising diffusion loss + region-aware loss (bounding box region loss weight \(\times \lambda = 2\))
Power-law timestep sampling \(\pi(t;\alpha)=\alpha \cdot t^{\alpha-1}\) (\(\alpha=1.4\)), biased toward higher timesteps to emphasize layout information
Random global prompt dropping (probability \(p_d=0.5\)), replacing with null tokens to encourage image tokens to attend more to layout tokens

Key Experimental Results¶

Dataset	Metric	Laytrol	SiamLayout-FLUX	MIGC	GLIGEN
T2I-CompBench	Spatial↑	47.40	35.84	36.39	33.22
T2I-CompBench	Color↑	80.65	76.63	65.34	34.00
COCO 2017	mIoU↑	80.08	70.09	77.64	79.71
COCO 2017	AP↑	70.11	56.62	65.11	68.92
COCO 2017	FID↓	34.34	36.66	39.25	39.85

Ablation Study¶

Parameter copying contributes the most: removing P-Copy alone drops mIoU from 76.75 to 64.92 and AP from 64.11 to 51.78
Layout-Level RoPE and Random Prompt Dropping each contribute 2–5% mIoU improvement
The number of Laytrol blocks is flexible: mIoU decreases from 76.75 (interval=1, full) to 72.16 (interval=6), retaining reasonable performance
In both human and GPT-4o evaluations, Laytrol outperforms SiamLayout on aesthetics (3.96 vs. 3.32), realism (3.72 vs. 3.58), and semantic consistency (4.24 vs. 4.09)

Highlights & Insights¶

Elegant adaptation of ControlNet's parameter copying to layout control in MM-DiT: The "initialize as a pure text encoder" strategy elegantly resolves the structural mismatch between layout tokens and image tokens.
Self-synthesized dataset: Using the model's own generated images as training data fundamentally eliminates distribution shift—a paradigm generalizable to other controllable generation tasks.
Layout Prompting to address layout bias: A simple yet effective approach that enriches layout diversity in generated images by injecting spatial descriptors into prompts.

Limitations & Future Work¶

High inference cost: Laytrol-1 requires 2.1× the TFLOPs of FLUX (15.6 vs. 7.4), with approximately doubled latency.
Only bounding-box-level control is supported; finer-grained instance segmentation masks or keypoints are not addressed.
The LaySyn dataset relies on GPT-4o and Grounding DINO, so annotation quality is bounded by the capabilities of these models.
Joint use with other control conditions (depth maps, pose, etc.) remains unexplored.

vs. SiamLayout: SiamLayout is also built on MM-DiT but trains its control module from scratch; Laytrol achieves substantially higher spatial scores via parameter copying (47.40 vs. 35.84).
vs. ControlNet: ControlNet handles pixel-level conditions (edge maps, etc.) that can be directly added to image tokens; Laytrol addresses heterogeneous layout conditions by initializing through the text encoder.
vs. GLIGEN: GLIGEN adopts Fourier embeddings and cross-attention on U-Net; Laytrol achieves more natural layout control on MM-DiT via parameter copying and RoPE.

The self-synthesized dataset paradigm is equally applicable to other controllable generation tasks (e.g., pose control, style transfer).
The training paradigm of "initialize to a known in-domain state → progressively inject new information" constitutes a general and efficient fine-tuning strategy.

Rating¶

Novelty: ⭐⭐⭐⭐ The adaptation of ControlNet parameter copying to heterogeneous input conditions is cleverly designed.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers two benchmarks with complete ablations, human evaluation, and efficiency analysis.
Writing Quality: ⭐⭐⭐⭐ The abstraction of conditions C1/C2 and the problem analysis are clear and well-structured.
Value: ⭐⭐⭐⭐ Provides a practical advancement for controllable generation on MM-DiT; code is open-sourced.