LACONIC: A 3D Layout Adapter for Controllable Image Creation¶

Conference: ICCV 2025 arXiv: 2507.03257 Code: N/A Area: 3D Vision Keywords: 3D layout guidance, image generation, diffusion model adapter, decoupled cross-attention, scene editing

TL;DR¶

This paper proposes LACONIC, a lightweight adapter based on parameterized 3D semantic bounding boxes that injects explicit 3D geometric information into a pretrained text-to-image diffusion model via a decoupled cross-attention mechanism. It is the first method to simultaneously support camera control, 3D object-level semantic guidance, and full scene context modeling of off-screen objects, achieving a 75.8% reduction in FID compared to SceneCraft.

Background & Motivation¶

Problem Definition¶

Given a scene layout composed of 3D semantic bounding boxes and a target camera viewpoint, the goal is to generate a single-view image that is consistent with the 3D structure and semantically coherent. The method additionally supports independent editing of individual objects' position, rotation, size, and semantics.

Limitations of Prior Work¶

Limitations of text guidance: Text-to-image models struggle to precisely convey complex spatial and geometric relationships (e.g., "books inside a bookshelf") through text alone; multi-object spatial arrangements are particularly difficult.

Fundamental limitations of 2D conditions: Conditions such as 2D bounding boxes, semantic maps, and depth maps inherently discard critical 3D information: - Cannot handle nested objects (e.g., a book inside a bookshelf) - Are strongly view-dependent, yielding inconsistent condition representations of the same scene across different viewpoints - Cannot perceive off-screen objects (e.g., lighting from outside a window affecting the interior)

Limitations of existing 3D-aware methods: SceneCraft projects 3D bounding boxes into 2D depth and semantic maps and then applies ControlNet, but the final conditions remain in 2D space, are limited to a fixed set of categories, and cannot handle nesting or occlusion. ControlRoom3D and Ctrl-Room rely on panoramas; Build-A-Scene does not scale to complex scenes.

Core Motivation¶

Key insight: Explicit, view-independent 3D geometric information (parameterized 3D bounding boxes with free-text descriptions) should be used directly as diffusion model conditions, rather than first projecting 3D to 2D and then encoding. This enables: (1) structural consistency across viewpoints; (2) natural handling of nesting and occlusion; (3) complete scene context including off-screen objects. An adapter architecture is employed to preserve the rich priors of the pretrained T2I model.

Method¶

Overall Architecture¶

Given a 3D layout $\mathcal{S}$ and camera pose $\mathcal{C}$, trainable modules encode the geometric and semantic attributes of each object into a sequence of tokens. After processing by a Transformer encoder, these tokens are injected into the frozen Stable Diffusion backbone via decoupled cross-attention to guide the denoising process and produce the target rendering.

Key Designs¶

1. Parameterized 3D Semantic Bounding Box Representation¶

Function: Defines an intuitive and explicit 3D scene condition representation.
Mechanism: The scene $\mathcal{S}$ contains $N$ objects $\mathcal{O} = \{o_1, ..., o_N\}$ and an optional floor plan $\mathcal{F} \in \mathbb{R}^{P \times 3}$. Each object is defined as a "semantic 3D bounding box": $$o_i = (p_i, d_i, R_i, s_i)$$ where $p_i \in \mathbb{R}^3$ is the center position, $d_i \in \mathbb{R}^3$ is the dimensions, $R_i \in \mathbb{R}^{3 \times 3}$ is the rotation matrix, and $s_i = [s_i^1, ..., s_i^M]$ is a free-form text description of $M$ tokens.

Camera viewpoint transformation: Spatial parameters are transformed from world coordinates to camera coordinates: $$p_i^{\mathcal{C}} = R_{\mathcal{C}}^\top(p_i - p_{\mathcal{C}}), \quad R_i^{\mathcal{C}} = R_{\mathcal{C}}^\top R_i$$ This closed-form transformation is executed at runtime, eliminating the need for the network to learn a complex 3D→2D mapping.

Design Motivation: The view-independent 3D representation ensures cross-viewpoint structural consistency. The explicit camera transformation directly encodes viewpoint information into spatial features, relieving the network of this burden. Free-form text replaces one-hot category labels, enabling open-vocabulary support.

2. 3D Layout Encoder + Decoupled Cross-Attention¶

Function: Encodes the 3D layout into condition embeddings compatible with the diffusion model.
Mechanism:

Object encoding: The spatial features of each object (position, dimensions, rotation) are encoded via sinusoidal positional encoding followed by a fully connected layer; semantic descriptions are encoded by a pretrained text encoder $\tau_\theta$. After concatenation, each object produces a token $\mathcal{T}_{o_i}$. The optional floor plan is encoded into independent tokens $\mathcal{T}_{\mathcal{F}}$ via PointNet. All tokens are processed by a Transformer encoder.

Decoupled cross-attention: Following the IP-Adapter paradigm, the 3D layout embedding $\hat{\mathcal{T}}$ is projected through additional trainable linear layers to produce keys $K^y$ and values $V^y$, which are combined with the query $Q$ derived from the image feature map: $$H^y = \text{softmax}\left(\frac{Q(K^y)^\top}{\sqrt{d}}\right) \cdot V^y$$ The final hidden state is a weighted sum of the text condition and the 3D layout condition: $$H = H^c + \gamma H^y$$ where $\gamma$ controls the strength of 3D layout guidance.

Design Motivation: The decoupled design leaves the original T2I text conditioning mechanism intact and introduces 3D conditions through separate KV projections. The $\gamma$ parameter enables flexible control over structural adherence — low $\gamma$ preserves creative freedom driven by text, while high $\gamma$ enforces structural precision.

3. Training Strategy and Application Scenarios¶

Function: Efficient training and support for diverse editing applications.
Mechanism:

Training dynamics: Classifier-free guidance training is adopted, randomly dropping the 3D layout input $y$ with probability $p_{\text{drop}}$. The global text description $c$ is always set to an empty prompt during training — no text–image pair supervision is required. Object semantic descriptions can be generated automatically using a VLM (e.g., BLIP).

Application scenarios: - Structurally consistent multi-view generation: Generating multiple views with structural consistency across different camera poses $\mathcal{C}_i$ - Text-driven scene stylization: Transforming scene style via global text prompts by leveraging the preserved T2I prior - Object attribute-level editing: Independently adjusting the position, size, or semantic description of individual objects

Design Motivation: Independence from text–image pair training makes the method applicable to 3D scene datasets lacking global descriptions. Performing editing operations in 3D space (rather than 2D pixel space) provides more intuitive and precise control.

Loss & Training¶

Training objective: Standard diffusion model denoising loss: $$\mathcal{L}_{\text{DM}} = \mathbb{E}_{x,c,y,\epsilon \sim \mathcal{N}(0,I),t} \left[\|\epsilon - \epsilon_\theta(x_t, t, c, y)\|_2^2\right]$$
Backbone: Stable Diffusion v1.5 (frozen)
Training data: HyperSim (326 scenes, 24,383 images) + custom bedroom dataset (72,000 scenes)
Optimizer: AdamW

Key Experimental Results¶

Main Results¶

3D layout-guided image generation (HyperSim dataset):

Method	FID↓	KID↓	IS↑	SOC↑
SceneCraft (w/o text)	39.36	28.26	7.72	17.59
DM-FS (w/o text)	15.83	7.29	8.69	18.22
LACONIC (w/o text)	9.50	3.44	9.74	18.36
SceneCraft (w/ text)	27.69	15.21	14.55	17.40
LACONIC (w/ text)	10.12	3.91	10.60	18.39

Ablation Study¶

Effect of adapter strength $\gamma$:

$\gamma$ value	Effect
Low (~0.5)	Preserves text prior creativity but weak structural control
Medium (~1.0)	Reasonable semantics and geometry
High (~2.0)	Strict adherence to 3D layout structure

Architectural design validation:

Method	FID↓	SOC↑	Note
SceneCraft (2D projection condition)	39.36	17.59	Information loss from 3D→2D projection
DM-FS (trained from scratch)	15.83	18.22	No pretrained prior
LACONIC (adapter)	9.50	18.36	Adapter + 3D encoder achieves best performance

Key Findings¶

Direct 3D encoding outperforms 2D projection: LACONIC reduces FID by 75.8% compared to SceneCraft (39.36→9.50), validating the advantage of directly using 3D representations.
Adapter outperforms training from scratch: DM-FS (trained from scratch) underperforms LACONIC (adapter) on all metrics, demonstrating the importance of leveraging pretrained T2I priors.
Off-screen objects affect global appearance: Experiments show that removing a window alters global illumination — an effect impossible to capture with 2D conditioning methods.
Object-level SOC metric: The newly proposed Scene Object CLIP score enables quantitative evaluation of object-level condition adherence.
Precise semantic concept assignment: Semantic concepts in text prompts (e.g., wallpaper patterns) are accurately assigned to the relevant objects without leaking to floors or ceilings.

Highlights & Insights¶

Representational innovation: The 3D semantic bounding box — position + dimensions + rotation + free-form text — is concise yet sufficient to express complex indoor scene structures.
Closed-form camera transformation: The critical 3D→2D mapping is made explicit through coordinate transformation, relieving the network of learning this complex operation and substantially improving training efficiency.
No text supervision required during training: Global text descriptions are not needed at training time, yet the T2I backbone's text prior can be exploited at inference — an elegant property enabled by the decoupled cross-attention design.
SOC evaluation metric: An object-level semantic alignment evaluation method is proposed specifically for layout-guided generation scenarios.
Powerful editing capability: Per-object translation, scaling, and re-description can be applied iteratively while maintaining global 3D consistency.

Limitations & Future Work¶

Training data distribution: A model trained on bedroom data is unlikely to generate plausible kitchen scenes; generalization is constrained by the training domain.
Limited view consistency: Although the 3D structure is consistent, texture and appearance details generated across different viewpoints are not fully consistent — an inherent limitation of single-view training.
Backbone limitations: SD 1.5 is used as the backbone, limiting generation quality to that of the base model (compatibility with DiT backbones has been verified).
Scene complexity: Validation is primarily on indoor scenes; scalability to more complex outdoor or large-scale scenes remains unknown.
Dataset scale: HyperSim contains only 326 unique layouts, potentially leading to scene memorization.

Relation to IP-Adapter: LACONIC adopts its decoupled cross-attention approach but extends the conditioning modality from image guidance to 3D layout guidance.
Fundamental difference from SceneCraft: SceneCraft operates in 2D space (depth maps + semantic maps), whereas LACONIC operates directly in 3D space.
Difference from GLIGEN: GLIGEN uses 2D bounding boxes with one-hot category labels; LACONIC uses 3D bounding boxes with free-form text.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to use parameterized 3D layouts directly as diffusion model conditions, though the adapter design follows established paradigms.
Experimental Thoroughness: ⭐⭐⭐⭐ — Includes quantitative and qualitative evaluations, user study, and a new SOC metric, though dataset scale is limited.
Writing Quality: ⭐⭐⭐⭐⭐ — Problem motivation is clearly articulated with rich and intuitive illustrations.
Value: ⭐⭐⭐⭐ — Provides a more direct and flexible solution for 3D-aware image generation.