SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation¶

Conference: NeurIPS 2025
arXiv: 2511.16666
Code: https://henghuiding.com/SceneDesigner/ (Project Page)
Area: Controllable Image Generation / 3D-Aware Generation
Keywords: 9DoF Pose Control, CNOCS Representation, Multi-Object Generation, Reinforcement Learning Fine-tuning, ControlNet

TL;DR¶

SceneDesigner introduces a CNOCS map representation combined with a two-stage reinforcement learning training strategy, achieving for the first time precise 9D pose control (position, size, and orientation) over multiple objects, significantly outperforming existing methods in both controllability and generation quality.

Background & Motivation¶

3D-aware controllable image generation is an important yet insufficiently addressed problem. Existing methods suffer from the following limitations:

2D Control Dominance: Most approaches (GLIGEN, InstanceDiffusion) operate only on 2D bounding boxes, failing to capture 3D attributes.

Insufficient Orientation Control: LOOSECONTROL uses 3D bounding boxes but lacks orientation information, so a single box may correspond to either the front or back of an object.

Difficulty with Multi-Object Scenes: Continuous 3D Words and COMPASS handle only single objects and produce unrealistic visual styles.

Poor Method Compatibility: ORIGEN relies on one-step generation models and is incompatible with mainstream multi-step diffusion frameworks.

Key Insight: Design an efficient 9D pose encoding representation (CNOCS map), combined with a dedicated dataset and two-stage training, to enable precise multi-object pose control.

Method¶

Overall Architecture¶

SceneDesigner introduces a ControlNet branch into pretrained Stable Diffusion 3.5, accepting CNOCS maps encoding 9D pose as control conditions. Training proceeds in two stages: the first stage learns basic control capability on the ObjectPose9D dataset, and the second stage applies reinforcement learning fine-tuning to improve generation quality for low-frequency poses. At inference time, Disentangled Object Sampling (DOS) is employed to handle multi-object scenes.

Key Designs¶

CNOCS Map (Cuboid Normalized Object Coordinate System)
- Design Motivation: Traditional NOCS requires precise 3D shapes (CAD models), which is user-unfriendly.
- Simplification: CNOCS approximates object geometry using only a cuboid, preserving key geometric information.
- Construction: (1) For each pixel, find the intersection coordinate on the 3D bounding box surface → (2) Transform from camera coordinate system to object coordinate system → (3) Normalize by bounding box dimensions to \([-1, 1]\) → (4) Map to final pixel values via encoding function \(f\).
- Variants: C-CNOCS (constant function, e.g., Euler angles), I-CNOCS (identity function, direct coordinates), S-CNOCS (spherical harmonics). I-CNOCS is selected based on experiments.
- Advantage: Users only need to manipulate a cuboid mesh in 3D space to specify object pose, making the interface intuitive and accessible.
ObjectPose9D Dataset
- Base Data: Approximately 110K images with accurate pose annotations selected from OmniNOCS (Objectron + Cityscapes subset).
- Extended Data: Large-scale 9D pose annotation of MS-COCO (~65K samples) to enrich object category and scene diversity.
- Annotation Pipeline: Filter suitable objects → estimate orientation via Orient Anything → reconstruct 3D point cloud via MoGe → compute 3D bounding boxes → manual verification.
- Total: 125,486 training samples.
Two-Stage Training Strategy
- Stage 1: Train the ControlNet branch on ObjectPose9D using flow matching loss (45K iterations) to acquire basic pose control.
- Stage 2: RL fine-tuning (5K iterations) to address poor generation quality for low-frequency poses (e.g., rear views of most animals).
- Reward Design: Location/size reward \(r_{ls}\) (IoU via Grounding DINO) + orientation reward \(r_o\) (KL divergence via Orient Anything).
- Randomized truncated backpropagation and gradient checkpointing are used to reduce memory overhead.
Disentangled Object Sampling (DOS)
- Design Motivation: In multi-object scenes, the model struggles to correctly associate each object with its corresponding pose, leading to under-generation and concept confusion.
- Mechanism: At each denoising step, multiple noise latents are composed and sampled conditioned on either global or individual object conditions; results are merged in latent space via regional masks.
- Can be combined with user personalization weights (LoRA) to enable customized pose control for reference subjects.

Loss & Training¶

Stage 1: Standard flow matching loss \(\|v_\theta(x_t, t, c_p, \{P_i\}) - (\epsilon - x)\|^2\)
Stage 2: Minimize \(-\beta r(x, c_p, \{P_i\}) + L_{prior}\), where \(L_{prior}\) is the Stage 1 loss used to stabilize training.
AdamW optimizer, learning rate 5e-6, resolution 512×512, batch size 48, 6× A800 GPUs.

Key Experimental Results¶

Main Results (Pose Alignment Accuracy)¶

Benchmark	Method	Acc_ls(%)↑	mIoU(%)↑	Abs.Err↓	Acc@22.5°(%)↑
Single-Front	C3DW	2.02	19.61	50.01	60.32
Single-Front	LOOSECONTROL	23.89	27.12	87.26	23.08
Single-Front	SceneDesigner	50.20	57.21	13.23	89.47
Single-Back	LOOSECONTROL	24.36	30.49	132.26	7.05
Single-Back	SceneDesigner	52.56	60.66	17.47	83.33
Multi	LOOSECONTROL	14.85	22.58	147.42	4.80
Multi	SceneDesigner	47.16	52.16	23.14	80.79

Ablation Study¶

Configuration	Acc_ls(%)↑	mIoU(%)↑	Abs.Err↓	Acc@22.5°(%)↑
w/o MS-COCO	41.69	50.07	74.89	24.32
w/o RL Fine-tuning	43.18	50.32	43.85	52.36
C-CNOCS	40.45	49.86	37.86	73.70
Pose Embedding	32.51	40.73	49.65	47.15
Text Description Only	12.90	14.32	88.43	25.31
SceneDesigner	51.12	58.55	14.87	87.10

Key Findings¶

The I-CNOCS map substantially outperforms direct pose embedding injection (Acc@22.5°: 87.1% vs. 47.2%), validating the effectiveness of spatial representation.
RL fine-tuning improves orientation accuracy from 52.36% to 87.10%, particularly benefiting rear-view pose generation.
MS-COCO data augmentation raises orientation accuracy from 24.32% to 87.10%, demonstrating the critical importance of category diversity.
DOS significantly improves all metrics in multi-object scenes (Acc_ls: 36.68% → 47.16%).
In user studies, SceneDesigner substantially outperforms competing methods in image quality (0.96) and orientation fidelity (0.91).

Highlights & Insights¶

The CNOCS map is an elegant design: replacing precise 3D shapes with cuboids lowers the usage barrier while preserving geometric interpretability.
The two-stage training strategy directly optimizes for pose control objectives, addressing data imbalance via RL more efficiently than data resampling.
DOS trades additional inference computation for improved multi-object scene quality, embodying a straightforward and effective design philosophy.
Integration with DreamBooth/LoRA for personalized pose control enhances practical applicability.

Limitations & Future Work¶

Precise object shape cannot be controlled; only the cuboid bounding box can be manipulated.
Multi-object scenes remain subject to concept confusion as the number of objects increases, constrained by the base model's capacity.
DOS introduces additional computational overhead, requiring independent sampling per object.
The current method supports image generation only; extension to video generation requires further work.

Compared to LOOSECONTROL, CNOCS supplies the critical missing orientation information.
The RL fine-tuning strategy is generalizable to other generative tasks requiring alignment between control conditions and outputs.
The variant design of CNOCS (constant / identity / spherical harmonic functions) offers a reference design space for representation engineering.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐