SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes¶
Conference: ICML 2026
arXiv: 2602.09153
Code: https://scenesmith.github.io/ (Project Page)
Area: 3D Vision / Indoor Scene Generation / Agentic AI / Robot Simulation
Keywords: Indoor Scene Synthesis, VLM Agent, Robot Simulation, Text-to-3D, Hierarchical Generation
TL;DR¶
SceneSmith utilizes a "designer-critic-orchestrator" VLM agent triad to hierarchically construct indoor scenes on a "layout → furniture → small objects" tree. It deeply couples text-to-3D generation, articulated object retrieval, and physical property estimation into the agent toolchain. From a single natural language prompt, it directly produces dense, manipulatable environments that are "immediately simulatable," averaging 71 objects per room (vs. 11–23 in baselines), with an object collision rate <2% and a gravity-based stability rate of 96%, significantly outperforming all prior methods.
Background & Motivation¶
Background: Training home robots increasingly relies on large-scale simulations. However, existing simulation scenes are mostly "empty rooms with sparsely placed furniture"—either procedurally generated (ProcTHOR, Infinigen Indoors) based on hand-written rules with poor expressiveness, or data-driven (DiffuScene, etc.) restricted by SE(2) ground alignment assumptions. Recent LLM/VLM-driven methods (Holodeck, I-Design, LayoutVLM, SceneWeaver) focus on furniture-level layout and visual realism while ignoring small objects, articulated parts, and physical properties.
Limitations of Prior Work: Real home scenes contain dense, articulated, and manipulatable clutter, such as "cabinets filled with dishes," whereas simulation rooms typically contain only a few static objects. Robot policies learned in sparse scenes often collapse in real environments, as clutter manipulation is a primary challenge. Furthermore, scenes generated by prior methods lack collision geometry, mass, friction, or inertia, making them unsuitable for direct use in physical simulators.
Key Challenge: Existing pipelines split "asset generation" and "scene organization." Asset-side research (e.g., generating single 3D objects) and scene-side research (e.g., placing layouts from fixed libraries) operate independently. Consequently, no system can generate a "simulation-ready" house—dense, physically feasible, and with complete geometry—from a single prompt. Additionally, the single-agent "reason-act-reflect" paradigm (SceneWeaver) is prone to self-evaluation bias, making it difficult to converge on dense yet feasible configurations.
Goal: To enable a single natural language prompt to directly generate "immediately simulatable" multi-room indoor environments that simultaneously satisfy: (1) object density close to real homes; (2) open-vocabulary assets generated on demand; (3) geometric non-penetration and stability under gravity; (4) a full pipeline without human intervention.
Key Insight: Scene construction is decomposed into a tree-structured stage pipeline (layout → furniture → wall-hangings → ceiling → independent small object branches for each supporting surface). Each stage employs three VLM agents (designer / critic / orchestrator). Simultaneously, text-to-3D, articulated object retrieval, thin coverings, and physical property estimation are unified as agent tools managed by an asset router.
Core Idea: Replace single-shot generation or single-agent reflection with a "hierarchical agent tree + designer-critic-orchestrator division of labor + asset generation-routing-verification integration." This merges scene generation and asset generation at the agent tool level into an end-to-end, simulation-ready pipeline.
Method¶
Overall Architecture¶
The input is a natural language scene prompt \(\mathcal{T}\), and the output is a multi-room scene \(\mathcal{S}=\{\mathcal{R}_j\}\) exportable to Drake / MuJoCo / Isaac Sim / Genesis. Each room \(\mathcal{R}_j=(\mathcal{G}_j, \mathcal{O}_j)\) includes architectural geometry (walls with thickness, floors, doors, and windows) and a set of objects \(\{(\mathcal{A}_i, \mathcal{X}_i)\}\). Each asset \(\mathcal{A}_i\) includes visual meshes, convex decomposed collision geometry, and physical properties (mass, center of mass, inertia, friction). Articulated objects also include joint definitions.
The construction follows a stage tree: the root stage uses a layout agent to generate the architectural geometry of \(M\) rooms. Each room independently proceeds through "furniture → wall-hangings → ceiling" stages, with the prompt refined from the global \(\mathcal{T}\) to room-level \(\mathcal{T}_j\). Subsequently, selected supporting entities (furniture surfaces, wall shelves, floor areas) in each room branch out to add small objects using entity-level prompts \(\mathcal{T}_{j,k}\). Cross-surface coordination is explicitly constrained within these branch prompts. After all stages, physical post-processing (projection de-penetration + gravity settling) is performed before flattening into \(\mathcal{S}\).
Key Designs¶
-
Designer-Critic-Orchestrator Trio and Constrained Tools:
- Function: Completely separates "proposal / evaluation / control" responsibilities to avoid single-agent bias and structure iterative refinement through tool calls.
- Mechanism: The designer uses scene modification tools (placing/adjusting assets, snapping, assembling complex objects). The critic has access only to observation and verification tools (querying poses, rendering views, collision/reachability detection) and outputs a scalar score with natural language feedback. The orchestrator schedules the designer and critic, maintains history checkpoints, and rolls back if scores drop. Agents utilize a sliding window memory, with earlier turns compressed via LLM summarization.
- Design Motivation: Single-agent reflection often falls into the trap of "grading its own work as 90/100." Role differentiation allows the critic to catch semantic and physical issues from an external perspective, while the orchestrator's rollback mechanism turns exploration into "safe exploration."
-
Asset Routing + Three-way On-demand Generation:
- Function: Automatically fulfills designer requests (e.g., "red apple" or "kitchen cabinet with drawers") into simulation-ready assets with collision geometry and physics, covering static, articulated, and thin decoration types.
- Mechanism: The asset router decomposes complex requests into atomic assets. Static objects use a text-to-3D pipeline: GPT Image 1.5 generates the reference \(\to\) SAM3 segments foreground \(\to\) SAM3D reconstructs textured mesh \(\to\) alignment/scaling \(\to\) convex decomposition \(\to\) VLM physics estimation. Articulated objects are retrieved from the ArtVIP library. Thin coverings (rugs, posters) use lightweight surface meshes with PBR materials. All candidates undergo mesh integrity and VLM semantic checks.
- Design Motivation: Text-to-3D is currently unreliable for articulated kinematics. Relying solely on generation produces "cabinets with doors that won't open," while retrieval-only is limited by library size. "Selecting the best strategy per object type + unified physics post-processing" achieves the balance of open-vocabulary support and simulation readiness.
-
Hierarchical Tree Construction + Physical Post-processing:
- Function: Commits scene structure from large to small scales, ensuring local decisions are consistent with global intent, while automated post-processing ensures the output is "simulation-ready."
- Mechanism: The hierarchical tree refines prompts (\(\mathcal{T} \to \mathcal{T}_j \to \mathcal{T}_{j,k}\)) to pass global style and surface semantics downward. Object placement is specified in the surface's coordinate system as an \(SE(2)\) pose, then lifted to \(SE(3)\) to prevent objects from floating or clipping. After the furniture and small-object stages, post-processing projects objects to the nearest collision-free configuration and performs gravity simulation in Drake for static equilibrium.
- Design Motivation: Strict physical constraints are expensive for agents to satisfy directly. Decoupling semantics (via agents) from physical feasibility (via solvers) allows for agent flexibility while ensuring zero penetrations and stability.
Loss & Training¶
SceneSmith does not train new models. It is composed of off-the-shelf VLMs (e.g., GPT) and vision foundation models (SAM3, SAM3D, text-to-image). Scalar scores from the critic are used for orchestrator decision-making (accept/rollback), not gradient optimization. Agent behavior is controlled entirely by prompt engineering and tool-call budgets.
Key Experimental Results¶
Main Results¶
Evaluated on 210 prompts across five categories: SceneEval-100, Type Diversity, Object Density, Themed Scenes, and House-Level scenes. Included 205 crowdsourced participants and 3,051 valid comparisons.
| Dataset / Dimension | Metric | Ours (SceneSmith) | Prev. SOTA | Gain |
|---|---|---|---|---|
| Indoor Scenes | Objects/Room | 71.1 ± 13.0 | HSM 22.7 / Holodeck 23.0 | 3–6× |
| Indoor Scenes | Collision Rate COL ↓ | 1.2% | 3–29% (Baselines) | Significant |
| Indoor Scenes | Static Stability STB ↑ | 95.6% | 8–61% (Baselines) | 1.5–12× |
| Indoor Scenes | Obj-Obj Relations OOR ↑ | 67.6 | I-Design 28.6 | 2.2× |
| User Study | Realism Win Rate | 92.2% | — | All p<0.001 |
| User Study | Prompt Fidelity Win Rate | 91.5% | — | All p<0.001 |
| House-Level | Object Count | 214.1 ± 60.9 | Holodeck 81.3 | 2.6× |
Ablation Study¶
Six ablations compared SceneSmith against its variants via user studies and automated metrics.
| Configuration | Realism / Fidelity Win Rate | Obj Count | Key Insight |
|---|---|---|---|
| Full SceneSmith | — | 71.1 | Baseline for comparison. |
| w/o Generated | 63.8% / 67.0% | 57.7 | Generated assets are critical for realism and open-vocabulary support. |
| w/o AssetValidation | 63.0% / 62.2% | 72.7 | Validation prevents low-quality assets from entering the scene. |
| w/o ObserveScene | 61.5% / 53.2% | 69.7 | Visual feedback significantly contributes to realism. |
| w/o Critic | 51.8% / 47.5% | 54.0 | Saves 70% cost but object count drops 24%; a viable cost-efficient option. |
Key Findings¶
- Density is the primary differentiator: 71 vs. 11–23 objects per room determines whether a robot can learn to handle manipulation in cluttered environments.
- Physical readiness is a qualitative shift: Baselines suffer high collision (3-29%) and low stability (as low as 8%), meaning simulation often fails immediately. SceneSmith achieves 1.2% collision and 96% stability.
- House connectivity is significantly more logical: Generated hotels correctly follow a "lobby \(\to\) hallway \(\to\) rooms" topology, whereas baselines like Holodeck often generate nonsensical layouts.
- Closed-loop policy evaluation: The automated evaluator achieved 99.7% agreement with human labels across 300 cases, successfully distinguishing between standard and degraded policies.
Highlights & Insights¶
- Designer-Critic-Orchestrator Role Isolation: This is a transferable design pattern. Dividing "proposal, review, and control" by permissions is much steadier than simple role-playing in prompts.
- "Agent for Semantics + Solver for Physics": Decoupling the flexible agent (for aesthetics/semantics) from a rigid solver (for mm-level de-penetration/settling) is the current optimal engineering balance for simulation-ready generation.
- On-demand Asset Routing: Admitting that text-to-3D is currently poor for articulation and bypassing it via retrieval for those specific cases is a pragmatic choice that enables deployment today.
- Reduction of Data Contamination: By generating assets rather than using fixed libraries, SceneSmith avoids the bias where robot policies "overfit" to known 3D models.
Limitations & Future Work¶
- The pipeline relies heavily on closed-source frontier models (GPT Image 1.5, etc.), resulting in high token costs and latency.
- Articulated objects are still limited by the coverage of the ArtVIP library; true on-demand generation for articulated parts remains unsolved.
- Post-processing does not explicitly optimize for dynamic reachability or kinematic envelopes; some objects might be stable but inaccessible to a robot arm.
- Evaluation depends partially on VLM-based scoring (SceneEval), which has its own noise. User studies might also be biased by the visual impact of higher object density.
Related Work & Insights¶
- vs. HSM (Pun et al., 2026): HSM introduced hierarchical levels, but SceneSmith adds prompt refinement and the agent triad, resulting in much higher object counts (71 vs. 23) and superior stability (96% vs. 45%).
- vs. Holodeck (Yang et al., 2024b): Holodeck uses constraint solvers; SceneSmith achieves 2.6× more objects at the house level and much lower collision rates.
- vs. SceneWeaver (Yang et al., 2025): SceneWeaver uses a single-agent planner; SceneSmith’s triple-agent分工 provides a 91.7% win rate in comparisons.
- vs. ProcTHOR / Infinigen Indoors: These rule-based methods lack semantic expressiveness; SceneSmith finds a middle ground by using VLM-guided semantics with rule-constrained tools.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of triad agents and hierarchical asset routing is a robust end-to-end system design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive user studies, automated metrics, physics metrics, and robot policy tests.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear motivations regarding manipulation in clutter and transparent discussion of trade-offs.
- Value: ⭐⭐⭐⭐⭐ A landmark contribution that moves environment generation from "research demo" to "industrial utility" for robot learning.