HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models¶
Conference: CVPR 2026 arXiv: 2604.10772 Code: None Area: Multimodal VLM Keywords: 3D scene generation, scene editing, vision-language models, hierarchical optimization, RAG
TL;DR¶
This paper proposes HOG-Layout, a hierarchical framework for 3D indoor scene generation, optimization, and editing based on VLM and LLM. It achieves superior performance over LayoutVLM on SceneEval at 4.5× faster speed, through RAG-enhanced semantic consistency and force-directed hierarchical optimization for physical plausibility.
Background & Motivation¶
- Background: 3D indoor scene generation serves interior design, VR, and embodied AI. Traditional approaches learn layouts from data (graph networks, Transformers, diffusion models) or directly generate appearance (NeRF, Gaussian Splatting), but are limited by diversity or lack of interactivity. The emergence of LLM/VLM enables open-vocabulary scene generation.
- Limitations of Prior Work: LLMs directly generating layouts (e.g., LayoutGPT) may produce collisions and implausible placements; incorporating spatial relationship constraints (e.g., Holodeck) improves plausibility at the cost of diversity; VLM-based methods (e.g., LayoutVLM) improve semantic consistency but require predefined object sets and rely on computationally expensive gradient-based optimization (~321s/scene). All methods primarily focus on generation from scratch, neglecting the practically more important need for scene editing.
- Key Challenge: Generating semantically consistent and physically plausible scenes requires satisfying both soft constraints (semantic relationships) and hard constraints (collision-free, within boundaries). Existing methods struggle to balance both while maintaining computational efficiency.
- Goal: Build a hierarchical framework that supports both scene generation and editing, achieving low latency while guaranteeing semantic consistency and physical plausibility.
- Key Insight: Objects are organized into a hierarchical structure based on support relationships (floor → table → objects on table), with optimization applied within each layer and across parent-child levels, decomposing complex 3D constraints into planar forces, vertical forces, and rotational torques.
- Core Idea: Four modules work in concert: RAG-enhanced scene planning + VLM-based initial layout generation + force-directed hierarchical optimization + LLM-parsed editing instructions, enabling efficient scene generation and editing.
Method¶
Overall Architecture¶
The pipeline consists of four modules: (1) Scene Planning — LLM + RAG generates a structured plan from text; (2) Layout Generation — VLM combines top-down views to generate hierarchical layouts and retrieve objects; (3) Hierarchical Optimization — force-directed iterative optimization for physical and semantic constraints; (4) Scene Editing — LLM parses editing instructions into add/delete/move operations.
Key Designs¶
-
RAG-Enhanced Scene Planning:
- Function: Generate structured object lists and layout guidance from text descriptions.
- Mechanism: A template library of layout constraint rules is constructed; Qwen3-Embedding-4B extracts 1024-dimensional feature vectors stored in a FAISS database. At inference, cosine similarity retrieves the top-3 most relevant layout rules, which are combined with user input and fed to an LLM to generate a scene plan (object IDs, names, sizes, groupings, etc.). Objects are grouped by functional zones (e.g., a bedroom-living room is split into a dining group and a viewing group).
- Design Motivation: Direct LLM layout generation lacks domain knowledge constraints; RAG compensates by injecting human design rules.
-
Force-Directed Hierarchical Optimization:
- Function: Iteratively optimize an initial layout to a physically and semantically stable state.
- Mechanism: Objects form a hierarchical tree based on support relationships. Each object maintains three force accumulators: planar force \(F_{i,\text{plane}} \in \mathbb{R}^2\) (collision, boundary, proximity, wall-hugging), vertical force \(F_{i,\text{vert}} \in \mathbb{R}\) (inter-level collision, vertical boundary), and rotational torque \(\tau_i\) (orientation, alignment). Positions and rotations are updated via explicit Euler integration. Deadlock detection and avoidance are incorporated: horizontal deadlocks apply a vertical force to "push out," while vertical deadlocks directly scale the Z-axis. Convergence is declared when residual forces fall below threshold \(\epsilon_{\text{conv}}\).
- Design Motivation: Unifying all constraints as continuous forces avoids the computational overhead of mixed-integer programming; the hierarchical decomposition enables parallel optimization of intra-level and parent-child constraints.
-
Text-Driven Scene Editing:
- Function: Support precise scene modification through natural language.
- Mechanism: An LLM maps user text to four primitive operations (plan/add/delete/move). Add operations are routed to the layout generation module; move operations have the VLM output the target object ID and new position; delete operations have the VLM output the ID to remove. After modification, the result is passed to the hierarchical optimization module to produce the final scene.
- Design Motivation: In practice, users more often refine than rebuild scenes; editing capability is critical for transitioning to interactive scene design.
Loss & Training¶
No training is required. GPT-4o is used uniformly as the LLM/VLM backbone. Object retrieval combines weighted scores from SBERT text similarity, OpenCLIP image-text similarity, and size matching.
Key Experimental Results¶
Main Results¶
| Method | COL_ob↓ | COL_sc↓ | SUP↑ | OAR↑ | SP↑ | Time↓ |
|---|---|---|---|---|---|---|
| LayoutGPT | 35.67% | 49% | 34.39% | 11.48% | 35.14 | 37s |
| Holodeck | 12.24% | 63% | 34.72% | 38.27% | 55.45 | 272s |
| LayoutVLM | 29.44% | 55% | 77.54% | 61.99% | 65.54 | 322s |
| HOG-Layout | 5.28% | 16% | 81.17% | 75.74% | 69.69 | 70s |
Human Evaluation (7-point scale):
| Method | Plausibility | Semantic Alignment |
|---|---|---|
| LayoutGPT | 2.43 | 2.58 |
| Holodeck | 3.97 | 3.66 |
| LayoutVLM | 3.69 | 4.61 |
| HOG-Layout | 5.33 | 5.75 |
Ablation Study¶
| Configuration | COL_ob↓ | SP↑ | Note |
|---|---|---|---|
| HOG-Layout (full) | 5.28% | 69.69 | All modules |
| w/o RAG | Higher | ~65 | Weakened semantic constraints |
| w/o hierarchical optimization | ~20% | ~60 | Significant collision increase |
| w/o force decomposition | ~15% | ~64 | Improper vertical constraint handling |
Key Findings¶
- 6× reduction in collision rate: HOG-Layout achieves an object collision rate of only 5.28% (vs. 29.44% for LayoutVLM) and a scene collision rate of 16% (vs. 55% for LayoutVLM).
- 4.5× speed improvement: 70s vs. 322s for LayoutVLM, as force-directed optimization is substantially faster than gradient-based optimization.
- Consistency between automated and human evaluation: GPT-5 scores align with human evaluation trends; HOG-Layout leads significantly on both metrics.
Highlights & Insights¶
- Force-directed hierarchical optimization is the core innovation: framing scene layout optimization as a physical force equilibrium problem is both intuitive and efficient. The deadlock detection and avoidance mechanism further enhances robustness.
- Editing support is a critical step toward practical utility: most scene generation works focus solely on generation from scratch; HOG-Layout's add/delete/move editing capabilities bring it closer to real-world usage.
- Group-based generation strategy is worth borrowing: progressive generation by functional zone, where each group's top-down view serves as context for the next, ensures spatial consistency.
Limitations & Future Work¶
- Object retrieval relies on existing 3D asset libraries (3D-FUTURE, Objaverse) and cannot generate objects absent from the database.
- Force-directed optimization may converge to local optima; the deadlock avoidance strategy is heuristic.
- Only indoor scenes are supported; applicability to outdoor or large-scale scenes is unverified.
- Editing operations are relatively basic (add/delete/move); more complex semantic edits (e.g., "make the room cozier") are not supported.
Related Work & Insights¶
- vs. LayoutVLM: LayoutVLM uses gradient-based optimization, incurring high computational cost (322s) and requiring predefined object sets. HOG-Layout's force-directed optimization is 4.5× faster and supports open-vocabulary inputs.
- vs. Holodeck: Holodeck uses DFS/MILP to satisfy hard constraints but neglects soft semantic constraints. HOG-Layout addresses both physical and semantic constraints simultaneously.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of hierarchical force-directed optimization and RAG-enhanced planning is novel and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐ SceneEval with 100 scenes + human evaluation + editing experiments.
- Writing Quality: ⭐⭐⭐ Multiple modules are described clearly, though some details require reference to supplementary materials.
- Value: ⭐⭐⭐⭐ The unified generation-and-editing framework offers practical value, with clear speed advantages.