Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans (HouseMind)¶
Conference: CVPR 2026 arXiv: 2603.11640 Code: housemind.github.io Area: Multimodal VLM / Architectural Floor Plan Design Keywords: Multimodal large language models, VQ-VAE, spatial tokenization, floor plan generation, floor plan editing, instruction tuning
TL;DR¶
This paper presents HouseMind, which discretizes architectural floor plans into room-level spatial tokens via a hierarchical VQ-VAE, enabling floor plan understanding, generation, and editing within a unified MLLM framework. The approach comprehensively outperforms diffusion model and general-purpose VLM baselines in geometric validity and controllability.
Background & Motivation¶
High cognitive complexity of architectural floor plan design: Floor plan design requires simultaneous reasoning over geometric, semantic, and spatial hierarchical relationships. Patterns are not sequential but embedded in complex relational structures, posing a significant challenge for AI.
Lack of global spatial consistency in existing methods: Diffusion models and autoregressive models have achieved improvements in visual fidelity but treat layout synthesis as a purely visual process, lacking explicit room-instance-level reasoning. This leads to locally plausible but globally spatially incoherent results (e.g., inconsistent adjacency and circulation relationships).
Insufficient interpretability and controllability: Large-scale vision-language models commonly function as black-box generators, with limited spatial controllability and interpretability.
Inability to unify understanding, generation, and editing: Existing frameworks struggle to simultaneously handle understanding, generation, and editing tasks within a single architecture, particularly given the geometric and semantic complexity of architectural layouts.
High computational overhead and difficulty of local deployment: Most AI systems demand substantial computational resources, making integration into practical design workflows difficult.
Existing LLM-driven design approaches remain modular: Methods such as Tell2Design, ChatHouseDiffusion, and FloorPlanLLaMA improve interpretability but operate as independent modules, lacking unified multi-task reasoning.
Method¶
Overall Architecture¶
HouseMind consists of two core components: Room-Instance Tokenization and Multimodal Alignment & Instruction Tuning.
A floor plan is decomposed into an outline \(x_o\) and \(N\) room instances \(\{x_{r_i}\}_{i=1}^N\), which are encoded into discrete token sequences via two separate VQ-VAEs and then interleaved into a unified sequence:
where \(\ell_{r_i}\) denotes the semantic label token of room \(i\), and \(\boldsymbol{z}_o\) and \(\boldsymbol{z}_{r_i}\) are the discrete tokens for the outline and each room, respectively.
Key Designs¶
- Outline Discretization: A CNN encoder \(E_o\) extracts features from the binary outline mask, which are vector-quantized into discrete tokens via the outline codebook \(\mathcal{Z}_o\); a decoder reconstructs the outline.
- Conditional Room Discretization: The room encoder \(E_r\) jointly encodes each room mask conditioned on the outline, and quantizes it via the room codebook \(\mathcal{Z}_r\). Conditional encoding enables room representations to be globally context-aware, capturing geometric and spatial adjacency relationships.
-
Three-stage Multimodal Training:
- Stage 1 – Embedding Initialization: Spatial codes from the VQ-VAE codebook are mapped to trainable token embeddings in the LLM vocabulary, establishing a one-to-one correspondence between discrete spatial codes and text tokens.
- Stage 2 – Multimodal Pre-training: The model is trained on large-scale paired data consisting of text descriptions, outline tokens, and room tokens using an autoregressive language modeling objective, achieving bidirectional alignment between language and geometry.
- Stage 3 – Instruction Tuning (SFT): Supervised fine-tuning is performed on instruction data covering understanding, generation, and editing tasks, endowing the model with task awareness and spatial reasoning capability.
- Unified Task Modeling: Understanding (inferring room functions and topology from \(Z\)), generation (autoregressively generating a layout given text \(s\) and outline \(\boldsymbol{z}_o\)), and editing (generating a modified layout \(Z^{tgt}\) given a source layout \(Z^{src}\) and instruction \(s\)) are all formulated as a unified sequence modeling problem.
Backbone & Efficiency¶
HouseMind is built upon Qwen3-0.6B as the language model backbone. Its small parameter count supports real-time inference and local deployment on a single RTX 3090 GPU.
Experiments¶
Dataset & Benchmark¶
A unified benchmark for evaluating floor plan understanding, generation, and editing is constructed based on the RPLAN dataset, comprising 80,738 samples: 76,122 for training, 2,308 for validation, and 2,308 for testing. Each floor plan includes a JSON representation and both simple and detailed text descriptions (generated by Qwen3-30B-A3B).
Understanding Task Results¶
| Method | RMR | LocAcc | AreaDiff↓ | AdjAcc | RelAcc | Time (s) |
|---|---|---|---|---|---|---|
| LLaVA-v1.6-Mistral-7B | 0.616 | 0.225 | 3.649 | 0.134 | 0.056 | ~6 |
| Qwen3-VL-8B | 0.698 | 0.347 | 5.837 | 0.382 | 0.128 | ~8 |
| InternVL3.5-8B | 0.847 | 0.546 | 12.234 | 0.469 | 0.157 | ~13 |
| MiniCPM-V 4.5 | 0.904 | 0.492 | 13.765 | 0.597 | 0.208 | ~14 |
| HouseMind-U | 0.998 | 0.969 | 0.549 | 0.990 | 0.808 | ~3 |
HouseMind improves room localization accuracy and adjacency accuracy by more than 40 absolute percentage points, reducing area error from several square meters to below 0.6 m².
Generation Task Results¶
| Method | Micro IoU | Macro IoU | FID↓ | GED↓ | Node F1 | Edge Ovl. | Time (s) |
|---|---|---|---|---|---|---|---|
| Tell2Design | 0.390 | 0.307 | 30.5 | 6.94 | 0.808 | 0.197 | ~15 |
| ChatHouseDiffusion | 0.589 | 0.521 | 11.3 | 2.36 | 0.985 | 0.710 | ~30 |
| FloorPlanLLaMA | 0.607 | 0.511 | 49.3 | 2.68 | 0.922 | 0.574 | ~1 |
| HouseMind-G | 0.709 | 0.653 | 1.91 | 1.01 | 0.994 | 0.880 | ~2 |
IoU improves by more than 10% over ChatHouseDiffusion, and FID drops from 11.3 to 1.9.
Editing Task Results¶
| Method | ΔIoU | ΔMSE↓ | Node F1 | Edge Ovl. |
|---|---|---|---|---|
| FLUX.1-Kontext-dev | 0.053 | 0.0162 | 0.765 | 0.222 |
| Qwen-Image-Edit | 0.088 | 0.0074 | 0.915 | 0.426 |
| HouseMind-E | 0.608 | 0.0019 | 0.998 | 0.934 |
Ablation Study¶
| Configuration | Train Loss↓ | Eval Loss↓ |
|---|---|---|
| w/o Stage 1&2 | 0.0729 | 0.0836 |
| w/o Stage 1 | 0.0659 | 0.0840 |
| w/o Stage 2 | 0.0712 | 0.0831 |
| Full | 0.0644 | 0.0830 |
Key Findings¶
- Removing Stage 1 (embedding initialization) leads to unstable optimization, preventing spatial tokens from settling into a stable embedding space.
- Removing Stage 2 (multimodal pre-training) deprives the model of high-level text–layout correspondences.
- The unified variant HouseMind-O achieves performance close to or on par with task-specific models trained independently on each task.
- In qualitative comparisons with GPT-5 and Gemini 2.5 Pro, HouseMind demonstrates superior spatial consistency and controllability.
Highlights & Insights¶
- Room-level discrete tokenization is the core innovation: it bridges continuous geometric layouts to discrete sequence modeling, enabling LLMs to perform interpretable spatial reasoning directly in token space.
- Unified three-task modeling: a single model simultaneously handles understanding, generation, and editing without modular composition.
- Extremely lightweight and deployable: built on Qwen3 with 0.6B parameters, inference runs on a single RTX 3090, requiring only 2–3 seconds per sample.
- Conditional room encoding: room encoding is conditioned on the outline, naturally capturing global context and adjacency relationships.
- First unified benchmark: a standardized evaluation protocol covering all three tasks is established.
Limitations & Future Work¶
- The editing functionality supports only simple addition and deletion operations, without handling complex topological transformations such as global reorganization.
- Functional components such as doors, windows, and furniture are not modeled, limiting applicability to detailed interior design.
- Generated results are not aligned with human design preferences and aesthetic constraints, leaving a gap relative to professional design standards.
- The dataset is based on RPLAN (predominantly Chinese residential buildings); generalizability to other architectural typologies and cultural styles remains to be verified.
Related Work & Insights¶
- GAN-based: Graph-constrained GANs and similar approaches improve realism but overfit to local geometry.
- Graph/GNN-based: Methods such as Graph2Plan model room connectivity, but discrete graph representations limit geometric fidelity.
- Diffusion-based: GSDiff, FloorPlan Diffusion, and related methods are stable but computationally expensive and limited to single tasks.
- LLM-driven: Tell2Design establishes a text-to-floor-plan benchmark; ChatHouseDiffusion and FloorPlanLLaMA introduce language control but remain modular architectures.
- Positioning of this work: HouseMind is the first unified multi-task multimodal framework to jointly learn geometric, semantic, and topological representations.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of room-level VQ-VAE tokenization with a unified three-task LLM framework is novel, recasting spatial design as token sequence modeling.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across three tasks with multiple baselines (including GPT-5/Gemini) and ablation studies validating the training strategy.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, formalized problem definitions, and rich figures and tables.
- Value: ⭐⭐⭐⭐ — Establishes a unified paradigm in architectural design AI; the lightweight deployable design has practical application potential.