HouseMind: Tokenization Allows MLLMs to Understand, Generate and Edit Architectural Floor Plans¶

Conference: CVPR 2026 arXiv: 2603.11640 Code: https://housemind.github.io/ Area: Multimodal VLM Keywords: Architectural Floor Plans, VQ-VAE, Multimodal LLM, Spatial Reasoning, Unified Generation

TL;DR¶

This paper proposes HouseMind, a framework that discretizes architectural floor plans into structured sequences of contour tokens and room instance tokens via a hierarchical VQ-VAE. Combined with three-stage multimodal alignment and instruction fine-tuning on Qwen3-0.6B as the backbone, HouseMind achieves unified modeling of floor plan understanding, generation, and editing, substantially outperforming existing methods in geometric validity and controllability.

Background & Motivation¶

Background: AI-assisted architectural floor plan design has seen diverse approaches including GANs, graph neural networks, and diffusion models. Recent LLM/MLLM-based paradigms, such as Tell2Design and ChatHouseDiffusion, have explored language-driven layout generation.

Limitations of Prior Work: (1) Diffusion and autoregressive models treat layout generation as a purely visual process, lacking explicit room-instance-level reasoning; (2) large models largely operate as black boxes with poor spatial controllability; (3) understanding, generation, and editing cannot be unified; (4) high computational overhead makes local deployment impractical.

Key Challenge: Floor plan design requires joint processing of geometric and semantic information. Existing methods either excel at geometry but lack semantic reasoning, or possess language capabilities but sacrifice spatial precision.

Goal: To build an efficient, locally deployable unified framework that simultaneously supports spatial understanding, conditional generation, and controllable editing of floor plans.

Key Insight: The key breakthrough is identified at the representation level — using VQ-VAE to discretize floor plans into structured token sequences so that an LLM can process spatial layouts as it does language.

Core Idea: A hierarchical VQ-VAE represents floor plans as discrete sequences of contour tokens and room instance tokens, enabling an MLLM to autoregressively unify all three design tasks.

Method¶

Overall Architecture¶

A floor plan is decomposed into a contour and \(N\) rooms. Two independent VQ-VAEs encode them into discrete token sequences, which together with room labels form a structured sequence \(Z = [\mathbf{z}_o, \ell_{r_1}, \mathbf{z}_{r_1}, \dots]\). This sequence is interleaved with text tokens and fed into Qwen3-0.6B, which is trained via a three-stage procedure for unified autoregressive modeling.

Key Designs¶

Hierarchical VQ-VAE (Contour + Conditional Room Discretization):
- Function: Encodes the global contour and each room instance separately into discrete tokens.
- Mechanism: The contour is quantized with a CNN encoder against codebook \(\mathcal{Z}_o\); each room is encoded conditionally using the room mask and contour as joint input: \(z_{i,j}^{(r)} = \arg\min_k \|E_r(x_{r_i}, x_o)_j - e_k^{(r)}\|_2\).
- Design Motivation: Conditional encoding allows room tokens to capture both geometric shape and spatial position relative to the contour simultaneously.
Three-Stage Multimodal Alignment Training:
- Function: Progressively establishes alignment between spatial tokens and language tokens.
- Mechanism: Stage 1 integrates VQ-VAE codebook embeddings into the LLM vocabulary; Stage 2 performs autoregressive pretraining on large-scale paired data; Stage 3 applies supervised fine-tuning (SFT) on instruction data covering all three tasks.
- Design Motivation: Progressive alignment avoids optimization instability that arises from directly training on complex tasks.
Unified Sequence Modeling (Understanding / Generation / Editing):
- Function: Unifies all three tasks as conditional autoregressive prediction.
- Mechanism: Generation follows \(p(Z|\mathbf{z}_o, s) = \prod_t p(Z_t|Z_{<t}, \mathbf{z}_o, s)\); understanding outputs textual descriptions; editing takes the original sequence plus an instruction and outputs the modified sequence.
- Design Motivation: A unified format enables the same model to share knowledge across different tasks.

Loss & Training¶

The VQ-VAE is trained with reconstruction loss and codebook loss. The LLM component uses cross-entropy autoregressive loss. The model is built on Qwen3-0.6B and trained on the RPLAN dataset (78,738 samples), with inference on a single RTX 3090.

Key Experimental Results¶

Main Results¶

Method	Micro IoU	FID↓	Node F1	Edge Overlap	Inference (s)
Tell2Design	0.390	30.5	0.808	0.197	~15
ChatHouseDiffusion	0.589	11.3	0.985	0.710	~30
HouseMind-G	0.709	1.91	0.994	0.880	~2

Ablation Study (Understanding Task)¶

Method	RMR	LocAcc	AreaDiff↓	AdjAcc	RelAcc
Qwen3-VL-8B	0.698	0.347	5.837	0.382	0.128
InternVL3.5-8B	0.847	0.546	12.234	0.469	0.157
HouseMind-U	0.998	0.969	0.549	0.990	0.808

Key Findings¶

FID drops from 11.3 (ChatHouseDiffusion) to 1.91, marking a substantial improvement in generation quality.
On the understanding task, room localization accuracy of 0.969 exceeds the best VLM baseline by over 40 points, with an area error of only 0.549 m².
On the editing task, ΔIoU reaches 0.608, far surpassing FLUX (0.053) and Qwen-Edit (0.088).
The unified model HouseMind-O achieves performance close to single-task variants across all tasks.

Highlights & Insights¶

Room-level tokenization: Naturally mirrors the cognitive process of architectural design (global outline first → room layout second).
Extreme efficiency: A 0.6B-parameter model outperforms 7–8B VLMs, with inference of only ~2 seconds per sample.
Comparison with GPT-5/Gemini Pro: Domain-specific tokenization achieves superior precision over general-purpose large models.

Limitations & Future Work¶

Editing is limited to room addition and deletion; complex topological transformations are not supported.
Functional components such as doors, windows, and furniture are not modeled.
Evaluation is conducted solely on the RPLAN dataset.

vs MaskPLAN: MaskPLAN encodes the layout holistically, whereas HouseMind encodes hierarchically by room, preserving structural information.
vs FloorPlanLLaMA: VQ-VAE encoding of the entire layout leads to boundary inconsistencies; HouseMind maintains geometric precision through conditional VQ-VAE encoding.

Rating¶

Novelty: ⭐⭐⭐⭐ — Hierarchical tokenization concept is novel; three-task unified design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across three tasks, though limited to RPLAN.
Writing Quality: ⭐⭐⭐⭐ — Clear structure.
Value: ⭐⭐⭐⭐ — Direct practical value for AI-assisted architectural design.