Unified Vector Floorplan Generation via Markup Representation¶

Conference: CVPR 2026 arXiv: 2604.04859 Code: https://mapooon.github.io/FMLPage Area: Image Generation Keywords: floorplan generation, markup language, autoregressive sequence model, constrained decoding, vector representation

TL;DR¶

This paper proposes the Floorplan Markup Language (FML), which encodes floorplan elements such as rooms and doors into structured token sequences. A LLaMA-style Transformer model (FMLM) trained on this representation unifies unconditional, boundary-conditioned, graph-conditioned, and completion tasks within a single framework, achieving over 80% lower FID than HouseDiffusion.

Background & Motivation¶

Background: Automated floorplan generation is a core requirement in architectural design and real estate. Existing methods are task-specific — Graph2Plan handles boundary conditions, while HouseGAN++/HouseDiffusion address adjacency graph conditions — each requiring a dedicated model.
Limitations of Prior Work: (1) Different generation tasks rely on different architectures, precluding unification; (2) diffusion-based methods (e.g., GSDiff) produce raster images, and post-processing conversion to vector format introduces errors; (3) GAN-based methods suffer from mode collapse and limited generation diversity.
Key Challenge: Floorplans are inherently structured vector data (room polygons + door positions + connectivity), yet existing methods either operate in pixel space (losing structural information) or require task-specific graph neural networks.
Goal: Design a unified representation that reformulates all floorplan generation tasks as a single sequence prediction problem.
Key Insight: Inspired by markup languages (HTML/XML) in NLP — token sequences defined by syntactic rules naturally represent structured information and are directly amenable to autoregressive Transformer modeling.
Core Idea: Define FML grammar to encode floorplans as token sequences of "tag + coordinate + index + type", and apply constrained decoding to guarantee syntactic validity of generated outputs.

Method¶

Overall Architecture¶

Optional input conditions (boundary point sequence / adjacency graph / partial floorplan) → encoded as FML condition segment → FMLM autoregressively generates FML sequence → constrained decoding enforces syntactic validity → FML parsed into vector floorplan (room polygons and door positions).

Key Designs¶

Floorplan Markup Language (FML)
Function: Encodes all floorplan elements into a linear token sequence.
Mechanism: Defines four token types — tags (e.g., <room>, <door>), coordinates (1D encoding \(z = x + y \times W\), \(W=256\)), room indices, and room types. The grammar follows <sequence> → <condition> → <floorplan> → rooms → doors → front_door → </sequence>. Rooms are ordered by descending index.
Design Motivation: 1D coordinate encoding avoids the high-dimensional sparsity of 2D positional representations. Descending ordering is validated by ablation, reducing FID from 94.57 to 25.50. Tag tokens provide structural supervision signals.
FMLM Model Architecture
Function: Autoregressively generates FML token sequences.
Mechanism: A LLaMA-3-style Transformer with 24 layers, 512-dimensional hidden states, and 32 attention heads. Coordinate tokens use sinusoidal positional encoding with a learnable projection; tag/index/type tokens use learnable embeddings. A unified output head \(W \in \mathbb{R}^{(C_{tag}+C_{coord}+C_{index}+C_{type}) \times C}\) is shared across all token types.
Design Motivation: The unified output head allows the model to learn when to generate each token type automatically, eliminating the need for manually switching decoding modes.
Constrained Decoding
Function: Guarantees syntactic validity of generated FML sequences at inference time.
Mechanism: Hard constraints include: doors must have exactly 2 vertices; room polygons must not overlap with existing rooms; doors must lie on room boundaries. These rules are enforced by masking invalid token probabilities during decoding.
Design Motivation: Autoregressive models may generate syntactically invalid sequences (e.g., a door with 3 vertices). Constrained decoding guarantees 100% valid outputs at zero additional computational cost.

Loss & Training¶

Standard cross-entropy loss is computed over non-structural tag tokens in the FML sequence. Room permutation augmentation (random shuffling of room order) is applied during training to encourage the model to learn permutation equivariance — ablation shows this reduces FID from 24.36 to 14.17.

Key Experimental Results¶

Main Results¶

Task	Method	FID↓	GED↓	IoU↑
Unconditional	GSDiff	15.02	-	-
Unconditional	FMLM	7.22	-	-
Boundary-conditioned	Graph2Plan	34.20	-	95.87%
Boundary-conditioned	FMLM	6.51	-	97.86%
Graph-conditioned (ALL)	HouseGAN++	48.44	2.57	-
Graph-conditioned (ALL)	HouseDiffusion	29.31	1.55	-
Graph-conditioned (ALL)	FMLM	3.41	1.21	-
Boundary+Graph (ALL)	Graph2Plan	22.87	3.43	92.96%
Boundary+Graph (ALL)	FMLM	14.17	1.24	97.59%

Ablation Study¶

Configuration	FID↓	GED↓	IoU↑	Note
Full + permutation aug.	14.17	1.24	97.59%	Full model
w/o permutation aug.	24.36	2.35	95.82%	FID +72%
Ascending index order	94.57	-	-	Very poor FID
Descending index order	25.50	-	-	Descending far superior

Key Findings¶

Room permutation augmentation is critical — removing it increases FID from 14.17 to 24.36 (+72%), indicating that learning permutation equivariance is essential for generalization.
FMLM substantially outperforms GAN-based and diffusion-based methods across all conditioning settings.
Constrained decoding guarantees 100% syntactically valid outputs, whereas post-processing pipelines in methods such as HouseDiffusion cannot provide this guarantee.
Performance degrades slightly for 8-room layouts (FID increases from 3.41 to 4.64) due to limited training samples.

Highlights & Insights¶

Elegance of the markup representation: Reformulating structured generation as sequence prediction via grammar rules is a clean and transferable paradigm, applicable to other structured generation tasks such as circuit layout or molecular structure generation.
Zero-overhead hard constraints: Masking invalid tokens at inference time enforces hard syntactic constraints without additional computation, which is more reliable than post-hoc correction.
Multi-task unification: A single model handles unconditional, boundary-conditioned, graph-conditioned, and completion tasks simultaneously, eliminating the redundancy of maintaining task-specific architectures.

Limitations & Future Work¶

Only single-story floorplans are supported; multi-story buildings would require extending the FML grammar.
Performance degrades for layouts with more than 8 rooms due to insufficient training data.
Coordinate quantization to a 256×256 grid may sacrifice precision; higher resolutions would increase vocabulary size.
Integrating with LLMs (natural language specification → floorplan generation) is a promising future direction.

vs. HouseDiffusion: Diffusion methods model continuous spaces and require vectorization post-processing, whereas FMLM directly generates vector results in discrete token space with greater precision.
vs. Graph2Plan: Requires a GNN encoder for adjacency graph conditioning, resulting in architectural complexity. FMLM serializes adjacency relationships directly into the FML condition segment, requiring no additional encoder.
vs. GSDiff: The raster-based diffusion approach achieves FID 15.02, compared to FMLM's 7.22; the gap primarily stems from the structural prior embedded in the vector representation.

Rating¶

Novelty: ⭐⭐⭐⭐ The markup language representation is a novel perspective, though autoregressive generation itself is not new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparisons across four conditioning settings, with ablations and multi-room-count analysis.
Writing Quality: ⭐⭐⭐⭐ Clear and fluent, with a rigorous definition of FML grammar.
Value: ⭐⭐⭐⭐ Directly applicable to architectural design, with a transferable markup-based generation paradigm.