HouseTour: A Virtual Real Estate A(I)gent¶

Conference: ICCV 2025 arXiv: 2510.18054 Code: https://house-tour.github.io/ Area: 3D Vision / Vision-Language Models / Trajectory Generation Keywords: Camera Trajectory Generation, Real Estate, Diffusion, VLM, 3D Gaussian Splatting

TL;DR¶

HouseTour is proposed to jointly generate human-like 3D camera trajectories and real estate textual descriptions given a set of indoor images with known poses. The system employs a Residual Diffuser for diffusion-based trajectory planning and integrates spatial features into Qwen2-VL-3D to produce 3D-grounded text summaries.

Background & Motivation¶

Property video tours are a critical tool in the U.S. real estate market (valued at $3.43 trillion). However, producing professional tour videos requires: (1) on-site filming by agents equipped with high-end photography equipment; and (2) carefully crafted spatial descriptions. This process is labor-intensive and costly.

Limitations of Prior Work: - Vision-language models (VLMs) lack sufficient geometric reasoning capability to understand 3D spatial layouts. - Camera trajectories in existing 3D datasets are designed for reconstruction tasks (close-up surface scanning, jittery motion) and are ill-suited for showcasing overall spaces. - Scene description datasets enumerate furniture and their spatial relations but lack professional descriptions of spatial layout, architectural features, materials, and ambiance.

Goal: To enable ordinary users to automatically generate professional-grade property tour videos and textual descriptions by simply uploading a set of smartphone photos, without requiring specialized equipment.

Method¶

Overall Architecture¶

Given a sparse ordered set of camera poses $\mathcal{C}=[c_1,...,c_{N_c}]$ and corresponding RGB images $\mathcal{I}$: 1. Residual Diffuser: Generates a smooth, human-like tour trajectory $\tau$ ($N > N_c$ frames). 2. Qwen2-VL-3D: Produces a real estate textual summary $\Sigma$ using trajectory spatial features and visual tokens. 3. 3DGS Rendering: Renders the final tour video along the generated trajectory.

Residual Diffuser — Diffusion-Based Trajectory Planning¶

Core Idea: Rather than directly learning absolute trajectories (which vary greatly across different property layouts), the model learns residuals relative to a spline interpolation.

Formulation: $\tilde{p} = \mathcal{S} + \Delta p$, where $\mathcal{S}$ is the spline interpolation between known poses and $\Delta p$ is the predicted residual. At time steps corresponding to known poses, the residual is a zero vector.

Reverse Diffusion Process: $$\begin{cases} \vec{0} = \delta(p^i) & \text{if } i \in t_\tau \\ p_\theta(\Delta p_{t-1}^i | \Delta p_t^i, \mathcal{S}) = \mathcal{N}(\Delta p_{t-1}^i; \mu_\theta, \Sigma_\theta) & \text{else} \end{cases}$$

Trajectory Loss: - Translation: L2 norm computed over uniformly sampled dense spline points. - Rotation: Geodesic loss on the SO(3) manifold. $$\mathcal{L}_\theta = \mathbb{E}_{t,\tau,\epsilon}\left[\|\epsilon_{pos} - \epsilon_\theta(pos_t,t)\|^2 + d_{geo}(\epsilon_{rot}, \epsilon_\theta(rot_t,t))\right]$$

Key Design: Uniform sample points on spline segments between consecutive camera poses are evaluated efficiently using Horner's method, modeling the trajectory as a continuous function rather than a discrete point sequence.

Qwen2-VL-3D — Spatially Aware Text Generation¶

Two-Stage Training: 1. LoRA Fine-tuning: Qwen2-VL is fine-tuned on 96-frame multi-image inputs to learn the linguistic style and architectural terminology of real estate descriptions. 2. Spatial Feature Integration: - Special tokens <|traj_start|>, <|traj_pad|>, and <|traj_end|> are introduced. - The denoised poses $p_0^i$ and bottleneck features $f_0^i$ from the Residual Diffuser are concatenated and projected into the VLM's language embedding space via a linear layer. - Spatial information for each frame is encoded using a single token.

HouseTour Dataset¶

1,639 tour videos covering diverse properties ranging from apartments to multi-story villas.
1,298 videos with textual descriptions (half with timestamps); 878 scenes with 3D reconstructions.
Provides scene-level human trajectories, dense point clouds, and professional real estate descriptions.

Key Experimental Results¶

End-to-End Performance¶

Method	R@75cm ↑	Rot. Score ↑	BT ↑	SLS ↑
Baseline (Catmull-Rom + Qwen2-VL SFT)	57.1	96.8	71.4	71.7
HouseTour	60.2	97.1	79.5	76.0

HouseTour outperforms the baseline on all metrics, with particularly notable gains in text generation (BT +8.1).

Ablation Study on Trajectory Generation¶

Method	R@50cm ↑	R@1m ↑	Euclidean ↓	DTW ↓	Geodesic ↓
Linear Interp.	41.2%	59.8%	145.8	192.1	0.20
Catmull-Rom	45.9%	64.7%	106.2	146.3	0.10
Residual Diffuser	46.2%	69.4%	73.9	128.8	0.09

Key Findings: - The Residual Diffuser substantially outperforms interpolation baselines at R@1m (69.4% vs. 64.7%), indicating fewer large errors. - Euclidean distance is reduced by over 30%, demonstrating that residual learning significantly outperforms direct trajectory learning. - The advantage of the Residual Diffuser is more pronounced in high-uncertainty regions (i.e., locations far from known poses).

Highlights & Insights¶

Residual Diffusion Modeling: Reformulating trajectory generation from learning absolute positions to learning residuals relative to a spline elegantly addresses the challenge of varying layouts across scenes.
Novel Joint Evaluation Metric SLS: The paper introduces a Spatial-Language Score (harmonic mean of translation recall, rotation score, and Bradley-Terry score) as the first joint spatial-language evaluation metric.
Tri-modal VLM: Language, vision, and 3D localization are integrated into a single VLM, enabling spatially grounded text generation.
Strong Practical Value: The end-to-end system is directly applicable to the real estate and tourism industries.

Limitations & Future Work¶

Performance depends on the quality and coverage of smartphone-captured images.
3DGS rendering quality is limited under sparse-view conditions (explicitly noted by the authors as out of scope).
The dataset covers only real estate scenes and has not been extended to other tour scenarios (e.g., museums, tourist attractions).
Text generation still relies on LoRA fine-tuning; generalization to new domains requires further investigation.

Long Video Understanding: Methods such as TimeChat handle long sequences but lack domain knowledge of architectural environments.
Trajectory Planning: The Diffuser family is applied to robot decision-making in fixed environments, not in varying layouts.
3D Vision-Language: ScanRefer and DenseCap describe object relations but overlook spatial layout and architectural features.

Rating¶

Novelty: ★★★★☆ — The combination of residual diffusion trajectory planning and a tri-modal VLM is novel.
Practical Value: ★★★★★ — Directly targets a trillion-dollar real estate market.
Experimental Thoroughness: ★★★★☆ — Introduces a new dataset and evaluation metrics with thorough ablations.
Writing Quality: ★★★★☆ — Well-structured with a clearly defined problem formulation.