RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing¶

Conference: CVPR 2025
arXiv: 2412.16778
Code: None
Area: Image Generation
Keywords: Indoor scene texturing, Diffusion models, Multi-view consistency, Zero-shot texture synthesis, Occlusion inpainting

TL;DR¶

RoomPainter is proposed, which adapts 2D diffusion models into a 3D-consistent indoor scene texture synthesis tool via zero-shot Multi-View Integrated Sampling (MVIS) and Correlated View Attention, employing a two-stage strategy to ensure global and local consistency.

Background & Motivation¶

Indoor scene texture synthesis is critical for applications in VR, digital media, and creative arts. Existing methods face three major challenges: (1) inpainting-based methods (e.g., Text2Tex) generate textures in a view-by-view manner, leading to severe cross-view inconsistencies and visible seams; (2) optimization-based methods (e.g., SceneTex) optimize global textures using SDS loss, but suffer from huge computational overhead and unstable training; (3) object occlusion causes missing textures.

Core Problem: How to efficiently generate globally consistent indoor scene textures while resolving occlusion issues? RoomPainter addresses this through a two-stage strategy: (1) a global stage using MVIS to generate globally consistent textures, and (2) a refinement stage using MVRS to perform occlusion inpainting and texture refinement for each instance.

Method¶

Overall Architecture¶

Relying on SDXL + ControlNet (conditioned on depth), the framework operates in two stages. In the first stage, \(N\) cameras surrounding the room center generate textures simultaneously, achieving global consistency by dynamically merging UV texture maps. In the second stage, MVRS is applied to each instance individually to repair occluded regions and refine textures. The key zero-shot techniques are MVIS and Correlated View Attention.

Key Design 1: Multi-View Integrated Sampling (MVIS)¶

Function: Zero-shot adaptation of 2D diffusion models to generate multi-view consistent texture maps.

Mechanism: Given \(N\) cameras \(\{C^n\}\), at each time step \(t\) of diffusion sampling: (1) run denoising for all views to obtain the estimate \(x_{0,t}^n\); (2) decode to image space to get \(\mathcal{I}_t^n\); (3) back-project to UV space to obtain per-view textures \(\mathcal{T}_t^n\); (4) fuse them using a dynamic merging strategy:

\[\mathcal{T}_t = \frac{\sum_{n=1}^{N}(W_n^{exp(t)} \cdot \mathcal{T}_t^n)}{\sum_{n=1}^{N} W_n + \gamma}\]

where \(W_n\) is based on the inverse of normal-view cosine similarity, and \(exp(t)\) increases linearly as \(t\) decreases to make the fusion sharper; (5) re-render the images of each view from \(\mathcal{T}_t\) and encode them back into the latent space to replace the original \(x_{0,t}^n\) for guiding the next sampling step.

Design Motivation: By projecting information from all views into a shared UV space and dynamically fusing them at each diffusion step, a loop of "consistent global textures guiding individual view sampling" is established. The angle-based weighting ensures that each patch utilizes the texture from the optimal view.

Key Design 2: Multi-View Integrated Repaint Sampling (MVRS)¶

Function: Inpainting occluded regions and refining instance-level textures.

Mechanism: After decoupling each instance, a variant of MVIS is utilized—the already textured region (from the first stage) is added with noise to the current timestep and then mixed with the MVIS sampling result via a mask \(P\): the painted regions remain unchanged, while the unpainted regions are generated by MVIS. This essentially combines diffusion inpainting (repaint) with multi-view sampling.

Design Motivation: Certain regions remain textureless during the first stage due to object occlusions. MVRS maintains style consistency with the existing textures when inpainting occlusions, and avoids the computational waste of global regeneration through instance-level operations.

Key Design 3: Correlated View Attention Mechanism¶

Function: Enhancing multi-view information sharing during the diffusion model sampling process.

Mechanism: Modifying the self-attention of the U-Net—for each view \(n\), the Query \(Q_n\) is concatenated with the Keys/Values of its correlated views (e.g., adjacent left and right views) to calculate attention:

\[\text{softmax}\left(\frac{Q_n \tilde{K}_n^T}{\sqrt{d}} \tilde{V}_n\right)\]

where \(\tilde{K}_n = [K_1, K_2, ..., K_R]^T\) is the concatenated Keys of \(R\) correlated views.

Design Motivation: Injecting cross-view information training-free during diffusion sampling to ensure consistency in texture style and content among adjacent views. Adjacent views are used at the room level, while all views are used at the instance level.

Loss & Training¶

This is a training-free method—constrained during zero-shot inference. Based on the DDPM sampling process, generation is guided through the texture map feedback mechanism.

Key Experimental Results¶

Main Results: Indoor Scene Texture Synthesis¶

Method	Generation Time (mins) ↓	CLIP Score ↑	Aesthetic Score ↑
Text2Tex-H	8.50	21.58	4.34
Text2Tex-C	70.75	21.93	4.85
SceneTex	2614.50	21.87	4.75
RoomPainter	46.00	23.47	5.03

Ablation Study¶

Configuration	CLIP Score ↑	Aesthetic Score ↑
Full Method	23.47	5.03
W/o Correlated View Attention	23.32	5.01
W/o MVIS	23.27	5.01
W/o MVRS	22.39	4.40

Key Findings¶

57x faster than SceneTex (46 mins vs 2614 mins) while achieving a +1.6 higher CLIP Score.
MVRS contributes the most to final quality—removing it drops the Aesthetic Score from 5.03 to 4.40, proving that occlusion inpainting is crucial.
Text2Tex-H is fast but suffers from poor consistency (visible seams); Text2Tex-C offers acceptable quality but takes 70 minutes for instance-by-instance generation.
Correlated View Attention significantly improves cross-view consistency (eliminating color and style inconsistencies in qualitative comparisons).

Highlights & Insights¶

Zero-shot 3D Consistency: Requires no training of multi-view diffusion models; cross-view consistency is achieved during the sampling process through a UV texture map feedback mechanism.
Exquisite Two-Stage Strategy: Coarse-to-fine design—global MVIS establishes the style tone, while instance-level MVRS repairs occlusions and enhances details.
Obvious Efficiency Advantages: Run over 50x faster than optimization methods, making practical deployment feasible.

Limitations & Future Work¶

Requires obtaining high-quality indoor scene meshes beforehand as input.
Camera positions in the global stage are fixed at the center of the room, which may lack flexibility for irregular rooms.
Texture resolution is limited by the output resolution of SDXL.
Future work could explore combining with 3D generative models to directly synthesize textured scenes.

Text2Tex / TEXTure: Pioneers in view-by-view inpainting-based texture generation, but lack global consistency.
SceneTex: Scene-level texturing based on VSD optimization; produces high quality but is extremely time-consuming.
SyncMVD: Shares a similar idea of modifying self-attention to achieve cross-view information sharing as the Correlated View Attention.

Rating¶

⭐⭐⭐⭐ — The method design is clean and practical, with the two-stage strategy addressing the core problems of indoor scene texturing. The results—a 57x speedup accompanied by quality improvements—are highly impressive. The versatility of the zero-shot approach allows it to be directly applied to various diffusion models. However, the dependence on input mesh quality limits its application scenarios.