MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes¶

Conference: CVPR 2025
arXiv: 2412.11457
Code: Project Page
Area: 3D Vision
Keywords: novel view synthesis, multi-object, structure-aware diffusion, timestep scheduling, cross-view consistency

TL;DR¶

Addressing novel view synthesis (NVS) for multi-object indoor scenes, this paper significantly improves cross-view object placement and geometric consistency through three key designs: injecting structure-aware features (depth + object masks), introducing an auxiliary mask prediction task, and designing a structure-guided timestep sampling scheduler.

Background & Motivation¶

Background: Single-object NVS based on pretrained diffusion models (such as Zero-1-to-3) has achieved impressive results. However, almost all methods are trained on single-object datasets like Objaverse and cannot be directly scaled to multi-object compositional scenes.

Limitations of Prior Work: Directly applying single-object NVS methods to multi-object scenes leads to severe issues: incorrect object placement, deformed shapes, inconsistent appearances, and even missing objects. The fundamental reason is the lack of structure-awareness.

Key Challenge: Structural information in multi-object scenes is hierarchical—high-level object placement (position/orientation) and low-level per-object geometric appearance. The one-to-one mapping paradigm of single-object methods cannot handle this compositional complexity.

Goal: To enhance the perception of multi-object compositional structures in view-conditioned diffusion models, achieving cross-view consistent multi-object NVS.

Key Insight: Comprehensively inject structure-aware signals from three dimensions: model inputs, auxiliary tasks, and training strategies.

Core Idea: By leveraging depth/mask inputs, target-view mask prediction, and timestep resampling scheduling, the diffusion model is forced to learn global layout before local details.

Method¶

Overall Architecture¶

Based on the pretrained Stable Diffusion, the model takes an input view and a target view image pair for fine-tuning. Structure-aware features are concatenated at the input stage of the denoising U-Net, semantic information is injected via cross-attention, and the object mask of the target view is simultaneously predicted.

Key Designs¶

1. Structure-Aware Feature Amalgamation¶

The depth map and object mask of the input view are incorporated as additional inputs: - Object Mask: Normalizes the instance ID rendering map into a continuous image, providing a coarse perception of object placement and shape. - Depth Map: Encodes the relative positions and shapes of visible objects. - Both are replicated to 3 channels to simulate RGB, encoded by the VAE, and concatenated with the noisy target view image. - During inference, these can be obtained using off-the-shelf detectors like SAM + Marigold.

The modified learning objective is: \(\mathbb{E}[\|\epsilon_\theta(\alpha_t x_0 + \sigma_t \epsilon, t, C_{SA}(\hat{x}_0, R, T, \hat{D}, \hat{M})) - \epsilon\|^2]\)

2. Auxiliary Target-View Mask Prediction¶

Analogous to the concept of classifier guidance, target-view object mask prediction is introduced as an auxiliary training task: - The mask predictor extracts features from the last layer of the denoising U-Net, conditioned on the noisy image \(x_t\), timestep \(t\), and structure-aware features. - Joint training loss: diffusion reconstruction + \(\gamma \| M_{tgt} - M_t \|^2\) (\(\gamma = 0.1\)). - This forces the model to explicitly learn "where objects should be placed in the target view."

3. Structure-Guided Timestep Sampling Scheduler¶

Key Observation: Global object placement is restored in the early denoising stage (large \(t\)), while fine geometric details are restored in the late stage (small \(t\)).

The uniform sampling \(t \sim \mathcal{U}(1, 1000)\) is modified to Gaussian sampling \(t \sim \mathcal{N}(\mu(s), \sigma)\), where \(\mu(s)\) linearly decays from \(\mu_{global}=1000\) to \(\mu_{local}=500\) (\(\sigma=200\)): - First 4000 steps of warmup: \(\mu = 1000\), emphasizing global layout learning. - Linear decay from 4000 to 6000 steps. - After 6000 steps: \(\mu = 500\), shifting focus to fine-details learning.

Loss & Training¶

\(\mathcal{L} = \|\epsilon_\theta - \epsilon\|^2 + \gamma \|M_{tgt} - M_t\|^2\), where \(\gamma = 0.1\), combining the diffusion reconstruction loss and the mask prediction MSE loss.

Key Experimental Results¶

Main Results: C3DFS Test Set¶

Method	PSNR↑	SSIM↑	LPIPS↓	IoU↑	Hit Rate↑	Dist↓
ZeroNVS	10.7	0.533	0.481	21.6	1.4	135.2
Zero-1-to-3	14.3	0.771	0.302	33.7	4.4	86.7
Free3D	14.4	0.774	0.297	34.2	4.8	83.6
MOVIS	17.4	0.825	0.171	58.1	19.3	44.9

MOVIS improves IoU (object placement accuracy) by 72% (vs Zero-1-to-3) and Hit Rate (cross-view matching) by 339%.

Generalization: Objaverse + Room-Texture¶

Objaverse: PSNR 17.7 / IoU 51.3 / Hit Rate 17.0 (outperforming others by a large margin)
Room-Texture: PSNR 10.0 / IoU 24.2 / Hit Rate 4.4 (maintaining advantages across domains)

Ablation Study¶

Variant	PSNR↑	LPIPS↓	IoU↑
w/o depth	17.1	0.178	57.2
w/o mask (auxiliary task)	16.9	0.187	54.7
w/o scheduler	16.2	0.212	49.1
Full MOVIS	17.4	0.171	58.1

Key Findings¶

Timestep scheduler is the most critical component: Removing it degrades IoU by 9 percentage points, indicating that the "global first, local second" learning sequence is crucial for multi-object scenes.
The auxiliary mask prediction task is the second most major contributor (IoU -3.4); direct supervision helps the model distinguish object instances.
Cross-view consistency metrics (Hit Rate/Dist) complement traditional NVS metrics, revealing structural problems that traditional metrics fail to reflect.

Highlights & Insights¶

New Evaluation Dimension: Proposes cross-view consistency metrics (Hit Rate and Dist based on MASt3R image matching) to fill the blind spot in NVS evaluation.
Hierarchical Analysis of Denoising: By visualizing intermediate predictions at different timesteps, it uncovers the pattern of global layout restoration in the early stage and fine-mask prediction in the late stage.
Timestep Scheduler Design Philosophy: Standardizes curriculum learning (easy-to-hard) in diffusion training; the "coarse-to-fine" process in multi-object scenes naturally aligns with the denoising progression.
Structure-Aware Input Design: Though simple and intuitive, combining depth and masks with ready-to-use monocular predictors (SAM + Marigold) makes them applicable during inference, ensuring strong practicality.

Limitations & Future Work¶

Only focuses on foreground objects without modeling the background (left for future work).
Trained on the synthetic dataset C3DFS, limiting generalization to real-world indoor scenes (such as SUNRGB-D).
Still requires input-view depth and mask as extra conditions, increasing inference costs.
Performance degrades under large view changes, where occluded regions demand stronger generative capabilities.

Zero-1-to-3: Pioneered using diffusion models as NVS synthesizers, but limited to single objects; MOVIS demonstrates that multi-object extension requires explicit structure awareness.
Compositional 3D Reconstruction series (ComboVerse, etc.): A pipeline paradigm of segmentation -> completion -> single-object 3D -> composition suffers from accumulated cascade errors; the end-to-end scheme of MOVIS is more concise.
Related Insight: The concept of a timestep resampling scheduler can be extended to other diffusion tasks requiring "hierarchical generation" (e.g., scene generation from layout to texture).

Rating¶

⭐⭐⭐⭐ — Clear problem definition. The three design choices are orthogonal to each other and their contributions are well validated through ablation. The insight behind the timestep scheduler is particularly interesting. Generalization on real-world data still needs further verification.