MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion¶
Conference: ICLR 2026 arXiv: 2510.13702 Code: Project Page Area: Diffusion Models / Personalized Generation Keywords: Multi-view customized generation, camera pose control, feature field rendering, video diffusion, geometric consistency
TL;DR¶
This paper introduces a new task termed multi-view customization and proposes the MVCustom framework, which leverages a video diffusion backbone with dense spatio-temporal attention for holistic frame consistency. At inference time, two novel techniques are introduced—depth-aware feature rendering and consistency-aware latent completion—achieving for the first time the simultaneous satisfaction of camera pose control, subject identity preservation, and cross-view geometric consistency.
Background & Motivation¶
Background: Controllable image generation has two key dimensions—camera control (multi-view generation) and customization (preserving subject identity from reference images). Each dimension has been extensively studied, yet methods that jointly address both remain virtually absent.
Limitations of Prior Work: - Traditional customization methods (DreamBooth, Custom Diffusion) do not support camera pose control. - Multi-view generation methods (CameraCtrl, SEVA) do not support personalized customization. - Customization methods with viewpoint control (CustomDiffusion360, CustomNet) focus solely on the subject, neglecting cross-view consistency of the background. - Directly applying customization methods (e.g., DreamBooth-LoRA) to multi-view generation backbones leads to loss of subject identity and degraded camera control.
Key Challenge: Multi-view generation relies on large-scale data to learn 3D geometry, whereas customization scenarios provide only a handful of reference images—a fundamental tension between data scarcity and the demand for geometric consistency.
Goal: Define and address the "multi-view customization" task: (i) generate images matching specified camera poses; (ii) preserve subject identity from reference images; (iii) maintain cross-view consistency for both subject and background.
Key Insight: Decouple the training and inference stages—use limited data during training to learn subject identity and geometry, and apply explicit geometric constraints (depth rendering) at inference to enforce consistency.
Core Idea: Leverage a video diffusion backbone to learn temporal consistency, employ feature field modeling for geometry, and use depth-guided rendering at inference to ensure cross-view geometric consistency.
Method¶
Overall Architecture¶
MVCustom consists of two stages: - Training Stage: A video diffusion backbone based on AnimateDiff with dense spatio-temporal attention (replacing the original 1D temporal attention), a pose-conditioned Transformer block (incorporating FeatureNeRF), and textual inversion to learn the subject embedding. - Inference Stage: Depth-aware feature rendering explicitly enforces geometric consistency across views; consistency-aware latent completion fills newly visible regions.
Key Designs¶
-
Pose-Conditioned Transformer Block (FeatureNeRF):
- Function: Injects camera pose information into the diffusion model and learns the geometric structure of the subject.
- Mechanism: A dual-branch architecture is designed—the main branch generates target-view feature maps, while the multi-view branch aggregates reference-view features via FeatureNeRF. FeatureNeRF exploits epipolar geometry and volume rendering to synthesize pose-aligned feature maps \(\bm{X}_y\) from reference image features \(\{(\bm{X}_i, \pi_i)\}\).
- Design Motivation: Enables the diffusion model to learn 3D geometric information from a small number of reference images.
-
Dense Spatio-Temporal Attention:
- Function: Replaces AnimateDiff's 1D temporal attention to enable information exchange across frames and spatial positions.
- Mechanism: The original 1D temporal attention only interacts between frames at identical spatial positions, failing to model spatial displacements caused by viewpoint changes. Dense 3D spatio-temporal attention allows cross-frame interaction at arbitrary spatial positions. A progressively expanding spatial attention scope strategy is adopted to maintain training stability and preserve pre-trained knowledge.
- Design Motivation: Ablation studies confirm that when performing feature replacement, 1D temporal attention fails to propagate spatial flow correctly; dense spatio-temporal attention is essential.
-
Depth-Aware Feature Rendering:
- Function: Explicitly enforces cross-view geometric consistency at inference time.
- Mechanism: An anchor frame is selected, its depth is estimated via ZoeDepth, and an anchor feature grid \(\mathcal{M}_a = (\bm{P}_a, \bm{F}_a, \mathcal{T}_a)\) is constructed. A differentiable mesh renderer projects the anchor frame features onto other camera poses. During the first 35 DDIM sampling steps, visible regions are replaced with rendered features: \(\hat{\bm{F}}_n = \bm{M}_n^a \odot \bm{F}_n^a + (1-\bm{M}_n^a) \odot \bm{F}_n\).
- Design Motivation: Data scarcity during training precludes implicit learning of geometric consistency as in large-scale multi-view methods; explicit geometric constraints are therefore necessary.
-
Consistency-Aware Latent Completion:
- Function: Generates plausible content for newly visible (disoccluded) regions arising from viewpoint changes.
- Mechanism: During denoising, \(x_0\) is predicted from the intermediate latent \(x_t\), then re-noised to timestep \(t\) to obtain perturbed latents \(x_t'\). Newly visible regions in the original latents are replaced with their perturbed counterparts. This process iterates from timestep \(T\) down to an early timestep \(\tau\) near \(T\).
- Design Motivation: Feature rendering can only handle regions visible in the anchor frame; newly visible regions require the generative model's prior knowledge for coherent completion.
Loss & Training¶
Standard denoising loss combined with the FeatureNeRF loss. The video backbone is trained on a subset of WebVid10M (430K samples); CO3Dv2 is used for customization experiments (3 concepts each for car, chair, and motorcycle categories).
Key Experimental Results¶
Main Results¶
| Method | Camera Pose Accuracy↑ | Multi-View Consistency↓ | Identity Preservation↓ | Text Alignment↑ |
|---|---|---|---|---|
| Custom Img + Img-MV gen | 0.675 | 0.214 | 0.504 | 0.676 |
| Txt-MV gen with DB | 0.283 | 0.116 | 0.557 | 0.723 |
| CustomDiffusion360 | 0.000 | 0.190 | 0.417 | 0.806 |
| MVCustom (ours) | 0.735 | 0.121 | 0.448 | 0.744 |
MVCustom is the only method achieving high scores simultaneously on camera pose accuracy and multi-view consistency.
Ablation Study¶
| Configuration | Outcome |
|---|---|
| Customization fine-tuning only (no DFR/LCC) | Background remains static across viewpoints |
| + Depth-Aware Feature Rendering (DFR) | Background shifts correctly with camera motion, but disoccluded regions exhibit repeated content |
| + Consistency-Aware Latent Completion (LCC) | Disoccluded regions completed naturally; full geometric consistency achieved |
| 1D temporal attention + feature replacement | Spatial flow propagation fails |
| Dense spatio-temporal attention + feature replacement | Spatial consistency propagated correctly |
Key Findings¶
- COLMAP reconstruction fails entirely for CustomDiffusion360 (pose accuracy = 0), demonstrating that focusing solely on the subject while ignoring background consistency is infeasible.
- Depth-aware feature rendering and latent completion are complementary: the former ensures geometric consistency in visible regions, while the latter handles generation in invisible regions.
- Dense spatio-temporal attention is a prerequisite for the feature replacement strategy to function correctly.
- MVCustom incurs notable computational overhead (130.92s, 19.29GB), primarily due to the depth estimator and feature replacement operations.
Highlights & Insights¶
- Clear and systematic task formulation: Table 1 systematically analyzes the capability gaps of existing methods across all dimensions of multi-view customization, defining an important and unaddressed task.
- Elegant training-inference decoupling: Subject representation is learned from limited data at training time, while explicit geometric constraints compensate for data insufficiency at inference—a strategy generalizable to other data-scarce generative settings.
- Transfer from video diffusion to multi-view generation: Exploiting the temporal consistency of video models to achieve multi-view consistency represents an effective cross-task transfer.
Limitations & Future Work¶
- The framework cannot alter the intrinsic pose of the subject through text (e.g., changing from "sitting" to "standing"), as FeatureNeRF learns a fixed canonical pose.
- Computational overhead is substantially higher than competing methods (130.92s vs. 27–97s; 19.29GB vs. 5–7GB).
- Evaluation is conducted on only 3 categories from CO3Dv2, leaving generalization insufficiently validated.
- Depth estimation quality directly impacts rendering results and may lack robustness for complex scenes.
Related Work & Insights¶
- vs. CustomDiffusion360: Both target viewpoint-controllable customization, but CD360 neglects background consistency, causing COLMAP failure; MVCustom addresses this via a video backbone combined with inference-time geometric constraints.
- vs. SEVA (Img-MV gen): Supports multi-view generation from a single image but lacks subject identity information, suffering severe degradation at views far from the input.
- vs. CameraCtrl + DB: Directly fine-tuning a multi-view model with DreamBooth paradoxically degrades camera control capability.
- The depth-aware feature rendering paradigm is transferable to other generation tasks requiring geometric consistency (e.g., video editing, 3D-aware inpainting).
Rating¶
- Novelty: ⭐⭐⭐⭐ The task formulation is original, and the inference-time geometric constraint strategy is elegant.
- Experimental Thoroughness: ⭐⭐⭐ Limited to 3 categories; large-scale validation is absent.
- Writing Quality: ⭐⭐⭐⭐ Problem definition is clear and method description is thorough.
- Value: ⭐⭐⭐⭐ Opens a new direction in multi-view customized generation with promising applications in 3D content creation.