MV-Adapter: Multi-view Consistent Image Generation Made Easy¶

Conference: ICCV 2025 arXiv: 2412.03632 Code: Available Area: 3D Vision / Multi-view Generation Keywords: Multi-view Generation, Adapter, diffusion model, 3D Generation, Texture Generation

TL;DR¶

This paper proposes MV-Adapter, the first adapter-based framework for multi-view image generation. By duplicating self-attention layers and adopting a parallel attention architecture, it enables plug-and-play multi-view generation on SDXL at 768 resolution, with compatibility across diverse T2I-derived models.

Background & Motivation¶

Multi-view image generation is a fundamental task in 2D/3D content creation. Existing methods (e.g., MVDream, Era3D) suffer from three major limitations:

High computational cost: Intrusive modifications to pretrained T2I models followed by full fine-tuning require processing \(n\) views simultaneously, making it infeasible to scale to larger base models or higher resolutions.

Degraded image quality: High-quality 3D data is scarce, and full-model fine-tuning tends to overfit, leading to generation quality degradation.

Lack of flexibility: Modifying the original model architecture precludes compatibility with personalized models, LoRA, ControlNet, and other T2I-derived tools.

Mechanism: The adapter paradigm is naturally suited to multi-view generation — it involves fewer parameters, is easier to train, preserves pretrained knowledge, and supports plug-and-play usage. The key challenge lies in effectively encoding 3D geometric knowledge without modifying the original network structure.

Method¶

Overall Architecture¶

MV-Adapter consists of two core components: 1. Condition Guider: Encodes camera parameters or geometric information. 2. Decoupled Attention Layers: Comprising multi-view attention and image cross-attention.

At inference time, MV-Adapter can be inserted into any personalized or distilled T2I model to serve as a multi-view generator.

Key Designs¶

1. Condition Guider

Camera conditioning: Represented via raymaps, encoding the ray origin and direction at each spatial position with the same resolution as the latent representation.
Geometry conditioning: Global representations using position maps and normal maps; position maps provide cross-view point correspondences, while normal maps capture geometric details.
A lightweight convolutional network extracts multi-scale features that are added to the corresponding levels of the U-Net encoder.

2. Duplicated Self-Attention Layers

The core principle is to preserve the original network structure and feature space unchanged. Rather than modifying the base model's self-attention, the paper duplicates its structure and weights to create new multi-view attention and image cross-attention layers, with the output projection layers zero-initialized. This ensures that the new layers learn geometric knowledge without interfering with the original model.

3. Parallel Attention Architecture

In contrast to a serial organization, MV-Adapter adopts a parallel architecture:

\[f^{self} = \text{SelfAttn}(f^{in}) + \text{MultiViewAttn}(f^{in}) + \text{ImageCrossAttn}(f^{in}, f^{ref}) + f^{in}\]

The key advantage of this design is that the new layers receive the same input as the original self-attention layer, making the pretrained weight initialization effective and allowing the new layers to directly inherit image priors. In a serial architecture, the new layers receive inputs in a different domain, rendering such initialization ineffective.

4. Multi-view Attention Strategies

3D object generation: 0° elevation, row-wise self-attention.
3D texture generation: Four 0° views plus two top/bottom views, row-wise and column-wise self-attention.
Arbitrary-view generation: Full self-attention.

5. Image Cross-Attention

A pretrained frozen U-Net serves as the image encoder. The reference image (at timestep \(= 0\)) is passed through it to extract multi-scale self-attention features, which are then injected into the denoising U-Net.

Loss & Training¶

Standard diffusion training objective; only MV-Adapter parameters are optimized.
Reference image features are randomly zeroed to support classifier-free guidance.
Noise schedule is shifted toward higher noise levels: log-SNR shift of \(\log(n)\), where \(n\) is the number of generated views.
Training data: a subset of Objaverse.

Key Experimental Results¶

Main Results¶

Text-to-multi-view generation:

Method	FID↓	IS↑	CLIP Score↑
MVDream	32.15	14.38	31.76
SPAD	48.79	12.04	30.87
Ours (SDXL)	29.71	16.38	33.17

Image-to-multi-view generation:

Method	PSNR↑	SSIM↑	LPIPS↓
Era3D	20.890	0.8601	0.1199
Ouroboros3D	20.810	0.8535	0.1193
Ours (SDXL)	22.131	0.8816	0.1002

Ablation Study¶

Training efficiency comparison (batch size = 1):

Method	Trainable Params	VRAM	Training Speed
Era3D (SD2.1)	993M	36G	2.2 iter/s
Ours (SD2.1)	127M	17G	3.1 iter/s
Era3D (SDXL)	3.1B	>80G	Infeasible
Ours (SDXL)	490M	60G	1.05 iter/s

Attention architecture ablation:

Architecture	PSNR↑	SSIM↑	LPIPS↓
Serial (SDXL)	20.687	0.8681	0.1149
Parallel (SDXL)	22.131	0.8816	0.1002

Key Findings¶

Parallel vs. serial: The parallel architecture substantially outperforms the serial one (PSNR gain of 1.44); the serial architecture produces artifacts and inconsistent details.
Training efficiency: Parameter count is only 1/6 that of full fine-tuning (SD2.1), VRAM usage is halved, and Era3D is infeasible on SDXL.
Texture generation: FID of 27.28 (image-conditioned, SDXL), 24% lower than the best baseline SyncMVD (36.13), with inference in only 33 seconds.
3D reconstruction quality: Chamfer Distance of 0.0206, significantly outperforming Era3D's 0.0329.

Highlights & Insights¶

First introduction of the adapter paradigm: Applying adapter-based methods to multi-view generation achieves "train once, use everywhere" flexibility.
Elegant parallel attention design: By sharing the input with the original self-attention, the new layers inherit pretrained weight initialization effectively.
Zero-initialization strategy: Zero-initializing the output projection of new layers ensures that the original feature space is not disrupted at the start of training.
Decoupled learning paradigm: Provides a general framework extensible to modeling other types of knowledge, such as physical or temporal priors.

Limitations & Future Work¶

Fixed number of views: Each application currently requires training a separate adapter for a specific number of views.
Room for improvement in 3D consistency: Post-processing is still required to obtain the final 3D model.
Dependency on training data: 3D datasets such as Objaverse are still required.
Potential extension to video generation: The parallel attention architecture may be applicable to temporal consistency modeling.

MVDream: Replaces self-attention with a 3D variant through intrusive modifications, resulting in incompatibility with T2I-derived models.
Era3D: Achieves efficient multi-view interaction via row-wise self-attention, but requires full fine-tuning.
SPAD: Employs epipolar-constrained cross-attention, with computational cost between dense and row-wise attention.
IP-Adapter: Its decoupled cross-attention design inspired MV-Adapter's image conditioning mechanism.
Insight: In the era of large-scale models, parameter-efficient fine-tuning is not merely an efficiency concern, but a critical strategy for preserving pretrained priors and enabling flexible compositional use.

Rating¶

Dimension	Score (1–5)
Novelty	4.5
Technical Depth	4
Experimental Thoroughness	4.5
Writing Quality	4.5
Practical Value	5
Overall	4.5