MVGBench: a Comprehensive Benchmark for Multi-view Generation Models¶

Conference: ICCV 2025 arXiv: 2507.00006 Code: Project Page Area: 3D Vision Keywords: multi-view generation, 3D consistency, benchmark, 3DGS, vision-language model evaluation

TL;DR¶

This paper presents MVGBench, a comprehensive evaluation framework for multi-view generation (MVG) models. It introduces a novel 3D consistency metric based on 3DGS self-consistency (requiring no 3D ground truth), systematically evaluates 12 state-of-the-art methods across three dimensions—peak performance, generalization, and robustness—and derives a new method, ViFiGen, from the identified best practices.

Background & Motivation¶

Limitations of Prior Work¶

Background: Multi-view generation (MVG) models are central to modern 3D content creation, yet evaluation methodologies lag significantly behind, suffering from three key problems:

Unfair comparison against GT: Generative models sample from the solution-space distribution and may produce outputs that differ from the ground truth yet remain valid; existing per-view PSNR/SSIM metrics ignore 3D consistency entirely.

Incomparable methods: Different MVG methods adopt different camera configurations (focal length, distance, elevation), making numerical results evaluated under their respective GTs incomparable across methods.

Insufficient evaluation dimensions: Critical aspects such as generalization to real-world images and robustness to input perturbations are not covered.

Core Insight: If the generated multi-view images are 3D-consistent, then 3DGS models reconstructed independently from disjoint subsets of those views should be mutually similar. This motivates the proposed self-consistency metric, which requires no 3D ground truth.

Method¶

Overall Architecture¶

MVGBench evaluates three aspects: - Peak performance: each method is assessed under its own optimal camera configuration. - Generalization to real images: evaluated on manually annotated CO3D and MVImgNet datasets. - Robustness: sensitivity to variations in elevation, azimuth, and illumination.

Key Designs¶

3D Consistency Metric (Self-Consistency):
- Given \(N\) views generated by an MVG model, partition them into two disjoint subsets \(\mathcal{I}_1, \mathcal{I}_2\).
- Fit separate 3DGS models \(\mathcal{G}_1, \mathcal{G}_2\) to each subset.
- Geometric consistency:
  - Chamfer Distance (CD): computed by resampling point clouds from the Gaussians using their covariance matrices (not Gaussian centers), downsampled to 60k points.
  - Depth error \(e_d\): render depth maps from the same viewpoint and compute the L1 difference.
- Texture consistency: render RGB images from the same viewpoint and compute cPSNR, cSSIM, and cLPIPS.
- Advantages: enables quantitative evaluation on real images (no 3D GT required) and provides a fair comparison across methods (each using its own optimal camera setup).
Alignment Scheme for Fair Comparison:
- Synthetic data: 3D objects are normalized to a unit cube; input images are rendered using each method's own training camera settings; the reconstructed 3DGS models are inherently aligned.
- Real data: ICP with uniform scaling is applied to align each method's 3DGS to a reference 3DGS.
- For methods with different numbers of output views, a small degree of overlap between subsets is permitted to equalize the upper bound of 3DGS fitting accuracy.
Semantic and Image Quality Metrics:
- oFID: object-level FID computed per object and then averaged, rather than over the full dataset; more consistent with human preference (Pearson correlation 0.69).
- IQ-vlm: a pretrained VLM assesses image quality via binary (yes/no) responses; strongly correlated with human ratings.
- Semantic consistency: a VLM evaluates whether the generated views are consistent with the input in terms of category, color, and style.
Evaluation Datasets:
- Peak performance: GSO (100 objects), OmniObject3D (202 objects).
- Generalization: CO3D (102 images), MVImgNet (230 images); front-facing views are manually selected and elevation angles are annotated.
- Robustness: GSO30 rendered under varying elevations, azimuths, and illumination conditions.

Loss & Training¶

ViFiGen (derived from best practices identified through analysis): - Adopts a video diffusion model architecture (video priors provide a better 3D consistency–quality trade-off). - Replaces CLIP with ConvNextV2 for input image encoding (preserving more fine-grained detail). - Incorporates an improved camera embedding design.

Key Experimental Results¶

Main Results¶

Performance comparison of 12 methods on CO3D (real) and GSO (synthetic):

Method	CO3D CD↓	CO3D cPSNR↑	CO3D IQ-vlm↑	GSO CD↓	GSO cPSNR↑	GSO IQ-vlm↑
Zero123	12.06	13.16	0.38	10.99	17.37	0.73
SyncDreamer	3.04	25.30	0.12	2.99	26.83	0.53
SV3D	3.48	23.72	0.29	3.47	26.75	0.77
Hi3D	5.60	20.92	0.35	3.29	24.60	0.87
ViFiGen (Ours)	3.02	25.82	0.29	3.15	28.93	0.82

Ablation Study¶

Metric robustness validation (self-consistency scores on GT under varying numbers of views and camera settings):

# Views	Camera Setting	CD↓	cPSNR↑	cSSIM↑
16	[17]	1.993	30.281	0.924
18	[9]	2.119	30.688	0.934
20	[37]	2.133	30.448	0.925
20	[54]	2.091	30.800	0.932
Relative Std. Dev.	-	0.026	0.006	0.004

Metric validation: Relative standard deviations below 8% confirm that the metrics are insensitive to the number of views and camera configuration.

VLM metrics align well with human preference (400 images, 10 users): Pearson confidence interval 0.95; oFID matches human rankings at 0.77, compared to 0.50 for conventional FID.

Key Findings¶

Trade-off between 3D consistency and image quality: No existing method occupies the top-right corner (optimal on both axes); methods tend to be either consistent but lacking detail, or detailed but inconsistent.
Significant synthetic-to-real gap: All methods exhibit substantial performance degradation on real images, especially on IQ-vlm.
Video-based models are superior: Video diffusion models such as SV3D and ViFiGen achieve a better balance between consistency and quality.
Critical design choices:
- Input image encoder: ConvNextV2 > CLIP (richer detail preservation).
- Camera embedding: explicit camera parameter encoding outperforms implicit approaches.
- Video prior: spatiotemporal attention provides a natural 3D consistency constraint.
Robustness is universally insufficient: Most methods are sensitive to changes in elevation and azimuth.

Highlights & Insights¶

Evaluation paradigm innovation: The self-consistency metric is an elegant solution for assessing 3D consistency in generative models and is generalizable to other 3D generation tasks.
Value of systematic analysis: Large-scale comparative analysis across 12 methods, 4 datasets, and 3 evaluation dimensions identifies critical design choices.
Resampling from Gaussians rather than using centers: Resampling point clouds via covariance matrices yields more accurate CD computation than directly using Gaussian centers.

Limitations & Future Work¶

Errors introduced by 3DGS fitting itself may confound consistency evaluation (though experiments show the impact is below 8%).
Evaluation is limited to object-level MVG; scene-level generation is not covered.
VLM-based evaluation is constrained by the capabilities of current VLMs and may shift as VLMs evolve.
Text-guided MVG methods are not evaluated.

Zero123 / SV3D / SyncDreamer: Representative MVG methods, each exhibiting distinct consistency–quality trade-offs.
MEt3R: A concurrent work focusing on consistency in 3D scene generation; this paper targets object-level generation.
LGM: A fast method for reconstructing 3DGS from multi-view images, used here to render evaluation views.

Rating¶

Novelty: ⭐⭐⭐⭐ (innovative self-consistency metric)
Technical Depth: ⭐⭐⭐⭐ (comprehensive evaluation framework design)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (12 methods × 4 datasets × 3 dimensions)
Practical Value: ⭐⭐⭐⭐⭐ (benchmark tooling is highly valuable to the community)