MVGBench: a Comprehensive Benchmark for Multi-view Generation Models¶
Conference: ICCV 2025 arXiv: 2507.00006 Code: Project Page Area: 3D Vision Keywords: multi-view generation, 3D consistency, benchmark, 3DGS, vision-language model evaluation
TL;DR¶
This paper presents MVGBench, a comprehensive evaluation framework for multi-view generation (MVG) models. It introduces a novel 3D consistency metric based on 3DGS self-consistency (requiring no 3D ground truth), systematically evaluates 12 state-of-the-art methods across three dimensions—peak performance, generalization, and robustness—and derives a new method, ViFiGen, from the identified best practices.
Background & Motivation¶
Limitations of Prior Work¶
Background: Multi-view generation (MVG) models are central to modern 3D content creation, yet evaluation methodologies lag significantly behind, suffering from three key problems:
Unfair comparison against GT: Generative models sample from the solution-space distribution and may produce outputs that differ from the ground truth yet remain valid; existing per-view PSNR/SSIM metrics ignore 3D consistency entirely.
Incomparable methods: Different MVG methods adopt different camera configurations (focal length, distance, elevation), making numerical results evaluated under their respective GTs incomparable across methods.
Insufficient evaluation dimensions: Critical aspects such as generalization to real-world images and robustness to input perturbations are not covered.
Core Insight: If the generated multi-view images are 3D-consistent, then 3DGS models reconstructed independently from disjoint subsets of those views should be mutually similar. This motivates the proposed self-consistency metric, which requires no 3D ground truth.
Method¶
Overall Architecture¶
MVGBench evaluates three aspects: - Peak performance: each method is assessed under its own optimal camera configuration. - Generalization to real images: evaluated on manually annotated CO3D and MVImgNet datasets. - Robustness: sensitivity to variations in elevation, azimuth, and illumination.
Key Designs¶
-
3D Consistency Metric (Self-Consistency):
- Given \(N\) views generated by an MVG model, partition them into two disjoint subsets \(\mathcal{I}_1, \mathcal{I}_2\).
- Fit separate 3DGS models \(\mathcal{G}_1, \mathcal{G}_2\) to each subset.
- Geometric consistency:
- Chamfer Distance (CD): computed by resampling point clouds from the Gaussians using their covariance matrices (not Gaussian centers), downsampled to 60k points.
- Depth error \(e_d\): render depth maps from the same viewpoint and compute the L1 difference.
- Texture consistency: render RGB images from the same viewpoint and compute cPSNR, cSSIM, and cLPIPS.
- Advantages: enables quantitative evaluation on real images (no 3D GT required) and provides a fair comparison across methods (each using its own optimal camera setup).
-
Alignment Scheme for Fair Comparison:
- Synthetic data: 3D objects are normalized to a unit cube; input images are rendered using each method's own training camera settings; the reconstructed 3DGS models are inherently aligned.
- Real data: ICP with uniform scaling is applied to align each method's 3DGS to a reference 3DGS.
- For methods with different numbers of output views, a small degree of overlap between subsets is permitted to equalize the upper bound of 3DGS fitting accuracy.
-
Semantic and Image Quality Metrics:
- oFID: object-level FID computed per object and then averaged, rather than over the full dataset; more consistent with human preference (Pearson correlation 0.69).
- IQ-vlm: a pretrained VLM assesses image quality via binary (yes/no) responses; strongly correlated with human ratings.
- Semantic consistency: a VLM evaluates whether the generated views are consistent with the input in terms of category, color, and style.
-
Evaluation Datasets:
- Peak performance: GSO (100 objects), OmniObject3D (202 objects).
- Generalization: CO3D (102 images), MVImgNet (230 images); front-facing views are manually selected and elevation angles are annotated.
- Robustness: GSO30 rendered under varying elevations, azimuths, and illumination conditions.
Loss & Training¶
ViFiGen (derived from best practices identified through analysis): - Adopts a video diffusion model architecture (video priors provide a better 3D consistency–quality trade-off). - Replaces CLIP with ConvNextV2 for input image encoding (preserving more fine-grained detail). - Incorporates an improved camera embedding design.
Key Experimental Results¶
Main Results¶
Performance comparison of 12 methods on CO3D (real) and GSO (synthetic):
| Method | CO3D CD↓ | CO3D cPSNR↑ | CO3D IQ-vlm↑ | GSO CD↓ | GSO cPSNR↑ | GSO IQ-vlm↑ |
|---|---|---|---|---|---|---|
| Zero123 | 12.06 | 13.16 | 0.38 | 10.99 | 17.37 | 0.73 |
| SyncDreamer | 3.04 | 25.30 | 0.12 | 2.99 | 26.83 | 0.53 |
| SV3D | 3.48 | 23.72 | 0.29 | 3.47 | 26.75 | 0.77 |
| Hi3D | 5.60 | 20.92 | 0.35 | 3.29 | 24.60 | 0.87 |
| ViFiGen (Ours) | 3.02 | 25.82 | 0.29 | 3.15 | 28.93 | 0.82 |
Ablation Study¶
Metric robustness validation (self-consistency scores on GT under varying numbers of views and camera settings):
| # Views | Camera Setting | CD↓ | cPSNR↑ | cSSIM↑ |
|---|---|---|---|---|
| 16 | [17] | 1.993 | 30.281 | 0.924 |
| 18 | [9] | 2.119 | 30.688 | 0.934 |
| 20 | [37] | 2.133 | 30.448 | 0.925 |
| 20 | [54] | 2.091 | 30.800 | 0.932 |
| Relative Std. Dev. | - | 0.026 | 0.006 | 0.004 |
Metric validation: Relative standard deviations below 8% confirm that the metrics are insensitive to the number of views and camera configuration.
VLM metrics align well with human preference (400 images, 10 users): Pearson confidence interval 0.95; oFID matches human rankings at 0.77, compared to 0.50 for conventional FID.
Key Findings¶
- Trade-off between 3D consistency and image quality: No existing method occupies the top-right corner (optimal on both axes); methods tend to be either consistent but lacking detail, or detailed but inconsistent.
- Significant synthetic-to-real gap: All methods exhibit substantial performance degradation on real images, especially on IQ-vlm.
- Video-based models are superior: Video diffusion models such as SV3D and ViFiGen achieve a better balance between consistency and quality.
- Critical design choices:
- Input image encoder: ConvNextV2 > CLIP (richer detail preservation).
- Camera embedding: explicit camera parameter encoding outperforms implicit approaches.
- Video prior: spatiotemporal attention provides a natural 3D consistency constraint.
- Robustness is universally insufficient: Most methods are sensitive to changes in elevation and azimuth.
Highlights & Insights¶
- Evaluation paradigm innovation: The self-consistency metric is an elegant solution for assessing 3D consistency in generative models and is generalizable to other 3D generation tasks.
- Value of systematic analysis: Large-scale comparative analysis across 12 methods, 4 datasets, and 3 evaluation dimensions identifies critical design choices.
- Resampling from Gaussians rather than using centers: Resampling point clouds via covariance matrices yields more accurate CD computation than directly using Gaussian centers.
Limitations & Future Work¶
- Errors introduced by 3DGS fitting itself may confound consistency evaluation (though experiments show the impact is below 8%).
- Evaluation is limited to object-level MVG; scene-level generation is not covered.
- VLM-based evaluation is constrained by the capabilities of current VLMs and may shift as VLMs evolve.
- Text-guided MVG methods are not evaluated.
Related Work & Insights¶
- Zero123 / SV3D / SyncDreamer: Representative MVG methods, each exhibiting distinct consistency–quality trade-offs.
- MEt3R: A concurrent work focusing on consistency in 3D scene generation; this paper targets object-level generation.
- LGM: A fast method for reconstructing 3DGS from multi-view images, used here to render evaluation views.
Rating¶
- Novelty: ⭐⭐⭐⭐ (innovative self-consistency metric)
- Technical Depth: ⭐⭐⭐⭐ (comprehensive evaluation framework design)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (12 methods × 4 datasets × 3 dimensions)
- Practical Value: ⭐⭐⭐⭐⭐ (benchmark tooling is highly valuable to the community)