3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement¶

Conference: CVPR 2025
arXiv: 2412.18565
Code: https://github.com/yihangluo/3DEnhancer
Area: 3D Vision / Neural Rendering
Keywords: Multi-view Diffusion, 3D Enhancement, View Consistency, Epipolar Aggregation, Texture Refinement

TL;DR¶

Proposed a 3D enhancement framework based on a multi-view latent diffusion model. By incorporating a pose-aware encoder, multi-view row attention, and adjacent-view epipolar aggregation modules, it significantly enhances the texture quality of low-quality 3D generation results while maintaining cross-view consistency.

Background & Motivation¶

Background: Current 3D generation mainly adopts a two-stage pipeline: first using multi-view diffusion models to generate multi-angle images (e.g., MVDream), and then generating 3D models via feed-forward reconstruction models (e.g., LGM). However, high-quality 3D datasets are scarce (limited to small-scale data like Objaverse), which falls far short of the billions-scale 2D image datasets.

Limitations of Prior Work: Images generated by multi-view diffusion models suffer from two critical issues: low resolution/coarse textures, and a severe lack of consistency between views. These problems directly propagate to the final 3D reconstruction quality.

Key Challenge: Existing enhancement methods have their own limitations: image SR methods (e.g., RealESRGAN) process each view independently, failing to guarantee cross-view consistency; video SR methods (e.g., Upscale-A-Video) rely on temporal attention, which fails when facing large viewpoint variations; UV-space enhancement is only applicable to meshes with UV coordinates.

Key Insight: The authors' key insight is that if high-quality and multi-view consistent 2D renderings can be obtained, the quality of 3D generation will improve accordingly. Therefore, instead of altering the 3D representation itself, this work focuses on "enhancing the intermediate multi-view images."

Core Idea: Design a multi-view diffusion framework specifically tailored for 3D enhancement, combining implicit (row attention) and explicit (epipolar aggregation) mechanisms to guarantee multi-view consistency.

Method¶

Overall Architecture¶

The framework is based on a DiT latent diffusion model (utilizing PixArt-Σ as the backbone), taking low-quality multi-view images and corresponding camera poses as inputs, and outputting enhanced high-quality multi-view images. The framework consists of: a pose-aware encoder (to inject camera information), view-consistent DiT blocks (including row attention and epipolar aggregation), and a diverse data synthesis/augmentation pipeline. The enhancement results can be directly input into LGM for 3D reconstruction, or serve as pseudo ground truths to iteratively optimize coarse 3D models.

Key Designs¶

Pose-Aware Encoder:
- Function: Encodes low-quality multi-view images and camera poses into latent representations.
- Mechanism: Employs Plücker coordinates \(\mathbf{r}_v^i = (\mathbf{d}^i, \mathbf{o}^i \times \mathbf{d}^i) \in \mathbb{R}^6\) to encode camera trajectories, concatenates them with RGB values along the channel dimension, and feeds the result into a trainable encoder \(\mathcal{E}_\psi\), which is then injected into the pre-trained DiT via a learnable copy.
- Design Motivation: Plücker coordinates provide a compact 6D representation that effectively encodes ray information in 3D space, enabling the network to learn camera-content correspondences.
Multi-View Row Attention (Implicit Consistency):
- Function: Performs cross-view attention interactions on the same horizontal line of multi-view features.
- Mechanism: Based on epipolar geometry constraints, for common camera configurations where the Y-axis is aligned with gravity, the epipolar lines can be approximated as horizontal lines. Therefore, Self-Attention calculation is extended to positions where \(Y=v\) across all views to achieve efficient cross-view information exchange.
- Design Motivation: Features significantly lower computational and memory overhead than dense multi-view attention, while implicitly capturing correspondences across views.
Adjacent-View Epipolar Aggregation (Explicit Consistency):
- Function: Explicitly propagates corresponding features from adjacent views through epipolar-constrained feature matching.
- Mechanism: For each feature position \(i\) in view \(v\), it searches for the best-matching feature position along the epipolar constraint in the two nearest adjacent views: \(M_{v,k}[i] = \arg\min_{j, j^\top F i = 0} D(\mathbf{f}_v[i], \mathbf{f}_k[j])\). Then, it linearly fuses the matched features from the two adjacent views, and blends them with the original feature using a 0.5 average to prevent token loss during large viewpoint variations.
- Design Motivation: Relying solely on attention is insufficient for establishing precise cross-view correspondences, requiring explicit feature propagation. Introducing learnable fusion weights allows for simultaneous consideration of geometric distances and feature similarities.
Multi-View Data Augmentation:
- Texture degradation: downsampling, blurring, noise, and JPEG compression.
- Texture deformation + camera jitter: grid warping + slight camera parameter perturbations.
- Color drift: randomly altering patch colors to simulate cross-view color inconsistency and 3DGS ghosting artifacts.
- Controllable noise: adding controllable noise to adjust enhancement intensity.

Loss & Training¶

Uses the standard multi-view diffusion training objective \(\mathcal{L}_{MV}(\theta) = \mathbb{E}[\|\epsilon - \epsilon_\Theta(\mathcal{Z}_t; y, \pi, t)\|_2^2]\). Training is conducted on approximately 400K objects from Objaverse, using 8×A100-80G GPUs for 10 days, with a resolution of 512×512, a batch size of 256, and a learning rate of 2e-5. During inference, DDIM is used with 20 steps and CFG=4.5.

3D Optimization and Inference¶

The enhanced multi-view images can serve as pseudo ground truths to optimize the coarse 3D representation: \(\mathcal{M}' = \arg\min_\mathcal{M} \sum_{v=1}^N \mathcal{L}(\mathbf{x}_v', \text{Rend}(\mathcal{M}, \pi_v))\), utilizing L1 + LPIPS losses.

Key Experimental Results¶

Main Results: Objaverse Synthetic Dataset Multi-View Enhancement¶

Method	PSNR↑	SSIM↑	LPIPS↓
Input (LQ)	26.15	0.9056	0.1257
RealESRGAN	26.02	0.9185	0.0877
StableSR	25.12	0.8914	0.1130
RealBasicVSR	26.21	0.9212	0.0888
Upscale-A-Video	25.57	0.8937	0.1153
3DEnhancer	27.53	0.9265	0.0626

Ablation Study: Cross-View Modules¶

Configuration	Multi-View Attn	Epipolar Agg	PSNR↑	SSIM↑	LPIPS↓
(a) W/o consistency module	✗	✗	25.11	0.9067	0.081
(b) Row attention only	✓	✗	25.95	0.9147	0.072
(c) Epipolar aggregation only	✗	✓	26.92	0.9226	0.0642
(d) Combination of both	✓	✓	27.53	0.9265	0.0626

Key Findings¶

Epipolar aggregation alone contributes more (+1.81 PSNR vs. +0.84 for row attention), indicating that explicit feature correspondence is more crucial than implicit attention.
Combining the two yields a signature complementary effect (+2.42 PSNR), where row attention provides global view information and epipolar aggregation ensures precise correspondence.
Removing epipolar constraints causes the model to aggregate textures from incorrect areas (e.g., erroneously propagating the texture from the weapon's top to the handle region).
On in-the-wild datasets, the 3D reconstruction quality (FID=71.78, IS=9.96) is comprehensively leading.

Highlights & Insights¶

Problem Redefinition: Translates "poor 3D generation quality" into a "multi-view consistent enhancement" problem, identifying multi-view image quality as the key bottleneck in the two-stage pipeline. This approach is more elegant than directly modifying the 3D representation.
Implicit + Explicit Hybrid Strategy: Row attention handles global information flow with high efficiency, while epipolar aggregation excels in establishing precise correspondences; the two are complementary. This hybrid design philosophy of "efficient approximation + precise compensation" can be transferred to many domains.
Value of Epipolar Geometry Priors: Classical 3D geometric constraints (fundamental matrices, epipolar lines) still play a powerful guiding role in deep learning, being difficult to replace with pure end-to-end learning.
Plug-and-Play Design: Can be seamlessly integrated into existing pipelines like MVDream \(\rightarrow\) LGM, or directly optimize NeRF/3DGS, exhibiting strong generalizability.

Limitations & Future Work¶

Assuming the camera Y-axis is aligned with gravity and views are roughly horizontal limits the application scope (e.g., scenes with large pitch angles).
Epipolar aggregation only considers the two nearest adjacent views, potentially missing useful information from further views.
Focusing on texture enhancement, the ability to correct 3D geometric structures (such as the Janus problem) is limited.
Training requires rendering 400K 3D objects along with extensive data augmentation, rendering the computational cost relatively high.

vs SuperGaussian/3DGS-Enhancer: These methods utilize video diffusion models for 3D enhancement, but temporal attention fails under large viewpoint variations. This work explicitly models 3D geometry via row attention and epipolar aggregation, making it more suitable for multi-view scenarios.
vs RealESRGAN/StableSR: Single-view enhancement cannot guarantee cross-view consistency, whereas this work addresses the issue from the perspective of joint multi-view enhancement.
vs TokenFlow: TokenFlow propagates tokens in video editing, whereas this work introduces learnable fusion weights and epipolar constraints, making it better suited for 3D geometric scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ The formulation of multi-view consistency as the core problem of 3D enhancement is novel, and the hybrid implicit + explicit design strategy is ingenious. However, the fundamental technologies (diffusion models, epipolar geometry) are well-known.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on both synthetic and in-the-wild datasets, with ablation studies clearly proving the contributions of individual modules. However, a systematic analysis of different camera configurations is lacking.
Writing Quality: ⭐⭐⭐⭐ Clear logic, detailed and reproducible method descriptions, and rich information in tables and figures.
Value: ⭐⭐⭐⭐ Solves practical pain points in 3D generation, benefits from a highly practical plug-and-play design, and open-sources the code.