Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework¶

Conference: ICCV 2025 arXiv: 2508.04090 Code: Project Page Area: 3D Vision Keywords: 3D super-resolution, diffusion models, 3D Gaussian splatting, multi-view consistency, novel view synthesis

TL;DR¶

This paper proposes 3DSR, a framework that integrates 2D diffusion-based super-resolution with 3D Gaussian Splatting (3DGS) representations. At each diffusion denoising step, multi-view 3D consistency is enforced via 3DGS rendering, enabling high-fidelity and spatially consistent 3D scene super-resolution.

Background & Motivation¶

3D representation learning (e.g., NeRF, 3DGS) has achieved remarkable progress in novel view synthesis, yet synthesized images often lack fine-grained details due to the limited spatial resolution of input camera views. Existing super-resolution approaches suffer from three major limitations:

Image Super-Resolution (ISR): Processes each view independently, introducing cross-view inconsistencies that lead to artifacts in 3D reconstruction.

Video Super-Resolution (VSR): Achieves a degree of consistency through implicit spatio-temporal aggregation, but cannot guarantee 3D view consistency and exhibits poor detail preservation.

Diffusion-based Super-Resolution: Capable of generating high-quality details but prone to hallucinations, resulting in texture inconsistencies across views.

Core insight: diffusion models excel at generating high-frequency details but lack 3D geometric awareness, while 3DGS excels at enforcing multi-view consistency. Combining both yields the advantages of each.

Method¶

Overall Architecture¶

At each denoising timestep, 3DSR executes the following loop: (a) the diffusion SR model predicts clean latents → (b) decodes them into high-resolution images → (c) trains 3DGS with the high-resolution images and renders consistent views → (d) encodes the 3D-consistent renderings back into latent space → (e) performs the denoising step using the 3D-consistent latents. The encoder, decoder, and diffusion model are all kept frozen; only the 3D representation is optimized.

Key Designs¶

Multiview Consistent Noise Sampling: Given a set of low-resolution (LR) input images \(\{L^i\}\) and camera poses \(\{P^i\}\), a 3DGS model is first pre-trained on the LR images. At the initial diffusion sampling step, noise latents \(x_t^i\) are randomly initialized, and the SR model estimates clean latents \(\hat{x}_0^i\), which are decoded into intermediate high-resolution images \(H^i\). The critical step is to update the 3DGS representation with \(H^i\) and obtain 3D-consistent high-resolution images \(R^i\) via rendering. \(R^i\) is re-encoded into latent space to yield 3D-consistent latents \(\ddot{x}_0^i\), which replace the original estimate \(\hat{x}_0^i\) for denoising:

\(x_{t-1}^i = \sqrt{\alpha_{t-1}} \ddot{x}_0^i + \eta_t \epsilon_\theta(x_t^i, t) + \sigma_t \epsilon_t\)
Subsampling-based Regularization: The total loss comprises a high-resolution supervision term and a low-resolution consistency term:

\(\mathcal{L}_{\text{all}} = \mathcal{L}_{\text{hr}}(H^i, R^i) + \lambda \mathcal{L}_{\text{lr}}(L^i, R_{lr}^i)\)

where \(R_{lr}^i\) is the rendered image downsampled to LR resolution. Both terms use a combination of \(\ell_1\) loss and D-SSIM loss:

$\mathcal{L}_\alpha = (1-\delta)\mathcal{L}_1^\alpha + \delta \mathcal{L}_{\text{D-SSIM}}^\alpha$

Plug-and-Play SR Model Compatibility: The diffusion SR model in the framework is not restricted to single-image ISR or video VSR and is compatible with future, more advanced SR methods. Unlike SuperGaussian, the proposed method requires no fine-tuning of video models and enforces consistency more directly through an explicit 3D representation.

Loss & Training¶

Mip-Splatting is used as the 3D representation, pre-trained for 30K iterations on LR images.
The number of diffusion sampling steps is set to 4 (using StableSR-Turbo), with 3DGS trained for 5K iterations per step.
Subsampling weight \(\lambda=1\), D-SSIM weight \(\delta=0.2\).
All experiments are conducted on an Nvidia A6000 GPU (49 GB).

Key Experimental Results¶

Main Results¶

LLFF dataset (×8 downsampling, ×4 upsampling):

Method	PSNR↑	SSIM↑	LPIPS↓	NIQE↓	MEt3R↓	FID↓
SuperGaussian	23.054	0.725	0.296	4.553	0.541	51.199
DiSR-NeRF	22.504	0.697	0.310	6.293	0.518	54.138
StableSR	22.748	0.717	0.219	4.793	0.531	41.129
3DSR (Ours)	24.212	0.754	0.181	4.632	0.516	20.731

MipNeRF360 dataset (×8 downsampling, ×4 upsampling):

Method	PSNR↑	SSIM↑	LPIPS↓	NIQE↓	MEt3R↓	FID↓
SuperGaussian	25.252	0.725	0.303	4.694	0.675	32.652
StableSR	24.313	0.700	0.326	5.177	0.644	44.174
3DSR (Ours)	26.097	0.746	0.222	5.065	0.625	22.438

Ablation Study¶

Qualitative analysis is used to demonstrate the contribution of each component:

Comparison	Observation
SuperGaussian vs. 3DSR	SuperGaussian produces blurrier outputs due to limited training views in LLFF
StableSR vs. 3DSR	StableSR introduces artifacts and geometric distortions; hallucinated details are misaligned across views
DiSR-NeRF vs. 3DSR	DiSR-NeRF frequently produces black and blurry artifacts
3DSR on indoor/outdoor scenes	Structural integrity and fine texture are preserved in both scene types

On LLFF (×8→×4), 3DSR achieves +1.46 dB PSNR and −20.4 FID compared to StableSR.
On MipNeRF360, 3DSR achieves +0.85 dB PSNR and −10.2 FID compared to SuperGaussian.
MEt3R (3D consistency metric) improves from 0.531/0.644 (StableSR) to 0.516/0.625 (3DSR).

Key Findings¶

Hallucinated details from diffusion models are inconsistent across views when SR is applied independently, causing blurring after 3DGS reconstruction.
"Fusing" SR results from multiple views through 3DGS effectively smooths cross-view inconsistencies.
The method requires no fine-tuning of diffusion or video models and serves as a general plug-and-play framework.
3DSR ranks second on the NIQE perceptual quality metric, suggesting that the 3D consistency constraint slightly limits single-image perceptual quality.

Highlights & Insights¶

Elegant combination of diffusion and 3D representations: The diffusion model is left unmodified; the 3D representation serves solely as a cross-view consistency regularizer.
Simple and general design: The 3D representation can be replaced with other forms (e.g., NeRF), and the SR model can be substituted with any diffusion-based SR method.
Comprehensive evaluation metrics: MEt3R (3D consistency) and FID are introduced into 3D super-resolution evaluation for the first time.
Qualitative comparisons against StableSR clearly illustrate the fundamental difference between "SR then 3D" and "SR interleaved with 3D optimization."

Limitations & Future Work¶

Training 3DGS at every diffusion denoising step incurs high computational cost.
Only ×4 and ×2 upsampling are evaluated; larger scale factors (e.g., ×8) remain unverified.
StableSR-Turbo uses only 4 denoising steps; more steps may yield better results at higher computational cost.
The impact of different 3D representations (e.g., NeRF) on performance is not explored.
The NIQE metric is not optimal, indicating a trade-off between 3D consistency and single-image perceptual quality.

MultiDiffusion and Generative Power of Ten achieve consistency at the 2D level via noise averaging; this paper extends that idea to 3D.
SuperGaussian relies on the implicit consistency of VSR, whereas this work enforces consistency more directly through an explicit 3D representation.
DiSR-NeRF employs Score Distillation Sampling; the proposed \(x_0\)-guidance strategy is more stable.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of embedding 3DGS optimization within the diffusion denoising loop is natural and novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across two datasets, multiple metrics, and multiple baselines; however, a quantitative ablation table is absent.
Writing Quality: ⭐⭐⭐⭐ The method is clearly described, and the algorithmic pseudocode aids understanding.
Value: ⭐⭐⭐⭐ Provides a general framework for high-quality 3D scene rendering with strong practical application potential.