3DSR: Bridging Diffusion Models and 3D Representations for 3D Consistent Super-Resolution¶

Conference: ICCV 2025 arXiv: 2508.04090 Code: https://consistent3dsr.github.io/ Area: Image Generation Keywords: 3D-consistent super-resolution, diffusion models, 3DGS, multi-view consistency, denoising guidance

TL;DR¶

3DSR is proposed — an alternating iterative framework coupling diffusion-based SR with 3DGS to achieve 3D-consistent super-resolution: after each denoising step, SR images are used to train a 3DGS, yielding 3D-consistent renderings that are re-encoded into the latent space to guide the next denoising step. Without fine-tuning any model, it explicitly enforces cross-view consistency, achieving +1.16 dB PSNR and 50% FID reduction on LLFF (vs. StableSR).

Background & Motivation¶

Root Cause¶

Key Challenge: Background: 3D scene reconstruction is fundamentally limited by the resolution of input images. Applying image SR frame-by-frame introduces inconsistent hallucinated details across views, causing blur and geometric artifacts after 3DGS training. Video SR implicitly models temporal consistency but does not guarantee 3D consistency. A framework is needed that explicitly leverages 3D representations to constrain multi-view consistency in diffusion-based SR.

Mechanism¶

Goal: How can a diffusion model be exploited to generate high-quality SR details while simultaneously guaranteeing 3D structural consistency across views?

Method¶

Overall Architecture¶

Low-resolution multi-view images → Pretrained 3DGS (LR) → Iterative denoising (4-step StableSR-Turbo): at each step, (1) for each view: diffusion denoising yields clean latent \(\hat{x}_0\) → decoded to SR image \(H_i\); (2) all SR images are used to train 3DGS (5K iterations) → rendered to 3D-consistent image \(R_i\); (3) \(R_i\) is re-encoded to 3D-consistent latent \(\tilde{x}_0\); (4) \(\tilde{x}_0\) is combined with current \(x_t\) to guide the next denoising step.

Key Designs¶

Diffusion–3D Representation Alternation: The core innovation — 3DGS training and rendering are inserted at every denoising step. The diffusion model supplies fine-grained details, while 3DGS enforces 3D consistency. \(\hat{x}_0\) (diffusion prediction: rich details but inconsistent) → train 3DGS → \(R_i\) (3D-consistent but potentially losing details) → \(\tilde{x}_0\) (3D-consistent latent) guides the next denoising step.
Subsampling Regularization: SR images are downsampled to LR resolution and aligned with the original LR input (\(\mathcal{L}_{lr}\)), preventing SR details from deviating excessively from the original signal. Total loss \(= \mathcal{L}_{hr}\)(SR image vs. rendering) \(+ \lambda \cdot \mathcal{L}_{lr}\)(original LR vs. downsampled rendering).
Plug-and-Play without Fine-tuning: The diffusion model, VAE, and 3DGS are all frozen or trained only with the rendering loss. No fine-tuning of video models or training of additional networks is required. The framework is compatible with arbitrary SR diffusion models or 3D representations.

Loss & Training¶

\(\mathcal{L}_{\text{all}} = \mathcal{L}_{hr}(H_i, R_i) + \lambda \cdot \mathcal{L}_{lr}(L_i, R_{lr}^i)\)
\(\mathcal{L}_{\alpha} = (1-\delta)\cdot\mathcal{L}_1 + \delta \cdot \mathcal{L}_{\text{DSSIM}}\), \(\lambda=1\), \(\delta=0.2\)
StableSR-Turbo (4-step denoising), 5K 3DGS training iterations per denoising step
Mip-Splatting as the 3D representation; A6000 GPU

Key Experimental Results¶

LLFF (×8 downsampling, ×4 upsampling)¶

Main Results¶

Method	PSNR↑	SSIM↑	LPIPS↓	MEt3R (3D Consistency)↓	FID↓
SuperGaussian (VSR)	23.05	0.725	0.296	0.541	51.2
StableSR (ISR)	22.75	0.717	0.219	0.531	41.1
DiSR-NeRF	22.50	0.697	0.310	0.518	54.1
3DSR (Ours)	24.21	0.754	0.181	0.516	20.7

MipNeRF360 (×8 downsampling, ×4 upsampling)¶

Ablation Study¶

Method	PSNR↑	LPIPS↓	FID↓
SuperGaussian	25.25	0.303	32.7
StableSR	24.31	0.326	44.2
3DSR	26.10	0.222	22.4

Highlights & Insights¶

3D Representation as a Consistency Constraint: Inserting 3DGS into the diffusion denoising loop is a simple yet effective means of enforcing 3D consistency.
FID Halved: From 41.1 → 20.7 (LLFF), demonstrating that 3D consistency constraints substantially reduce the distributional shift introduced by hallucinated details.
Plug-and-Play without Fine-tuning: No pretrained model is modified; the method purely exploits the geometric inductive bias of 3DGS.
Compatible with Arbitrary SR Models: The framework is applicable to both ISR and VSR approaches.

Limitations & Future Work¶

Training 5K 3DGS steps at every denoising step introduces significant computational overhead.
The NIQE metric (no-reference perceptual quality) is inferior to StableSR, suggesting that 3D consistency constraints may sacrifice some perceptual quality.
Validation is limited to ×4 upsampling.

vs. StableSR (ISR): Per-frame SR without 3D consistency leads to blurry results after 3DGS training; 3DSR explicitly guarantees consistency.
vs. SuperGaussian (VSR): Relies on the temporal consistency of video models rather than true 3D consistency.
vs. DiSR-NeRF: Optimizes NeRF via SDS, producing black artifacts; 3DSR directly uses 3DGS renderings as guidance.

Relevance to My Research¶

The alternating diffusion–3D iteration framework is transferable to tasks such as 3D editing and 3D completion.
The plug-and-play role of 3DGS within diffusion pipelines warrants further exploration.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of alternating diffusion and 3DGS is intuitive yet effective.
Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets, multiple SR baselines, and the MEt3R 3D consistency metric.
Writing Quality: ⭐⭐⭐⭐ The motivation figure (Fig. 2) is persuasive and the method is described clearly.
Value: ⭐⭐⭐⭐ The framework combining 3D consistency with diffusion-based SR offers strong reference value.