SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting¶
Conference: CVPR2026
arXiv: 2602.24020
Code: Project Page
Area: 3D Vision
Keywords: 3D super-resolution, 3D Gaussian splatting, feed-forward reconstruction, Gaussian offset learning, sparse-view reconstruction
TL;DR¶
SR3R reformulates 3D super-resolution (3DSR) as a feed-forward mapping from sparse low-resolution views to high-resolution 3DGS, achieving high-fidelity HR 3DGS reconstruction via Gaussian offset learning and feature refinement, without per-scene optimization, while enabling strong zero-shot generalization.
Background & Motivation¶
- Limitations of Prior Work: Existing 3DSR methods rely on dense LR inputs and pretrained 2D super-resolution models to generate pseudo-HR images, which are then used as supervision for per-scene HR 3DGS optimization. This paradigm has three fundamental limitations:
- Constrained high-frequency priors: High-frequency knowledge is solely derived from 2DSR model priors, failing to capture 3D-specific high-frequency geometry and texture structures.
- Reconstruction fidelity ceiling: The quality of pseudo-HR labels inherently limits the upper bound of reconstruction quality.
- High computational cost: Dense multi-view synthesis combined with per-scene iterative optimization prevents cross-scene generalization.
- Key Observation: Feed-forward 3DGS reconstruction models can directly predict 3DGS from sparse views, but reconstruction quality is severely constrained by input resolution. This motivates the question: can 3DSR also be formulated as a feed-forward mapping that learns 3D-specific high-frequency priors from large-scale multi-scene data?
- Paradigm Shift: From "per-scene HR 3DGS self-optimization" to "generalizable HR 3DGS feed-forward prediction," fundamentally changing the way 3DSR acquires high-frequency knowledge.
Method¶
Overall Architecture¶
SR3R adopts a plug-and-play design with a four-stage pipeline:
- LR 3DGS Reconstruction: Any feed-forward 3DGS backbone (e.g., NoPoSplat/DepthSplat) is used to obtain LR 3DGS \(\mathcal{G}^{\text{LR}}\) from 2 LR views.
- Gaussian Densification: \(\mathcal{G}^{\text{LR}}\) is densified into \(\mathcal{G}^{\text{Dense}}\) via Gaussian Shuffle Split, serving as a structural scaffold.
- Mapping Network: Upsampled LR images are processed through a ViT encoder, feature refinement module, and ViT decoder to extract multi-view fused features.
- Gaussian Offset Learning: Residual offsets from \(\mathcal{G}^{\text{Dense}}\) to \(\mathcal{G}^{\text{HR}}\) are predicted to obtain the final HR 3DGS.
The core formulation — feed-forward mapping definition:
where each 3D Gaussian primitive is parameterized as \((\boldsymbol{\mu}, \alpha, \boldsymbol{r}, \boldsymbol{s}, \boldsymbol{c})\), corresponding to center position, opacity, quaternion rotation, scale, and spherical harmonic coefficients, respectively.
Key Design 1: Gaussian Shuffle Split Densification¶
For each Gaussian primitive in \(\mathcal{G}^{\text{LR}}\), six child Gaussians are generated along the positive and negative directions of its three principal axes, providing a finer structural scaffold:
- \(R_j\) is the rotation matrix corresponding to quaternion \(\boldsymbol{r}_j\); \(\boldsymbol{e}_k\) denotes unit vectors along positive/negative principal axes.
- \(\beta = 0.5\) controls the offset magnitude; the child Gaussian scale along the offset axis is reduced to \(\frac{1}{4}\) of the original.
- Applied only to Gaussians with opacity > 0.5, focusing on structurally significant regions.
- After densification, \(\mathcal{G}^{\text{Dense}}\) contains \(N = 6M\) primitives (\(M\) being the number of LR Gaussians).
Key Design 2: Feature Refinement Module¶
Upsampled LR images contain blurry and hallucinated high-frequency patterns introduced by interpolation, which can cause 3D geometry and texture artifacts if used directly. The feature refinement module aligns ViT-encoded features with geometry-aware features from the pretrained 3DGS backbone via bidirectional cross-attention:
The outputs of both attention directions are concatenated and fused through a fully connected layer to produce the refined feature \(\boldsymbol{t}_{ca}\). The core mechanism is to transfer reliable 3D geometric priors from the 3DGS backbone into the 2D feature space, suppressing ambiguities introduced by upsampling.
Key Design 3: Gaussian Offset Learning¶
This is the most critical module for SR3R's performance gains. The core idea is to predict residual offsets from \(\mathcal{G}^{\text{Dense}}\) to \(\mathcal{G}^{\text{HR}}\) rather than directly regressing absolute Gaussian parameters.
Procedure: 1. Each dense Gaussian center \(\boldsymbol{\mu}_i\) is projected onto the image plane to obtain 2D coordinates \(\boldsymbol{p}_i\). 2. Local features \(\boldsymbol{F}_i\) are queried at position \(\boldsymbol{p}_i\) from the ViT decoder feature map \(\boldsymbol{t}_{de}\). 3. Gaussian centers, queried features, and camera intrinsics are aggregated and fed into PointTransformerV3 for spatial reasoning:
- A Gaussian Head (lightweight MLP) predicts residual offsets:
- Residual combination yields the final HR 3DGS:
Design Motivation: Since \(\mathcal{G}^{\text{Dense}}\) already provides a reliable coarse structural scaffold, the remaining discrepancy primarily consists of local high-frequency signals. Learning offsets rather than absolute parameters constrains the search space to local neighborhoods, substantially improving training stability and reconstruction sharpness.
Key Design 4: ViT Decoder Cross-View Fusion¶
The refined features \(\boldsymbol{t}_{ca}\) are processed by the ViT decoder via: - Intra-view self-attention: Aggregates global contextual information. - Inter-view cross-attention: Fuses complementary information across views, mitigating inconsistencies caused by inaccurate poses or insufficient view overlap.
Loss & Training¶
A combination of pixel-level MSE reconstruction loss and perceptual consistency LPIPS loss is adopted:
End-to-end training is performed via differentiable Gaussian rasterization.
Key Experimental Results¶
Experimental Setup¶
- Datasets: RealEstate10K (RE10K, indoor), ACID (outdoor aerial), DTU (object-centric), ScanNet++ (indoor)
- Super-resolution scale: 4× (64×64 → 256×256)
- Backbones: NoPoSplat, DepthSplat
- Training: 75K iterations, batch=8, lr=2.5e-5, 4×RTX 5090
Main Results¶
| Method | Dataset | PSNR ↑ | SSIM ↑ | LPIPS ↓ | # Gaussians |
|---|---|---|---|---|---|
| NoPoSplat | RE10K | 21.33 | 0.612 | 0.307 | 2.7M |
| Up-NoPoSplat | RE10K | 23.37 | 0.771 | 0.251 | 44.5M |
| SR3R (NoPoSplat) | RE10K | 24.79 | 0.827 | 0.188 | 16.5M |
| DepthSplat | RE10K | 23.15 | 0.699 | 0.281 | 2.3M |
| Up-DepthSplat | RE10K | 24.71 | 0.793 | 0.244 | 38.3M |
| SR3R (DepthSplat) | RE10K | 26.25 | 0.856 | 0.165 | 14.2M |
| NoPoSplat | ACID | 21.45 | 0.606 | 0.531 | 2.7M |
| Up-NoPoSplat | ACID | 23.91 | 0.692 | 0.384 | 44.5M |
| SR3R (NoPoSplat) | ACID | 25.54 | 0.746 | 0.283 | 16.5M |
| DepthSplat | ACID | 23.80 | 0.624 | 0.437 | 2.3M |
| Up-DepthSplat | ACID | 25.32 | 0.721 | 0.322 | 38.3M |
| SR3R (DepthSplat) | ACID | 27.02 | 0.797 | 0.261 | 14.2M |
Key Findings: SR3R achieves an average PSNR improvement of 1.4–3.5 dB while requiring only 37%–63% of the Gaussian parameters compared to direct upsampling (16.5M vs. 44.5M).
Zero-Shot Generalization (RE10K → DTU)¶
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Reconstruction Time |
|---|---|---|---|---|
| SRGS (per-scene opt.) | 12.42 | 0.327 | 0.598 | 300s |
| FSGS+SRGS (per-scene opt.) | 13.72 | 0.444 | 0.481 | 420s |
| NoPoSplat | 12.63 | 0.343 | 0.581 | 0.01s |
| Up-NoPoSplat | 16.64 | 0.598 | 0.369 | 0.16s |
| SR3R (NoPoSplat) | 17.24 | 0.607 | 0.291 | 1.69s |
SR3R not only outperforms all feed-forward baselines but also surpasses per-scene optimization methods SRGS/FSGS+SRGS (PSNR +3.5 dB), while being 177–248× faster.
Ablation Study¶
| Component | PSNR ↑ | SSIM ↑ | LPIPS ↓ | # Gaussians |
|---|---|---|---|---|
| NoPoSplat (baseline) | 21.33 | 0.612 | 0.307 | 2.7M |
| + Upsampling | 23.37 | 0.771 | 0.251 | 44.5M |
| + Cross-attention | 23.50 | 0.784 | 0.237 | 44.5M |
| + Gaussian offset (w/o PTv3) | 24.45 | 0.808 | 0.211 | 16.5M |
| + PTv3 (Full SR3R) | 24.79 | 0.827 | 0.188 | 16.5M |
Key Findings: 1. Gaussian offset learning contributes the most: +0.95 PSNR, while reducing Gaussian count from 44.5M to 16.5M. 2. Cross-attention feature refinement improves structural consistency (+0.13 PSNR, LPIPS −0.014). 3. PTv3 multi-scale spatial reasoning further enhances sharpness (+0.35 PSNR, LPIPS −0.023). 4. All components are complementary, progressively improving reconstruction quality.
Upsampling Strategy Robustness¶
| Upsampling Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| Bilinear | 24.59 | 0.795 | 0.204 |
| Bicubic | 24.66 | 0.817 | 0.193 |
| SwinIR | 24.79 | 0.827 | 0.188 |
| HAT | 24.78 | 0.819 | 0.183 |
Even with simple bilinear interpolation, SR3R surpasses all feed-forward baselines, demonstrating that the framework does not depend on any specific upsampling design.
Highlights & Insights¶
- 🔄 Paradigm Shift: SR3R transitions 3DSR from "per-scene optimization with 2DSR pseudo-supervision" to "large-scale cross-scene feed-forward prediction," fundamentally changing how high-frequency knowledge is acquired.
- 🔌 Plug-and-Play: Compatible with any feed-forward 3DGS backbone, offering an elegant and practical design.
- 📐 Offset Learning > Direct Regression: Learning residual offsets rather than absolute parameters improves reconstruction quality while reducing Gaussian count to 37%.
- 🎯 Zero-Shot Generalization: Surpasses per-scene optimization methods on unseen scenes, with inference speeds two orders of magnitude faster.
- ⚡ Efficient and Practical: Complete HR 3D reconstruction from only 2 LR input views.
Limitations & Future Work¶
- Inference time (1.69s), while far faster than optimization-based methods (300+s), is still approximately 100× slower than base feed-forward models (0.01s), limiting real-time applicability.
- Only 4× super-resolution is validated; performance at higher scales (8×/16×) remains unexplored.
- The densification strategy (fixed 6 child Gaussians) is heuristic; adaptive densification may be more effective.
- Training requires 4×RTX 5090, imposing a high computational resource threshold.
- Validation is limited to indoor, outdoor, and object-centric scenes; generalization to large-scale outdoor environments (e.g., autonomous driving) remains untested.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The feed-forward mapping paradigm for 3DSR is novel, and the Gaussian offset learning design is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 3 datasets, zero-shot generalization, ablation studies, and upsampling strategy analysis.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with thorough motivation and rigorous formulations.
- Value: ⭐⭐⭐⭐ — Introduces a new paradigm for 3DSR with strong practical utility and plug-and-play applicability.