FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation¶

Conference: CVPR 2026 arXiv: 2604.10512 Code: https://mvp-ai-lab.github.io/FreeScale Area: 3D Vision Keywords: Novel View Synthesis, Data Augmentation, 3D Gaussian Splatting, Feed-Forward Reconstruction, Certainty-Aware Sampling

TL;DR¶

FreeScale scales limited real-world data into large-scale training sets by sampling high-quality free-view images from existing scene reconstructions guided by certainty estimation, achieving a 2.7 dB PSNR improvement on feed-forward novel view synthesis models.

Background & Motivation¶

Background: Novel view synthesis (NVS) is transitioning from per-scene optimization (NeRF, 3DGS) toward generalizable feed-forward models (e.g., LVSM) that learn cross-scene priors from large-scale data and perform efficient 3D reconstruction at inference time.

Limitations of Prior Work: The bottleneck for feed-forward models lies in the scarcity of large-scale training data with diverse and accurate camera trajectories. Real-world data is photorealistic but sparsely captured and expensive to collect; synthetic data suffers from domain gaps; and images generated by diffusion models cannot provide precise camera poses.

Key Challenge: Real-world scene capture yields only discrete, sparse viewpoint coverage, while the continuous 3D representations obtained after reconstruction can theoretically be rendered from arbitrary viewpoints—yet directly sampling from imperfect reconstructions amplifies artifacts.

Goal: Design a data generation engine that produces diverse, high-quality, accurately posed free-view images from existing real-world scene reconstructions.

Key Insight: Imperfect reconstructed scenes can serve as rich geometric proxies; the key is identifying which novel viewpoints are both informative and uncontaminated by reconstruction errors.

Core Idea: Employ a certainty-aware free-view sampling strategy to identify high-certainty regions within 3DGS reconstructions, generating high-quality training data to scale feed-forward model training.

Method¶

Overall Architecture¶

Sparse image sequence → 3DGS reconstruction → Certainty grid construction → Virtual camera placement (10 trajectory patterns) → View graph filtering of redundant views → Image quality assessment and pose correction → Diffusion model enhancement → Output high-quality free-view images for training feed-forward models or augmenting per-scene optimization.

Key Designs¶

Certainty Grid:
- Function: Quantifies reconstruction reliability across scene regions.
- Mechanism: The scene bounding box is discretized into a \(128^3\) voxel grid. The certainty score for each voxel is \(\mathcal{C}(v_i) = \sum \alpha_j / (\text{Vol}_j + \epsilon)\), i.e., the accumulated opacity of Gaussians falling within the voxel divided by their volume. Small, opaque Gaussians indicate high-certainty regions.
- Design Motivation: High-certainty regions yield clean rendered images, while low-certainty regions are prone to artifacts and must be treated differently.
Virtual Camera Placement and View Graph:
- Function: Generates a large pool of candidate viewpoints and efficiently selects the optimal subset.
- Mechanism: Ten camera trajectory patterns are designed (orbit, spiral, fly-through, etc.), with anchor points selected from training cameras pointing toward high-certainty regions in the certainty grid. After generating 2,000+ candidate views, a view graph based on weighted IoU (WIoU) is constructed to measure informational overlap between views, and NMS is applied to eliminate redundant candidates.
- Design Motivation: Feature-matching-based redundancy estimation is computationally prohibitive; WIoU over the certainty grid enables efficient geometric-level redundancy assessment.
Free-View Correction and Curriculum Learning:
- Function: Repairs low-quality candidate views and guides stable training.
- Mechanism: Poses of low-quality views are corrected by interpolating toward the nearest anchor camera; image quality is subsequently enhanced using the DIFIX3D diffusion model. During feed-forward model training, a curriculum learning strategy is adopted—training begins with high-WIoU neighbors (stable) and progressively shifts toward low-WIoU views (increasing diversity).
- Design Motivation: Discarding low-quality candidates is wasteful; corrected views recover valuable training samples. Curriculum learning prevents training instability caused by large camera motions.

Loss & Training¶

When augmenting per-scene 3DGS optimization, the top-\(K\) free-view images with the lowest WIoU relative to training cameras are selected as auxiliary supervision targets. The loss is a weighted combination of L1 and SSIM.

Key Experimental Results¶

Main Results¶

Dataset / Setting	Metric	LVSM Baseline	LVSM + FreeScale	Gain
DL3DV (large motion)	PSNR	18.75	21.45	+2.7 dB
DL3DV (small motion)	PSNR	22.20	24.20	+2.0 dB
MipNeRF360 (large motion)	PSNR	13.88	17.27	+3.39 dB

Ablation Study¶

Configuration	Description
w/o certainty guidance	Sampling low-quality regions degrades performance
w/o view graph filtering	More redundant views reduce training efficiency and quality
w/o curriculum learning	Large camera motions cause training instability

Key Findings¶

Adding only 22% generated data significantly improves generalization in sparse-view reconstruction.
In per-scene 3DGS optimization, exploratory views targeting low-certainty regions also yield consistent improvements.
The view graph is more effective than simple frame-distance sampling for guiding training batch selection.

Highlights & Insights¶

Elegant reuse of the certainty grid: A single simple voxel statistic simultaneously serves view filtering, view graph construction, and exploratory training—a remarkably unified and elegant design.
Data engine paradigm: Treating 3D reconstruction as a data factory rather than an end product is a transferable idea applicable to data augmentation across a broader range of 3D tasks.

Limitations & Future Work¶

The approach depends on the quality of the initial 3DGS reconstruction; performance degrades when input images are extremely sparse.
Generated data still exhibits a synthetic-to-real domain gap, particularly at scene boundaries.
Future work could integrate stronger generative models to further improve free-view image quality.

vs. Megasynth: Megasynth stacks textures onto amorphous geometry, resulting in low data fidelity; FreeScale leverages real-scene reconstructions and preserves semantic consistency.
vs. DIFIX3D: DIFIX3D is a single-scene post-processing enhancement method, whereas FreeScale is a data generation engine designed to scale training data.

Rating¶

Novelty: ⭐⭐⭐⭐ — The certainty-guided data scaling concept is novel, though the overall contribution is largely an engineering integration.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Both feed-forward and per-scene application scenarios are thoroughly validated.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with detailed method descriptions.
Value: ⭐⭐⭐⭐ — Addresses a key data bottleneck in 3D vision with strong practical utility.