ERUPT: Efficient Rendering with Unposed Patch Transformer¶

Conference: CVPR 2025
arXiv: 2503.24374
Code: None (Dataset MSVS-1M provided)
Area: 3D Vision
Keywords: Novel View Synthesis, Scene Representation, Unposed Rendering, Patch Decoder, Latent View Synthesis

TL;DR¶

ERUPT proposes an efficient latent view synthesis model. By replacing pixel-level decoding with a patch-based decoder, incorporating learnable latent camera poses, and utilizing a frozen DINOv2 feature extractor, it achieves novel view synthesis at 600 fps using only 5 unposed images without requiring precise camera poses, reaching SOTA performance on the MSN dataset.

Background & Motivation¶

Background: Historically, the field of novel view synthesis has mostly relied on two major frameworks, NeRF and 3D Gaussian Splatting, which achieve high-quality rendering by training scene-specific models, but require dense images and precise camera poses. Recently, methods like SRT and RUST have explored feed-forward generalization schemes based on latent scene representations.
Limitations of Prior Work: NeRF/3DGS requires re-training for each new scene and depends on a large number of input images with accurate poses. SRT requires precise camera parameters for all images; while RUST supports unposed training, it still needs part of the target image during inference to query the model, preventing direct camera control. Both employ pixel-by-pixel decoding, which is computationally expensive.
Key Challenge: Existing latent scene representation methods face bottlenecks along three dimensions: (1) inability to train effectively on unposed data, (2) inability to directly control the camera at inference time, and (3) extremely low computational efficiency due to pixel-by-pixel decoding.
Goal: Design a generalized novel view synthesis model that supports both posed and unposed training, allows direct camera pose specification at inference time, and improves computational efficiency by an order of magnitude.
Key Insight: Alternate self-attention and cross-attention in the encoder to distinguish tokens from different images, and introduce patch-based decoding to replace pixel-level decoding.
Core Idea: Solve unposed training, direct camera control, and computational efficiency simultaneously through a tripartite architectural design incorporating "patch ray queries + learnable latent poses + alternating-attention scene Transformer".

Method¶

Overall Architecture¶

The input to ERUPT is a set of unordered (potentially unposed) scene images (typically 5), and the output is a novel view image rendered from any specified camera pose. The entire pipeline consists of three stages: (1) extracting feature tokens from each image using a frozen DINOv2; (2) generating a compact scene representation and estimating the camera pose for each image via a scene Transformer that alternates self-attention and cross-attention; (3) efficiently rendering the target image from the scene representation using a patch-based decoder.

Key Designs¶

Alternating-Attention Scene Transformer:
- Function: Extract a compact scene representation from unposed input images, while estimating relative camera poses for each image.
- Mechanism: Unlike SRT which directly concatenates all tokens for global attention, ERUPT alternates intra-image self-attention (mixing within tokens of the same image) and scene-wide cross-attention (mixing each image's tokens with all scene tokens). A learnable camera token is appended to each image, naturally aggregating camera information during the alternating attention process. The scene Transformer consists of 6 blocks.
- Design Motivation: Simple token mixing in SRT cannot distinguish tokens from different images, and RUST only uses simple tags to differentiate reference and non-reference frames. The alternating attention design allows the model to distinguish different images and construct robust scene representations even without knowing camera parameters.
Target Camera Switching:
- Function: Support unposed data during training while enabling direct camera control at inference time.
- Mechanism: Concatenate the estimated latent pose with the sinusoidal encoding of the ground-truth pose. During training, three modes are randomly sampled with 1/3 probability: latent pose only, ground-truth pose only, or both provided simultaneously. During inference, only the encoded target pose is utilized, enabling direct camera control. Ablation studies show minimal performance degradation when using only 5% ground-truth poses.
- Design Motivation: RUST requires half of the target image to estimate poses, limiting the generate-able views. This switching mechanism allows the model to generate correct outputs via the latent channel when poses are imprecise, while maintaining accurate control during inference.
Patch-Based Decoder:
- Function: Improve rendering efficiency by an order of magnitude.
- Mechanism: Use \(8 \times 8\) patch rays instead of pixel-by-pixel rays to query the scene representation, reconstructing 64 pixels per query instead of 1. The decoder consists of 4 standard Transformer decoder blocks that alternate self-attention and cross-attention with the scene representation, followed by 3 convolutional pixel shuffle upsampling blocks. An additional token decoder matches the semantic embeddings of the DINOv2 backbone.
- Design Motivation: Pixel-by-pixel decoding in SRT and RUST leads to out-of-memory (OOM) issues on a 48GB A6000 at 224 resolution (RUST). Patch decoding reduces VRAM requirements by \(64\times\).

Loss & Training¶

The image decoder uses \(L_2\) pixel loss; the token decoder utilizes ArcGeo auxiliary loss to match DINOv2 semantic features; camera poses are trained with \(L_2\) position loss + negative cosine view loss. Five target images per scene are trained to reuse the scene representation. The model is optimized using AdamW for 160 epochs with a batch size of 128. For GAN fine-tuning, \(L_2\) is replaced with \(L_1\) + perceptual + GAN loss. For Stable Diffusion rendering, the token decoder output is used as a prompt, fine-tuning the SD U-Net for 20 epochs.

Key Experimental Results¶

Main Results¶

Method	Input Poses	Target Poses	PSNR↑	SSIM↑	LPIPS↓	FID↓
SRT	✓	✓	23.41	0.697	0.369	-
SRT*	✓	✓	25.93	-	0.237	67.29
RUST	✗	✗	23.49	0.703	0.351	-
DORSal	✗	✗	18.99	-	0.265	9.00
ERUPT L+LORA	✗	Partial	25.26	0.769	0.340	91.1
ERUPT B+GAN	✗	Partial	23.38	0.713	0.204	7.45
ERUPT B+SD	✗	Partial	21.06	0.637	0.234	6.89

Ablation Study¶

Configuration	PSNR↑	SSIM↑	Description
ERUPT B (baseline)	23.85	0.718	Full model
Single target training	23.43	0.700	Reusing scene representation for multiple targets is beneficial
Patch RUST	23.20	0.690	Simple token mixing is inferior to alternating attention
ERUPT B+LORA	24.69	0.749	LORA fine-tuning of the backbone yields significant improvement
ERUPT L+LORA	25.26	0.769	Scaling up the model further improves performance
5% known poses	23.55	0.706	Requires very few ground-truth poses

Key Findings¶

LoRA fine-tuning of the DINOv2 backbone contributes the most (PSNR +0.84), indicating that even robust foundation models need adaptation for 3D scene synthesis tasks.
Using only 5% of the ground-truth target poses results in a performance drop of only 0.3 PSNR, demonstrating the robustness of the pose-switching strategy.
The patch decoder makes training 5 times faster than RUST (at 224 resolution) and reduces VRAM footprint by \(64\times\); RUST directly runs out of memory (OOM) on a 48GB GPU at 224 resolution.
GAN and SD fine-tuning significantly improve perceptual quality (FID drops from ~100 to 7-9), although multi-view consistency for SD remains challenging.

Highlights & Insights¶

Patch ray queries is the core efficiency innovation—decoding 64 pixels instead of 1 pixel per query, yielding almost no degradation in quality while improving performance by an order of magnitude. This design concept can be transferred to any ray-query-based scene representation method.
Random pose switching training is highly ingenious—switching between three modes with a 1/3 probability lets the model learn to leverage both ground-truth and latent poses simultaneously, with selective utilization during inference. This strategy of "mixing multiple input modes during training" can be transferred to other multimodal tasks.
The introduction of the MSVS-1M real-world dataset (1 million images from Mapillary street views) fills the void of lacking large-scale real-world datasets in this field.

Limitations & Future Work¶

\(L_2\) loss produces blurry outputs in scenes with high uncertainty; although GAN/SD fine-tuning offers improvements, they introduce new artifacts (GAN) or multi-view inconsistency (SD).
SD rendering speed is only about 1 fps (at 512 resolution), which prevents real-time utilization.
Each frame is rendered independently, lacking temporal consistency; incorporating multi-view diffusion models (such as the approach used in DORSal) is a promising future direction.
Performance drops significantly on the real-world MSVS-1M dataset (PSNR 20.64 vs. MSN 24.69), indicating that the model's generalization capabilities in complex real-world scenes still require enhancement.

vs SRT: SRT requires precise poses for all images, whereas ERUPT requires no input poses and only a fraction of target poses. ERUPT naturally resolves the pose issue using alternating attention and camera tokens.
vs RUST: RUST requires half of the target image during inference and cannot directly control the camera. ERUPT's pose-switching mechanism achieves both unposed training and camera control at inference time.
vs DORSal: DORSal uses multi-view diffusion to guarantee consistency, yielding better FID but significantly worse PSNR. ERUPT+SD outperforms DORSal in FID while maintaining a higher PSNR.

Rating¶

Novelty: ⭐⭐⭐⭐ Patch decoding and pose-switching training are highly innovative, though the overall architecture still follows the encoder-decoder paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐ Experiments on MSN and real-world datasets are comprehensive, ablations are thorough, and computational efficiency comparisons are detailed.
Writing Quality: ⭐⭐⭐⭐ Structurally clear with complete technical details, though it contains quite a few equations and notations.
Value: ⭐⭐⭐⭐ The efficiency gains from patch decoding and the contribution of the real-world dataset are of practical value, and the pose-switching training paradigm is inspiring.