Scaling Sequence-to-Sequence Generative Neural Rendering¶
Conference: ICLR 2026 arXiv: 2510.04236 Code: Project Page Area: 3D Vision Keywords: Neural Rendering, Novel View Synthesis, Rectified Flow Transformer, Masked Autoregression, Unified Positional Encoding, Video-3D Unification
TL;DR¶
This paper presents Kaleido, a family of decoder-only rectified flow transformers that treats 3D as a special subdomain of video. Through Unified Positional Encoding, a masked autoregressive framework, and a video pretraining strategy, Kaleido achieves "any-to-any" 6-DoF novel view synthesis without any explicit 3D representation. It is the first generative method to match per-scene optimization (InstantNGP) in rendering quality under multi-view settings, and scales resolution from 512/576px to 1024px.
Background & Motivation¶
Novel View Synthesis (NVS) is a core task in 3D vision: given a set of reference views, generate images at arbitrary target viewpoints. Existing methods face clear technical bottlenecks:
| Paradigm | Representative Works | Core Limitations |
|---|---|---|
| Per-scene optimization | 3DGS, NeRF, InstantNGP | Requires many views + per-scene optimization taking minutes; cannot generalize zero-shot |
| Feed-forward reconstruction | PixelNeRF, LRM | Relies on explicit 3D representations (point clouds/tri-planes); limited generalization |
| Diffusion-based generation | Zero123++, SV3D, SEVA | U-Net architecture limits resolution to 512/576px and scalability; most require SDS two-stage refinement |
| Video-controlled generation | MotionCtrl, CameraCtrl, GEN3C | Essentially 4D temporal prediction; constrained by single reference frames/fixed trajectories; cannot handle "any-to-any" spatial queries |
Core Insight: 3D can be viewed as a special subdomain of video—both are fundamentally sequences of images, differing only in whether the inter-frame camera transformations are known. However, directly fine-tuning video models does not work: video models rely on temporal VAEs that assume high temporal correlation between frames, an assumption that breaks down in sparse-view 3D tasks.
Key Motivations: - Camera-annotated 3D data is extremely scarce, while video data is orders of magnitude larger - SEVA, the previous strongest general NVS model, uses a U-Net architecture with poor scalability and is limited to 576px - A scalable pure Transformer architecture is needed that can learn spatial priors ("visual commonsense") from video and transfer them to 3D
Method¶
Overall Architecture¶
Kaleido formulates novel view synthesis entirely as a sequence-to-sequence image synthesis task: given \(N\) reference views with 6-DoF camera poses \(\{(I_i, P_i)\}_{i=1}^{N}\) and \(M\) target poses \(\{P_j\}_{j=1}^{M}\), it directly generates \(M\) target views \(\{I_j\}_{j=1}^{M}\). The entire process uses a single decoder-only rectified flow transformer with no explicit 3D representation (no point clouds, no NeRF, no 3DGS, no depth estimation).
Core architectural components: - Tokenization: Each image is encoded into latent tokens via a VAE - Pose conditioning: Camera pose information is injected into the token sequence via Unified Positional Encoding - Masking strategy: Reference view tokens are unmasked (conditioned), and target view tokens are masked (to be generated) - Generation process: Masked tokens are denoised via rectified flow
Key Designs¶
1. Unified Positional Encoding
This is the key architectural innovation enabling a single Transformer to handle both video and 3D data simultaneously, requiring no additional trainable parameters:
- For video data: RoPE encodes the temporal position \(t\) and spatial position \((h, w)\) of each frame
- For 3D data: RoPE encodes Plücker ray coordinates \((\mathbf{o}, \mathbf{d})\) (computed from camera intrinsics and extrinsics), where \(\mathbf{o}\) is the ray origin and \(\mathbf{d}\) is the ray direction
- This unified design allows temporal consistency priors from video to naturally transfer as spatial consistency in 3D, and the architecture switches between video and 3D training with zero modifications
2. Masked Autoregressive Framework
Enables flexible "any-to-any" inference: - Arbitrary \(N\) reference views → arbitrary \(M\) target views (\(N\) and \(M\) can vary at both training and inference time) - Causal masking distinguishes conditioned views (clean tokens) from target views (noisy tokens) - Autoregressive iteration: already-generated high-quality views can be added as new conditions, progressively expanding coverage - Supports extreme inference: extrapolation from 12-view training length to 480-frame autoregressive generation (40× training length)
3. Systematic Resolution of Scaling Bottlenecks
Through extensive ablation studies, the paper identifies and resolves non-obvious bottlenecks that prevent pure seq2seq models from scaling: - Activation overflow (Sec 2.2.2): Massive activation overflow occurring during large model training, resolved through specific activation function choices - Suboptimal SNR sampler (Sec 2.2.3): Standard diffusion model SNR sampling strategies are suboptimal for 3D rendering; noise-biased sampling is proposed to improve training - These "scaling recipe" findings are themselves core contributions, making simple pure-Transformer designs viable for generative NVS for the first time
Loss & Training¶
Two-stage training paradigm:
- Video pretraining stage: Training on large-scale video data with the rectified flow matching loss \(\mathcal{L} = \mathbb{E}_{t}\left[\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\right]\), learning spatiotemporal consistency priors
- 3D fine-tuning stage: Fine-tuning on camera-annotated 3D datasets; the Unified Positional Encoding automatically switches to Plücker ray mode to learn precise geometric correspondences
Key training strategies: - Rectified Flow offers better training stability and sampling efficiency than standard DDPM - Classifier-Free Guidance (CFG) is supported to enhance generation quality - Noise-biased sampling strategy optimizes the SNR distribution for 3D generation - No architectural modifications are required for the transfer from video to 3D
Key Experimental Results¶
Main Results: NVS Benchmark Comparison¶
| Setting | Comparison Methods | Kaleido Performance | Key Metric |
|---|---|---|---|
| Few-view (1–3 views) | Zero123++, SV3D, SEVA, EscherNet | Zero-shot; significantly outperforms all generative methods | PSNR substantially higher |
| Many-view (>10 views) | InstantNGP (per-scene optimization) | First to match optimization-based quality | Comparable PSNR |
| Single-view 3D reconstruction | All prior methods | SOTA | CD=1.83, VIoU=70% |
| Resolution | SEVA (576px), CAT3D (512px) | First 1024px generative rendering model | Resolution doubled |
Comparison with Video-Controlled Generation Models (Appendix H, added in rebuttal)¶
| Method | Type | Camera Accuracy (\(R_{err}\)↓ / \(T_{err}\)↓) | Visual Quality (LPIPS↓) |
|---|---|---|---|
| Wonderland (CVPR 2025) | Video→3DGS pipeline | Worse | Comparable |
| ViewCrafter | Video generation | Worse | Worse |
| VD3D | Video diffusion | Worse | Worse |
| MotionCtrl (SIGGRAPH 2024) | Video control | Worse | Worse |
| Kaleido | Seq2Seq NVS | Best | Best or comparable |
On DL3DV and Tanks & Temples, Kaleido significantly outperforms all video baseline models in camera accuracy.
Ablation Study¶
| Ablation Configuration | Impact | Notes |
|---|---|---|
| Remove video pretraining | Significant degradation in spatial consistency | Validates the "3D ≈ video subdomain" hypothesis |
| Standard PE vs. Unified PE | Performance drop | Unified PE is critical for video→3D transfer |
| Encoder-decoder vs. Decoder-only | Performance drop | Decoder-only better suits variable-length sequence tasks |
| Standard SNR sampling vs. Noise-biased sampling | Performance drop | SNR distribution optimization is critical for 3D rendering |
| Activation overflow not fixed | Unstable training | Key bottleneck for large-model scaling |
| Reduced model scale | Consistent degradation | 3D geometric understanding requires sufficient model capacity |
Key Findings¶
- 3D ≈ a special subdomain of video: The hypothesis is empirically validated—video pretraining substantially improves 3D rendering quality, and temporal consistency effectively transfers as spatial consistency
- Clear scaling law: Increasing model scale consistently improves rendering quality; a clear scaling law exists for NVS with pure Transformers
- Implicit geometric understanding: The model learns meaningful geometry without explicit 3D representations, as evidenced by 3D reconstruction quality (CD=1.83)
- Video models cannot be naively substituted: Comparative experiments show that the temporal VAE assumption of video models fails on sparse 3D tasks; methods such as GEN3C collapse under large viewpoint changes due to "empty cache" issues
Highlights & Insights¶
- Paradigm breakthrough: The first demonstration that a pure 2D sequence model can match per-scene optimization (e.g., InstantNGP) in rendering quality, with no SDS two-stage refinement required
- Architectural innovation: Unified Positional Encoding enables a single architecture to handle both video and 3D with zero modifications—an elegant and practical design
- Data scale leverage: Video pretraining elegantly compensates for the scarcity of 3D data with clearly transferable benefits—offering a paradigm reference for other 3D tasks facing data scarcity
- Scaling recipe: Systematic identification and resolution of seq2seq 3D rendering scaling bottlenecks (activation overflow, SNR sampling) yields broadly applicable insights
- 512px to 1024px: The first generative rendering model to surpass U-Net limitations and achieve 1024px resolution
Limitations & Future Work¶
- Inference efficiency: Large models with long sequences incur significant inference costs; zero-shot inference, while faster than per-scene optimization, remains far from real-time
- Geometric precision ceiling: Implicit 3D understanding may be insufficient for applications requiring sub-pixel-level geometric accuracy (e.g., AR/VR)
- Camera parameter dependency: Known 6-DoF camera parameters for target views are still required as conditions
- Long-sequence degradation: Extreme autoregressive generation at 480 frames still exhibits artifacts; consistency over very long trajectories remains to be improved
- High training cost: The combined cost of video pretraining and 3D fine-tuning is substantial
- No 4D capability: The current design targets static scenes and does not address dynamic scene reconstruction
Related Work & Insights¶
- NeRF / 3DGS: Per-scene optimization methods serving as quality upper bounds; Kaleido is the first to match InstantNGP in multi-view settings
- SEVA (ICCV 2025): A U-Net-based general NVS model; Kaleido comprehensively surpasses it in scalability and resolution
- CAT3D / ReconFusion / ZeroNVS: Generative methods requiring SDS two-stage refinement; Kaleido achieves higher quality in a single stage
- GEN3C (NVIDIA): A video model based on explicit depth reprojection that collapses under large viewpoint changes due to empty caches; Kaleido's implicit priors are more robust
- Wonderland (CVPR 2025): A video→3DGS pipeline; Kaleido significantly outperforms it in camera accuracy
- Insights: The approach of unifying specialized domain tasks as sequence-to-sequence problems and leveraging large-scale pretraining on adjacent data can be generalized to 3D editing, dynamic scene reconstruction, robotic manipulation, and other data-scarce settings
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Triple innovation: "3D-as-Video" unified paradigm + Unified Positional Encoding + scaling recipe)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive multi-benchmark evaluation with thorough ablations; comparison with video models was missing from the initial submission and added in the rebuttal)
- Writing Quality: ⭐⭐⭐⭐ (Concepts are clearly articulated and systematically organized; the distinguishing arguments in the rebuttal are more precise than those in the main paper)
- Value: ⭐⭐⭐⭐⭐ (Opens a new direction for generative neural rendering; first to match optimization-based quality; scaling insights have broad guiding significance)