RayNova: Scale-Temporal Autoregressive World Modeling in Ray Space¶
Conference: CVPR 2026 arXiv: 2602.20685 Authors: Yichen Xie, Chensheng Peng, Mazen Abdelfattah, Yihan Hu et al. (Applied Intuition, UC Berkeley) Code: Project Page Area: 3D Vision Keywords: World Model, Multi-View Video Generation, Autoregressive, Plücker Rays, Autonomous Driving
TL;DR¶
This paper proposes RayNova, a geometry-agnostic multi-view world model based on dual-causal (scale + temporal) autoregressive modeling. By leveraging relative Plücker ray positional encodings, RayNova achieves unified 4D spatiotemporal reasoning and attains state-of-the-art multi-view video generation performance on nuScenes.
Background & Motivation¶
World foundation models (WFMs) aim to simulate the physical evolution of real-world environments. Existing approaches suffer from fundamental limitations:
- Decoupled spatial-temporal design: Spatial relationships are modeled via multi-view adjacency while temporal dynamics rely on video generation techniques, handled separately — limiting adaptability to novel camera configurations and fast motion.
- Reliance on strong 3D priors: Methods depend on explicit 3D representations such as point clouds or BEV, which restricts generalization to open-world settings.
- Fixed camera configuration binding: Most methods assume a fixed sensor layout and adjacency structure.
Core Problem¶
How can a world model be constructed with minimal inductive bias that remains physically plausible while generalizing to arbitrary camera configurations and motions?
Method¶
3.1 Next-Scale Prediction (Foundation)¶
Building on visual autoregressive models, each image is quantized into \(K\) multi-scale token maps \(X_{1:K}\), generated autoregressively from coarse to fine:
3.2 Dual-Causal Autoregression¶
Scale causality: All views within a single frame are modeled jointly (as they depict the same 3D space), with generation proceeding scale by scale:
Temporal causality: The current frame is conditioned on all views from all historical frames, without assuming strong dependencies between frames of the same camera:
3.3 Isotropic Spatiotemporal Representation¶
Core Innovation: Rotary positional encoding (RoPE) based on relative Plücker rays.
For each token, the Plücker ray is computed as \(\mathbf{p}_k^{v,t} = (\mathbf{m}, \mathbf{d}, t) \in \mathbb{R}^7\), where \(\mathbf{m} = \mathbf{o}^{v,t} \times \mathbf{d}_k^{v,t}\).
RoPE is extended to 7D space:
Attention scores are computed based on the relative position between tokens:
Key advantages: - Isotropic across all scales, views, and frames — no camera-specific assumptions required. - Relative encoding naturally supports extrapolation beyond the training distribution.
3.4 Transformer Architecture¶
Each block contains three attention layers: 1. Image-wise self-attention: Processes each image independently with 2D Axial RoPE, preserving per-image fidelity. 2. Global self-attention: Unified attention across views and frames with Plücker ray RoPE, ensuring spatiotemporal consistency. 3. Image-wise cross-attention: Fuses conditional inputs such as text, 3D bounding boxes, and HD maps.
Condition encoding: 3D bounding box corners are projected into image space and encoded; T5 text embeddings are used for text; HD map points are sampled in 3D, projected, and encoded via PointNet.
3.5 Long-Video Recurrent Training¶
To address distribution shift in long-video generation, a recurrent training strategy is proposed: - Frame-by-frame forward and backward passes with gradient accumulation followed by a unified parameter update. - Latent features (rather than KV caches) are cached, reducing GPU memory by 50% while retaining gradients through the KV projection layers. - Random bit-flip noise is introduced into visual token inputs to simulate inference-time errors.
Key Experimental Results¶
| Method | Resolution | FID ↓ | FVD ↓ | Throughput ↑ (img/s) |
|---|---|---|---|---|
| MagicDrive | 224×400 | 16.2 | - | 1.76 |
| DriveDreamer | 256×448 | 14.9 | 341 | 0.37 |
| Panacea | 256×512 | 17.0 | 139 | 0.67 |
| RayNova | 384×672 | 10.5 | 91 | 1.96 |
| Evaluation Dimension | Method | Metric (relative to Oracle) |
|---|---|---|
| Object Conditioning (StreamPETR) | Panacea | 32.1 NDS (68%) |
| RayNova | 41.9 NDS (89%) | |
| Object Conditioning (SparseFusion) | X-Drive | 69.6 NDS (95%) |
| RayNova | 72.0 NDS (99%) | |
| Novel View Synthesis FID (shift 4m) | StreetGaussian | 67.44 |
| RayNova | 17.48 |
Highlights & Insights¶
- Geometry-agnostic design: No reliance on point clouds, BEV, or depth-based 3D priors; geometric awareness is achieved solely through relative ray positional encodings.
- Dual-causal autoregression: A unified scale-temporal causal framework that is more flexible than decoupled spatial-temporal attention designs.
- Strong novel-view generalization: Zero-shot adaptation to unseen camera configurations; FID of 17.48 vs. StreetGaussian's 67.44 under a 4m camera shift.
- Efficient generation: Throughput of 1.96 img/s significantly outperforms diffusion-based baselines (0.37–1.76 img/s).
- Heterogeneous data compatibility: Supports mixed training data with varying sensor configurations, resolutions, and frame rates.
Limitations & Future Work¶
- An image-based VAE is used, which may affect FID/FVD metrics.
- Training data volume (~60 hours) remains limited compared to some methods trained on private large-scale datasets.
- Recurrent training incurs longer overall training time.
- The 3D point projection for map conditioning lacks height information.
- Experiments are conducted exclusively in driving scenarios; generalization to indoor or other settings remains unverified.
Related Work & Insights¶
- vs. Panacea: Panacea assumes strong temporal dependencies between frames of the same camera, binding it to specific camera configurations; RayNova is fully decoupled, achieving FVD 91 vs. 139.
- vs. X-Drive: X-Drive relies on point clouds as a 3D prior; RayNova requires no explicit 3D representation.
- vs. StreetGaussian/OmniRe: Explicit 3D representations degrade sharply under large camera shifts (FID 67+), whereas RayNova remains robust (17.48).
- vs. BEVWorld: BEV representations are tied to a specific ground plane height; RayNova's ray space is more general.
The design principle of relative Plücker ray encoding is transferable to other geometry-aware generative tasks. The dual-causal autoregressive framework provides a unified paradigm for multi-modal and multi-resolution generation. The recurrent training strategy for mitigating distribution shift offers broader inspiration for long-sequence generation. The combination with VAR (Visual Autoregressive Model) warrants further investigation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Dual-causal autoregression combined with relative ray positional encoding represents a fundamentally novel design paradigm.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluation covers multiple dimensions (quality, conditioning, novel-view synthesis, motion), though limited to driving scenarios.
- Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are rigorous and figures are well-crafted.
- Value: ⭐⭐⭐⭐⭐ — Establishes a new direction for geometry-agnostic world modeling.