RayNova: Scale-Temporal Autoregressive World Modeling in Ray Space¶
Conference: CVPR 2026
arXiv: 2602.20685
Authors: Yichen Xie, Chensheng Peng, Mazen Abdelfattah, Yihan Hu et al. (Applied Intuition, UC Berkeley)
Code: Project Page
Area: 3D Vision
Keywords: World Models, Multi-view Video Generation, Autoregressive, Plücker Rays, Autonomous Driving
TL;DR¶
Ours proposes RayNova, a geometry-agnostic multi-view world model based on dual-causal (scale + temporal) autoregression. By utilizing relative Plücker ray positional encoding, it enables unified 4D spatio-temporal reasoning, achieving SOTA multi-view video generation performance on nuScenes.
Background & Motivation¶
World Foundation Models (WFM) aim to simulate the physical evolution of the real world. Existing methods suffer from fundamental limitations:
- Spatio-temporal Decoupled Design: Spatial processing uses multi-view adjacency while temporal processing uses video generation techniques separately, limiting adaptability to new camera configurations and rapid motion.
- Strong 3D Prior Dependence: Dependence on explicit 3D representations like Point Clouds or BEV limits generalization in open-world scenarios.
- Fixed Camera Configuration Binding: Most methods assume a fixed sensor layout and adjacency relationships.
Core Problem¶
How to construct a world model that generalizes to arbitrary camera configurations and motions while maintaining physical plausibility with minimal inductive bias?
Method¶
Overall Architecture¶
RayNova aims to build a multi-view world model that is not tied to specific camera configurations or dependent on explicit 3D priors like Point Clouds/BEV, capable of generating physically plausible futures under arbitrary sensor layouts and rapid motions. It places world modeling entirely within a "ray space" autoregressive framework: multi-view images of each frame are first quantized into multi-scale tokens and then generated step-by-step along two causal chains: "scale" and "temporal". Each generation step is completed by a three-layer attention block, where geometric information is injected entirely through relative Plücker ray positional encoding rather than any 3D representation.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
A["Multi-view image sequence<br/>(Arbitrary camera layout)"] --> B["Next-Scale Prediction<br/>Quantized into K multi-scale token maps"]
B --> C["Dual-causal autoregression<br/>Scale causality: multi-views of the same frame progress by scale<br/>Temporal causality: current frame conditioned on all historical views"]
C --> D
subgraph D["Three-layer attention Transformer block"]
direction TB
E["image-wise self-attention<br/>2D Axial RoPE, ensures single-image realism"] --> F["global self-attention<br/>Cross-view and cross-frame, ensures 4D spatio-temporal consistency"]
F --> G["image-wise cross-attention<br/>Integrates bbox / map / text conditions"]
end
H["Relative Plücker ray RoPE<br/>7-dimensional isotropic geometric encoding"] -.Inject geometry.-> F
D --> I["Multi-view future video"]
Key Designs¶
1. Next-Scale Prediction: Decomposing Image Generation into Coarse-to-Fine Scale Autoregression
The generation backbone of RayNova follows visual autoregressive models: each image is first quantized into \(K\) multi-scale token maps \(X_{1:K}\), then generated scale-by-scale from coarse to fine, with each scale conditioned on the coarser scales:
This coarse-to-fine sequential generation provides a natural progressive structure for the subsequent "scale + temporal" dual causality.
2. Dual-Causal Autoregression: Unifying 4D Spacetime via Scale and Temporal Causal Chains
Most existing world models decouple space and time—handling space via multi-view adjacency and time via video generation techniques—resulting in poor performance under new camera configurations or rapid motion. RayNova unifies both into two causal chains: Scale Causality allows all views of the same frame to be modeled jointly (as they describe the same 3D space), progressing by scale:
Temporal Causality ensures the current frame is conditioned on all views of all historical frames, without assuming strong dependencies between frames of the same camera:
Avoiding the bias that "adjacent frames of the same camera are most relevant" is key to its adaptation to arbitrary camera layouts.
3. Isotropic Spatio-temporal Representation: Relative Plücker Ray RoPE instead of Explicit 3D Priors
This is the core of RayNova's geometry-agnosticism. Instead of relying on Point Clouds/BEV, it calculates a Plücker ray \(\mathbf{p}_k^{v,t} = (\mathbf{m}, \mathbf{d}, t) \in \mathbb{R}^7\) (where \(\mathbf{m} = \mathbf{o}^{v,t} \times \mathbf{d}_k^{v,t}\)) for each token and extends Rotary Positional Encoding (RoPE) to this 7D space:
The attention score only considers the relative positions between tokens \(a_{i,j} = \mathbf{q}_i^T \mathbf{R}_\Delta^{i,j} \mathbf{k}_j\) (where \(\mathbf{R}_\Delta^{i,j} = \mathbf{R}_i^T \mathbf{R}_j\)). Since the encoding is isotropic across all scales/views/frames and is relative, the model naturally extrapolates to camera configurations outside the training distribution, which is why it remains robust even under 4m displacements.
4. Three-layer Attention Transformer: Decoupling Realism, Consistency, and Controllability
Each block uses three layers of attention for distinct roles: image-wise self-attention with 2D Axial RoPE processes each image independently to ensure single-image realism; global self-attention integrates cross-view and cross-frame attention with Plücker ray RoPE to ensure 4D spatio-temporal consistency; image-wise cross-attention integrates conditional signals. For conditions, bboxes project 8 corners into image space for encoding with T5 text embeddings, while maps are sampled as 3D points, projected, and encoded using PointNet.
Loss & Training¶
The greatest enemy of long video generation is distribution drift. RayNova counters this with recursive training: frame-by-frame forward/backward propagation with unified updates after gradient accumulation; caching latent features (rather than KV) saves 50% GPU memory while retaining gradients for KV projection layers; and injecting random bit-flip noise into visual token inputs to simulate inference errors, aligning the training distribution with real autoregressive inference.
Key Experimental Results¶
| Method | Resolution | FID ↓ | FVD ↓ | Throughput ↑ (img/s) |
|---|---|---|---|---|
| MagicDrive | 224×400 | 16.2 | - | 1.76 |
| DriveDreamer | 256×448 | 14.9 | 341 | 0.37 |
| Panacea | 256×512 | 17.0 | 139 | 0.67 |
| Ours (RayNova) | 384×672 | 10.5 | 91 | 1.96 |
| Evaluation Dimension | Method | Metric (Relative to Oracle) |
|---|---|---|
| Object Conditioning (StreamPETR) | Panacea | 32.1 NDS (68%) |
| Ours | 41.9 NDS (89%) | |
| Object Conditioning (SparseFusion) | X-Drive | 69.6 NDS (95%) |
| Ours | 72.0 NDS (99%) | |
| Novel View Synthesis FID (shift 4m) | StreetGaussian | 67.44 |
| Ours | 17.48 |
Highlights & Insights¶
- Geometry-agnostic Design: Does not rely on 3D priors like Point Clouds/BEV/Depth, achieving geometric awareness solely through relative ray positional encoding.
- Dual-Causal Autoregression: A unified scale+temporal causal framework that is more flexible than decoupled spatio-temporal attention.
- Strong Novel-View Generalization: Zero-shot adaptation to unseen camera configurations, with an FID of only 17.48 under 4m displacement vs 67.44 for StreetGaussian.
- Efficient Generation: Throughput of 1.96 img/s significantly exceeds diffusion model baselines (0.37-1.76).
- Heterogeneous Data Compatibility: Can mix training data with different sensor configurations, resolutions, and frame rates.
Limitations & Future Work¶
- Use of image-based VAE may affect FID/FVD metrics.
- Training data volume (~60 hours) is still limited compared to some private data methods.
- Recursive training requires longer training times.
- Map-conditioned 3D point projection lacks height information.
- Experiments only validated in driving scenarios; indoor or other scenes remain unverified.
Related Work & Insights¶
- vs Panacea: Panacea assumes strong multi-frame dependencies per camera, limited to specific configurations; RayNova is fully decoupled, FVD 91 vs 139.
- vs X-Drive: X-Drive uses Point Clouds as 3D priors; RayNova requires no 3D representation.
- vs StreetGaussian/OmniRe: Explicit 3D representations degrade sharply under large camera offsets (FID 67+); RayNova remains robust (17.48).
-
vs BEVWorld: BEV representations are tied to specific height planes; RayNova's ray space is more universal.
-
The design of relative Plücker ray encoding can be extended to other generative tasks requiring geometric awareness.
- Dual-causal autoregression provides a unified framework for multi-modal/multi-resolution generation.
- Recursive training for distribution drift has implications for other long-sequence generation tasks.
- The integration with VAR (Visual Autoregressive Model) is noteworthy.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Dual-causal autoregression + relative ray positional encoding is a brand-new design paradigm.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dimensional evaluation (quality/conditioning/novel view/motion), though limited to driving scenarios.
- Writing Quality: ⭐⭐⭐⭐⭐ — Rigorous mathematical derivation and excellent illustrations.
- Value: ⭐⭐⭐⭐⭐ — Defines a new direction for geometry-agnostic world models.