Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos¶

Conference: ICCV 2025 arXiv: 2504.07940 Code: None (project page available) Area: Video Understanding / Video Generation Keywords: 360° video generation, video outpainting, diffusion model, panoramic video, view synthesis

TL;DR¶

This paper introduces Argus, the first model to generate complete 360° panoramic videos from standard perspective videos. Through three geometry- and motion-aware techniques—camera movement simulation, view-based frame alignment, and blended decoding—Argus achieves spatially consistent and temporally coherent panoramic video generation within a diffusion-based framework.

Background & Motivation¶

360° videos offer a boundless field of view compared to standard videos, providing a more complete representation of dynamic visual scenes. However, existing video models are limited to generating narrow-FOV standard videos and cannot produce panoramic content. The video-to-360° task presents three core challenges:

Extreme FOV expansion: The input video covers only a limited viewing angle, requiring the model to infer the spatial layout and object dynamics of the entire scene.

Limitations of existing outpainting methods: Models trained on narrow-FOV videos exhibit degrading quality as the generated content moves farther from the input viewpoint.

Non-linear distortion in equirectangular projection (ERP): Objects and spatial structures are warped under projection, and boundary discontinuities arise.

The authors' key insight is that the large corpus of 360° videos available on the internet constitutes a relatively underexploited data source for learning panoramic priors.

Method¶

Overall Architecture¶

Argus is built upon the Stable Video Diffusion architecture and formulates the problem as video outpainting under a dynamic mask. Given a perspective video, the model first estimates camera poses to project it into equirectangular coordinates, then performs diffusion-based generation conditioned on this projection. The core components are: video-conditioned 360° diffusion, camera movement simulation, view-based frame alignment, and blended decoding.

Key Designs¶

Video-Conditioned 360° Diffusion:
- Input perspective video \(X_{\text{pers}} \in \mathbb{R}^{T \times 3 \times H \times W}\); output 360° panoramic video \(Y_{\text{equi}} \in \mathbb{R}^{T \times 3 \times H' \times W'}\).
- The perspective video is converted to equirectangular format \(X_{\text{equi}}\), with unmapped regions set to black.
- An encoder \(\mathcal{E}\) encodes both representations into latent space; noise is added to \(\mathbf{y}_{\text{equi}}\), which is then concatenated with \(\mathbf{x}_{\text{equi}}\) and fed into the denoising network \(f_\theta\).
- CLIP is used to extract image feature sequences for cross-attention conditioning.
- The training loss incorporates a latitude-dependent reweighting function: \(\lambda(h) = (\frac{1}{2} - |\frac{1}{2} - h|)^2 + \delta\), assigning higher weights to equatorial regions, since polar areas are disproportionately magnified under equirectangular projection.
Camera Movement Simulation:
- Training data is constructed by sampling real viewpoints from 360° videos.
- Natural human camera motion is simulated via linear drift, oscillation, and noise terms:
  - \(\phi_{\text{yaw}}(k) = \mathcal{N}(0, \eta_y) + a_y \sin(\omega k + \tau_y) + d_y k + \phi_0\)
- Horizontal and vertical field-of-view angles are randomly sampled from \([30°, 120°]\).
- Data augmentation: random circular translation (horizontal rotation preserving the 360° property).
View-Based Frame Alignment:
- Core problem: how to project a perspective video into equirectangular format?
- Naively centering each frame at the equirectangular map center requires the model to implicitly learn camera motion; scene content such as the sky appears at inconsistent positions across frames, increasing learning difficulty.
- Solution: SLAM (MegaSaM) is used to estimate relative camera poses, computing Euler angles relative to the first frame and projecting each frame into a shared coordinate system.
- Effect: ensures that each region of the equirectangular map corresponds to approximately the same scene area across frames (sky consistently at the top, road consistently at the bottom).
Blended Decoding:
- Problem: the left and right boundaries of equirectangular images are spatially adjacent in the scene but distant in image space, causing boundary artifacts.
- The model decodes both the original latent representation and its 180°-rotated counterpart, producing two outputs with identical content but artifacts at different locations.
- Distance-based weighted averaging: \(Y_{k,i,j} = h_W(i) \cdot Y_{k,i,j} + (1 - h_W(i)) \cdot Y'_{k,i,j}\)
- where \(h_W(x) = 1 - 2|\frac{x}{W} - \frac{1}{2}|\), assigning higher weights to pixels farther from the boundary.

Loss & Training¶

Built on the EDM diffusion framework with a score matching objective.
Initialized from Stable Video Diffusion-I2V-XL.
Two-stage training: first trained at \(384 \times 768\) resolution for 100K iterations, then fine-tuned on a high-quality subset at \(512 \times 1024\) for 20K iterations with batch size 16.
Long video generation: context-aware training alternating between standard and context-aware inputs.
Data: approximately 283,863 video clips filtered from the 360-1M dataset.

Key Experimental Results¶

Main Results (Video-to-360° Generation)¶

Method	PSNR↑	LPIPS↓	FVD↓	Motion↑	Line cons.↑
PanoDiffusion	16.44	0.4138	2649.0	0.9426	0.6504
Argus	21.83	0.2409	1228.6	0.9802	0.8506

Comparison with video outpainting methods (FoV=60°):

Method	Imaging↑	Aesthetic↑	Motion↑
Be-Your-Outpainter	0.4014	0.3461	0.9683
Follow-Your-Canvas	0.4268	0.4750	0.9704
Argus	0.4760	0.4722	0.9816

Ablation Study¶

Variant	PSNR↑	LPIPS↓	FVD↓	Imaging↑
w/o frame alignment	20.42	0.3194	1349.6	0.3816
w/o blended decoding	22.09	0.2675	1226.3	0.4574
Full model	21.83	0.2409	1228.6	0.4939
VAE reconstruction upper bound	24.54	0.1663	121.8	0.5272

Key Findings¶

View-based frame alignment is critical to overall performance (removing it drops Imaging from 0.4939 to 0.3816).
Blended decoding substantially improves boundary consistency, though its effect on quantitative metrics is modest.
Argus demonstrates an understanding of dynamic content in the input video (e.g., vehicle motion) and extrapolates it reasonably; predicted trajectories closely match ground truth.
3D reconstruction validation: the mean rotational deviation of generated panoramic videos is only \((0.22°, 0.30°, 0.34°)\).
Video outpainting methods degrade with increasing angular distance from the input viewpoint, whereas Argus maintains consistent quality across the full 360° range.

Highlights & Insights¶

360° videos as a prior data source: This work breaks away from the convention of training exclusively on narrow-FOV videos, leveraging the abundance of 360° video on the internet to learn panoramic priors.
Geometry-aware design: The latitude reweighting loss, view-based frame alignment, and blended decoding all reflect a deep understanding of the geometric properties of equirectangular projection.
Rich downstream applications: Video stabilization (without FOV loss), free-viewpoint control, dynamic environment mapping, and interactive VQA demonstrate the broad application potential of 360° video generation.

Limitations & Future Work¶

The output resolution of \(512 \times 1024\) is substantially below that of typical 4K panoramic video, and resolution degrades further when reprojecting back to perspective views.
Object shape inconsistencies and physical artifacts persist, limitations shared with the underlying SVD foundation model.
Camera pose estimation is required at inference time, increasing pipeline complexity.
Computational constraints limit generation at higher resolutions and longer durations.

Unlike video panorama stitching methods such as VidPanos, Argus can extrapolate content well beyond the input viewpoint.
The blended decoding strategy (mixing in pixel space rather than latent space) may generalize to other generation tasks involving boundary discontinuities.
The camera movement simulation approach provides a systematic data construction methodology for training on 360° video data.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic treatment of the video-to-360° task, with technically rich geometric insights.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers quantitative evaluation, qualitative results, ablation studies, and multiple downstream application demonstrations.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and method motivation is well articulated.
Value: ⭐⭐⭐⭐⭐ Opens a new direction in panoramic video generation with substantial downstream application potential.