Skip to content

Pano360: Perspective to Panoramic Vision with Geometric Consistency

Conference: CVPR 2025
arXiv: 2603.12013
Code: https://github.com/KiMomota/Pano360
Area: 3D Vision
Keywords: Panoramic Stitching, 3D Geometric Consistency, Transformer, Seam Detection, Large-scale Dataset

TL;DR

Proposes Pano360, the first Transformer framework that performs panoramic stitching in 3D photogrammetric space. It leverages a pretrained VGGT backbone to obtain 3D-aware multi-view feature alignment and multi-feature joint optimization for seam detection. It supports 2 to hundreds of input images, achieving a success rate of 97.8% in challenging scenarios with weak texture, large parallax, or repetitive patterns.

Background & Motivation

Background: Traditionally, panoramic stitching relies on pairwise feature matching to estimate homography matrices (SIFT/ORB + RANSAC). CNN-based methods like UDIS/UDIS2 improve stitching through end-to-end learning but remain limited to pairwise processing.

Limitations of Prior Work: (a) Pairwise matching in multi-image stitching leads to error accumulation, resulting in severe distortion; (b) Feature matching is unreliable in weak texture, large parallax, or repetitive pattern scenes, causing homography estimation to fail; (c) CNN-based methods only process image pairs and cannot exploit multi-view global geometric consistency.

Key Challenge: Existing methods establish pairwise correspondences in 2D space, neglecting 3D projective geometry—whereas multi-view geometric relations are more accurate and globally consistent in 3D space.

Goal: Extend panoramic stitching from 2D pairwise alignment to 3D global alignment by leveraging multi-view geometric consistency.

Key Insight: Leverage the inherent 3D feature correspondence awareness of pretrained large vision models (VGGT) to elevate the image alignment task into the 3D photogrammetric space.

Core Idea: Utilize the 3D perception capability of VGGT to directly estimate camera parameters for alignment in 3D space, and use multi-feature joint optimization to generate globally optimal seams in a single pass.

Method

Overall Architecture

Inputs N partially overlapping images and outputs a seamless panorama. Dual-branch architecture: (1) Projection branch: DINO encoding + VGGT alternating attention \(\rightarrow\) camera tokens \(\rightarrow\) decoded into intrinsic/extrinsic parameters \(\rightarrow\) global 3D alignment + local mesh warp correction; (2) Seam branch: feature tokens \(\rightarrow\) seam decoder \(\rightarrow\) multi-feature joint optimization to generate seam masks \(\rightarrow\) fusion output.

Key Designs

  1. 3D-Aware Feature Backbone:

    • Function: Extract global 3D geometric relationships from multiple images.
    • Mechanism: DINO encodes each image into patch tokens, with pre-placed learnable camera tokens and register tokens. These are processed through \(L\) layers of alternating attention (intra-frame self-attention + global cross-attention) in the pretrained VGGT, outputting camera tokens containing 3D geometric correspondences and feature tokens preserving detail.
    • Design Motivation: VGGT, trained with 3D supervision, possesses an inherent awareness of multi-view 3D correspondence, making it more accurate and global than traditional feature matching.
  2. Projection Head:

    • Function: Decode camera intrinsic and extrinsic parameters from camera tokens to directly align images in 3D space.
    • Mechanism: Predicts the intrinsic matrix \(\mathbf{K}_i\) and extrinsic parameters \((\mathbf{R}_i, \mathbf{t}_i)\) for each image, defining a projection function \(\mathbf{P}_i\) to map pixels to the panoramic coordinate system. The warp function is defined as \(\mathcal{W}_i(\mathbf{u}) = P_i(\mathbf{u}) + W_i(\mathbf{u})\), where \(W_i\) represents the local mesh warp to handle parallax. It supports multiple projection formats, such as planar, equirectangular, and spherical.
    • Design Motivation: Camera parameters provide globally consistent 3D constraints that are more robust than 2D homography matrices, while local mesh warping compensates for depth variations.
  3. Seam Head:

    • Function: Predict globally optimal seam masks for each image.
    • Mechanism: Jointly compute seam labels from color, gradient, and texture features across all images to train the network. Single-pass forward inference predicts all seam masks simultaneously, bypassing pairwise graph-cuts.
    • Design Motivation: Traditional pairwise seam detection is computationally expensive and easily trapped in local optima. Joint multi-feature optimization and global prediction yield superior results in complex multi-image overlap areas, accelerating speed by 10x in large-scale scenes.

Training Data

Constructs the Pano360 dataset: 200 real-world scenes, 14,400+ images, mapping 360° FoV per scene, with all images annotated with ground-truth (GT) camera parameters. It contains challenging conditions such as weak textures, diverse lighting, and extreme weather. - Data Collection: Captured using a calibrated multi-camera system, with GT intrinsics and extrinsics obtained via SfM. - Train/Test Split: 180/20 scenes, ensuring the test set covers all challenging types. - Each scene averages 72 images with an overlap of approximately 30%-50%, simulating non-uniform sampling in real applications.

Key Experimental Results

Main Results

Method QA_q↑ QA_a↑ BRIS↓ NIQE↓
AutoStitch 3.82 3.20 40.98 4.55
GES-GSP 3.95 3.20 36.45 3.36
UDIS2 3.02 2.97 60.55 5.23
Pano360 Best Best Lowest Lowest

Success rate in challenging scenes: Pano360 reaches 97.8%, while traditional methods frequently fail under repetitive patterns and weak textures.

Ablation Study

Configuration Description
w/o 3D alignment (Pure 2D) Severe distortion and misalignment in large parallax
w/o Local mesh correction Residual misalignment in depth-varying areas
w/o Multi-feature seams Unnatural seams in complex overlapping areas
Full Pano360 Geometrically consistent + visually seamless

Efficiency Comparison

Method Image Count = 8 Image Count = 32 Image Count = 128
AutoStitch 2.1s 18.4s Timeout
UDIS2 1.8s Pairwise \(\times N\) Not Supported
Pano360 1.5s 4.2s 12.8s

Key Findings

  • 3D space alignment fundamentally solves the error accumulation problem in multi-image stitching.
  • The 3D correspondence perception of VGGT effectively filters out unreliable matches under repetitive patterns.
  • Seam detection speed is 10 times faster than pairwise methods in large scenes (global one-pass prediction vs. pairwise graph-cuts).
  • Supports flexible input from several to hundreds of images, making it applicable to real-world scenarios such as autonomous driving and VR.
  • Pano360's processing time scales nearly linearly with the number of images, whereas traditional methods scale quadratically or time out.

Highlights & Insights

  • Shifting panoramic stitching from 2D pairwise to 3D global represents a paradigm shift, resolving the fundamental flaws of traditional methods by leveraging the 3D perception of pretrained large models.
  • Dataset contribution holds long-term value: 200 real-world scenes with 360° FoV and GT camera parameters fill the gap in training/evaluation data for panoramic stitching.
  • Supports multiple projection formats (planar, equirectangular, spherical) and adaptive selection, indicating strong practicality.

Limitations & Future Work

  • Relies on pretrained VGGT weights; performance may degrade in extreme scenarios out of the training distribution.
  • Assumes all cameras share the same focal length and the principal point is at the center, which is not applicable to non-standard lenses.
  • Details of local mesh warping are not fully detailed, and the capability to handle extremely large parallax remains to be verified.
  • Although the dataset is large, the diversity of 200 scenes may still be limited.
  • vs UDIS2: CNN-based and end-to-end but limited to pairwise, requiring complex post-processing for multi-image stitching. Pano360 natively supports global multi-image alignment.
  • vs GES-GSP: Traditional geometric feature-based method, which fails in challenging scenes due to matching errors. Pano360 replaces handcrafted features with learned 3D correspondences.
  • vs AutoStitch: Classical automatic stitching, but with severe accumulated error. Pano360 eliminates cumulative errors through 3D global alignment.

Rating

  • Novelty: ⭐⭐⭐⭐ Paradigm innovation of global alignment in 3D space for panoramic stitching.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across multiple scenes + multiple baselines + dataset construction.
  • Writing Quality: ⭐⭐⭐⭐ Clear writing with excellent motivation analysis.
  • Value: ⭐⭐⭐⭐⭐ Significant advancement for practical applications of panoramic stitching.