Pano360: Perspective to Panoramic Vision with Geometric Consistency¶

Conference: CVPR 2025
arXiv: 2603.12013
Code: https://github.com/KiMomota/Pano360
Area: 3D Vision
Keywords: Panoramic Stitching, 3D Geometric Consistency, Transformer, Seam Detection, Large-scale Dataset

TL;DR¶

Proposes Pano360, the first Transformer framework that performs panoramic stitching in 3D photogrammetric space. It leverages a pretrained VGGT backbone to obtain 3D-aware multi-view feature alignment and multi-feature joint optimization for seam detection. It supports 2 to hundreds of input images, achieving a success rate of 97.8% in challenging scenarios with weak texture, large parallax, or repetitive patterns.

Background & Motivation¶

Background: Traditionally, panoramic stitching relies on pairwise feature matching to estimate homography matrices (SIFT/ORB + RANSAC). CNN-based methods like UDIS/UDIS2 improve stitching through end-to-end learning but remain limited to pairwise processing.

Limitations of Prior Work: (a) Pairwise matching in multi-image stitching leads to error accumulation, resulting in severe distortion; (b) Feature matching is unreliable in weak texture, large parallax, or repetitive pattern scenes, causing homography estimation to fail; (c) CNN-based methods only process image pairs and cannot exploit multi-view global geometric consistency.

Key Challenge: Existing methods establish pairwise correspondences in 2D space, neglecting 3D projective geometry—whereas multi-view geometric relations are more accurate and globally consistent in 3D space.

Goal: Extend panoramic stitching from 2D pairwise alignment to 3D global alignment by leveraging multi-view geometric consistency.

Key Insight: Leverage the inherent 3D feature correspondence awareness of pretrained large vision models (VGGT) to elevate the image alignment task into the 3D photogrammetric space.

Core Idea: Utilize the 3D perception capability of VGGT to directly estimate camera parameters for alignment in 3D space, and use multi-feature joint optimization to generate globally optimal seams in a single pass.

Method¶

Overall Architecture¶

Inputs N partially overlapping images and outputs a seamless panorama. Dual-branch architecture: (1) Projection branch: DINO encoding + VGGT alternating attention \(\rightarrow\) camera tokens \(\rightarrow\) decoded into intrinsic/extrinsic parameters \(\rightarrow\) global 3D alignment + local mesh warp correction; (2) Seam branch: feature tokens \(\rightarrow\) seam decoder \(\rightarrow\) multi-feature joint optimization to generate seam masks \(\rightarrow\) fusion output.

Key Designs¶

3D-Aware Feature Backbone:
- Function: Extract global 3D geometric relationships from multiple images.
- Mechanism: DINO encodes each image into patch tokens, with pre-placed learnable camera tokens and register tokens. These are processed through \(L\) layers of alternating attention (intra-frame self-attention + global cross-attention) in the pretrained VGGT, outputting camera tokens containing 3D geometric correspondences and feature tokens preserving detail.
- Design Motivation: VGGT, trained with 3D supervision, possesses an inherent awareness of multi-view 3D correspondence, making it more accurate and global than traditional feature matching.
Projection Head:
- Function: Decode camera intrinsic and extrinsic parameters from camera tokens to directly align images in 3D space.
- Mechanism: Predicts the intrinsic matrix \(\mathbf{K}_i\) and extrinsic parameters \((\mathbf{R}_i, \mathbf{t}_i)\) for each image, defining a projection function \(\mathbf{P}_i\) to map pixels to the panoramic coordinate system. The warp function is defined as \(\mathcal{W}_i(\mathbf{u}) = P_i(\mathbf{u}) + W_i(\mathbf{u})\), where \(W_i\) represents the local mesh warp to handle parallax. It supports multiple projection formats, such as planar, equirectangular, and spherical.
- Design Motivation: Camera parameters provide globally consistent 3D constraints that are more robust than 2D homography matrices, while local mesh warping compensates for depth variations.
Seam Head:
- Function: Predict globally optimal seam masks for each image.
- Mechanism: Jointly compute seam labels from color, gradient, and texture features across all images to train the network. Single-pass forward inference predicts all seam masks simultaneously, bypassing pairwise graph-cuts.
- Design Motivation: Traditional pairwise seam detection is computationally expensive and easily trapped in local optima. Joint multi-feature optimization and global prediction yield superior results in complex multi-image overlap areas, accelerating speed by 10x in large-scale scenes.

Training Data¶

Constructs the Pano360 dataset: 200 real-world scenes, 14,400+ images, mapping 360° FoV per scene, with all images annotated with ground-truth (GT) camera parameters. It contains challenging conditions such as weak textures, diverse lighting, and extreme weather. - Data Collection: Captured using a calibrated multi-camera system, with GT intrinsics and extrinsics obtained via SfM. - Train/Test Split: 180/20 scenes, ensuring the test set covers all challenging types. - Each scene averages 72 images with an overlap of approximately 30%-50%, simulating non-uniform sampling in real applications.

Key Experimental Results¶

Main Results¶

Method	QA_q↑	QA_a↑	BRIS↓	NIQE↓
AutoStitch	3.82	3.20	40.98	4.55
GES-GSP	3.95	3.20	36.45	3.36
UDIS2	3.02	2.97	60.55	5.23
Pano360	Best	Best	Lowest	Lowest

Success rate in challenging scenes: Pano360 reaches 97.8%, while traditional methods frequently fail under repetitive patterns and weak textures.

Ablation Study¶

Configuration	Description
w/o 3D alignment (Pure 2D)	Severe distortion and misalignment in large parallax
w/o Local mesh correction	Residual misalignment in depth-varying areas
w/o Multi-feature seams	Unnatural seams in complex overlapping areas
Full Pano360	Geometrically consistent + visually seamless

Efficiency Comparison¶

Method	Image Count = 8	Image Count = 32	Image Count = 128
AutoStitch	2.1s	18.4s	Timeout
UDIS2	1.8s	Pairwise \(\times N\)	Not Supported
Pano360	1.5s	4.2s	12.8s

Key Findings¶

3D space alignment fundamentally solves the error accumulation problem in multi-image stitching.
The 3D correspondence perception of VGGT effectively filters out unreliable matches under repetitive patterns.
Seam detection speed is 10 times faster than pairwise methods in large scenes (global one-pass prediction vs. pairwise graph-cuts).
Supports flexible input from several to hundreds of images, making it applicable to real-world scenarios such as autonomous driving and VR.
Pano360's processing time scales nearly linearly with the number of images, whereas traditional methods scale quadratically or time out.

Highlights & Insights¶

Shifting panoramic stitching from 2D pairwise to 3D global represents a paradigm shift, resolving the fundamental flaws of traditional methods by leveraging the 3D perception of pretrained large models.
Dataset contribution holds long-term value: 200 real-world scenes with 360° FoV and GT camera parameters fill the gap in training/evaluation data for panoramic stitching.
Supports multiple projection formats (planar, equirectangular, spherical) and adaptive selection, indicating strong practicality.

Limitations & Future Work¶

Relies on pretrained VGGT weights; performance may degrade in extreme scenarios out of the training distribution.
Assumes all cameras share the same focal length and the principal point is at the center, which is not applicable to non-standard lenses.
Details of local mesh warping are not fully detailed, and the capability to handle extremely large parallax remains to be verified.
Although the dataset is large, the diversity of 200 scenes may still be limited.

vs UDIS2: CNN-based and end-to-end but limited to pairwise, requiring complex post-processing for multi-image stitching. Pano360 natively supports global multi-image alignment.
vs GES-GSP: Traditional geometric feature-based method, which fails in challenging scenes due to matching errors. Pano360 replaces handcrafted features with learned 3D correspondences.
vs AutoStitch: Classical automatic stitching, but with severe accumulated error. Pano360 eliminates cumulative errors through 3D global alignment.

Rating¶

Novelty: ⭐⭐⭐⭐ Paradigm innovation of global alignment in 3D space for panoramic stitching.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across multiple scenes + multiple baselines + dataset construction.
Writing Quality: ⭐⭐⭐⭐ Clear writing with excellent motivation analysis.
Value: ⭐⭐⭐⭐⭐ Significant advancement for practical applications of panoramic stitching.