Pano360: Perspective to Panoramic Vision with Geometric Consistency¶
Conference: CVPR 2025
arXiv: 2603.12013
Code: https://github.com/KiMomota/Pano360
Area: 3D Vision
Keywords: Panoramic Stitching, 3D Geometric Consistency, Transformer, Seam Detection, Large-scale Dataset
TL;DR¶
Proposes Pano360, the first Transformer framework that performs panoramic stitching in 3D photogrammetric space. It leverages a pretrained VGGT backbone to obtain 3D-aware multi-view feature alignment and multi-feature joint optimization for seam detection. It supports 2 to hundreds of input images, achieving a success rate of 97.8% in challenging scenarios with weak texture, large parallax, or repetitive patterns.
Background & Motivation¶
Background: Traditionally, panoramic stitching relies on pairwise feature matching to estimate homography matrices (SIFT/ORB + RANSAC). CNN-based methods like UDIS/UDIS2 improve stitching through end-to-end learning but remain limited to pairwise processing.
Limitations of Prior Work: (a) Pairwise matching in multi-image stitching leads to error accumulation, resulting in severe distortion; (b) Feature matching is unreliable in weak texture, large parallax, or repetitive pattern scenes, causing homography estimation to fail; (c) CNN-based methods only process image pairs and cannot exploit multi-view global geometric consistency.
Key Challenge: Existing methods establish pairwise correspondences in 2D space, neglecting 3D projective geometry—whereas multi-view geometric relations are more accurate and globally consistent in 3D space.
Goal: Extend panoramic stitching from 2D pairwise alignment to 3D global alignment by leveraging multi-view geometric consistency.
Key Insight: Leverage the inherent 3D feature correspondence awareness of pretrained large vision models (VGGT) to elevate the image alignment task into the 3D photogrammetric space.
Core Idea: Utilize the 3D perception capability of VGGT to directly estimate camera parameters for alignment in 3D space, and use multi-feature joint optimization to generate globally optimal seams in a single pass.
Method¶
Overall Architecture¶
Inputs N partially overlapping images and outputs a seamless panorama. Dual-branch architecture: (1) Projection branch: DINO encoding + VGGT alternating attention \(\rightarrow\) camera tokens \(\rightarrow\) decoded into intrinsic/extrinsic parameters \(\rightarrow\) global 3D alignment + local mesh warp correction; (2) Seam branch: feature tokens \(\rightarrow\) seam decoder \(\rightarrow\) multi-feature joint optimization to generate seam masks \(\rightarrow\) fusion output.
Key Designs¶
-
3D-Aware Feature Backbone:
- Function: Extract global 3D geometric relationships from multiple images.
- Mechanism: DINO encodes each image into patch tokens, with pre-placed learnable camera tokens and register tokens. These are processed through \(L\) layers of alternating attention (intra-frame self-attention + global cross-attention) in the pretrained VGGT, outputting camera tokens containing 3D geometric correspondences and feature tokens preserving detail.
- Design Motivation: VGGT, trained with 3D supervision, possesses an inherent awareness of multi-view 3D correspondence, making it more accurate and global than traditional feature matching.
-
Projection Head:
- Function: Decode camera intrinsic and extrinsic parameters from camera tokens to directly align images in 3D space.
- Mechanism: Predicts the intrinsic matrix \(\mathbf{K}_i\) and extrinsic parameters \((\mathbf{R}_i, \mathbf{t}_i)\) for each image, defining a projection function \(\mathbf{P}_i\) to map pixels to the panoramic coordinate system. The warp function is defined as \(\mathcal{W}_i(\mathbf{u}) = P_i(\mathbf{u}) + W_i(\mathbf{u})\), where \(W_i\) represents the local mesh warp to handle parallax. It supports multiple projection formats, such as planar, equirectangular, and spherical.
- Design Motivation: Camera parameters provide globally consistent 3D constraints that are more robust than 2D homography matrices, while local mesh warping compensates for depth variations.
-
Seam Head:
- Function: Predict globally optimal seam masks for each image.
- Mechanism: Jointly compute seam labels from color, gradient, and texture features across all images to train the network. Single-pass forward inference predicts all seam masks simultaneously, bypassing pairwise graph-cuts.
- Design Motivation: Traditional pairwise seam detection is computationally expensive and easily trapped in local optima. Joint multi-feature optimization and global prediction yield superior results in complex multi-image overlap areas, accelerating speed by 10x in large-scale scenes.
Training Data¶
Constructs the Pano360 dataset: 200 real-world scenes, 14,400+ images, mapping 360° FoV per scene, with all images annotated with ground-truth (GT) camera parameters. It contains challenging conditions such as weak textures, diverse lighting, and extreme weather. - Data Collection: Captured using a calibrated multi-camera system, with GT intrinsics and extrinsics obtained via SfM. - Train/Test Split: 180/20 scenes, ensuring the test set covers all challenging types. - Each scene averages 72 images with an overlap of approximately 30%-50%, simulating non-uniform sampling in real applications.
Key Experimental Results¶
Main Results¶
| Method | QA_q↑ | QA_a↑ | BRIS↓ | NIQE↓ |
|---|---|---|---|---|
| AutoStitch | 3.82 | 3.20 | 40.98 | 4.55 |
| GES-GSP | 3.95 | 3.20 | 36.45 | 3.36 |
| UDIS2 | 3.02 | 2.97 | 60.55 | 5.23 |
| Pano360 | Best | Best | Lowest | Lowest |
Success rate in challenging scenes: Pano360 reaches 97.8%, while traditional methods frequently fail under repetitive patterns and weak textures.
Ablation Study¶
| Configuration | Description |
|---|---|
| w/o 3D alignment (Pure 2D) | Severe distortion and misalignment in large parallax |
| w/o Local mesh correction | Residual misalignment in depth-varying areas |
| w/o Multi-feature seams | Unnatural seams in complex overlapping areas |
| Full Pano360 | Geometrically consistent + visually seamless |
Efficiency Comparison¶
| Method | Image Count = 8 | Image Count = 32 | Image Count = 128 |
|---|---|---|---|
| AutoStitch | 2.1s | 18.4s | Timeout |
| UDIS2 | 1.8s | Pairwise \(\times N\) | Not Supported |
| Pano360 | 1.5s | 4.2s | 12.8s |
Key Findings¶
- 3D space alignment fundamentally solves the error accumulation problem in multi-image stitching.
- The 3D correspondence perception of VGGT effectively filters out unreliable matches under repetitive patterns.
- Seam detection speed is 10 times faster than pairwise methods in large scenes (global one-pass prediction vs. pairwise graph-cuts).
- Supports flexible input from several to hundreds of images, making it applicable to real-world scenarios such as autonomous driving and VR.
- Pano360's processing time scales nearly linearly with the number of images, whereas traditional methods scale quadratically or time out.
Highlights & Insights¶
- Shifting panoramic stitching from 2D pairwise to 3D global represents a paradigm shift, resolving the fundamental flaws of traditional methods by leveraging the 3D perception of pretrained large models.
- Dataset contribution holds long-term value: 200 real-world scenes with 360° FoV and GT camera parameters fill the gap in training/evaluation data for panoramic stitching.
- Supports multiple projection formats (planar, equirectangular, spherical) and adaptive selection, indicating strong practicality.
Limitations & Future Work¶
- Relies on pretrained VGGT weights; performance may degrade in extreme scenarios out of the training distribution.
- Assumes all cameras share the same focal length and the principal point is at the center, which is not applicable to non-standard lenses.
- Details of local mesh warping are not fully detailed, and the capability to handle extremely large parallax remains to be verified.
- Although the dataset is large, the diversity of 200 scenes may still be limited.
Related Work & Insights¶
- vs UDIS2: CNN-based and end-to-end but limited to pairwise, requiring complex post-processing for multi-image stitching. Pano360 natively supports global multi-image alignment.
- vs GES-GSP: Traditional geometric feature-based method, which fails in challenging scenes due to matching errors. Pano360 replaces handcrafted features with learned 3D correspondences.
- vs AutoStitch: Classical automatic stitching, but with severe accumulated error. Pano360 eliminates cumulative errors through 3D global alignment.
Rating¶
- Novelty: ⭐⭐⭐⭐ Paradigm innovation of global alignment in 3D space for panoramic stitching.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across multiple scenes + multiple baselines + dataset construction.
- Writing Quality: ⭐⭐⭐⭐ Clear writing with excellent motivation analysis.
- Value: ⭐⭐⭐⭐⭐ Significant advancement for practical applications of panoramic stitching.