Pano360: Perspective to Panoramic Vision with Geometric Consistency¶

Conference: CVPR 2026 arXiv: 2603.12013 Code: https://github.com/KiMomota/Pano360 Area: 3D Vision / Image Stitching Keywords: Panoramic Stitching, 3D Geometric Consistency, Transformer, Seam Detection, Multi-View Alignment

TL;DR¶

Pano360 proposes a Transformer-based panoramic stitching framework that extends the conventional 2D pairwise alignment paradigm to 3D space, directly leveraging camera poses to guide global multi-image alignment. Combined with a multi-feature joint optimization strategy for seam detection, the method achieves a 97.8% success rate on challenging scenarios including weak texture, large parallax, and repetitive patterns, substantially outperforming existing approaches.

Background & Motivation¶

Background: Panoramic stitching is increasingly demanded in autonomous driving, VR, and related domains. Traditional methods rely on hand-crafted features (SIFT, SURF, ORB) for pairwise matching to estimate homography matrices, while learning-based methods such as UDIS/UDIS2 employ CNN-based end-to-end learning yet remain constrained to pairwise stitching.
Limitations of Prior Work: Pairwise methods accumulate projection errors, causing severe distortion when aligning multiple images. Feature matching is unreliable or fails entirely under weak texture, large parallax, and repetitive texture conditions. CNN-based methods handle only image pairs, requiring complex post-processing for multi-image stitching and thus offering limited practical applicability.
Key Challenge: Nearly all existing methods establish pairwise correspondences in 2D space, neglecting the global consistency demanded by the underlying 3D projective geometry. This fundamentally limits alignment accuracy and robustness.
Goal: To achieve globally consistent panoramic stitching directly in 3D space from as few as a handful to hundreds of input images.
Key Insight: Motivated by the inherent 3D-awareness of large vision models such as VGGT, the paper elevates feature matching and alignment from 2D to 3D measurement space, leveraging pretrained Transformers to aggregate global multi-view information.
Core Idea: A Transformer directly predicts camera parameters to guide global multi-image alignment in 3D space, bypassing the 2D paradigm of pairwise feature matching.

Method¶

Overall Architecture¶

Given \(N\) partially overlapping images \(\{I_i\}_{i=1}^N\), Pano360 adopts a dual-branch architecture: ① a projection head that decodes camera parameters to guide global alignment, and ② a seam head that predicts optimal seam masks for seamless blending. Each image is first encoded by a DINO encoder to extract patch embeddings; a learnable camera token is then appended and the sequence is fed into VGGT pretrained alternating-attention Transformer layers (global attention + frame attention), yielding camera tokens and feature tokens.

Key Designs¶

3D Pose-Guided Global Alignment (Projection Head)
- Function: Decodes camera intrinsics \(\mathbf{K}_i\) and extrinsics \(\{\mathbf{R}_i, \mathbf{t}_i\}\) from Transformer camera tokens to perform global alignment directly in 3D space.
- Mechanism: All cameras are assumed to share a common focal length with principal points at image centers; the pose of the first image is fixed as the reference frame. The network predicts camera parameters at a fixed scale and linearly rescales them to the original resolution at inference. The alignment transformation is decomposed into a global projection \(P_i\) (defined by camera parameters) and a local deformation \(W_i\) (mesh warping for parallax correction), yielding the complete transformation \(\mathcal{W}_i(\mathbf{u}) = P_i(\mathbf{u}) + W_i(\mathbf{u})\). Multiple projection formats are supported, including planar, equirectangular, and spherical projection.
- Design Motivation: Aligning images directly via 3D camera parameters inherently exploits global multi-view geometric constraints, avoiding the error accumulation of pairwise matching. In particular, 2D matching is highly ambiguous under repetitive textures, whereas 3D spatial constraints can effectively suppress unreliable correspondences.
Multi-Feature Joint Seam Detection (Seam Head)
- Function: Identifies optimal seams in overlapping regions to ensure seamless and visually imperceptible transitions, supporting simultaneous processing of multiple images.
- Mechanism: The problem is formulated as energy minimization \(E(\mathcal{I}) = E_l + E_c\), where \(E_l\) is the label cost (ensuring each pixel originates from a valid image) and \(E_c\) penalizes adjacent pixels assigned different labels. The pixel cost \(C(p) = F_{color}(p) + F_{gradient}(p) \times F_{ratio}(p)\) simultaneously accounts for color discrepancy, gradient magnitude, and texture complexity. Unlike conventional pairwise graph-cut approaches prone to local optima, the proposed method considers multi-feature costs across all images within overlapping regions in a single forward pass, predicting globally optimal seams.
- Design Motivation: Texture-rich regions with significant parallax or depth variation require heavy penalties to route seams away, while seams in homogeneous regions are perceptually inconspicuous. The multi-feature joint strategy is more robust than any single metric based on color or gradient alone.
Large-Scale Real-World Dataset
- Function: Provides a high-quality benchmark for Transformer training and panoramic stitching evaluation.
- Mechanism: The dataset comprises 200 diverse real-world scenes (50% tourism, 30% extreme sports, 20% challenging lighting), each captured at 3 focal lengths with 24 frames covering a 360° FoV, totaling 14,400 frames with ground-truth camera parameters annotated per frame. The dataset supports 2048×2048 resolution and applies 2° random rotation jitter during augmentation.
- Design Motivation: Existing datasets are predominantly synthetic or contain only image pairs, lacking the multi-view diversity of real scenes necessary for effectively training and evaluating 3D-aware stitching networks.

Loss & Training¶

A multi-task loss is adopted: \(\mathcal{L}_{cam}\) (Huber loss on camera parameters) + \(\mathcal{L}_{seam}\) (L1 loss on seam masks) + \(\mathcal{L}_{proj}\) (projection consistency loss, enabled from the start of training to ensure gradient continuity). Weights of the alternating-attention module are initialized from the VGGT pretrained model and frozen during training. All quantities are expressed in the coordinate frame of the first image to achieve permutation invariance.

Key Experimental Results¶

Main Results: Success Rate and Runtime¶

Method	Success Rate (%)	Runtime
LoFTR+RANSAC	63.4	~13s
LightGlue+RANSAC	66.7	~11s
AutoStitch	46.7	~60s
GES-GSP	83.3	~20s
UDIS2	-	-
Pano360	97.8	~5s

Cross-Dataset Generalization (UDIS-D Dataset)¶

Method	PSNR↑	SSIM↑	PIQE↓	NIQE↓
APAP	23.79	0.794	53.36	14.16
UDIS2	25.43	0.838	48.09	6.11
DHS	25.88	0.845	45.73	6.18
Pano360	25.97	0.852	42.12	5.78

Ablation Study¶

\(\mathcal{L}_{cam}\)	\(\mathcal{L}_{proj}\)	\(\mathcal{L}_{seam}\)	QA_quality↑	BRIS↓	NIQE↓
✗	✗	✗	2.76	62.47	5.31
✓	✗	✗	3.45	47.43	4.65
✓	✓	✗	3.68	43.71	3.97
✓	✓	✓	4.09	37.96	3.37

Key Findings¶

Pose-guided image transformation contributes the most (QA: 2.76→3.45); aligning images directly via camera parameters eliminates the cumulative errors of pairwise matching.
Non-perspective projection further reduces distortion (QA: 3.45→3.68); seam detection achieves its best performance only on top of accurate alignment.
In repetitive texture scenarios, Pano360 achieves a success rate of 97.8%, far exceeding the strongest geometric baseline GES-GSP at 83.3%, while running 4× faster.
On the unseen UDIS-D dataset, Pano360 achieves substantially better perceptual quality (PIQE/NIQE) than all competing methods, demonstrating strong generalization.

Highlights & Insights¶

The paradigm shift from 2D pairwise matching to 3D global alignment is the most fundamental contribution. By exploiting the 3D-awareness of pretrained large models (VGGT/DINO) to establish multi-view geometric relationships directly in 3D space, the paper offers an elegant problem reformulation.
Multi-feature joint seam optimization simultaneously considers color, gradient, and texture costs across all images, avoiding the local optima of conventional pairwise graph-cut while achieving over 10× speedup.
The dataset construction pipeline — multi-focal-length capture + rotation jitter augmentation + ground-truth camera parameter annotation — is broadly applicable to other 3D vision tasks.

Limitations & Future Work¶

Images with intrinsic distortion (e.g., fisheye lenses) are not directly supported; pre-calibration and undistortion are required.
Extreme large-parallax scenarios (i.e., the same object captured from very different viewpoints) still require 3D reconstruction as an auxiliary step, which stitching alone cannot resolve.
Although diverse, the current dataset contains only 200 scenes, which may remain insufficient for training large-scale Transformers.
Integration with 3D Gaussian Splatting could be explored to handle complex scenes with significant depth variation.

vs. UDIS/UDIS2 [26,27]: CNN-based methods handle only pairwise inputs and require complex post-processing for multi-image stitching, whereas Pano360 supports an arbitrary number of inputs with end-to-end prediction. Pano360 also surpasses UDIS2 on the UDIS-D dataset.
vs. GES-GSP [6]: The strongest geometry-based feature method achieves 83.3% success rate but fails under repetitive textures; Pano360 reaches 97.8% by exploiting 3D consistency and runs 4× faster.
vs. VGGT [34]: Pano360 draws on VGGT's 3D-aware Transformer architecture but adapts it specifically for panoramic stitching via the projection head and seam head.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Reframes panoramic stitching as a global alignment problem in 3D space — a paradigm-level innovation
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dataset validation with complete ablations; success rate results are highly compelling
Writing Quality: ⭐⭐⭐⭐ Clear structure and well-defined problem formulation
Value: ⭐⭐⭐⭐⭐ Addresses a core pain point in real-world scenarios; open-source code; high practical utility