CVPR 2026 3D Vision Panoramic 3D reconstruction feed-forward multi-view reconstruction spherical position encoding SO(3) data augmentation large-scale panoramic dataset

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery¶

Conference: CVPR 2026 arXiv: 2603.17571 Code: Available (coming soon) Area: 3D Vision Keywords: Panoramic 3D reconstruction, feed-forward multi-view reconstruction, spherical position encoding, SO(3) data augmentation, large-scale panoramic dataset

TL;DR¶

This paper proposes PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and globally consistent 3D point clouds from one or more unordered panoramic images in a single feed-forward pass. The paper also contributes PanoCity — a large-scale dataset comprising over 120,000 outdoor panoramic images.

Background & Motivation¶

Background: Feed-forward 3D reconstruction models such as DUSt3R, VGGT, and \(\pi^3\) have achieved remarkable success on perspective images, jointly inferring depth, pose, and 3D structure in a single forward pass.

Limitations of Prior Work: - These models are fundamentally built on the pinhole projection assumption; applying them directly to equirectangular panoramic images introduces seam artifacts, inconsistent parallax, and geometric drift. - Decomposing panoramas into multiple perspective crops and stitching the results introduces additional artifacts. - Existing panoramic datasets suffer from insufficient scale, incomplete annotations, and inadequate viewpoint overlap.

Key Challenge: Panoramic images exhibit non-pinhole distortion and spherical geometry, rendering the position encodings, data augmentation strategies, and geometric reasoning of existing feed-forward models inapplicable.

Goal: (a) Extend the feed-forward 3D reconstruction paradigm to the panoramic image domain; (b) construct a sufficiently large panoramic dataset to support training.

Key Insight: Enable the Transformer to perform effective geometric reasoning in the spherical domain through spherical-aware position encoding and \(SO(3)\) rotation augmentation.

Core Idea: Extend the feed-forward reconstruction paradigm of VGGT/\(\pi^3\) to panoramic imagery via spherical position encoding, three-axis rotation augmentation, and a stochastic anchoring strategy.

Method¶

Overall Architecture¶

Given a set of unordered panoramic images \(\{I_i\}_{i=1}^N\), the model passes them through an encode–aggregate–decode architecture to output camera poses \(G = \{g_i\}\), depth maps \(D = \{D_i\}\), and a world-coordinate 3D point cloud \(P = \{P_i\}\). The model is permutation-equivariant — the input order does not affect the output.

Key Designs¶

Spherical-aware Position Embedding
- Function: Provides geometrically correct positional information for panoramic image patches under equirectangular projection.
- Mechanism: For each patch, the spherical center coordinates \((\theta, \phi)\) are computed and encoded as a 4D cyclically symmetric vector \(p_{\text{vec}} = [\sin\theta, \cos\theta, \sin\phi, \cos\phi]\), which is then mapped to a high-dimensional embedding \(p_{\text{embed}} \in \mathbb{R}^C\) via an MLP.
- Design Motivation: Standard ViT position encodings cannot handle the spatially varying sampling density of equirectangular projection. Trigonometric encodings naturally preserve wrap-around continuity at the boundary \(\theta = \pm\pi\). More critically, under \(SO(3)\) rotation augmentation, the decoupling of fixed position encodings from dynamically rotated content forces the network to disentangle distortion effects from semantic content.
Geometry Aggregator
- Function: Performs local and global geometric reasoning across multi-view tokens.
- Mechanism: \(L\) alternating attention blocks, each comprising (a) intra-frame self-attention — tokens within the same panorama attend to each other to capture local structure and projection distortion; and (b) global self-attention — tokens from all panoramas are mixed to enable cross-view correspondence reasoning. The aggregated features are combined with spherical embeddings via three lightweight adapters and fed into three prediction heads: a camera pose head, a local point cloud head, and a global point cloud head.
Stochastic Anchoring
- Function: Resolves global coordinate frame ambiguity arising from the permutation-equivariant design.
- Mechanism: At each training iteration, a random panorama \(k\) is selected as the anchor, and all poses and point clouds are aligned to a coordinate frame centered on frame \(k\).
- Design Motivation: Fixing the first frame as the origin (as in VGGT) introduces ordering bias and is unstable under unordered inputs. Stochastic anchoring establishes a stable "hub-and-spoke" geometric structure while maintaining full permutation equivariance.
Panorama-specific Three-axis \(SO(3)\) Data Augmentation
- Function: Exploits panoramas' natural support for spherical rotations to enable unbounded data augmentation.
- Mechanism: For each RGB–depth–pose triplet, a random rotation \(R_{\text{aug}} \in SO(3)\) is sampled; poses are updated as \(g_i' = R_{\text{aug}} \cdot g_i\), and the panorama and depth map are projected onto the sphere, rotated, and resampled back to equirectangular format.
- Design Motivation: Augmentation for perspective images is constrained by planar geometry, whereas panoramas support arbitrary three-axis rotations without compromising geometric validity, substantially alleviating data scarcity.

Loss & Training¶

Scale-consistent local/global geometry losses: \(\mathcal{L}_{\text{lp}} + \mathcal{L}_{\text{gp}}\), with an optimal scale factor \(s^*\) estimated in closed form to ensure metric consistency.
Normal consistency regularization: \(\mathcal{L}_{\text{nor}}\), penalizing angular differences between surface normals.
Relative pose supervision: rotation loss \(\mathcal{L}_{\text{rot}}\) (angular distance) + translation loss \(\mathcal{L}_{\text{trans}}\) (L1).
Total loss: \(\mathcal{L} = \mathcal{L}_{\text{lp}} + \mathcal{L}_{\text{gp}} + \mathcal{L}_{\text{nor}} + 0.1(100 \cdot \mathcal{L}_{\text{trans}} + \mathcal{L}_{\text{rot}})\)
Training takes approximately 10 days on 8× A100 GPUs.

Key Experimental Results¶

Main Results — Camera Pose Estimation¶

Method	Matterport3D AUC@30↑	PanoCity AUC@30↑	PanoCity Rot. Err.↓	PanoCity Trans. Err.↓
BiFuse++	0.007	0.833	1.655°	5.044°
VGGT	0.034	0.205	7.659°	35.867°
\(\pi^3\)	0.047	0.571	7.669°	16.780°
\(\pi^3\)* (panoramic retrain)	0.305	0.682	—	—
PanoVGGT	0.459	0.949	0.873°	2.168°

Monocular Depth Estimation¶

Method	Matterport3D Abs Rel↓	Stanford2D3D Abs Rel↓	PanoCity Abs Rel↓
EGFormer	0.0987	0.0929	0.0363
BiFuse++	0.1076	0.1120	0.0200
PanoVGGT (mono)	0.0884	0.0711	0.0312
PanoVGGT (multi-view)	0.0840	0.0778	0.0196

Key Findings¶

PanoVGGT comprehensively outperforms all baselines on pose estimation: AUC@30 on Matterport3D rises from 0.305 (the second-best, \(\pi^3\)*) to 0.459, and reaches 0.949 on PanoCity.
PanoVGGT surpasses dedicated depth estimation models on monocular depth estimation, while operating as a single unified model under a multi-task joint prediction setting.
BiFuse++ performs poorly on Matterport3D/Stanford2D3D because its self-supervised training relies on ordered narrow-baseline frames, which is mismatched with these sparse unordered panoramas.
The PanoCity dataset (120K frames) far exceeds the scale of existing panoramic datasets (Matterport3D: 10K; Structured3D: 12K) and provides complete multi-view overlap.

Highlights & Insights¶

The synergistic design of spherical position encoding and \(SO(3)\) augmentation is particularly elegant: rotating content under fixed position encodings forces the network to learn to decouple projection distortion effects from semantic content, achieving an effect analogous to spherical convolutions without their complexity.
Stochastic anchoring is a simple yet effective solution to the global coordinate frame ambiguity in permutation-equivariant models, and is more robust than fixing the first frame.
The PanoCity dataset is itself a significant contribution: it provides the first large-scale outdoor panoramic dataset with continuous trajectories, complete 6-DoF poses, and high-precision depth, filling a critical gap in the field.
Overcomplete supervision: jointly regressing both depth maps and 3D point clouds (which are theoretically mutually derivable) proves empirically that this redundant supervision significantly improves the accuracy of all predictions.

Limitations & Future Work¶

Training resolution is limited to \(336 \times 672\) (far below the native \(4096 \times 2048\)), potentially losing high-frequency detail.
Indoor datasets (Matterport3D, Stanford2D3D) contain few valid multi-view samples with poor overlap, limiting the thoroughness of indoor evaluation.
The model is large (DINOv2 backbone); inference efficiency on edge devices is not discussed.
PanoCity is a synthetic dataset; the domain gap with real-world panoramas is not sufficiently analyzed.

vs. VGGT / \(\pi^3\): Both models are built on the pinhole assumption and suffer drastic performance degradation when applied directly to panoramas. PanoVGGT effectively bridges this gap through spherical position encoding and rotation augmentation.
vs. BiFuse++: BiFuse++ is designed for panoramas but relies on ordered narrow-baseline self-supervision, collapsing on sparse unordered viewpoints; PanoVGGT's permutation-equivariant design naturally accommodates unordered inputs.
vs. Traditional Panoramic Methods: Most existing panoramic methods address only a single task (depth or pose); PanoVGGT is the first unified feed-forward model to jointly predict poses, depth, and point clouds in the panoramic domain.

Rating¶

Novelty: ⭐⭐⭐⭐ The synergistic design of spherical position encoding and \(SO(3)\) augmentation is creative, though the overall architecture follows VGGT/\(\pi^3\).
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation spans multiple datasets and tasks, including cross-domain generalization and complete ablations, with a high-quality dataset contribution.
Writing Quality: ⭐⭐⭐⭐ Well-structured; both dataset construction and method design are described with sufficient clarity.
Value: ⭐⭐⭐⭐⭐ The PanoCity dataset and the panoramic feed-forward reconstruction paradigm will have substantial impact on the community.