SAP: Segment Any 4K Panorama¶
Conference: CVPR 2026 arXiv: 2603.12759 Code: Available (Project Page) Area: Panoramic Image Segmentation Keywords: Panoramic Segmentation, SAM2, 4K High Resolution, Topology-Memory Alignment, Perspective Video Reconstruction
TL;DR¶
This paper proposes SAP (Segment Any 4K Panorama), which converts panoramic images into perspective pseudo-video sequences sampled along fixed spherical trajectories, addressing the structural mismatch of SAM2's streaming memory mechanism on 360° images. By synthesizing a 183K instance-annotated 4K panoramic dataset for fine-tuning, SAP achieves a zero-shot mIoU improvement of +17.2 on real-world panoramic benchmarks.
Background & Motivation¶
With the growing adoption of 360° cameras in robotics, AR/VR, and embodied intelligence, the demand for high-quality instance segmentation of panoramic images has increased substantially. However, existing segmentation foundation models (e.g., SAM/SAM2) face three major challenges:
- Resolution loss: SAM supports only \(1024^2\) inputs; a 4K 2:1 panorama (\(4096 \times 2048\)) is downsampled to \(1024 \times 512\) and padded, losing significant detail.
- Geometric distortion: Equirectangular projection (ERP) introduces severe polar distortion and left-right seam discontinuities.
- Violation of structural assumptions: SAM2's streaming memory mechanism assumes consecutive frames correspond to smooth camera motion and overlapping visual content, whereas ERP panoramas have no intrinsic temporal order—sliding-window cropping disrupts physical viewpoint continuity.
OmniSAM attempts to apply SAM2 with a sliding window directly on ERP, but distortion and seam issues are merely surface-level symptoms. The key insight of this paper is that spherical panoramas fundamentally violate the structural assumptions of streaming memory models, and this must be addressed from a topology-memory alignment perspective.
Method¶
Overall Architecture¶
The SAP pipeline consists of four steps: 1. Project the ERP panorama and prompt points into a perspective pseudo-video sequence sampled along a fixed trajectory. 2. Process the pseudo-video with fine-tuned SAM2 to generate per-frame segmentation masks. 3. Back-project perspective frame masks and fuse them onto the ERP plane. 4. Output the final panoramic segmentation result.
Key Designs¶
-
Panorama-to-perspective video conversion: Given an ERP image \(I^{ERP} \in \mathbb{R}^{H \times W \times 3}\), \(N\) perspective views are generated via a three-stage geometric projection:
- Define camera intrinsics (FoV \(\beta = 90°\), focal length \(f = \frac{L-1}{2\tan(\beta/2)}\))
- Back-project pixels to ray directions: \(\mathbf{r}^{cam} \propto \mathbf{K}^{-1}[u,v,1]^T\)
- Rotate to world coordinates and convert to spherical coordinates for sampling: \([x,y,z]^T = \text{Normalize}(\mathbf{R}_i \mathbf{K}^{-1}[u,v,1]^T)\)
-
Column-first zigzag scanning trajectory: This is the central design innovation of the paper. Compared to row-first scanning, column-first scanning possesses an infinite-loop property: starting from any point, alternating up-and-down motion returns exactly to the starting point. Formally, the visiting order for column \(j\) is: $\(\mathcal{O}_j = \begin{cases} (j,1),(j,2),\dots,(j,N_{pitch}), & j \bmod 2 = 1 \\ (j,N_{pitch}),\dots,(j,2),(j,1), & j \bmod 2 = 0 \end{cases}\)$ Consecutive frames differ in only one angular dimension (yaw or pitch), ensuring smooth video-like transitions. The sampling grid is determined by FoV and overlap ratio \(r=0.5\): \(\Delta_{yaw} = \beta_h(1-r)\), \(N_{yaw} = \lceil 360°/\Delta_{yaw} \rceil\).
-
Cyclic extension for arbitrary starting points: The trajectory is duplicated to \(2 \cdot N\) frames; during training, a random start index \(s \in \{0, \dots, N-1\}\) is sampled and a contiguous window of \(N\) frames is extracted, guaranteeing that any window covers all viewpoints at least once.
-
183K synthetic dataset: Using the InfiniGen engine and 40,000 GPU hours, 183,440 panoramic images at \(4096 \times 2048\) resolution were synthesized, comprising 6,409,732 instance masks. Object size distribution: small 37.84%, medium 25.70%, large 36.47%.
-
Prompt point projection: A user prompt point \(\mathbf{p} = (u_p, v_p)\) on the ERP is first converted to a spherical direction vector \(\mathbf{d} = [\cos\theta_p\cos\phi_p, \cos\theta_p\sin\phi_p, \sin\theta_p]^T\), then projected onto each perspective frame to determine visibility.
Mask Fusion¶
Per-frame masks are fused back onto the ERP plane via max-value aggregation:
Loss & Training¶
- Built on SAM2 (Hiera-Large encoder)
- Image encoder is frozen; only the memory attention, memory encoder, mask decoder, and prompt encoder are updated
- Mixed training: synthetic panoramic data + SAM2 original training data (SA-1B + SA-V) to prevent catastrophic forgetting
- AdamW optimizer, batch size 128, lr \(2 \times 10^{-4}\) (cosine schedule), weight decay 0.1, gradient clipping 0.1
Key Experimental Results¶
PAV-SOD Real-World 4K Panoramic Benchmark (Zero-Shot)¶
| Method | 1-click Overall | 1-click Small | 1-click Large | 3-click Overall |
|---|---|---|---|---|
| SAM2-tiny | 51.6 | 46.3 | 49.1 | 82.2 |
| SAM2-tiny+scan | 65.1 | 49.6 | 70.0 | 83.0 |
| SAP-tiny | 75.8 | 53.9 | 79.7 | 84.8 |
| Δ(SAP-SAM2) | +24.2 | +7.6 | +30.6 | +2.6 |
| SAM2-large | 66.3 | 50.7 | 64.4 | 84.3 |
| SAM2-large+scan | 69.0 | 58.4 | 73.8 | 84.1 |
| SAP-large | 77.3 | 61.1 | 81.7 | 86.1 |
| Δ(SAP-SAM2) | +11.0 | +10.4 | +17.3 | +1.8 |
InfiniGen Synthetic 4K Panoramic Benchmark¶
| Method | 1-click Overall | 1-click Small | 1-click Large | 3-click Overall |
|---|---|---|---|---|
| SAM2-base | 62.0 | 57.6 | 59.8 | 81.4 |
| SAP-base | 81.8 | 72.3 | 89.6 | 88.9 |
| Δ(SAP-SAM2) | +19.8 | +14.7 | +29.8 | +7.5 |
| SAM2-large | 62.8 | 59.7 | 60.7 | 81.4 |
| SAP-large | 81.9 | 72.5 | 90.7 | 89.0 |
| Δ(SAP-SAM2) | +19.1 | +12.8 | +30.0 | +7.6 |
Key Findings¶
- SAP substantially outperforms SAM2 across all model sizes, with an average gain of +17.2 mIoU across four model variants (PAV-SOD 1-click).
- The scanning strategy alone (without fine-tuning) yields +13.5 improvement on the tiny model, but fine-tuning provides larger and more consistent gains.
- Improvement is most pronounced for large objects (PAV-SOD tiny: +30.6), indicating that cross-view propagation is especially critical for large-scale instances.
- On HunyuanWorld (cartoon-style 8K panoramas), applying the scan strategy without fine-tuning actually degrades performance, underscoring the necessity of fine-tuning.
- Ablation studies confirm that mixing SAM2 original training data significantly improves generalization (PAV-SOD: 67.3 → 77.3).
Highlights & Insights¶
- Precise problem formulation: Reframing "ERP distortion and seam discontinuity" as a "topology-memory alignment" problem provides a fundamental explanation for SAM2's failure on panoramic inputs.
- Elegant column-first zigzag trajectory design: Satisfies the infinite-loop constraint and guarantees full coverage from any starting point—an elegant engineering solution.
- Effective large-scale synthetic data: The 183K synthetic images with SAM2 fine-tuning prove effective not only on synthetic test sets but also on real-world data, validating synthetic-to-real transfer.
- Essential distinction from prior work: OmniSAM applies a sliding window directly on ERP; SAP operates entirely in perspective space, avoiding distortion altogether.
Limitations & Future Work¶
- The large number of perspective frames (\(N_{yaw} \times N_{pitch}\) frames × 2 for cyclic extension) results in high inference cost.
- The fixed FoV of \(90°\) and overlap ratio of \(50\%\) are manually selected rather than adaptively optimized.
- Evaluation is limited to SAM2 as the foundation model; other segmentation foundation models are not assessed.
- Cross-frame instance consistency relies on SAM2's memory mechanism, which may still struggle in complex occlusion scenarios.
- Although synthetic data is effective, the domain gap persists, particularly for small objects where improvement is relatively limited.
Related Work & Insights¶
- SAM2 [Meta 2024]: A video segmentation foundation model providing a streaming memory mechanism—the backbone of this work.
- OmniSAM [2024]: Applies SAM2 with a sliding window on ERP for semantic segmentation; the primary baseline improved upon in this paper.
- InfiniGen [2024]: A data engine for generating large-scale synthetic panoramic images.
- Trans4PASS / PanoFormer: Deformable embedding / tangent patch methods for handling spherical distortion.
- Inspiration: The topology-memory alignment paradigm can be generalized to adapt other foundation models to non-standard geometries such as spherical, cylindrical, and fisheye imaging.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Practicality | ⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐⭐ |
| Overall | ⭐⭐⭐⭐⭐ |