Skip to content

Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image

Conference: CVPR 2026 arXiv: 2603.05908 Code: Available (project page) Area: 3D Vision Keywords: Panoramic 3D reconstruction, compositional scene generation, feed-forward transformation prediction, VGGT, 3D Gaussian splatting

TL;DR

This paper proposes Pano3DComposer, a modular feed-forward compositional 3D scene generation framework that takes a single panoramic image as input. A plug-and-play Object-World Transformation Predictor (based on Alignment-VGGT) maps generated 3D objects from local coordinates to world coordinates, producing high-fidelity 3D scenes in approximately 20 seconds on an RTX 4090.

Background & Motivation

Background: 3D scene generation is foundational for VR/AR and digital twins. Existing methods primarily rely on perspective images with limited field of view; panoramic images provide a complete 360° spatial context but introduce severe distortion.

Limitations of Prior Work: - Feed-forward scene understanding methods (Total3D, InstPIFu) are constrained by insufficient 3D mesh supervision and limited generalization - Feed-forward multi-instance generation models (MIDI, SceneGen) require expensive fine-tuning and tightly couple object generation with layout estimation - Compositional optimization methods (GALA3D, LayoutYour3D) require time-consuming iterative optimization that does not meet efficiency requirements - Panorama-specific methods (DeepPanoContext, PanoContext-Former) can only produce texture-free meshes

Key Challenge: How to simultaneously achieve efficiency, decouple object generation from layout estimation, and handle panoramic distortion.

Goal: (a) Replace time-consuming iterative optimization with feed-forward inference; (b) Decouple object generation from layout estimation; (c) Address panoramic distortion via perspective reprojection preprocessing.

Key Insight: Reformulate the object-to-world coordinate transformation from the challenging 3D space to the more robust 2D image space, exploiting correspondences between multi-view renderings and target crop images.

Core Idea: Use Alignment-VGGT to predict, in a single feed-forward pass, the rotation \(\mathbf{R}\), translation \(\mathbf{t}\), and anisotropic scaling \(\mathbf{S}\) that map each 3D object from its local coordinate frame to the world coordinate frame.

Method

Overall Architecture

Given an equirectangular panoramic image \(\mathbf{I} \in \mathbb{R}^{H \times W \times 3}\), the framework produces a compositional 3D scene through four stages: 1. Preprocessing: Object detection and perspective reprojection for distortion removal 2. Object Generation & Alignment: 3D object generation + Object-World Transformation Predictor 3. Background Modeling: Inpainted panorama → 3DGS background 4. Composition: Fusion of all aligned objects and background

Key Designs

  1. Preprocessing Module — Panoramic Distortion Removal

    • Function: Projects detected objects from the panorama into undistorted perspective crop images.
    • Mechanism: For each object, SAM extracts mask \(\mathbf{M}_i\); given its spherical longitude-latitude \((\theta_i, \phi_i)\) and field-of-view angle \(\alpha_i\), a perspective projection operator \(\Pi_{\text{persp}}\) yields an undistorted crop: \(\mathbf{I}_i^{\text{crop}} = \Pi_{\text{persp}}(\mathbf{I} \odot \mathbf{M}_i; \theta_i, \phi_i, \alpha_i)\)
    • Design Motivation: The distortion inherent in equirectangular projection prevents off-the-shelf image-to-3D models from operating directly on panoramas; perspective reprojection enables the use of any existing 3D generator.
  2. Object-World Transformation Predictor (Alignment-VGGT)

    • Function: Predicts the transformation parameters — rotation \(\mathbf{R}\), translation \(\mathbf{t}\), and anisotropic scaling \(\mathbf{S}\) — that map a generated 3D object from its local coordinate frame to the world frame.
    • Mechanism: The VGGT architecture is adapted to accept the target crop \(\mathbf{I}_i^{\text{crop}}\) (as the first frame in the sequence) alongside multi-view renderings of the generated object \(\{\mathbf{I}_{i,v}^{\text{gen}}\}_{v=1}^V\), with known camera parameters provided to resolve intrinsic/extrinsic ambiguity. A scaling head is added alongside the existing camera head to output anisotropic scaling factors \(\hat{\mathbf{S}} = \text{diag}(\hat{s}_x, \hat{s}_y, \hat{s}_z)\).
    • The unknown local extrinsics \(\mathbf{E}_0^{\text{obj}}\) are derived via relative pose chaining and combined with world-frame extrinsics to obtain the non-rigid transformation \(\mathbf{T}_i\).
    • Design Motivation: Direct alignment in 3D space relies on monocular panoramic depth estimation, which is inaccurate. Shifting to 2D space and exploiting correspondences between multi-view renderings and crop images yields greater robustness.
  3. Pseudo-Geometry Supervision (PGD)

    • Function: Resolves supervision signal mismatch caused by shape discrepancies between generated and ground-truth objects.
    • Mechanism: For each generated object, a differentiable optimizer is run offline (bidirectional Chamfer loss, or unidirectional Chamfer + mask loss) to obtain pseudo-GT transformation parameters \((\mathbf{R}^\star, \mathbf{t}^\star, \mathbf{S}^\star)\), which supervise the network predictions via an L1 loss.
    • Training Loss: \(\mathcal{L} = \lambda_{\text{CD}}\mathcal{L}_{\text{CD}} + \lambda_{\text{PGD}}\mathcal{L}_{\text{PGD}} + \lambda_{\text{MASK}}\mathcal{L}_{\text{MASK}}\)
    • Design Motivation: GT mesh pose annotations correspond to GT geometry, not generated geometry; directly supervising with GT poses produces a misaligned training signal.
  4. Coarse-to-Fine (C2F) Alignment Mechanism

    • Function: Iteratively refines object poses at inference time for out-of-distribution inputs.
    • Mechanism: A separate C2F Refiner based on Alignment-VGGT is trained. At each step, the object is rendered under the current pose and compared with the target crop; the refiner predicts a relative pose update \(\Delta\mathbf{T}^{(k)}\), updating only rotation and translation while fixing scale. Convergence is monitored via Chamfer distance: the process terminates when \(\mathcal{L}_{\text{CD}}^{(k)} - \mathcal{L}_{\text{CD}}^{(k+1)} < \tau\).
    • Design Motivation: The feed-forward predictor may be insufficiently accurate on out-of-distribution data; rendering-feedback iteration progressively corrects errors without requiring gradient-based optimization.

Loss & Training

  • Chamfer loss \(\mathcal{L}_{\text{CD}}\): Bidirectional when GT mesh is available; otherwise unidirectional with depth back-projected point clouds.
  • PGD loss \(\mathcal{L}_{\text{PGD}}\): L1 regression on quaternion rotation, translation, and scale.
  • Mask loss \(\mathcal{L}_{\text{MASK}}\): MSE + IoU between rendered mask and instance mask.
  • The DINOv2 backbone and VGGT frame-attention layers are frozen; learning rate is \(1 \times 10^{-4}\); training takes approximately 2 days on a single RTX 4090.

Key Experimental Results

Main Results

Method CD-S↓ CD-O↓ F-Score-S↑ F-Score-O↑ IoU-B↑ Training Cost Inference Time
OPT (differentiable optimization) 0.1059 0.1128 0.5535 0.5640 0.4010 120s
ICP 0.2483 0.2305 0.4524 0.4896 0.2830 1s
DeepPanoContext 0.7851 0.1657 0.3101 0.3822 0.0021 14s
SceneGen 0.1765 0.0914 0.4575 0.4827 0.1124 56 GPU days 63s
Pano3DComposer 0.0787 0.0765 0.6923 0.6926 0.5679 2 GPU days 20s
Pano3DComposer-C2F 0.0784 0.0762 0.6930 0.6937 0.5699 4 GPU days 24s

Ablation Study

Configuration CD-S↓ CD-O↓ F-Score-S↑ F-Score-O↑ IoU-B↑
\(\mathcal{L}_{\text{CD}}\) only 0.8688 0.9027 0.1980 0.1888 0.0906
+ \(\mathcal{L}_{\text{PGD}}\) 0.1266 0.1219 0.5675 0.5670 0.4670
+ \(\mathcal{L}_{\text{MASK}}\) 0.1120 0.1063 0.5788 0.5850 0.4818
w/o camera information 0.1850 0.1705 0.4673 0.4691 0.3830

Key Findings

  • Training with Chamfer loss alone yields very poor results (CD-S 0.87); adding the pseudo-geometry distillation PGD loss improves performance substantially to 0.13.
  • Removing camera parameter inputs causes a significant performance drop, validating the importance of camera priors.
  • Compared to SceneGen, training cost is reduced by 28× (2 vs. 56 GPU days) and inference is 3× faster (20s vs. 63s).
  • The C2F mechanism adds only 4 seconds to inference time while achieving substantially better generalization on real-world scenes.

Highlights & Insights

  • The pseudo-geometry supervision strategy is particularly elegant: generated objects inevitably differ in shape from GT objects, so directly supervising with GT poses misleads the network. Using an offline differentiable optimizer to produce object-specific pseudo-GT parameters elegantly resolves the shape mismatch while providing high-quality supervision for the feed-forward predictor. This paradigm is transferable to any generate-then-align task.
  • Shifting from 3D to 2D alignment: by avoiding unreliable monocular panoramic depth estimation and instead exploiting multi-view rendering correspondences in 2D image space, the method makes a practical and effective design choice.
  • Modularity and flexibility: the 3D generator can be swapped at any time (e.g., TRELLIS, Amodal3R) without requiring joint retraining.

Limitations & Future Work

  • The pipeline depends on SAM segmentation quality; heavily occluded or small objects may fail to segment correctly.
  • Training and evaluation are currently limited to indoor scenes (3D-FRONT, Structured3D); generalization to outdoor environments remains unverified.
  • Each object requires independent 3D asset generation (~4s per object), causing total inference time to grow linearly with scene complexity.
  • The C2F mechanism still relies on depth estimation to construct reference point clouds, and inaccurate depth may limit the extent of refinement.
  • vs. SceneGen: SceneGen jointly generates multiple instances end-to-end but requires substantial fine-tuning for panoramic inputs (56 GPU days); the decoupled design proposed here is more flexible and incurs 28× lower training cost.
  • vs. GALA3D / DreamScene: These methods optimize appearance via SDS (30–60 min per object) and rely on LLM-based layout planning that is prone to physical constraint violations; the proposed method directly infers layout from the panorama, making it more efficient and physically grounded.
  • vs. CAST: CAST also predicts alignment parameters but couples object generation and alignment, precluding plug-and-play replacement of the generator.

Rating

  • Novelty: ⭐⭐⭐⭐ Pseudo-geometry supervision and Alignment-VGGT are creative designs, though the overall framework is a modular assembly of existing components.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers both synthetic and real-world scenes with comprehensive ablations, but quantitative evaluation on more diverse real-world data is lacking.
  • Writing Quality: ⭐⭐⭐⭐ Method description is clear and mathematical derivations are complete.
  • Value: ⭐⭐⭐⭐ Offers an efficient and practical panoramic 3D scene generation solution with direct applicability to VR/AR.