Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image¶
Conference: CVPR 2026 arXiv: 2603.05908 Code: Available (project page) Area: 3D Vision Keywords: Panoramic 3D reconstruction, compositional scene generation, feed-forward transformation prediction, VGGT, 3D Gaussian splatting
TL;DR¶
This paper proposes Pano3DComposer, a modular feed-forward compositional 3D scene generation framework that takes a single panoramic image as input. A plug-and-play Object-World Transformation Predictor (based on Alignment-VGGT) maps generated 3D objects from local coordinates to world coordinates, producing high-fidelity 3D scenes in approximately 20 seconds on an RTX 4090.
Background & Motivation¶
Background: 3D scene generation is foundational for VR/AR and digital twins. Existing methods primarily rely on perspective images with limited field of view; panoramic images provide a complete 360° spatial context but introduce severe distortion.
Limitations of Prior Work: - Feed-forward scene understanding methods (Total3D, InstPIFu) are constrained by insufficient 3D mesh supervision and limited generalization - Feed-forward multi-instance generation models (MIDI, SceneGen) require expensive fine-tuning and tightly couple object generation with layout estimation - Compositional optimization methods (GALA3D, LayoutYour3D) require time-consuming iterative optimization that does not meet efficiency requirements - Panorama-specific methods (DeepPanoContext, PanoContext-Former) can only produce texture-free meshes
Key Challenge: How to simultaneously achieve efficiency, decouple object generation from layout estimation, and handle panoramic distortion.
Goal: (a) Replace time-consuming iterative optimization with feed-forward inference; (b) Decouple object generation from layout estimation; (c) Address panoramic distortion via perspective reprojection preprocessing.
Key Insight: Reformulate the object-to-world coordinate transformation from the challenging 3D space to the more robust 2D image space, exploiting correspondences between multi-view renderings and target crop images.
Core Idea: Use Alignment-VGGT to predict, in a single feed-forward pass, the rotation \(\mathbf{R}\), translation \(\mathbf{t}\), and anisotropic scaling \(\mathbf{S}\) that map each 3D object from its local coordinate frame to the world coordinate frame.
Method¶
Overall Architecture¶
Given an equirectangular panoramic image \(\mathbf{I} \in \mathbb{R}^{H \times W \times 3}\), the framework produces a compositional 3D scene through four stages: 1. Preprocessing: Object detection and perspective reprojection for distortion removal 2. Object Generation & Alignment: 3D object generation + Object-World Transformation Predictor 3. Background Modeling: Inpainted panorama → 3DGS background 4. Composition: Fusion of all aligned objects and background
Key Designs¶
-
Preprocessing Module — Panoramic Distortion Removal
- Function: Projects detected objects from the panorama into undistorted perspective crop images.
- Mechanism: For each object, SAM extracts mask \(\mathbf{M}_i\); given its spherical longitude-latitude \((\theta_i, \phi_i)\) and field-of-view angle \(\alpha_i\), a perspective projection operator \(\Pi_{\text{persp}}\) yields an undistorted crop: \(\mathbf{I}_i^{\text{crop}} = \Pi_{\text{persp}}(\mathbf{I} \odot \mathbf{M}_i; \theta_i, \phi_i, \alpha_i)\)
- Design Motivation: The distortion inherent in equirectangular projection prevents off-the-shelf image-to-3D models from operating directly on panoramas; perspective reprojection enables the use of any existing 3D generator.
-
Object-World Transformation Predictor (Alignment-VGGT)
- Function: Predicts the transformation parameters — rotation \(\mathbf{R}\), translation \(\mathbf{t}\), and anisotropic scaling \(\mathbf{S}\) — that map a generated 3D object from its local coordinate frame to the world frame.
- Mechanism: The VGGT architecture is adapted to accept the target crop \(\mathbf{I}_i^{\text{crop}}\) (as the first frame in the sequence) alongside multi-view renderings of the generated object \(\{\mathbf{I}_{i,v}^{\text{gen}}\}_{v=1}^V\), with known camera parameters provided to resolve intrinsic/extrinsic ambiguity. A scaling head is added alongside the existing camera head to output anisotropic scaling factors \(\hat{\mathbf{S}} = \text{diag}(\hat{s}_x, \hat{s}_y, \hat{s}_z)\).
- The unknown local extrinsics \(\mathbf{E}_0^{\text{obj}}\) are derived via relative pose chaining and combined with world-frame extrinsics to obtain the non-rigid transformation \(\mathbf{T}_i\).
- Design Motivation: Direct alignment in 3D space relies on monocular panoramic depth estimation, which is inaccurate. Shifting to 2D space and exploiting correspondences between multi-view renderings and crop images yields greater robustness.
-
Pseudo-Geometry Supervision (PGD)
- Function: Resolves supervision signal mismatch caused by shape discrepancies between generated and ground-truth objects.
- Mechanism: For each generated object, a differentiable optimizer is run offline (bidirectional Chamfer loss, or unidirectional Chamfer + mask loss) to obtain pseudo-GT transformation parameters \((\mathbf{R}^\star, \mathbf{t}^\star, \mathbf{S}^\star)\), which supervise the network predictions via an L1 loss.
- Training Loss: \(\mathcal{L} = \lambda_{\text{CD}}\mathcal{L}_{\text{CD}} + \lambda_{\text{PGD}}\mathcal{L}_{\text{PGD}} + \lambda_{\text{MASK}}\mathcal{L}_{\text{MASK}}\)
- Design Motivation: GT mesh pose annotations correspond to GT geometry, not generated geometry; directly supervising with GT poses produces a misaligned training signal.
-
Coarse-to-Fine (C2F) Alignment Mechanism
- Function: Iteratively refines object poses at inference time for out-of-distribution inputs.
- Mechanism: A separate C2F Refiner based on Alignment-VGGT is trained. At each step, the object is rendered under the current pose and compared with the target crop; the refiner predicts a relative pose update \(\Delta\mathbf{T}^{(k)}\), updating only rotation and translation while fixing scale. Convergence is monitored via Chamfer distance: the process terminates when \(\mathcal{L}_{\text{CD}}^{(k)} - \mathcal{L}_{\text{CD}}^{(k+1)} < \tau\).
- Design Motivation: The feed-forward predictor may be insufficiently accurate on out-of-distribution data; rendering-feedback iteration progressively corrects errors without requiring gradient-based optimization.
Loss & Training¶
- Chamfer loss \(\mathcal{L}_{\text{CD}}\): Bidirectional when GT mesh is available; otherwise unidirectional with depth back-projected point clouds.
- PGD loss \(\mathcal{L}_{\text{PGD}}\): L1 regression on quaternion rotation, translation, and scale.
- Mask loss \(\mathcal{L}_{\text{MASK}}\): MSE + IoU between rendered mask and instance mask.
- The DINOv2 backbone and VGGT frame-attention layers are frozen; learning rate is \(1 \times 10^{-4}\); training takes approximately 2 days on a single RTX 4090.
Key Experimental Results¶
Main Results¶
| Method | CD-S↓ | CD-O↓ | F-Score-S↑ | F-Score-O↑ | IoU-B↑ | Training Cost | Inference Time |
|---|---|---|---|---|---|---|---|
| OPT (differentiable optimization) | 0.1059 | 0.1128 | 0.5535 | 0.5640 | 0.4010 | — | 120s |
| ICP | 0.2483 | 0.2305 | 0.4524 | 0.4896 | 0.2830 | — | 1s |
| DeepPanoContext | 0.7851 | 0.1657 | 0.3101 | 0.3822 | 0.0021 | — | 14s |
| SceneGen | 0.1765 | 0.0914 | 0.4575 | 0.4827 | 0.1124 | 56 GPU days | 63s |
| Pano3DComposer | 0.0787 | 0.0765 | 0.6923 | 0.6926 | 0.5679 | 2 GPU days | 20s |
| Pano3DComposer-C2F | 0.0784 | 0.0762 | 0.6930 | 0.6937 | 0.5699 | 4 GPU days | 24s |
Ablation Study¶
| Configuration | CD-S↓ | CD-O↓ | F-Score-S↑ | F-Score-O↑ | IoU-B↑ |
|---|---|---|---|---|---|
| \(\mathcal{L}_{\text{CD}}\) only | 0.8688 | 0.9027 | 0.1980 | 0.1888 | 0.0906 |
| + \(\mathcal{L}_{\text{PGD}}\) | 0.1266 | 0.1219 | 0.5675 | 0.5670 | 0.4670 |
| + \(\mathcal{L}_{\text{MASK}}\) | 0.1120 | 0.1063 | 0.5788 | 0.5850 | 0.4818 |
| w/o camera information | 0.1850 | 0.1705 | 0.4673 | 0.4691 | 0.3830 |
Key Findings¶
- Training with Chamfer loss alone yields very poor results (CD-S 0.87); adding the pseudo-geometry distillation PGD loss improves performance substantially to 0.13.
- Removing camera parameter inputs causes a significant performance drop, validating the importance of camera priors.
- Compared to SceneGen, training cost is reduced by 28× (2 vs. 56 GPU days) and inference is 3× faster (20s vs. 63s).
- The C2F mechanism adds only 4 seconds to inference time while achieving substantially better generalization on real-world scenes.
Highlights & Insights¶
- The pseudo-geometry supervision strategy is particularly elegant: generated objects inevitably differ in shape from GT objects, so directly supervising with GT poses misleads the network. Using an offline differentiable optimizer to produce object-specific pseudo-GT parameters elegantly resolves the shape mismatch while providing high-quality supervision for the feed-forward predictor. This paradigm is transferable to any generate-then-align task.
- Shifting from 3D to 2D alignment: by avoiding unreliable monocular panoramic depth estimation and instead exploiting multi-view rendering correspondences in 2D image space, the method makes a practical and effective design choice.
- Modularity and flexibility: the 3D generator can be swapped at any time (e.g., TRELLIS, Amodal3R) without requiring joint retraining.
Limitations & Future Work¶
- The pipeline depends on SAM segmentation quality; heavily occluded or small objects may fail to segment correctly.
- Training and evaluation are currently limited to indoor scenes (3D-FRONT, Structured3D); generalization to outdoor environments remains unverified.
- Each object requires independent 3D asset generation (~4s per object), causing total inference time to grow linearly with scene complexity.
- The C2F mechanism still relies on depth estimation to construct reference point clouds, and inaccurate depth may limit the extent of refinement.
Related Work & Insights¶
- vs. SceneGen: SceneGen jointly generates multiple instances end-to-end but requires substantial fine-tuning for panoramic inputs (56 GPU days); the decoupled design proposed here is more flexible and incurs 28× lower training cost.
- vs. GALA3D / DreamScene: These methods optimize appearance via SDS (30–60 min per object) and rely on LLM-based layout planning that is prone to physical constraint violations; the proposed method directly infers layout from the panorama, making it more efficient and physically grounded.
- vs. CAST: CAST also predicts alignment parameters but couples object generation and alignment, precluding plug-and-play replacement of the generator.
Rating¶
- Novelty: ⭐⭐⭐⭐ Pseudo-geometry supervision and Alignment-VGGT are creative designs, though the overall framework is a modular assembly of existing components.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers both synthetic and real-world scenes with comprehensive ablations, but quantitative evaluation on more diverse real-world data is lacking.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear and mathematical derivations are complete.
- Value: ⭐⭐⭐⭐ Offers an efficient and practical panoramic 3D scene generation solution with direct applicability to VR/AR.