PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction¶

Conference: CVPR 2026 arXiv: 2603.05888 Code: Project Page Area: 3D Vision Keywords: Single-view scene reconstruction, autoregressive mesh generation, native mesh, artist-ready, compositional 3D

TL;DR¶

This paper proposes PixARMesh, the first autoregressive framework for single-view scene reconstruction that operates natively in mesh space (rather than SDF space). By enhancing a point cloud encoder with pixel-aligned image features and global scene context, the method jointly predicts object poses and meshes within a unified token sequence. PixARMesh achieves scene-level state-of-the-art on 3D-FRONT while producing compact, editable, artist-ready meshes.

Background & Motivation¶

Background: Single-view 3D scene reconstruction is a long-standing ill-posed problem. The compositional generation paradigm has attracted growing attention, driven by advances in large-scale object-level reconstruction models (TRELLIS, CLAY, etc.).

Limitations of Prior Work: - Holistic methods (Panoptic3D, Uni-3D) are constrained by voxel resolution and the limited expressiveness of feed-forward decoders. - Compositional methods (Gen3DSR, DeepPriorAssembly) require inpainting occluded regions before generation, then estimate layout via optimization—prone to local optima. - MIDI avoids layout optimization but still generates in normalized scene coordinates using SDF. - All existing methods rely on SDF representations, requiring Marching Cubes for surface extraction, which produces over-triangulated, overly smooth, high-polygon meshes unsuitable for editing.

Key Challenge: Mesh generation models (MeshGPT, EdgeRunner, BPT) are limited to single-object outputs; no prior work has extended them to scene-level reconstruction.

Key Insight: Leverage a pretrained object-level autoregressive mesh generator (EdgeRunner/BPT), augment its point cloud encoder to incorporate appearance and global context, and jointly predict poses and meshes within a unified token sequence.

Core Idea: Jointly predict object poses (tokenized as bounding box corners) and native meshes (tokenized as vertices/faces) within a single autoregressive sequence, eliminating SDF extraction and post-hoc layout optimization.

Method¶

Overall Architecture¶

Input RGB image → depth estimation + instance segmentation + image feature extraction (all using off-the-shelf models) → depth unprojection to obtain global and per-object point clouds → pixel-aligned point cloud encoder fusing geometry and appearance → scene context aggregation → Transformer decoder autoregressively generating [pose tokens | mesh tokens].

Key Designs¶

Pixel-Aligned Point Cloud Encoder
Function: Fuse image appearance features into the point cloud encoder.
Mechanism: For each 3D point \(p\) in the instance point cloud \(P_i\), project it onto the image plane via camera intrinsics \((u,v) = \text{Proj}(K, p)\), extract the DINOv2 feature \(\mathbf{f}_p^{\text{img}}\) at the corresponding pixel, and concatenate it with the geometric feature \(\mathbf{f}_p^{\text{pc}}\) before feeding into a Transformer fusion block. Learnable query vectors aggregate the fused features into a compact latent code \(\mathbf{z}_i\).
Design Motivation: The original EdgeRunner/BPT point cloud encoders process only coordinates, ignoring rich appearance cues from images. In single-view scenes where objects are heavily occluded, appearance features are essential for inferring complete geometry.
Scene Context Aggregation
Function: Inject global scene context into each object's representation.
Mechanism: All point clouds are first normalized in a unified scene coordinate system (rather than independently per object) to preserve spatial consistency. A global scene point cloud is encoded to produce \(\mathbf{z}_{\text{scene}}\), and each object latent code aggregates scene information via cross-attention: \(\mathbf{z}_i^{\text{agg}} = \text{CrossAttn}(q=\mathbf{z}_i, k=\mathbf{z}_{\text{scene}}, v=\mathbf{z}_{\text{scene}})\).
Design Motivation: Local point cloud information from individual objects is insufficient for inferring complete geometry and precise poses. Contextual cues from nearby similar objects provide complementary information, especially under heavy occlusion.
Unified Pose–Mesh Tokenization
Function: Encode both object poses and meshes into a unified token sequence using the same vocabulary.
Mechanism: Poses are represented as gravity-aligned 7-DoF bounding boxes, encoded as the 3D coordinates of 8 corner points. The mesh generator's coordinate vocabulary is reused (EdgeRunner: 3 tokens per point <x><y><z>, 24 tokens total; BPT: 2 tokens per point <block_id><offset_id>, 16 tokens total). At inference, the local-to-global affine transformation \(\mathbf{T}^\star\) is recovered from the 8 corner points to map the canonical-space mesh back to scene coordinates.
Final sequence format: <bos> [pose_seq] <sep> [mesh_seq] <eos>
Design Motivation: Avoids introducing new vocabulary types, enabling full vocabulary sharing. Pose sequences add only 16–24 tokens, negligible compared to the mesh sequence.

Loss & Training¶

A single next-token prediction cross-entropy loss: \(\mathcal{L}_{\text{ce}} = -\sum_t \log p_\theta(s_t | s_{<t}, \mathbf{z}_{\text{agg}})\)
Depth jitter (±0.02) is applied during training to simulate monocular depth inaccuracies.
Trained on 8×H100 GPUs: approximately 2 days for EdgeRunner and 18 hours for BPT.

Key Experimental Results¶

Main Results (3D-FRONT Dataset)¶

Method	Scene CD↓ (×10⁻³)	Scene CD-S↓	Scene F-Score↑	Object CD↓	Object F-Score↑
InstPIFu	213.4	124.9	13.72%	44.74	29.63%
MIDI	156.3	79.3	24.83%	6.71	72.69%
DepR	153.2	56.4	25.00%	2.57	89.66%
PixARMesh-ER	98.8	49.1	33.55%	4.04	82.27%
PixARMesh-BPT	98.4	47.6	32.26%	4.57	80.30%

Ablation Study (Necessity of Joint Pose–Mesh Modeling)¶

Configuration	Scene CD↓	Scene F-Score↑	Object CD↓	Object F-Score↑
EdgeRunner-FT (no layout)	119.8	27.81%	4.75	80.57%
Two-stage (separate models)	99.8	33.32%	4.75	80.85%
PixARMesh (joint)	98.8	33.55%	4.04	82.27%

Ablation Study (Module Contributions, Using GT Inputs)¶

Image Features	Scene Context	Scene CD↓	Scene F-Score↑	Object CD↓
✗	✗	57.78	41.02%	5.29
✓	✗	55.44	42.84%	5.56
✗	✓	39.30	44.67%	3.64
✓	✓	39.88	46.15%	4.04

Key Findings¶

PixARMesh achieves scene-level SOTA across all metrics: scene CD drops from 153.2 (DepR) to 98.4 (−36%), and F-Score improves from 25% to 33.6%.
DepR retains an advantage at the object level (CD 2.57 vs. 4.04), as diffusion-based SDF generation yields higher geometric fidelity. However, PixARMesh outputs compact artist-ready meshes (only a few thousand faces per object), whereas SDF-based methods produce densely triangulated high-polygon meshes.
Scene context aggregation is the most critical module: adding it reduces scene CD from 57.78 to 39.30, contributing far more than image features alone.
Joint modeling outperforms the two-stage baseline: object CD improves from 4.75 to 4.04, demonstrating that geometry generation benefits from pose inference.
The EdgeRunner variant outperforms the BPT variant due to higher quantization resolution preserving more geometric detail.

Highlights & Insights¶

First extension of autoregressive mesh generation to the scene level: breaks the assumption that mesh generation models are limited to single objects. A clean token sequence design enables unified decoding of poses and meshes without post-hoc layout optimization.
Vocabulary-sharing pose tokenization is elegantly designed: bounding box corner coordinates reuse the mesh vocabulary with zero additional token overhead, adding only 16–24 tokens. Layout is recovered at inference via least-squares affine fitting.
Emergent benefit of joint modeling: pose prediction and mesh generation mutually reinforce each other—geometric information aids localization, and pose context aids geometry completion. This synergy is unattainable in two-stage pipelines.
Output meshes are directly usable in graphics applications (editing, rendering, simulation), whereas Marching Cubes outputs from SDF-based methods require extensive post-processing.

Limitations & Future Work¶

Object-level geometric accuracy falls short of diffusion-based SDF methods such as DepR; autoregressive mesh generation has an inherent disadvantage on fine surface details.
Currently trained only on 3D-FRONT indoor furniture scenes with a limited object category set.
Relies on Grounded-SAM for segmentation and Depth Pro for depth estimation; errors in upstream models cascade through the pipeline.
Autoregressive decoding slows down as the number of objects increases, since sequence length grows linearly.

vs. DepR: DepR uses depth-guided diffusion to generate in SDF space, achieving finer object geometry (CD 2.57 vs. 4.04), but inferior scene layout (scene CD 153.2 vs. 98.8) and requiring Marching Cubes post-processing.
vs. MIDI: MIDI directly generates SDF in normalized scene space, avoiding layout optimization, but still requires surface extraction and achieves lower scene-level accuracy than PixARMesh.
vs. original EdgeRunner / BPT: these models support only single-object generation; PixARMesh extends them to the scene level by injecting pixel-aligned features and scene context.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First mesh-native scene reconstruction; the unified tokenization design for poses and meshes is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers both synthetic and real data with thorough ablations, but lacks comparison against more mesh generation baselines.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear writing with a complete and coherent logical chain from motivation to design to experiments.
Value: ⭐⭐⭐⭐⭐ — Establishes a new paradigm for mesh-native scene reconstruction with significant implications for future work.