Skip to content

NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction

Conference: ICLR 2026 arXiv: 2603.04179 Code: Project Page Area: 3D Vision / Reconstruction Keywords: non-pixel-aligned, amodal 3D reconstruction, scene tokens, flow-matching, complete point cloud

TL;DR

This paper proposes NOVA3R — a non-pixel-aligned amodal 3D reconstruction framework from pose-free images. It employs learnable scene tokens to aggregate global information across views and a flow-matching-based diffusion 3D decoder to generate complete point clouds (including occluded regions). The method addresses two fundamental limitations of pixel-aligned approaches — inability to reconstruct occluded surfaces and redundant geometry in overlapping regions — and outperforms prior SOTA on scene-level and object-level benchmarks including SCRREAM and GSO.

Background & Motivation

Background: DUSt3R pioneered the pixel-aligned feed-forward 3D reconstruction paradigm, where each pixel predicts a 3D point along its ray. Subsequent methods (MASt3R, CUT3R, VGGT) extend this to multi-view settings while remaining pixel-aligned. A separate line of work pursues latent 3D generation (TripoSG/TRELLIS), but is largely limited to object-level tasks and requires high-quality mesh supervision.

Limitations of Prior Work: Pixel-aligned methods suffer from two fundamental deficiencies: (1) they can only reconstruct visible surfaces, leaving occluded regions geometrically empty; (2) in multi-view overlapping regions, the same physical 3D point is independently predicted by multiple rays, producing physically implausible redundant overlapping point layers.

Key Challenge: In the real world, a scene consists of a fixed number of physical points regardless of the number of viewpoints. When a 3D point is observed from multiple images, the correct representation should contain only a single point rather than one per observation. The pixel-aligned paradigm fundamentally violates this physical fact.

Goal: (a) How to learn a global, view-independent scene representation from pose-free images? (b) How to decode it into a complete (visible + occluded) non-pixel-aligned point cloud? (c) How to supervise an unordered point set (L2 loss is not directly applicable to unordered points)?

Key Insight: The problem is decomposed into two stages — first, a 3D point cloud autoencoder is trained to compress complete point clouds into latent tokens and decode them back via flow-matching; second, an image encoder is trained to map images into the same latent space. This decoupled two-stage design avoids the instability of end-to-end training.

Core Idea: Replace per-ray pixel-aligned prediction with learnable scene tokens, and combine them with a flow-matching decoder to enable feed-forward reconstruction of complete, non-pixel-aligned 3D point clouds from pose-free images.

Method

Overall Architecture

Input: \(K\) pose-free images. Output: Complete 3D point cloud \(P \in \mathbb{R}^{N \times 3}\) (covering both visible and occluded regions).

The method consists of two training stages: - Stage 1 (3D Autoencoder): Complete point cloud → encoded into \(M\) latent tokens → reconstructed via a flow-matching decoder. This stage learns a compact representation space for 3D point clouds. - Stage 2 (Image-to-Latent): Input images + learnable scene tokens → image encoder (based on VGGT) → scene latent \(\hat{Z}\) → point cloud generated by the frozen Stage 1 decoder. Only Stage 2 is used at inference.

Key Designs

  1. Definition and Construction of Complete Point Clouds:

    • Function: Defines the "complete point cloud" required for training supervision, encompassing both visible and occluded regions.
    • Mechanism: GT mesh uniform sampling is used when available; otherwise, dense-view depth maps are back-projected and aggregated, deduplicated via voxel-grid filtering, cropped to the input view frustum, and FPS-sampled to \(N\) points for training.
    • Design Motivation: This resolves the supervision data problem for non-pixel-aligned methods — watertight meshes are not required; depth maps alone suffice to approximate complete point clouds. All points are defined in the first-view coordinate system, preserving view independence.
  2. Flow-Matching-Based 3D Latent Autoencoder (Stage 1):

    • Function: Compresses complete point clouds into \(M\) latent tokens and decodes them back.
    • Mechanism: The encoder applies FPS to sample \(M\) query points from \(P\), concatenates them with learnable tokens, and processes them through cross- and self-attention to obtain \(Z \in \mathbb{R}^{M \times C}\). The decoder is a diffusion model: given \(N\) noisy points \(x_t\), latent \(Z\), and timestep \(t\), it predicts the velocity field. Training loss: \(\mathcal{L}_{flow}^{AE} = \mathbb{E}[\|\Phi_{dec}(x_t, Z, t) - (\epsilon - x_0)\|_2^2]\)
    • Design Motivation: Traditional 3D VAEs decode via occupancy/SDF and require a canonical space and watertight meshes — conditions that scene-level data cannot satisfy. Direct coordinate prediction is infeasible for unordered point clouds due to L2 loss incompatibility. Flow-matching elegantly resolves the unordered matching problem: the decoder learns a deterministic ODE trajectory from noise to the target point cloud without requiring point-to-point correspondence.
    • Joint Decoder Architecture: Self-attention layers are inserted between cross-attention layers, enabling inter-point spatial information exchange and yielding higher precision than an independent decoder (validated in Table 5).
  3. Learnable Scene Token Image Encoding (Stage 2):

    • Function: Extracts a global scene representation \(\hat{Z} \in \mathbb{R}^{M \times C}\) from \(K\) pose-free images.
    • Mechanism: \(M\) learnable scene tokens \(t_S\) are introduced alongside the image tokens of VGGT. All tokens are processed jointly through alternating frame-level and global-level self-attention. Scene tokens are treated as a global frame in the first-view coordinate system and share the first-view camera token.
    • Design Motivation: In pixel-aligned methods, the token count is \(K \times H \times W\), growing linearly with the number of views and tied to pixels. Scene token count is fixed at \(M\) regardless of input views, naturally avoiding redundancy in overlapping regions and supporting an arbitrary number of inputs.

Loss & Training

  • Stage 1: End-to-end training of the autoencoder with flow-matching loss; 50 epochs.
  • Stage 2: Decoder frozen; only the image Transformer and scene tokens are trained with the same flow-matching loss; 50 epochs.
  • No KL loss or additional regularization is used. AdamW, lr = 3e-4. 4× A40, batch size = 32.
  • The image encoder is initialized from VGGT pretrained weights (16 layers instead of 24).

Key Experimental Results

Main Results: Scene Completion (SCRREAM Dataset)

Method Type Complete K=1 CD FS@0.1 FS@0.05 Complete K=2 CD FS@0.1
DUSt3R Multi-view 0.086 0.757 0.565 0.061 0.833
CUT3R Multi-view 0.091 0.753 0.543 0.092 0.739
VGGT Multi-view 0.070 0.810 0.657 0.065 0.821
LaRI Single-view 0.059 0.825 0.590 - -
NOVA3R Multi-view 0.048 0.882 0.687 0.053 0.862

Ablation Study (SCRREAM Complete K=1)

Configuration CD FS@0.05 FS@0.02 Notes
Point query only 0.011 0.991 0.894 FPS points as query
Learnable only 0.013 0.981 0.841 Learnable tokens as query
Hybrid (default) 0.011 0.993 0.904 Point + learnable concat is best
256 tokens 0.014 0.975 0.811 Insufficient token count
768 tokens (default) 0.011 0.993 0.904 More tokens yield better results
FM loss 0.011 0.993 0.904 High reconstruction quality
CD loss 0.024 0.907 0.575 Chamfer loss performs significantly worse

Key Findings

  • FM vs. CD loss: FM improves FS@0.02 from 0.575 to 0.904, demonstrating its substantially superior ability to handle unordered point set matching compared to Chamfer Distance.
  • Hole Ratio: NOVA3R achieves 0.088 vs. VGGT 0.307 vs. DUSt3R 0.317, demonstrating that non-pixel-aligned methods significantly reduce surface holes.
  • Density Variance: NOVA3R 5.127 vs. lowest baseline 7.105, indicating more uniform point cloud distribution.
  • Object-Level Generalization: On the GSO dataset, NOVA3R achieves CD 0.020 vs. TripoSG 0.025, showing the method generalizes beyond scene-level tasks.

Highlights & Insights

  • Paradigm Shift: Pixel-Aligned to Non-Pixel-Aligned. The transition from "predict one point per ray" to "learn a global scene representation and decode" represents a conceptual breakthrough in 3D reconstruction.
  • Two-Stage Decoupled Training is particularly elegant: the 3D autoencoder first establishes a latent space for point clouds (without any images), and the image encoder then learns to map into this space. Decoupling clarifies the objective of each stage and stabilizes training.
  • Flow-Matching for Unordered Point Set Matching: FM models the decoding process as a continuous ODE, naturally handling unordered sets — a technique transferable to other tasks requiring unordered set generation.
  • Variable-Resolution Inference: Since the model learns a point distribution rather than a per-pixel map, the number of query points can be adjusted at inference time to control output point cloud density.

Limitations & Future Work

  • Due to compute constraints, training is limited to \(K \leq 2\) views; generalization to large-scale scenes with more views remains to be validated.
  • \(M = 768\) scene tokens may be insufficient for complex large-scale scenes — an adaptive token count selection strategy is needed.
  • Only static scenes are supported; dynamic objects are not handled. The authors discuss possible extensions to 4D reconstruction.
  • The FM decoder requires multi-step denoising (step size 0.04), with decoding taking 2.985s vs. 0.557s for CD loss.
  • vs. DUSt3R/VGGT: These are representative pixel-aligned methods — simple and efficient, but unable to complete occluded regions and prone to overlapping redundancy. NOVA3R transcends pixel constraints via scene tokens to achieve complete reconstruction.
  • vs. TripoSG/TRELLIS: These perform latent-space 3D generation but are limited to object-level tasks and require a canonical space. NOVA3R requires no canonical space and handles scene-level inputs.
  • vs. LaRI: LaRI also performs amodal reconstruction but remains ray-conditional and requires explicit separation of visible and occluded points. NOVA3R makes no such distinction, generating all points in a unified manner.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Non-pixel-aligned feed-forward 3D reconstruction is a conceptually new paradigm; the scene token + FM decoder design is highly innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive validation at scene level (SCRREAM/7-Scenes/NRGBD) and object level (GSO), with rich ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear; the pixel-aligned vs. non-pixel-aligned contrast is presented intuitively.
  • Value: ⭐⭐⭐⭐⭐ — Establishes a new paradigm for non-pixel-aligned reconstruction with significant implications for the 3D vision community.