Skip to content

GGPT: Geometry-Grounded Point Transformer

Conference: CVPR 2026
arXiv: 2603.11174
Code: Available
Area: 3D Vision / 3D Reconstruction
Keywords: sparse-view 3D reconstruction, Point Transformer, SfM, feed-forward, multi-view geometry

TL;DR

Proposes the GGPT framework: obtains geometrically consistent sparse point clouds via an improved lightweight SfM pipeline (dense matching + sparse BA + DLT triangulation), then utilizes 3D Point Transformer V3 to directly fuse sparse geometric guidance with feed-forward dense predictions in 3D space for residual refinement. Trained solely on ScanNet++, it significantly improves various feed-forward 3D reconstruction models across architectures and datasets.

Background & Motivation

Background: Feed-forward 3D reconstruction networks (DUSt3R → MASt3R → VGGT) can predict dense point maps and camera parameters in a single pass with high speed and good visual quality. however, the lack of explicit multi-view constraints leads to geometric inconsistency, particularly in out-of-distribution scenes (medical/surgical/human body).

Limitations of Prior Work: (1) SfM is geometrically consistent but fragile under wide-baseline/sparse-view conditions and only restores sparse points; (2) Previous methods for fusing geometric guidance depend on pseudo-GT SfM points or dense video sequences, which are unavailable in real sparse scenarios; (3) Existing refinement methods operate in 2D image space (depth completion/image Transformers) and cannot achieve true cross-view consistency.

Key Challenge: Feed-forward predictions are dense but inconsistent, while SfM is consistent but sparse—existing methods either rely on unrealistic GT guidance or fail to guarantee 3D consistency via 2D space refinement.

Goal: To organically combine the geometric accuracy of SfM with the dense completeness of feed-forward networks in 3D space, achieving sparse-view 3D reconstruction refinement that generalizes across architectures without fine-tuning.

Key Insight: A two-stage approach—first obtaining ground-truth sparse geometry from input RGB via an improved SfM, then performing attention and residual correction directly in point cloud space using a 3D Point Transformer.

Core Idea: Performing geometric fusion refinement in 3D space rather than 2D image space is the key to cross-domain generalization.

Method

Overall Architecture

GGPT aims to bridge the contradiction between feed-forward networks (dense but inconsistent) and traditional SfM (consistent but sparse and fragile). It proposes a two-stage pipeline. The first stage is an improved lightweight SfM: it initializes cameras and points using a feed-forward model, obtains global correspondences via a dense matcher (RoMa + UFM) with cycle-consistency filtering, and then performs sparse BA (only 2048 points/view) for camera estimation and DLT triangulation for a geometrically consistent sparse point cloud \(\mathbf{X}_s\). The second stage is GGPT: it encodes the offsets between feed-forward dense predictions \(\mathbf{X}_d\) and corresponding sparse guidance points, then processes \(\mathbf{X}_d\) and \(\mathbf{X}_s\) jointly in a global 3D coordinate system using Point Transformer V3 (53M parameters) to predict residual displacement \(\boldsymbol{\delta}\) and confidence \(c\), outputting the refined dense reconstruction \(\hat{\mathbf{X}}_d\).

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    IN["Sparse-view RGB"] --> FF["Feed-forward Net (VGGT, etc.)<br/>Dense Prediction X_d"]
    subgraph SFM["Improved SfM Pipeline"]
        direction TB
        A["VGGT Init Camera/Points"] --> B["Dense Matching RoMa+UFM<br/>+ Cycle-consistency Filter"]
        B -->|"High Threshold ε_BA Few Points"| C["Sparse BA for Camera"]
        B -->|"Low Threshold ε_DLT Denser Match"| D["DLT Triangulation"]
        C --> D
        D --> XS["Consistent Sparse Cloud X_s"]
    end
    IN --> SFM
    subgraph GGPT["GGPT Refinement (PTv3)"]
        direction TB
        ENC["Geo-guidance Encoding<br/>PE + Type Tag + Offset Δ"]
        ATT["PTv3 3D Attention<br/>3D Local Patch Self-attention"]
        ENC --> ATT
    end
    FF --> ENC
    XS --> ENC
    ATT --> OUT["Residual δ + Confidence c<br/>→ Refined Dense Recon X̂_d"]

Key Designs

1. Improved SfM Pipeline: Obtaining Geometrically Consistent Sparse Point Clouds from Sparse RGB

Traditional SfM is fragile under sparse views, while prior fusion methods rely on pseudo-GT or dense sequences. GGPT uses a feed-forward model (VGGT) for initialization, dense feature matching (RoMa+UFM) to obtain a correspondence tensor \(\mathbf{T} \in \mathbb{R}^{N \times N \times W \times H \times 2}\), and cycle-consistency filtering. It separates non-linear optimization and linear triangulation using two confidence thresholds \(\epsilon_{BA} > \epsilon_{DLT}\). Sparse BA uses very few high-confidence points to estimate cameras accurately, while DLT triangulation uses a denser subset for efficient linear reconstruction of 3D points.

2. Geometry-Grounded Encoding: Feeding "Prediction vs. Geometric Prior Difference" to the Network

Inputting sparse point clouds alone is insufficient; the network must know where to refine. The embedding for a dense point \(\mathbf{x}_d\) consists of four parts: its own positional encoding \(\text{PE}(\mathbf{x}_d)\), a type tag \(\mathbf{e}_{type(d)}\), the positional encoding of the corresponding sparse guidance point \(\text{PE}(\mathbf{x}_{d \to s})\), and the offset \(\Delta_{d \to s} = \mathbf{x}_{d \to s} - \mathbf{x}_s\). The offset \(\Delta_{d \to s}\) explicitly encodes how much correction is needed, which is the most critical component for refinement.

3. Direct 3D Attention PTv3: Natural Cross-view Consistency via 3D Neighbors

Previous refinement methods in 2D space are view-dependent. GGPT uses Point Transformer V3 (8 layers, 53M parameters) to perform patch-wise self-attention on 3D neighbors. Receptive fields are defined by spatial proximity rather than pixel coordinates, naturally ensuring multi-view consistency. To handle large scenes, it partitions the scene into overlapping cubic blocks (radius = 0.2 × scene radius), processing up to 400,000 points per block independently and averaging outcomes in overlap regions.

Loss & Training

  • Confidence-weighted Regression: \(\mathcal{L}_{conf} = \sum c \|\hat{\mathbf{x}} - \mathbf{x}_{GT}\| - \alpha \log c\). This heteroscedastic form allows the model to automatically reduce weights in uncertain regions.
  • Identity Consistency: \(\mathcal{L}_{id} = \sum \|\hat{\mathbf{x}} - \mathbf{x}_{d \to s}\|\), encouraging dense points with correspondences to align with geometric guidance.
  • Total loss \(\mathcal{L} = \mathcal{L}_{conf} + \lambda_{id} \mathcal{L}_{id}\), where \(\lambda_{id}=1, \alpha=0.2\).
  • Training: 20k sequences from ScanNet++, one day on 8×GH200 GPUs.

Key Experimental Results

Main Results (AUC@5/10 cm ↑, 8 Views)

Method ScanNet++ ETH3D T&T
VGGT 19/32 23/36 25/39
VGGT + Ours 45/60 47/61 42/57
Pi3 56/71 25/41 26/42
Pi3 + Ours 56/72 36/53 32/50
MapAnything 38/57 7/15 9/20
MapAnything + Ours 48/64 33/45 40/55

Ablation Study

Ablation Item ScanNet++ 4v ETH3D 4v Remarks
Full GGPT 38/53 41/55 Baseline
W/O \(\mathbf{X}_s\) guidance Learnable but OOD collapse Significant drop Guidance is indispensable
W/O Offset Encoding \(\Delta_{d \to s}\) Significant drop Significant drop Most critical component
2D Transformer instead of PTv3 Small gap in-domain Large gap OOD 3D attention generalization advantage
Patch r=0.1 vs 0.2 vs 0.5 r=0.2 optimal Small patches enhance generalization

Key Findings

  • Strong Out-of-Distribution Generalization: Trained only on ScanNet++, it improves 5 models across 5 datasets without any fine-tuning.
  • Largest Gain for VGGT: AUC@5 increased from 19 to 45 (+137%) on ScanNet++ and from 23 to 47 (+104%) on ETH3D.
  • Impressive OOD Data Performance: On 4D-DRESS, VGGT AUC@1/5cm improved from 10/45 to 66/77 with Ours; on MV-dVRK, it improved from 8/33 to 45/61.
  • 3D vs 2D Refinement: PTv3 significantly outperforms 2D Transformer solutions on cross-domain data, representing a fundamental improvement.
  • SfM Ablation: Dense matchers are much better than sparse matchers (MASt3R); DLT is hundreds of times faster than RANSAC triangulation with comparable accuracy; 512 points are sufficient for sparse BA.

Highlights & Insights

  • Performing geometric fusion in 3D space rather than 2D image space provides a fundamental advantage for cross-domain generalization.
  • The design philosophy of "training a single configuration to improve multiple feed-forward methods without fine-tuning" is highly valuable.
  • The separation strategy of sparse BA + DLT is elegant and efficient: non-linear optimization is reserved for high-confidence sparse points, while triangulation uses linear methods.
  • The design of geometry-grounded encoding is clever: \(\Delta_{d \to s}\) directly encodes the "correction amount," providing the network with a direct supervisory signal.

Limitations & Future Work

  • SfM Error Propagation: SfM and GGPT execute sequentially; if SfM fails (e.g., in textureless scenes), refinement cannot recover the geometry.
  • Patch Artifacts: Block-wise processing may cause boundary discontinuities; while averaging overlap regions helps, it does not fully eliminate them.
  • Indoor Training Focus: The model has not been validated on large-scale outdoor scenes or scenarios with more than 16 views.
  • Computational Overhead: Requires additional execution of dense matchers and BA, increasing total inference time.
  • vs. DUSt3R/VGGT: Feed-forward predictions are fast but inconsistent; GGPT serves as a universal post-processing module to provide geometric consistency.
  • vs. COLMAP: Traditional incremental SfM is fragile under sparse views; the global SfM + dense matching used here is more robust and efficient.
  • vs. 2D Depth Completion: Image space refinement has inherent view-dependent limitations; 3D space attention fundamentally solves cross-view consistency.
  • Insight: The paradigm of 3D space processing > 2D image processing is worth validating in more tasks, such as multi-view fusion for semantic segmentation or object detection.

Rating

⭐⭐⭐⭐⭐ (5/5)

Reasoning: The method is elegantly designed with a strong motivation (3D vs. 2D refinement). The experiments are exceptionally comprehensive (5 feed-forward methods across 5 datasets, including OOD medical/human data). Its generalization capability is impressive (universal improvement with a single configuration), and the ablation studies are thorough with clear conclusions. The designs of the improved SfM pipeline and the 3D Point Transformer are both independently valuable. This is high-quality work in the field of sparse-view 3D reconstruction.