VGG-T3: Offline Feed-Forward 3D Reconstruction at Scale¶

Conference: CVPR 2026 arXiv: 2602.23361 Code: N/A Area: 3D Vision / 3D Reconstruction Keywords: 3D Reconstruction, Test-Time Training, Linear Complexity, KV Compression, Visual Localization

TL;DR¶

This paper proposes VGG-T3, which compresses the variable-length KV representations in VGGT's global attention layers into fixed-size MLP weights via test-time training (TTT), reducing the computational complexity of offline feed-forward 3D reconstruction from $O(n^2)$ to $O(n)$, enabling large-scale scene reconstruction at the thousand-image level (1k images in only 58 seconds).

Background & Motivation¶

Background: Feed-forward multi-view 3D reconstruction methods (e.g., VGGT, Fast3R) leverage Transformer global self-attention for multi-view reasoning, achieving accuracy comparable to the classical COLMAP pipeline with greater robustness under challenging conditions.

Limitations of Prior Work: The computational complexity and memory requirements of these methods scale quadratically with the number of input images $n$, with the core bottleneck being that global softmax attention must query a variable-length KV space spanning all image tokens. VGGT requires over 11 minutes to process 1k images.

Key Challenge: Existing acceleration methods (e.g., token merging in FastVGGT, sparse attention in SparseVGGT) reduce constant factors but leave the asymptotic complexity quadratic: $O(n^2) \to O(n/r)^2$. Online methods (e.g., CUT3R, Must3R) use fixed-size implicit memories but suffer from limited accuracy and drift.

Goal: Reduce complexity to linear $O(n)$ while preserving the accuracy advantage of global offline reconstruction, supporting reconstruction from arbitrarily large image collections.

Key Insight: Inspired by DeepSDF — compressing variable-length representations into fixed-size optimizable parameters. The variable-length KV space of VGGT's global attention layers is distilled into the weights of a fixed-size MLP via TTT.

Core Idea: TTT is used to learn an MLP $T_\theta$ that captures the Key-to-Value mapping ($\arg\min_\theta \sum_i L_t(T_\theta(k_i) - v_i)$); at inference time, querying this MLP yields the output with complexity linear in sequence length.

Method¶

Overall Architecture¶

VGG-T3 retains VGGT's image tokenizer and prediction heads, replacing only all global attention layers with TTT layers. The process consists of two stages: - Update stage: Input tokens are projected into QKV; TTT compresses the KV mapping into fixed-size MLP weights $\theta$. - Apply stage: The optimized MLP is applied to query $q$ to obtain output tokens, which are passed to the next layer.

Key Designs¶

1. Linearizing the Pretrained Model¶

Function: Initialized from VGGT pretrained weights, retaining the $W_q, W_k, W_v$ projection matrices.
Mechanism: VGGT's QK projections use LayerNorm ($q_i = \text{LN}_q(W_q x_i)$), but the learnable parameters of LN distort the input space during TTT optimization, causing extremely slow convergence. Replacing LN with $L_2$ normalization unlocks fast convergence.
Design Motivation: Post-training linearization strategies have proven successful in LLMs and can significantly reduce training costs.

2. ShortConv2D Nonlinear Spatial Mixing¶

Function: Applies 2D convolution in the Value space to break the linear dependency in the K→V mapping.
Mechanism: Since both $K = W_k x$ and $V = W_v x$ are linear projections of $x$, theoretically $V = W_v W_k^{-1} K$, and the TTT objective may yield a trivial solution. After applying ShortConv2D, the target becomes learning $K \to V'$, where $V'$ encodes local spatial context:
- Reshape the 1D token sequence into a 2D image grid of shape $(N, H/p, W/p, d)$
- Apply a single-layer 2D convolution to aggregate local neighborhood information
- Flatten back to a 1D sequence
Design Motivation: Forces the MLP to predict targets containing neighborhood information from single-token features, yielding a more robust geometric scene representation.

3. Test-Time Scaling¶

Function: Handles large-scale image collections beyond the training distribution.
Key Findings: Training typically requires only 1 optimization step, but processing 1k images necessitates increasing the number of steps. Simply increasing to 2 steps achieves near-constant generalization across sequence lengths.
Design Motivation: A fixed number of optimization steps is insufficient to compress significantly larger scenes into a fixed-dimensional MLP.

Loss & Training¶

Dot product loss is adopted for TTT optimization: $$L_t(T_\theta(k_i), v_i) = T_\theta(k_i)^T v_i$$

The Muon optimizer is used with a SwiGLU MLP as the fast weight network. All original VGGT parameters are frozen; only the global attention layers are fine-tuned for 100k steps (approximately 12% of the cost of training VGGT from scratch).

Key Experimental Results¶

Main Results: Standard Benchmarks¶

Method	Complexity	DTU CD↓	ETH3D CD↓	NRGBD-D CD↓	7scenes-D NC↑
VGGT	$O(n^2)$	1.537	0.279	0.014	0.668
SparseVGGT	$O(n^2)$	1.541	0.327	0.018	0.665
TTT3R	$O(n)$	5.708	0.885	0.071	0.666
VGG-T3	$O(n)$	1.654	0.480	0.029	0.679

Pointmap estimation: substantially outperforms the only $O(n)$ baseline TTT3R across all datasets (DTU error reduced by 2–2.5×), remaining competitive with $O(n^2)$ methods.
Video depth estimation: $\delta<1.25$ of 0.967 on KITTI, on par with $O(n^2)$ methods.

Large-Scale Reconstruction Performance¶

No. of Images	VGG-T3	VGGT	FastVGGT	TTT3R
1k	58s	11min (11.6× slower)	4min (4.3× slower)	~60s
2k (4 GPUs)	48.5s	1590s	N/A	N/A

Ablation Study¶

Design	DTU CD↓	ETH3D CD↓
w/o ShortConv2D	Significant degradation	Significant degradation
LayerNorm instead of L2 Norm	Extremely slow convergence	—
1-step TTT (1k images)	Error increases ~5×	—
2-step TTT (1k images)	Approaches small-scale accuracy	Stable

Key Findings¶

The quality gap between VGG-T3 and $O(n^2)$ methods narrows as the number of images increases.
Supports single-GPU processing of arbitrarily large image collections (via minibatch offloading to CPU) as well as multi-GPU distributed inference.
Visual localization: freezing the TTT-MLP enables feed-forward localization, achieving $e_r=6.71°, e_t=0.15\text{m}$ on 7scenes.

Highlights & Insights¶

Elegant core insight: Treating the KV space in attention as a "variable-length scene representation" and compressing it into a "fixed-size scene representation" via TTT — a natural and profound analogy to DeepSDF.
Practical large-scale solution: The additivity of the TTT objective (gradients can be accumulated in minibatches) naturally supports distributed inference and CPU offloading, which softmax attention cannot achieve.
Unified reconstruction and localization: The same model and TTT-MLP support both scene mapping and localization, opening a new path toward unified end-to-end solutions.
Low fine-tuning cost: The majority of VGGT parameters are frozen; only the new parameters in global attention layers are trained, at approximately 12% of the cost of training from scratch.

Limitations & Future Work¶

Weak camera pose estimation: The TTT-linearized model performs poorly on pose estimation, likely related to the heterogeneous design of camera tokens in VGGT; this remains a key issue for future work.
Remaining gap with softmax attention: Especially in wide-baseline settings, the fixed capacity of the MLP limits scene representation ability.
Non-trivial training cost: Although only 12% of VGGT's cost, training still requires 8×A100-80GB GPUs for 100k steps.
Limited visual localization validation: Demonstrated only on 7scenes and Wayspots, with a notable gap compared to dedicated localization pipelines (e.g., Reloc3R).

VGGT: The base architecture for this work; global softmax attention enables multi-view reasoning with high accuracy but $O(n^2)$ complexity.
FastVGGT / SparseVGGT: Accelerate via token merging / block-sparse attention; asymptotic complexity remains unchanged.
TTT3R: A concurrent work; an autoregressive TTT model based on CUT3R; $O(n)$ but lower accuracy and does not support unordered inputs.
CUT3R / Must3R / Point3R: Online methods using fixed-size implicit/spatial memories; linear complexity but poor global consistency.
LaCT (Sun et al.): The originator of the TTT framework adopted here; VGG-T3 uses its SwiGLU MLP and Muon optimizer.
DeepSDF: A classic implicit representation work; the central idea of "encoding instance geometry in a fixed-size network" directly underpins VGG-T3.

Rating¶

Novelty: ⭐⭐⭐⭐ — Elegantly transfers post-training linearization and TTT from the LLM domain to 3D reconstruction; the ShortConv2D design is targeted and well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers pointmap estimation, depth, pose, and localization; includes large-scale evaluation and distributed inference; ablations are comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Logically clear with progressively developed motivation; figures and tables are highly informative.
Value: ⭐⭐⭐⭐⭐ — Addresses the scalability bottleneck of feed-forward 3D reconstruction; achieves 11.6× speedup with minimal accuracy loss; has direct practical value for large-scale scene reconstruction.

Method	Complexity	DTU CD↓	ETH3D CD↓	NRGBD-D CD↓	7scenes-D NC↑
VGGT	\(O(n^2)\)	1.537	0.279	0.014	0.668
SparseVGGT	\(O(n^2)\)	1.541	0.327	0.018	0.665
TTT3R	\(O(n)\)	5.708	0.885	0.071	0.666
VGG-T3	\(O(n)\)	1.654	0.480	0.029	0.679