TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction¶

Conference: ICCV2025 arXiv: 2412.16919 Code: GitHub Area: 3D Vision Keywords: 3D generation, autoregressive, VQ-VAE, triplane representation, next-part prediction, GPT

TL;DR¶

TAR3D is proposed as the first framework to quantize triplane representations into discrete geometric parts and generate them autoregressively via GPT. A 3D VQ-VAE encodes meshes of arbitrary face counts into fixed-length sequences, while TriPE positional encoding preserves 3D spatial information. The method comprehensively outperforms existing approaches on text/image-to-3D tasks.

Background & Motivation¶

Background: Conditional 3D generation has achieved significant progress along three main directions: (1) SDS-based optimization (e.g., DreamFusion), which is time-consuming and prone to multi-face artifacts; (2) multi-view synthesis (e.g., Zero123), which depends on view consistency and tends to lose fine details; (3) native 3D diffusion (e.g., Direct3D, Clay), which is difficult to unify with LLM-based modeling.

Limitations of Prior Work: Approaches such as MeshGPT and MeshXL attempt to introduce autoregressive LLMs into 3D generation, but directly quantizing mesh faces yields a sequence length of \(9 \times\) the face count. Industrial-grade models with hundreds of thousands of faces thus produce sequences of intractable length that do not scale.

Core Problem: Can meshes of arbitrary face counts be encoded into fixed-length sequences, making GPT-based autoregressive modeling of 3D objects feasible?

Key Insight: Triplane representations naturally compress 3D information into fixed-size feature maps, and their 2D feature map structure is amenable to image-based VQ quantization strategies, enabling face-count-agnostic fixed-length discrete sequences.

Method¶

Overall Architecture¶

TAR3D consists of two core modules: a 3D VQ-VAE that encodes 3D shapes into discrete triplane features, and a 3D GPT that autoregressively predicts codebook index sequences. Training proceeds in two stages: the VQ-VAE is first trained to obtain a high-quality discrete representation, followed by GPT training for sequence generation.

3D VQ-VAE¶

3D Shape Encoder¶

Input: 81,920 points uniformly sampled from the 3D mesh surface (with normals), \(P \in \mathbb{R}^{B \times N_p \times 6}\)
Fourier positional encoding is applied to the point cloud to capture high-frequency details
Transformer architecture: 1 cross-attention layer + 8 self-attention layers
Point cloud information is injected into learnable query tokens (\(3 \times 32 \times 32 = 3072\) tokens, channel dimension 768) via cross-attention
Output: triplane latent representation \(\hat{z} \in \mathbb{R}^{B \times 3072 \times d_z}\)

Quantizer¶

Learnable codebook \(Z = \{z_k\}_{k=1}^{K}\), \(K = 16384\), channel dimension \(d_q = 8\)
A linear projection maps continuous features to the codebook dimension
Element-wise nearest-neighbor quantization: \(z_q = \arg\min \|\tilde{z}_{ij} - z_k\|\)
Sequence length is fixed at \(3 \times 32 \times 32 = 3072\), entirely independent of mesh face count

3D Geometry Decoder¶

\(N_2 = 6\) attention layers, each comprising 2 feature deformation steps and self-attention
Intra-plane self-attention: ignores the plane axis, processing each plane independently
Inter-plane self-attention: concatenates the three planes along the height dimension for cross-plane information interaction (PII)
Triplane features are upsampled to \(256 \times 256\) resolution
Query point features are sampled from the triplane via bilinear interpolation and fed into an MLP to predict occupancy values
Query points consist of 20,480 volumetric points + 20,480 near-surface points

3D GPT¶

Sequence Construction¶

Sequences are built from codebook indices obtained from the pretrained VQ-VAE
Intra-plane: raster scan order
Inter-plane: indices at the same spatial position across the three planes are arranged adjacently
Conditional prompts are embedded as prefilling tokens

TriPE Positional Encoding¶

A custom 3D positional encoding combining 2D RoPE and 1D RoPE
TriP_2D: each element of the 2D positional encoding is repeated 3 times and arranged adjacently, preserving intra-plane 2D spatial information
TriP_1D: the 1D positional encoding (3 elements) is each repeated \(h \times w\) times, distinguishing the three feature planes
TriPE = TriP_2D + TriP_1D (element-wise addition)
Retains significantly more 3D spatial information than naive 1D RoPE

Autoregressive Generation¶

Sequence \(s \in \{0, \ldots, K-1, K\}^{3 \cdot h \cdot w}\)
GPT-L architecture (LLamaGen configuration): 24 Transformer layers, 16 heads, hidden dimension 1024
Condition encoding: DINO (ViT-B16) for images; FLAN-T5 XL for text
Classifier-free guidance (CFG, scale = 7.5) is applied at inference to improve quality and alignment
Training objective: maximize the log-likelihood of the triplane index sequence

Training Details¶

VQ-VAE loss: \(\mathcal{L} = \lambda_{\text{rec}} \cdot \mathcal{L}_{\text{rec}}(\text{BCE}) + \lambda_{\text{cb}} \cdot \mathcal{L}_{\text{cb}}(\text{codebook learning})\)
\(\lambda_{\text{rec}} = 1\), \(\lambda_{\text{cb}} = 0.1\), \(\beta = 0.25\)
Optimizer: AdamW, learning rate \(1 \times 10^{-4}\), cosine annealing schedule
VQ-VAE: batch size 128, 100K steps, \(8 \times\) A100
GPT: batch size 80, 100K steps

Key Experimental Results¶

Datasets¶

Dataset	Scale	Usage
ShapeNet	52,472 meshes / 55 categories	48,597 train + 2,592 test
Objaverse	~100K (filtered from 800K)	~99K train + 1K evaluation
GSO	~1,000 real scans	Out-of-domain generalization

Quantitative Comparison on Image-to-3D (Tab. 1)¶

Method	PSNR↑	SSIM↑	CLIP↑	LPIPS↓	CD↓	F-Score↑
Shap-E	10.991	0.702	0.834	0.325	0.156	0.163
SyncDreamer	11.269	0.706	0.837	0.320	0.158	0.178
Michelangelo	11.928	0.734	0.864	0.278	0.117	0.226
InstantMesh	11.560	0.721	0.847	0.303	0.137	0.179
LGM	11.363	0.714	0.841	0.317	0.149	0.172
TAR3D	13.626	0.763	0.868	0.216	0.066	0.303

PSNR improves by 1.7+, CD is reduced by over 43%, and F-Score improves by over 34%, achieving substantial leads across all metrics.

Ablation Study¶

Ablation	CD↓	F-Score↑
3D VAE (no quantization)	0.018	0.811
3D VQ-VAE	0.016	0.822
Without PII	0.023	0.661
With PII	0.016	0.822

Triplane Resolution Ablation¶

Triplane Size	CD↓	Inference Time
\(3 \times 16 \times 16\)	0.157	17.7s
\(3 \times 32 \times 32\)	0.066	67.6s
\(3 \times 48 \times 48\)	0.062	143.9s

\(32 \times 32\) achieves the best quality-efficiency trade-off: CD drops from 0.157 to 0.066 at an acceptable inference cost, while \(48 \times 48\) yields only marginal improvement at twice the inference time.

Highlights & Insights¶

Fixed-length sequences are the key breakthrough: MeshGPT's sequence length scales with face count, making it non-scalable. TAR3D compresses meshes of arbitrary complexity into 3,072 tokens via triplane VQ quantization, enabling truly scalable 3D autoregressive modeling.
Elegant design of TriPE: Rather than naively applying 1D positional encoding to a flattened sequence, TriPE fuses 2D (intra-plane spatial) and 1D (inter-plane identity) encodings. Ablation experiments confirm that naive 1D RoPE loses critical geometric detail.
Importance of PII: F-Score improves from 0.661 to 0.822 (+24%), demonstrating that cross-plane information interaction is essential for high-quality reconstruction and that the three planes cannot be processed independently.
VQ-VAE outperforms VAE: Contrary to the intuition that quantization degrades information, the 3D VQ-VAE achieves better reconstruction quality than its continuous counterpart (CD: 0.016 vs. 0.018), suggesting that the discrete codebook learns superior geometric part representations.
GPT paradigm unifies multi-modal 3D generation: Image and text conditions are naturally unified as prefilling tokens, supporting multi-modal conditional generation in a manner compatible with the broader LLM ecosystem.

Limitations & Future Work¶

Geometry only, no texture: The current version generates shape only; texture relies on an external synthesizer (SyncMVD), and end-to-end texture generation remains to be addressed.
Inference efficiency: The \(32 \times 32\) configuration requires 67.6 seconds per sample, which is slower than diffusion-based methods that generate in seconds; autoregressive sequential dependencies limit parallelization.
Occupancy field representation: The use of occupancy fields rather than SDF or DMTet may constrain topological complexity and thin-structure reconstruction.
Limited data scale: Only ~100K Objaverse samples are used; the authors indicate that future work will incorporate Objaverse-XL and explore scaling laws.
Larger GPT variants unexplored: Only GPT-L (24 layers, hidden dim 1024) is evaluated; scaling behavior and emergent capabilities remain to be investigated.

MeshGPT/MeshXL: Direct quantization of mesh faces yields excessively long sequences; TAR3D circumvents this fundamental bottleneck via triplane representations.
Direct3D/Clay: 3D diffusion-based approaches; TAR3D provides an autoregressive alternative that integrates more naturally with the LLM paradigm.
LLamaGen: Successful autoregressive image generation validates the paradigm; TAR3D extends it to the 3D domain.
Inspiration: The triplane + VQ approach is extensible to 4D (dynamic 3D) generation; the TriPE positional encoding design is applicable to other structured sequence modeling tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First framework combining triplane quantization with GPT autoregressive 3D generation, resolving the sequence length bottleneck
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dataset, multi-baseline comparisons with thorough ablations, though large-scale experiments are absent
Writing Quality: ⭐⭐⭐⭐ Clear structure, well-motivated, with intuitive illustrations
Value: ⭐⭐⭐⭐⭐ Opens a viable path toward unifying 3D generation with the LLM ecosystem