Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction¶

Conference: ECCV 2024
Code: https://github.com/DianJJ/AST
Area: 3D Vision
Keywords: Single-view 3D reconstruction, analysis-by-synthesis paradigm, Transformer, mesh reconstruction, texture generation

TL;DR¶

Proposes the Analysis-by-Synthesis Transformer (AST), which models pixel-to-shape and pixel-to-texture relationships using Shape Transformer and Texture Transformer in a unified framework, achieving high-quality mesh reconstruction and texture generation with only 2D annotations, outperforming existing methods on CUB-200-2011 and ShapeNet.

Background & Motivation¶

Background: Single-view 3D reconstruction aims to recover 3D shapes and textures from a single 2D image. Deep learning methods have made significant progress in this task, but mainstream approaches rely heavily on expensive 3D annotations (such as 3D meshes, CAD models, voxel annotations, etc.) for training. Recently, the Analysis-by-Synthesis (analysis-by-synthesis) paradigm has gained attention, which learns 3D reconstruction by "comparison after synthesis"—specifically, predicting 3D shapes/textures, rendering them into 2D images, and comparing them with the input image for training, thereby only requiring 2D annotations (such as silhouette masks, 2D keypoints, etc.).

Limitations of Prior Work: Existing analysis-by-synthesis methods suffer from limitations in both shape reconstruction and texture generation: (1) Shape-wise—most methods directly regress mesh vertex displacements from global features using simple MLPs or CNNs, failing to effectively capture fine-grained correspondences between image pixels and shape details, which results in reconstructed meshes lacking detail and poor recovery quality in occluded regions; (2) Texture-wise—existing methods typically adopt pixel-level texture sampling strategies (such as directly sampling textures from images based on projection relationships), which introduces incorrect inductive biases—directly copying image pixels for visible regions while failing to acquire reasonable textures for invisible regions, leading to incoherent textures.

Key Challenge: Shape reconstruction requires understanding the mapping between the global structure and local details of the image and the 3D shape, while texture generation requires inferring the color information of invisible regions from visible regions. Current methods lack the capacity to model these two complex relationships, which is particularly challenging under only 2D supervision.

Goal: 1) To establish fine-grained correspondences between image pixels and 3D mesh vertices under only 2D annotations; 2) To generate complete, coherent textures including occluded regions; 3) To simultaneously address both shape and texture sub-problems in a unified framework.

Key Insight: The authors argue that the attention mechanism of Transformers is naturally suited for modeling long-range correspondences between image pixels and 3D elements. By designing learnable shape queries and texture queries, they allow them to "query" the corresponding shape deformations and texture details from image features, respectively, achieving precise mapping from pixel to shape and pixel to texture.

Core Idea: Utilizing two Transformers to extract shape deformation and texture details from image features via learnable query vectors, establishing pixel-level 2D-to-3D correspondences through cross-attentions.

Method¶

Overall Architecture¶

AST adopts an encoder-dual-branch-decoder architecture. Given a single RGB image, a CNN backbone is employed to extract multi-scale image features. This is followed by two parallel processing branches: the Shape Transformer receives learnable shape queries and computes cross-attention with the image features to output mesh vertex displacements; the Texture Transformer receives learnable texture queries and computes cross-attention with the image features to output texture information for each face. Finally, the reconstructed shape and texture are rendered into a 2D image via a differentiable renderer and compared with the input image to calculate losses.

Key Designs¶

Shape Transformer:
- Function: Extracts shape information from image features to predict the 3D displacement of each vertex on the template mesh.
- Mechanism: Initializes a set of learnable shape queries, where each query corresponds to a vertex or a group of vertices in the template mesh. These queries interact with image features through multiple layers of Transformer decoders via cross-attention. In cross-attention, the shape queries serve as Query, while the image features serve as Key and Value. Through the attention mechanism, each shape query automatically learns to focus on the image region relevant to its corresponding vertex—for instance, the query representing the bird beak will attend to the beak area in the image. After multiple iterations of attention layers, each query is decoded into a 3D displacement vector \(\Delta v \in \mathbb{R}^3\) for the corresponding vertex. The key advantage is that the attention mechanism establishes non-local pixel-vertex correspondences, allowing the shape of occluded vertices to be inferred via global context.
- Design Motivation: Traditional methods directly regress the displacement of all vertices from global features, sharing the same feature representation across different vertices, which makes capturing local details difficult. Shape Transformer assigns individual queries to each vertex and establishes personalized pixel correspondences via attention, significantly improving reconstruction accuracy, especially for recovering occluded regions.
Texture Transformer:
- Function: Extracts texture information from image features to generate complete textures for each face of the mesh.
- Mechanism: Initializes a set of learnable texture queries, where each query corresponds to a face of the mesh. The architecture of the Texture Transformer is similar to the Shape Transformer but has a different goal: texture queries aggregate texture information through cross-attention with image features. The core innovation lies in adopting a non-local information collection approach: instead of simply projecting each face onto the image to sample pixel colors, each texture query searches for relevant texture clues across the entire image features via attention. Thus, even if a face is occluded and invisible in the current view, its texture query can infer reasonable textures through semantic correlations with visible regions.
- Design Motivation: Traditional projection sampling methods suffer from severe inductive bias—assuming that invisible regions should acquire textures from the nearest visible pixels, but adjacent regions may have completely different textures (e.g., a bird's belly and wings). Texture Transformer eliminates this assumption through global attention, allowing the network to learn by itself where to obtain the texture information for each face.
Differentiable Rendering and 2D Supervision:
- Function: Renders 3D predictions into 2D images to enable training of the 3D reconstruction model using only 2D annotations.
- Mechanism: A differentiable renderer (e.g., PyTorch3D) is used to render the combination of the deformed mesh from the Shape Transformer and the texture from the Texture Transformer into 2D images. Rendering includes silhouette rendering (for comparison with ground truth masks) and color rendering (for comparison with the original image). Since the rendering process is differentiable, the 2D losses can be backpropagated to the 3D predictions. Additionally, 2D keypoint annotations and camera parameters are utilized as auxiliary supervision.
- Design Motivation: 3D annotations are highly expensive to obtain, while 2D annotations (image, silhouette, keypoints) are relatively easy to acquire. The Analysis-by-Synthesis paradigm bridges 3D predictions and 2D supervision through rendering, substantially lowering the annotation requirements.

Loss & Training¶

The training loss includes multiple 2D supervision terms: (1) Silhouette loss (IoU loss) – consistency between the rendered silhouette and the ground truth mask; (2) Texture loss (perceptual loss + L1 loss) – color consistency between the rendered image and the original image; (3) Keypoint loss – distance between the projected mesh keypoints onto 2D and the annotated keypoints; (4) Mesh regularization loss – including Laplacian smoothing term, normal consistency term, and edge length regularization term, ensuring the mesh surface is smooth and free of self-intersections. A multi-stage training strategy is also introduced: the shape branch is trained first until convergence, followed by joint training of both shape and texture branches.

Key Experimental Results¶

Main Results¶

Dataset	Metric	AST	Prev. SOTA	Gain
CUB-200-2011	3D IoU↑	Best	CMR/UMR etc.	Significant improvement
CUB-200-2011	FID↓ (Texture)	Best	Prior methods	Texture quality substantially improved
ShapeNet (Car)	3D IoU↑	Best	DIB-R etc.	Validated on multiple categories
ShapeNet (Chair)	3D IoU↑	Best	Prior methods	Obvious advantage on complex shapes

Ablation Study¶

Configuration	3D IoU	FID	Description
Full AST	Best	Best	Full model
w/o Shape Transformer (replaced with MLP)	Obvious IoU drop	-	Proves the importance of pixel-vertex correspondence in Transformer
w/o Texture Transformer (using projection sampling)	-	FID rises	Indicates non-local texture collection outperforms projection sampling
Different query numbers	Smooth changes	-	Optimal when query number matches mesh resolution
Different Transformer layers	Has saturation point	-	Performance saturates at 4-6 layers

Key Findings¶

The Shape Transformer achieves significantly better reconstruction quality in occluded regions compared to the CNN/MLP schemes—the attention mechanism is capable of using global context to infer the shapes of invisible areas.
The textures generated by the Texture Transformer are noticeably more reasonable in invisible regions—traditional projection sampling generates severe artifacts on the back, whereas the Texture Transformer learns reasonable texture inference through semantic correlation.
Performs exceptionally well on the CUB birds dataset, as birds have rich texture patterns and distinct semantic part structures, which can be well captured by the Transformer's query mechanism.
Also functions across multiple categories of ShapeNet, demonstrating that the method is not limited to specific category shapes.

Highlights & Insights¶

Symmetric Design of Two Transformers: Shape and Texture Transformers use almost identical architectures but solve different problems (shape vs. texture), reflecting the flexibility of Transformers as a general relationship modeling tool. This design philosophy of "same tool, different tasks" is highly worthy of reference.
Learnable Queries as 3D Priors: Shape/texture queries learn prior knowledge of specific parts during training—certain queries are consistently focused on the bird's head, and others on the wings. This property of automatically discovering semantic parts through attention is very interesting, requiring no explicit part annotations.
Eliminating Inductive Bias of Projection Sampling: Traditional texture acquisition assumes that "the projection location is the correct texture source." Texture Transformer completely breaks this assumption, allowing the network to learn the optimal source of texture information on its own, which drastically improves performance in invisible regions.

Limitations & Future Work¶

Relies on template meshes (such as a sphere) as the initial shape—the mesh deformation approach cannot handle objects whose topological structure differs significantly from the template (such as shapes with holes).
Currently only supports single-object reconstruction; multi-object scenes require detection and segmentation before sequential reconstruction.
The number of learnable queries is fixed, which limits the flexibility of mesh resolution. Dynamic query numbers or multi-resolution query schemes could be considered.
Although training only uses 2D annotations, camera parameter estimation and 2D keypoint annotations are still required. A completely annotation-free setting is a more challenging future direction.
Extending the Texture Transformer into a generative model (e.g., combining with diffusion models) to generate consistent textures across more views could be explored.
Limited capability in generating high-frequency texture details (such as fur, stripes), constrained by the quantity and resolution of queries.

vs. CMR (Category-Specific Mesh Reconstruction): CMR directly regresses vertex displacements using CNN global features, lacking pixel-level correspondence; AST establishes fine-grained pixel-vertex mapping through the Shape Transformer, performing significantly better, especially in occluded regions.
vs. UMR (U-CMR): UMR introduces texture flow to improve texture quality, but is still based on the projection sampling paradigm; AST's Texture Transformer completely discards the projection assumption, gathering non-local texture information through attention.
vs. DIB-R: DIB-R uses differentiable rendering for end-to-end training, but its shape prediction still relies on simple deformation networks; AST introduces Transformers to the deformation process, providing stronger modeling capabilities.
This work demonstrates the great potential of Transformers in 2D-to-3D modeling, where the query mechanism is particularly suitable for modeling cross-dimensional (2D→3D) correspondences.

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing the query mechanism of Transformers into the analysis-by-synthesis 3D reconstruction paradigm, with an innovative dual-branch design for Shape and Texture.
Experimental Thoroughness: ⭐⭐⭐ Validated on two datasets with relatively complete ablations, though lacking comparison with the latest methods.
Writing Quality: ⭐⭐⭐⭐ Clear problem analysis and smooth method description, although some technical details require referring to the supplementary materials.
Value: ⭐⭐⭐⭐ Provides a new Transformer-based scheme for 2D-supervised 3D reconstruction, and the discovery of semantic parts via queries is inspiring.