Skip to content

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Conference: ECCV 2024
arXiv: 2403.05034
Code: Yes (https://github.com/thu-ml/CRM)
Area: 3D Vision
Keywords: Single-image 3D generation, Convolutional reconstruction, Triplane, FlexiCubes, Multi-view diffusion

TL;DR

This paper proposes CRM (Convolutional Reconstruction Model), which leverages the spatial alignment prior between triplanes and six orthographic views. It replaces the Transformer with a U-Net to directly map the six views to a triplane, and utilizes FlexiCubes for end-to-end training. CRM generates high-fidelity textured meshes from a single image in under 10 seconds, with only 1/8 of the training cost of LRM.

Background & Motivation

Feed-forward 3D generation models (such as LRM) demonstrate extremely fast generation speeds but suffer from the following limitations:

Transformer architectures do not leverage geometric priors: LRM-based methods use Transformers to generate triplane patches but fail to exploit the spatial alignment relationship between the triplane and the input images.

Scarcity of 3D data: The largest 3D dataset, Objaverse, contains only around one million objects, which is far smaller than LAION's 5 billion images. Therefore, incorporating prior knowledge into the architecture is particularly crucial.

Non-end-to-end training: Methods using NeRF or Gaussian Splatting as representations require additional post-processing steps to obtain textured meshes.

High training costs: LRM requires a batch size of 1024 and substantial GPU resources.

Key Observation: The visualization of a triplane shares an inherent spatial alignment with six orthographic views (front, back, left, right, top, bottom)—their silhouettes and textures align naturally. This inspires replacing the Transformer with a convolutional U-Net, which possesses strong pixel-alignment capabilities.

Method

Overall Architecture

CRM inference pipeline (approx. 10 seconds):

  1. Input single image → Multi-view diffusion model generates six orthographic views (~5s)
  2. Another diffusion model generates a Canonical Coordinate Map (CCM) (~1s)
  3. Six views + CCM → Convolutional U-Net → rolled-out triplane → MLP decoding → FlexiCubes → Textured mesh (~4s)

Key Designs

1. Spatial Alignment of Six Orthographic Views

Key Insight: The three planes of a triplane (\(xy\), \(xz\), \(yz\)) are spatially aligned with the orthographic views in their respective directions. Therefore:

  • Six orthographic views (front/back/left/right/top/bottom) are selected as the input for reconstruction, naturally corresponding to the triplane structure.
  • The six views are organized into two groups based on position, with each group of three views concatenated into a 256×768 image; a total of 4 such concatenated groups form a 12-channel input.
  • The U-Net directly maps this input to a rolled-out triplane.

2. Replacing Transformer with Convolutional U-Net

A pixel-aligned U-Net architecture is used instead of a Transformer:

  • Channel configuration: [64, 128, 128, 256, 256, 512, 512]
  • Self-attention blocks are added at resolutions [32, 16, 8]
  • Approximately 300M parameters

Advantages: - Larger bandwidth: The U-shaped structure is superior to Transformers in retaining input details, producing finer triplane features. - Extremely fast convergence: Reasonable reconstruction results emerge in only 280 iterations (20 minutes). - High training efficiency: The batch size is reduced to only 32 (compared to 1024 for LRM), requiring only 6 days of training on 8 A800 GPUs. - Total training cost is only 1/8 of LRM.

3. Canonical Coordinate Map (CCM)

The CCM contains the 3D coordinates in canonical space for each pixel (3 channels, valued in [0,1]), providing crucial geometric information.

  • Generated by a second diffusion model conditioned on the six views.
  • Concatenated with RGB images and fed into the U-Net.
  • Ablation studies demonstrate that geometric quality drops significantly without CCM input, especially for complex geometries.

4. FlexiCubes End-to-End Training

  • FlexiCubes (grid size 80) is used instead of NeRF/Gaussian Splatting.
  • Meshes are directly extracted during training using Dual Marching Cubes.
  • An MLP decodes the triplane features into SDF, deformations, weights, and colors.
  • Achieves end-to-end training with the textured mesh as the final output.

5. Training Enhancements for the Multi-view Diffusion Model

  • Fine-tuned based on ImageDream and expanded to 6 views.
  • Zero-SNR: Resolves the discrepancy between the initial noise during sampling and the noisiest training samples.
  • Random Resizing: Prevents the model from consistently generating objects that fill the entire image.
  • Silhouette Augmentation: Randomly changes silhouette colors to prevent the colors on the backside from overly depending on the input silhouette.

Loss & Training

\[\mathcal{L} = \mathcal{L}_{MSE}(x, x^{GT}) + \lambda_{LPIPS}\mathcal{L}_{LPIPS}(x, x^{GT}) + \lambda_{depth}\mathcal{L}_{MSE}(x_{depth}, x_{depth}^{GT}) + \lambda_{mask}\mathcal{L}_{MSE}(x_{mask}, x_{mask}^{GT}) + \lambda_{reg}\mathcal{L}_{reg}\]
  • \(\lambda_{LPIPS}{=}0.1\), \(\lambda_{depth}{=}0.5\), \(\lambda_{mask}{=}0.5\), \(\lambda_{reg}{=}0.005\)
  • Each shape is supervised by randomly sampling 8 views (out of 16 in total).
  • Small Gaussian noise is added to the input to enhance robustness against multi-view inconsistency.
  • The reconstruction model is trained for 110K steps, and the diffusion model is trained for 10K steps (with 12 gradient accumulation steps, effective batch size = 1536).

Key Experimental Results

Geometric Quality (GSO Dataset)

Method Chamfer Dist.↓ Vol. IoU↑ F-Score (%)↑
One-2-3-45 0.0172 0.4463 72.19
SyncDreamer 0.0140 0.3900 75.74
Wonder3D 0.0186 0.4398 76.75
LGM 0.0117 0.4685 68.69
CRM (Ours) 0.0094 0.6131 79.38

Texture Quality (GSO Dataset)

Method PSNR↑ SSIM↑ LPIPS↓ CLIP-Sim↑
OpenLRM 14.30 0.8294 0.2276 84.20
Magic123 12.69 0.7984 0.2442 85.16
LGM 13.28 0.7946 0.2560 85.20
CRM (Ours) 16.22 0.8381 0.2143 87.55

Multi-view Diffusion Quality

Method PSNR↑ SSIM↑ LPIPS↓
SyncDreamer 20.30 0.7804 0.2932
Wonder3D 23.76 0.8127 0.2210
CRM (Ours) 29.36 0.8721 0.1354

Ablation Study

Impact of CCM: Without CCM input, the geometry degrades significantly, especially for complex structures (e.g., animals).

Multi-view Diffusion Training Techniques:

Method PSNR↑ SSIM↑ LPIPS↓
ImageDream (6 view) 28.99 0.8565 0.1497
+ Zero-SNR 29.13 0.8598 0.1498
+ Random Resizing 29.36 0.8721 0.1354

Key Findings

  1. CRM outperforms all baseline methods across all geometric and texture metrics.
  2. The training cost is only 1/8 of LRM (8 GPUs for 6 days vs. LRM).
  3. Reasonable reconstructions emerge in only 280 iterations (20 minutes), showing that the spatial alignment prior greatly accelerates convergence.
  4. The PSNR of the multi-view diffusion model is 5.6 points higher than Wonder3D.

Highlights & Insights

  1. Triplane Spatial Alignment Prior: The biggest insight—incorporating correct priors into the architecture is more effective than stacking computational power.
  2. U-Net > Transformer (for this task): The inductive bias of convolution is better suited than a general-purpose Transformer for pixel-alignment tasks.
  3. End-to-end Mesh Output: FlexiCubes avoids the post-processing distortion associated with NeRF-to-mesh conversion.
  4. Silhouette Augmentation Technique: While it does not improve quantitative metrics, it substantially enhances robustness for in-the-wild inputs.
  5. Extremely Fast Convergence: Reasonable results can be obtained with only 20 minutes of training, showing that the prior plays a vital role.

Limitations & Future Work

  1. Multi-view diffusion models cannot guarantee complete consistency; inconsistent images degrade the 3D quality.
  2. The FlexiCubes grid resolution is limited to 80, which constraints ultra-fine geometric details.
  3. Performance is limited for inputs with high pitch angles or non-standard FoVs (inherited from ImageDream).
  4. The six views are fixed to orthographic perspectives, which might not be the optimal viewpoint configuration for all objects.
  • LRM: Pioneered Transformer-based triplane generation, but does not leverage spatial alignment priors.
  • LGM: Uses Gaussian Splatting representation, but requires an additional conversion step to obtain a mesh.
  • SyncDreamer/Wonder3D: Multi-view consistent generation, but requires test-time optimization for reconstruction.
  • Insight: When data is scarce, encoding domain priors into the architecture is more effective than data augmentation or scaling up models.

Rating

Attribute Score (1-10)
Novelty 8
Technical Depth 7
Experimental Thoroughness 8
Writing Quality 8
Value 9
Total Score 8.0