CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model¶

Conference: ECCV 2024
arXiv: 2403.05034
Code: Yes (https://github.com/thu-ml/CRM)
Area: 3D Vision
Keywords: Single-image 3D generation, Convolutional reconstruction, Triplane, FlexiCubes, Multi-view diffusion

TL;DR¶

This paper proposes CRM (Convolutional Reconstruction Model), which leverages the spatial alignment prior between triplanes and six orthographic views. It replaces the Transformer with a U-Net to directly map the six views to a triplane, and utilizes FlexiCubes for end-to-end training. CRM generates high-fidelity textured meshes from a single image in under 10 seconds, with only 1/8 of the training cost of LRM.

Background & Motivation¶

Feed-forward 3D generation models (such as LRM) demonstrate extremely fast generation speeds but suffer from the following limitations:

Transformer architectures do not leverage geometric priors: LRM-based methods use Transformers to generate triplane patches but fail to exploit the spatial alignment relationship between the triplane and the input images.

Scarcity of 3D data: The largest 3D dataset, Objaverse, contains only around one million objects, which is far smaller than LAION's 5 billion images. Therefore, incorporating prior knowledge into the architecture is particularly crucial.

Non-end-to-end training: Methods using NeRF or Gaussian Splatting as representations require additional post-processing steps to obtain textured meshes.

High training costs: LRM requires a batch size of 1024 and substantial GPU resources.

Key Observation: The visualization of a triplane shares an inherent spatial alignment with six orthographic views (front, back, left, right, top, bottom)—their silhouettes and textures align naturally. This inspires replacing the Transformer with a convolutional U-Net, which possesses strong pixel-alignment capabilities.

Method¶

Overall Architecture¶

CRM inference pipeline (approx. 10 seconds):

Input single image → Multi-view diffusion model generates six orthographic views (~5s)
Another diffusion model generates a Canonical Coordinate Map (CCM) (~1s)
Six views + CCM → Convolutional U-Net → rolled-out triplane → MLP decoding → FlexiCubes → Textured mesh (~4s)

Key Designs¶

1. Spatial Alignment of Six Orthographic Views¶

Key Insight: The three planes of a triplane (\(xy\), \(xz\), \(yz\)) are spatially aligned with the orthographic views in their respective directions. Therefore:

Six orthographic views (front/back/left/right/top/bottom) are selected as the input for reconstruction, naturally corresponding to the triplane structure.
The six views are organized into two groups based on position, with each group of three views concatenated into a 256×768 image; a total of 4 such concatenated groups form a 12-channel input.
The U-Net directly maps this input to a rolled-out triplane.

2. Replacing Transformer with Convolutional U-Net¶

A pixel-aligned U-Net architecture is used instead of a Transformer:

Channel configuration: [64, 128, 128, 256, 256, 512, 512]
Self-attention blocks are added at resolutions [32, 16, 8]
Approximately 300M parameters

Advantages: - Larger bandwidth: The U-shaped structure is superior to Transformers in retaining input details, producing finer triplane features. - Extremely fast convergence: Reasonable reconstruction results emerge in only 280 iterations (20 minutes). - High training efficiency: The batch size is reduced to only 32 (compared to 1024 for LRM), requiring only 6 days of training on 8 A800 GPUs. - Total training cost is only 1/8 of LRM.

3. Canonical Coordinate Map (CCM)¶

The CCM contains the 3D coordinates in canonical space for each pixel (3 channels, valued in [0,1]), providing crucial geometric information.

Generated by a second diffusion model conditioned on the six views.
Concatenated with RGB images and fed into the U-Net.
Ablation studies demonstrate that geometric quality drops significantly without CCM input, especially for complex geometries.

4. FlexiCubes End-to-End Training¶

FlexiCubes (grid size 80) is used instead of NeRF/Gaussian Splatting.
Meshes are directly extracted during training using Dual Marching Cubes.
An MLP decodes the triplane features into SDF, deformations, weights, and colors.
Achieves end-to-end training with the textured mesh as the final output.

5. Training Enhancements for the Multi-view Diffusion Model¶

Fine-tuned based on ImageDream and expanded to 6 views.
Zero-SNR: Resolves the discrepancy between the initial noise during sampling and the noisiest training samples.
Random Resizing: Prevents the model from consistently generating objects that fill the entire image.
Silhouette Augmentation: Randomly changes silhouette colors to prevent the colors on the backside from overly depending on the input silhouette.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{MSE}(x, x^{GT}) + \lambda_{LPIPS}\mathcal{L}_{LPIPS}(x, x^{GT}) + \lambda_{depth}\mathcal{L}_{MSE}(x_{depth}, x_{depth}^{GT}) + \lambda_{mask}\mathcal{L}_{MSE}(x_{mask}, x_{mask}^{GT}) + \lambda_{reg}\mathcal{L}_{reg}\]

\(\lambda_{LPIPS}{=}0.1\), \(\lambda_{depth}{=}0.5\), \(\lambda_{mask}{=}0.5\), \(\lambda_{reg}{=}0.005\)
Each shape is supervised by randomly sampling 8 views (out of 16 in total).
Small Gaussian noise is added to the input to enhance robustness against multi-view inconsistency.
The reconstruction model is trained for 110K steps, and the diffusion model is trained for 10K steps (with 12 gradient accumulation steps, effective batch size = 1536).

Key Experimental Results¶

Geometric Quality (GSO Dataset)¶

Method	Chamfer Dist.↓	Vol. IoU↑	F-Score (%)↑
One-2-3-45	0.0172	0.4463	72.19
SyncDreamer	0.0140	0.3900	75.74
Wonder3D	0.0186	0.4398	76.75
LGM	0.0117	0.4685	68.69
CRM (Ours)	0.0094	0.6131	79.38

Texture Quality (GSO Dataset)¶

Method	PSNR↑	SSIM↑	LPIPS↓	CLIP-Sim↑
OpenLRM	14.30	0.8294	0.2276	84.20
Magic123	12.69	0.7984	0.2442	85.16
LGM	13.28	0.7946	0.2560	85.20
CRM (Ours)	16.22	0.8381	0.2143	87.55

Multi-view Diffusion Quality¶

Method	PSNR↑	SSIM↑	LPIPS↓
SyncDreamer	20.30	0.7804	0.2932
Wonder3D	23.76	0.8127	0.2210
CRM (Ours)	29.36	0.8721	0.1354

Ablation Study¶

Impact of CCM: Without CCM input, the geometry degrades significantly, especially for complex structures (e.g., animals).

Multi-view Diffusion Training Techniques:

Method	PSNR↑	SSIM↑	LPIPS↓
ImageDream (6 view)	28.99	0.8565	0.1497
+ Zero-SNR	29.13	0.8598	0.1498
+ Random Resizing	29.36	0.8721	0.1354

Key Findings¶

CRM outperforms all baseline methods across all geometric and texture metrics.
The training cost is only 1/8 of LRM (8 GPUs for 6 days vs. LRM).
Reasonable reconstructions emerge in only 280 iterations (20 minutes), showing that the spatial alignment prior greatly accelerates convergence.
The PSNR of the multi-view diffusion model is 5.6 points higher than Wonder3D.

Highlights & Insights¶

Triplane Spatial Alignment Prior: The biggest insight—incorporating correct priors into the architecture is more effective than stacking computational power.
U-Net > Transformer (for this task): The inductive bias of convolution is better suited than a general-purpose Transformer for pixel-alignment tasks.
End-to-end Mesh Output: FlexiCubes avoids the post-processing distortion associated with NeRF-to-mesh conversion.
Silhouette Augmentation Technique: While it does not improve quantitative metrics, it substantially enhances robustness for in-the-wild inputs.
Extremely Fast Convergence: Reasonable results can be obtained with only 20 minutes of training, showing that the prior plays a vital role.

Limitations & Future Work¶

Multi-view diffusion models cannot guarantee complete consistency; inconsistent images degrade the 3D quality.
The FlexiCubes grid resolution is limited to 80, which constraints ultra-fine geometric details.
Performance is limited for inputs with high pitch angles or non-standard FoVs (inherited from ImageDream).
The six views are fixed to orthographic perspectives, which might not be the optimal viewpoint configuration for all objects.

LRM: Pioneered Transformer-based triplane generation, but does not leverage spatial alignment priors.
LGM: Uses Gaussian Splatting representation, but requires an additional conversion step to obtain a mesh.
SyncDreamer/Wonder3D: Multi-view consistent generation, but requires test-time optimization for reconstruction.
Insight: When data is scarce, encoding domain priors into the architecture is more effective than data augmentation or scaling up models.

Rating¶

Attribute	Score (1-10)
Novelty	8
Technical Depth	7
Experimental Thoroughness	8
Writing Quality	8
Value	9
Total Score	8.0