CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model¶
Conference: ECCV 2024
arXiv: 2403.05034
Code: Yes (https://github.com/thu-ml/CRM)
Area: 3D Vision
Keywords: Single-image 3D generation, Convolutional reconstruction, Triplane, FlexiCubes, Multi-view diffusion
TL;DR¶
This paper proposes CRM (Convolutional Reconstruction Model), which leverages the spatial alignment prior between triplanes and six orthographic views. It replaces the Transformer with a U-Net to directly map the six views to a triplane, and utilizes FlexiCubes for end-to-end training. CRM generates high-fidelity textured meshes from a single image in under 10 seconds, with only 1/8 of the training cost of LRM.
Background & Motivation¶
Feed-forward 3D generation models (such as LRM) demonstrate extremely fast generation speeds but suffer from the following limitations:
Transformer architectures do not leverage geometric priors: LRM-based methods use Transformers to generate triplane patches but fail to exploit the spatial alignment relationship between the triplane and the input images.
Scarcity of 3D data: The largest 3D dataset, Objaverse, contains only around one million objects, which is far smaller than LAION's 5 billion images. Therefore, incorporating prior knowledge into the architecture is particularly crucial.
Non-end-to-end training: Methods using NeRF or Gaussian Splatting as representations require additional post-processing steps to obtain textured meshes.
High training costs: LRM requires a batch size of 1024 and substantial GPU resources.
Key Observation: The visualization of a triplane shares an inherent spatial alignment with six orthographic views (front, back, left, right, top, bottom)—their silhouettes and textures align naturally. This inspires replacing the Transformer with a convolutional U-Net, which possesses strong pixel-alignment capabilities.
Method¶
Overall Architecture¶
CRM inference pipeline (approx. 10 seconds):
- Input single image → Multi-view diffusion model generates six orthographic views (~5s)
- Another diffusion model generates a Canonical Coordinate Map (CCM) (~1s)
- Six views + CCM → Convolutional U-Net → rolled-out triplane → MLP decoding → FlexiCubes → Textured mesh (~4s)
Key Designs¶
1. Spatial Alignment of Six Orthographic Views¶
Key Insight: The three planes of a triplane (\(xy\), \(xz\), \(yz\)) are spatially aligned with the orthographic views in their respective directions. Therefore:
- Six orthographic views (front/back/left/right/top/bottom) are selected as the input for reconstruction, naturally corresponding to the triplane structure.
- The six views are organized into two groups based on position, with each group of three views concatenated into a 256×768 image; a total of 4 such concatenated groups form a 12-channel input.
- The U-Net directly maps this input to a rolled-out triplane.
2. Replacing Transformer with Convolutional U-Net¶
A pixel-aligned U-Net architecture is used instead of a Transformer:
- Channel configuration: [64, 128, 128, 256, 256, 512, 512]
- Self-attention blocks are added at resolutions [32, 16, 8]
- Approximately 300M parameters
Advantages: - Larger bandwidth: The U-shaped structure is superior to Transformers in retaining input details, producing finer triplane features. - Extremely fast convergence: Reasonable reconstruction results emerge in only 280 iterations (20 minutes). - High training efficiency: The batch size is reduced to only 32 (compared to 1024 for LRM), requiring only 6 days of training on 8 A800 GPUs. - Total training cost is only 1/8 of LRM.
3. Canonical Coordinate Map (CCM)¶
The CCM contains the 3D coordinates in canonical space for each pixel (3 channels, valued in [0,1]), providing crucial geometric information.
- Generated by a second diffusion model conditioned on the six views.
- Concatenated with RGB images and fed into the U-Net.
- Ablation studies demonstrate that geometric quality drops significantly without CCM input, especially for complex geometries.
4. FlexiCubes End-to-End Training¶
- FlexiCubes (grid size 80) is used instead of NeRF/Gaussian Splatting.
- Meshes are directly extracted during training using Dual Marching Cubes.
- An MLP decodes the triplane features into SDF, deformations, weights, and colors.
- Achieves end-to-end training with the textured mesh as the final output.
5. Training Enhancements for the Multi-view Diffusion Model¶
- Fine-tuned based on ImageDream and expanded to 6 views.
- Zero-SNR: Resolves the discrepancy between the initial noise during sampling and the noisiest training samples.
- Random Resizing: Prevents the model from consistently generating objects that fill the entire image.
- Silhouette Augmentation: Randomly changes silhouette colors to prevent the colors on the backside from overly depending on the input silhouette.
Loss & Training¶
- \(\lambda_{LPIPS}{=}0.1\), \(\lambda_{depth}{=}0.5\), \(\lambda_{mask}{=}0.5\), \(\lambda_{reg}{=}0.005\)
- Each shape is supervised by randomly sampling 8 views (out of 16 in total).
- Small Gaussian noise is added to the input to enhance robustness against multi-view inconsistency.
- The reconstruction model is trained for 110K steps, and the diffusion model is trained for 10K steps (with 12 gradient accumulation steps, effective batch size = 1536).
Key Experimental Results¶
Geometric Quality (GSO Dataset)¶
| Method | Chamfer Dist.↓ | Vol. IoU↑ | F-Score (%)↑ |
|---|---|---|---|
| One-2-3-45 | 0.0172 | 0.4463 | 72.19 |
| SyncDreamer | 0.0140 | 0.3900 | 75.74 |
| Wonder3D | 0.0186 | 0.4398 | 76.75 |
| LGM | 0.0117 | 0.4685 | 68.69 |
| CRM (Ours) | 0.0094 | 0.6131 | 79.38 |
Texture Quality (GSO Dataset)¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | CLIP-Sim↑ |
|---|---|---|---|---|
| OpenLRM | 14.30 | 0.8294 | 0.2276 | 84.20 |
| Magic123 | 12.69 | 0.7984 | 0.2442 | 85.16 |
| LGM | 13.28 | 0.7946 | 0.2560 | 85.20 |
| CRM (Ours) | 16.22 | 0.8381 | 0.2143 | 87.55 |
Multi-view Diffusion Quality¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| SyncDreamer | 20.30 | 0.7804 | 0.2932 |
| Wonder3D | 23.76 | 0.8127 | 0.2210 |
| CRM (Ours) | 29.36 | 0.8721 | 0.1354 |
Ablation Study¶
Impact of CCM: Without CCM input, the geometry degrades significantly, especially for complex structures (e.g., animals).
Multi-view Diffusion Training Techniques:
| Method | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| ImageDream (6 view) | 28.99 | 0.8565 | 0.1497 |
| + Zero-SNR | 29.13 | 0.8598 | 0.1498 |
| + Random Resizing | 29.36 | 0.8721 | 0.1354 |
Key Findings¶
- CRM outperforms all baseline methods across all geometric and texture metrics.
- The training cost is only 1/8 of LRM (8 GPUs for 6 days vs. LRM).
- Reasonable reconstructions emerge in only 280 iterations (20 minutes), showing that the spatial alignment prior greatly accelerates convergence.
- The PSNR of the multi-view diffusion model is 5.6 points higher than Wonder3D.
Highlights & Insights¶
- Triplane Spatial Alignment Prior: The biggest insight—incorporating correct priors into the architecture is more effective than stacking computational power.
- U-Net > Transformer (for this task): The inductive bias of convolution is better suited than a general-purpose Transformer for pixel-alignment tasks.
- End-to-end Mesh Output: FlexiCubes avoids the post-processing distortion associated with NeRF-to-mesh conversion.
- Silhouette Augmentation Technique: While it does not improve quantitative metrics, it substantially enhances robustness for in-the-wild inputs.
- Extremely Fast Convergence: Reasonable results can be obtained with only 20 minutes of training, showing that the prior plays a vital role.
Limitations & Future Work¶
- Multi-view diffusion models cannot guarantee complete consistency; inconsistent images degrade the 3D quality.
- The FlexiCubes grid resolution is limited to 80, which constraints ultra-fine geometric details.
- Performance is limited for inputs with high pitch angles or non-standard FoVs (inherited from ImageDream).
- The six views are fixed to orthographic perspectives, which might not be the optimal viewpoint configuration for all objects.
Related Work & Insights¶
- LRM: Pioneered Transformer-based triplane generation, but does not leverage spatial alignment priors.
- LGM: Uses Gaussian Splatting representation, but requires an additional conversion step to obtain a mesh.
- SyncDreamer/Wonder3D: Multi-view consistent generation, but requires test-time optimization for reconstruction.
- Insight: When data is scarce, encoding domain priors into the architecture is more effective than data augmentation or scaling up models.
Rating¶
| Attribute | Score (1-10) |
|---|---|
| Novelty | 8 |
| Technical Depth | 7 |
| Experimental Thoroughness | 8 |
| Writing Quality | 8 |
| Value | 9 |
| Total Score | 8.0 |