3D Mesh Editing using Masked LRMs¶
Conference: ICCV 2025 arXiv: 2412.08641 Code: https://chocolatebiscuit.github.io/MaskedLRM/ Area: 3D Vision / Shape Editing / Large Reconstruction Models Keywords: LRM, Masked Reconstruction, 3D Editing, Conditional Inpainting, Multi-view Consistency
TL;DR¶
This paper proposes MaskedLRM, which reformulates 3D shape editing as a conditional reconstruction problem. During training, randomly generated 3D occluders mask multi-view inputs, and a single clean conditioning view guides completion of the occluded regions. At inference, the user defines an edit region and provides a single edited image; the model produces an edited 3D mesh in a single forward pass in under 3 seconds — 2–10× faster than optimization-based methods — while supporting topological changes (e.g., adding holes or handles) and achieving reconstruction quality on par with state-of-the-art methods.
Background & Motivation¶
3D shape editing remains far less mature than 2D image editing. Existing approaches fall into two categories: (1) Optimization-based methods (TextDeformer, MagicClay): optimize meshes using SDS losses, which suffer from noisy gradients, poor controllability, and slow runtimes (20 minutes to 1 hour), and cannot handle topological changes such as adding holes; (2) Generative methods (InstantMesh): first generate edited multi-view images via multi-view diffusion, then reconstruct with an LRM, but multi-view diffusion frequently produces inconsistent artifacts. The root cause is that 2D editing is straightforward while ensuring 3D consistency remains a fundamental challenge.
Core Problem¶
How can a single edited image drive 3D mesh editing while simultaneously guaranteeing: (1) faithfulness of the edited region to the 2D edit; (2) precise preservation of the original geometry in unedited regions; (3) multi-view consistency; (4) support for arbitrary topological changes; and (5) fast, feed-forward inference?
Method¶
Overall Architecture¶
Training: Multi-view renderings of the original shape → random 3D occluder masking → occluded patches replaced with learnable mask tokens → conditioning branch receives one clean view → Transformer (self-attention + cross-attention) outputs a triplane → volume rendering reconstructs all views (including occluded regions).
Inference: User defines the edit region → 2D inpainting produces an edited image → edited image serves as the condition + masked original multi-view renders as the primary input → single forward pass outputs an edited SDF → Marching Cubes extracts the mesh.
Key Designs¶
-
3D-consistent occlusion training strategy: Rather than applying random patch masks (MAE-style), a randomly generated 3D cuboid occluder is rendered to produce multi-view-consistent occlusion regions. This eliminates the training–inference mask distribution gap, since user-defined edit regions at inference time are also spatially contiguous 3D regions. Ablation experiments confirm that random patch masking leads to blurry and unrealistic edits.
-
Conditioning branch design: The primary branch processes 6–8 masked multi-view images; the conditioning branch processes a single clean edited image. Conditioning is fused into the primary branch via cross-attention. Plücker ray coordinates serve as pose encodings and are added after masking to preserve spatial information.
-
SDF + volume rendering: The network outputs a triplane representation, which is decoded into SDF and RGB values. Volume rendering against ground-truth images provides the training signal. Normal map supervision ensures high-quality surface reconstruction.
-
Two-stage training: Stage 1 uses 256×256 output resolution with 128 samples per ray (no normal loss); Stage 2 uses 384×384 output resolution with 512 samples per ray and normal loss supervision. Precision is increased progressively.
Loss & Training¶
$\(\mathcal{L} = w_I\|I - \hat{I}\|_2^2 + w_N\|N - \hat{N}\|_2^2 + w_M\|M - \hat{M}\|_2^2 + w_P\mathcal{L}_{LPIPS}\)$ - 64× H100 GPUs; Stage 1 trained for 30 epochs, Stage 2 for 20 epochs. - Objaverse dataset; 40 renders at 512×512 per shape. - Random 128×128 crops are used for supervision at each step.
Key Experimental Results¶
Reconstruction Quality (though not the primary goal)¶
| Method | ABO PSNR↑ | ABO LPIPS↓ | GSO PSNR↑ | GSO LPIPS↓ |
|---|---|---|---|---|
| InstantMesh (Mesh) | — | — | 22.79 | 0.120 |
| MeshLRM | 26.09 | 0.102 | 27.93 | 0.081 |
| MaskedLRM (8 views) | 28.65 | 0.078 | 27.58 | 0.085 |
MaskedLRM surpasses MeshLRM by 2.56 dB PSNR on ABO and matches state-of-the-art on GSO.
Editing Speed¶
| Method | Type | Runtime |
|---|---|---|
| TextDeformer | Optimization | 20 min |
| MagicClay | Optimization | 1 hour |
| InstantMesh | LRM | 30 sec |
| PrEditor3D | LRM | 80 sec |
| Instant3Dit | LRM | 6 sec |
| MaskedLRM | LRM | <3 sec |
Editing Quality (CLIP Similarity)¶
| Method | ViT-L-14 | ViT-BigG-14 |
|---|---|---|
| MagicClay | 0.285 | 0.286 |
| Instant3Dit | 0.303 | 0.309 |
| MaskedLRM | 0.323 | 0.337 |
Ablation Study¶
- 3D occlusion vs. random patch masking: Random patch masking produces blurry artifacts due to the training–inference distribution gap; 3D occluders yield sharp, clean geometry.
- Normal supervision: Without normal supervision, surfaces exhibit holes and bumps; depth supervision provides weaker benefits; normal supervision produces smooth and accurate surfaces.
- Topological changes: The method can add handles or holes to objects (genus change); optimization-based methods cannot achieve this due to fixed mesh topology.
Highlights & Insights¶
- Shape editing as conditional reconstruction: An elegantly principled problem reformulation — editing is recast as "reconstruct the original shape while completing missing regions conditioned on the edit."
- 3D-consistent masking strategy: Using a 3D occluder to generate multi-view-consistent masks is the core technical contribution, eliminating the training–inference distribution gap.
- Topological change capability: Because the output is a new mesh reconstructed from an SDF rather than a deformation of the original mesh, genus changes are naturally supported.
- Extreme speed: Feed-forward inference in under 3 seconds, 100–1000× faster than optimization-based methods.
- Identity preservation: Reconstruction quality in unedited regions matches state-of-the-art reconstruction methods.
Limitations & Future Work¶
- Edit quality is bounded by the quality of the conditioning image, requiring iterative 2D editing to obtain satisfactory results.
- The uniformity of Marching Cubes triangulation limits surface detail fidelity.
- Very fine-grained details (e.g., human faces) may appear blurry.
- Unedited regions cannot be directly frozen (unlike in MagicClay); preservation relies entirely on reconstruction quality.
- Training requires 64× H100 GPUs, imposing high resource demands.
Related Work & Insights¶
- TextDeformer: Text-guided mesh deformation. SDS gradients are noisy and cause global distortion; runtime is 20 minutes. MaskedLRM uses image conditioning for greater precision in under 3 seconds.
- MagicClay: Local SDS optimization. Results are sometimes successful but unpredictable (e.g., failures on fedora/top hat); runtime is 1 hour. MaskedLRM produces highly predictable outputs.
- InstantMesh: Single-view → multi-view diffusion → LRM pipeline. Multi-view diffusion introduces inconsistent artifacts. MaskedLRM uses ground-truth renderings as multi-view inputs, bypassing the consistency problem entirely.
- PrEditor3D / Instant3Dit: Multi-view diffusion-based editing. Semantically correct but lacking in detail. MaskedLRM's conditioning branch combined with masked training produces more realistic edits.
The paradigm of "conditional reconstruction = editing" is generalizable to other 3D editing tasks (texture editing, scene editing). The 3D-consistent masking training strategy is applicable to other tasks requiring cross-view-consistent inpainting.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Reformulating editing as conditional reconstruction combined with the 3D-consistent masking strategy is both elegant and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comparisons against 5 methods, reconstruction metrics, editing metrics, ablations, CLIP scores, speed benchmarks, and topological change evaluation.
- Writing Quality: ⭐⭐⭐⭐⭐ — Motivation is clear, method description is systematic, and comparisons are fair and thorough.
- Value: ⭐⭐⭐⭐⭐ — Feed-forward 3D editing in under 3 seconds has high practical utility; topological change support represents a key breakthrough.