3D Mesh Editing using Masked LRMs¶

Conference: ICCV 2025 arXiv: 2412.08641 Code: https://chocolatebiscuit.github.io/MaskedLRM/ Area: 3D Vision / Shape Editing / Large Reconstruction Models Keywords: LRM, Masked Reconstruction, 3D Editing, Conditional Inpainting, Multi-view Consistency

TL;DR¶

This paper proposes MaskedLRM, which reformulates 3D shape editing as a conditional reconstruction problem. During training, randomly generated 3D occluders mask multi-view inputs, and a single clean conditioning view guides completion of the occluded regions. At inference, the user defines an edit region and provides a single edited image; the model produces an edited 3D mesh in a single forward pass in under 3 seconds — 2–10× faster than optimization-based methods — while supporting topological changes (e.g., adding holes or handles) and achieving reconstruction quality on par with state-of-the-art methods.

Background & Motivation¶

3D shape editing remains far less mature than 2D image editing. Existing approaches fall into two categories: (1) Optimization-based methods (TextDeformer, MagicClay): optimize meshes using SDS losses, which suffer from noisy gradients, poor controllability, and slow runtimes (20 minutes to 1 hour), and cannot handle topological changes such as adding holes; (2) Generative methods (InstantMesh): first generate edited multi-view images via multi-view diffusion, then reconstruct with an LRM, but multi-view diffusion frequently produces inconsistent artifacts. The root cause is that 2D editing is straightforward while ensuring 3D consistency remains a fundamental challenge.

Core Problem¶

How can a single edited image drive 3D mesh editing while simultaneously guaranteeing: (1) faithfulness of the edited region to the 2D edit; (2) precise preservation of the original geometry in unedited regions; (3) multi-view consistency; (4) support for arbitrary topological changes; and (5) fast, feed-forward inference?

Method¶

Overall Architecture¶

Training: Multi-view renderings of the original shape → random 3D occluder masking → occluded patches replaced with learnable mask tokens → conditioning branch receives one clean view → Transformer (self-attention + cross-attention) outputs a triplane → volume rendering reconstructs all views (including occluded regions).

Inference: User defines the edit region → 2D inpainting produces an edited image → edited image serves as the condition + masked original multi-view renders as the primary input → single forward pass outputs an edited SDF → Marching Cubes extracts the mesh.

Key Designs¶

3D-consistent occlusion training strategy: Rather than applying random patch masks (MAE-style), a randomly generated 3D cuboid occluder is rendered to produce multi-view-consistent occlusion regions. This eliminates the training–inference mask distribution gap, since user-defined edit regions at inference time are also spatially contiguous 3D regions. Ablation experiments confirm that random patch masking leads to blurry and unrealistic edits.
Conditioning branch design: The primary branch processes 6–8 masked multi-view images; the conditioning branch processes a single clean edited image. Conditioning is fused into the primary branch via cross-attention. Plücker ray coordinates serve as pose encodings and are added after masking to preserve spatial information.
SDF + volume rendering: The network outputs a triplane representation, which is decoded into SDF and RGB values. Volume rendering against ground-truth images provides the training signal. Normal map supervision ensures high-quality surface reconstruction.
Two-stage training: Stage 1 uses 256×256 output resolution with 128 samples per ray (no normal loss); Stage 2 uses 384×384 output resolution with 512 samples per ray and normal loss supervision. Precision is increased progressively.

Loss & Training¶

$$\mathcal{L} = w_I\|I - \hat{I}\|_2^2 + w_N\|N - \hat{N}\|_2^2 + w_M\|M - \hat{M}\|_2^2 + w_P\mathcal{L}_{LPIPS}$$ - 64× H100 GPUs; Stage 1 trained for 30 epochs, Stage 2 for 20 epochs. - Objaverse dataset; 40 renders at 512×512 per shape. - Random 128×128 crops are used for supervision at each step.

Key Experimental Results¶

Reconstruction Quality (though not the primary goal)¶

Method	ABO PSNR↑	ABO LPIPS↓	GSO PSNR↑	GSO LPIPS↓
InstantMesh (Mesh)	—	—	22.79	0.120
MeshLRM	26.09	0.102	27.93	0.081
MaskedLRM (8 views)	28.65	0.078	27.58	0.085

MaskedLRM surpasses MeshLRM by 2.56 dB PSNR on ABO and matches state-of-the-art on GSO.

Editing Speed¶

Method	Type	Runtime
TextDeformer	Optimization	20 min
MagicClay	Optimization	1 hour
InstantMesh	LRM	30 sec
PrEditor3D	LRM	80 sec
Instant3Dit	LRM	6 sec
MaskedLRM	LRM	<3 sec

Editing Quality (CLIP Similarity)¶

Method	ViT-L-14	ViT-BigG-14
MagicClay	0.285	0.286
Instant3Dit	0.303	0.309
MaskedLRM	0.323	0.337

Ablation Study¶

3D occlusion vs. random patch masking: Random patch masking produces blurry artifacts due to the training–inference distribution gap; 3D occluders yield sharp, clean geometry.
Normal supervision: Without normal supervision, surfaces exhibit holes and bumps; depth supervision provides weaker benefits; normal supervision produces smooth and accurate surfaces.
Topological changes: The method can add handles or holes to objects (genus change); optimization-based methods cannot achieve this due to fixed mesh topology.

Highlights & Insights¶

Shape editing as conditional reconstruction: An elegantly principled problem reformulation — editing is recast as "reconstruct the original shape while completing missing regions conditioned on the edit."
3D-consistent masking strategy: Using a 3D occluder to generate multi-view-consistent masks is the core technical contribution, eliminating the training–inference distribution gap.
Topological change capability: Because the output is a new mesh reconstructed from an SDF rather than a deformation of the original mesh, genus changes are naturally supported.
Extreme speed: Feed-forward inference in under 3 seconds, 100–1000× faster than optimization-based methods.
Identity preservation: Reconstruction quality in unedited regions matches state-of-the-art reconstruction methods.

Limitations & Future Work¶

Edit quality is bounded by the quality of the conditioning image, requiring iterative 2D editing to obtain satisfactory results.
The uniformity of Marching Cubes triangulation limits surface detail fidelity.
Very fine-grained details (e.g., human faces) may appear blurry.
Unedited regions cannot be directly frozen (unlike in MagicClay); preservation relies entirely on reconstruction quality.
Training requires 64× H100 GPUs, imposing high resource demands.

TextDeformer: Text-guided mesh deformation. SDS gradients are noisy and cause global distortion; runtime is 20 minutes. MaskedLRM uses image conditioning for greater precision in under 3 seconds.
MagicClay: Local SDS optimization. Results are sometimes successful but unpredictable (e.g., failures on fedora/top hat); runtime is 1 hour. MaskedLRM produces highly predictable outputs.
InstantMesh: Single-view → multi-view diffusion → LRM pipeline. Multi-view diffusion introduces inconsistent artifacts. MaskedLRM uses ground-truth renderings as multi-view inputs, bypassing the consistency problem entirely.
PrEditor3D / Instant3Dit: Multi-view diffusion-based editing. Semantically correct but lacking in detail. MaskedLRM's conditioning branch combined with masked training produces more realistic edits.

The paradigm of "conditional reconstruction = editing" is generalizable to other 3D editing tasks (texture editing, scene editing). The 3D-consistent masking training strategy is applicable to other tasks requiring cross-view-consistent inpainting.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Reformulating editing as conditional reconstruction combined with the 3D-consistent masking strategy is both elegant and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comparisons against 5 methods, reconstruction metrics, editing metrics, ablations, CLIP scores, speed benchmarks, and topological change evaluation.
Writing Quality: ⭐⭐⭐⭐⭐ — Motivation is clear, method description is systematic, and comparisons are fair and thorough.
Value: ⭐⭐⭐⭐⭐ — Feed-forward 3D editing in under 3 seconds has high practical utility; topological change support represents a key breakthrough.