Mesh-RFT: Enhancing Mesh Generation via Fine-Grained Reinforcement Fine-Tuning¶

Conference: NeurIPS 2025 arXiv: 2505.16761 Code: Project Page Area: 3D Vision / Mesh Generation Keywords: mesh generation, reinforcement fine-tuning, DPO, topology-aware, fine-grained optimization

TL;DR¶

This paper proposes Mesh-RFT, a framework that achieves face-level fine-grained mesh quality optimization through a topology-aware scoring system and Masked Direct Preference Optimization (M-DPO), significantly improving the geometric integrity and topological regularity of generated meshes.

Background & Motivation¶

High-quality 3D mesh generation faces two major challenges:

Limitations of Prior Work: Autoregressive mesh generation methods (MeshGPT, MeshXL, etc.) are prone to structural ambiguities and "hallucinations" (inconsistent edges, non-manifold vertices, deformations, holes) when generating long-sequence, high-resolution meshes.

Shortcomings of Global Reinforcement Learning: DeepMesh applies DPO for preference alignment but relies on manually annotated preference pairs (only 5,000 samples), and its global reward signal fails to capture local topological variations. A key observation is that high-quality and low-quality structures frequently coexist within the same mesh.

Core problem: How to achieve face-level fine-grained optimization rather than applying a uniform global reward to the entire mesh?

Method¶

Overall Architecture¶

A three-stage pipeline: 1. Pre-training: Supervised learning using a Hourglass Autoregressive Transformer with a Shape Encoder. 2. Preference Dataset Construction: The pre-trained model generates candidate meshes, and a topology-aware scoring system establishes preference pairs. 3. Post-training: Fine-grained reinforcement fine-tuning using Masked DPO.

Key Designs¶

Topology-Aware Scoring System: Two objective topological metrics are proposed to replace manual annotation:
- Boundary Edge Ratio (BER): \(BER(\mathcal{M}) = E_{\partial\mathcal{M}} / E_{\mathcal{M}}\), measuring mesh integrity. A closed manifold mesh should have BER = 0; a high BER indicates surface discontinuities, holes, and similar issues.
- Topology Score (TS): \(TS(\mathcal{M}) = \sum_{i=1}^{4} w_i s_i(\mathcal{Q}(\mathcal{M}))\), evaluated by converting the triangle mesh to a quadrilateral mesh and assessing: quad ratio (0.4), angle quality (0.2), aspect ratio (0.3), and adjacency consistency (0.1).
- Hausdorff Distance (HD) is additionally used to measure geometric consistency.
Preference Dataset Construction: Eight candidate meshes are generated per input point cloud, yielding \(C(8,2)=28\) exhaustive pairs. A preference relation is established only when one mesh strictly outperforms another across all three metrics—BER, TS, and HD—thereby avoiding ambiguous preferences.
Masked Direct Preference Optimization (M-DPO): The core innovation. Quality is assessed at the individual triangle face level to construct a token-level binary mask \(\phi(\mathcal{M}) \in \{0,1\}^{|\mathcal{M}|}\). For chosen samples, the mask amplifies contributions from high-quality regions; for rejected samples, the inverted mask focuses the penalty on low-quality regions:
- \(\mathcal{L}^+\): tokens corresponding to high-quality regions in chosen samples are selected by the mask.
- \(\mathcal{L}^-\): tokens corresponding to low-quality regions in rejected samples are selected by the inverted mask.
This achieves the effect of preserving good regions while focusing repair efforts on defective regions.

Loss & Training¶

Pre-training: truncated training (fixed-length segments) + sliding-window inference (sliding begins after 40% window coverage, retaining the most recent 30%).
M-DPO loss: \(\mathcal{L}_{M-DPO} = -\mathbb{E}[\log \sigma(\beta \mathcal{L}^+ - \beta \mathcal{L}^-)]\)
Model architecture: Hourglass Transformer (with 2 shortening and 2 upsampling operations); the Hunyuan3D 2.0 point cloud encoder is injected via cross-attention.
Pre-training: 256 × H20 GPUs, 10 days; M-DPO: 64 GPUs, 8 hours, learning rate 5e-7.
Training data: 2M meshes for pre-training, 800K filtered meshes for fine-tuning, 10K meshes for preference data construction.

Key Experimental Results¶

Main Results¶

Method	CD↓	HD↓	TS↑	BER↓	US↑
MeshAnythingV2	0.2265	0.4760	72.0	0.0913	8%
BPT	0.1615	0.3347	73.7	0.0113	18%
DeepMesh* (0.5B)	0.1760	0.3570	75.8	0.0044	20%
Mesh-RFT	0.1286	0.2411	79.4	0.0015	40%

(Dense Meshes results; user preference score US improves from 20% to 40%.)

Ablation Study¶

Configuration	CD↓	HD↓	TS↑	BER↓	US↑
Pretrain	0.1588	0.3196	76.5	0.0033	30%
N-DPO (HD only)	0.1455	0.2919	75.7	0.0028	32%
S-DPO (scoring system)	0.1348	0.2625	77.9	0.0023	35%
M-DPO (masked)	0.1286	0.2411	79.4	0.0015	40%

Key Findings¶

Compared to the pre-trained model, M-DPO reduces HD by 24.6% and improves TS by 3.8%.
Compared to global DPO (S-DPO), M-DPO reduces HD by 17.4% and improves TS by 4.9%.
Using HD alone as the preference criterion (N-DPO) actually degrades TS, demonstrating the necessity of multi-metric composite scoring.
M-DPO achieves 40% user preference (vs. 30% for the pre-trained model), validating perceptual quality gains.
Strong performance on out-of-distribution Hunyuan2.5-generated meshes demonstrates generalization capability.

Highlights & Insights¶

First face-level RL optimization method: Breaks the limitation of global reward signals by enabling precise repair of local defects.
The objective topology scoring system replaces manual annotation and offers strong scalability (vs. DeepMesh's 5,000 annotated samples).
The BER and TS designs are elegant: evaluating triangle mesh quality via quadrilateral conversion aligns well with the industrial preference for quad meshes.
The engineering design of truncated training combined with sliding-window inference addresses practical challenges in long-sequence mesh generation.

Limitations & Future Work¶

Only point-cloud-conditioned generation is evaluated; text- and image-conditioned variants remain unexplored.
The comparison with DeepMesh is limited to the 0.5B version, which may not be fully fair.
The weights \(w_i\) in the scoring system are manually specified; adaptive learning could be considered.
Using quadrilateral quality as a proxy metric for triangle mesh quality may introduce bias.
The number of faces and resolution of generated meshes are constrained by sequence length.

Compared to DeepMesh, which uses global DPO with manual annotations, Mesh-RFT employs local M-DPO with automated scoring.
Transferring DPO/RLHF paradigms from NLP to 3D mesh generation is a growing trend, but adaptation to 3D structural properties is essential.
The Masked DPO concept is generalizable to other sequence generation scenarios with significant local quality variation (e.g., code generation, music generation).
The hierarchical design of the Hourglass Transformer provides a useful reference for long-sequence generation tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First face-level RL optimization combined with an objective topology scoring system; the M-DPO design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Complete ablations with user studies and OOD testing, though the number of baselines is limited.
Writing Quality: ⭐⭐⭐⭐ Rich figures and tables; method description is clear.
Value: ⭐⭐⭐⭐⭐ Directly applicable to production-level mesh generation; the objective scoring system is reusable.