TriTex: Learning Texture from a Single Mesh via Triplane Semantic Features¶

Conference: CVPR 2025
arXiv: 2503.16630
Code: Project Page
Area: 3D Vision
Keywords: Texture Transfer, Triplane Representation, Semantic Features, One-Shot Learning, 3D Mesh

TL;DR¶

This paper proposes TriTex, a method for learning a volumetric texture field from a single textured mesh. By projecting Diff3F semantic features into a triplane representation, TriTex utilizes a convolutional network and an MLP to achieve semantic-aware, feed-forward texture transfer, outperforming existing methods in both inference speed and texture fidelity.

Background & Motivation¶

3D texture transfer (applying the semantic texture of a source mesh to a target mesh) is a fundamental requirement in game development, simulation, and video production.
Existing diffusion model-based methods (such as TEXTure, EASI-TEX) are proficient at texture generation but struggle to faithfully preserve the original appearance of the source texture.
SDS optimization solutions (such as Latent-NeRF, Paint-it) require a long time to process a single object, making them unsuitable for large-scale scenes.
Iterative depth-conditioned inpainting methods are prone to viewpoint inconsistency artifacts.
IP-Adapter-based methods (such as MVEdit, EASI-TEX) only obtain vague inspiration from reference images, frequently deviating from the source texture.
Methods requiring training on large-scale 3D datasets (such as Texturify, AUV-net) are restricted by category and data availability.
There is a lack of efficient methods that can learn solely from a single textured mesh and generalize to new target meshes of the same category.
Texture transfer requires implicit or explicit semantic correspondence rather than simple pixel-level copying.

Method¶

Overall Architecture¶

The architecture of TriTex takes a 3D mesh with pre-extracted Diff3F semantic features and a 3D query point as input, and outputs the color of that point. First, it projects semantic features from 6 orthogonal views into a triplane representation \(\mathcal{T} \in \mathbb{R}^{3 \times W \times H \times 2D}\), which is then processed by a triplane-aware convolutional block to generate \(\mathcal{T}'\). During inference, the query point of the target mesh samples features from the three planes, which are concatenated and passed through a coloring MLP \(c: \mathbb{R}^{3D'} \to [0,1]^3\) to output the RGB color. Training is completed solely on a single textured mesh using a rendering reconstruction loss.

Key Designs¶

Design 1: Diff3F Semantic Features + Triplane Projection - Function: Establishes semantic correspondences between the source and target meshes. - Mechanism: Utilizes Diff3F (frozen diffusion model + DINO features) to extract zero-shot 3D semantic descriptors for the mesh, which are then projected onto the triplane representation from 6 orthogonal directions (3 positive and 3 negative directions along the axes). The triplane-aware convolutional block aggregates features across the three planes (by averaging each plane along its axis and replicating it to the others) to achieve cross-plane information interaction. - Design Motivation: Diff3F features exhibit cross-shape semantic consistency, allowing the semantic-to-color mapping learned on a single mesh to generalize to other objects in the same category with significant geometric variation. The triplane representation enables efficient processing of 3D information using 2D convolutions.

Design 2: One-shot Training Strategy and Data Augmentation - Function: Learns from a single textured mesh while preventing overfitting to the specific triplane projection. - Mechanism: Training utilizes a rendering reconstruction loss \(\mathcal{L} = \mathbb{E}_\theta[\mathcal{L}_{MSE}(\theta) + \delta_{app}\mathcal{L}_{app}(\theta)]\) to compare the predicted texture against the ground truth texture under random viewpoints. It applies two levels of data augmentation: (1) pre-processing level 3D transformations on the mesh with re-extracted features, and (2) training-level perturbations including translation, scaling, and small rotations. - Design Motivation: The core challenge of one-shot learning is generalization. Data augmentation expands the distribution of semantic features, preventing the network from overfitting to the source mesh's specific triplane projection.

Design 3: Coloring Neural Field - Function: Maps triplane semantic features to RGB colors. - Mechanism: For any 3D query point, features are sampled via bilinear interpolation from the three processed planes, concatenated into a single vector, and mapped to the \([0,1]^3\) color space through a lightweight MLP. Positional encoding is used to enhance detail generation capability (since the triplane resolution is only \(32 \times 32\)). - Design Motivation: The implicit representation of MLP naturally possesses spatial continuity and generalization capability, making it more robust than operating directly in the UV space.

Loss & Training¶

The total loss is \(\mathcal{L} = \mathbb{E}_\theta[\mathcal{L}_{MSE}(\theta) + \delta_{app}\mathcal{L}_{app}(\theta)]\), where the MSE loss ensures pixel-level accuracy and the perceptual loss \(\mathcal{L}_{app}\) emphasizes high-level semantic feature alignment, improving texture realism.

Key Experimental Results¶

Main Results: Texture Transfer Quality Comparison¶

Method	SIFID↓	CLIP sim.↑	Inference Time
TEXTure	0.34	0.84	5 min
MVEdit	0.38	0.84	1 min
EASI-TEX	0.29	0.85	15 min
TriTex	0.22	0.87	1 min

Ablation Study: Component Contributions¶

Setting	SIFID↓	CLIP sim.↑
w/o network (Nearest Neighbor)	0.23	0.86
w/o \(\mathcal{L}_{MSE}\)	0.23	0.86
w/o \(\mathcal{L}_{app}\)	0.28	0.85
w/o augmentations	0.21	0.85
TriTex (full)	0.22	0.87

Key Findings¶

TriTex outperforms the best baseline EASI-TEX in both SIFID (0.22 vs 0.29) and CLIP similarity (0.87 vs 0.85).
In an Amazon Mechanical Turk user study, TriTex was strongly preferred in all three comparisons (>65% of votes).
Removing the perceptual loss leads to blurry outputs (SIFID rises to 0.28), demonstrating that high-level feature alignment is crucial.
Removing augmentations results in a drop in the CLIP score (0.85), validating that generalization ability relies heavily on augmentation.
Inference takes only 1 minute, which is comparable to MVEdit but delivers superior quality.

Highlights & Insights¶

Elegant Combination of One-shot Learning and Semantic Correspondence: Leveraging pre-trained semantic features (Diff3F) to transform one-shot learning into semantic mapping learning.
Practicality of the Triplane Representation: Strikes an excellent balance between 3D and 2D operations, supporting efficient 2D convolutional processing.
Strong Cross-Shape Generalization: Although trained on only a single mesh, it retains high-quality transfer performance even under significant shape variations in objects of the same category.
Fast Feed-forward Inference: No per-object optimization is required during inference, making it ideal for large-scale scene applications.

Limitations & Future Work¶

Lack of generative capabilities: Cannot generate new details that do not exist in the source texture (e.g., if the source dog lacks a tongue, the target dog's tongue cannot be textured).
Does not leverage text-to-image priors: Cannot compensate when semantic feature matching is ambiguous.
Requires target shapes to be aligned with the source shape orientation; allowing arbitrary rotation degrades detail preservation.
Cross-category transfer performance depends on the degree of semantic overlap.
Future work could incorporate cross-attention to handle reference images, enabling image-based feed-forward texturing.

Unlike Texturify/AUV-net, which require large-scale datasets, TriTex only requires a single textured mesh.
Similar to Splice, which utilizes DINO features for 2D color transfer, TriTex extends this to 3D semantic texture transfer.
The cross-shape semantic consistency of Diff3F features is a critical foundation for the success of this method.

Rating¶

⭐⭐⭐⭐ — The method is simple yet highly efficient. The design that maps single-instance learning with triplane semantic mapping is ingenious. The experiments are thorough (both quantitative and user studies), demonstrating clear advantages in texture fidelity and speed.