CVPR 2025 Self-Supervised Learning material retrieval contrastive learning DINOv2 cross-domain PBR materials ZeST

MaRI: Material Retrieval Integration across Domains¶

Conference: CVPR 2025
arXiv: 2503.08111
Code: Project Page
Area: Self-Supervised
Keywords: material retrieval, contrastive learning, DINOv2, cross-domain, PBR materials, ZeST

TL;DR¶

This work proposes the MaRI framework, which constructs a shared embedding space using dual DINOv2 encoders (image + material) via contrastive learning. By combining synthetic data from Blender and real-world material data generated by ZeST, it achieves accurate cross-domain PBR material retrieval.

Background & Motivation¶

Background: Accurate material retrieval is crucial for creating realistic 3D assets and is widely applied in AR/VR, digital content creation, and industrial design. Theoretically, material retrieval can be viewed as an image search problem, but directly applying general image search methods yields poor results.

Limitations of Prior Work: 1. Image Space ≠ Material Space: Material retrieval requires capturing physical properties such as texture, albedo, and surface roughness. General image search models (ViT/CLIP/DINOv2) fail to represent these properties effectively. 2. Lack of Dedicated Datasets: Large-scale paired image-material datasets do not exist for training material embeddings. 3. Synthetic-to-Real Domain Gap: Although synthetic rendered data is controllable, it cannot fully represent the diverse appearance of real-world materials. 4. Limitations of Existing Methods: MaPa utilizes GPT-4V for classification combined with CLIP for retrieval, which has limited accuracy. Make-it-Real heavily relies on pre-annotated datasets.

Key Challenge: A unified embedding model is needed to align visual features and physical material properties into the same space, but there is a lack of training data and effective alignment methods.

Key Insight: Inspired by CLIP, this work designs a dual-encoder architecture to align image and material representations, and automatically constructs a paired synthetic and real-world dataset.

Method¶

Overall Architecture¶

MaRI consists of three parts: dataset construction, dual-encoder training, and retrieval inference: 1. Synthetic Data: Objaverse 3D models + AmbientCG PBR materials + HDR illumination $\to$ Blender rendering. 2. Real-World Data: Online crawled real images $\to$ Grounded-SAM segmentation $\to$ ZeST material transfer $\to$ material spheres. 3. Contrastive Training of Dual DINOv2 Encoders: Image encoder $E_I$ and material encoder $E_M$. 4. Inference: Query image $\to$ $E_I$ $\to$ cosine similarity $\to$ nearest neighbor in the material library.

Key Designs¶

1. Synthetic Dataset Construction (394,560 samples) - Function: Samples 3D models from Objaverse, normalizes them to the center of a unit cube, and maps them with 1,605 types of PBR materials (86 categories) from AmbientCG. Along with 712 HDR environmental lights, 8 views are rendered from random hemispherical positions. Each sample contains a rendered image $x_i$, a segmentation mask, and a material descriptor $m_i$. - Mechanism: By controlling shape variations (multiple objects), lighting variations (multiple HDRs), and viewpoint variations (multiple camera positions), the model is forced to learn shape-invariant and illumination-invariant material features. - Design Motivation: Existing datasets either lack diversity or paired annotations. An automated synthetic pipeline allows for scalable generation of large-scale paired data.

2. Real-World Dataset Construction (30,000 samples) - Function: Collects real-world images from online sources and multiple datasets, segments foreground objects using Grounded-SAM based on material prompts, and transfers the object materials onto neutral material spheres via the ZeST pipeline to generate standardized material representations. - Mechanism: The material transfer capability of ZeST "maps" the materials from arbitrary real-world images into a unified material sphere representation, achieving self-supervised pairing without human annotation. This covers 8 major material categories (metal, fabric, wood, ceramic, etc.). - Design Motivation: There is a domain gap in synthetic data. Real-world data provides diverse practical material appearances, making them complementary to each other.

3. Domain-Adaptive Contrastive Learning - Function: Two DINOv2 encoders process the masked image $x_i \odot \text{mask}_i$ and the material sphere $m_i$ respectively. Only the last Transformer block of each encoder is fine-tuned. InfoNCE loss is utilized to align positive pairs and separate negative pairs. The temperature parameter is $\tau = 0.07$. - Mechanism: $$\mathcal{L}_{\text{contrast}} = -\frac{1}{N}\sum_i \log \frac{\exp(\text{sim}(\mathbf{z}_I^i, \mathbf{z}_M^i)/\tau)}{\sum_j \exp(\text{sim}(\mathbf{z}_I^i, \mathbf{z}_M^j)/\tau)}$$ - Design Motivation: (1) Freezing most parameters preserves the generalization ability of DINOv2. (2) Fine-tuning only the last layer is sufficient to learn fine-grained differences specific to the material domain. (3) InfoNCE outperforms Triplet Loss due to its batch-level global optimization capability.

Loss & Training¶

Loss: InfoNCE (temperature $\tau = 0.07$)
Encoders: Dual DINOv2, fine-tuning only the last Transformer block
Image Input: Foreground objects masked by segmentation masks (removing background influence)
Material Input: Standardized material sphere images (unified size and shape)

Key Experimental Results¶

Main Results¶

Method	Trained T1I↑	T5I↑	T1C↑	T3IoU↑	Unseen T1I↑	T5I↑
ViT	3.5%	12.0%	16.0%	0.41	16.5%	56.0%
DINOv2	7.5%	28.0%	69.0%	0.67	31.0%	62.5%
CLIP	2.0%	11.0%	36.5%	0.47	14.0%	29.5%
Make-it-Real	8.5%	16.0%	76.5%	0.60	42.5%	75.0%
MaPa	2.5%	17.5%	80.0%	0.80	19.5%	69.0%
MaRI	26.0%	90.0%	81.5%	0.77	54.0%	89.0%

MaRI leads by a large margin in instance-level retrieval: achieving a Trained Top-5 accuracy of 90.0% (compared to MaPa's 17.5%) and an Unseen Top-1 accuracy of 54.0% (compared to Make-it-Real's 42.5%).

Ablation Study¶

Data Scale Influence:

Data Scale	Trained T1I↑	T5I↑	Unseen T1I↑	T5I↑
25%	19.5%	55.5%	44.5%	83.5%
50%	20.0%	63.5%	46.0%	85.5%
100%	26.0%	90.0%	54.0%	89.0%

Architecture/Data Combination:

Dual Encoder	Real Data	Synthetic Data	Trained T5I↑	Unseen T5I↑
✓	✗	✓	62.0%	78.0%
✓	✓	✗	27.5%	63.5%
✗	✓	✓	61.0%	85.5%
✓	✓	✓	90.0%	89.0%

Fine-Tuning Strategy:

Fine-tuning Scope	Loss Function	Trained T5I↑	Unseen T5I↑
All Parameters	InfoNCE	42.5%	67.0%
All Parameters	Triplet	21.0%	52.5%
Last Layer	Triplet	31.5%	71.5%
Last Layer	InfoNCE	90.0%	89.0%

Key Findings¶

General Search Models are Severely Inadequate: CLIP's instance-level Top-1 accuracy is only 2.0%, confirming that the material space and the image space are indeed distinct domains.
Both Synthetic and Real-World Data are Indispensable: Removing synthetic data drops T5I from 90.0% to 27.5%, while removing real-world data drops it to 62.0%.
Less is More in Fine-Tuning: Fine-tuning only the last layer (with InfoNCE) is far superior to full-parameter fine-tuning (90.0% vs. 42.5%), preventing overfitting to the training distribution.
InfoNCE >> Triplet: Batch-level global optimization significantly outperforms triplet-level local optimization.
Limited IoU Advantage for MaPa: MaPa's T3IoU (0.80) is slightly higher than MaRI's (0.77), but this is because MaPa first uses GPT-4V for category classification, which narrows the search space.

Highlights & Insights¶

Clear Problem Definition: Systematically investigates material retrieval as an independent task for the first time, clearly distinguishing it from general image retrieval.
Automated Data Construction: Both the synthetic pipeline (Objaverse+AmbientCG+Blender) and the real-world pipeline (ZeST+Grounded-SAM) can be scaled automatically.
Effective Minimal Fine-Tuning Strategy: Fine-tuning only the last layer of DINOv2 is sufficient to learn material-domain-specific representations while avoiding overfitting.
High Practical Value: Material retrieval can be directly integrated into 3D asset generation pipelines (e.g., MaPa, Make-it-Real, etc.).

Limitations & Future Work¶

Real-world material spheres generated via ZeST may suffer from distortion, affecting pairing quality.
The evaluation dataset is relatively small (200 materials each), offering limited statistical significance.
Efficiency issues of retrieving from larger-scale material libraries (>10K materials) have not been explored.
Material categories are still relatively limited (86 synthetic + 8 real), with insufficient coverage of rare materials.
Retrieval scenarios for multi-part objects (different parts with different materials) were not considered.

CLIP: MaRI's dual-encoder and contrastive learning framework is directly inspired by CLIP but replaces modality alignment with cross-domain visual alignment.
ZeST: A zero-shot material transfer method, creatively utilized here to automatically generate material pairings for real-world data.
DINOv2: A powerful vision foundation model providing high-quality pretrained features.
MaPa / Make-it-Real: Pioneering works that integrate material retrieval into 3D generation pipelines, though with limited retrieval accuracy.

Rating ⭐⭐⭐⭐¶

Novelty: ⭐⭐⭐⭐ The first systematic material retrieval framework, with innovative data construction methods.
Experimental Thoroughness: ⭐⭐⭐⭐ Comparisons against multiple baselines + ablation studies across three dimensions (data, architecture, and training).
Writing Quality: ⭐⭐⭐⭐ Clear motivation of the problem, standard and precise description of the methodology.
Value: ⭐⭐⭐⭐ Can be directly integrated into 3D asset creation workflows.