Skip to content

ZeST: Zero-Shot Material Transfer from a Single Image

Conference: ECCV 2024
arXiv: 2404.06425
Code: Project Page
Area: 3D Vision

TL;DR

ZeST is proposed, a zero-shot, training-free material transfer method. By combining three parallel branches—extracting material representations via IP-Adapter, providing geometric guidance through ControlNet, and utilizing a foreground grayscale image for lighting cues—it achieves 2D material transfer from a single material exemplar image to a target object.

Background & Motivation

Background

Background: Editing the materials of objects in images (e.g., transforming marble to steel) is of great value in applications such as game design and e-commerce.

Limitations of Prior Work

Limitations of Prior Work: Traditional methods require explicit 3D geometry, illumination estimation, and material parameter specification, which is highly complex.

Key Challenge

Key Challenge: Text-driven methods struggle to precisely describe the fine texture details of materials.

Core Idea

Core Idea: Existing approaches like TextureDreamer require 3-5 material images for DreamBooth fine-tuning, which is time-consuming and non-scalable.

Core Problem

Core Problem: How to transfer material to an object in a target image from a single material exemplar image under training-free conditions.

Method

Overall Architecture

Three parallel branches are input to Stable Diffusion XL Inpainting: 1. Material Encoding Branch: IP-Adapter encodes the material exemplar to extract the material latent representation \(z_M\). 2. Geometric Guidance Branch: DPT estimates the depth map \(\rightarrow\) ControlNet provides structural constraints. 3. Lighting Guidance Branch: A foreground grayscale image \(I_{init}\) is created \(\rightarrow\) Inpainting model initialization.

\[I_{gen} = \mathcal{S}(z_M, D_I, I_{init}, F)\]

Key Designs

Material Encoding (IP-Adapter): - Utilizes a CLIP image encoder to extract the feature representation of the material exemplar. - Injects it into the diffusion model via cross-attention. - Requires no DreamBooth fine-tuning; a single image is sufficient.

Geometric Guidance (ControlNet): - Depth-based ControlNet obtains structural information from the depth map of the input image. - Overrides the geometric information in the material encoding \(z_M\), ensuring the generated object retains its original shape. - IP-Adapter + Img2Img fails to preserve the original geometry (Key Finding).

Lighting Guidance via Foreground Grayscale (Core Design Choice): - Directly using the original image: The original object's color acts as a strong prior that interferes with the material color (e.g., an orange pumpkin). - Random noise initialization: Loses lighting and shading direction information. - Foreground grayscale image (Optimal): Removes color priors while preserving lighting and shading information. - \(I_{init} = F \odot I_{gray} + (1-F) \odot I\)

Implementation Details: - For depth estimation, DPT is used, and for foreground extraction, Rembg is used. - Based on SDXL Inpainting + the corresponding versions of ControlNet and IP-Adapter. - Takes about 15 seconds to generate one image on a single A10 GPU.

Loss & Training

No training is involved; the entire process is completed during the inference stage of the pre-trained models.

Key Experimental Results

Main Results

Quantitative comparison on the synthetic dataset (9 materials \(\times\) 10 meshes = 90 pairs):

Method PSNR↑ LPIPS↓ CLIP↑
IP-Adapter + InstructPix2Pix 16.92 0.096 0.745
Dreambooth + Geo/Illum Guidance 25.46 0.053 0.893
ZeST 25.82 0.046 0.899

User study (real images, 1-5 scale):

Method Material Fidelity↑ Realism↑
IP-Adapter + InstructPix2Pix 1.48 3.23
Dreambooth + Geo/Illum Guidance 3.25 3.41
ZeST 4.05 3.78

Ablation Study

Comparative validation of lighting guidance methods: The original image preserves the object's base color, interfering with the material color (Setting 1); random noise leads to incorrect lighting direction (Setting 2); the foreground grayscale image is optimal (Setting 3).

Robustness testing: - Changing the illumination direction and rotation angle of the material exemplar \(\rightarrow\) the generation results are highly consistent. - Scaling the material exemplar image \(\rightarrow\) the model automatically adjusts the texture scale to fit the target object.

Key Findings

  • ZeST significantly leads in user ratings for material fidelity (4.05 vs. 3.25), demonstrating that the zero-shot method outperforms the fine-tuning method.
  • Foreground grayscale is the optimal choice for lighting guidance (the value of core design choices).
  • ZeST is robust to changes in lighting, rotation, and scaling of the material exemplar.
  • The DreamBooth encoding process loses material information and causes color shifts (especially in real-world scenes).
  • Can be extended to multi-object editing (iterating with SAM) and illumination-aware material transfer.

Highlights & Insights

  • A purely engineering yet extremely elegant pipeline design—three branches each perform their own functions, completely training-free.
  • The insight of foreground grayscaling is simple yet key: removing color \(\rightarrow\) removing color priors, while preserving grayscale \(\rightarrow\) preserving lighting and shading.
  • As a pioneer of a new problem (2D-to-2D material transfer), it proposes both synthetic and real-world evaluation datasets.
  • Can be combined with 3D texturing methods like Text2Tex to bring material-exemplar-driven texturing into 3D.

Limitations & Future Work

  • Sometimes transfers material only to the most "plausible" regions of the object (partial transfer issue).
  • Multiple materials contained within the material exemplar may get blended.
  • IP-Adapter lacks region-level material extraction capabilities.
  • The latent space control of diffusion models is sometimes unpredictable.

Rating

  • Novelty: ⭐⭐⭐⭐ — New problem definition + clever training-free scheme
  • Effectiveness: ⭐⭐⭐⭐ — Significant lead in user studies
  • Practicality: ⭐⭐⭐⭐⭐ — Zero-shot, training-free, 15-second generation
  • Recommendation: ⭐⭐⭐⭐