Skip to content

EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting

Conference: CVPR 2025
arXiv: 2503.01158
Code: None
Area: Image Generation / Game Character Customization
Keywords: Virtual Character Creation, Automatic Avatar Crafting, Self-Supervised Learning, ViT Encoder, Text-to-Character

TL;DR

Proposes EasyCraft, an end-to-end automatic avatar customization framework that translates photos of arbitrary styles into game crafting parameters using a self-supervised pre-trained general ViT encoder, and supports text-driven character creation by integrating Stable Diffusion.

Background & Motivation

Background: Character customization ("avatar crafting") is a core gameplay feature in RPGs, but manual parameter adjustment is time-consuming and laborious. Existing automatic methods rely on semantic constraints from specific image domains (e.g., segmentation, perception, CLIP) and require developing a neural renderer for specific engine styles.

Limitations of Prior Work: Engine styles vary greatly (realistic, anime, cartoon), and existing methods rely on supervised signals of specific styles, making them difficult to transfer across engines. Additionally, they typically support only either image or text as input.

Key Challenge: The translator is trained on engine data and cannot process input images of non-engine styles.

Core Idea: Pre-train a general ViT encoder on multi-style face datasets using MAE self-supervised learning to unify the feature distribution across styles, then freeze the encoder and train only the parameter generation module.

Method

Overall Architecture

Two-stage process: (1) Pre-train a general ViT encoder using MAE on 5.1 million multi-style face images; (2) Freeze the encoder and train the parameter generation module using parameter-screenshot pairs randomly sampled from the engine. During inference, it accepts input photos of arbitrary styles. Integrating SD to generate engine-style face images enables text-driven creation.

Key Designs

  1. General ViT Encoder (MAE Pre-training):

    • Function: Encodes face images of arbitrary styles into a unified feature space.
    • Mechanism: Collects a multi-style face dataset (~5.1 million images) containing real, anime, and game engine styles, and pre-trains a ViT with a 75% masking rate using the MAE strategy, enabling the encoder to learn unified face features across styles.
    • Design Motivation: The unified feature distribution allows the parameter generation module trained on engine data to generalize to arbitrary input styles.
  2. Engine-Specific Parameter Generation Module:

    • Function: Translates unified features into specific game engine avatar parameters.
    • Mechanism: Three parallel MLPs predict facial structure parameters (continuous values, L1 loss), makeup texture parameters (discrete values, cross-entropy loss), and makeup attribute parameters (continuous values, L1 loss with conditional masking), respectively.
    • Design Motivation: Training only requires randomly sampled data from the engine without any external supervision, allowing easy transfer to other engines.
  3. Engine-Style Stable Diffusion:

    • Function: Enables text-to-character creation.
    • Mechanism: Fine-tunes SD v1.5 using 7,000 engine-rendered images paired with descriptions generated by GPT-4o to generate engine-style face images, which are then passed to the translator to obtain crafting parameters.
    • Design Motivation: The face style generated by the original SD does not match the engine and lacks makeup details; fine-tuning resolves this domain gap.

Loss & Training

MAE pre-training takes two weeks on 8 A100 GPUs. The translator is trained for 50 epochs on 4 A30 GPUs (training only the parameter generation module). The inference speed is 0.026 seconds per image.

Key Experimental Results

Main Results

Method Identity Similarity ↑ FID ↓ Speed
F2P 0.376 40.69 1.14s
F2P v2 0.275 34.27 0.007s
EasyCraft 0.351 17.65 0.026s

Key Findings

  • Removing ViT pre-training leads to severe distortion for non-engine style inputs.
  • In text-driven scenarios, the engine-style SD is significantly more accurate than the original SD.
  • In user studies, 87% of users preferred EasyCraft (vs. 11% for F2P).

Highlights & Insights

  • Key Insight: Unifying the encoder's feature distribution allows generalization through training on engine data alone.
  • The method can be easily transferred to any system supporting parameterized customization.
  • The dual text + image input significantly outperforms single-input solutions in terms of practicality.

Limitations & Future Work

  • The encoder pre-training cost is high (two weeks on 8×A100).
  • Performance may degrade on input images with extreme angles or occlusions.
  • The diversity of text-driven generation is limited by the engine's parameter space.

Rating

  • Novelty: 7/10 — The idea of combining MAE with cross-style unification is novel.
  • Technical Depth: 7/10 — The method is simple yet effective.
  • Experimental Thoroughness: 8/10 — Validated on two game engines along with user studies.
  • Writing Quality: 7/10 — Clear and standardized.