Skip to content

FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation

Conference: CVPR 2026 arXiv: 2603.22054 Code: N/A Area: Diffusion Models / Image Generation Keywords: artistic font generation, element-driven, visual in-context generation, image inpainting, style control

TL;DR

FontCrafter reframes artistic font generation as a visual in-context generation task. By horizontally concatenating reference element images with a blank canvas and feeding the result into a pretrained inpainting model (FLUX.1-Fill), it achieves high-fidelity element-driven font creation, significantly outperforming existing methods in both texture and structural fidelity.

Background & Motivation

  1. Background: Artistic font generation aims to synthesize stylized glyphs conditioned on reference styles. Existing approaches fall into two major paradigms: GAN-based feature fusion methods and zero-shot diffusion model methods augmented with adapters (e.g., IP-Adapter).
  2. Limitations of Prior Work: GAN-based methods suffer from limited model capacity and training on small-scale, simple-texture datasets, resulting in poor generalization. Diffusion-based methods with style adapters capture only global features, ignoring pixel-level details, making it difficult to precisely match the reference style. Both paradigms support only coarse-grained control (color/overall style).
  3. Key Challenge: Faithfully preserving both texture and structural information from reference elements while balancing style diversity and fine-grained control remains an open challenge.
  4. Goal: (a) How to achieve pixel-level element style transfer rather than merely transferring global semantics? (b) How to control glyph shape in a lightweight manner? (c) How to prevent hallucinated strokes in background regions?
  5. Key Insight: The authors draw inspiration from the "context propagation" capability of inpainting models (FLUX.1-Fill)—inpainting models can propagate visual cues from visible regions into masked regions. Leveraging this property, the reference element image serves as the visible context and the glyph region serves as the masked area, naturally enabling style transfer.
  6. Core Idea: Reformulate artistic font generation as a visual in-context inpainting task, allowing reference elements to directly "fill" glyph regions in pixel space.

Method

Overall Architecture

The inputs are a reference element image and a glyph mask. The element image is horizontally concatenated with a blank canvas in pixel space to form the input image; the glyph mask is similarly concatenated with an all-zero region. The overall framework is built upon the FLUX.1-Fill inpainting model, enhanced by three additional components: a Context-aware Mask Adapter (CMA) that injects glyph structure, Attention Redirection (AR) that suppresses hallucinated strokes and enables region-aware style blending, and Edge Repainting that refines glyph boundaries.

Key Designs

  1. Context-aware Mask Adapter (CMA):

    • Function: Injects glyph shape information to control the structure of the generated glyph.
    • Mechanism: A lightweight module is inserted at the end of each MM-DiT block, consisting of two linear layers with a GELU activation in between. The downsampled glyph mask is concatenated with the MM-DiT block's output features along the channel dimension as input. The first layer reduces the channel dimension to 64, and the second layer restores the original dimension. By fusing contextual features with the glyph mask, CMA adaptively generates control signals conditioned on different reference elements.
    • Design Motivation: If the control signal were derived from the glyph mask alone, it would be independent of the reference element. However, even for the same glyph, different elements should produce different structural characteristics (e.g., a flower element vs. a stone element). Fusing contextual features endows the control signal with element-awareness. CMA accounts for only 0.5% of the model parameters (22.4M vs. ControlNet's 743.81M).
  2. Attention Redirection:

    • Function: Suppresses hallucinated strokes in background regions and enables region-aware style blending.
    • Mechanism: An attenuation matrix \(M_{attenuate} \in \mathbb{R}^{L \times L}\) is defined, where entry \((i, j)\) is set to 1 when token \(i\) belongs to the glyph background region and token \(j\) belongs to the reference foreground region. The attention logits are modified during self-attention computation as: \(\hat{A} = A + M_{attenuate} \cdot \log_e(\lambda)\), where \(\lambda \in (0,1)\) is an attenuation factor. This reduces the attention weight from reference foreground tokens to glyph background tokens by a factor of \(\lambda\).
    • Design Motivation: The model occasionally generates extraneous content outside the glyph region (hallucinated strokes). By suppressing cross-region interactions from the reference foreground to the glyph background, style transfer is confined to the masked stroke regions. This mechanism requires no training and is applied directly at inference time.
  3. Edge Repainting:

    • Function: Refines glyph boundaries so that they more naturally reflect the characteristics of the reference element.
    • Mechanism: A narrow mask region is defined around the glyph contour, and a fine-tuned FLUX.1-Fill LoRA model reconstructs this region. The model leverages surrounding visual context to recover boundary details consistent with the reference style.
    • Design Motivation: At inference time, glyph masks are derived from standard font libraries and have uniformly clean contours. For amorphous elements (e.g., clouds, flames), the model adheres too strictly to mask boundaries, producing edges that are overly smooth and unnatural.

Loss & Training

The model is trained with a flow matching loss at a learning rate of \(1 \times 10^{-4}\). LoRA fine-tuning is applied to the linear layers of all MM-DiT blocks, and CMA modules are trained jointly with LoRA. Because amorphous elements and object elements differ substantially, separate LoRA and CMA parameters are used for each category. Text inputs are left empty during training, as the reference image already provides sufficient style conditioning. Training data is constructed by randomly cropping texture patches (amorphous elements) or concatenating segmented object instances (object elements), with glyph composition and rotation augmentation to increase structural diversity.

Key Experimental Results

Main Results

Method Type FID↓ CLIPIm↑ FIDp↓ Consistency↑ Readability↑ SR↑
StyleAligned Object 200.3 0.70 291.2 78.8 2.5 73.2
FontStudio Object 205.4 0.75 271.3 80.6 4.0 72.6
FontCrafter Object 127.5 0.91 190.6 94.2 93.5 92.0
StyleAligned Amorphous 227.9 0.74 304.2 82.6 4.0 85.2
FontStudio Amorphous 225.2 0.73 283.1 89.4 6.5 84.8
FontCrafter Amorphous 128.3 0.92 193.4 92.4 89.5 96.6

Ablation Study

Control Method Type Params FID↓ CLIPIm↑ FIDp↓ Consistency↑ Readability↑
w/ ControlNet Object 743.81M 193.2 0.74 252.1 68.4 82.2
w/ T2I-Adapter Object 79.03M 183.1 0.75 246.2 81.2 86.8
w/ IP-Adapter Object - 213.2 0.71 283.2 62.2 89.0
Ours (CMA) Object 22.4M 127.5 0.91 190.6 92.0 94.2

Key Findings

  • CMA surpasses ControlNet (743.81M) and T2I-Adapter (79.03M) with only 22.4M parameters, achieving a 33× improvement in parameter efficiency.
  • IP-Adapter provides only coarse-grained control (color and category features) and fails to preserve fine-grained texture and structure; the visual in-context generation strategy leads by 0.20 on CLIPIm.
  • In Attention Redirection, decreasing the attenuation factor \(\lambda\) progressively eliminates hallucinated strokes without affecting legitimate strokes.
  • The method naturally supports cross-category style blending, with the style ratio controllable by adjusting element density in the reference region.

Highlights & Insights

  • Elegant formulation via visual in-context generation: Recasting font generation as an inpainting task exploits the context propagation capability of inpainting models for pixel-level style transfer, circumventing the reliance on text descriptions or global features inherent in traditional approaches.
  • Lightweight CMA design: By fusing contextual features with mask information, CMA achieves shape control superior to ControlNet with far fewer parameters, demonstrating that "task-specific information + context-awareness" is more effective than large-scale independent control networks.
  • Training-free Attention Redirection: Manipulating the attention matrix at inference time resolves both hallucination and region-level style control, and is transferable to other generation tasks requiring region-specific control.
  • ElementFont dataset: Covering 6,000 element types and 19,000 glyphs, the dataset is built through a systematic pipeline (LLM generates element names → DALL·E 3 generates images → SAM segments → GPT quality-checks) and can serve as a standard benchmark for future research.

Limitations & Future Work

  • The method relies on FLUX.1-Fill as the backbone, resulting in a large model size and potentially slow inference speed.
  • Amorphous elements and object elements require separate LoRA parameters, precluding unified processing.
  • The paper does not provide large-scale quantitative evaluation on complex scripts such as Chinese characters (only qualitative results are shown).
  • Edge Repainting as an optional post-processing step increases pipeline complexity.
  • The ElementFont dataset is generated using DALL·E 3, which may introduce model-specific generation biases.
  • The paper does not evaluate resolution limitations or per-glyph inference time.
  • vs. FontStudio: FontStudio employs a shape-adaptive diffusion model but relies on a Style Adapter that captures only global style; FontCrafter achieves fine-grained control via pixel-space concatenation.
  • vs. Anything2Glyph: Anything2Glyph controls style via text prompts, supporting only coarse-grained object category control with cluttered backgrounds (FID as high as 297.8); FontCrafter uses reference images for precise control (FID reduced to 213.6).
  • vs. IP-Adapter: IP-Adapter injects global features through cross-attention and cannot preserve pixel-level details; the visual in-context strategy directly propagates visual cues in pixel space.

Rating

  • Novelty: ⭐⭐⭐⭐ Applying the context propagation capability of inpainting models to font generation is a novel perspective, though the core technical components are relatively standard.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage including main experiments, ablations, user studies, style blending, and generalization experiments.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated, methods are presented intuitively, and the ElementFont dataset construction is described in detail.
  • Value: ⭐⭐⭐⭐ Significant contribution to the artistic font generation field; both the dataset and the method have practical utility.