VecGlypher: Unified Vector Glyph Generation with Language Models¶
Conference: CVPR 2026 arXiv: 2602.21461 Code: https://xk-huang.github.io/VecGlypher Area: Multimodal VLM Keywords: Font Generation, Vector Graphics, SVG, Multimodal Language Models, Typography
TL;DR¶
VecGlypher is proposed as the first unified language model for text- and image-guided vector glyph generation. Through a two-stage training pipeline (large-scale SVG syntax learning followed by expert-annotated alignment), it autoregressively generates editable SVG paths directly, without rasterization intermediate steps or vectorization post-processing.
Background & Motivation¶
Background: Vector glyphs are the atomic units of digital typography. However, existing learning-based methods remain predominantly image-guided — generating vector outlines for unseen characters given a few reference glyph images — and rely on carefully prepared reference sheets and raster-to-vector post-processing.
Limitations of Prior Work: (a) Image-guided methods require users to first create or collect reference glyphs, which is a bottleneck for non-expert users; (b) rasterization intermediate steps introduce vectorization artifacts that degrade editability; (c) general-purpose SVG generation LLMs fail entirely on glyph generation, as fonts impose extremely strict requirements on coordinate precision, topological correctness, and stylistic consistency.
Key Challenge: Natural language is a more accessible interface for font design, and SVG paths are inherently text sequences well-suited for language modeling — yet this requires (a) large-scale font training data to teach models how to "draw" characters, and (b) typography-aware data engineering for coordinate normalization and path canonicalization.
Goal: To support both text descriptions and image references as conditions within a single multimodal LLM, directly generating high-fidelity, editable SVG glyphs.
Key Insight: A two-stage training strategy — first learning to draw SVGs on large-scale noisy fonts, then learning instruction-following on small-scale expert-annotated fonts.
Core Idea: Glyph generation is formulated as a multimodal language modeling problem. The model learns SVG syntax from 39K Envato fonts and style alignment from 2.5K Google Fonts, producing correct SVG paths in a single forward pass.
Method¶
Overall Architecture¶
The input consists of a textual style description (e.g., "high-contrast, serif, art-deco") or reference glyph images (1–8 images), along with a target character identity (e.g., "A"). The model autoregressively predicts an SVG path token sequence, which is then de-tokenized into a valid SVG path. No raster denoiser, vector post-optimizer, or contour simplifier is required.
Key Designs¶
-
Typography-Aware Data Engineering:
- Function: Clean, normalize, and prepare training font data.
- Mechanism: Four-stage filtering (character coverage, path length, deduplication, MLLM OCR verification) → coordinate normalization to UPM=1000 → path canonicalization (preserving command letters, one-decimal quantization) → train/test split by font family.
- Design Motivation: Fonts from different sources vary widely in coordinate systems and path formats. Without normalization, error accumulation during long-sequence decoding is significant. One-decimal quantization balances precision and sequence length.
-
Two-Stage Supervised Fine-Tuning (SFT):
- Stage 1 (Learning to Draw — Envato): Text-guided SFT on 39K noisy fonts to learn SVG syntax, long-sequence coordinate prediction, and character-conditioned geometry.
- Stage 2 (Instruction Following — Google Fonts): Text- and image-guided SFT on 2.5K expert-annotated fonts to align geometric and appearance instructions.
- Design Motivation: Large-scale Stage 1 training is critical — ablations show that models without Stage 1 exhibit significantly degraded generalization and contour stability.
-
Unified Multimodal Conditioning:
- Function: A single model and decoding procedure handles both text and image input modalities.
- Mechanism: Text conditions are processed via the tokenizer using style tags and target character; image conditions encode 1–8 reference glyphs rendered at 192×192 via a vision encoder. The two modalities are treated as mutually exclusive selections (||).
- Design Motivation: A unified architecture eliminates the redundancy of maintaining separate models for each input type. In practice, the workflow can first generate reference glyphs from text, then use those images to guide full font generation.
Loss & Training¶
Standard next-token prediction cross-entropy loss over SVG path text. Envato data is trained for 1 epoch; Google Fonts for 3 epochs. Greedy decoding is used to evaluate raw generation capability. The backbone is the Qwen3-VL series (4B/27B/70B).
Key Experimental Results¶
Main Results (Google Fonts Cross-Family OOD)¶
| Method | Type | FID↓ | L1↓ | SSIM↑ |
|---|---|---|---|---|
| DeepVecFont-v2 | Image-ref | 45.2 | 0.089 | 0.82 |
| DualVector | Image-ref | 38.6 | 0.075 | 0.85 |
| StarVector | SVG-LLM | 89.4 | 0.142 | 0.68 |
| GPT-5.2 | General LLM | 92.1 | 0.158 | 0.61 |
| VecGlypher-4B (text-ref) | LLM | 52.3 | 0.095 | 0.79 |
| VecGlypher-4B (image-ref) | LLM | 31.2 | 0.065 | 0.88 |
Ablation Study¶
| Configuration | FID↓ | L1↓ | Notes |
|---|---|---|---|
| Stage 2 only (Google) | 58.7 | 0.102 | No Envato pre-training |
| Stage 1 only (Envato) | 43.5 | 0.081 | No Google fine-tuning |
| Stage 1 + Stage 2 | 31.2 | 0.065 | Full two-stage training |
| 4B model | 31.2 | 0.065 | — |
| 27B model | 26.8 | 0.054 | Larger model = higher quality |
| 70B model | 24.1 | 0.048 | Further improvement |
| Relative coords | 35.8 | 0.072 | Relative coordinate serialization |
| Absolute coords | 31.2 | 0.065 | Absolute coordinates are superior |
Key Findings¶
- VecGlypher outperforms all specialized baselines in image-guided generation, including DeepVecFont-v2 and DualVector.
- General-purpose LLMs (GPT-5.2) and SVG-LLMs (StarVector) fail entirely on glyph generation, demonstrating the indispensability of typography-specific training.
- Model scale is the primary driver of vector fidelity — the 70B model achieves approximately 23% lower FID than the 4B model.
- Large-scale Stage 1 pre-training yields substantially greater OOD gains than training on expert data alone, reducing FID from 58.7 to 31.2.
- Absolute coordinate serialization outperforms relative coordinates, likely because absolute coordinates provide the model with more direct spatial reference.
Highlights & Insights¶
- Unified Language Modeling Paradigm: Font design is reframed from a "drawing problem" to a "code-writing problem," effectively transferring the code generation capabilities of LLMs to this domain. This paradigm is extensible to any parameterizable design task (e.g., logos, icons, UI components).
- Two-Stage Data Strategy: The pattern of learning syntax from large noisy data and semantics from small refined data mirrors the NLP paradigm of pre-training followed by instruction tuning; its successful application to visual generation is noteworthy.
- Practical Workflow: The progressive design pipeline of text → initial glyphs → image-ref → complete font genuinely lowers the barrier to font creation.
Limitations & Future Work¶
- The current system supports only 62 characters ("0–9a–zA–Z"); large character sets such as Chinese fonts are not yet supported.
- Greedy decoding limits diversity; beam search or nucleus sampling may produce stylistically richer variants.
- Generation quality may degrade for long sequences (some glyphs exceed 1,000 tokens), necessitating improved long-sequence modeling.
- Fine-grained conditional control (e.g., independently controlling stroke weight, serif style, or aspect ratio) remains unexplored.
Related Work & Insights¶
- vs. DeepVecFont-v2: DVF-v2 employs a dedicated encoder-decoder with geometric post-optimization; VecGlypher uses a single LLM with a single forward pass, achieving a simpler architecture with superior results.
- vs. StarVector: StarVector targets general SVG icons; VecGlypher introduces font-specific data curation and training strategies that address typography-specific constraints StarVector cannot handle.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First successful application of a multimodal LLM to vector glyph generation, unifying two conditioning modalities.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-scale models, cross-domain evaluation, detailed ablations, with comprehensive qualitative and quantitative analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Data engineering is described in detail; paradigm comparisons are clearly articulated.
- Value: ⭐⭐⭐⭐⭐ A practical tool that lowers the barrier to font design, with broad prospects for industrial application.