TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering¶

Conference: ECCV 2024
arXiv: 2311.16465
Area: Image Generation

TL;DR¶

TextDiffuser-2 utilizes two language models for layout planning and layout encoding respectively, achieving more flexible, automated, and diverse visual text rendering, significantly enhancing font style diversity while maintaining text accuracy.

Background & Motivation¶

Diffusion models have achieved great success in image synthesis, but still face challenges in visual text rendering. Prior methods exhibit three major limitations:

Lack of flexibility and automation: GlyphControl requires users to manually design glyph images, while GlyphDraw and TextDiffuser require manual keyword specification, failing to generate corresponding images directly from natural language prompts.
Limited layout prediction capability: GlyphDraw can only render single-line text, and the layouts generated by the Layout Transformer in TextDiffuser are not aesthetically pleasing.
Restrained font style diversity: TextDiffuser uses character-level segmentation masks as control signals, which implicitly constrains character positions and limits the generation of handwritten or artistic fonts.

The root cause of these issues is that previous methods employ overly rigid character-level control signals and lack flexible layout planning capabilities. TextDiffuser-2 aims to unleash the potential of language models in text rendering to address these problems.

Method¶

Overall Architecture¶

TextDiffuser-2 adopts a two-stage training architecture, centered around two language models: - Language Model M1 (Layout Planner): Fine-tuned based on Vicuna-7B, converting user prompts into text-formatted layouts. - Language Model M2 (Layout Encoder): The CLIP text encoder within the diffusion model, encoding line-level text positions and content information.

Key Designs¶

1. Layout Planning via Language Model (M1)

The Vicuna-7B model is fine-tuned on caption-OCR pairs from the MARIO-10M dataset to serve as the layout planner:

Supports two modes: (a) Automatically infers text content and layout when users do not provide keywords; (b) Determines only the corresponding layout positions when users provide keywords.
Output format is "textline x0, y0, x1, y1", with coordinates normalized to the range of 0~128.
Supports layout modification (regenerating, adding, moving keywords) through multi-turn dialogue.
Achieves optimal fine-tuning performance with only 5k data.

2. Layout Encoding via Language Model (M2)

The CLIP text encoder is utilized within the diffusion model to encode line-level layout information:

Hybrid-granularity tokenization strategy: Retains the original BPE tokenization for the prompt, while introducing character-level tokenization for keywords (e.g., "WILD" \(\rightarrow\) "[W]", "[I]", "[L]", "[D]").
Introduces 256 coordinate tokens and 95 character tokens to encode positions and content.
Line-level bounding boxes provide more flexible generation control without restricting font style diversity.
The maximum sequence length L is set to 128, covering 94% of the training samples.

3. Model Capacity

The overall model is based on SD 1.5, containing 922M parameters, with an input image size of 512×512.

Loss & Training¶

Stage 1 (Layout Planning): Cross-entropy loss is used to train M1, covering both with/without keyword scenarios.
Stage 2 (Image Generation): L2 denoising loss is used to train M2 and the U-Net.

\[\mathcal{L}_{denoise} = \mathbb{E}_{z_0, \epsilon, t} \| \epsilon - \epsilon_\theta(z_t, c, t) \|^2\]

Key Experimental Results¶

Main Results¶

Quantitative results and user study on the MARIO-Eval benchmark:

Metric	SD-XL	PixArt-α	GlyphControl	TextDiffuser	TextDiffuser-2
FID↓	62.54	87.09	50.82	38.76	33.66
CLIPScore↑	31.31	27.88	34.56	34.36	34.50
OCR Accuracy↑	0.31	0.02	32.56	56.09	57.58
OCR F-measure↑	3.66	0.03	64.07	78.24	75.06
Text Quality (Human)↑	14.58	3.65	21.35	23.44	36.98
Text-Image Match (Human)↑	7.14	3.30	29.67	19.23	40.66

TextDiffuser-2 achieves the best results across most metrics, including FID, OCR accuracy, and user studies.

Ablation Study¶

Ablation on fine-tuning data volume (M1 layout planner):

Data Volume	Accuracy↑	Precision↑	Recall↑	F-measure↑	IoU↓
0k-2shot	49.65	84.18	69.69	76.25	19.69
2.5k	61.10	82.20	85.18	83.67	3.21
5k	64.85	84.98	86.38	85.67	3.25
10k	64.85	84.38	86.23	85.29	4.27
100k	62.87	85.26	85.98	85.62	4.31

Optimal performance is achieved with only 5k data, with more data showing no significant improvement.

Ablation on coordinate representation and tokenization granularity:

Representation	Accuracy↑	Precision↑	Recall↑	F-measure↑
Center (Char)	35.19	61.75	62.71	62.23
LT (Char)	28.32	54.94	55.64	55.29
LT+RB (Subword)	15.48	41.74	42.53	42.13
LT+RB (Char)	57.58	74.02	76.14	75.06

Using the top-left & bottom-right (LT+RB) corner points combined with character-level tokenization yields the best performance, while subword-level tokenization drops the accuracy by 42.1%.

Key Findings¶

Language models exhibit flexibility in autonomously inferring keywords, such as automatically correcting spelling errors (e.g., "RRAINBOW" \(\rightarrow\) "RAINBOW").
Line-level guidance yields more diverse font styles than character-level guidance, though it involves a minor compromise in terms of accuracy.
Layouts can be flexibly manipulated through multi-turn dialogues, supporting regenerating, adding, or moving keywords.
TextDiffuser-2 demonstrates stronger robustness to overlapping bounding boxes.

Highlights & Insights¶

Minimal data fine-tuning: Just 5k caption-OCR pairs can train a 7B language model into a high-quality layout planner, demonstrating the strong cross-domain transfer capability of LLMs.
Control granularity trade-off: Changing from character-level to line-level control signals trades a small amount of accuracy for a significant boost in style diversity, making for an elegant design compromise.
Interactive layout editing: The layout planner, fine-tuned on a chat model, natively supports multi-turn dialogue layout modification, enhancing practical utility.
Hybrid-granularity tokenization: The hybrid strategy of using BPE for prompts and character-level tokenization for keywords balances efficiency and spelling capability.

Limitations & Future Work¶

Unable to render complex languages (such as Chinese) due to the vast character set causing difficulties in few-shot or even zero-shot scenarios.
The generation resolution is limited to 512×512.
Although line-level control improves diversity, its accuracy is slightly lower than character-level methods in scenarios requiring precise character alignment.

Rating¶

Novelty: ★★★★☆ — The dual language model architecture design is novel, and the hybrid-granularity tokenization strategy is elegant.
Utility: ★★★★★ — High degree of automation, supporting multi-turn interactive editing.
Experimental Thoroughness: ★★★★★ — Comprehensive ablation studies, including human and GPT-4V user studies.
Writing Quality: ★★★★☆ — Clear structure, well-elaborated motivation.