Text to Sketch Generation with Multi-Styles¶

Conference: NeurIPS 2025 arXiv: 2511.04123 Code: GitHub Area: Image Generation, Style Transfer, Sketch Synthesis Keywords: Sketch Generation, Multi-Style Synthesis, Diffusion Models, K/V Injection, AdaIN

TL;DR¶

This paper proposes M3S (Multi-Style Sketch Synthesis), a training-free framework that achieves single- and multi-style sketch generation conditioned on text prompts and reference style sketches, via linearly smoothed K/V feature injection, joint AdaIN style tendency control, and style-content disentangled guidance.

Background & Motivation¶

Sketches serve as a cross-lingual visual medium with broad applications ranging from industrial prototyping to artistic expression.
The scarcity of high-quality sketch datasets (requiring professional skills and substantial time) constrains model training.
Limitations of existing approaches:
- CLIPasso/DiffSketcher: lack precise control over style attributes.
- K/V replacement methods (e.g., MasaCtrl): suffer from misalignment between Q and the substituted K/V in cross-domain settings, leading to content leakage and structural incoherence.
- StyleAligned: aligns statistical distributions via AdaIN but performs poorly in structurally divergent domains such as sketches.
Text-conditioned style control lacks expressiveness and cannot accurately match specific styles.

Method¶

Overall Architecture¶

A training-free framework built upon Stable Diffusion v1.5/SDXL, supporting both single-style and multi-style sketch generation.

Key Designs¶

1. Style Feature Injection¶

Against direct replacement: $Attention(Q_{tar}, K_{ref}, V_{ref})$ introduces structural incoherence in cross-domain settings.

Against AdaIN alignment: $Q_{tar} = AdaIN(Q_{tar}, Q_{ref})$ is detrimental to sketch generation.

M3S solution: linearly smoothed feature concatenation $$Attention\left(Q_{tar}, \begin{bmatrix}K_{tar}\\\hat{K}_{ref}\end{bmatrix}, \begin{bmatrix}V_{tar}\\\hat{V}_{ref}\end{bmatrix}\right)$$ $$\hat{K}_{ref} = \lambda K_{tar} + (1-\lambda)K_{ref}, \quad \hat{V}_{ref} = \lambda V_{tar} + (1-\lambda)V_{ref}$$ - $\lambda \in [0,1]$ balances content fidelity and style consistency. - Increasing $\lambda$ enhances aesthetics and text alignment, but excessively large values may cause style degradation.

2. Multi-Style Tendency Control (Joint AdaIN)¶

$$z_t^{tar} = \eta \cdot AdaIN(z_t^{tar}, z_t^{ref_1}) + (1-\eta) \cdot AdaIN(z_t^{tar}, z_t^{ref_2})$$ - $\eta \in [0,1]$: style tendency parameter. - Intuition: dense-stroke sketches have lower mean → AdaIN biases toward detailed output; sparse sketches have higher mean → minimalist results. - Even at $\eta=0$ or $\eta=1$, multi-style characteristics are preserved because the self-attention module incorporates K/V features from both styles.

3. Style-Content Disentangled Guidance¶

$$\tilde{\epsilon}_t = \epsilon_\theta(z_t^{tar}, t, \emptyset) + \omega_1 \cdot \underbrace{[\epsilon_\theta^\times(\cdot, text, K_{ref}, V_{ref}) - \epsilon_\theta(\cdot, \emptyset)]}_{\text{content guidance}} + \omega_2 \cdot \underbrace{[\epsilon_\theta^\times(\cdot, \emptyset, K_{ref}, V_{ref}) - \epsilon_\theta(\cdot, \emptyset)]}_{\text{style guidance}}$$ - $\omega_1, \omega_2$ control the strength of content and style guidance, respectively. - $\omega_2$ is linearly increased from $\omega_2/3$ to $\omega_2$ throughout the denoising process (gradually reinforcing style).

4. Contour-Based Regularization Guidance (SD v1.5)¶

The denoised latent is estimated as $z_{0|t}^{tar}$ via the Tweedie formula and decoded into an image.
Sobel operators are applied to extract directional gradients, maximizing edge responses.
$\mathcal{L}_{edge} = -|grad_x| - |grad_y|$
Suppresses artifacts in abstract sketch generation.

Key Experimental Results¶

Main Results: Quantitative Comparison Across 6 Styles¶

Method	Impl.	Style1 CLIP-T↑	Style1 DINO↑	Style1 VGG↓	Style5 CLIP-T↑	Style5 DINO↑
StyleAligned	-	0.3130	0.6691	0.0308	0.3004	0.5428
AttentionDistill	-	0.3305	0.7738	0.0930	0.3377	0.6221
InstantStyle	-	0.3512	0.4934	0.0417	0.3480	0.4408
CSGO	-	0.3336	0.5276	0.0571	0.3298	0.4288
M3S (SD v1.5)	-	0.3507	0.6383	0.0200	0.3494	0.5777
M3S (SDXL)	-	0.3607	0.6545	0.0165	0.3467	0.5332

Human Preference Scores (1–8)¶

Method	Avg. Score
StyleAligned	2.77
CSGO	3.83
StyleStudio	4.22
AttentionDistill	4.28
InstantStyle	5.08
M3S (SD v1.5)	5.44
M3S (SDXL)	6.19

Multi-Style Generation (Validation of $\eta$ Control)¶

$\eta$	DINO-ref1↑	DINO-ref2↑	CLIP-T↑
0	0.3936	0.4944	0.3442
0.25	0.4180	0.4821	0.3514
0.5	0.4408	0.4556	0.3495
0.75	0.4578	0.4221	0.3499
1.0	0.4693	0.3975	0.3470

Key Findings¶

M3S (SDXL) substantially outperforms all baselines in human preference with a score of 6.19.
Linear smoothing ($\lambda$) effectively reduces content leakage compared to direct replacement and AdaIN alignment.
AttentionDistillation achieves high DINO scores but suffers from severe content leakage, mixing reference image content into the target output.
The multi-style parameter $\eta$ reliably controls style tendency: DINO-ref1 increases monotonically with $\eta$ while DINO-ref2 decreases monotonically.
Even at $\eta=0$ or $\eta=1$, the outputs retain characteristics of both styles due to the two sets of K/V features present in self-attention.

Highlights & Insights¶

Training-free framework: operates directly on pretrained diffusion models without any fine-tuning.
Elegant simplicity: linear smoothing replaces complex feature alignment or distillation schemes.
Controllable multi-style synthesis: the first method to achieve multi-style fusion and continuous style interpolation for sketches.
Cross-domain robustness: maintains generation quality even when reference and target structures differ substantially — a known failure mode of K/V replacement methods.
Dual-backbone validation: verified on both SD v1.5 and SDXL.

Limitations & Future Work¶

$\omega_1, \omega_2, \lambda$ require manual tuning per style type (e.g., abstract sketches in Style 6 require different parameter settings).
Abstract sketch generation on SD v1.5 may produce artifacts, necessitating contour regularization guidance.
The base model is trained on natural images; excessively large $\lambda$ may cause outputs to revert to a naturalistic appearance.
100-step DDIM sampling introduces considerable inference latency.
Video and animated sketch generation remain unexplored.

DiffSketcher: a pioneering method for text-driven sketch synthesis, but lacks style control.
Cross-image/MasaCtrl: establishes the K/V replacement paradigm — this paper analyzes the root causes of its cross-domain failure.
B-LoRA/IP-Adapter/CSGO: style adapter approaches primarily targeting natural images.
AdaIN (Huang 2017): a classical method for real-time arbitrary style transfer — extended in this paper to multi-style sketch control.

Rating¶

Novelty: ⭐⭐⭐⭐ (first exploration of multi-style sketch generation with an elegant method design)
Technical Depth: ⭐⭐⭐⭐ (thorough analysis of style injection mechanisms with comprehensive ablations)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (6 styles × 8 baselines + multi-style + human evaluation)
Writing Quality: ⭐⭐⭐⭐ (rich visualizations and clear comparisons)