Sketch-HARP: Hierarchical Autoregressive Sketch Generation for Flexible Stroke-Level Drawing Manipulation¶
Conference: AAAI 2026 arXiv: 2511.07889 Code: https://github.com/SCZang/Sketch-HARP Area: Sketch Generation / Image Generation Keywords: Sketch Generation, Hierarchical Autoregressive, Stroke-Level Manipulation, Sketch-HARP, Sketch Editing
TL;DR¶
This paper proposes Sketch-HARP, a hierarchical autoregressive sketch generation framework that achieves, for the first time, flexible stroke-level manipulation during the drawing process through a three-stage hierarchical pipeline (predicting stroke embeddings → determining canvas positions → generating drawing action sequences). The method significantly outperforms SketchEdit on tasks including stroke replacement, erasure, and extension.
Background & Motivation¶
Controllable sketch manipulation is an important generative task. Existing methods (e.g., SketchEdit) require all edited stroke embeddings to be collected and fed into the generator simultaneously before generation begins, prohibiting further manipulation during the generation process. This limits precise control over specific local content such as individual stroke shapes.
The core challenges are:
- Exposing editable intermediate representations that users can modify at any point during generation
- Maintaining global consistency in autoregressive generation — the shape and position of the current stroke must coordinate with previously drawn strokes
- Simultaneously controlling stroke features (what to draw) and position (where to draw), which are tightly coupled
- Existing instance-level manipulation methods (controllable synthesis, sketch inpainting, sketch analogy) lack the granularity for reliable single-stroke control
- Diffusion-based methods, while high quality, are difficult to interface at the stroke level
- The model must emulate the human drawing process — deciding what to draw, then where, then executing the stroke
Method¶
Overall Architecture¶
Encoding stage: Input sketches are stroke-separated → a Stroke Encoder (BiLSTM) extracts stroke embeddings \(e_k \in \mathbb{R}^{128}\) and a Position Encoder (FC layers) extracts position embeddings \(p_k \in \mathbb{R}^{128}\) → a Relationship Encoder (gMLP) learns inter-stroke relationships \(r_k\) → embeddings are fused as \(\tilde{e}_k = e_k + r_k\) → a Sketch Encoder (LSTM) produces the sketch code \(y \in \mathbb{R}^{128}\).
Generation stage (three-stage hierarchical): Sketch code \(y\) → (I) Stroke Decoder autoregressively predicts stroke embeddings \(\hat{e}_k\) and stop tokens \(\hat{\eta}_k\) → (II) Position Decoder predicts 2D starting coordinates via a bivariate Gaussian distribution → (III) Sequence Decoder translates embeddings into drawing action sequences using a GMM. All three decoders receive preceding outputs as conditional input in an autoregressive manner.
Key Designs¶
-
Relationship Embedding: A gMLP (2 layers, \(d_\text{model}=128\), \(d_\text{ffn}=512\)) serves as the relationship encoder, integrating three types of inter-stroke relationships — spatial (relative positions, e.g., nose above mouth), contextual (drawing-order proximity, e.g., two eyes of a pig drawn consecutively), and semantic (component-level dependencies, e.g., a pig requiring exactly two ears). The relationship vector \(r_k\) is added to the stroke embedding \(e_k\), endowing each stroke with a global view.
-
Hierarchical Autoregressive Generation: The framework emulates the human drawing process — (I) what to draw (predicting stroke embeddings): an LSTM stroke decoder (hidden=1024) takes the sketch code \(y\) and the previous step's \(\tilde{e}_{k-1} + p_{k-1}\) to output \(\hat{e}_k\) and stop token \(\hat{\eta}_k\); (II) where to place the stroke (determining position): an LSTM position decoder models a bivariate Gaussian distribution \(\mathcal{N}(\mu_{px}, \mu_{py}, \sigma_{px}, \sigma_{py}, \rho_p)\) and samples the starting coordinate; (III) executing the stroke (generating actions): an LSTM sequence decoder models offsets \((\Delta x, \Delta y)\) with \(M=20\) bivariate Gaussian mixtures and pen states with a categorical distribution.
-
Flexible Manipulation Interface: The exposed intermediate stroke embeddings \(\{\hat{e}_k\}\) can be edited at any point during generation, supporting stroke replacement (substituting \(\hat{e}_1\) with another stroke embedding), stroke erasure (skipping position determination and action generation for a given stroke), stroke extension (injecting an external embedding \(\varepsilon\) to add a new stroke with automatically generated position and actions), and combined operations (erasure followed by extension equals replacement). Error accumulation effects are mild due to the short sequence length at each level (average stroke count ~7, average actions per stroke ~10).
Loss & Training¶
Five loss terms are combined with weights: \(\mathcal{L} = \mathcal{L}_\text{seq} + \mathcal{L}_\text{pos} + \mathcal{L}_\text{stp} + 5 \cdot \mathcal{L}_\text{sok} + 0.5 \cdot \mathcal{L}_\text{img}\).
| Loss | Meaning | Weight | Form |
|---|---|---|---|
| \(\mathcal{L}_\text{seq}\) | Drawing action reconstruction | 1.0 | GMM negative log-likelihood + pen state cross-entropy |
| \(\mathcal{L}_\text{pos}\) | Starting position reconstruction | 1.0 | Bivariate Gaussian negative log-likelihood |
| \(\mathcal{L}_\text{stp}\) | Stop token prediction | 1.0 | Categorical cross-entropy |
| \(\mathcal{L}_\text{sok}\) | Stroke embedding regularization | 5.0 | L2 distance; highest weight to ensure embedding quality |
| \(\mathcal{L}_\text{img}\) | Canvas image reconstruction | 0.5 | CNN decoder reconstructing \(128\times128\) rasterized image |
Key Experimental Results¶
Main Results: Sketch Generation and Retrieval Quality¶
| Dataset | Metric | Sketch-HARP | DC-gra2seq | SketchEdit | SketchKnitter | SP-gra2seq |
|---|---|---|---|---|---|---|
| DS1 (17 classes) | FID↓ | 9.96 | 12.83 | 25.40 | 11.32 | 14.64 |
| DS1 | LPIPS↓ | 0.28 | 0.30 | 0.37 | 0.31 | 0.33 |
| DS1 | Rec@1↑ | 89.90% | 85.45% | 70.91% | 78.45% | 80.18% |
| DS2 (5 classes) | FID↓ | 6.97 | 11.01 | 28.64 | 8.10 | 13.42 |
| DS2 | Ret@1↑ | 96.00% | 95.27% | 82.00% | 95.07% | 92.13% |
Ablation Study & Human Evaluation¶
| Configuration / Evaluation Type | FID↓ / Human Score↑ | Rec@1↑ | Key Note |
|---|---|---|---|
| Full model | 9.96 | 89.90% | All three stages active |
| w/o relationship embedding | 25.45 | 74.55% | Most critical component (FID +155%) |
| w/o \(\mathcal{L}_\text{img}\) | 10.30 | 70.18% | Image regularization strongly affects retrieval |
| w/o \(\mathcal{L}_\text{sok}\) | 52.31 | 65.91% | Necessity of stroke embedding regularization at \(w=5\) |
| w/o position encoder | 14.51 | 82.91% | Position information is important |
| Stroke replacement (human score) | 2.09±1.06 | — | vs. SketchEdit 1.45±1.27 (+44%) |
| Stroke erasure (human score) | 2.30±0.96 | — | vs. SketchEdit 1.04±1.14 (+121%) |
Key Findings¶
- Human evaluation achieves extremely high statistical significance: paired t-test \(p = 6.04 \times 10^{-13}\), demonstrating statistically significant superiority over all baselines
- t-SNE visualization shows that relationship embeddings cause strokes to cluster automatically by semantic function (e.g., left wheel / right wheel / window position)
- Without visual features (pure sequence modeling) the method still surpasses DC-gra2seq, which uses visual inputs, validating the effectiveness of the hierarchical design
- The weight of \(\mathcal{L}_\text{sok}\) at 5 is far larger than other terms, reflecting the cascading impact of stroke embedding accuracy on downstream position and sequence generation
- The stop token \(\hat{\eta}_k\) enables the model to adaptively determine the number of strokes without hard-coded upper limits
Highlights & Insights¶
- The generation process precisely emulates the human drawing cognitive pipeline: deciding what to draw → where to place the stroke → executing it, with the three stages corresponding to distinct cognitive steps
- Exposing intermediate stroke embeddings as an editable interface enables genuine manipulation during generation rather than only prior to it
- Cross-category stroke replacement still produces coherent sketches (e.g., angel wings replaced with different shapes but size-matched)
- A special case of stroke extension — continuing from an incomplete sketch — can generate creative cross-category sketches (e.g., a pig-head clock)
Limitations & Future Work¶
- Processes only sequential-format sketches (QuickDraw dataset); rasterized or natural photo sketches are not supported
- Autoregressive generation still carries the risk of error accumulation (though practical impact is limited)
- Integration with stronger generators such as diffusion models has not been explored
- Stroke count and action length are constrained by fixed upper bounds (\(N_\text{max}^\text{num}=25\), \(N_\text{max}^\text{len}=32\))
- The application domain is relatively niche and the industrial deployment pathway is unclear
Related Work & Insights¶
- vs. SketchEdit: manipulation is possible during generation vs. only before generation; human evaluation scores are substantially higher
- vs. DC-gra2seq: superior FID is achieved without visual features, validating the sufficiency of sequence modeling
- vs. SketchKnitter (diffusion-based): finer stroke-level control is provided compared to holistic generation
- The hierarchical autoregressive paradigm is transferable to structured generation tasks such as vector graphics generation, CAD design, and architectural sketching
Rating¶
⭐⭐⭐⭐ (4/5) The method is elegantly designed with a natural and well-motivated hierarchical process, and the human evaluation results demonstrate strong statistical significance. The application domain is relatively niche, but the methodology offers broad inspirational value.