FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations¶
TL;DR¶
FlipSketch is the first to achieve unconstrained raster sketch animation generation from a single static sketch and text prompt. Through three key innovations—LoRA fine-tuning on a T2V diffusion model, a DDIM-inversion-based reference frame mechanism, and a dual-attention composition—it generates smooth, dynamic animation sequences while maintaining sketch identity.
Background & Motivation¶
- Appeal and pain points of sketch animation: Flip-book animation is the most classic form of animation, but traditional animation requires a large number of professional artists to draw keyframes and in-betweens.
- Limitations of prior work:
- Vector-based animation methods (Live-Sketch): Achieve animation via control point coordinate transformations, but are limited by: (1) only being able to displace/scale existing strokes without adding/deleting them, (2) 2D sketches representing only local perspectives of 3D objects and failing to express perspective transformations, and (3) extremely time- and computation-consuming SDS optimization.
- I2V methods (SVD, DynamiCrafter): Suffer from a sketch-to-photo domain gap, making it difficult to preserve sketch identity in the generated results.
- Skeleton-based methods: Require human-like inputs and are not applicable to general objects.
- Key Challenge:
- How to make video generation models generate sketch-style frames.
- How to maintain the visual integrity (identity consistency) of the input sketch.
- How to support unconstrained motion (beyond stroke displacement).
Method¶
Overall Architecture¶
Based on the ModelScope T2V diffusion model, the pipeline is divided into three parts: 1. LoRA Fine-tuning: Adapts the T2V model to the sketch style using synthetic sketch animations. 2. Reference Frame Mechanism: Constructs the reference noise via DDIM inversion followed by iterative frame alignment. 3. Dual Attention Composition: Injects reference frame information during spatial and temporal attention to guide denoising.
Key Designs¶
1. LoRA Fine-Tuning for Sketch-Style Adaptation¶
- Employs synthetic vector animations from Live-Sketch as training data.
- Trains LoRA (rank=4) on the 3D U-Net of ModelScope T2V with only 2500 iteration steps.
- The fine-tuned model can generate sketch-style frame sequences from text prompts.
- Extremely small parameter count (\(< 0.01\%\)), preserving the strong motion priors of the T2V model.
2. Reference Frame Mechanism (Reference Frame via DDIM Inversion)¶
- Setup: Encodes the input sketch \(I_s\) and performs DDIM inversion (null-text inversion) to obtain the reference noise \(x_T^r\).
- The first frame uses reference noise \(x_T^r\), while the remaining \(M-1\) frames are sampled from a standard normal distribution \(\{f_T^i\}_{i=2}^M \sim \mathcal{N}(0, \mathbf{I})\).
- Iterative Frame Alignment:
- For each timestep \(t \in [T, \tau_1]\):
- Denoise the reference frame independently: \(\eta_1 = \epsilon_\theta(x_t^r, t, \mathcal{P}_{null})\) as the GT feature.
- Jointly denoise all frames: \([\eta'_i] = \epsilon_\theta([x_t^r, f_t^{train}], t, \mathcal{P}_{input})\).
- Calculate alignment loss: \(\mathcal{L}_{align} = \|\eta'_1 - \eta_1\|_2^2\).
- Backpropagate to optimize \(f_t^{train}\), aligning the first frame in joint denoising with the independently denoised reference.
- Performed only at early timesteps (\(\tau_1 = 2T/5\)), as coarse structures are determined in the early stages of diffusion.
3. Dual Attention Composition¶
At timestep \(t \in [T, \tau_2]\) (\(\tau_2 = 3T/5\)), two-stream denoising is performed simultaneously: - (i) Jointly denoise all frames: \(\epsilon_\theta([x_t^r, f_t^i], t, \mathcal{P}_{input})\) - (ii) Denoise the reference frame only: \(\epsilon_\theta([x_t^r], t, \mathcal{P}_{null})\)
Spatial Attention Composition \(\mathcal{C}^S\): - Performs cross-attention between reference frame query \(q_t^r\) and joint frame key \(k_t^g\), replacing part of the self-attention. - Repeats the reference frame \(N\) times (\(N\) linearly decays from \(M\) to 1) to prevent the generated frames from degenerating into static frames. - Effect: Injects spatial features (stroke positions, structures) of the reference frame into the generated frames.
Temporal Attention Composition \(\mathcal{C}^T\): - Directly replaces the first frame's key in temporal self-attention with the reference frame key \(k_t^r\). - Controls the influence weight of the first frame on other frames. - Supports a motion-fidelity trade-off parameter \(\lambda\): \(k_t^r = k_t^r \cdot (1 + \lambda \cdot 2e^{-2})\), where high \(\lambda\) enhances stability and low \(\lambda\) increases motion magnitude.
Loss & Training¶
- LoRA training: Standard diffusion denoising loss.
- Inference-time frame alignment: \(\mathcal{L}_{align} = \|\eta'_1 - \eta_1\|_2^2\) (optimizes only the sampling noise without updating model parameters).
Key Experimental Results¶
Quantitative Comparison (Tab. 1 — CLIP Metrics)¶
| Method | S2V Consistency↑ | T2V Alignment↑ |
|---|---|---|
| SVD | 0.917 | - |
| DynamiCrafter | 0.780 | 0.127 |
| Live-Sketch | 0.965 | 0.142 |
| FlipSketch | 0.956 | 0.172 |
| FlipSketch (λ=1) | 0.968 | 0.170 |
Ablation Study¶
| Configuration | S2V Consistency↑ | T2V Alignment↑ |
|---|---|---|
| FlipSketch (Full) | 0.956 | 0.172 |
| w/o frame alignment | 0.952 | 0.171 |
| w/o \(\mathcal{C}^T\) & \(\mathcal{C}^S\) | 0.876 | 0.168 |
| λ=0 (max motion) | 0.949 | 0.174 |
| λ=1 (max fidelity) | 0.968 | 0.170 |
User Study (Tab. 2)¶
Users rated FlipSketch higher than Live-Sketch and the ablated versions in both text fidelity and sketch consistency.
Key Findings¶
- Removing the dual attention composition (\(\mathcal{C}^T\) & \(\mathcal{C}^S\)) causes S2V consistency to plunge from 0.956 to 0.876, proving its critical role in identity preservation.
- FlipSketch significantly outperforms Live-Sketch in text-video alignment (0.172 vs 0.142), showcasing richer motion.
- Live-Sketch slightly outperforms in S2V consistency (0.965 vs 0.956) because vector-based methods naturally constrain strokes.
- Computational efficiency: FlipSketch generates a 10-frame animation in approximately a few seconds, whereas Live-Sketch requires hours of SDS optimization.
- Frame extrapolation can smoothly concatenate animations by using the last frame as the input sketch for the next segment.
Highlights & Insights¶
- Raster vs. Vector Paradigm Shift: Abandoning vector-level constraints in favor of raster-level degrees of freedom allows the animation to depict stroke additions/deletions and perspective transformations that vector-based methods cannot achieve.
- Ingenious Utilization of DDIM Inversion: Using the inversion noise of the input sketch as the reference frame naturally guarantees accurate reconstruction after denoising, elegantly solving the identity preservation problem.
- Inference-Time vs. Training-Time Optimization: Frame alignment is achieved in inference time by optimizing noise (rather than model parameters), ensuring manageable overhead.
- Explicit Motion-Fidelity Control: The \(\lambda\) parameter behaves as a user-adjustable knob to meet different creative demands.
- Minimalist LoRA Adaptation: Standard training with only rank=4 and 2500 steps successfully adapts the T2V model to the sketch domain.
Limitations & Future Work¶
- 10-Frame Limitation: Generates roughly 10 frames per run, needing frame extrapolation to splice longer animations, which might accumulate drift.
- Sketch-Motion Consistency: For complex 3D motion (e.g., rotation), raster frames may exhibit unnatural distortions.
- Depth of Text Understanding: Performance is heavily dependent on the T2V model's text comprehension, yielding limited fidelity to precise motion descriptions.
- Post-Processing Constraints: Output frames are forced into black strokes on a white background via post-processing, potentially discarding grayscale details.
Related Work & Insights¶
- Live-Sketch (NeurIPS'24): Vector-based animation optimized via SDS \(\rightarrow\) This work demonstrates that direct raster generation is faster and more flexible.
- ModelScope T2V: Open-source T2V model \(\rightarrow\) Provides foundational motion priors.
- TF-ICON: Attention composition in diffusion models \(\rightarrow\) Inspired the design of the dual attention composition.
- Insights: Performing lightweight adaptation (LoRA) on diffusion models combined with delicate inference-time control (attention manipulation and noise optimization) can unlock novel generation capabilities at a very low cost.
Rating¶
⭐⭐⭐⭐ — Highly creative work that elegantly combines the simple experience of flip-book animations with modern T2V techniques. The three core innovations (LoRA, reference frame, and dual attention) fulfill their respective roles and coordinate well to form a practical and entertaining system.