FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations¶
Conference: CVPR 2025
arXiv: 2411.10818
Code: None
Area: Image Generation
Keywords: Sketch Animation, Text-guided, Video Diffusion Models, Flipbook Animation, DDIM Inversion
TL;DR¶
This work proposes FlipSketch, the first system that generates unconstrained raster sketch animations from a single static sketch and a text description, achieving smooth animation through three key innovations: fine-tuning a text-to-video diffusion model, iterative reference frame alignment, and dual-attention composition.
Background & Motivation¶
Background: Sketch animation is a powerful visual storytelling medium. Traditional animation production requires professional teams to draw keyframes and in-betweens, and existing automated attempts still require users to specify exact motion paths or multiple keyframes.
Limitations of Prior Work: Vector-based sketch animation methods (e.g., Live-Sketch) are constrained by stroke-by-stroke translation operations, failing to freely redraw and reinterpret the subject. Image-to-video methods (e.g., SVD, DynamiCrafter) suffer from the sketch-photo domain gap, making it difficult to preserve sketch identity.
Key Challenge: The need to simultaneously preserve the visual integrity of the input sketch and achieve unconstrained motion, whereas existing methods can only achieve one or the other.
Goal: To realize a simple workflow of "draw a sketch + describe motion = animation".
Key Insight: Leveraging the motion priors of text-to-video diffusion models, adapting them to the sketch domain via LoRA fine-tuning, and utilizing DDIM inversion to provide reference frame constraints.
Core Idea: Embedding the sketch as a reference frame into a video diffusion model via DDIM inversion, and achieving smooth animation while maintaining sketch identity through iterative frame alignment and dual-attention composition.
Method¶
Overall Architecture¶
Given a static input sketch and a text prompt, the ModelScope T2V model is first fine-tuned using LoRA to generate sketch-style videos. During inference: (1) the input sketch is embedded via DDIM inversion into reference noise to serve as the first frame; (2) the remaining frames are sampled from a Gaussian distribution; (3) the animation is generated through joint denoising guided by iterative reference frame alignment and dual-attention composition.
Key Designs¶
-
LoRA Fine-tuned Text-to-Sketch Animation Baseline:
- Function: Adapting the T2V model to the sketch animation domain
- Mechanism: Training a LoRA (rank=4) of ModelScope T2V on synthetic vector sketch animation data generated by Live-Sketch for only 2500 iterations
- Design Motivation: Transferring the motion prior of video diffusion models to the sketch domain using synthetic data
-
Iterative Frame Alignment:
- Function: Ensuring that the first frame accurately reconstructs the input sketch during joint denoising
- Mechanism: Within early timesteps \(t > \tau_1\), the reference frame is denoised independently to obtain features \(\eta_1\). The alignment loss is computed as \(\mathcal{L}_{align} = ||\eta_1' - \eta_1||_2^2\) against the first frame's feature \(\eta_1'\) in the joint denoising process, and backpropagation is used to optimize the noise of the remaining frames \(f_t^{train}\)
- Design Motivation: Resolving the issue where temporal attention layers cause the reference frame to be corrupted by random noise frames
-
Dual-Attention Composition:
- Function: Injecting the identity information of the reference sketch into the generated frames
- Mechanism: Running independent denoising of the reference frame and joint denoising of all frames in parallel, extracting query-key pairs \((q_t^r, k_t^r)\) from the reference frame to compose them into spatial and temporal attentions, respectively. Spatial attention performs cross-attention with the keys of the joint frames by repeating the reference frame \(N\) times (with linear decay); temporal attention directly replaces the first frame's key with the reference key
- Design Motivation: Transmitting both coarse-grained and fine-grained features of the reference sketch concurrently across spatial and temporal dimensions
Loss & Training¶
During the LoRA fine-tuning phase, a standard diffusion loss is used. During inference, a controllable parameter \(\lambda\) regulates the motion-fidelity trade-off: \(k_t^r = k_t^r \cdot (1 + \lambda \cdot 2e^{-2})\), where lower \(\lambda\) produces more motion, and higher \(\lambda\) improves stability. The timestep thresholds are set to \(\tau_1 = 2T/5\) and \(\tau_2 = 3T/5\).
Key Experimental Results¶
Main Results¶
| Method | S2V Consistency ↑ | T2V Alignment ↑ |
|---|---|---|
| Live-Sketch | 0.965 | 0.142 |
| DynamiCrafter | 0.780 | 0.127 |
| FlipSketch | 0.956 | 0.172 |
Ablation Study¶
- Without frame alignment: S2V consistency drops from 0.956 to 0.952, with a noticeable degradation in visual quality
- Without dual-attention composition: S2V consistency drops to 0.876, and identity preservation is severely compromised
- \(\lambda=0\) (maximum motion) vs \(\lambda=1\) (maximum fidelity) demonstrates smooth trade-off control
Key Findings¶
- Raster-based frame animation is more flexible than vector animation, enabling 3D perspective transformations
- A frame extrapolation strategy can generate longer and more complex animation sequences
- In user studies, the proposed method significantly outperforms Live-Sketch in both MOS and text alignment
Highlights & Insights¶
- The "draw + describe" interaction paradigm is highly intuitive, lowering the barrier for creative animation production
- The idea of iterative frame alignment is ingenious—improving reference frame reconstruction by optimizing the noise in other frames
- The frame extrapolation scheme for longer animations is simple yet effective
Limitations & Future Work¶
- There is an inherent trade-off between motion and identity preservation
- Relying on the motion priors of T2V models means highly complex movements may lack precision
- The need for multiple forward passes during each inference (due to iterative alignment) leaves room for efficiency improvement
Related Work & Insights¶
- Live-Sketch is the most direct baseline, but it is limited by its vector representation
- The combined strategy of DDIM inversion and attention control can be generalized to other conditional video generation tasks
- The frame extrapolation idea is applicable to long video generation for any short video model
Rating¶
- Novelty: 8/10 — A new paradigm for sketch animation
- Technical Depth: 8/10 — The three technical components are elegantly designed
- Experimental Thoroughness: 7/10 — Thorough user study, but quantitative metrics are somewhat limited
- Writing Quality: 8/10 — Vivid narrative and clear motivation