FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations¶

Conference: CVPR 2025
arXiv: 2411.10818
Code: None
Area: Image Generation
Keywords: Sketch Animation, Text-guided, Video Diffusion Models, Flipbook Animation, DDIM Inversion

TL;DR¶

This work proposes FlipSketch, the first system that generates unconstrained raster sketch animations from a single static sketch and a text description, achieving smooth animation through three key innovations: fine-tuning a text-to-video diffusion model, iterative reference frame alignment, and dual-attention composition.

Background & Motivation¶

Background: Sketch animation is a powerful visual storytelling medium. Traditional animation production requires professional teams to draw keyframes and in-betweens, and existing automated attempts still require users to specify exact motion paths or multiple keyframes.

Limitations of Prior Work: Vector-based sketch animation methods (e.g., Live-Sketch) are constrained by stroke-by-stroke translation operations, failing to freely redraw and reinterpret the subject. Image-to-video methods (e.g., SVD, DynamiCrafter) suffer from the sketch-photo domain gap, making it difficult to preserve sketch identity.

Key Challenge: The need to simultaneously preserve the visual integrity of the input sketch and achieve unconstrained motion, whereas existing methods can only achieve one or the other.

Goal: To realize a simple workflow of "draw a sketch + describe motion = animation".

Key Insight: Leveraging the motion priors of text-to-video diffusion models, adapting them to the sketch domain via LoRA fine-tuning, and utilizing DDIM inversion to provide reference frame constraints.

Core Idea: Embedding the sketch as a reference frame into a video diffusion model via DDIM inversion, and achieving smooth animation while maintaining sketch identity through iterative frame alignment and dual-attention composition.

Method¶

Overall Architecture¶

Given a static input sketch and a text prompt, the ModelScope T2V model is first fine-tuned using LoRA to generate sketch-style videos. During inference: (1) the input sketch is embedded via DDIM inversion into reference noise to serve as the first frame; (2) the remaining frames are sampled from a Gaussian distribution; (3) the animation is generated through joint denoising guided by iterative reference frame alignment and dual-attention composition.

Key Designs¶

LoRA Fine-tuned Text-to-Sketch Animation Baseline:
- Function: Adapting the T2V model to the sketch animation domain
- Mechanism: Training a LoRA (rank=4) of ModelScope T2V on synthetic vector sketch animation data generated by Live-Sketch for only 2500 iterations
- Design Motivation: Transferring the motion prior of video diffusion models to the sketch domain using synthetic data
Iterative Frame Alignment:
- Function: Ensuring that the first frame accurately reconstructs the input sketch during joint denoising
- Mechanism: Within early timesteps \(t > \tau_1\), the reference frame is denoised independently to obtain features \(\eta_1\). The alignment loss is computed as \(\mathcal{L}_{align} = ||\eta_1' - \eta_1||_2^2\) against the first frame's feature \(\eta_1'\) in the joint denoising process, and backpropagation is used to optimize the noise of the remaining frames \(f_t^{train}\)
- Design Motivation: Resolving the issue where temporal attention layers cause the reference frame to be corrupted by random noise frames
Dual-Attention Composition:
- Function: Injecting the identity information of the reference sketch into the generated frames
- Mechanism: Running independent denoising of the reference frame and joint denoising of all frames in parallel, extracting query-key pairs \((q_t^r, k_t^r)\) from the reference frame to compose them into spatial and temporal attentions, respectively. Spatial attention performs cross-attention with the keys of the joint frames by repeating the reference frame \(N\) times (with linear decay); temporal attention directly replaces the first frame's key with the reference key
- Design Motivation: Transmitting both coarse-grained and fine-grained features of the reference sketch concurrently across spatial and temporal dimensions

Loss & Training¶

During the LoRA fine-tuning phase, a standard diffusion loss is used. During inference, a controllable parameter \(\lambda\) regulates the motion-fidelity trade-off: \(k_t^r = k_t^r \cdot (1 + \lambda \cdot 2e^{-2})\), where lower \(\lambda\) produces more motion, and higher \(\lambda\) improves stability. The timestep thresholds are set to \(\tau_1 = 2T/5\) and \(\tau_2 = 3T/5\).

Key Experimental Results¶

Main Results¶

Method	S2V Consistency ↑	T2V Alignment ↑
Live-Sketch	0.965	0.142
DynamiCrafter	0.780	0.127
FlipSketch	0.956	0.172

Ablation Study¶

Without frame alignment: S2V consistency drops from 0.956 to 0.952, with a noticeable degradation in visual quality
Without dual-attention composition: S2V consistency drops to 0.876, and identity preservation is severely compromised
\(\lambda=0\) (maximum motion) vs \(\lambda=1\) (maximum fidelity) demonstrates smooth trade-off control

Key Findings¶

Raster-based frame animation is more flexible than vector animation, enabling 3D perspective transformations
A frame extrapolation strategy can generate longer and more complex animation sequences
In user studies, the proposed method significantly outperforms Live-Sketch in both MOS and text alignment

Highlights & Insights¶

The "draw + describe" interaction paradigm is highly intuitive, lowering the barrier for creative animation production
The idea of iterative frame alignment is ingenious—improving reference frame reconstruction by optimizing the noise in other frames
The frame extrapolation scheme for longer animations is simple yet effective

Limitations & Future Work¶

There is an inherent trade-off between motion and identity preservation
Relying on the motion priors of T2V models means highly complex movements may lack precision
The need for multiple forward passes during each inference (due to iterative alignment) leaves room for efficiency improvement

Live-Sketch is the most direct baseline, but it is limited by its vector representation
The combined strategy of DDIM inversion and attention control can be generalized to other conditional video generation tasks
The frame extrapolation idea is applicable to long video generation for any short video model

Rating¶

Novelty: 8/10 — A new paradigm for sketch animation
Technical Depth: 8/10 — The three technical components are elegantly designed
Experimental Thoroughness: 7/10 — Thorough user study, but quantitative metrics are somewhat limited
Writing Quality: 8/10 — Vivid narrative and clear motivation