Skip to content

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

Conference: CVPR 2025
arXiv: 2411.10818
Code: None
Area: Image Generation
Keywords: Sketch Animation, Text-guided, Video Diffusion Models, Flipbook Animation, DDIM Inversion

TL;DR

This work proposes FlipSketch, the first system that generates unconstrained raster sketch animations from a single static sketch and a text description, achieving smooth animation through three key innovations: fine-tuning a text-to-video diffusion model, iterative reference frame alignment, and dual-attention composition.

Background & Motivation

Background: Sketch animation is a powerful visual storytelling medium. Traditional animation production requires professional teams to draw keyframes and in-betweens, and existing automated attempts still require users to specify exact motion paths or multiple keyframes.

Limitations of Prior Work: Vector-based sketch animation methods (e.g., Live-Sketch) are constrained by stroke-by-stroke translation operations, failing to freely redraw and reinterpret the subject. Image-to-video methods (e.g., SVD, DynamiCrafter) suffer from the sketch-photo domain gap, making it difficult to preserve sketch identity.

Key Challenge: The need to simultaneously preserve the visual integrity of the input sketch and achieve unconstrained motion, whereas existing methods can only achieve one or the other.

Goal: To realize a simple workflow of "draw a sketch + describe motion = animation".

Key Insight: Leveraging the motion priors of text-to-video diffusion models, adapting them to the sketch domain via LoRA fine-tuning, and utilizing DDIM inversion to provide reference frame constraints.

Core Idea: Embedding the sketch as a reference frame into a video diffusion model via DDIM inversion, and achieving smooth animation while maintaining sketch identity through iterative frame alignment and dual-attention composition.

Method

Overall Architecture

Given a static input sketch and a text prompt, the ModelScope T2V model is first fine-tuned using LoRA to generate sketch-style videos. During inference: (1) the input sketch is embedded via DDIM inversion into reference noise to serve as the first frame; (2) the remaining frames are sampled from a Gaussian distribution; (3) the animation is generated through joint denoising guided by iterative reference frame alignment and dual-attention composition.

Key Designs

  1. LoRA Fine-tuned Text-to-Sketch Animation Baseline:

    • Function: Adapting the T2V model to the sketch animation domain
    • Mechanism: Training a LoRA (rank=4) of ModelScope T2V on synthetic vector sketch animation data generated by Live-Sketch for only 2500 iterations
    • Design Motivation: Transferring the motion prior of video diffusion models to the sketch domain using synthetic data
  2. Iterative Frame Alignment:

    • Function: Ensuring that the first frame accurately reconstructs the input sketch during joint denoising
    • Mechanism: Within early timesteps \(t > \tau_1\), the reference frame is denoised independently to obtain features \(\eta_1\). The alignment loss is computed as \(\mathcal{L}_{align} = ||\eta_1' - \eta_1||_2^2\) against the first frame's feature \(\eta_1'\) in the joint denoising process, and backpropagation is used to optimize the noise of the remaining frames \(f_t^{train}\)
    • Design Motivation: Resolving the issue where temporal attention layers cause the reference frame to be corrupted by random noise frames
  3. Dual-Attention Composition:

    • Function: Injecting the identity information of the reference sketch into the generated frames
    • Mechanism: Running independent denoising of the reference frame and joint denoising of all frames in parallel, extracting query-key pairs \((q_t^r, k_t^r)\) from the reference frame to compose them into spatial and temporal attentions, respectively. Spatial attention performs cross-attention with the keys of the joint frames by repeating the reference frame \(N\) times (with linear decay); temporal attention directly replaces the first frame's key with the reference key
    • Design Motivation: Transmitting both coarse-grained and fine-grained features of the reference sketch concurrently across spatial and temporal dimensions

Loss & Training

During the LoRA fine-tuning phase, a standard diffusion loss is used. During inference, a controllable parameter \(\lambda\) regulates the motion-fidelity trade-off: \(k_t^r = k_t^r \cdot (1 + \lambda \cdot 2e^{-2})\), where lower \(\lambda\) produces more motion, and higher \(\lambda\) improves stability. The timestep thresholds are set to \(\tau_1 = 2T/5\) and \(\tau_2 = 3T/5\).

Key Experimental Results

Main Results

Method S2V Consistency ↑ T2V Alignment ↑
Live-Sketch 0.965 0.142
DynamiCrafter 0.780 0.127
FlipSketch 0.956 0.172

Ablation Study

  • Without frame alignment: S2V consistency drops from 0.956 to 0.952, with a noticeable degradation in visual quality
  • Without dual-attention composition: S2V consistency drops to 0.876, and identity preservation is severely compromised
  • \(\lambda=0\) (maximum motion) vs \(\lambda=1\) (maximum fidelity) demonstrates smooth trade-off control

Key Findings

  • Raster-based frame animation is more flexible than vector animation, enabling 3D perspective transformations
  • A frame extrapolation strategy can generate longer and more complex animation sequences
  • In user studies, the proposed method significantly outperforms Live-Sketch in both MOS and text alignment

Highlights & Insights

  • The "draw + describe" interaction paradigm is highly intuitive, lowering the barrier for creative animation production
  • The idea of iterative frame alignment is ingenious—improving reference frame reconstruction by optimizing the noise in other frames
  • The frame extrapolation scheme for longer animations is simple yet effective

Limitations & Future Work

  • There is an inherent trade-off between motion and identity preservation
  • Relying on the motion priors of T2V models means highly complex movements may lack precision
  • The need for multiple forward passes during each inference (due to iterative alignment) leaves room for efficiency improvement
  • Live-Sketch is the most direct baseline, but it is limited by its vector representation
  • The combined strategy of DDIM inversion and attention control can be generalized to other conditional video generation tasks
  • The frame extrapolation idea is applicable to long video generation for any short video model

Rating

  • Novelty: 8/10 — A new paradigm for sketch animation
  • Technical Depth: 8/10 — The three technical components are elegantly designed
  • Experimental Thoroughness: 7/10 — Thorough user study, but quantitative metrics are somewhat limited
  • Writing Quality: 8/10 — Vivid narrative and clear motivation