AniDoc: Animation Creation Made Easier¶
Background & Motivation¶
Animation creation is a highly labor-intensive industry. In traditional animation workflows, colorization is one of the most time-consuming stages. A standard animated film consists of tens of thousands of frames, where the coloring of each frame must be done manually by professional artists. This is not only extremely costly but also highly dependent on human resources.
In recent years, deep learning-based video colorization methods have made progress, but existing approaches face the following key challenges:
Temporal Consistency: Frame-by-frame colorization is prone to color flickering, and color assignment between adjacent frames can be inconsistent.
Sparse Reference: In actual production, animators usually only provide color references for keyframes, requiring the intermediate frames to be automatically inferred.
Sketch Quality Variance: Hand-drawn sketches and digital line art differ significantly in clarity, line thickness, closure, and other aspects.
Background Interference: Complex backgrounds in training data interfere with the model learning to colorize the foreground characters.
This paper proposes AniDoc, a video sketch colorization system based on diffusion models, which addresses the aforementioned challenges through explicit correspondence guidance and data augmentation strategies.
Method¶
Overall Architecture¶
AniDoc is built upon a video diffusion model. The core workflow is: given a set of color reference frames and grayscale sketch videos, generate a complete colorized animation video. The system consists of three key modules:
- Correspondence Guidance Module (Correspondence Guidance)
- Data Augmentation Strategy (Data Augmentation)
- Sparse Sketch Training (Sparse Sketch Training)
Correspondence Guidance (Correspondence Guidance)¶
To resolve the temporal consistency issue, AniDoc builds explicit inter-frame correspondences:
| Stage | Technology | Effect |
|---|---|---|
| Feature Extraction | SIFT + LightGlue | Establish sparse feature matching between reference and target frames |
| Dense Tracking | Co-Tracker | Expand sparse matches into dense optical flow fields |
| Color Propagation | Warping + Attention | Propagate color from reference frames to target frames based on correspondence |
Specific workflow: 1. Extract SIFT feature points from reference frames and each target frame. 2. Use LightGlue for feature matching to obtain reliable sparse corresponding point pairs. 3. Use sparse corresponding points as initialization and input them into Co-Tracker to obtain dense pixel-level tracking results. 4. The correspondence information is injected into the cross-attention layer of the diffusion model in the form of feature maps.
Data Augmentation Strategy¶
Binarization Augmentation¶
During training, random binarization is applied to the input sketch to simulate sketches of various styles and qualities:
Where $ au$ is an adaptive threshold, and \(\epsilon \sim \mathcal{U}(-\delta, \delta)\) is a random perturbation.
Background Augmentation¶
To reduce background interference on foreground colorization, the background is randomly replaced during training: - 50% probability of using a pure white background - 30% probability of using a random solid color background - 20% probability of retaining the original background
This forces the model to focus on the structure and color of the foreground characters instead of relying on background cues.
Sparse Sketch Training¶
In practical applications, animators typically only paint color versions of the first and last keyframes. AniDoc proposes a sparse training strategy:
- During training, only the first and last frames of the video sequence are provided as color references.
- Intermediate frames are input in sketch form.
- The model must automatically infer the colorization scheme of the intermediate frames.
This training method enables the model to learn to perform reasonable color interpolation utilizing limited reference information.
Sakuga-42M Dataset¶
This paper collects and organizes the Sakuga-42M dataset, which contains: - Source: Public animation databases and video platforms - Scale: Approximately 42 million frames of animation data - Processing: Automatic sketch extraction, keyframe labeling, and filtering of low-quality samples
Experimental Results¶
Quantitative Comparison¶
| Method | FID↓ | FVD↓ | LPIPS↓ | Temporal Consistency↑ |
|---|---|---|---|---|
| Reference-based (baseline) | 78.42 | 312.5 | 0.183 | 0.891 |
| w/o correspondence matching | 75.91 | 298.7 | 0.171 | 0.907 |
| AniDoc (full) | 54.33 | 215.8 | 0.124 | 0.952 |
| AniDoc (sparse, 2-ref) | 58.17 | 234.2 | 0.138 | 0.941 |
Ablation Study¶
| Component | FID↓ | Description |
|---|---|---|
| Full Model | 54.33 | Full AniDoc |
| w/o Correspondence | 75.91 | Without correspondence guidance, chaotic color assignment |
| w/o Binarization Aug | 61.27 | Degraded generalization to hand-drawn sketches |
| w/o Background Aug | 59.84 | Degraded colorization quality in background regions |
| w/o Sparse Training | 63.15 | Only supports dense references, reducing practicality |
Training Details¶
- Hardware: 16× NVIDIA A100 GPUs
- Training Time: 5 days
- Video Resolution: 512×512, 16 frames
- Optimizer: AdamW, lr=1e-5
- Batch Size: 2 video clips per GPU
Summary & Outlook¶
By combining explicit correspondence guidance, targeted data augmentation, and sparse reference training, AniDoc significantly improves the quality and practicality of animation sketch colorization. The reduction of FID from 75.91 to 54.33 demonstrates the key role of correspondence guidance. The system can be directly integrated into existing animation production pipelines, substantially reducing manual labor costs in the colorization stage.