AniDoc: Animation Creation Made Easier¶

Background & Motivation¶

Animation creation is a highly labor-intensive industry. In traditional animation workflows, colorization is one of the most time-consuming stages. A standard animated film consists of tens of thousands of frames, where the coloring of each frame must be done manually by professional artists. This is not only extremely costly but also highly dependent on human resources.

In recent years, deep learning-based video colorization methods have made progress, but existing approaches face the following key challenges:

Temporal Consistency: Frame-by-frame colorization is prone to color flickering, and color assignment between adjacent frames can be inconsistent.

Sparse Reference: In actual production, animators usually only provide color references for keyframes, requiring the intermediate frames to be automatically inferred.

Sketch Quality Variance: Hand-drawn sketches and digital line art differ significantly in clarity, line thickness, closure, and other aspects.

Background Interference: Complex backgrounds in training data interfere with the model learning to colorize the foreground characters.

This paper proposes AniDoc, a video sketch colorization system based on diffusion models, which addresses the aforementioned challenges through explicit correspondence guidance and data augmentation strategies.

Method¶

Overall Architecture¶

AniDoc is built upon a video diffusion model. The core workflow is: given a set of color reference frames and grayscale sketch videos, generate a complete colorized animation video. The system consists of three key modules:

Correspondence Guidance Module (Correspondence Guidance)
Data Augmentation Strategy (Data Augmentation)
Sparse Sketch Training (Sparse Sketch Training)

Correspondence Guidance (Correspondence Guidance)¶

To resolve the temporal consistency issue, AniDoc builds explicit inter-frame correspondences:

Stage	Technology	Effect
Feature Extraction	SIFT + LightGlue	Establish sparse feature matching between reference and target frames
Dense Tracking	Co-Tracker	Expand sparse matches into dense optical flow fields
Color Propagation	Warping + Attention	Propagate color from reference frames to target frames based on correspondence

Specific workflow: 1. Extract SIFT feature points from reference frames and each target frame. 2. Use LightGlue for feature matching to obtain reliable sparse corresponding point pairs. 3. Use sparse corresponding points as initialization and input them into Co-Tracker to obtain dense pixel-level tracking results. 4. The correspondence information is injected into the cross-attention layer of the diffusion model in the form of feature maps.

Data Augmentation Strategy¶

Binarization Augmentation¶

During training, random binarization is applied to the input sketch to simulate sketches of various styles and qualities:

\[I_{bin} = egin{cases} 1, & I_{gray} > au + \epsilon \ 0, & ext{otherwise} \end{cases}\]

Where $ au$ is an adaptive threshold, and $\epsilon \sim \mathcal{U}(-\delta, \delta)$ is a random perturbation.

Background Augmentation¶

To reduce background interference on foreground colorization, the background is randomly replaced during training: - 50% probability of using a pure white background - 30% probability of using a random solid color background - 20% probability of retaining the original background

This forces the model to focus on the structure and color of the foreground characters instead of relying on background cues.

Sparse Sketch Training¶

In practical applications, animators typically only paint color versions of the first and last keyframes. AniDoc proposes a sparse training strategy:

During training, only the first and last frames of the video sequence are provided as color references.
Intermediate frames are input in sketch form.
The model must automatically infer the colorization scheme of the intermediate frames.

This training method enables the model to learn to perform reasonable color interpolation utilizing limited reference information.

Sakuga-42M Dataset¶

This paper collects and organizes the Sakuga-42M dataset, which contains: - Source: Public animation databases and video platforms - Scale: Approximately 42 million frames of animation data - Processing: Automatic sketch extraction, keyframe labeling, and filtering of low-quality samples

Experimental Results¶

Quantitative Comparison¶

Method	FID↓	FVD↓	LPIPS↓	Temporal Consistency↑
Reference-based (baseline)	78.42	312.5	0.183	0.891
w/o correspondence matching	75.91	298.7	0.171	0.907
AniDoc (full)	54.33	215.8	0.124	0.952
AniDoc (sparse, 2-ref)	58.17	234.2	0.138	0.941

Ablation Study¶

Component	FID↓	Description
Full Model	54.33	Full AniDoc
w/o Correspondence	75.91	Without correspondence guidance, chaotic color assignment
w/o Binarization Aug	61.27	Degraded generalization to hand-drawn sketches
w/o Background Aug	59.84	Degraded colorization quality in background regions
w/o Sparse Training	63.15	Only supports dense references, reducing practicality

Training Details¶

Hardware: 16× NVIDIA A100 GPUs
Training Time: 5 days
Video Resolution: 512×512, 16 frames
Optimizer: AdamW, lr=1e-5
Batch Size: 2 video clips per GPU

Summary & Outlook¶

By combining explicit correspondence guidance, targeted data augmentation, and sparse reference training, AniDoc significantly improves the quality and practicality of animation sketch colorization. The reduction of FID from 75.91 to 54.33 demonstrates the key role of correspondence guidance. The system can be directly integrated into existing animation production pipelines, substantially reducing manual labor costs in the colorization stage.