MikuDance: Animating Character Art with Mixed Motion Dynamics¶
Conference: ICCV 2025 arXiv: 2411.08656 Code: https://kebii.github.io/MikuDance Area: Video Understanding Keywords: Character Animation, diffusion model, Camera Control, Mixed Motion, Image-to-Video
TL;DR¶
This paper proposes MikuDance, a diffusion-based character art animation system that achieves high-dynamic animation of complex character artwork through two core contributions: Mixed Motion Modeling, which unifies character motion and 3D camera motion into a pixel-space representation, and Mixed-Control Diffusion, which implicitly aligns character shape/scale with motion guidance within the Reference UNet.
Background & Motivation¶
Animating static character artwork has substantial demand in film, gaming, and digital design. Traditional tools (MMD, Live2D) require professional expertise, while existing image-to-video methods (Animate Anyone, DISCO) primarily target real humans and cannot be directly applied to character artwork. Two major challenges arise:
High-dynamic motion guidance: Animating character artwork requires simultaneously handling complex foreground character motion and large-scale camera motion in the background. Existing methods only support static backgrounds with human body motion and cannot model full-scene dynamics.
Reference–guidance misalignment: Anime characters exhibit unique head-to-body ratios, exaggerated poses, and diverse artistic styles, creating severe scale and body-shape discrepancies with motion guidance (typically derived from real-person videos). Explicit alignment is infeasible given the variability of character forms, necessitating implicit alignment.
Method¶
Overall Architecture¶
MikuDance is built on SD-1.5 and consists of: - Mixed Motion Modeling: Extracts character poses (Xpose) and camera poses (DROID-SLAM) from a driving video, then converts camera motion into a pixel-level scene motion representation via Scene Motion Tracking. - Mixed-Control Diffusion: Concatenates the reference image, reference pose, and all driving poses (VAE-encoded) along the channel dimension as input to the Reference UNet, and injects scene motion via Motion-Adaptive Normalization. - Two-stage mixed-source training: First trains the Reference UNet on stylized frame pairs; then trains MAN and temporal modules jointly.
Key Designs¶
-
Scene Motion Tracking (SMT):
- Constructs a scene point cloud \(\phi^l \in \mathbb{R}^{N \times 3}\) from the depth map of the reference image.
- Projects the point cloud from frame \(l\) to frame \(l+1\) using camera-to-world transformation matrices \(\mathcal{T}^l\) and \(\mathcal{Y}^{l+1}\).
- Computes pixel correspondences between the two projected images to obtain scene motion \(\mathbf{m}^s \in \mathbb{R}^{N \times 2}\).
- Formulation: \((z^l - z^{l+1})[\mathbf{m}^s; \mathbf{1}] = \mathcal{K}^l[\phi^l; \mathbf{1}] - \mathcal{K}^{l+1}\mathcal{Y}^{l+1}\mathcal{T}^l[\phi^l; \mathbf{1}]\)
Key distinctions from optical flow: (1) SMT is independent of driving video content, whereas optical flow is content-dependent; (2) SMT tracks 3D point clouds while optical flow tracks 2D pixels. SMT thus provides disentangled camera dynamic information.
Design Motivation: Converts 3D camera motion into a 2D pixel representation that is in the same domain as character poses, enabling unified motion guidance.
-
Mixed-Control Diffusion:
- No separate encoders: Unlike Animate Anyone and similar methods, no independent pose encoder or ControlNet is used.
- The reference image, reference pose, and all driving poses are VAE-encoded and concatenated along the channel dimension as input to the Reference UNet.
- The input convolutional layer of the Reference UNet is expanded accordingly; newly added parameters are initialized with zero convolution.
- The reference image is also encoded via a CLIP image encoder and used as cross-attention keys for both UNets.
Design Motivation: Implicit alignment through mixed inputs outperforms the combination of explicit alignment and separate encoders. Ablation experiments confirm this simpler architecture achieves the best results.
-
Motion-Adaptive Normalization (MAN):
- Inspired by SPADE, applies instance normalization to features at each downsampling block of the Reference UNet, then uses the scene motion \(\mathbf{m}^s\) to generate spatially adaptive scale \(\gamma^i\) and shift \(\beta^i\) parameters.
- \(f^{i'} = \gamma^i_{C,H,W}(\mathbf{m}^s) \frac{f^i_{C,H,W} - \mu^i_C}{\sigma^i_C} + \beta^i_{C,H,W}(\mathbf{m}^s)\)
- \(\gamma^i\) and \(\beta^i\) carry spatial dimensions, enabling pixel-level scene motion guidance.
Design Motivation: Scene motion exerts a global influence on animated frames; adaptive normalization effectively injects global motion while preserving local consistency.
Loss & Training¶
Two-stage mixed-source training:
-
Stage 1: Paired video frame training without MAN or temporal modules.
- Randomly mixes stylized frame pairs: the initial frame is spatially concatenated and style-transferred using SDXL-Neta.
- The reference frame is randomly sampled from frames outside the target sequence (simulating the inference scenario where reference and driving are unrelated).
- Resolution 768×768, batch size 128, trained for 120k steps.
-
Stage 2: MAN and temporal modules added (all other parameters frozen).
- Mixes MMD videos and camera-motion-only videos without characters.
- 24-frame sequences, batch size 16, trained for 60k steps.
- 16× A800 GPUs.
Both stages randomly drop out pose and scene motion guidance (rate 0.2) to improve robustness. Inference uses DDIM with 20 steps.
Standard diffusion training loss: \(\mathcal{L}_{simple} = \mathbb{E}_{\epsilon, t, c}[\|\epsilon - \epsilon_\theta(x_t, t, c)\|_2^2]\)
Key Experimental Results¶
Main Results (Table)¶
Quantitative comparison (100 MMD test videos):
| Method | FID ↓ | SSIM ↑ | PSNR ↑ | LPIPS ↓ | L1 ↓ | FID-VID ↓ | FVD ↓ |
|---|---|---|---|---|---|---|---|
| AniAny | 43.945 | 0.488 | 12.530 | 0.548 | 7.31E-5 | 38.179 | 846.414 |
| AniAny* | 28.833 | 0.526 | 13.610 | 0.517 | 6.23E-5 | 26.764 | 575.304 |
| DISCO | 59.221 | 0.313 | 10.732 | 0.615 | 9.25E-5 | 46.852 | 923.921 |
| MagicPose | 44.258 | 0.424 | 12.357 | 0.554 | 7.77E-5 | 41.347 | 886.691 |
| UniAnimate | 47.328 | 0.417 | 12.074 | 0.571 | 7.93E-5 | 40.924 | 882.245 |
| MikuDance | 24.597 | 0.576 | 14.592 | 0.493 | 5.73E-5 | 22.868 | 502.380 |
Ablation Study (Table)¶
Ablation of key designs:
| Setting | FID ↓ | SSIM ↑ | PSNR ↑ | FID-VID ↓ | FVD ↓ |
|---|---|---|---|---|---|
| w/o MIX (separate Ref UNet + 2 ControlNets) | 27.315 | 0.523 | 14.004 | 24.124 | 541.453 |
| w/o MAN (scene motion via direct concatenation) | 24.985 | 0.542 | 14.501 | 23.366 | 509.342 |
| w/o SMT (no scene motion) | 25.472 | 0.534 | 14.312 | 23.362 | 517.673 |
| w/ Plücker (Plücker coordinates as substitute) | 25.918 | 0.538 | 14.261 | 23.471 | 521.853 |
| w/ Flow (optical flow as substitute) | 26.141 | 0.516 | 14.088 | 23.079 | 505.533 |
| MikuDance (full) | 24.597 | 0.576 | 14.592 | 22.868 | 502.380 |
User study: 50 volunteers ranked the anonymous results of 4 methods on 20 videos. MikuDance substantially outperforms all baselines across three dimensions—overall quality, frame quality, and temporal quality—with >97% of users preferring MikuDance.
Key Findings¶
- Mixed control outperforms separate encoders: w/o MIX (2 ControlNets) achieves FID 27.3 vs. 24.6 for the full model; separate processing fails to resolve character–guidance misalignment.
- MAN outperforms direct concatenation: Spatially adaptive normalization injects global motion more effectively, yielding FVD 502 vs. 509.
- SMT outperforms Plücker coordinates and optical flow: Pixel-level scene motion representation shares the same domain as character poses; Plücker coordinates introduce a domain gap, and optical flow is content-dependent and non-generalizable.
- Even domain-finetuned baselines fall far short: AniAny* (fine-tuned on anime domain) achieves FVD 575 vs. MikuDance's 502.
- Unique capability for high-dynamic scenes: MikuDance maintains high-fidelity animation under the combination of large dance motions and rapid camera movements.
Highlights & Insights¶
- Elegant unification into pixel-space motion representation: Heterogeneous character poses (keypoints) and camera motion (3D poses) are unified into a 2D pixel motion space.
- "Simplicity wins" mixed-control design: Removing complex architectures such as ControlNet and directly mixing all guidance signals within the Reference UNet yields superior results.
- Practical two-stage training strategy: Stylized frame pairs simulate diverse character styles, while camera-motion videos teach background dynamics.
- MMD dataset of 3,600 animations: A dataset of 120k clips and 10.2 million frames is constructed from MMD animations.
- Overwhelming user study victory: >97% preference rate.
Limitations & Future Work¶
- Some generated animations exhibit background distortion and artifacts—3D ambiguity in image animation is an inherent challenge.
- SMT assumes a static scene; in practice, background objects may also move.
- Depth map quality affects scene point cloud accuracy and consequently SMT quality.
- The method relies on DROID-SLAM for camera pose extraction, which may be inaccurate under extreme motion.
- Long video generation depends on temporal aggregation methods for stitching, potentially introducing transition artifacts.
- Training cost is substantial (16× A800, 180k steps total).
Related Work & Insights¶
- Animate Anyone series: Establishes the foundational Reference UNet + Denoising UNet architecture; this work addresses the additional challenges specific to character artwork on top of this framework.
- Camera-controlled video generation: Methods such as Human4DiT and HumanVid use Plücker coordinates; this work demonstrates the superiority of pixel-level representations.
- SPADE → MAN: The concept of spatially adaptive normalization in semantic space is transferred to motion-adaptive normalization.
- The method holds significant commercial value for anime and game character animation applications.
Rating¶
- Novelty: ⭐⭐⭐⭐ The SMT strategy and mixed-control design are original; the unification of heterogeneous motion signals into pixel space is an elegant idea.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive quantitative comparisons, thorough ablations, and convincing user study; character animation–specific benchmarks beyond MMD are lacking.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with rich illustrations; the two challenges and their corresponding solutions are explicitly paired.
- Value: ⭐⭐⭐⭐ The first work to introduce high-dynamic camera motion control for character artwork animation, with direct applicability to the anime and gaming industries.