EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation¶

Conference: CVPR 2025
arXiv: 2411.10061
Code: https://github.com/antgroup/echomimic_v2
Area: Image Inpainting
Keywords: Human Animation, Audio-driven, Semi-body Animation, Pose Simplification, Diffusion Models

TL;DR¶

This paper proposes an Audio-Pose Dynamic Harmonization (APDH) strategy to progressively shift control from full-body poses to audio—gradually removing keypoints (retaining hands) while expanding the audio control scope (from lips to the full body). This secures high-quality semi-body animation driven only by audio, a reference image, and hand poses.

Background & Motivation¶

Background: Audio-driven human animation generates speaking/motion videos from reference images and audio. Existing methods usually require full-body pose sequences as additional control conditions, which are costly to obtain.

Limitations of Prior Work: 1. Full-body poses (facial, body, and hand keypoints) are difficult to acquire—typically requiring extraction tools like OpenPose, with out-of-video poses needing extra design. 2. Natural correlations exist between audio and body movements (e.g., breathing, gestures), yet existing methods fail to exploit this. 3. Portrait (headshot) data is abundant, whereas semi-body data is scarce.

Key Challenge: Full-body pose conditions offer precise control but are costly to obtain; audio signals contain rich rhythmic and emotional information but are insufficient to directly control full-body movements with high quality.

Goal: Simplify control conditions—reducing from full-body poses to only hand poses and audio, while maintaining or even improving the animation quality.

Key Insight: Similar to a "waltz"—pose taking a step back (progressively reducing keypoints) while audio takes a step forward (progressively expanding the control range). During training, lip \(\rightarrow\) head \(\rightarrow\) body keypoints are progressively removed, while audio attention expands from the lip region to the head, and eventually to the entire body.

Core Idea: Deploy an "Audio-Pose Dynamic Harmonization" (APDH) progressive coordination strategy where audio takes over the control responsibilities of the poses. Ultimately, only hand poses + audio are required to generate high-quality semi-body animations.

Method¶

Overall Architecture¶

Reference Image + Audio + Hand Pose Sequences \(\rightarrow\) SD U-Net Base Diffusion Model + APDH Coordination Strategy \(\rightarrow\) Pose Sampling (progressive keypoint removal) \(\rightarrow\) Audio Diffusion (progressive audio control region expansion) \(\rightarrow\) Head Partial Attention (portrait data augmentation) \(\rightarrow\) PhD Loss (denoising phase-specific loss weighting).

Key Designs¶

Audio-Pose Dynamic Harmonization (APDH):
- Function: Progressively shifts control from pose to audio.
- Mechanism: Pose Sampling periodically removes keypoints (lip \(\rightarrow\) head \(\rightarrow\) body, retaining only hands); Audio Diffusion simultaneously expands the spatial attention mask (lip region \(\rightarrow\) head region \(\rightarrow\) full-body region). During early training, audio controls only the lips, while in later stages, audio controls the entire body.
- Design Motivation: Directly removing most pose conditions leads to poor performance (FID 51.53 vs 49.33). Progressive coordination allows the model to step-by-step learn to replace poses with audio.
Head Partial Attention (HPA):
- Function: Leverages portrait data at zero cost to enhance facial quality.
- Mechanism: Crops and pads portrait photos to semi-body size, reusing the head region in the attention mask. No extra modules are required—the same model handles both semi-body and portrait data.
- Design Motivation: High-quality semi-body data is scarce, but portrait data is abundant. HPA provides a "free lunch" style data augmentation.
Phase-specific Denoising Loss (PhD Loss):
- Function: Focuses on different quality dimensions at different denoising stages.
- Mechanism: The first 10% of denoising steps (\(L_{pose}\)): focus on optimizing pose consistency; the middle 60% (\(L_{detail}\)): focus on detail optimization (e.g., lip synchronization); the final 30% (\(L_{low}\)): focus on visual quality (SSIM, PSNR).
- Design Motivation: Ablation studies show that removing \(L_{pose}\) drops HKC from 0.923 to 0.874, and removing \(L_{low}\) drops SSIM from 0.738 to 0.675.

Loss & Training¶

PhD Loss = A three-stage loss weighted by denoising timesteps. Trained on CelebV-HQ videos.

Key Experimental Results¶

Main Results¶

Method	FID↓	FVD↓	SSIM↑	Sync-C↑	HKC↑	CSIM↑
AnimateAnyone	58.98	1016	0.729	0.987	0.809	0.387
MimicMotion	53.47	623	0.702	1.495	0.907	0.526
EchoMimicV2	49.33	598	0.738	7.219	0.923	0.558

Ablation Study¶

Component	Impact of Removal
Without APDH	FID 51.53, CSIM 0.508
Without \(L_{pose}\)	HKC 0.874 (-0.049)
Without \(L_{detail}\)	Sync-C 6.985 (-0.234)
Without \(L_{low}\)	SSIM 0.675 (-0.063)

Key Findings¶

Significant lead in lip sync: Sync-C of 7.219 vs. MimicMotion's 1.495 (+4.8x), indicating that audio-driven lip animation performs much better than pose-driven animation.
Hand poses are sufficient: Performance improves after removing facial and body keypoints (via APDH), as the model does not need to learn to ignore noisy poses.
Free augmentation via HPA: Portrait data enhances facial quality with zero extra overhead.

Highlights & Insights¶

The "waltz-like" progressive coordination is an elegant training strategy—much smoother than simply dropping conditions.
The finding that audio naturally contains rhythm/emotion \(\rightarrow\) can control breathing and gestures is inspiring, implying that audio conditions might be far more powerful than previously thought.

Limitations & Future Work¶

Hand poses are still required as input—achieving complete audio-driven generation (without any poses) is the next goal.
Infrequent gestures (such as large waving) still rely on pose inputs.
Only validated in talking scenarios; effectiveness in dancing or sports scenarios remains unexplored.

vs. Hallo / V-Express: These methods only generate portrait animations. EchoMimicV2 extends to semi-body and simplifies the control conditions.
vs. MimicMotion: Requires full-body poses. EchoMimicV2 requires only hand poses and improves lip sync by \(5\times\).

Rating¶

Novelty: ⭐⭐⭐⭐ Creative designs with APDH progressive coordination and PhD Loss phase-wise optimization.
Experimental Thoroughness: ⭐⭐⭐⭐ Detailed component ablations and multi-metric comparisons.
Writing Quality: ⭐⭐⭐⭐ Vivid "waltz" metaphor.
Value: ⭐⭐⭐⭐ Direct value for digital human and virtual anchor applications.