CVPR 2025 Human Understanding Audio-driven facial animation Diffusion models Keyframe generation Long-sequence generation Emotion modeling Non-verbal vocalizations

KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation¶

Conference: CVPR 2025
arXiv: 2503.01715
Code: See project page
Area: Human Understanding
Keywords: Audio-driven facial animation, Diffusion models, Keyframe generation, Long-sequence generation, Emotion modeling, Non-verbal vocalizations

TL;DR¶

KeyFace proposes a two-stage diffusion framework that first generates anchor frames capturing key expressions at a low frame rate, and then fills in the intermediate frames using an interpolation model. This design addresses identity drift and quality degradation in long sequences for existing audio-driven facial animation methods, while introducing support for continuous emotion (valence/arousal) modeling and animation generation for various non-speech vocalizations (NSVs) for the first time.

Background & Motivation¶

Background: Audio-driven facial animation has made significant progress in recent years with the development of GANs and diffusion models, yielding generation quality close to real videos. This technology is widely applied in fields such as virtual assistants, education, and VR.

Limitations of Prior Work: 1. Quality degradation in long sequences: Most methods suffer from identity drift and overall quality degradation when sequences exceed a few seconds—autoregressive methods accumulate errors, causing quality to deteriorate sharply after a few seconds. 2. Cost of extra spatial control: Introducing target head poses or landmarks as inputs to resolve long-sequence issues improves temporal consistency but restricts the naturalness and flexibility of expressions. 3. Limitations in emotion modeling: Existing emotion-driven methods assume fixed emotional states or use discrete emotion labels (such as "angry" or "sad"), failing to capture the continuous transitions of emotion. 4. Neglecting non-speech vocalizations: NSVs such as laughter and sighing are crucial for natural communication but are almost entirely ignored by existing methods.

Key Challenge: Long-sequence animation requires global temporal consistency, but the local receptive field of autoregressive generation leads to error accumulation—demanding a generation strategy that establishes long-range dependencies without introducing rigid spatial constraints.

Goal: To design an audio-driven facial animation framework capable of generating long-sequence, high-fidelity, emotionally rich animations with support for NSVs.

Method¶

Overall Architecture¶

KeyFace is a two-stage pipeline based on the Stable Video Diffusion (SVD) architecture: 1. Keyframe Generation Stage: Generates \(T\) keyframes at a low frame rate (gaps of \(S\) frames) conditioned on identity frames, audio embeddings, and emotion parameters, covering a wider time span to capture key emotional variations. 2. Interpolation Stage: Uses the same architecture but different conditional inputs to fill in \(S-2\) intermediate frames with two adjacent keyframes as anchors, ensuring smooth transitions and temporal consistency. Long videos can be generated by repeating this process, with the interpolation model guaranteeing seamless transitions between segments.

Key Designs¶

Keyframe Generation Model:
- Function: Generates sparse but informative anchor frames, implicitly decoupling motion and identity control.
- Mechanism: The input is a noise sequence, and the identity frame is encoded by VAE, repeated \(T\) times, and concatenated with the noise. Identity details are preserved through U-Net skip connections. Audio is injected via both cross-attention and timestep embedding mechanisms.
- Design Motivation: Low frame rate generation allows the model to span a longer duration (several seconds), capturing facial expressions and motion patterns from a global perspective to bypass the short-sighted nature of autoregressive methods.
Dual Audio Encoders (WavLM + BEATs):
- Function: Simultaneously captures linguistic content and non-speech acoustic features.
- Mechanism: WavLM excels at extracting linguistic features (lip synchronization), while BEATs is adept at capturing broad acoustic signals (including NSVs). The concatenated embeddings of both are injected into the model through two paths: (a) as key/value for cross-attention; (b) added to the diffusion timestep embedding after an MLP.
- Experimental Validation: Removing BEATs drops NSV accuracy from 42% to 10% (close to random), while removing WavLM deteriorates lip synchronization quality.
Continuous Emotion Modeling (Valence & Arousal):
- Function: Supports continuous emotional variation during the video generation process.
- Mechanism: For each frame, valence and arousal are extracted using a pre-trained emotion recognition model, encoded as sinusoidal embeddings, and added to the diffusion timestep embedding.
- Emotion conditioning is applied only in the keyframe model; the interpolation model automatically propagates emotional expressions.
- During inference, users can provide arbitrary valence/arousal values to control emotional states and interpolate them within the same video to create emotional progression.

Loss & Training¶

\[L = \lambda_{tot}(L_2(z_0, z_{gt}) + L_2(x_0, x_{gt}) + L_p(x_0, x_{gt}))\]

where \(\lambda_{tot} = \lambda(t) \cdot \lambda_{lower}\):

Latent space L2 loss: Standard diffusion training loss, computed for all frames.
Pixel-space L2 loss: Reconstruction loss against ground truth frames after decoding back to RGB space, computed only on a single random frame (saving GPU memory).
Perceptual loss: Based on VGG feature matching to enhance perceptual quality.
Lower-half weight \(\lambda_{lower}=3\): Applies higher weight to the lower half of the image (mouth region) to enhance lip synchronization quality.

The keyframe model utilizes decoupled CFG (controlling guiding intensity of identity and audio separately), while the interpolation model uses Autoguidance to avoid over-amplification of conditional signals by standard CFG.

Key Experimental Results¶

Main Results¶

Method	FID↓	FVD↓	LipScore↑	AQ↑	Elo↑
Ours (HDTF)	16.76	137.25	0.36	0.59	1091.52
Hallo	19.22	236.97	0.27	0.55	1054.69
V-Express	34.68	200.67	0.37	0.55	985.35
AniPortrait	20.68	299.09	0.14	0.56	887.84
EchoMimic	20.35	213.30	0.17	0.55	1023.53
SadTalker	60.55	410.86	0.24	0.52	960.44

Emotion Evaluation (MEAD)	FID↓	FVD↓	Emo_acc↑
Ours (V&A)	44.43	447.74	0.67
Ours (Discrete Labels)	50.34	509.13	0.43
EDTalk	101.19	619.90	0.72
EAT	75.69	560.61	0.54

Highlights & Insights¶

Keyframe + interpolation two-stage paradigm: Implicitly decouples motion planning (keyframes capture what happens) and motion refinement (interpolation addresses how it transitions). Compared to autoregressive methods, it fundamentally averts error accumulation—resulting in little-to-no growth in FID over time.
Continuous emotion vs. discrete labels: The continuous representation of valence/arousal is not only more fine-grained than discrete labels (improving Emo_acc from 0.43 to 0.67), but also crucially allows smooth emotional transitions within a video, which is vital for long-sequence narration.
Autoguidance replacing CFG for interpolation: Deliberate and detailed facial expression transitions are required in the interpolation stage, whereas traditional CFG scaling may destroy naturalness. Autoguidance utilizes smaller models or models with fewer steps to guide, balancing quality and diversity.
New LipScore metric: The lipreader-based perceptual score is more reliable than SyncNet. Trained with 6 times more data than SyncNet, it aligns better with human perception.

Limitations & Future Work¶

The keyframe interval \(S\) is a fixed hyperparameter, which may require adaptive adjustment for different speech rates or scene complexities.
Training requires 160 hours of speech and 30 hours of NSV data, which incurs a high data collection cost.
Emotion labels are derived from pseudo-labels (extracted via pre-trained models), making the accuracy heavily reliant on the capability of the emotion recognition model.
The model currently processes only the facial region, excluding the generation of hand gestures or body movements.

Audio-driven facial animation: Wav2Lip (lip sync expert discriminator), Hallo (autoregressive diffusion), AniPortrait (audio \(\rightarrow\) landmark \(\rightarrow\) video two-stage), EchoMimic (audio + landmark conditioned diffusion)
Emotion-driven generation: EAT (discrete emotion labels), EDTalk (video-driven emotion transfer)
Non-speech vocalizations: Laughing Matters (laughter diffusion model), LaughTalk (3D laughter + speech model)
Video diffusion models: SVD (Stable Video Diffusion foundation), EDM (efficient diffusion framework)

Rating¶

Novelty: ⭐⭐⭐⭐ (The combination of the keyframe + interpolation paradigm, continuous emotion, and NSV is completely novel in facial animation)
Value: ⭐⭐⭐⭐⭐ (Directly resolves the core challenges of long-sequence facial animation, with clear application scenarios)
Technical Depth: ⭐⭐⭐⭐ (Two-stage diffusion design, decoupled CFG/Autoguidance, multi-loss combination)
Writing Quality: ⭐⭐⭐⭐⭐ (Clear structure, thorough ablation studies, and well-justified new metric)