EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing¶

Conference: CVPR 2025
arXiv: 2412.08988
Code: https://github.com/GalaxyCong/DubFlow
Area: Diffusion Models
Keywords: Movie Dubbing, Emotion Controllable, Lip Synchronization, Flow Matching, Positive-Negative Guidance

TL;DR¶

This paper proposes EmoDubber, an emotion-controllable movie dubbing framework. By aligning lip movements and prosody using duration-level contrastive learning, enhancing speech clarity via a pronunciation enhancement strategy, and controlling emotion categories and intensity through a flow matching-based positive-negative guidance mechanism, EmoDubber comprehensively outperforms existing methods in lip-sync and pronunciation clarity.

Background & Motivation¶

Background: Movie dubbing (Visual Voice Cloning, V2C) aims to convert text into speech synchronized with video lip movements while matching the specified speaker's voiceprint. Existing methods are divided into two categories: those focusing on speaker style representation (V2C-Net, StyleDubber), and those utilizing video information for prosody modeling (HPMDubbing, MCDubber).

Limitations of Prior Work: (1) Audio-visual synchronization and clear pronunciation are hard to guarantee simultaneously—existing methods operate at the video frame and mel-spectrogram levels, ignoring phone-level pronunciation information, which leads to mumbled generated speech; (2) Emotional expression is stiff and lacks controllability—users cannot specify emotional categories and intensity, which is a critical requirement in movie post-production.

Key Challenge: The dubbing task requires simultaneous optimization across four dimensions—lip-sync, pronunciation clarity, speaker cloning, and emotional control—whereas existing methods only address a subset of the first three.

Goal: (1) How to achieve precise lip-movement and prosody alignment? (2) How to improve pronunciation clarity? (3) How to allow users to flexibly control emotion categories and intensity?

Key Insight: The authors observe that emotions in human speech are often mixed rather than single. Therefore, a positive-negative guidance mechanism is designed to enhance the target emotion while suppressing other emotions, achieving more precise emotional control.

Core Idea: Align lip movements and prosody through duration-level contrastive learning, integrate phoneme sequences using pronunciation enhancement, and achieve flexible emotional intensity control through flow-matching guided by positive and negative classifiers.

Method¶

Overall Architecture¶

EmoDubber takes four inputs: silent video \(V_l\), reference audio \(R_a\), text \(T_p\), and user emotion instruction \(E = \{c, \alpha, \beta\}\). The output is the emotion-controllable dubbing audio \(\hat{Y}\). The process is handled sequentially by four modules: (1) LPA aligns lip movements with prosody; (2) PE enhances pronunciation information; (3) SIA injects speaker style to generate acoustic priors; (4) FUEC generates mel-spectrograms via flow matching with emotions injected. Finally, a vocoder converts the output into waveforms.

Key Designs¶

Lip-related Prosody Aligning (LPA):
- Function: Learn the inherent consistency between lip movements and speech prosody, establishing accurate temporal alignment.
- Mechanism: Use lip embedding \(\mathcal{E}\) as Query and phone prosody embedding \(\mathcal{O}_p\) (comprising style phonemes + pitch + energy) as Key/Value to obtain the lip-prosody context sequence \(C_{pho}\) through multi-head attention. The key innovation is Duration-Level Contrastive Learning (DLCL): using the "0-1" duration matrix \(M^{gt}_{lip,pho}\) forced-aligned by MFA as positive sample pairs, encouraging attention weights at correct temporal positions to be higher than elsewhere. The loss is defined as \(\mathcal{L}_{cl} = -\log \frac{\sum\exp(\text{sim}^+/\tau)}{\sum\exp(\text{sim})}\), guaranteeing monotonicity and surjectivity.
- Design Motivation: Previous methods used simple MSE or unconstrained attention, which failed to guarantee precise temporal correspondence between lip movements and phonemes. DLCL explicitly enforces monotonic alignment via contrastive learning of positive-negative pairs, which is more flexible and precise compared to diagonal constraints.
Pronunciation Enhancing (PE):
- Function: Expand phone-level information to the video-frame level and fuse it with the lip-prosody sequence to enhance speech clarity.
- Mechanism: Extract the explicit duration \(D_p\) of each phoneme from the attention matrix using Monotonic Alignment Search (MAS), and expand the phone embedding \(\mathcal{O}_s \in \mathbb{R}^{P \times d_m}\) to the video level \(\mathcal{O}^v_s \in \mathbb{R}^{F \times d_m}\) using a length regulator. Then, use Audio-Visual Efficient Conformer (AVEC, 5 Conformer blocks + CTC layer) to fuse two features: lip-prosody context \(C_{pho}\) and pronunciation enhancement sequence \(\mathcal{O}^v_s\). The CTC layer guarantees correct pronunciation by maximizing the probability of correct phonemes.
- Design Motivation: Existing dubbing methods operate at frame or mel-spectrogram levels, neglecting phoneme-level pronunciation details. Explicitly extending phone sequences and fusing them via Conformer provides phoneme-level supervision to avoid "muffled" speech.
Flow-based User Emotion Controlling (FUEC):
- Function: Inject user-specified emotion categories and intensities during the flow matching generation process.
- Mechanism: Train a Flow Matching Prediction Network (FMPN) using OT-CFM to generate mel-spectrograms. During inference, a Positive-Negative Guidance Mechanism (PNGM) is introduced: a pre-trained emotion classifier \(\psi\) predicts the emotion distribution of the current intermediate state \(\phi_t(x)\), modifying the velocity field to \(\tilde{v}_{t,i} = v_t + \gamma(\alpha \nabla\log p_\psi(c_i|\phi_t) - \beta \nabla\log p_\psi(\sum_{j\neq i} l_j c_j|\phi_t))\). Here, \(\alpha\) represents positive guidance to enhance the target emotion \(c_i\), and \(\beta\) represents negative guidance to suppress the weighted mix of other emotions. Users can flexibly control emotion by adjusting \(\alpha \in [0,9]\) and \(\beta \in [0,2]\).
- Design Motivation: Traditional classifier guidance only enhances the target emotion, but human speech features mixed emotions. PNGM achieves more precise emotional styling by performing simultaneous enhancement and suppression: increasing \(\alpha\) emphasizes the target emotion, while increasing \(\beta\) dampens other emotions, allowing both to be independently adjusted.

Loss & Training¶

The total training loss comprises: flow matching loss \(\mathcal{L}_\theta\) (MSE of predicted vs. target vector fields), contrastive learning loss \(\mathcal{L}_{cl}\) (DLCL in LPA), and CTC loss (ensuring pronunciation correctness in PE). The emotion classifier \(\psi\) is pre-trained on 13 emotion datasets from Emobox (50,000+ recordings). Phoneme encoder, USL, and flow decoder are pre-trained on LibriSpeech. During inference, \(\gamma=15\).

Key Experimental Results¶

Main Results¶

Chem benchmark (Setting 1.0 + 2.0):

Method	LSE-C↑	LSE-D↓	WER↓	SECS↑
GT	8.12	6.59	3.85	100.0
HPMDubbing	7.85	7.19	16.05	85.09
StyleDubber	3.87	10.92	13.14	87.72
Speaker2Dub	3.76	10.56	16.98	74.73
EmoDubber	8.11	6.92	11.72	90.62

Zero-shot setting (Dub 3.0):

Method	LSE-C↑	LSE-D↓	WER↓	MOS-S↑	MOS-N↑
StyleDubber	6.17	9.11	15.10	4.03	3.85
Speaker2Dub	4.83	10.39	15.91	3.98	4.01
EmoDubber	7.40	6.65	14.03	4.07	4.05

Ablation Study¶

Configuration	LSE-C↑	LSE-D↓	WER↓	Description
Full EmoDubber	8.11	6.92	11.72	Full model
w/o DLCL	~5.5	~9.0	~14.0	Removing contrastive learning leads to a significant decrease in lip-sync
w/o PE	~7.0	~7.5	~15.0	Removing pronunciation enhancement increases WER
w/o PNGM	-	-	-	Emotion intensity becomes uncontrollable

Key Findings¶

EmoDubber almost matches the ground truth in lip-sync metrics (LSE-C: 8.11 vs. GT: 8.12), significantly outperforming all baseline methods.
WER is reduced from 16.05% in HPMDubbing and 13.14% in StyleDubber to 11.72%, demonstrating a notable improvement in pronunciation clarity.
Speaker similarity SECS reaches 90.62%, outperforming all baselines including StyleDubber's 87.72%.
It maintains superior performance in zero-shot multi-speaker settings, indicating strong generalization capability.
In PNGM, \(\alpha\) and \(\beta\) can be independently controlled: increasing \(\alpha\) enhances the Intensity Score of the target emotion, while increasing \(\beta\) suppresses alternative emotions.

Highlights & Insights¶

Breakthrough in near-GT lip-sync: LSE-C of 8.11 vs. GT 8.12 shows the extreme efficacy of DLCL's explicit duration-level alignment. This is more direct than prior hierarchical modeling in HPMDubbing.
Novel Positive-Negative Guidance concept (PNGM): Going beyond mere target enhancement, it actively dampens extraneous emotional components, aligning better with the mixed nature of human expressions. This design can generalize to any flow-matching/diffusion generation tasks requiring precise attribute control.
End-to-end phone-to-waveform pipeline: Fully integrates alignment, enhancement, style injection, and emotional control, formulating the most comprehensive dubbing system to date.

Limitations & Future Work¶

The emotion classifier is fixed and pre-trained, limiting the number of treatable emotion classes.
Evaluated only on English datasets; cross-lingual adaptability is unexplored.
Training requires additional annotated MFA alignments and emotion labels.
The interactive cost is high since \(\gamma\), \(\alpha\), and \(\beta\) in PNGM must be manually configured by the user.
It is not compared with the latest large-scale speech generation models (e.g., VALL-E, VoiceCraft).

vs HPMDubbing: HPM models visual-to-prosody mapping hierarchically using lip, face, and scene tracks. EmoDubber explicitly aligns at the duration level via DLCL, massively leading in lip-sync (LSE-C 8.11 vs. 7.85) and additionally offering emotion control.
vs StyleDubber: StyleDubber leverages a multi-scale style adapter to enhance speaker characteristics. EmoDubber's SIA module serves a similar purpose but incorporates an emotional control dimension via FUEC.
vs Emotional TTS (e.g., EmoSphere): Conventional emotional TTS relies only on positive guidance for emotional enhancement. EmoDubber's PNGM incorporates negative guidance to suppress non-target emotions, securing finer control.

Rating¶

Novelty: ⭐⭐⭐ vanity positive-negative guidance mechanism and duration-level contrastive learning display novelty, though the system is functionally a multi-module pipeline.
Experimental Thoroughness: ⭐⭐⭐⭐ Diverse dubbing datasets, zero-shot setup, and extensive MOS reviews offer solid coverage.
Writing Quality: ⭐⭐⭐⭐ The architecture is clearly detailed with excellent formula-to-figure correspondence.
Value: ⭐⭐⭐⭐ Successfully unifies lip-syncing, pronunciation validity, and emotional control in a dubbing framework, proving highly applicable to movie post-production.