OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers¶

Conference: NeurIPS 2025 (Spotlight) arXiv: 2505.21448 Code: https://ziqiaopeng.github.io/OmniSync/ (Project Page) Area: Image Generation Keywords: Lip Synchronization, Diffusion Transformer, Flow Matching, Classifier-Free Guidance, AIGC Video

TL;DR¶

OmniSync proposes a universal lip synchronization framework based on Diffusion Transformers, introducing three key innovations—a mask-free training paradigm, Flow Matching-based progressive noise initialization, and dynamic spatiotemporal CFG—to substantially outperform prior methods on both real and AI-generated videos, achieving an 87.78% success rate on stylized character lip sync (vs. 67.78% for the previous best).

Background & Motivation¶

Lip synchronization aims to align a speaker's lip movements in video with a target audio track, with broad applications in film dubbing, digital avatars, and telepresence. As AI video generation (T2V models such as Kling, Wan, and Hunyuan) has gained momentum, lip synchronization has evolved from a specialized technique into a foundational capability within the video generation ecosystem.

Existing methods exhibit three major limitations:

Reliance on reference frames and inpainting masks: Conventional methods extract appearance from reference frames and mask the mouth region of target frames before generating lip motion. This causes boundary artifacts, identity drift, and severe quality degradation under inconsistent head poses.
Lip leakage: Audio signals are substantially weaker than visual signals; models tend to "peek" at the original lip shape rather than fully replacing it with the mouth shape driven by the target audio.
Inability to handle stylized content: Methods that depend on face detection and alignment fail outright on non-realistic characters (cartoons, animations, non-human entities)—precisely the type of content T2V models excel at generating.

Furthermore, no evaluation benchmark exists in the literature specifically targeting lip synchronization in AI-generated videos.

Core Problem¶

How to build a universal lip synchronization framework that works on both real human face videos and stylized character videos generated by AI? The core challenges are: (1) eliminating dependence on conventional preprocessing such as face detection and alignment; (2) overcoming the inherent difficulty of using audio as a weak conditioning signal; and (3) strictly preserving identity, pose, and background while modifying lip motion.

Method¶

Overall Architecture¶

OmniSync's pipeline is straightforward: given a source video \(V_{cd}\) and target audio \(A_{ab}\), the model directly outputs a video \(V_{ab}\) whose lip movements are synchronized with the target audio, without requiring any masks or reference frames. The entire system is built on a Diffusion Transformer (DiT) with Flow Matching as the training objective. Audio features are extracted using a pretrained Whisper encoder, and text conditioning is handled by a T5 encoder (descriptive text labels such as "A person speaking loudly with clear facial and tooth movements" are used during training to enhance lip clarity).

Three core modules address, respectively, the training paradigm, inference stability, and the weak audio conditioning signal.

Key Designs¶

Mask-Free Training Paradigm: The conventional "mask the mouth → inpaint driven by audio" paradigm is abandoned in favor of training the DiT to directly learn a cross-frame editing mapping \((V_{cd}, A_{ab}) \mapsto V_{ab}\). The key challenge is that direct frame editing requires perfectly paired training data (identical pose and identity, differing only in lip shape), which is nearly impossible to obtain in practice. The authors cleverly exploit the progressive denoising property of diffusion models, proposing a timestep-dependent data sampling strategy:
- High-noise timesteps (\(t > 850\), responsible for generating the coarse facial structure): paired pseudo-data from the MEAD dataset (recorded in a fixed laboratory setting, where utterances from the same speaker naturally yield pose-consistent pairs) are used to teach the model to construct facial structure while preserving pose and identity.
- Mid-to-low noise timesteps (\(t \leq 850\), responsible for lip shape generation and detail refinement): the training switches to more diverse, unpaired YouTube data, enabling the model to learn a more generalizable audio-to-lip mapping.
The training loss is the standard Conditional Flow Matching (CFM) loss: \(\mathcal{L}_{CFM}(\theta) = \mathbb{E}[\|v_\theta(x_t, V_{cd}, A_{ab}, t) - u_t(x_t|V_{ab})\|^2_2]\)
Flow Matching-Based Progressive Noise Initialization: Rather than starting inference from pure random noise (which causes pose drift and identity leakage due to accumulated errors in the early denoising steps), controlled noise is added to the source video frames: \(x_{init} = (1-\tau)V_{source} + \tau\epsilon\) (with \(\tau=0.92\)), and only the subsequent 50 denoising steps are executed. This effectively skips the early diffusion stages (which govern macro-level structure), directly inheriting the source video's pose and global layout, allowing the model to focus on editing the mouth region. This ensures spatial consistency while reducing computation.
Dynamic Spatiotemporal Classifier-Free Guidance (DS-CFG): Addresses the weakness of the audio condition. Standard CFG poses a dilemma: high scale yields accurate lip sync but introduces artifacts; low scale preserves visual quality but degrades lip accuracy. DS-CFG adapts along two dimensions:
- Spatial dimension: A Gaussian weight matrix \(\mathbf{G}_{spatial}(x,y)\) centered on the mouth is constructed, applying maximal guidance strength (\(\omega_{peak}\)) at the mouth and minimal strength (\(\omega_{base}\)) at distant regions, ensuring strong audio guidance is applied only where modification is needed.
- Temporal dimension: Guidance strength decays over the denoising trajectory as \(\omega(t) = \omega_{peak} \cdot (t/T)^\gamma\) (\(\gamma=1.5\)), with strong guidance early in denoising to establish correct lip shape and weaker guidance later to preserve texture details.
- Final formulation: \(\hat{\epsilon}_\theta = \epsilon_\theta(x_t, \varnothing, t) + \mathbf{G}_{spatial} \cdot \omega(t) \cdot [\epsilon_\theta(x_t, c, t) - \epsilon_\theta(x_t, \varnothing, t)]\)

Loss & Training¶

Training objective: Conditional Flow Matching loss (L2 loss on the learned velocity field)
Training scale: 64× A100 GPUs, batch size 64, 80k steps, completed in 80 hours
Optimizer: AdamW, lr=1e-5
Timestep threshold: \(t_{threshold} = 850\)
Inference noise parameter: \(\tau = 0.92\), 50 denoising steps
Text conditioning: videos are annotated with descriptive prompts during training; at inference, prompt engineering can control lip clarity and motion magnitude

Key Experimental Results¶

HDTF Dataset (Real Video)¶

Method	FID↓	FVD↓	CSIM↑	NIQE↓	BRISQUE↓	HyperIQA↑	LMD↓	LSE-C↑
Wav2Lip	14.91	543.3	0.852	6.50	53.37	45.82	10.01	7.63
IP-LAP	9.51	325.7	0.809	6.53	54.40	50.09	7.70	7.26
MuseTalk	8.76	231.4	0.862	5.82	46.00	55.40	8.70	6.89
LatentSync	8.52	216.9	0.859	6.27	50.86	53.21	17.34	8.05
OmniSync	7.86	199.6	0.875	5.48	37.92	56.36	7.10	7.31

AIGC-LipSync Benchmark (AI-Generated Video)¶

Method	FID↓	FVD↓	CSIM↑	Generation Success Rate↑	Stylized Character Success Rate↑
Wav2Lip	22.99	562.2	0.727	71.38%	26.67%
MuseTalk	17.67	297.6	0.667	92.20%	67.78%
LatentSync	15.37	263.1	0.751	74.96%	35.56%
OmniSync	10.68	211.4	0.808	97.40%	87.78%

User study (5-point scale, 39 participants): OmniSync leads across all five dimensions—lip sync accuracy (3.92), identity preservation (4.13), temporal stability (4.04), image quality (4.05), and video naturalness (3.87).

Ablation Study¶

Variant	FID↓	FVD↓	CSIM↑	LSE-C↑
Full model	15.71	287.2	0.814	7.06
w/o timestep-dependent sampling	21.55	549.8	0.727	7.00
w/o progressive noise initialization	16.73	361.3	0.805	7.03
Low static CFG	—	—	—	4.16
High static CFG	22.73	348.3	0.782	7.10

Timestep-dependent sampling contributes most: removing it drops CSIM by 10.7% and increases FVD from 287 to 550, with visible facial misalignment.
Progressive noise initialization: removing it increases FVD from 287 to 361, with noticeably degraded temporal consistency.
Balancing role of DS-CFG: low CFG yields poor lip sync (LSE-C = 4.16); high CFG introduces artifacts (FID = 22.73); DS-CFG achieves a favorable balance (LSE-C = 7.06, FID = 15.71).

Highlights & Insights¶

Timestep-dependent sampling is the central contribution: leveraging the property that diffusion models learn different content at different timesteps (early → structure; mid → semantics; late → details), different training data are used at different stages. This idea is transferable to any diffusion-based task requiring localized editing.
Genuinely mask-free design: complete elimination of face detection, alignment, and masking enables the method to handle stylized characters such as cartoons and non-human figures. The 87.78% stylized character success rate on AIGC-LipSync provides substantive empirical support.
Elegant DS-CFG design: replacing a single global CFG scale with a spatiotemporally adaptive control field—Gaussian-weighted in space and power-decayed in time—offers a useful template for other conditional generation tasks such as audio-driven expression or motion synthesis.
AIGC-LipSync Benchmark: the first evaluation benchmark specifically targeting lip synchronization in AI-generated video (615 videos covering real humans, stylized characters, and non-human entities), filling a notable gap in the field.
Solid engineering: 64× A100 training over 80 hours, support for unlimited-length inference, and a user study with 39 evaluators (Cronbach's α = 0.98).

Limitations & Future Work¶

LSE-C not optimal: LatentSync achieves a slightly higher LSE-C (8.05 vs. 7.31) on HDTF due to its SyncNet-based loss. Incorporating a SyncNet loss could further improve lip accuracy.
High training and inference cost: 64× A100 training and 50-step denoising at inference limit real-time applicability and accessibility.
Mouth center localization: DS-CFG's spatial Gaussian requires mouth center coordinates \((x_m, y_m)\); how to determine this center for extremely stylized characters is not fully discussed.
Limited scale of AIGC-LipSync Benchmark: 615 videos sourced from a small number of T2V models; broader generalization remains to be validated.
No quantitative comparison with portrait animation methods: comparisons with EchoMimic, Hallo3, and Sonic are confined to qualitative results in the appendix.

vs. Wav2Lip: Wav2Lip pioneered SyncNet-supervised training but suffers from poor visual quality and dependence on face detection. OmniSync improves FID from 14.91 to 7.86 and AIGC video success rate from 71% to 97%.
vs. LatentSync: Both are diffusion-based methods, but LatentSync still relies on reference frames and inpainting masks. OmniSync leads on all visual quality metrics (FID/FVD/CSIM), while LatentSync holds a marginal advantage in LSE-C due to its SyncNet loss. OmniSync's key advantage is the generality afforded by its mask-free design.
vs. MuseTalk: MuseTalk achieves a 92.20% success rate on AIGC videos (by selecting pose-matched reference images), but only 67.78% on stylized characters, compared to OmniSync's 87.78%. This demonstrates the fundamental limitation of mask-based methods on non-standard faces.
vs. portrait animation methods (EMO/Hallo/Sonic): These are image-to-video methods and do not guarantee consistency with a source video. As a video-to-video method, OmniSync inherits the source video's texture, dynamics, and speaking style.

Rating¶

Novelty: ⭐⭐⭐⭐ The three innovations—mask-free training, timestep-dependent sampling, and DS-CFG—each offer distinct contributions, though individually each can be viewed as a clever combination of existing techniques.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Dual evaluation on HDTF and the newly constructed AIGC-LipSync Benchmark, seven baselines, complete ablation study, 39-participant user study with Cronbach's α = 0.98.
Writing Quality: ⭐⭐⭐⭐ Clear structure, rigorous notation, well-motivated design choices, with detailed justifications in the appendix.
Value: ⭐⭐⭐⭐ NeurIPS Spotlight; first to extend lip synchronization to AI-generated video; establishes a new benchmark; offers practical value to industry.