DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation¶

Conference: ICCV 2025 arXiv: 2410.07151 Code: Project Page Area: Image Generation / Face Video Generation Keywords: face video dataset, video generation, text-to-video, image-to-video, diffusion models

TL;DR¶

This paper introduces DH-FaceVid-1K, a large-scale high-quality face video dataset comprising 1,200+ hours, 270,043 video clips, and 20,000+ unique identities. It specifically addresses the severe underrepresentation of Asian faces in existing datasets and empirically validates scaling laws with respect to data volume and model parameter count through systematic experiments.

Background & Motivation¶

Face video generation is one of the most active tasks in video generation, underpinning applications such as talking-head synthesis and text-driven video generation. However, many state-of-the-art methods rely on proprietary private data, while publicly available datasets suffer from three core limitations:

Insufficient total duration: CelebV-HQ (68h) and CelebV-Text (279h) are far from adequate for pretraining needs.

Quality–quantity trade-off: VoxCeleb2 offers 2,400h but at only 224×224 resolution; TalkingHead-1KH is similarly resolution-constrained.

Lack of diversity: Asian faces are severely underrepresented in existing datasets, limiting model generalization across ethnic groups.

The paper further identifies common quality issues in existing public datasets, including low sharpness/resolution, multiple faces per frame, hand/object occlusion, and overlaid captions or noise artifacts, all of which substantially degrade training effectiveness.

Method¶

Overall Architecture¶

The construction of DH-FaceVid-1K proceeds in four key stages: 1. Raw video collection: Interview programs and vlog-style videos are collected from crowdsourcing platforms (single subject, professional environment, high-quality equipment), yielding over 2,000 hours of raw footage. 2. Face detection and cropping: Clips are cropped to include the full face and upper-shoulder region, ensuring a minimum face region of 256×256. 3. Noise filtering: Subtitle detection (OCR), letterbox detection, multi-face exclusion, manual screening for hand/occlusion artifacts, and blur enhancement via CodeFormer. 4. Annotation generation: DWPose is used to extract facial keypoints; PLLaVA generates initial text descriptions; 100+ human annotators perform cross-validation over six months.

Key Designs¶

Multi-ethnic coverage and Asian face supplementation: Approximately 1,000 hours of in-house collected data (95% Asian faces) are combined with 200 hours of cleaned data from CelebV-HQ, CelebV-Text, and TalkingHead-1KH. The final dataset comprises 80% Asian, 11% Caucasian, 4% African, and 5% other ethnicities, addressing the demographic bias prevalent in existing datasets.
Rigorous data quality control pipeline: (1) Subtitle detection: five frames are randomly sampled per clip for OCR; clips with more than 10 characters are flagged; (2) Letterbox detection: continuous black borders exceeding 20 pixels are identified; (3) Face filtering: OpenCV discards clips with multiple faces; FaceXFormer filters out individuals under 22 years of age; (4) Two-stage manual verification: Stage 1 cross-validates static and dynamic attributes per ISO 2859 standards; Stage 2 further reviews flagged problematic samples.
Audio filtering and lip-audio synchronization: To address SyncNet score bias for non-English speech, a SyncNet model is retrained to generate synchronization scores for each clip, filtering out samples with poor lip-audio alignment and ensuring dataset suitability for audio-driven talking-head generation.

Loss & Training¶

As a dataset paper, no training strategy is proposed per se. However, the following empirical guidelines are provided for downstream model training: - The optimal data scale for a 2B-parameter DiT model is approximately 600 hours. - Models with 5B/6B parameters require the full 1,000+ hours to achieve their performance ceiling. - Small models overfit on large datasets; large models exhibit training instability on small datasets.

Key Experimental Results¶

Main Results (Text-to-Video)¶

Method	Dataset	FVD (↓)	FID (↓)	CLIP (↑)
CogVideoX	HDTF	127.88	17.83	0.9247
CogVideoX	CelebV-HQ	137.62	17.31	0.9305
CogVideoX	CelebV-Text	129.59	15.82	0.9388
CogVideoX	DH-FaceVid-1K	98.01	11.73	0.9401
EasyAnimate	CelebV-Text	121.22	16.53	0.9274
EasyAnimate	DH-FaceVid-1K	113.27	13.91	0.9240

Ablation Study (Data Scaling Laws, CogVideoX T2V)¶

Data Scale	2B FVD	2B FID	5B FVD	5B FID
100h	215.06	17.01	237.52	18.17
200h	185.06	16.23	203.52	16.37
400h	177.18	14.16	180.99	14.25
600h	145.14	12.50	143.25	12.83
800h	148.82	13.15	121.63	12.05
1000h	150.27	13.31	98.01	11.73

The 2B model reaches a performance inflection point at 600h (additional data yields marginal or slight degradation), whereas the 5B model continues to benefit monotonically from more data.

Key Findings¶

All models trained on DH-FaceVid-1K significantly outperform those trained on public datasets: CogVideoX T2V FVD decreases from 129.59 to 98.01 (−24.3%) and FID from 15.82 to 11.73 (−25.9%).
Scaling laws: The 2B model achieves optimal cost-effectiveness at ~600h; models with 5B+ parameters require the full dataset to realize their potential.
DiT vs. UNet: DiT architectures (Latte, CogVideoX) generally outperform UNet architectures (AnimateDiff), though at higher training resource cost.
Models trained on CelebV-Text exhibit visible artifacts when generating Asian faces (random hand-like noise, multi-face artifacts); models trained on DH-FaceVid-1K do not exhibit these issues.
I2V tasks also benefit: CogVideoX I2V achieves FVD 92.31 on DH-FaceVid-1K vs. 123.25 on CelebV-Text.

Highlights & Insights¶

Exceptional data engineering value: A two-stage cross-validation quality control pipeline with 100+ annotators over six months ensures industry-leading data quality.
Scaling law experiments provide actionable guidance: different model scales correspond to distinct optimal data volumes, preventing blind data accumulation.
Empirical DiT vs. UNet comparison offers a valuable reference for backbone selection.
Addresses the practical gap of Asian face data scarcity, with significant implications for real-world deployment.

Limitations & Future Work¶

The 80% proportion of Asian faces leaves other ethnicities still underrepresented, potentially introducing a new demographic bias.
The strict dataset access process (application form submission and agreement signing) may limit broad adoption by the academic community.
The effect of generating longer videos (>15s) remains unexplored.
Direct comparison with recent dedicated talking-head methods such as Hallo3 is absent.
Experiments on audio-driven generation tasks are limited.

The results validate the importance of domain-specific high-quality datasets for fine-tuning pretrained models, a conclusion generalizable to other vertical domains.
The data processing pipeline (subtitle detection → letterbox removal → multi-face filtering → manual review → audio synchronization) can serve as a standard reference for video dataset construction.
The scaling law analysis framework (model parameters × data scale × performance metrics) provides an evaluation methodology for future dataset papers.

Rating¶

Novelty: ⭐⭐⭐ The dataset construction pipeline is well-established but methodologically incremental; the primary contribution lies in the data itself.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Systematic experiments across 5 T2V models, 5 I2V models, 6 data scales, and 2 model parameter scales.
Writing Quality: ⭐⭐⭐⭐ Clear structure, comprehensive statistics, and in-depth scaling law analysis.
Value: ⭐⭐⭐⭐ Fills a critical gap in large-scale high-quality face video datasets; the scaling law analysis offers practical guidance.