Zero-Shot Text-to-Speech for Vietnamese¶
Conference: ACL 2025
arXiv: 2506.01322
Code: https://huggingface.co/datasets/thivux/phoaudiobook(数据集)
Area: Audio & Speech / TTS
Keywords: Zero-Shot TTS, Vietnamese, PhoAudiobook, Speech Synthesis, Low-Resource Languages
TL;DR¶
To address the lack of high-quality long-audio datasets for Vietnamese zero-shot TTS, the 941-hour PhoAudiobook dataset was constructed. Systematic experiments conducted on three SOTA zero-shot TTS models (VALL-E, VoiceCraft, and XTTS-v2) demonstrate that PhoAudiobook significantly improves model performance. Specifically, XTTS-v2 completely outperforms the baseline viXTTS on long-sentence synthesis, while VALL-E and VoiceCraft exhibit higher robustness in short-sentence synthesis.
Background & Motivation¶
Zero-shot TTS aims to synthesize speech for unseen speakers using only a few seconds of reference audio, which has been a research hotspot in the TTS field recently. Language modeling-based methods such as VALL-E and VoiceCraft have achieved remarkable results in English.
Limitations of Low-Resource Languages: - Low-resource languages like Vietnamese lack the large-scale, high-quality datasets required for training zero-shot TTS. - Existing Vietnamese speech datasets (such as VinBigData, BUD500, viVoice, etc.) suffer from several key defects: - Audio samples are too short (typically <10 seconds), making them unsuitable for TTS models that require long contexts. - Lack of speaker identification (viVoice uses YouTube channel names as approximations, but a channel may contain multiple speakers). - Text is not normalized (e.g., numbers to words). - Inconsistent audio quality (e.g., from consumer-grade devices, background noise).
Ours Contributions: Instead of proposing a new TTS model architecture, this work focuses on dataset construction—building a large-scale Vietnamese audio dataset truly suitable for zero-shot TTS training, and validating the performance of existing SOTA models on Vietnamese based on this dataset.
Method¶
Overall Architecture¶
- Construction of the PhoAudiobook dataset: Starting from raw audiobook audio, passing through a complete pipeline of background removal \(\rightarrow\) transcription \(\rightarrow\) quality filtering \(\rightarrow\) speaker diarization \(\rightarrow\) text normalization.
- Training three zero-shot TTS models (VALL-E, VoiceCraft, and XTTS-v2) on PhoAudiobook.
- Comparative evaluation against the baseline viXTTS on multiple test sets using both objective and subjective metrics.
Key Designs¶
-
PhoAudiobook Dataset Construction Pipeline:
- Raw Data Collection: Collected 23K hours of audio, 2,697 audiobooks, and 735 speakers from public audiobook websites.
- Background Music Removal: Used Demucs to extract vocal tracks.
- Transcription Generation: Whisper-large-v3 was used to generate transcriptions and timestamps.
- Long Audio Merging: Merged continuous short audio segments into 10-20 second long samples (this is a key innovation, as other datasets have audio <10 seconds).
- Dual-Model Cross-Validation: Whisper-large-v3 and PhoWhisper-large independently generated transcriptions, keeping only samples where both transcriptions matched exactly — ensuring transcription quality.
- Multi-Speaker Filtering: Used the wav2vec2-bartpho model to identify and filter out audio containing multiple speakers.
- Post-processing: Removed excessively short transcriptions (<25 words), trimmed leading/trailing silences, and normalized volume.
- Text Normalization: Finetuned an mbart-large-50 model to handle conversions such as numbers to words.
- Speaker Balancing: Capped at 4 hours per speaker, resulting in a final dataset of 941 hours across 735 speakers.
-
Data Augmentation Strategy:
- Reprocessed the 940-hour training set through the pipeline as new raw data, but skipped the long audio merging and short sample filtering steps.
- Obtained an additional 554 hours of short audio, bringing the total training data to 1,494 hours.
- Design Motivation: To ensure the models can handle input texts of varying lengths.
-
Evaluation System Design:
- 4 Test Sets: PhoAudiobook-Seen (seen speakers), PhoAudiobook-Unseen (unseen speakers), VIVOS (out-of-distribution short audio), viVoice (out-of-distribution).
- Objective Metrics: WER (word error rate / intelligibility), MCD (mel-cepstral distortion), RMSE_F0 (prosody matching).
- Subjective Metrics: MOS (mean opinion score for overall quality, 1-5 scale), SMOS (similarity MOS for speaker similarity, 1-5 scale).
- 10-20 native evaluators, with model names anonymized and randomized during testing.
Loss & Training¶
Each of the three models uses its standard training procedure: - VALL-E: Conditional codec language modeling. - VoiceCraft: Token rearrangement + left-to-right language modeling. - XTTS-v2: Multilingual finetuning based on the Tortoise architecture. All models were trained on the 1,494 hours of augmented training data.
Key Experimental Results¶
Main Results (Objective Metrics)¶
| Model | PAB-S WER↓ | PAB-U WER↓ | VIVOS WER↓ | viVoice WER↓ |
|---|---|---|---|---|
| VALL-E_PAB | 24.96 | 12.90 | 12.63 | 13.58 |
| VoiceCraft_PAB | 7.53 | 15.14 | 13.53 | 21.70 |
| XTTS-v2_PAB | 4.16 | 4.31 | 37.81 | 8.32 |
| viXTTS (baseline) | 4.23 | 5.17 | 37.81 | 12.54 |
| Model | PAB-S MCD↓ | PAB-U RMSE_F0↓ | viVoice MCD↓ |
|---|---|---|---|
| XTTS-v2_PAB | 6.30 | 242.51 | 8.34 |
| viXTTS | 7.47 | 271.70 | 8.71 |
Main Results (Subjective Metrics)¶
| Model | PAB-S MOS↑ | PAB-U MOS↑ | VIVOS MOS↑ | viVoice MOS↑ |
|---|---|---|---|---|
| XTTS-v2_PAB | 4.20 | 3.89 | 2.79 | 3.98 |
| VoiceCraft_PAB | 4.16 | 3.75 | 3.85 | 3.98 |
| VALL-E_PAB | 3.96 | 4.04 | 3.44 | 3.75 |
| viXTTS | 4.05 | 3.85 | 2.37 | 3.48 |
| Model | PAB-S SMOS↑ | PAB-U SMOS↑ | VIVOS SMOS↑ | viVoice SMOS↑ |
|---|---|---|---|---|
| VALL-E_PAB | 3.77 | 3.46 | 3.35 | 3.20 |
| XTTS-v2_PAB | 3.55 | 3.56 | 3.03 | 3.39 |
| viXTTS | 2.88 | 2.63 | 2.48 | 3.11 |
Ablation Study¶
| Analysis Dimension | Key Findings |
|---|---|
| PhoAudiobook vs viVoice | XTTS-v2_PAB achieves a WER of 8.32 on the viVoice test set, significantly outperforming viXTTS (12.54), even though the latter was trained on this data. |
| Long Sentences vs Short Sentences | VALL-E and VoiceCraft perform much better than XTTS-v2 on the VIVOS short sentence set (WER 12-13 vs 37.81). |
| XTTS-v2 Short Sentence Issues | XTTS-v2 generates redundant/verbose trailing speech for short input texts, which is an architectural issue rather than a data issue. |
| Data Quality | PhoAudiobook has the highest mel-cepstral SNR (SI-SNR of 4.91dB) among all datasets, with clear speaker IDs. |
Key Findings¶
- PhoAudiobook consistently improves the performance of all models, validating the critical role of high-quality data for low-resource language TTS.
- XTTS-v2 leads comprehensively in long-sentence synthesis, but has architectural defects (generating redundant trailing speech) on short sentences.
- VALL-E and VoiceCraft are more robust in short-sentence scenarios and exhibit complementarity.
- Regarding speaker similarity (SMOS), models trained on PhoAudiobook significantly outperform viXTTS (up to a +0.87 points increase).
- The audio length distribution of the dataset has a direct impact on model behavior (10-20s training samples benefit long-sentence synthesis).
Highlights & Insights¶
- Reusable Dataset Construction Pipeline: The entire pipeline from audiobooks to high-quality TTS datasets can be directly migrated to other low-resource languages, offering high engineering value.
- Dual-ASR Cross-Validation: Validating transcription accuracy with both Whisper and PhoWhisper and retaining only consistent results is a strategy worth adopting in any speech dataset construction.
- Clever Data Augmentation: Reprocessing the training set through the pipeline while skipping the merging step to obtain short audio augmentation is a simple yet effective solution to the input length generalization problem.
- Objective Truth: In low-resource language TTS, data quality is often more important than model choice—XTTS-v2 trained on PhoAudiobook performs better even on viXTTS's own training data.
Limitations & Future Work¶
- Performance in code-mixed (Vietnamese + English) scenarios was not evaluated.
- The audiobook domain results in relatively slow speech rates (201 wpm vs 229-243 wpm in other datasets), which may affect generalization to fast speech.
- Non-commercial use-only limits practical applications.
- The performance of newer TTS models (e.g., CosyVoice, etc.) was not explored.
- The "Hard" task contains only 40 samples, lacking statistical reliability.
- The 16kHz sampling rate is lower than the standard of some high-quality TTS studies (24kHz+).
Related Work & Insights¶
- VALL-E (Wang et al., 2023) pioneered the codec language modeling paradigm for TTS; this work extends it to Vietnamese.
- The token rearrangement strategy of VoiceCraft (Peng et al., 2024) performs well on zero-shot TTS and speech editing.
- XTTS-v2 (Casanova et al., 2024) is currently the strongest multilingual zero-shot TTS, but this work reveals its short-text limitations.
- Insight: The primary bottleneck in low-resource language TTS research is data rather than models. Audiobooks have unique advantages as speech data sources—professional recording, long content, and identifiable speakers—which are worth extending to more languages.
Rating¶
- Novelty: ⭐⭐⭐ High engineering innovation in dataset construction, but limited methodology innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage with 3 models, 4 test sets, and both objective and subjective metrics.
- Writing Quality: ⭐⭐⭐⭐ Detailed dataset description, standard experimental design, and high reproducibility.
- Value: ⭐⭐⭐⭐ Direct boost to the Vietnamese TTS community, with pipeline designs that serve as a valuable reference for other low-resource languages.