SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models¶
Conference: ACL 2025
arXiv: 2509.14270
Code: None
Area: Audio & Speech
Keywords: Text-to-Speech, Synthetic Data, Multilingual, Text Normalization, Voice Cloning
TL;DR¶
SpeechWeave proposes an end-to-end synthetic speech data generation pipeline that enhances text diversity through keyphrase sampling, performs text normalization at the generation stage (achieving \(97\%\) accuracy), and uses cross-lingual voice cloning to standardize speakers. The generated data is \(10\)-\(48\%\) more diverse than direct LLM prompting and significantly improves downstream TTS model performance.
Background & Motivation¶
In the text-to-speech (TTS) domain, high-quality training data is crucial for model success. However, existing data acquisition methods face three core challenges:
Domain Specificity: Public TTS datasets (e.g., LibriSpeech) primarily consist of audiobooks or generic passages, lacking data for specific commercial domains (automotive, healthcare, retail, etc.). Scraping from the web or purchasing data involves high costs and copyright concerns.
Lack of Text Diversity: Although LLMs can generate text, experiments show that even with high temperature and top_p settings, the short sentences generated by LLMs are highly similar when the input prompt remains the same. Extremely high parameter settings lead to unstable output and degraded quality. This makes direct LLM-based generation of large-scale training data impracticable.
Text Normalization Challenges: Written text and spoken forms often differ (e.g., dates, addresses, abbreviations, known as "semi-semantic classes"). Existing normalization tools (such as NeMo) are error-prone or miss variations when handling multiple formats.
Non-scalable Voice Recording: Commercial TTS systems require standardized speech from specific speakers. Relying on professional voice artists for recording is expensive and does not scale.
Based on these limitations of prior work, the authors propose SpeechWeave—an automated synthetic data pipeline integrating text generation, normalization, and speech synthesis. The Core Idea is to resolve diversity and normalization issues directly during the data generation phase rather than applying post-hoc fixes.
Method¶
Overall Architecture¶
SpeechWeave consists of four core modules: Keyphrase Sampler, Entity Sampler with At-Source Normalizer, Postprocessor, and Audio Generation Module. The overall workflow is as follows: first generate diverse keyphrases \(\rightarrow\) generate text containing semantic class entities \(\rightarrow\) perform normalization simultaneously \(\rightarrow\) postprocess \(\rightarrow\) generate standardized speech.
Key Designs¶
-
Keyphrase Sampling:
- Core Problem: LLMs generate repetitive texts under fixed prompts.
- Mechanism: Use keyphrase infusion to increase prompt diversity. For example, changing "Generate sentences for the financial domain" to "Generate sentences for the financial domain containing keyphrases 'mortgage loan, asset finance'".
- Multi-step Generation: Instruct the LLM to generate a list of sub-domains \(\rightarrow\) randomly select one \(\rightarrow\) generate creative paragraphs \(\rightarrow\) extract keyphrases. This iterative multi-step prompting significantly enhances the diversity of ideas.
- De-duplication Mechanism: Maintain an in-memory keyphrase store and deduplicate based on fuzzy search (token sort ratio + Levenshtein distance). Experiments indicate this generates more diverse keyphrases compared to PhraseBERT embedding similarity methods.
-
Entity Sampler & At-Source Normalization:
- Novelty: Instead of normalization after generating text, normalized forms are produced simultaneously during entity generation.
- Design Motivation: Traditional normalizers can miss certain variants when facing multiple formats for the same entity (e.g., dates like 03/01/2005, 01-Mar-2005, March 01, 2005). Normalization at source is deterministic because entity components are already normalized during the assembly process.
- Function: Supports 9 entity types (address, phone, email, URL, date, time, percentage, name+honorific), generating thousands of unique combinations, with multilingual and locale-sensitive support.
-
Speech Audio Generation & Cross-Lingual Voice Cloning:
- The normalized text is fed into a pre-trained TTS model (Zhao et al. 2023) to generate base audio.
- OpenVoiceV2's tone converter is used to perform speaker standardization with reference voice artist audio.
- Key Insight: The tone converter is language-independent, allowing English reference audio to standardize speech in other languages, achieving cross-lingual speaker consistency.
Loss & Training¶
As this paper focuses on a data generation pipeline rather than model training, there is no specific loss function design. Text generation is performed using Mistral-7b-Instruct, enforced with JSON format outputs via lm-format-enforcer. The post-processing phase uses a basic heuristic algorithm to expand abbreviations, convert numbers, etc.
Key Experimental Results¶
Main Results¶
The experiment generated 3,000 data points across 16 commercial domains, 5 sentence structures, 9 semantic classes, and 2 reference speakers (male/female), evaluated for both English and Spanish.
| Metric | Baseline (Direct Prompting) | SpeechWeave | LibriSpeech | Explanation |
|---|---|---|---|---|
| Mean Similarity (Grouped, English) | \(0.48\) | \(0.26\) | - | Reduced by \(45.8\%\) |
| Mean Similarity (Grouped, Spanish) | \(0.54\) | \(0.30\) | - | Reduced by \(44.4\%\) |
| TTR (English) | \(0.118\) | \(0.167\) | \(0.123\) | Highest lexical richness |
| MATTR (English) | \(0.761\) | \(0.803\) | \(0.758\) | Highest moving average TTR |
| Diphone Coverage (English) | \(1442\) | \(1694\) | \(1792\) | \(17.4\%\) higher than baseline |
| Normalization Accuracy (English) | \(0.67\) (NeMo) | \(0.97\) | - | Significant improvement |
| Normalization Accuracy (Spanish) | \(0.54\) (NeMo) | \(0.94\) | - | Significant improvement |
Ablation Study¶
| Configuration | English WER (%) | Spanish WER (%) | Explanation |
|---|---|---|---|
| LibriTTS Checkpoint (Baseline) | \(15.37\%\) | \(85.05\%\) | Pre-finetuning |
| Baseline + Fine-tuning with SpeechWeave Data | \(9.36\%\) | \(48.44\%\) | WER reduced by \(40\%\) and \(27\%\) respectively |
Audio quality: English \(\text{SNR}=59.82\text{ dB}\), \(\text{MOS}=4.95\), \(\text{WER}=9.32\%\); Spanish \(\text{SNR}=53.01\text{ dB}\), \(\text{MOS}=4.87\), \(\text{WER}=15.21\%\).
Key Findings¶
- Directly prompting LLMs results in low text diversity even with high temperature; keyphrase infusion is an effective diversity boosting method.
- At-source normalization accuracy far outperforms the post-processing normalizer NeMo (\(97\%\) vs \(67\%\) in English, \(94\%\) vs \(54\%\) in Spanish).
- Cross-lingual voice cloning allows a single reference speaker to be shared across multiple languages.
- Synthetic data effectively improves the performance of downstream TTS models on real-world test sets.
Highlights & Insights¶
- "Generation as Normalization" Design Philosophy: Moving normalization from post-processing to the data generation stage fundamentally avoids coverage blind spots of normalizers on various entity formats. This is a simple but elegant engineering innovation.
- Deduplication Strategy in Keyphrase Sampling: Using fuzzy search instead of semantic embeddings yielded better deduplication results, indicating that diversity control requires attention to literal-level differences.
- Practical Value of Cross-Lingual Voice Cloning: Standardizing speech in other languages using English reference audio significantly lowers the threshold for producing multilingual TTS data.
Limitations & Future Work¶
- The entity sampler supports a limited number of semantic classes (only 9 types); extending to new types requires manually writing rules.
- Evaluation was only conducted on English and Spanish; the performance on morphologically rich languages (e.g., Arabic, Japanese) remains unknown.
- The postprocessor can still introduce normalization errors when handling unsupported entity types.
- Lack of subjective human evaluation (e.g., large-scale MOS testing); speech naturalness needs further validation.
- Stylized/emotional speech generation is not supported.
Related Work & Insights¶
- Similar to Cornell et al. (2024)'s LLM+TTS pipeline for ASR data, but this work integrates text normalization and speaker standardization.
- Gunduz et al. (2024)'s open-source TTS data generation tool lacks text transcript generation and normalization functions.
- Keyphrase infusion technology is derived from Eldan and Li (2023), but this work systematizes it into a complete solution with multi-step sampling + deduplication.
- This provides valuable insights for other scenarios requiring large-scale domain-specific data (e.g., ASR, NLU data augmentation).
Rating¶
- Novelty: ⭐⭐⭐ While individual modules are not technologically novel, the combination method and at-source normalization perspective are valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dimensional evaluation (diversity, normalization, speech quality, downstream tasks), though verification across more languages is lacking.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with abundant tables and examples, making it easy to understand and replicate.
- Value: ⭐⭐⭐⭐ Address realistic pain points in TTS data generation, showing high practical value for the industry.