NeurIPS2025 Audio & Speech multilingual speech parliamentary recordings ASR speech-text alignment dataset construction low-resource languages

EuroSpeech: A Multilingual Speech Corpus¶

Conference: NeurIPS2025 arXiv: 2510.00514 Code: disco-eth/EuroSpeech Area: Speech Processing / Multilingual Datasets Keywords: multilingual speech, parliamentary recordings, ASR, speech-text alignment, dataset construction, low-resource languages

TL;DR¶

This paper presents a scalable, open-source pipeline for automatically constructing the EuroSpeech dataset from recordings of 22 European parliaments — yielding 61K hours of high-quality speech-text aligned data across 22 languages, with 19 languages exceeding 1K hours. Fine-tuning Whisper on this data reduces average WER by 41.8%.

Background & Motivation¶

Core Problem: Multilingual ASR/TTS models depend on large-scale annotated speech data, yet publicly available multilingual datasets exhibit severe imbalance in language coverage — most languages fall well below the 1K-hour threshold required for effective training.
Limitations of Prior Datasets:
- Common Voice covers 133 languages, but only 8 exceed 1K hours.
- VoxPopuli draws from EU parliamentary sessions but covers only 16 languages with 1.8K hours total, none exceeding 1K hours.
- Whisper's 680K-hour training set is not publicly released; MMS-Lab covers 1,107 languages but remains private.
- FLEURS contains only 1.4K hours and targets benchmarking rather than training.
Opportunity: National parliament recordings paired with official transcripts constitute a natural source of high-quality multilingual speech. However, data formats are highly heterogeneous, transcripts are not verbatim, and recordings are long and unsegmented — making scalable processing with existing pipelines impractical.
Motivation: Design a source-agnostic, automated pipeline capable of handling non-verbatim transcripts and cross-format data sources to construct a large-scale, linguistically balanced multilingual speech dataset.

Method¶

1. Overall Pipeline Architecture¶

The full pipeline comprises three stages:

Data Source Collection and Metadata Curation: Manual inspection of parliamentary websites → writing custom scraping scripts → generating standardized CSVs containing audio/video URLs, transcript links, and session IDs.
Download Pipeline: A dispatch architecture distributes URLs to dedicated handlers (direct download, YouTube, dynamic pages, etc.), with support for resumable transfers, parallel downloading, and PostgreSQL-based status tracking.
Alignment Pipeline: Audio segmentation → ASR transcription → two-stage dynamic alignment → CER filtering → output of aligned dataset.

2. Two-Stage Dynamic Alignment Algorithm (Core Contribution)¶

To address noise in parliamentary transcripts — including non-verbatim text, speaker annotations, and procedural language — a coarse-to-fine alignment strategy is proposed:

Stage 1 — Coarse Search: - For each ASR-transcribed segment, a sliding window of length \(n\) (number of words in the ASR segment) is applied. - Search proceeds sequentially from the previous match position last_end_idx. - CER is computed for each window against the ASR text; the first candidate with CER < 30% is selected. If none exists, the \(k=3\) candidates with the lowest CER are retained.

Stage 2 — Refined Search: - Local optimization is applied to candidates identified in the coarse search. - The start position is varied within \(\pm 15\) words, and window size is tested over the range \((L-15, L+15)\). - The (start, size) combination minimizing CER is selected as the final match.

Fallback Mechanism: - If CER exceeds threshold \(\theta\) after local search, a global coarse search is re-executed from the beginning of the transcript, resolving position drift caused by prior misalignments. - If the global search also fails, a default match is performed (refined search near last_end_idx, retained regardless of CER), ensuring complete dataset coverage.

3. Transcript Preprocessing¶

Built-in parsers for PDF, DOCX, HTML, TXT, and SRT formats.
An optional LLM-based cleaning step (defaulting to Gemini Flash 2.0) removes speaker labels, procedural annotations, and other non-speech content.
In German-language tests, LLM-based cleaning reduced the median CER of aligned segments from 12.3% to 9.7%.

4. Multi-Transcript Selection Strategy¶

A parliament may release multiple transcripts in different formats for the same session, with no explicit correspondence to the audio. The proposed solution: - Aligns audio against all candidate transcripts × all formats independently. - Selects the best format by median CER. - Selects the final transcript(s) according to a configurable criterion (lowest CER or all below a threshold).

5. CER-Based Tiered Quality Filtering¶

CER serves as the primary quality control metric. Aligned duration statistics at multiple thresholds are reported:

Filter Level	Aligned Duration	Proportion
All aligned	78.1K h	100%
CER < 30%	61.0K h	78.2%
CER < 20%	50.5K h	65.4%
CER < 10%	32.3K h	41.0%

CER < 20% defines the primary dataset (consistent with VoxPopuli), with audio segments of 3–20 seconds and a sampling rate of 16 kHz.

EuroSpeech Dataset Analysis¶

Language Coverage and Scale¶

Language	Duration (CER<20%)	Language	Duration (CER<20%)
Croatian	5615.8 h	Bulgarian	2200.1 h
Danish	5559.8 h	German	2184.4 h
Norwegian	3866.7 h	Serbian	1855.7 h
Portuguese	3293.5 h	Finnish	1848.2 h
Italian	2813.7 h	Latvian	1218.8 h
Lithuanian	2681.2 h	Ukrainian	1191.1 h
English	2609.3 h	Slovenian	1156.4 h
Slovak	2553.6 h	Estonian	1014.9 h
Greek	2395.4 h	Bosnian	691.3 h
Swedish	2312.8 h	Icelandic	647.4 h
French	2249.8 h	Maltese	613.0 h

Key Comparison: EuroSpeech surpasses the scale of existing public datasets for 12 languages, with 8 languages crossing the 1K-hour threshold for the first time. For 5 languages, the data volume is 10–100× that of the previous best (e.g., Lithuanian: 25h → 2681h; Slovak: 61h → 2554h; Maltese: 44h → 613h).

Data Splits¶

Train/dev/test splits are defined at the level of complete parliamentary sessions, preventing segments from the same session from appearing across different splits.

Key Experimental Results¶

ASR Fine-Tuning Evaluation¶

Whisper v3 Turbo is fine-tuned on approximately 200 hours of the lowest-CER data per language for 6 low-resource languages, and evaluated on the out-of-domain FLEURS test set:

Language	Baseline WER	Fine-tuned WER	Relative Gain
Maltese	72.2%	25.9%	64.1%
Icelandic	20.0%	15.0%	25.0%
Lithuanian	25.0%	15.9%	36.4%
Latvian	19.3%	11.1%	42.5%
Slovenian	20.5%	13.0%	36.7%
Estonian	18.4%	9.9%	46.1%
Average	29.2%	15.1%	41.8%

Maltese shows the largest improvement (64.1%), transitioning from near-unusable to practically viable.
All 6 languages achieve substantial gains on the out-of-domain test set, validating the dataset's generalization value.
Training efficiency is high: only 200 hours of data and 1.3–43 GPU hours per language are required.

Computational Cost¶

Stage	Resource Consumption
Video download	~3,930 CPU·h
Transcript retrieval	~280 CPU·h
Alignment processing	~5,548 GPU·h (mixed GPU types)
ASR fine-tuning	1.3–43 h/language (A6000)

Highlights & Insights¶

Balanced coverage as core value: Rather than maximizing total duration, EuroSpeech's distinguishing contribution is that all 22 languages exceed 500 hours — in contrast, only 8 of Common Voice's 133 languages exceed 1K hours.
Elegant two-stage alignment design: Linear scanning ensures efficiency; local refined search ensures precision; two-level fallback ensures robustness — all without requiring manual transcript preprocessing.
Engineering value of the pipeline: The modular design (metadata CSV → download dispatch → alignment → filtering) enables extension to new parliaments by writing only a metadata scraping script, with full reuse of core logic.
LLM-assisted cleaning reduces barriers: Gemini Flash is used to automatically remove non-speech elements from PDF transcripts, substantially reducing manual preprocessing effort.
Deliberate selection of hardest languages for evaluation: The 6 languages on which Whisper performs worst are specifically chosen — demonstrating the largest absolute gains while also serving as a rigorous stress test of pipeline robustness (poorer ASR → noisier alignment → stronger fault tolerance required).

Limitations & Future Work¶

Domain homogeneity: Coverage is limited to formal parliamentary speech, which is planned and formal in style; models may generalize poorly to conversational or informal settings.
Limited dialect and variety coverage: Parliamentary language tends toward standard/official registers, with insufficient representation of dialectal and sociolinguistic variation.
Dependence on ASR quality: Alignment quality is bounded by the underlying ASR system; for low-resource languages where ASR is inherently weak, alignment may introduce systematic biases.
Geographic limitation: Coverage is restricted to 22 European nations; low-resource languages in Africa, Asia, and South America are not addressed.
Potential misuse risk: The data contains identifiable speech from public political figures and could be exploited for voice synthesis or deepfake generation, though the formal parliamentary register may partially limit misuse.
Future directions: Extension to conversational speech, incorporation of speaker metadata, and a 24 kHz TTS-oriented variant.

Dataset	Languages	Total Duration	Languages >1K h	Public
Common Voice	133	22.1K h	8	✓
VoxPopuli	16	1.8K h	0	✓
YODAS	149	369.5K h	13	✓
Whisper Data	91	680K h	16	✗
MMS-Lab	1,107	44.7K h	0	✗
EuroSpeech	22	61K h	19	✓

Among publicly available datasets, EuroSpeech achieves the highest number of languages exceeding 1K hours (19), trading breadth of language coverage for depth within each language.

Further Insights: - Tiered quality–quantity tradeoff: The multi-level CER filtering design (10%/20%/30%) is broadly applicable — different downstream tasks have different noise tolerances (TTS requires high quality; ASR pretraining can leverage noisier data). - Systematic utilization of public data: Parliamentary recordings represent a class of "government-released but systematically underutilized" data sources; analogous opportunities exist in other modalities (legal texts, policy documents, sign language video). - Pipeline-first thinking: Rather than manually annotating small high-quality datasets, designing robust pipelines to automatically extract value from large noisy sources is the prevailing paradigm for large-scale dataset construction.

Rating¶

Novelty: ⭐⭐⭐ — pipeline is not novel per se, but the scale is significant
Experimental Thoroughness: ⭐⭐⭐⭐ — validated across multiple languages for ASR
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐ — large-scale multilingual speech resources carry high community value