On the Robust Approximation of ASR Metrics¶

Conference: ACL 2025
arXiv: 2502.12408
Code: None
Area: Speech
Keywords: ASR evaluation, label-free metrics, WER approximation, multimodal embeddings, proxy reference

TL;DR¶

This paper proposes a label-free approximation method for ASR performance metrics. By utilizing speech-text similarity in a unified multimodal embedding space and proxy metrics from high-quality proxy models, an ensemble regression model is trained to predict WER/CER. The absolute error is maintained within single digits across over 40 models and 14 datasets, outperforming the latest baseline by more than 50%.

Background & Motivation¶

Background: ASR models are typically evaluated using WER and CER, which rely heavily on ground-truth transcriptions. While large-scale speech foundation models perform excellently on standard benchmarks, their generalization capabilities under diverse domains and testing conditions remain unclear.

Limitations of Prior Work: - Annotating data is expensive and time-consuming, limiting the evaluation of model performance in new domains. - Existing reference-free evaluation methods (such as NoRefER) mainly provide relative quality assessment and cannot provide precise error rates. - Existing metric approximation methods (such as eWER3) are primarily validated in IID settings, lacking evaluations of OOD generalization.

Key Challenge: The need to obtain reliable quantitative indicators of ASR performance without labels, while maintaining robustness across out-of-distribution (OOD) data and different ASR models.

Goal: To achieve robust ASR metric approximation across four evaluation scenarios (IID-Source, IID-Target, OOD-Source, and OOD-Target), covering over 40 models and 14 datasets.

Key Insight: Combining speech-text similarity from the SONAR multimodal unified embedding space with the WER/CER of high-quality proxy models as features to train an ensemble regression model.

Core Idea: Using the cosine similarity of speech-transcription in a unified representation space and the error rates of proxy models as features to train a regression model that predicts the ground-truth WER/CER.

Method¶

Overall Architecture¶

The pipeline consists of three components: 1. Similarity calculation within a unified representation space. 2. Consistency measurement against proxy references. 3. Training regression models to predict ASR metrics.

Key Designs¶

Similarity in a Unified Representation Space: Using the SONAR model, the speech signal \(x_{\text{speech}}\) and the ASR transcription \(x_{\text{text}}\) are mapped into a shared 1024-dimensional embedding space to compute the cosine similarity:

\[\text{Similarity}(x_{\text{speech}}, x_{\text{text}}) = \frac{e_{\text{speech}} \cdot e_{\text{text}}}{\|e_{\text{speech}}\| \|e_{\text{text}}\|}\]

Intuition: Higher similarity indicates better transcription quality and superior alignment with the actual content.

Proxy Reference Mechanism: A high-quality ASR model is selected as a proxy to compute the WER (pWER) and CER (pCER) between the target model's transcription and the proxy's transcription as features. The proxy is selected dynamically: 41 models are ranked based on their average performance on the datasets, and for each target model, the top-ranked model that is not itself is selected as the proxy.
Ensemble Regression Model: The similarity and proxy metrics are concatenated into a feature vector \(z = [\text{Similarity}, \text{pWER}/\text{pCER}]\), which is input into ensemble regressors to predict aWER/aCER. The ensemble includes Random Forest, Gradient Boosting, Histogram-based Gradient Boosting, and Ridge Regression with non-negativity constraints. Hyperparameters are tuned using RandomizedSearchCV to minimize MAE.

Four Evaluation Settings¶

Case 1: IID data + Source model — trained on \(\mathcal{D}_{S,B}^{\text{train}}\) and tested on \(\mathcal{D}_{S,B}^{\text{test-IID}}\)
Case 2: IID data + Target model — tested on \(\mathcal{D}_{T,B}^{\text{test-IID}}\)
Case 3: OOD data + Source model — tested on \(\mathcal{D}_{S,W}\) (wild dataset)
Case 4: OOD data + Target model — tested on \(\mathcal{D}_{T,W}\)

Loss & Training¶

The regression models minimize MAE (Mean Absolute Error).
Leave-one-out strategy: Training on 9 out of 10 standard benchmarks and testing on the remaining 1.
The regression target is the absolute number of errors (word/character level), which is normalized to obtain aWER/aCER.
Extraction of SONAR embeddings for 1,000 samples per dataset takes only about 1 minute.

Key Experimental Results¶

Main Results — WER Approximation on Wild Datasets (Selected Models)¶

Model	LS_Noise (WER/aWER)	Primock57	ATCOsim	VP_Acc
canary-1b	4.1/6.4	16.2/13.4	30.4/35.5	23.2/12.1
whisper-l-v3	4.6/5.9	18.7/12.0	64.7/73.9	19.2/18.1
parakeet-tdt-1.1b	3.4/6.0	13.5/13.2	28.3/35.7	17.9/10.2
data2vec-large	7.2/8.6	28.3/30.7	44.0/51.1	21.4/26.5
mms-1b-f102	24.0/24.9	70.2/67.8	93.4/99.0	39.4/38.2

Key Findings in Benchmark Datasets¶

Whisper-large-v3 achieves a WER of 19.0% vs aWER of 17.1% on AMI_IHM, with a gap of only 1.9%.
Approximations for high-performance models (low WER) are more accurate, while low-performance models exhibit larger deviation, though still remaining within a reasonable range.
For most models on most datasets, |WER - aWER| < 5%.

Cross-Lingual Experiments¶

Training Language→Testing Language	MAE (word error count)
EN→EN (IID)	0.56-0.66
EN→DE (OOD)	0.75-1.60
DE→EN (OOD)	0.50-1.59

Ablation Study¶

Similarity only: Performance drops significantly, indicating that proxy metrics contribute critical information.
Proxy metrics only: Outperforms Similarity-only, but the combination of both yields optimal performance.
Training Data Size: Training with only 20% of the data can achieve over 90% of the full-data performance.
vs. eWER3 Baseline: Outperforms eWER3 by more than 50% in all settings.

Key Findings¶

High-performance ASR models (low WER) are easier to approximate accurately, while low-performance models show larger deviation, but the absolute gap still remains in the single-digits.
The method is agnostic to both models and data, showing strong generalizability across models and domains.
It maintains reasonable accuracy even cross-linguistically (EN↔DE).
OOD generalization performance on wild datasets (real-world scenarios) is highly satisfactory.
The approximation deviation increases at extremely high error rates (e.g., WER > 90%), but distinguishing high from extremely high error rates has limited practical significance in such cases.

Highlights & Insights¶

High Practical Value: Evaluates ASR model performance in new domains without labels, which is highly useful for domain adaptation assessment before large-scale model deployment.
Ingenious Use of SONAR: Leverages the pre-trained multimodal unified embedding space as a zero-shot feature extractor, avoiding end-to-end training.
Flexible Proxy Selection Strategy: Dynamically ranks and selects the best proxy to avoid self-referential bias.
Impressive Scale: 40+ models \(\times\) 14 datasets \(\times\) 4 evaluation settings = extremely comprehensive evaluation.

Limitations & Future Work¶

The approximation accuracy for models with extremely high error rates (WER > 80%) is lower, which may require non-linear features.
The selection of proxy models depends on the pre-evaluation of multiple models, leading to a high initial setup cost.
The regression models are based on traditional ML (RandomForest/GBT), leaving neural network-based regressors unexplored.
Mostly validated on English data, with cross-lingual experiments only involving German.
Sentence-level approximations might be less stable than corpus-level ones.

The capabilities of the SONAR (Duquenne et al., 2023) multimodal unified embedding model extend beyond translation evaluation, proving equally effective in ASR quality assessment.
eWER3 (Chowdhury and Ali, 2023) uses an end-to-end approach based on wav2vec2 + RoBERTa but lacks proxy information and OOD generalization.
Potential application in pseudo-labeling: Helps filter high-quality transcriptions for knowledge distillation.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of Proxy + SONAR similarity is simple and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 40+ models \(\times\) 14 datasets \(\times\) 4 settings, extremely comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Clear framework and rigorous evaluation setup.
Value: ⭐⭐⭐⭐ — High practical value for the area of label-free ASR evaluation.