PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement¶

Conference: AAAI 2026 arXiv: 2511.13300
Code: https://github.com/cisco-open/pase
Area: Hallucination Detection Keywords: Speech Enhancement, Hallucination Suppression, WavLM, Phonological Prior, Representation Distillation

TL;DR¶

This paper proposes PASE, a framework that leverages robust phonological priors embedded in pretrained WavLM via Denoising Representation Distillation (DRD) to suppress linguistic hallucinations, while employing a dual-stream representation (high-level phonetic + low-level acoustic) to eliminate acoustic hallucinations, simultaneously achieving state-of-the-art performance in both perceptual quality and content fidelity.

Background & Motivation¶

Hallucination in Generative Speech Enhancement¶

Speech enhancement (SE) aims to recover clean speech from noisy mixed signals. Generative models (GANs, diffusion, flow matching, language models) surpass conventional discriminative approaches in perceptual quality, yet suffer from a largely overlooked problem—hallucination: enhanced speech that is inconsistent with the original signal in linguistic content or speaker characteristics.

Taxonomy of Two Hallucination Types¶

The authors are the first to systematically categorize hallucinations in speech enhancement into two types:

Linguistic Hallucination: Enhanced speech contains incorrect spoken content, arising from the model's inability to constrain valid phonological structures. This is the more fundamental challenge.

Acoustic Hallucination: Enhanced speech exhibits inconsistent speaker characteristics, arising from the loss of fine-grained acoustic details. This can be mitigated by supplementing acoustic cues.

Fundamental Limitations of Prior Work¶

Both major paradigms suffer from critical flaws:

Continuous Representation Mapping (CRM): - Treats S3M representations as simple feature vector sequences, ignoring the contextual structure underlying their pseudo-linguistic properties - Performance gains stem primarily from the strength of the representations themselves rather than from effective exploitation of their internal structure

Discrete Language Modeling (DLM): - Discretizes S3M representations into token sequences and models them autoregressively - Prior contamination risk: Learning from noise-corrupted representations tends to produce linguistically inconsistent outputs - Discretization discards critical acoustic information (pitch, timbre), precluding mitigation of acoustic hallucinations - Noise differs from true masking—noise acts as a "soft mask" that merely distorts information, allowing the model to bypass contextual knowledge by reconstructing from local cues alone

Core Argument¶

The authors introduce the concept of Phonological Prior: S3M models (e.g., WavLM) do not truly understand linguistic semantics; rather, they construct pseudo-linguistic properties that simulate language comprehension by learning statistical co-occurrence patterns of phonetic structures from large-scale audio data. This prior should be directly leveraged rather than relearned from corrupted inputs.

Method¶

Overall Architecture¶

PASE (Phonologically Anchored Speech Enhancer) consists of two core components:

DeWavLM (Denoising WavLM): WavLM fine-tuned via Denoising Representation Distillation to serve as a denoising expert
Dual-Stream Vocoder: Reconstructs enhanced waveforms from DeWavLM's high-level phonetic representations and low-level acoustic representations

Key Designs¶

1. Denoising Representation Distillation (DRD)¶

Core Idea: Rather than learning phonological priors from noisy inputs (which risks prior contamination), directly leverage the robust priors already present in pretrained WavLM.

Method: - Instantiate two WavLM copies: a frozen teacher and a trainable student, both initialized from pretrained weights - The student learns to map noisy input waveforms to clean representations: - Input: noisy speech waveform - Target: final-layer output produced by the teacher from the corresponding clean speech - Loss: MSE loss

Key Findings: - Using the final layers of both student and teacher (L24→L24) yields optimal performance - Using only the KD loss (without the original masked prediction loss) is most effective—the pure KD objective provides strong regularization, effectively preventing knowledge degradation and catastrophic forgetting - Joint SSL+KD objectives are suboptimal, as SSL loss induces representation drift

Catastrophic Forgetting: The new denoising objective may overwrite the model's original phonological priors; however, experiments show: - PNMI (phoneme discriminability) remains stable across all fine-tuning objectives - The KD loss acts as a regularizer that pulls representations back toward the original manifold, achieving an RFS of 0.98

2. Dual-Stream Acoustic Conditioning Reconstruction¶

Core Idea: Two complementary layers are selected as vocoder inputs—

Phonetic Representation: Final Transformer layer output, rich in abstract, context-dependent phonetic content → ensures linguistic integrity
Acoustic Representation: First Transformer layer output, preserving fine-grained acoustic details → retains speaker identity and prosody

The two representations are fused via simple element-wise addition (after linear projection), which achieves optimal performance—since the two representations are largely orthogonal, addition suffices for efficient fusion, and more complex strategies yield no additional benefit.

3. Vocoder Architecture¶

Backbone: Enhanced Vocos with integrated attention modules for improved contextual modeling
Adversarial training: Multi-Period Discriminator (MPD) + Multi-Band Multi-Scale STFT Discriminator (MBMSD), with equal adversarial loss weights
Loss weights: Reconstruction : Adversarial : Feature Matching = 15 : 2 : 1

Loss & Training¶

DeWavLM training: 100K steps, batch size 4, learning rate 1e-4, AdamW + cosine decay
Vocoder training: 200K steps, batch size 12, learning rate 2e-4, DeWavLM frozen during training
Hardware: 4 × NVIDIA RTX 4090
Training data: ~2,000 hours of clean speech, SNR range \([-5, 15]\) dB

Key Experimental Results¶

Main Results¶

Comparison on the simulated LibriTTS test set:

Model	Params (M)	DNSMOS↑	UTMOS↑	SBS↑	LPS↑	SpkSim↑	WER (%)↓
Noisy	-	1.33	1.44	0.62	0.63	0.77	14.35
TF-GridNet	2.77	3.04	2.62	0.85	0.90	0.80	9.93
StoRM	55.12	3.07	2.55	0.68	0.65	0.63	45.94
LLaSE-G1	1895.63	3.16	3.17	0.74	0.71	0.42	36.58
AES-V2	-	3.35	4.09	0.79	0.85	0.60	21.32
PASE (ours)	382.14	3.12	3.09	0.90	0.93	0.80	7.49

PASE achieves the lowest WER (7.49%) and highest SpkSim/LPS/SBS simultaneously at the lowest computational cost (21.42 G MACs/s), striking a comprehensive balance between perceptual quality and content fidelity.

Ablation Study¶

Ablation on DRD objective functions:

Configuration	DNSMOS↑	UTMOS↑	SpkSim↑	WER (%)↓	Notes
w/o DRD	1.55	1.39	0.46	32.33	No fine-tuning baseline
SSL only	2.64	1.96	0.39	15.38	Masked prediction loss only
KD only	3.26	3.42	0.57	7.62	MSE distillation loss only
SSL+KD	3.07	2.95	0.52	8.78	Joint loss

Investigation of phonological prior sources:

DeWavLM Variant	DNSMOS↑	WER (%)↓	Notes
Base (960h pretrained)	3.32	15.49	Small model + less data
Base+ (94Kh pretrained)	3.30	13.34	Small model + more data
Large (94Kh pretrained)	3.26	7.62	Large model + more data
Base-FS (trained from scratch)	3.33	36.16	Small model, no pretraining
Large-FS (trained from scratch)	3.24	38.62	Large model, no pretraining

Ablation on acoustic conditioning schemes:

Scheme	SpkSim↑	WER (%)↓	Notes
w/o condition	0.57	7.62	No acoustic conditioning
Add	0.80	7.50	Simple addition
Concat	0.80	7.49	Concatenation
Cross-Attention	0.79	7.78	Cross-attention
FiLM	0.80	7.58	FiLM modulation

Key Findings¶

Pretrained priors are indispensable: From-scratch (FS) models yield WERs of 36–38%, whereas pretrained models achieve only 7–15%
Phonological priors originate from masked prediction objectives: MRS (Masked Reconstruction Score) is highly correlated with denoising WER; the contextual reasoning capability cultivated by masked prediction is the fundamental source of the prior
Data scale amplifies but does not establish priors: Effective priors can be established with as little as 960h of data (Base WER 15.49% vs. Base-FS 36.16%), but the large model + large data combination yields the best results
Simple additive fusion suffices: High-level phonetic and low-level acoustic representations are largely orthogonal; addition is efficient and incurs no information loss
Commercial solution AES-V2 achieves the best perceptual quality but 21% WER: This highlights the persistent trade-off between perceptual quality and linguistic accuracy

Highlights & Insights¶

Strong conceptual contribution: The paper is the first to systematically distinguish linguistic and acoustic hallucinations in speech enhancement and trace each back to its root cause
Paradigm shift: Moves from "learning priors from corrupted inputs" to "directly leveraging existing robust priors," avoiding prior contamination
In-depth prior source analysis: Through carefully designed experiments using three metrics (PNMI, RFS, MRS), the paper reveals that phonological priors fundamentally originate from the contextual reasoning capability cultivated by masked prediction objectives
Highly practical: Computational cost of only 21.42 G MACs/s, far lower than StoRM (\(317.76 \times 30\)) and FlowSE (\(36.79 \times 32\)), while achieving superior performance

Limitations & Future Work¶

UTMOS scores are relatively low on the DNS1 with-reverb subset, suggesting a possible domain mismatch between reverberant training and test conditions
Reliance on WavLM-Large as the backbone results in a non-trivial parameter count (382M), posing challenges for edge deployment
The low-level acoustic representation in the dual-stream design may carry residual noise, causing UTMOS to drop from 3.42 to 3.09
Generalization to multilingual speech is not discussed
Adversarial training of the vocoder may introduce artifacts; further validation under extreme conditions is needed

WavLM's layer-wise analysis—where lower layers encode acoustic/speaker information and higher layers encode abstract linguistic information—provides the theoretical basis for the proposed design
HuBERT's masked prediction training paradigm informs the analysis of phonological prior sources
The shortcomings of prior work such as DeVo/GenSE/SELM/LLaSE-G1 (prior contamination, information loss) directly motivate the paradigm shift proposed in this paper
Insight: The strategy of directly exploiting pretrained knowledge rather than relearning it may be equally effective in other S3M-based tasks, such as automatic speech recognition and speaker verification

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Outstanding in both conceptual innovation (hallucination taxonomy + paradigm shift) and technical innovation (DRD + dual-stream design)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multi-dataset, multi-metric, in-depth ablations (distillation layers, objective functions, prior sources, fusion schemes)
Writing Quality: ⭐⭐⭐⭐⭐ — Problem definition is precise, analysis is logically layered, and causal reasoning is rigorous
Value: ⭐⭐⭐⭐⭐ — Achieves the best balance between perceptual quality and content fidelity, with open-source code and high practical utility