HCFD: A Benchmark for Audio Deepfake Detection in Healthcare¶
Conference: ACL 2026
arXiv: 2604.17642
Code: GitHub
Area: Medical Imaging
Keywords: Audio Deepfake Detection, Pathological Speech, Neural Audio Codec, Hyperbolic Prototype, Medical Security
TL;DR¶
Ours proposes the codec forged speech detection task HCFD in medical scenarios, constructs the first codec forged speech dataset HCFK containing various clinical pathological conditions (depression, Alzheimer’s, dysarthria), and introduces the PHOENIX-Mamba framework—achieving 97.04% accuracy in English depression detection by modeling multimodal forgery evidence prototypes in hyperbolic space.
Background & Motivation¶
Background: Audio deepfake detection has developed rapidly in recent years, with benchmarks like ASVspoof and CodecFake driving progress. Existing detectors are primarily trained and evaluated on healthy speech and have demonstrated some capability in detecting forged speech generated by neural audio codecs (NAC).
Limitations of Prior Work: (1) Speech in medical scenarios (e.g., remote consultation, telephone screening) faces real risks of being replaced by codec-generated synthetic speech, yet existing detectors have never been evaluated on pathological speech; (2) Pathological speech exhibits abnormalities in prosody, articulation, and phonation due to diseases, which systematically alter acoustic features and may mask or confuse subtle artifacts introduced by codecs; (3) Experiments show that AASIST trained on healthy speech drops to near-random performance (48.62%) when detecting pathological speech.
Key Challenge: Codec forgery detection relies on capturing subtle artifacts introduced by quantization and bandwidth compression, but the acoustic variations of pathological speech (abnormal speech rate, altered voice quality, reduced clarity) highly overlap with these artifacts in spectral features, making it impossible for detectors to distinguish between "disease characteristics" and "forgery traces."
Goal: (1) Construct the first pathology-aware codec forged speech dataset HCFK; (2) Systematically evaluate the failure modes of existing detectors on medical speech; (3) Design a detection framework specifically targeting the heterogeneity of pathological speech.
Key Insight: The authors observe that codec artifacts may appear in various heterogeneous patterns within pathological speech (different disease conditions, different codec families), and a single vector representation cannot capture this multimodal distribution. Therefore, a method is needed that can preserve multiple local evidences and model heterogeneous forgery patterns.
Core Idea: Use multi-prototype clustering in hyperbolic space to model the heterogeneous patterns of codec-forged speech—preserving multiple local evidence vectors and achieving automatic pattern discovery and classification through exponential mapping and prototype distances on the Poincaré ball.
Method¶
Overall Architecture¶
The PHOENIX-Mamba workflow: (1) Input speech passes through a pre-trained encoder (e.g., PaSST) to extract feature sequences \(X \in \mathbb{R}^{T \times D}\); (2) Mapped to a low-dimensional space \(U\) via an adapter; (3) Long-range temporal modeling using a Mamba state space model to obtain \(Z\); (4) Learnable pooling compresses \(Z\) into \(M\) evidence vectors \(E\); (5) Evidence vectors are embedded into the Poincaré ball via exponential mapping; (6) Classification scores are calculated by geodesic distances to positive/negative prototypes in hyperbolic space.
Key Designs¶
-
Multi-Evidence Pooling:
- Function: Compresses long sequence representations into \(M\) local evidence vectors rather than a single global vector.
- Mechanism: Uses learnable attention weights for weighted summation of temporal features, where each evidence vector \(e_m = \sum_t a_{m,t} z_t\) focuses on different local regions of the sequence. Weights are generated by a differentiable scoring mechanism.
- Design Motivation: Codec artifacts are unevenly and sparsely distributed in pathological speech; a single pooling vector loses critical local cues. Multi-evidence design allows the model to retain multiple discriminative features at different positions.
-
Hyperbolic Prototype Reasoning:
- Function: Performs distance-based classification using multiple positive prototypes and one negative prototype on the Poincaré ball.
- Mechanism: Each evidence vector is projected onto the Poincaré ball via exponential mapping \(h_m = \text{Exp}_0^c(We_m)\). Parameterizes \(K\) positive (forged) prototypes and 1 negative (real) prototype. By calculating the geodesic distance from each evidence to each prototype, a soft assignment \(q_{m,k}\) is obtained via temperature-controlled softmax, and finally an instance-level positive/negative score is aggregated via log-sum-exp.
- Design Motivation: Different codec families and pathological conditions produce heterogeneous forgery patterns, which are difficult to separate with a single decision boundary in Euclidean space. The exponential volume growth of hyperbolic space is suitable for modeling tree-like hierarchical class structures, and the multi-prototype design allows the model to automatically discover sub-categories within the forgery class.
-
Geometry-aware Regularization Loss:
- Function: Guides prototypes to form compact and dispersed clustering structures.
- Mechanism: Total loss \(\mathcal{L} = \mathcal{L}_{cls} + \lambda \mathcal{L}_{cluster} + \beta \mathcal{L}_{sep}\). The clustering loss pulls evidence points closer to their assigned positive prototypes and controls assignment sharpness with an entropy term; the separation loss pushes different positive prototypes away from each other and from the negative prototype.
- Design Motivation: Prevents prototype collapse (all prototypes converging to the same point) and maintains pattern diversity. Training with only classification loss may cause prototypes to degenerate into unimodal solutions.
Loss & Training¶
Trained using the AdamW optimizer for 20 epochs, batch size 32, weight decay 0.01, and gradient clipping 1.0. Upstream PTM parameters are frozen; only the adapter, Mamba backbone, and prototype parameters are trained, with a trainable parameter count of 2M-5M. Evaluation metrics include Accuracy, Macro F1, and EER.
Key Experimental Results¶
Main Results¶
| Method | En-Depression Acc | En-Alzheimer's Acc | En-Dysarthria Acc | Zh-Depression Acc |
|---|---|---|---|---|
| AASIST (CodecFake train) | 48.62 | 34.19 | 36.71 | 45.81 |
| AASIST (In-domain train) | 60.84 | 52.14 | 56.07 | 58.06 |
| PaSST+CNN | 78.98 | 69.27 | 71.03 | 75.69 |
| PHOENIX-Mamba (PaSST) | 97.04 | 96.73 | 96.57 | 94.41 |
Ablation Study¶
| Configuration | En-Depression Acc | En-Alzheimer's Acc | Description |
|---|---|---|---|
| Full PHOENIX-Mamba | 97.04 | 96.73 | Complete model |
| CNN Head (No Mamba) | 82.26 | 75.52 | No temporal modeling |
| Single evidence (M=1) | 73.51 | 55.03 | Significant degradation with single evidence |
| PHOENIX-Euc (Euclidean) | 83.62 | 79.48 | Removed hyperbolic geometry |
Key Findings¶
- There is a significant domain shift when migrating from healthy speech to pathological speech; AASIST trained on CodecFake performs near random guessing on HCFK.
- Multi-evidence pooling contributes the most—removing it (M=1) causes Alzheimer's detection to plummet from 96.73% to 55.03%, indicating extremely uneven distribution of forgery cues in pathological speech.
- PaSST as an upstream encoder consistently outperforms speech SSL models like WavLM and Wav2vec2, likely because its patch-based spectro-temporal representation is better suited for capturing codec artifacts.
- Cross-pathology transfer experiments show that training on Depression + Dysarthria and transferring to Alzheimer's reaches 98.53% Acc.
Highlights & Insights¶
- Expanding codec forgery detection to medical scenarios is a valuable new direction—the popularity of telemedicine and voice bio-authentication makes this threat increasingly real.
- The combination of multi-evidence and hyperbolic prototypes elegantly solves the "heterogeneous forgery pattern" problem—performing much better than forcing a single classifier in Euclidean space.
- Cross-pathology transfer results are surprisingly good (98.53%), suggesting that the core features of codec artifacts might be pathology-independent, which provides hope for practical deployment.
Limitations & Future Work¶
- Currently only covers three clinical conditions and two languages, with limited coverage.
- Only considers codec resynthesis attacks, without involving other forgery means like TTS/VC/diffusion models.
- Open-set detection and uncertainty estimation have not been studied.
- The construction of HCFK relies on the availability of existing clinical speech datasets; privacy and ethical constraints may limit expansion.
Related Work & Insights¶
- vs CodecFake (Wu et al.): CodecFake builds benchmarks on healthy speech; this paper finds that detectors trained on it fail completely on pathological speech, proving the necessity of domain-specific benchmarks.
- vs AASIST: AASIST uses graph attention networks to model spectro-temporal relationships, but its single-vector representation cannot handle the heterogeneity of pathological speech; the multi-evidence + multi-prototype design of PHOENIX-Mamba specifically addresses this.
- vs SASTNet: SASTNet attempts to unify semantic and acoustic representations for detection but still evaluates on healthy speech; this paper focuses on the more challenging scenario of pathological speech.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First systematic study of codec forgery detection in medical scenarios; the problem definition is forward-looking.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers three diseases, two languages, and seven codecs with sufficient ablation, though model scale is limited.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation, detailed method description, though the paper is quite long.
- Value: ⭐⭐⭐⭐ Medical speech security is a critical issue, but practical deployment still requires more validation.