HCFD: A Benchmark for Audio Deepfake Detection in Healthcare¶

Conference: ACL 2026 arXiv: 2604.17642 Code: GitHub Area: Medical Imaging Keywords: Audio deepfake detection, pathological speech, neural audio codec, hyperbolic space prototypes, healthcare security

TL;DR¶

This paper introduces HCFD, a codec-based audio deepfake detection task for healthcare settings. It constructs HCFK, the first codec-forged speech dataset covering multiple clinical pathological conditions (depression, Alzheimer's disease, dysarthria), and proposes the PHOENIX-Mamba framework, which models heterogeneous forgery evidence prototypes in hyperbolic space, achieving 97.04% accuracy on English depression detection.

Background & Motivation¶

Background: Audio deepfake detection has advanced rapidly in recent years, with benchmarks such as ASVspoof and CodecFake driving progress. Existing detectors are primarily trained and evaluated on healthy speech and have demonstrated reasonable capability in detecting forgeries generated by neural audio codecs (NACs).

Limitations of Prior Work: (1) Speech in clinical settings—such as remote consultations and telephone screening—faces a real risk of being replaced by codec-synthesized impostors, yet existing detectors have never been evaluated on pathological speech. (2) Pathological speech exhibits disease-induced abnormalities in prosody, articulation, and phonation that systematically alter acoustic features, which may mask or confound the subtle artifacts introduced by codecs. (3) Empirical results show that AASIST trained on healthy speech drops to near-random performance (48.62%) when evaluated on pathological speech.

Key Challenge: Codec forgery detection relies on capturing subtle artifacts from quantization and bandwidth compression, but the acoustic variability of pathological speech (abnormal speaking rate, altered voice quality, reduced intelligibility) highly overlaps with these artifacts in the spectral domain, making it difficult for detectors to distinguish disease-related characteristics from forgery traces.

Goal: (1) Construct HCFK, the first pathology-aware codec-forged speech dataset; (2) systematically evaluate the failure modes of existing detectors on clinical speech; (3) design a detection framework specifically addressing the heterogeneity of pathological speech.

Key Insight: The authors observe that codec artifacts may manifest in multiple heterogeneous patterns across pathological speech (different disease conditions, different codec families), which cannot be captured by a single vector representation. A method is therefore needed that can retain multiple local evidence vectors and model heterogeneous forgery patterns.

Core Idea: Model heterogeneous patterns of codec-forged speech via multi-prototype clustering in hyperbolic space—retaining multiple local evidence vectors and performing automatic pattern discovery and classification through exponential mapping on the Poincaré ball and prototype distances.

Method¶

Overall Architecture¶

The PHOENIX-Mamba pipeline proceeds as follows: (1) Input speech is passed through a pretrained encoder (e.g., PaSST) to extract a feature sequence \(X \in \mathbb{R}^{T \times D}\); (2) an adapter projects features to a lower-dimensional space \(U\); (3) a Mamba state-space model performs long-range temporal modeling to produce \(Z\); (4) learnable pooling compresses \(Z\) into \(M\) evidence vectors \(E\); (5) exponential mapping embeds evidence vectors onto the Poincaré ball; (6) classification scores are computed in hyperbolic space via geodesic distances to positive/negative class prototypes.

Key Designs¶

Multi-Evidence Pooling:
- Function: Compresses long-sequence representations into \(M\) local evidence vectors rather than a single global vector.
- Mechanism: Learnable attention weights perform a weighted sum over temporal features, with each evidence vector \(e_m = \sum_t a_{m,t} z_t\) attending to different local regions of the sequence. Weights are generated by a differentiable scoring mechanism.
- Design Motivation: Codec artifacts are unevenly and sparsely distributed across pathological speech; a single pooling vector discards critical local cues. The multi-evidence design enables the model to retain discriminative features from multiple distinct positions.
Hyperbolic Prototype Reasoning:
- Function: Performs distance-based classification on the Poincaré ball using multiple positive-class prototypes and one negative-class prototype.
- Mechanism: Each evidence vector is projected onto the Poincaré ball via the exponential map \(h_m = \text{Exp}_0^c(We_m)\). The model parameterizes \(K\) positive-class (forged) prototypes and one negative-class (genuine) prototype. Geodesic distances from each evidence vector to all prototypes are computed, soft assignments \(q_{m,k}\) are obtained via temperature-controlled softmax, and instance-level positive/negative scores are aggregated via log-sum-exp.
- Design Motivation: Different codec families and pathological conditions produce heterogeneous forgery patterns that a single decision boundary in Euclidean space struggles to separate. The exponential volume growth of hyperbolic space is well-suited to modeling tree-structured hierarchical class organization, and the multi-prototype design allows the model to automatically discover subclasses within the forged category.
Geometry-Aware Regularization Loss:
- Function: Encourages prototypes to form compact yet dispersed cluster structures.
- Mechanism: The total loss is \(\mathcal{L} = \mathcal{L}_{cls} + \lambda \mathcal{L}_{cluster} + \beta \mathcal{L}_{sep}\). The clustering loss pulls evidence points toward their assigned positive-class prototypes and uses an entropy term to control assignment sharpness; the separation loss pushes different positive-class prototypes apart and maintains distance between positive and negative prototypes.
- Design Motivation: Prevents prototype collapse (all prototypes converging to the same point) and preserves pattern diversity. Training with only the classification loss may cause prototypes to degenerate into a single-mode solution.

Loss & Training¶

AdamW optimizer is used for 20 epochs with batch size 32, weight decay 0.01, and gradient clipping of 1.0. Upstream pretrained model parameters are frozen; only the adapter, Mamba backbone, and prototype parameters are trained, yielding 2M–5M trainable parameters. Evaluation metrics include accuracy, macro F1, and EER.

Key Experimental Results¶

Main Results¶

Method	EN-Depression Acc	EN-Alzheimer's Acc	EN-Dysarthria Acc	ZH-Depression Acc
AASIST (trained on CodecFake)	48.62	34.19	36.71	45.81
AASIST (in-domain trained)	60.84	52.14	56.07	58.06
PaSST+CNN	78.98	69.27	71.03	75.69
PHOENIX-Mamba (PaSST)	97.04	96.73	96.57	94.41

Ablation Study¶

Configuration	EN-Depression Acc	EN-Alzheimer's Acc	Note
Full PHOENIX-Mamba	97.04	96.73	Complete model
CNN Head (w/o Mamba)	82.26	75.52	No temporal modeling
Single evidence (M=1)	73.51	55.03	Severe degradation
PHOENIX-Euc (Euclidean)	83.62	79.48	Without hyperbolic geometry

Key Findings¶

A large domain shift exists between healthy and pathological speech: AASIST trained on CodecFake performs near chance on HCFK, demonstrating the necessity of domain-specific benchmarks.
Multi-evidence pooling contributes the most—removing it (M=1) causes Alzheimer's detection to plummet from 96.73% to 55.03%, indicating highly uneven distribution of forgery cues in pathological speech.
PaSST consistently outperforms speech SSL models such as WavLM and Wav2vec2 as the upstream encoder, likely because its patch-based spectro-temporal representation is better suited to capturing codec artifacts.
Cross-pathology transfer experiments show that training on depression and dysarthria and transferring to Alzheimer's achieves 98.53% accuracy.

Highlights & Insights¶

Extending codec forgery detection to clinical settings is a practically valuable new direction—the proliferation of telemedicine and voice-based biometric authentication makes this threat increasingly real.
The combination of multi-evidence pooling and hyperbolic prototypes elegantly addresses the heterogeneous forgery pattern problem, substantially outperforming a single Euclidean-space classifier.
The surprisingly strong cross-pathology transfer results (98.53%) suggest that the core characteristics of codec artifacts may be pathology-agnostic, offering optimism for real-world deployment.

Limitations & Future Work¶

Only three clinical conditions and two languages are covered, limiting generalizability.
Only codec resynthesis attacks are considered; other forgery modalities such as TTS, voice conversion, and diffusion-based methods are not addressed.
Open-set detection and uncertainty estimation are not investigated.
Construction of HCFK depends on the availability of existing clinical speech datasets; privacy and ethical constraints may limit future expansion.

vs. CodecFake (Wu et al.): CodecFake establishes a benchmark on healthy speech; this paper finds that detectors trained on it completely fail on pathological speech, demonstrating the necessity of domain-specific benchmarks.
vs. AASIST: AASIST employs graph attention networks to model spectro-temporal relationships, but its single-vector representation cannot handle the heterogeneity of pathological speech; PHOENIX-Mamba's multi-evidence and multi-prototype design specifically addresses this limitation.
vs. SASTNet: SASTNet attempts to unify semantic and acoustic representations for detection but is still evaluated on healthy speech; this paper targets the more challenging setting of pathological speech.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic study of codec forgery detection in healthcare settings; the problem formulation is forward-looking.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers three diseases, two languages, and seven codecs with thorough ablations, though model scale is limited.
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clear and method description is detailed, though the paper is lengthy.
Value: ⭐⭐⭐⭐ Healthcare speech security is an important problem, but further validation is needed before practical deployment.