Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling¶

Conference: ACL 2026 arXiv: 2601.04744 Code: GitHub Area: Medical Imaging Keywords: Semi-supervised learning, pathological speech detection, multi-granularity modeling, pseudo-labeling, clinical dialogue

TL;DR¶

This paper proposes an audio-only semi-supervised learning framework that jointly models pathological speech features in clinical dialogues at three levels—session, clip, and frame—using an EMA teacher-student network to dynamically generate high-quality pseudo-labels. With only 11 annotated samples, the framework achieves 90% of fully supervised performance on depression and Alzheimer's disease detection.

Background & Motivation¶

Background: Leveraging acoustic speech features as biomarkers for disease detection is a growing area of interest. Existing approaches predominantly rely on fully supervised learning with pretrained audio encoders such as wav2vec2 and HuBERT. Multimodal methods incorporating audio, text, and vision have also been explored, but face challenges including modality conflicts and ASR error propagation.

Limitations of Prior Work: (1) Annotating medical speech data requires clinical expertise, making acquisition prohibitively expensive and resulting in severe data scarcity. (2) Clinical annotations exhibit substantial inter-rater subjectivity, rendering labels inherently noisy. (3) The most critical challenge is distal weak supervision: a dialogue lasting several minutes carries only a single session-level label (e.g., "depressed/healthy"), yet pathological features are not uniformly distributed across utterances. (4) Existing methods segment long recordings and process them independently, implicitly assuming uniform symptom expression across segments—an assumption that frequently fails in practice. (5) General-purpose semi-supervised methods from the audio domain cannot be directly transferred to clinical settings, as pathological patterns are sparsely distributed in speech.

Key Challenge: Supervision signals exist at the session level (macro), whereas meaningful acoustic features must be extracted at the frame or clip level (micro). Models must learn to localize diagnostically informative segments within long dialogues without fine-grained annotations.

Goal: Design a semi-supervised framework tailored to medical speech that simultaneously addresses weak supervision, data scarcity, and label noise.

Key Insight: The hierarchical structure of clinical dialogues—frame-level acoustics → clip-level (utterance) semantics → session-level diagnostic labels—is explicitly modeled rather than compressed into a single granularity.

Core Idea: Joint modeling across three granularities (session-level classification + clip-level pseudo-label refinement + frame-level consistency regularization) enables dynamic pseudo-label generation and updating within a single-stage end-to-end training procedure, facilitating efficient exploitation of unlabeled data.

Method¶

Overall Architecture¶

The framework adopts a teacher-student architecture with three levels of modeling: (1) a session-level main pipeline that encodes the full dialogue and aggregates representations via a Transformer for final diagnosis; (2) a clip-level module that applies an RNN over utterance-level embeddings and is trained using pseudo-labels generated by the main pipeline; and (3) a frame-level module that enforces consistency constraints via a Siamese network. Teacher network parameters are updated via EMA and do not receive gradients.

Key Designs¶

Session-level Pipeline:
- Function: Processes the complete dialogue audio and produces the final diagnostic output.
- Mechanism: The dialogue is segmented into clips; each clip is encoded by an audio encoder \(E\) (e.g., HuBERT/wav2vec2) as \(embed_{clip_i} = POOL(E(clip_i))\). The embeddings are concatenated with learnable positional encodings and fed into a 3-layer, 16-head Transformer to obtain a session-level representation, which is subsequently aggregated via temporal attention and passed to a classification head. Pseudo-labels are dynamically generated for unlabeled samples and incorporated into the training set.
- Design Motivation: Processing the full dialogue directly avoids information loss from independent segment processing; multi-head attention implicitly learns the relative diagnostic importance of each utterance.
Clip-level Pseudo-labeling:
- Function: Encourages the audio encoder to learn finer-grained, utterance-level pathological representations.
- Mechanism: Clip embeddings \(embed_{clip_i}\) from the session-level pipeline are fed into a two-layer bidirectional LSTM to produce clip-level embeddings \(embed_{clip_i} = RNN(clip_i)\). Pseudo-labels are derived directly from session-level inference and the module is trained with cross-entropy loss. Crucially, no assumption is made that pathological features are uniformly expressed; pseudo-labels are updated dynamically throughout training.
- Design Motivation: Eliminates the strong "uniform expression" assumption of existing methods and handles mixed dialogues containing both interviewer and subject speech without requiring speaker diarization.
Frame-level Consistency Regularization:
- Function: Captures fine-grained acoustic feature consistency at the frame level.
- Mechanism: A Siamese network applies two different augmentations (speed perturbation, pitch shifting, time masking) to the same input, generating two views that are passed through the student and teacher networks respectively. An MSE loss enforces consistency between frame-level embeddings: \(Loss_{frame} = MSELoss(embed_{teacher}, embed_{student})\). The teacher is updated via EMA: \(\theta_{teacher} \leftarrow m \cdot \theta_{teacher} + (1-m) \cdot \theta_{student}\), with decay rate \(m=0.999\).
- Design Motivation: Frame-level training is independent of pseudo-labels, providing inherent robustness to pseudo-label noise—low-level acoustic pattern learning remains unaffected even when session- or clip-level pseudo-labels are inaccurate.

Loss & Training¶

The total loss is a weighted sum of the three-level losses: \(Loss = \alpha Loss_{session} + \beta Loss_{clip} + \gamma Loss_{frame}\). Training follows a single-stage online scheme: after a warm-up of \(k_0\) steps, all pseudo-labels are re-evaluated and updated every \(k\) steps. A confidence threshold strategy is applied—unlabeled samples exceeding the threshold are included in training, while those below are excluded. The optimal threshold is 0.75 (achieving Macro F1 of 68.58% on EATD-Corpus).

Key Experimental Results¶

Main Results¶

Detection Performance at Varying Annotation Ratios (Macro F1)

Method	100%	50%	40%	30%	20%	10%
Depression-Baseline	59.53	57.41	55.78	56.04	55.00	51.73
Depression-Ours	63.26	62.00	58.51	58.59	57.70	54.37
Alzheimer's-Baseline	71.25	70.18	69.80	67.79	67.45	65.09
Alzheimer's-Ours	73.01	71.35	72.14	70.11	69.80	69.47

Ablation Study¶

Incremental Ablation of Hierarchical Components (Depression Detection, Macro F1)

Method	100%	50%	30%	10%
Baseline	59.53	57.41	56.04	51.73
+Session-level	-	60.55	58.25	52.95
+Clip-level	60.78	58.30	56.55	52.50
+Frame-level	62.87	60.31	58.21	54.21
Ours (all + fine-tuned encoder)	63.26	62.00	58.59	54.37

Key Findings¶

Using only 10% of labeled data (~11 samples) achieves 90% of fully supervised baseline performance; 30% labeled data matches full supervision.
The framework also outperforms the baseline under full supervision, demonstrating that multi-granularity modeling itself enhances feature learning.
The frame-level component contributes most significantly, due to its inherent robustness to pseudo-label noise.
The framework generalizes robustly across encoders (wav2vec2, HuBERT, WavLM), languages (Chinese and English), and disease types.
Pseudo-label quality improves continuously during training, eventually surpassing additional manually annotated labels in the later stages.
Compared to the FixMatch baseline, the proposed method achieves substantial gains across all annotation ratios (e.g., 69.47 vs. 61.93 at 10%), as FixMatch does not model the hierarchical sparsity of pathological features.

Highlights & Insights¶

The three-granularity joint modeling design is particularly elegant: session-level handles global decisions, clip-level performs fine-grained learning, and frame-level enforces consistency regularization—the three components are complementary yet each functional independently.
Single-stage online pseudo-label updating avoids the complexity and additional inference cost of conventional multi-stage SSL pipelines.
No speaker diarization preprocessing is required; the framework processes raw dialogues containing both interviewer and subject speech, and experiments show negligible performance degradation when interviewer speech is included.
Clip-level pseudo-label analysis (Figure 5) clearly demonstrates that the model learns to distinguish subject from interviewer: the proportion of "pathological" pseudo-labels assigned to subjects increases markedly during training, while those for interviewers remain low.

Limitations & Future Work¶

The audio-only approach cannot exploit information from text and visual modalities.
Dataset scale is limited (EATD-Corpus: 162 subjects; ADReSSo21: 237 subjects); performance on larger-scale data remains to be verified.
Integration with multimodal methods has not been explored.
Cross-lingual pretrained models (e.g., English WavLM applied to Chinese datasets) yield limited gains, underscoring the importance of language-matched pretraining.

vs. FixMatch: General-purpose SSL methods cannot model the hierarchical sparsity of pathological features; the proposed multi-granularity framework is specifically designed to address this limitation.
vs. Multimodal methods (CAMFM, ACMA): Multimodal approaches achieve higher accuracy, but the audio-only method offers advantages in cross-lingual transfer and deployment simplicity.
vs. Conventional fully supervised methods: Traditional methods segment dialogues and process each segment independently under a uniform symptom expression assumption; this work explicitly models the sparse expression property of pathological features.

Rating¶

Novelty: ⭐⭐⭐⭐ The three-granularity semi-supervised framework is introduced for the first time in medical speech detection; the single-stage online pseudo-label updating scheme is concise and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated across two languages, two diseases, three encoders, multiple annotation ratios, and detailed ablations; however, dataset scale is relatively small.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, method hierarchy is well-structured, and pseudo-label analysis visualizations are informative.
Value: ⭐⭐⭐⭐ Provides a practical semi-supervised solution for annotation-scarce medical speech analysis; the model-agnostic design enhances generalizability.