SpeakerSleuth: Can Large Audio-Language Models Judge Speaker Consistency across Multi-turn Dialogues?¶
Conference: ACL 2026 arXiv: 2601.04029 Code: https://github.com/holi-lab/SpeakerSleuth Area: Audio & Speech Keywords: Large Audio-Language Models, Speaker Consistency, Multi-turn Dialogue, Benchmark, Modality Bias
TL;DR¶
SpeakerSleuth introduces the first benchmark (1,818 instances) for evaluating LALMs' ability to judge speaker consistency in multi-turn dialogues. Systematic evaluation of 12 LALMs and 6 embedding methods reveals that models struggle to detect and localize acoustic inconsistencies, exhibit severe text-over-acoustics modality bias, yet perform comparatively well on comparative/ranking tasks involving acoustic variants.
Background & Motivation¶
Background: Speech synthesis technology can now generate naturalistic human speech and is widely deployed in voice assistants, podcasts, film dubbing, and dialogue agents. Maintaining speaker identity consistency across multi-turn dialogues—timbre, pitch, and voice quality—is a fundamental requirement.
Limitations of Prior Work: - Even state-of-the-art speech synthesis models suffer from speaker confusion, timbre drift, and voice quality variation. - Existing evaluation methods compute pairwise similarity via embedding models, which cannot holistically assess consistency across an entire dialogue and require manual threshold tuning. - Although LALMs can process an entire dialogue in one pass and directly produce a judgment, whether their acoustic discrimination capability is reliable remains entirely unknown.
Key Challenge: LALMs are theoretically capable of serving as comprehensive audio-language judges, yet no systematic benchmark exists to assess whether they possess reliable acoustic discrimination capability, particularly in multi-turn dialogue settings.
Goal: Construct a benchmark to systematically evaluate both LALMs and embedding-based methods on speaker consistency judgment in multi-turn dialogues, and to reveal their strengths, weaknesses, and fundamental limitations.
Key Insight: Three progressively challenging tasks are designed—detection (consistent or not) → localization (which turn is inconsistent) → discrimination (comparison and ranking of variants)—to comprehensively assess acoustic discrimination capability at different granularities.
Core Idea: A controlled experimental design (identical dialogue content × three scenarios: fully consistent / gender-switched / similar-speaker substitution) isolates acoustic factors for systematic evaluation, thereby exposing the modality bias of LALMs.
Method¶
Overall Architecture¶
The SpeakerSleuth benchmark comprises: (1) collecting multi-turn dialogue audio from 4 datasets; (2) generating 3 scenario variants per dialogue (S1: fully consistent, S2: gender-switched, S3: similar-speaker substitution); (3) quality assurance via human verification; and (4) evaluating 12 LALMs and 6 embedding methods across detection, localization, and discrimination tasks.
Key Designs¶
-
Three Controlled Scenario Designs:
- Function: Isolate acoustic discrimination capability through controlled variables.
- Mechanism:
- S1 (Fully Consistent): The original dialogue serves as the positive sample.
- S2 (Gender-Switched): One randomly selected turn is replaced with an opposite-gender speaker via voice conversion, introducing an obvious acoustic deviation.
- S3 (Similar Speaker): One turn is replaced with the most acoustically similar same-gender speaker (highest cosine similarity under ECAPA-TDNN embeddings), testing fine-grained discrimination.
- Design Motivation: S1/S2/S3 share identical dialogue content, so performance differences directly reflect acoustic discrimination ability. The increasing difficulty from S2 to S3 probes the gradient of models' acoustic sensitivity.
-
Three-Level Task Hierarchy:
- Function: Evaluate acoustic discrimination capability at coarse-to-fine granularities.
- Mechanism:
- Detection (absolute judgment): Determine whether all turns belong to the same speaker, requiring a stable internal threshold.
- Localization (fine-grained analysis): Identify which specific turn is inconsistent, requiring turn-level acoustic feature discrimination.
- Discrimination (relative comparison): Rank three candidate audio clips by acoustic similarity; assessed in both classification and ranking forms.
- Design Motivation: Mirrors real TTS workflows—first detect inconsistency → localize the problematic turn → regenerate and select the best candidate.
-
Modality Bias Experiment (Effect of Textual Context):
- Function: Reveal modality imbalance between textual and acoustic signals in LALMs.
- Mechanism: Building on the main experiment, textual context from other speakers' turns is additionally provided, and the resulting effect on detection performance is observed.
- Design Motivation: In real-world applications, LALMs receive both audio and text simultaneously; it is therefore necessary to verify whether models neglect acoustic inconsistencies due to textual coherence.
Loss & Training¶
SpeakerSleuth is an evaluation benchmark rather than a training framework. Data construction involves FreeVC for voice conversion, automated text filtering (Qwen3-32B), and manual audio quality verification.
Key Experimental Results¶
Main Results (Detection — Balanced Accuracy)¶
| Model | S1 Acc | S2 Acc | S3 Acc | Balanced Acc | Notes |
|---|---|---|---|---|---|
| Gemini-2.5-Pro | 73.9 | 71.6 | 39.3 | 64.7 | Best LALM |
| GPT-4o-audio | 72.9 | 32.8 | 29.5 | 52.0 | Weak detection |
| Pairwise (WavLM) | 91.8 | 38.4 | 37.7 | 64.9 | Best embedding method |
| Pairwise (ECAPA) | 36.0 | 88.4 | 86.3 | 61.7 | Over-detection tendency |
Discrimination Task¶
| Model | Classification Acc | NDCG@1 | Exact Match | Notes |
|---|---|---|---|---|
| Gemini-2.5-Pro | 81.5 | 88.8 | 71.5 | Strong relative judgment |
| Pairwise (ECAPA) | 99.2 | 99.6 | 58.6 | Excellent embedding ranking |
Effect of Textual Context (Detection)¶
| Model | S2 Audio-only | S2 +Text | Δ | Notes |
|---|---|---|---|---|
| GPT-4o-audio | 32.8 | 6.3 | −26.5 | Severe textual interference |
| Gemini-2.5-Flash-Lite | 70.3 | 3.3 | −67.0 | Near-complete failure |
| Gemini-2.5-Pro | 71.6 | 46.8 | −24.8 | Affected but retains some discrimination |
Key Findings¶
- Unstable Detection Threshold: LALMs cluster along the anti-diagonal—either systematically predicting consistency (e.g., MiniCPM-o) or systematically predicting inconsistency (e.g., Qwen2.5-Omni-7B)—indicating a lack of calibrated internal thresholds.
- Very Weak Localization: Most models either default to marking no turns or indiscriminately mark all turns (e.g., Gemma-3n achieves 95% recall but only 19% precision).
- Strong Discrimination Performance: The same models perform well on relative comparison and ranking of acoustic variants (Gemini-2.5-Pro: 88.8% NDCG@1), suggesting that models possess inherent acoustic discrimination capacity but cannot make reliable absolute judgments.
- Severe Textual Modality Bias: When textual context is added, models prioritize textual coherence over acoustic cues, failing to detect even highly salient inconsistencies such as gender switching.
- Embedding-based methods also exhibit systematic bias: ECAPA-TDNN tends toward over-detection, while WavLM tends toward under-detection.
Highlights & Insights¶
- The discovery of a fundamental text-over-acoustics modality bias in LALMs serves as an important warning for the development of reliable audio-language judges.
- The counter-intuitive finding that models detect poorly yet discriminate well reveals the root cause: the problem is not a lack of acoustic perceptual ability, but rather the absence of reliable internal decision thresholds.
- The controlled three-scenario design (consistent / gender-switched / similar speaker) is elegant in its clean isolation of acoustic factors.
- The concurrent evaluation of both LALMs and embedding-based methods provides a fair comparison and complementary insights across both paradigms.
Limitations & Future Work¶
- The voice conversion tool used in benchmark construction may introduce artifacts, potentially affecting the naturalness of certain scenarios.
- Only English dialogue data are evaluated; cross-lingual assessment remains to be conducted.
- Each target speaker is fixed at 5 turns; consistency evaluation over longer dialogues is not addressed.
- The evaluation set size (1,818 instances) is relatively limited, and statistical power may be insufficient to distinguish subtle performance differences between certain models.
Related Work & Insights¶
- vs. Traditional Speaker Verification (ECAPA-TDNN): Traditional methods perform pairwise comparison; SpeakerSleuth evaluates holistic dialogue-level consistency judgment.
- vs. LALM-as-Judge (Speech Quality Assessment): Existing LALM-based judges primarily focus on single-dimension speech quality; SpeakerSleuth is the first to evaluate cross-turn speaker consistency.
- vs. Speaker Recognition/Diarization: Traditional tasks identify who is speaking; SpeakerSleuth evaluates whether utterances claimed to be from the same speaker are acoustically consistent.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The first benchmark for speaker consistency evaluation in multi-turn dialogues; the modality bias finding carries significant implications.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive coverage of 12 LALMs, 6 embedding methods, three-level tasks, textual context analysis, and reference audio influence analysis.
- Writing Quality: ⭐⭐⭐⭐ — The logical flow from task design to benchmark construction to experimental analysis is clear, and key findings are well summarized.