Skip to content

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Conference: ACL2026
arXiv: 2605.28618
Code: https://swanaigc.github.io/#bench
Area: Audio & Speech / Long-form Speech Generation Evaluation
Keywords: Long-form Speech Generation, Speech Evaluation Benchmark, Expressiveness Assessment, Prosodic Consistency, Multi-scenario TTS

TL;DR

This paper introduces SwanBench-Speech, a system that systematically measures long-form speech generation using 1,101 samples across 17 real-world downstream scenarios and 7 automatic evaluation dimensions. The study concludes that while current models approach usability in content accuracy, they still significantly lag behind real recordings in reverb consistency, long-range prosody, and expressive hierarchy.

Background & Motivation

Background: Speech generation is evolving from sentence-level TTS toward paragraph-level and minute-level generation. Typical applications include podcasts, audiobooks, course explanations, news broadcasts, interviews, and multi-person dialogues. Existing models can generate high-fidelity short speech, but long-text scenarios require models to simultaneously maintain timbre, sound field, semantics, rhythm, and emotional transitions.

Limitations of Prior Work: Existing test sets often cover only a few domains or single-speaker scenarios, with metrics still biased toward short-sentence quality, such as WER/CER, MOS, or single-sentence similarity. This leads to two issues: first, evaluation scenarios diverge significantly from real applications; second, model failures common in long text—such as timbre drift, reverb drift, prosodic collapse, and flat expression—are difficult to quantify.

Key Challenge: The quality of long-form speech is not a single score but is determined by acoustic stability, semantic intelligibility, and expressive dynamics. Traditional metrics can measure clarity but fail to address whether a segment sounds like a continuous natural narrative, whether the sound field of a multi-person dialogue is unified, or whether emotions evolve with the progression of paragraphs.

Goal: The authors aim to construct a standardized benchmark for long-form speech: covering various speaking styles in real applications while decomposing "long-form quality" into metrics that are automatically evaluable, interpretable, and correlated with human perception.

Key Insight: Instead of proposing a new TTS model, the paper focuses on the evaluation system. It organizes data and metrics around three challenges: Acoustics (timbre, reverb, fidelity), Semantics (content accuracy and prosodic naturalness), and Expressiveness (emotional richness and paragraph-level expression hierarchy).

Core Idea: Replace single MOS-style evaluations with "multi-scenario data + decomposed metrics + human correlation verification" to expose the genuine weak points of long-form speech generation models.

Method

The methodology of SwanBench-Speech resembles a complete evaluation protocol rather than a model training algorithm. It first constructs a long-form speech test set covering a wide range of scenarios, then designs a set of automatic metrics to evaluate different models, and finally validates the key automatic metrics against human hearing via preference experiments.

Overall Architecture

The input is a long text or dialogue script designed for a specific scenario, along with the TTS or dialogue speech generation model to be evaluated. After the model generates long-form speech, SwanBench-Speech calculates metrics along three axes: the acoustic axis measures timbre consistency, reverb consistency, and no-reference sound quality; the semantic axis measures content restoration and prosodic coherence; and the expressive axis measures local emotional richness and overall expression hierarchy. The output is a set of diagnostic multi-dimensional results rather than a black-box total score.

Data construction involves three sources. The first is online text corpora, such as audiobooks, dramas, and news scripts; the second is online audio media, processed through noise reduction, DNS-MOS quality filtering, speaker diarization, and SenseVoice transcription followed by manual verification; the third is supplemental test samples generated by GPT-5 to expand topic and scenario diversity. The final data undergoes semantic deduplication, content quality filtering, privacy/ethical risk detection, and manual review, resulting in 1,101 samples.

Key Designs

  1. Long-form Test Set with Three Axes and Seventeen Scenarios:

    • Function: Decomposes long-form speech generation into three challenge types—Acoustics, Semantics, and Expressiveness—mapped to 17 downstream scenarios.
    • Mechanism: Acoustic-related scenarios include customer service, podcasts, chit-chat, debates, audiobooks, and interviews; semantic-intensive scenarios include courses, science communication, presentations, seminars, and news; expressive scenarios include drama, talk shows, hosting, speeches, live streaming, and sports commentary.
    • Design Motivation: Short-sentence TTS tests cannot expose cumulative errors in minute-level generation. Splitting by real scenarios allows for a clearer view of where models fail in specific applications.
  2. Seven Diagnostic Automatic Metrics:

    • Function: Breaks down "long-form quality" into timbre consistency, reverb consistency, sound quality, content accuracy, prosodic coherence, expressive richness, and hierarchy.
    • Mechanism: Timbre consistency is measured by pairwise cosine similarity of sliding-window speaker embeddings; reverb consistency is measured by the standard deviation of SRMR sequences (lower is more stable); content accuracy uses WER/CER between ASR transcriptions and the original text; prosody is scored by SpeechJudge; expressive richness and hierarchy are evaluated by LALM/Gemini3-Pro using specialized prompts.
    • Design Motivation: Long-form failures are often not due to "unclear speech," but rather unstable identity, sound field, and expressive state over time. Decomposed metrics help researchers identify whether to improve data, architecture, or expressive modeling.
  3. Human Perception Alignment Verification:

    • Function: Ensures automatic metrics reflect human preferences rather than being arbitrary.
    • Mechanism: For prosody, 50 pairs of audio generated by different models from the same text were selected, and 10 evaluators gave relative preferences from -2 to 2; for expressiveness, 200 audio segments were selected, scored by 10 evaluators using the same prompts, and the correlations (SRCC) between various MOS networks/LALM evaluators and human MOS were compared.
    • Design Motivation: Expressive metrics are particularly prone to model evaluator bias, so SRCC is used to verify consistency between automatic scores and human judgments.

Loss & Training

Since this paper does not train a new speech generation model, there is no traditional loss function. Its "training strategy" is reflected in the evaluation protocol: using sliding window analysis, ASR restoration, no-reference quality models, SpeechJudge, and Gemini3-Pro evaluators on generated audio; and calibrating evaluator reliability using manual SRCC. For the models evaluated, the paper covers two tasks: single-speaker long-form text and dialogue generation, comparing open-source and closed-source systems respectively.

Key Experimental Results

Main Results

Evaluated Object Timbre↑ Reverb↓ Fidelity↑ CER/WER↓ Prosody↑ Richness↑ Hierarchy↑
Avg. Open-source (Single) 0.93 1.95 3.63 0.073 / 0.164 3.43 3.03 2.67
Avg. Closed-source (Single) 0.93 1.96 3.55 0.065 / 0.138 3.79 3.42 3.01
Real Recording (Single) 0.96 1.91 3.62 0.070 / 0.074 4.04 4.35 3.94
Avg. Open-source (Dialogue) 0.92 3.45 3.02 0.129 / 0.137 3.41 3.07 3.06
Avg. Closed-source (Dialogue) 0.92 3.36 3.17 0.095 / 0.103 3.83 3.51 3.76
Real Recording (Dialogue) 0.95 2.73 2.94 0.050 / 0.137 3.95 4.42 4.17

Closed-source systems generally outperform open-source systems in prosody and expression but still show a significant gap compared to real recordings. In single-speaker scenarios, real recordings achieve a Richness of 4.35 and Hierarchy of 3.94, while closed-source averages are only 3.42 and 3.01. In dialogue scenarios, the closed-source hierarchy average reaches 3.76, still lower than the 4.17 of real dialogues; reverb consistency also shows a clear gap, with closed-source dialogues averaging 3.36 Reverb against 2.73 for real dialogues.

Ablation Study

Validation Item Setting Key Result Description
Prosody Metric Alignment 50 pairs, 10 human evaluators \(SRCC = 0.82\) The modified SpeechJudge prosody score correlates highly with human preference.
Expressive Richness Alignment 200 segments, 10 human evaluators \(SRCC = 0.71\) Gemini3-Pro has the highest correlation with human MOS for expressive richness.
Expressive Hierarchy Alignment 200 segments, 10 human evaluators \(SRCC = 0.62\) Paragraph-level expressive dynamics are harder to evaluate automatically than local emotion.
Generation Length Analysis MegaTTS3, F5TTS, CosyVoice2, etc. Multi-dimensional degradation after 100 words Long-range dependency issues affect content accuracy, prosody, and expression.

Key Findings

  • Content accuracy is no longer the sole bottleneck. Many models have CER/WER close to real speech, but Prosody, Richness, and Hierarchy remain significantly lower.
  • Expressive scenarios are where models fail most easily. Theoretically, drama, hosting, and sports commentary should have higher expressive caps, but current models show degradation across multiple metrics in these scenarios, indicating insufficient training data and modeling.
  • Clear trade-offs exist between AR and NAR architectures. NAR models are more stable and efficient but prone to over-smoothing; AR models are more expressive but prone to error propagation in long sequences.
  • Data quality is more critical than pure scaling. Short-fragment training data brings short-form bias; unstable sound fields in "in-the-wild" data induce acoustic drift; and large-scale averaging may weaken dynamic expression.

Highlights & Insights

  • A key highlight of this paper is the granular decomposition of long-form TTS failure modes. It does not just ask "which model is best," but rather where models lose out in terms of timbre, reverb, prosody, and expressive hierarchy.
  • SwanBench-Speech is highly relevant to real-world applications. The 17 scenarios correspond to speech forms actually encountered by users, making the results more effective for guiding model iteration than single-reading tests.
  • Human perception alignment makes the evaluation protocol credible. For expressiveness metrics in particular, without manual SRCC validation, they risk becoming another uninterpretable model score.
  • An important insight is that "long-form capability" cannot be solved simply by a larger context window. Speech models require training data that is temporally continuous, acoustically stable, and expressively structured at the paragraph level.

Limitations & Future Work

  • Limited language coverage. SwanBench-Speech currently covers mainly Chinese and English; low-resource languages, dialects, and accents are not yet fully included.
  • Semantic understanding metrics are still preliminary. The paper admits that current metrics prioritize acoustic consistency over transitions in emotion and style driven by deep semantics.
  • Reference timbre lacks diversity. Prompt speech in experiments primarily comes from approximately 20 open-source speakers, which may introduce timbre bias.
  • Expressiveness evaluation partially relies on the closed-source model Gemini3-Pro, affecting reproducibility if API updates occur; future work could distill an open-source evaluator.
  • Future directions should include broader language coverage, stronger open expressive evaluators, more real long-context recordings, and curriculum training from sentence-level to paragraph-level.
  • vs. SeedTTS-Eval / EmergentTTS-Eval: These benchmarks cover some short speech or limited scenarios. SwanBench-Speech's advantage lies in the completeness of its 17 scenarios and long-form dimensions, though at the cost of a more complex evaluation protocol.
  • vs. MultiDialog / LibriSpeech-long: The latter provide long text or dialogue materials, but their metrics may not fully capture expressive hierarchy. This paper integrates data coverage with multi-dimensional automatic metrics for better model diagnostics.
  • vs. MOS / WER single-metric evaluation: While easy to use, MOS and WER conflate multiple failure modes. SwanBench-Speech suggests that long-form generation quality should be decomposed into addressable sub-problems.
  • Insights for subsequent research: To improve long-form TTS, researchers should not only pursue larger models but also design specific training objectives and data recipes targeting reverb drift, paragraph prosody, emotional hierarchy, and high-quality continuous data.

Rating

  • Novelty: ⭐⭐⭐⭐ The technical form of the benchmark paper is not radical, but the organization of 3 axes, 7 metrics, and 17 scenarios is highly diagnostic.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers over 20 models, both single-speaker and dialogue tasks, with human correlation validation; the experimental scale is robust.
  • Writing Quality: ⭐⭐⭐⭐ The structure is clear and metrics are well-explained; some tables have high information density requiring careful reading.
  • Value: ⭐⭐⭐⭐⭐ Highly practical for the long-form TTS field, directly informing researchers that current models lack long-range consistency and expressive structure rather than just sound quality.