FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing¶
Conference: NeurIPS 2025 arXiv: 2507.14815 Code: Available (github.com/ictnlp/FastLongSpeech) Area: Model Compression / Speech Model Efficiency Keywords: speech compression, long-speech processing, iterative fusion, CTC, dynamic compression training
TL;DR¶
FastLongSpeech is proposed to compress redundant speech representations via an iterative fusion strategy and to transfer short-speech capabilities to long-speech scenarios through dynamic compression training, enabling large speech-language models (LSLMs) to efficiently process long speech without long-speech training data, achieving state-of-the-art performance on long-speech QA with a 70% improvement in inference efficiency.
Background & Motivation¶
LSLMs such as Qwen2-Audio perform well on short-speech tasks, but face two major challenges when processing long speech (>30 seconds):
Scarcity of training data: Long-speech alignment and instruction fine-tuning data are extremely limited and costly to construct.
High computational cost: Speech representation sequences are typically more than 4× longer than their text equivalents, causing the computational cost of LLM inference to surge dramatically for long speech.
Limitations of existing approaches: - Cascaded methods (ASR→LLM): error propagation and loss of paralinguistic information - NTK-RoPE: extends the positional encoding limit but does not compress the sequence, leaving computation unchanged (61.21 TFLOPs) - Naive compression (random sampling, average pooling): significant information loss and poor generation quality - Only a few LSLMs (e.g., Gemini) handle 30-minute speech by constructing large-scale long-speech datasets, at prohibitive cost
Core Problem: How can LSLMs efficiently process long speech using only short-speech data?
Method¶
Overall Architecture¶
FastLongSpeech builds upon Qwen2-Audio with an additional speech extractor module:
Raw speech s → Audio encoder (Whisper) → Speech representation h (25Hz)
→ Extractor (CTC decoder + Iterative Fusion) → Compressed representation h' (length ≤ L)
→ LLM → Text response y
For long-speech inference: the input is first segmented into 30-second chunks → each chunk passes through the audio encoder independently → the outputs are concatenated into a full speech representation → iterative fusion compresses it to the target length \(L\).
Key Designs¶
1. Iterative Fusion¶
Core Idea: Progressively merge redundant frames while retaining frames with high information density. Each iteration halves the sequence length until the target length \(L\) is reached.
Two metrics:
Content Density: The sum of non-blank token probabilities from the CTC decoder output, measuring the textual information content of each frame: $\(d_j = \sum_{a_j \neq \epsilon} p_{ctc}(a_j \mid h_j)\)$
Inter-frame Similarity: Cosine similarity between adjacent frames: $\(e_{j,j+1} = \frac{h_j h_{j+1}}{|h_j| |h_{j+1}|}\)$
Iterative Process: 1. Compute the current length \(T^{(m)}\) and target length \(T^{(m+1)} = \lfloor T^{(m)}/2 \rfloor\) (if \(T^{(m)} > 2L\)), otherwise \(T^{(m+1)} = L\) 2. Compute the number of frames to eliminate: \(r^{(m)} = T^{(m)} - T^{(m+1)}\) 3. Identify the \(r^{(m)}\) most similar adjacent frame pairs 4. Group consecutively identified frames into spans; within each span, merge frames into a single frame via content-density-weighted fusion 5. Repeat until sequence length \(\leq L\)
Key Advantage: The progressively shrinking receptive field (halved each round) preserves semantic information better than one-shot compression; content density guides retention of high-information frames.
2. Dynamic Compression Training (DCT)¶
Problem: The LLM has only been exposed to speech representations at their original length; feeding compressed representations directly causes a distributional mismatch.
Solution: During training, the target length \(L\) is sampled randomly, enabling the LLM to adapt to compressed representations at varying compression ratios:
where \(L\) is sampled uniformly from \(\{750, 400, 200, 100, 50, 25, 12\}\). This trains the LLM to handle inputs ranging from uncompressed to 60× compressed.
Loss & Training¶
Two-stage training:
Stage 1 — CTC Training: - Only the CTC decoder is trained to learn content density estimation for speech frames - Training data: LibriSpeech 960h + MLS 3000h (ASR data) - All other modules are frozen
Stage 2 — Dynamic Compression Training: - The LLM component of Qwen2-Audio is fine-tuned with LoRA - Training data: OpenASQA (5.9kh) + LibriSQA (360h) + Common Voice (1.7kh), all short speech (<30s) - CTC decoder is frozen - Speech window \(L = 750\) (original Qwen2-Audio setting)
Key Experimental Results¶
Main Results: Long-Speech Spoken QA¶
| Method | Score (↑) | Note |
|---|---|---|
| Random | 2.54 | Random frame sampling |
| Similar (MostSim) | 3.08 | Merge most similar frames |
| AvgPool | 3.10 | Average pooling |
| NTK-RoPE | 3.44 | Extended positional encoding (no compression) |
| FastLongSpeech | 3.55 | Iterative fusion + DCT |
Under the same speech window as NTK-RoPE, FastLongSpeech achieves the best performance on long-speech understanding.
Inference Efficiency¶
| Method | Score | TFLOPs (↓) | Time/s (↓) |
|---|---|---|---|
| NTK-RoPE | 3.44 | 61.21 | 4.80 |
| Cascaded (Whisper+LLM) | 3.75 | n/a | 17.23+1.38 |
| FastLongSpeech | 3.55 | 26.44 | 1.47 |
Computation is reduced by 57%, inference is 3.3× faster than NTK-RoPE and 7× faster than the cascaded baseline.
Short-Speech Efficiency (LibriTTS OpenASQA)¶
| Method | Score (↑) | TFLOPs (↓) |
|---|---|---|
| Baseline (Qwen2-Audio) | 3.73 | 9.79 |
| Ours (\(L=400\)) | 3.80 | 8.54 |
| Ours (\(L=200\)) | 3.87 | 5.64 |
| Ours (\(L=100\)) | 3.71 | 4.17 |
On short speech, the proposed method matches or exceeds the baseline at half the computational cost.
Ablation Study¶
| Method | Score (↑) |
|---|---|
| FastLongSpeech (full) | 3.55 |
| w/o DCT | 3.33 (−0.22) |
| w/o Iterative Fusion (one-shot compression) | 3.41 (−0.14) |
| w/o Content Density (uniform-weight fusion) | 3.28 (−0.27) |
All three components contribute significantly. Content density guidance contributes the most, underscoring the importance of distinguishing informative from redundant frames.
Key Findings¶
- Iterative (multi-round halving) compression outperforms one-shot compression: progressively expanding the receptive field better aggregates semantic content.
- CTC content density is an effective measure of frame informativeness, guiding retention of high-information frames.
- Dynamic compression training successfully transfers short-speech capabilities to long-speech scenarios without requiring any long-speech training data.
- For ASR tasks, WER increases by only 0.23 over Qwen2-Audio at a low compression ratio (\(L=400\)), but degrades substantially at high compression (\(L=100\)), indicating that the optimal compression ratio is task-dependent.
- The method generalizes directly to Qwen2.5-Omni without DCT retraining.
Highlights & Insights¶
- Zero long-speech training data: Long-speech processing capability is transferred via dynamic compression training using only short-speech data (<30s).
- Information-aware compression: The CTC output distribution naturally provides frame-level information density, proving more effective than simple similarity measures or random sampling.
- Flexible efficiency–quality trade-off: The target length \(L\) can be adjusted freely to balance inference efficiency and generation quality.
- LongSpeech-Eval benchmark: A novel evaluation benchmark for long-speech understanding is constructed, filling a gap in the field.
Limitations & Future Work¶
- Only evaluated on Qwen2-Audio: Generalizability to other LSLMs remains to be verified.
- Long-speech training data still has potential: As long-speech data accumulates, direct training may outperform compression-based transfer.
- CTC decoder introduces additional parameters: Although relatively lightweight, it increases system complexity.
- Significant ASR degradation at high compression ratios: The trade-off between content fidelity and efficiency requires further optimization.
- Promising future directions include end-to-end joint training of the CTC decoder and LLM, and adaptive compression ratio selection (dynamically adjusting \(L\) based on speech complexity).
Related Work & Insights¶
- SpeechPrune / FastAdaSP: Token selection/pruning strategies; the fusion strategy in FastLongSpeech preserves more information by merging rather than discarding frames.
- StreamUni: Segmentation strategies for real-time speech translation; iterative fusion can be viewed as an offline counterpart for efficient compression.
- NTK-RoPE: A classical approach for extending context length, but without reducing computational cost.
- Insight: The high redundancy of speech (frame rate far exceeding the text token rate) naturally accommodates compression, and CTC blank probabilities serve as an excellent proxy for frame-level redundancy.
Rating¶
- Novelty: ★★★★☆ (the combination of iterative fusion, CTC content density, and dynamic compression training is novel)
- Technical Depth: ★★★★☆ (clear problem decomposition; each component has well-motivated design with theoretical grounding)
- Experimental Thoroughness: ★★★★☆ (multi-task evaluation, ablation study, and efficiency analysis are comprehensive, though limited to a single base model)
- Value: ★★★★★ (no long-speech data required, plug-and-play, substantial inference speedup — high practical applicability)