FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing¶

Conference: NeurIPS 2025 arXiv: 2507.14815 Code: Available (github.com/ictnlp/FastLongSpeech) Area: Model Compression / Speech Model Efficiency Keywords: speech compression, long-speech processing, iterative fusion, CTC, dynamic compression training

TL;DR¶

FastLongSpeech is proposed to compress redundant speech representations via an iterative fusion strategy and to transfer short-speech capabilities to long-speech scenarios through dynamic compression training, enabling large speech-language models (LSLMs) to efficiently process long speech without long-speech training data, achieving state-of-the-art performance on long-speech QA with a 70% improvement in inference efficiency.

Background & Motivation¶

LSLMs such as Qwen2-Audio perform well on short-speech tasks, but face two major challenges when processing long speech (>30 seconds):

Scarcity of training data: Long-speech alignment and instruction fine-tuning data are extremely limited and costly to construct.

High computational cost: Speech representation sequences are typically more than 4× longer than their text equivalents, causing the computational cost of LLM inference to surge dramatically for long speech.

Limitations of existing approaches: - Cascaded methods (ASR→LLM): error propagation and loss of paralinguistic information - NTK-RoPE: extends the positional encoding limit but does not compress the sequence, leaving computation unchanged (61.21 TFLOPs) - Naive compression (random sampling, average pooling): significant information loss and poor generation quality - Only a few LSLMs (e.g., Gemini) handle 30-minute speech by constructing large-scale long-speech datasets, at prohibitive cost

Core Problem: How can LSLMs efficiently process long speech using only short-speech data?

Method¶

Overall Architecture¶

FastLongSpeech builds upon Qwen2-Audio with an additional speech extractor module:

Raw speech s → Audio encoder (Whisper) → Speech representation h (25Hz)
→ Extractor (CTC decoder + Iterative Fusion) → Compressed representation h' (length ≤ L)
→ LLM → Text response y

For long-speech inference: the input is first segmented into 30-second chunks → each chunk passes through the audio encoder independently → the outputs are concatenated into a full speech representation → iterative fusion compresses it to the target length $L$.

Key Designs¶

1. Iterative Fusion¶

Core Idea: Progressively merge redundant frames while retaining frames with high information density. Each iteration halves the sequence length until the target length $L$ is reached.

Two metrics:

Content Density: The sum of non-blank token probabilities from the CTC decoder output, measuring the textual information content of each frame: $$d_j = \sum_{a_j \neq \epsilon} p_{ctc}(a_j \mid h_j)$$

Inter-frame Similarity: Cosine similarity between adjacent frames: $$e_{j,j+1} = \frac{h_j h_{j+1}}{|h_j| |h_{j+1}|}$$

Iterative Process: 1. Compute the current length $T^{(m)}$ and target length $T^{(m+1)} = \lfloor T^{(m)}/2 \rfloor$ (if $T^{(m)} > 2L$), otherwise $T^{(m+1)} = L$ 2. Compute the number of frames to eliminate: $r^{(m)} = T^{(m)} - T^{(m+1)}$ 3. Identify the $r^{(m)}$ most similar adjacent frame pairs 4. Group consecutively identified frames into spans; within each span, merge frames into a single frame via content-density-weighted fusion 5. Repeat until sequence length $\leq L$

Key Advantage: The progressively shrinking receptive field (halved each round) preserves semantic information better than one-shot compression; content density guides retention of high-information frames.

2. Dynamic Compression Training (DCT)¶

Problem: The LLM has only been exposed to speech representations at their original length; feeding compressed representations directly causes a distributional mismatch.

Solution: During training, the target length $L$ is sampled randomly, enabling the LLM to adapt to compressed representations at varying compression ratios:

\[L_{dct} = -\sum_{L \sim \mathcal{U}(\mathbf{L})} \log p(\mathbf{y} \mid \mathbf{x}, \text{IF}(\mathbf{h}, L))\]

where $L$ is sampled uniformly from $\{750, 400, 200, 100, 50, 25, 12\}$. This trains the LLM to handle inputs ranging from uncompressed to 60× compressed.

Loss & Training¶

Two-stage training:

Stage 1 — CTC Training: - Only the CTC decoder is trained to learn content density estimation for speech frames - Training data: LibriSpeech 960h + MLS 3000h (ASR data) - All other modules are frozen

Stage 2 — Dynamic Compression Training: - The LLM component of Qwen2-Audio is fine-tuned with LoRA - Training data: OpenASQA (5.9kh) + LibriSQA (360h) + Common Voice (1.7kh), all short speech (<30s) - CTC decoder is frozen - Speech window $L = 750$ (original Qwen2-Audio setting)

Key Experimental Results¶

Main Results: Long-Speech Spoken QA¶

Method	Score (↑)	Note
Random	2.54	Random frame sampling
Similar (MostSim)	3.08	Merge most similar frames
AvgPool	3.10	Average pooling
NTK-RoPE	3.44	Extended positional encoding (no compression)
FastLongSpeech	3.55	Iterative fusion + DCT

Under the same speech window as NTK-RoPE, FastLongSpeech achieves the best performance on long-speech understanding.

Inference Efficiency¶

Method	Score	TFLOPs (↓)	Time/s (↓)
NTK-RoPE	3.44	61.21	4.80
Cascaded (Whisper+LLM)	3.75	n/a	17.23+1.38
FastLongSpeech	3.55	26.44	1.47

Computation is reduced by 57%, inference is 3.3× faster than NTK-RoPE and 7× faster than the cascaded baseline.

Short-Speech Efficiency (LibriTTS OpenASQA)¶

Method	Score (↑)	TFLOPs (↓)
Baseline (Qwen2-Audio)	3.73	9.79
Ours ($L=400$)	3.80	8.54
Ours ($L=200$)	3.87	5.64
Ours ($L=100$)	3.71	4.17

On short speech, the proposed method matches or exceeds the baseline at half the computational cost.

Ablation Study¶

Method	Score (↑)
FastLongSpeech (full)	3.55
w/o DCT	3.33 (−0.22)
w/o Iterative Fusion (one-shot compression)	3.41 (−0.14)
w/o Content Density (uniform-weight fusion)	3.28 (−0.27)

All three components contribute significantly. Content density guidance contributes the most, underscoring the importance of distinguishing informative from redundant frames.

Key Findings¶

Iterative (multi-round halving) compression outperforms one-shot compression: progressively expanding the receptive field better aggregates semantic content.
CTC content density is an effective measure of frame informativeness, guiding retention of high-information frames.
Dynamic compression training successfully transfers short-speech capabilities to long-speech scenarios without requiring any long-speech training data.
For ASR tasks, WER increases by only 0.23 over Qwen2-Audio at a low compression ratio ($L=400$), but degrades substantially at high compression ($L=100$), indicating that the optimal compression ratio is task-dependent.
The method generalizes directly to Qwen2.5-Omni without DCT retraining.

Highlights & Insights¶

Zero long-speech training data: Long-speech processing capability is transferred via dynamic compression training using only short-speech data (<30s).
Information-aware compression: The CTC output distribution naturally provides frame-level information density, proving more effective than simple similarity measures or random sampling.
Flexible efficiency–quality trade-off: The target length $L$ can be adjusted freely to balance inference efficiency and generation quality.
LongSpeech-Eval benchmark: A novel evaluation benchmark for long-speech understanding is constructed, filling a gap in the field.

Limitations & Future Work¶

Only evaluated on Qwen2-Audio: Generalizability to other LSLMs remains to be verified.
Long-speech training data still has potential: As long-speech data accumulates, direct training may outperform compression-based transfer.
CTC decoder introduces additional parameters: Although relatively lightweight, it increases system complexity.
Significant ASR degradation at high compression ratios: The trade-off between content fidelity and efficiency requires further optimization.
Promising future directions include end-to-end joint training of the CTC decoder and LLM, and adaptive compression ratio selection (dynamically adjusting $L$ based on speech complexity).

SpeechPrune / FastAdaSP: Token selection/pruning strategies; the fusion strategy in FastLongSpeech preserves more information by merging rather than discarding frames.
StreamUni: Segmentation strategies for real-time speech translation; iterative fusion can be viewed as an offline counterpart for efficient compression.
NTK-RoPE: A classical approach for extending context length, but without reducing computational cost.
Insight: The high redundancy of speech (frame rate far exceeding the text token rate) naturally accommodates compression, and CTC blank probabilities serve as an excellent proxy for frame-level redundancy.

Rating¶

Novelty: ★★★★☆ (the combination of iterative fusion, CTC content density, and dynamic compression training is novel)
Technical Depth: ★★★★☆ (clear problem decomposition; each component has well-motivated design with theoretical grounding)
Experimental Thoroughness: ★★★★☆ (multi-task evaluation, ablation study, and efficiency analysis are comprehensive, though limited to a single base model)
Value: ★★★★★ (no long-speech data required, plug-and-play, substantial inference speedup — high practical applicability)