End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering¶

Conference: AAAI2026 arXiv: 2511.09282 Code: 193746/CLSR Area: Audio & Speech Keywords: Spoken Question Answering, Contrastive Learning, Retrieval-Augmented Generation, Speech-Text Alignment, CIF

TL;DR¶

This paper proposes CLSR, an end-to-end contrastive language-speech retriever that converts acoustic representations into text-like representations before aligning them with text, enabling efficient extraction of question-relevant segments from long-form audio to support RAG-based spoken question answering for downstream LALMs.

Background & Motivation¶

Most existing SQA (Spoken Question Answering) models can only handle short audio clips under one minute, whereas real-world scenarios (meetings, lectures, online discussions) typically involve audio exceeding ten minutes.
Large Audio Language Models (LALMs), despite their strong speech understanding capabilities, suffer from slow inference and degraded accuracy on long-form audio.
RAG has demonstrated significant effectiveness in long-context textual QA, naturally raising the question of whether RAG can be applied to speech to retrieve the most relevant segments from long audio.
Existing speech retrievers (e.g., CLAP, SpeechDPR) exhibit insufficient performance—CLAP excels at audio-effect-to-text alignment rather than speech-content-to-text alignment, while SpeechDPR is limited by text-free training and data scarcity.

Core Problem¶

How to build an end-to-end speech-text retriever that, without relying on cascaded ASR followed by text retrieval, achieves or surpasses pipeline-based retrieval accuracy while substantially reducing inference time and error rates for long-form spoken question answering.

Method¶

Overall Architecture¶

CLSR consists of two components: 1. Left branch: A CIF (Continuous Integrate-and-Fire)-based non-autoregressive attention encoder-decoder (AED) that takes speech input \(X\) and outputs a token probability distribution \(D\). 2. Right branch: A Transformer text encoder (frozen BGE-base) that receives either text-like embeddings or real text embeddings and produces sentence-level representations for contrastive learning.

CIF Module¶

A speech encoder (SAN-M architecture) extracts acoustic features \(H^s\).
CIF computes per-frame weights \(\alpha_i \in [0,1]\) via convolution, accumulating them until the sum exceeds a threshold \(\beta\), thereby mapping time steps to token counts and producing acoustic representations \(E^a\).
This step realizes soft monotonic alignment from frame-level to token-level representations.

Sampler Training Optimization¶

Training proceeds in two rounds: in the first round, \(E^a\) is directly used to predict the token distribution, yielding ASR output \(Y^{asr}\).
In the second round, \(Y^{asr}\) is compared against ground-truth \(Y^{con}\); at erroneous token positions, the correct embeddings are substituted into \(E^a\) at sampling rate \(\lambda\), producing mixed features \(E^s\).
\(E^s\) is used to re-predict token distribution \(D'\), enhancing the decoder's contextual modeling capacity.

VQ Adaptor (Vector Quantization Adaptor)¶

Argmax is applied to the token probability distribution \(D\) to obtain the highest-probability token index \(q_i\).
Temperature-scaled softmax (\(\gamma=0.1\)) combined with straight-through gradient estimation preserves gradient flow.
The quantized one-hot matrix \(Q^{st}\) is multiplied with the text encoder's embedding weight matrix \(W^{te}\) to produce text-like embeddings \(E^{Y'}\).
Core idea: rather than directly aligning acoustic representations with text representations, VQ "translates" acoustic representations into approximate representations in the text space, where contrastive learning is then performed.

Contrastive Learning & Loss Function¶

The text-like embedding of the context and the text embedding of the question are fed into the text encoder; CLS token outputs are used as sentence-level representations.
Cosine similarity combined with NLL loss is used for alignment training.
Total loss: \(\mathcal{L}_{total} = (1-\alpha-\beta)\mathcal{L}_{ASR} + \alpha\mathcal{L}_{MAE} + \beta\mathcal{L}_{NLL}\), where \(\alpha=\beta=\frac{1}{3}\).

Training Strategy¶

Pre-training stage: Paraformer is pre-trained on LibriSpeech 460h for ASR; BGE is pre-trained on clean text pairs.
Joint training: BGE is frozen; the ASR module and contrastive loss are jointly optimized.
Post-training: The ASR module is frozen; BGE is fine-tuned for a few epochs to adapt to text-like representations.

Key Experimental Results¶

Datasets¶

Four datasets: Spoken-SQuAD, LibriSQA, SLUE-SQA-5 (real recordings), and DRCD (Chinese).

Main Results (Spoken-SQuAD*)¶

Model	Paradigm	WER↓	Q→C R@1	Q→C R@10
CLAP	E2E	-	2.93	14.84
Whisper+BGE	Pipeline	19.39	69.93	90.53
CLSR	E2E	15.14	70.03	90.68

CLSR substantially outperforms CLAP across all four datasets (R@1 improves from ~3% to ~70%) and surpasses SpeechDPR.
CLSR achieves performance comparable to or better than the Whisper+BGE pipeline while attaining lower WER (15.14 vs. 19.39).
On LibriSQA, CLSR achieves R@1 = 85.04%, approaching the text-only BGE baseline of 86.91%.

Ablation Study¶

Removing the VQ adaptor: R@10 drops sharply from ~86% to ~44%, validating the central role of text-like representations.
Removing the Sampler: WER increases from 15.01 to 16.18, with retrieval recall also declining.
Pre-training both the ASR module and BGE each contributes significantly to final performance.
WER ~16.75% constitutes a threshold beyond which retrieval performance degrades sharply.

Long-form SQA Performance¶

Evaluated on Spoken Wikipedia (average audio length ~30 minutes): - Without CLSR: EM = 18.00, F1 = 23.55, inference time = 7935s. - With CLSR: EM = 27.60, F1 = 35.10, inference time = 783s (10× speedup).

Highlights & Insights¶

First application of RAG to SQA, providing a systematic framework for long-form spoken question answering.
The text-like representation bridging strategy elegantly circumvents the difficulty of direct speech-text alignment by leveraging mature text-based contrastive learning models for high-quality cross-modal retrieval.
No large-scale speech-text pre-training is required; joint training on task data alone achieves performance on par with pipeline methods.
The straight-through estimator in the VQ adaptor ensures end-to-end training feasibility.

Limitations & Future Work¶

Evaluation is limited to TTS-synthesized speech and a small number of real recordings; robustness to noisy environments and multi-speaker scenarios remains unexplored.
The current approach fixes long audio segmentation at 40-second chunks, lacking adaptive semantic segmentation strategies.
BGE is frozen during joint training; gains from post-training are limited, and better unfreezing strategies warrant investigation.
Comparisons with more recent speech foundation models (e.g., Whisper-v3, SeamlessM4T) are absent.
The long-form audio experiment uses only 500 samples, which is relatively small in scale.

Method	Characteristics	Limitations
CLAP	Audio-text contrastive learning	Suited for sound-effect matching, not speech-content retrieval
SpeechDPR	Text-free training for speech retrieval	Data scarcity leads to poor performance (R@20 only 19.94)
Whisper+BGE	ASR-cascaded text retrieval	Dependent on ASR quality; error propagation; weak Chinese support
CLSR	VQ bridging + joint training	E2E achieves pipeline-level performance; WER and retrieval jointly optimized

The text-like representation idea is transferable to other cross-modal retrieval tasks (e.g., translating video representations into the text space for video-text retrieval). The CIF + VQ combination can serve as a general-purpose speech-to-discrete-token frontend, replacing conventional discretization schemes. The long-form audio RAG framework can be integrated with streaming ASR to enable real-time meeting question answering systems.

Rating¶

Novelty: ⭐⭐⭐⭐ (First introduction of RAG into SQA; text-like representation bridging approach is novel)
Experimental Thoroughness: ⭐⭐⭐⭐ (Four datasets + ablations + long-form audio validation, though the long-form experiment scale is limited)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, complete formulations, well-supported by figures)
Value: ⭐⭐⭐⭐ (Provides a practical framework for long-form SQA; 10× inference speedup has strong application potential)