Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning¶
Conference: ICLR 2026 arXiv: 2602.11909 Code: GitHub Area: Reinforcement Learning Keywords: Audio Understanding, Large Audio-Language Models, Audio-Interleaved Reasoning, Reinforcement Learning, Chain-of-Thought
TL;DR¶
This paper proposes a novel paradigm called audio-interleaved reasoning, which treats audio as an active component during inference rather than a static context, enabling LALMs to dynamically locate and re-listen to audio segments during the reasoning process. Through a two-stage SFT+RL training framework and a structured data generation pipeline, the authors build the Echo model, which surpasses GPT-4o and Gemini-2.0-Flash on both expert-level and general audio understanding benchmarks.
Background & Motivation¶
Large audio-language models (LALMs) perform well on basic audio tasks (speech recognition, sound classification, music analysis), but exhibit a significant performance gap on complex audio tasks requiring fine-grained interpretation and reasoning.
The existing reasoning paradigm — audio-conditioned text reasoning — suffers from a fundamental information bottleneck:
- Audio is encoded once into contextual embeddings, after which reasoning proceeds entirely in the text modality.
- Audio is a continuous signal carrying richer and more fine-grained information than text; a single encoding cannot retain all subtle details.
- Empirical evidence: during inference, LALM attention to audio tokens drops rapidly to <5% after the first 25 steps.
Human cognitive inspiration: Human auditory processing involves cyclically re-listening to critical acoustic segments, driven by auditory working memory and top-down attentional control. This paper emulates this mechanism by enabling LALMs to actively re-listen to audio during reasoning.
Core contrast: shifting from "thinking about audio" to "thinking with audio" — analogous to the paradigm shift in visual reasoning from "thinking about images" to "thinking with images."
Method¶
Overall Architecture¶
A two-stage training framework:
- Stage 1 (SFT): Teaches the model to locate critical audio segments and generate audio-anchored reasoning.
- Stage 2 (RL): Activates audio-interleaved reasoning capability through reasoning format adaptation and reinforcement learning.
Key Design 1: Audio-Anchored Reasoning (SFT Stage)¶
Initialized from Qwen2.5-Omni (7B), the base model tends toward pure text reasoning without actively referencing audio segments.
SFT data format: Each sample contains multimodal input \((A, q)\) (audio + question) and ground-truth \((c, a)\) (CoT + answer), where the CoT densely embeds <seg>start, end</seg> tag pairs referencing audio segments. Each reference is preceded by a calling rationale and followed by fine-grained analysis grounded in the segment.
Training objective: Standard cross-entropy \(\mathcal{L}_\text{SFT}(\theta) = -\frac{1}{n}\sum_{t=1}^n \log \pi_\theta(y_{i,t}^*|x_i, y_{i,<t}^*)\)
This yields a "cold-start model" capable of referencing specific time intervals during reasoning, though still limited to the text modality.
Key Design 2: Audio-Interleaved Reasoning (RL Stage)¶
Reasoning format adaptation: The cold-start model's reasoning is extended from text-only to a truly multimodal process — whenever the model generates a <seg> tag pair, generation is paused, the corresponding segment \(A_{s:e}\) is cropped from the original audio, its tokens are inserted into the reasoning sequence, and generation resumes from the augmented input. This loop continues until <eos> is generated.
RL reward design:
| Reward Component | Score | Description |
|---|---|---|
| \(\mathcal{R}_\text{format}\) | 0.5 | Correct use of enclosing tags |
| \(\mathcal{R}_\text{consist}\) | −0.1/instance, max −0.5 | Penalizes semantic discontinuity after </seg> (e.g., capitalized letter or <) |
| \(\mathcal{R}_\text{acc}\) | 0.5 | Answer matches ground truth |
| \(\mathcal{R}_\text{seg}\) | 0.5 | Correct answer with at least one segment reference; 0 otherwise |
Optimization algorithm: GRPO (Group Relative Policy Optimization), sampling \(G=8\) candidate responses, normalizing rewards to compute advantages, with PPO-style clipping and KL divergence constraint:
All inserted audio tokens are excluded from loss computation.
Key Design 3: Data Generation Pipeline¶
Built upon audio datasets with temporal metadata, including AudioSet-Strong and MusicBench:
- Qwen2.5-Omni converts audio into three types of textual descriptions (comprehensive description, speech transcription, musical elements).
- Combined with temporal metadata, DeepSeek-R1 synthesizes QA-CoT triplets.
- Two-stage quality filtering: high-quality QA+CoT → SFT dataset; high-quality QA only → RL dataset.
Output: EAQA-SFT (75.9k samples with CoT) + EAQA-RL (21.9k samples without CoT).
Key Experimental Results¶
Main Results: MMAR Expert-Level Audio Reasoning¶
| Model | Size | Sound | Music | Speech | Mixed Modal Avg | Overall Avg |
|---|---|---|---|---|---|---|
| Qwen2.5-Omni | 7B | 58.79 | 40.78 | 59.86 | ~58 | 57.33 |
| GPT-4o-Audio | — | 53.94 | 50.97 | 70.41 | ~65 | 64.09 |
| Gemini-2.0-Flash | — | 61.21 | 50.97 | 72.11 | ~70 | 67.90 |
| Audio-Thinker | 7B | 68.48 | 53.88 | 64.29 | ~70 | 67.25 |
| Echo | 7B | 67.27 | 60.68 | 69.39 | ~71 | 69.99 |
Echo, as a 7B open-source model, surpasses GPT-4o-Audio (+5.9%) and Gemini-2.0-Flash (+2.1%).
Main Results: MMAU General Audio Understanding¶
| Model | MMAU-mini Avg | MMAU Avg |
|---|---|---|
| Qwen2.5-Omni (7B) | 71.53 | 71.00 |
| Audio-Thinker (7B) | 78.00 | 75.39 |
| Gemini-2.5-Pro | 71.60 | 69.36 |
| Echo (7B) | 80.41 | 76.61 |
Echo exceeds Audio-Thinker by +2.41% on MMAU-mini and +1.22% on MMAU.
Ablation Study: Training Framework (MMAR Mean Accuracy)¶
| Model | SFT Data | RL Data | Reasoning Format | Accuracy |
|---|---|---|---|---|
| Base Model | — | — | Text-conditioned | 51.80% |
| Cold-Start | EAQA-SFT | — | Audio-anchored | 56.77% |
| Cold-Start | EAQA-SFT | — | Audio-interleaved | 52.26% |
| Echo | EAQA-SFT | EAQA-RL | Audio-interleaved | 69.99% |
| Direct RL | — | EAQA-RL | Text-conditioned | 63.15% |
Key Findings¶
- Reasoning format comparison: Along the E→B′→D trajectory, greater audio participation correlates with higher performance, while output length and latency remain comparable.
- Training dynamics: During RL, the model stabilizes at ~1.9 segment references per response, average segment duration of 3.0s, and segment overlap rate of only ~0.1.
- Segment coverage: 99.4% of responses re-listen to at least one segment; 78.0% re-listen to two or more; segments are distributed uniformly across the audio timeline.
- Skill improvements: Multi-speaker role mapping +37.0%, event-based sound reasoning +20.8%, emotional state summarization +20.5%.
- Generalization: Although SFT data only covers the first 10 seconds, Echo accurately localizes informative segments in longer audio.
Highlights & Insights¶
- Paradigm innovation: "Thinking with audio" rather than "thinking about audio" elevates audio from a static context to an active reasoning component.
- Attention analysis provides intuitive evidence: audio-interleaved reasoning increases audio token attention from <5% to 10–14% (Δ+140%).
- The engineering design of reasoning format adaptation is elegant and effective — requiring only pausing generation at
<seg>tags to insert audio tokens. - The consistency reward \(\mathcal{R}_\text{consist}\) and segment reward \(\mathcal{R}_\text{seg}\) are well-designed, effectively guiding the model to learn meaningful re-listening behavior.
Limitations & Future Work¶
- The current re-listening implementation is relatively simple; more advanced audio operations such as slow playback and frequency band isolation could be explored.
- CoT annotations in EAQA-SFT are automatically generated from fixed temporal metadata, lacking human-crafted heuristics.
- Due to DeepSeek-R1's "rumination" tendency, the data may exhibit insufficient diversity in reasoning paths.
- Computational overhead: each re-listening requires reprocessing audio tokens, resulting in inference latency of ~2.12s (vs. baseline 1.18s).
Related Work & Insights¶
- Analogous to the evolution in visual reasoning: from Multimodal CoT → visual grounding reasoning → direct insertion of image patches.
- Complementary to Audio-Reasoner (SFT-only) and Omni-R1 (RL-only); Echo demonstrates the superiority of the two-stage SFT+RL approach.
- The data generation pipeline of "audio → multi-perspective text → LLM-synthesized QA-CoT" is generalizable to other modalities such as video.
- The design pattern of GRPO + multi-component rewards has broad applicability in multimodal RL.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Audio-interleaved reasoning is an entirely new paradigm)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (3 benchmarks, detailed ablations, training dynamics analysis)
- Writing Quality: ⭐⭐⭐⭐⭐ (Clear structure, polished figures, in-depth analysis)
- Value: ⭐⭐⭐⭐⭐ (Opens a new direction for audio understanding with strong empirical support)