Skip to content

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Conference: ACL2026 arXiv: 2601.09413 Code: GitHub Area: audio_speech Keywords: Speech Recognition, Audio Reasoning, Multimodal Agent, Self-Reflection, Generative Error Correction

TL;DR

Speech-Hands is proposed as a learnable speech agent framework that decides whether to trust its own perception or external ASR hypotheses by generating explicit action tokens (<internal>/<external>/<rewrite>) at inference time. It achieves an average WER reduction of 12.1% across 7 benchmarks on the OpenASR leaderboard and reaches 77.37% accuracy in Audio QA.

Background & Motivation

Omni-multimodal models (e.g., Qwen2.5-Omni) can process audio and text simultaneously. However, a critical and counter-intuitive finding is that naively fine-tuning omni-models to fuse speech recognition and external sound understanding tasks often degrades performance. Preliminary experiments show that using Qwen2.5-Omni for Generative Error Correction (GER) on Whisper's N-best hypotheses leads to WER deterioration (8.52%-9.05%) across 7 ASR benchmarks. Further zero-shot analysis reveals that foundation models lack intrinsic arbitration capabilities—their decisions are highly sensitive to prompt phrasing rather than the correct answer. This indicates a need for an explicit self-reflection mechanism that allows the model to learn when to trust itself and when to seek external help.

Method

Overall Architecture

Speech-Hands models speech understanding as an agentic decision process. Given an input audio \(A\) and an optional query \(Q\), the model first generates its own response \(H_{omni}\) (internal perception) while obtaining a response \(H_{ext}\) from an external model. Then, based on the full context \((A, Q, H_{omni}, H_{ext})\), the model generates an explicit action token to guide the subsequent generation strategy: <internal> to trust itself, <external> to adopt the external result, or <rewrite> for fusion-based rewriting.

Key Designs

  1. ASR Action Token Construction: For each training sample, the WER of the internal transcription \(T_{int}\), external transcription \(T_{ext}\), and GER-fused transcription \(T_{ger}\) are calculated. If \(T_{int}\) is identical to the ground truth (\(WER=0\)) or has the lowest WER, it is labeled <internal>; if \(T_{ext}\) is optimal, it is labeled <external>; if \(T_{ger}\) is optimal, it is labeled <rewrite>. This fine-grained WER-based labeling provides a strong supervision signal.
  2. Audio QA Action Token Construction: Since QA provides discrete correct/incorrect signals, a multi-sampling stability strategy is introduced. The external model is sampled 5 times, and a majority vote determines the <external> or <rewrite> label, reducing the impact of external prediction randomness on the decision boundary.
  3. Unified End-to-End Training: Each training sample is formatted as "action token + target text." Action selection and subsequent generation are jointly supervised via a single cross-entropy loss, enabling the model to internalize the mapping from multimodal evidence to action selection.

Loss & Training

  • Standard cross-entropy loss is used to jointly optimize the action token and the target sequence.
  • Fine-tuning is based on Qwen2.5-Omni for 5 epochs with a batch size of 64, a learning rate of \(1e-4\) (cosine decay), and fp16 training.
  • A maximum of 20,000 training samples per dataset are used (limited by inference computation).

Key Experimental Results

Main Results

ASR Task (7 OpenASR datasets, WER%):

Method AMI Tedlium GigaSpeech SPGISpeech VoxPopuli Libri-clean Libri-other Average WER↓
Whisper-v2-large 16.88 4.32 11.45 3.94 7.57 2.91 5.15 7.17
Qwen2.5-Omni 19.77 5.17 11.26 4.58 6.59 2.09 3.85 7.33
Phi-4-MM 11.69 2.90 9.78 3.13 5.93 1.68 3.83 6.14
GER ⇒ Whisper 23.44 6.15 12.15 3.94 7.53 2.97 4.89 8.44
Speech-Hands ⇌ parakeet 11.20 4.37 11.10 2.26 6.02 1.67 3.18 5.69

Audio QA Task (Accuracy%):

Method Bio-acoustic Soundscape Complex QA Average Acc↑
Qwen2.5-Omni 47.32 56.32 59.89 57.87
AudioFlamingo 3 71.88 57.31 81.26 74.49
Speech-Hands + majority 81.25 59.4 85.7 77.37

Ablation Study

Experiment Content Key Findings
Prompt Ablation (GER SFT) All prompt strategies failed (WER 8.44-9.05), proving implicit fusion is infeasible.
Zero-shot Arbitration Model decisions are sensitive to prompt phrasing rather than the correct answer (validated by confusion matrix).
Action token F1 <internal> F1 > 0.8 (most datasets), <external> F1 0.65-0.89, <rewrite> F1 < 0.4 (limited by data sparsity).
Training Data Size Outperforms full-training baselines with only 20k samples per dataset.

Key Findings

  • Cascaded GER (ASR followed by LLM correction) is consistently inferior to the original ASR, whereas the parallel agentic architecture of Speech-Hands is consistently superior to both baselines.
  • Despite the extreme sparsity of the <rewrite> label (<2%), the model maintains high precision when triggered, reflecting cautious but reliable rewrite detection.
  • On AMI (meeting speech, the nosiest scenario), Speech-Hands reduces the WER of Qwen2.5-Omni from 19.77% to 11.20%, a 43% Gain.

Highlights & Insights

  • Core Insight: The fundamental problem of multimodal models is not insufficient perception capacity, but a lack of a mechanism to arbitrate between multiple information sources. Explicit action tokens transform implicit information fusion into an interpretable decision process.
  • "Knowing what one doesn't know": The framework analogies the self-reflection ability in developmental psychology, evolving from an egocentric perspective to a stage where it can "step outside its own thinking" to evaluate the reliability of its beliefs.
  • Natural Generalization: The transition from ASR to Audio QA requires no architectural modifications, only an adjustment of the action token construction strategy.

Limitations & Future Work

  • Training data for the <rewrite> action is extremely sparse, leading to low F1; data augmentation strategies are needed.
  • Currently, only Qwen2.5-Omni is used as the backbone; generalization to other omni-models remains to be verified.
  • Tool-calling actions (e.g., calling external APIs) have not yet been implemented and represent a future direction.
  • Multilingual ASR scenarios have not been explored.
  • Generative Error Correction (GER) (Yang et al., 2023): Pure text-based cascaded correction cannot utilize raw audio; Ours proves this is a fundamental limitation of "non-agentic" approaches.
  • Qwen2.5-Omni / Phi-4-MM: Current state-of-the-art omni-models still lack explicit arbitration mechanisms.
  • Self-Reflection (Madaan et al., 2023): Existing reflection methods intervene only after perceptual fusion; the innovation of Speech-Hands lies in reflecting on the perceptual behavior itself.
  • Insight: The action token approach can be extended to any multimodal task requiring arbitration between multiple information sources.

Rating

Dimension Score (1-10)
Novelty 8
Experimental Thoroughness 8
Writing Quality 7
Value 8
Total Score 7.8

Rating

  • Novelty: TBD
  • Experimental Thoroughness: TBD
  • Writing Quality: TBD
  • Value: TBD