Skip to content

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Conference: ICCV 2025 arXiv: 2412.09501 Code: https://github.com/dvlab-research/Lyra Area: Multimodal VLM / Omni-Modal Understanding / Speech Modality Keywords: omni-cognition, speech-centric, multi-modality LoRA, latent cross-modality regularizer, long speech, token extraction

TL;DR

This paper proposes Lyra, a speech-centric omni-modal MLLM framework consisting of three core components — a DTW-based cross-modality regularizer, multi-modality LoRA, and a latent multi-modality extractor — along with the first 12K long-speech SFT dataset. Using only 2.7M training samples and modest compute, Lyra achieves state-of-the-art performance simultaneously on vision-language, vision-speech, and speech-language benchmarks, while supporting speech inputs of up to 2 hours in length.

Background & Motivation

Background: Existing omni-models (e.g., VITA, EMOVA, Intern-Omni) have begun exploring multimodal fusion, but the speech modality remains severely underutilized — their speech evaluation is largely confined to speech-text ASR metrics (e.g., LibriSpeech WER), neglecting cross-modal interactions between speech and other modalities such as vision.

Limitations of Prior Work: (1) Speech tokens and text tokens share semantic content but differ greatly in length (Whisper output tokens are much longer than their corresponding text), making direct training suboptimal; (2) Long speech support is extremely limited (VITA supports only ~1 minute); (3) Training large omni-models requires massive data (VITA: 5M, Intern-Omni: 27M), resulting in low efficiency.

Key Challenge: Incorporating the speech modality into an LLM through joint training degrades existing visual and language capabilities; directly replacing text instructions with speech tokens leads to a significant performance drop (MM-Vet drops 8% under speech instructions vs. text instructions).

Goal: To efficiently build an MLLM with genuine support for image/video/speech/sound omni-modal understanding and interaction, with a focus on deep integration of speech with other modalities.

Key Insight: (1) Use DTW to align speech tokens with transcription text tokens to reduce the modality gap; (2) Use modality-specific LoRA to preserve existing capabilities; (3) Use a query-aware token extractor to reduce long-context overhead.

Core Idea: By adopting a speech-centric design with DTW-based cross-modality regularization, multi-modality LoRA, and progressive multimodal token extraction, Lyra constructs an omni-modal MLLM efficiently with minimal training data.

Method

Overall Architecture

Built upon Qwen2-VL, Lyra incorporates a Whisper-v3 speech encoder and an ImageBind audio encoder. Multimodal input tokens are passed through modality-specific projectors into the LLM (equipped with multi-modality LoRA and the latent extractor), producing both text and streaming speech outputs. Training proceeds in four stages: (1) speech alignment pretraining → (2) three-modality joint training → (3) long-speech SFT → (4) speech generation.

Key Designs

  1. Latent Cross-Modality Regularizer (LCMR):

    • Function: Aligns the latent representations of speech tokens with their corresponding transcription text tokens during training.
    • Mechanism: Speech tokens \(X^{[speech]} \in \mathbb{R}^{d \times L}\) and STT text tokens \(X^{[STT]} \in \mathbb{R}^{d \times S}\) differ in length (\(L \gg S\)). Dynamic Time Warping (DTW) is applied to find the optimal alignment path, minimizing the aligned cosine distance: \(\mathcal{L}_{LCMR} = \frac{1}{L+S} D_{L,S}\), where the DTW distance matrix is defined as \(D_{l,s} = \text{dist}(l,s) + \min\{D_{l,s-1}, D_{l-1,s}, D_{l-1,s-1}\}\).
    • Effect: Incorporating LCMR improves MM-Vet under speech instructions from 53.1 to 58.1 (+5.0), while also improving the text-instruction score from 61.1 to 62.6, narrowing the cross-modal performance gap from 8% to 4.5%.
    • Design Motivation: Speech and text are different surface representations of the same semantic content but differ in sequence length. DTW handles variable-length alignment at low cost with a well-established algorithm, ensuring speech tokens carry text-equivalent semantic information before entering the LLM.
  2. Multi-Modality LoRA (MLoRA):

    • Function: Trains separate LoRA adapters for different modality combinations, preventing new-modality training from degrading existing capabilities.
    • Mechanism: \(H = (B^{[M]}A^{[M]} + W)X^{[M]}\), where \(M\) denotes the modality combination (text, image, speech, etc.) and each combination is assigned an independent low-rank adapter.
    • Effect: Compared to full-parameter SFT with MLoRA, MLoRA better preserves visual capabilities (TextVQA: 82.6 vs. 81.3) while more effectively developing speech capabilities (MM-VetS: 60.0 vs. 54.0), using only 50% of the data.
    • Design Motivation: Pretrained models such as Qwen2-VL are already highly capable. Full-parameter fine-tuning on limited data causes catastrophic forgetting. LoRA freezes the original weights and applies low-rank updates, fundamentally reducing inter-modality interference.
  3. Latent Multi-Modality Extractor (LMME):

    • Function: At the output of each LLM block, progressively removes redundant multimodal tokens based on attention relevance between text query tokens and multimodal tokens.
    • Mechanism: The LLM is divided into \(n\) blocks. At the end of each block, the method computes \(\text{topk}(\text{softmax}(\frac{Q^{[text]} K^{[\neg text]T}}{\sqrt{d}}))\) and retains only the \(\rho L\) most relevant multimodal tokens. The token count decays exponentially across blocks.
    • Effect: LMME(4, 0.7) reduces prefill time by nearly half (0.65s → 0.37s at \(2^{14}\) tokens), training time by 29–54%, and GPU memory by over 50%, with negligible performance loss (average ±0.1%–1.5% across benchmarks).
    • Design Motivation: In long-video or long-speech scenarios, token counts can reach hundreds of thousands, most of which are irrelevant to the current query. Progressive extraction (as opposed to one-shot pruning) allows the model to retain information at different levels of granularity across different depths.
  4. Long Speech Capability:

    • The first 12K long-speech SFT dataset is introduced, covering YouTube audio ranging from several minutes to 2 hours.
    • Long audio is handled via a strategy analogous to LLaVA-NeXT image slicing: segment → Whisper encode → compress to 300 tokens/segment → flatten.
    • Needle-in-a-Haystack evaluation: the baseline model fails beyond 450 seconds; with long-speech SFT, it supports up to 4,500 seconds (98% accuracy); with LMME, it supports up to 9,900 seconds (2.75 hours).

Key Experimental Results

Omni-Modal Benchmark Comparison

Model Params Training Data MME TextVQA MMMU TextVQAS DocVQAS MM-VetS WER↓
VITA 66B 5M 2097 - 41.6 - - - 8.1
EMOVA 14B 4M 2205 - 55.8 - - - 4.0
Intern-Omni 8B 27M 2210 - 60.0 69.1 79.9 56.0 -
Lyra-Base 9B 2.7M 2335 82.6 63.5 80.0 85.5 61.0 2.0
Lyra-Pro 74B 2.7M 2485 83.5 71.4 81.0 89.4 68.5 1.8

Ablation Study: LCMR Regularizer

Configuration TextVQAS MM-VetS TextVQAT MM-VetT WER↓
w/o LCMR 76.7 53.1 79.5 61.1 1.9
w/ LCMR 77.8 58.1 80.1 62.6 2.0

Efficiency: LMME Extractor

Configuration Prefill @\(2^{14}\) tokens TPS @\(2^{14}\) Memory @\(2^{14}\) Training on 1.5M data
Baseline 0.65s 27.3 tok/s 30G 66h
LMME(4, 0.7) 0.37s (−43%) 32.5 (+19%) 19G (−37%) 47h (−29%)

Key Findings

  • Speech changes how omni-modal evaluation should be conducted: Speech-text WER metrics fail to reflect the true capability of omni-models — models with similar WER can differ by up to 9% on vision-speech tasks. Speech-centric multimodal evaluation is necessary.
  • Superior performance with minimal data: Lyra outperforms Intern-Omni (27M) and VITA (5M) using only 2.7M data samples, achieving a 10× improvement in data efficiency. This is primarily attributed to MLoRA preserving pretrained capabilities and LCMR improving speech alignment quality.
  • Long speech is an underexplored frontier for MLLMs: Existing models support at most ~1 minute of speech; Lyra is the first to support 2 hours. Long speech alone can resolve one-third of VideoMME questions (78.6% accuracy using audio only, surpassing GPT-4o with captions on the long-video subset).
  • LMME token extraction is interpretable: The retained token regions are highly correlated with the user query (different queries retain different visual/audio regions), with only 10–25% of tokens ultimately preserved.

Highlights & Insights

  • DTW-based speech-text alignment is an elegant cross-modal regularization technique: It leverages the natural correspondence between speech and text, and DTW handles variable-length alignment at low computational cost with a well-established algorithm. This paradigm can be generalized to any cross-modal alignment scenario with a natural correspondence.
  • Modality-combination routing in MLoRA is simple yet effective: Assigning different LoRA adapters to different modality combinations achieves better results than full-parameter SFT with only half the data — a highly practical paradigm for efficient multimodal training.
  • First systematic validation of the value of "long speech + vision": The paper demonstrates that long speech information can substantially complement visual understanding (VideoMME +6.5%), and that audio alone can address a large portion of video understanding tasks. This opens a new direction of "audio as a visual supplement."

Limitations & Future Work

  • Speech generation quality depends on CTC and a vocoder, precluding the generation of emotionally expressive speech.
  • The sound modality relies on ImageBind's single-token encoding, which limits generalization.
  • Long-speech inference still requires substantial GPU memory even with LMME.
  • No direct comparison with closed-source models such as GPT-4o on speech interaction tasks is provided.
  • vs. VITA: VITA requires 66B parameters and 5M data and supports only 1 minute of speech. Lyra-Base, with 9B parameters and 2.7M data, supports 2 hours and achieves stronger performance across multiple benchmarks. The key differentiators are LCMR and MLoRA.
  • vs. EMOVA: EMOVA focuses on expressive speech but does not support long speech or vision-speech interaction. Lyra offers broader coverage.
  • vs. Intern-Omni: Intern-Omni requires 27M training samples, whereas Lyra requires only 2.7M. Lyra outperforms Intern-Omni by approximately 9% on vision-speech tasks (DocVQAS: 85.5 vs. 79.9).

Rating

  • Novelty: ⭐⭐⭐⭐ — DTW-based cross-modal regularization combined with multi-modality LoRA is novel; the long-speech SFT dataset fills an important gap.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across vision-language, vision-speech, and speech-language tasks; three model scales; long-speech Needle-in-a-Haystack tests; extensive ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure; the speech-centric evaluation perspective offers genuine insight.
  • Value: ⭐⭐⭐⭐⭐ — Outperforms a competitor trained on 27M samples using only 2.7M data; first to support 2-hour long speech; significant implications for the omni-modal research direction.