Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition¶
Conference: ICCV 2025 arXiv: 2412.09501 Code: https://github.com/dvlab-research/Lyra Area: Multimodal VLM / Omni-Modal Understanding / Speech Modality Keywords: omni-cognition, speech-centric, multi-modality LoRA, latent cross-modality regularizer, long speech, token extraction
TL;DR¶
This paper proposes Lyra, a speech-centric omni-modal MLLM framework consisting of three core components — a DTW-based cross-modality regularizer, multi-modality LoRA, and a latent multi-modality extractor — along with the first 12K long-speech SFT dataset. Using only 2.7M training samples and modest compute, Lyra achieves state-of-the-art performance simultaneously on vision-language, vision-speech, and speech-language benchmarks, while supporting speech inputs of up to 2 hours in length.
Background & Motivation¶
Background: Existing omni-models (e.g., VITA, EMOVA, Intern-Omni) have begun exploring multimodal fusion, but the speech modality remains severely underutilized — their speech evaluation is largely confined to speech-text ASR metrics (e.g., LibriSpeech WER), neglecting cross-modal interactions between speech and other modalities such as vision.
Limitations of Prior Work: (1) Speech tokens and text tokens share semantic content but differ greatly in length (Whisper output tokens are much longer than their corresponding text), making direct training suboptimal; (2) Long speech support is extremely limited (VITA supports only ~1 minute); (3) Training large omni-models requires massive data (VITA: 5M, Intern-Omni: 27M), resulting in low efficiency.
Key Challenge: Incorporating the speech modality into an LLM through joint training degrades existing visual and language capabilities; directly replacing text instructions with speech tokens leads to a significant performance drop (MM-Vet drops 8% under speech instructions vs. text instructions).
Goal: To efficiently build an MLLM with genuine support for image/video/speech/sound omni-modal understanding and interaction, with a focus on deep integration of speech with other modalities.
Key Insight: (1) Use DTW to align speech tokens with transcription text tokens to reduce the modality gap; (2) Use modality-specific LoRA to preserve existing capabilities; (3) Use a query-aware token extractor to reduce long-context overhead.
Core Idea: By adopting a speech-centric design with DTW-based cross-modality regularization, multi-modality LoRA, and progressive multimodal token extraction, Lyra constructs an omni-modal MLLM efficiently with minimal training data.
Method¶
Overall Architecture¶
Built upon Qwen2-VL, Lyra incorporates a Whisper-v3 speech encoder and an ImageBind audio encoder. Multimodal input tokens are passed through modality-specific projectors into the LLM (equipped with multi-modality LoRA and the latent extractor), producing both text and streaming speech outputs. Training proceeds in four stages: (1) speech alignment pretraining → (2) three-modality joint training → (3) long-speech SFT → (4) speech generation.
Key Designs¶
-
Latent Cross-Modality Regularizer (LCMR):
- Function: Aligns the latent representations of speech tokens with their corresponding transcription text tokens during training.
- Mechanism: Speech tokens \(X^{[speech]} \in \mathbb{R}^{d \times L}\) and STT text tokens \(X^{[STT]} \in \mathbb{R}^{d \times S}\) differ in length (\(L \gg S\)). Dynamic Time Warping (DTW) is applied to find the optimal alignment path, minimizing the aligned cosine distance: \(\mathcal{L}_{LCMR} = \frac{1}{L+S} D_{L,S}\), where the DTW distance matrix is defined as \(D_{l,s} = \text{dist}(l,s) + \min\{D_{l,s-1}, D_{l-1,s}, D_{l-1,s-1}\}\).
- Effect: Incorporating LCMR improves MM-Vet under speech instructions from 53.1 to 58.1 (+5.0), while also improving the text-instruction score from 61.1 to 62.6, narrowing the cross-modal performance gap from 8% to 4.5%.
- Design Motivation: Speech and text are different surface representations of the same semantic content but differ in sequence length. DTW handles variable-length alignment at low cost with a well-established algorithm, ensuring speech tokens carry text-equivalent semantic information before entering the LLM.
-
Multi-Modality LoRA (MLoRA):
- Function: Trains separate LoRA adapters for different modality combinations, preventing new-modality training from degrading existing capabilities.
- Mechanism: \(H = (B^{[M]}A^{[M]} + W)X^{[M]}\), where \(M\) denotes the modality combination (text, image, speech, etc.) and each combination is assigned an independent low-rank adapter.
- Effect: Compared to full-parameter SFT with MLoRA, MLoRA better preserves visual capabilities (TextVQA: 82.6 vs. 81.3) while more effectively developing speech capabilities (MM-VetS: 60.0 vs. 54.0), using only 50% of the data.
- Design Motivation: Pretrained models such as Qwen2-VL are already highly capable. Full-parameter fine-tuning on limited data causes catastrophic forgetting. LoRA freezes the original weights and applies low-rank updates, fundamentally reducing inter-modality interference.
-
Latent Multi-Modality Extractor (LMME):
- Function: At the output of each LLM block, progressively removes redundant multimodal tokens based on attention relevance between text query tokens and multimodal tokens.
- Mechanism: The LLM is divided into \(n\) blocks. At the end of each block, the method computes \(\text{topk}(\text{softmax}(\frac{Q^{[text]} K^{[\neg text]T}}{\sqrt{d}}))\) and retains only the \(\rho L\) most relevant multimodal tokens. The token count decays exponentially across blocks.
- Effect: LMME(4, 0.7) reduces prefill time by nearly half (0.65s → 0.37s at \(2^{14}\) tokens), training time by 29–54%, and GPU memory by over 50%, with negligible performance loss (average ±0.1%–1.5% across benchmarks).
- Design Motivation: In long-video or long-speech scenarios, token counts can reach hundreds of thousands, most of which are irrelevant to the current query. Progressive extraction (as opposed to one-shot pruning) allows the model to retain information at different levels of granularity across different depths.
-
Long Speech Capability:
- The first 12K long-speech SFT dataset is introduced, covering YouTube audio ranging from several minutes to 2 hours.
- Long audio is handled via a strategy analogous to LLaVA-NeXT image slicing: segment → Whisper encode → compress to 300 tokens/segment → flatten.
- Needle-in-a-Haystack evaluation: the baseline model fails beyond 450 seconds; with long-speech SFT, it supports up to 4,500 seconds (98% accuracy); with LMME, it supports up to 9,900 seconds (2.75 hours).
Key Experimental Results¶
Omni-Modal Benchmark Comparison¶
| Model | Params | Training Data | MME | TextVQA | MMMU | TextVQAS | DocVQAS | MM-VetS | WER↓ |
|---|---|---|---|---|---|---|---|---|---|
| VITA | 66B | 5M | 2097 | - | 41.6 | - | - | - | 8.1 |
| EMOVA | 14B | 4M | 2205 | - | 55.8 | - | - | - | 4.0 |
| Intern-Omni | 8B | 27M | 2210 | - | 60.0 | 69.1 | 79.9 | 56.0 | - |
| Lyra-Base | 9B | 2.7M | 2335 | 82.6 | 63.5 | 80.0 | 85.5 | 61.0 | 2.0 |
| Lyra-Pro | 74B | 2.7M | 2485 | 83.5 | 71.4 | 81.0 | 89.4 | 68.5 | 1.8 |
Ablation Study: LCMR Regularizer¶
| Configuration | TextVQAS | MM-VetS | TextVQAT | MM-VetT | WER↓ |
|---|---|---|---|---|---|
| w/o LCMR | 76.7 | 53.1 | 79.5 | 61.1 | 1.9 |
| w/ LCMR | 77.8 | 58.1 | 80.1 | 62.6 | 2.0 |
Efficiency: LMME Extractor¶
| Configuration | Prefill @\(2^{14}\) tokens | TPS @\(2^{14}\) | Memory @\(2^{14}\) | Training on 1.5M data |
|---|---|---|---|---|
| Baseline | 0.65s | 27.3 tok/s | 30G | 66h |
| LMME(4, 0.7) | 0.37s (−43%) | 32.5 (+19%) | 19G (−37%) | 47h (−29%) |
Key Findings¶
- Speech changes how omni-modal evaluation should be conducted: Speech-text WER metrics fail to reflect the true capability of omni-models — models with similar WER can differ by up to 9% on vision-speech tasks. Speech-centric multimodal evaluation is necessary.
- Superior performance with minimal data: Lyra outperforms Intern-Omni (27M) and VITA (5M) using only 2.7M data samples, achieving a 10× improvement in data efficiency. This is primarily attributed to MLoRA preserving pretrained capabilities and LCMR improving speech alignment quality.
- Long speech is an underexplored frontier for MLLMs: Existing models support at most ~1 minute of speech; Lyra is the first to support 2 hours. Long speech alone can resolve one-third of VideoMME questions (78.6% accuracy using audio only, surpassing GPT-4o with captions on the long-video subset).
- LMME token extraction is interpretable: The retained token regions are highly correlated with the user query (different queries retain different visual/audio regions), with only 10–25% of tokens ultimately preserved.
Highlights & Insights¶
- DTW-based speech-text alignment is an elegant cross-modal regularization technique: It leverages the natural correspondence between speech and text, and DTW handles variable-length alignment at low computational cost with a well-established algorithm. This paradigm can be generalized to any cross-modal alignment scenario with a natural correspondence.
- Modality-combination routing in MLoRA is simple yet effective: Assigning different LoRA adapters to different modality combinations achieves better results than full-parameter SFT with only half the data — a highly practical paradigm for efficient multimodal training.
- First systematic validation of the value of "long speech + vision": The paper demonstrates that long speech information can substantially complement visual understanding (VideoMME +6.5%), and that audio alone can address a large portion of video understanding tasks. This opens a new direction of "audio as a visual supplement."
Limitations & Future Work¶
- Speech generation quality depends on CTC and a vocoder, precluding the generation of emotionally expressive speech.
- The sound modality relies on ImageBind's single-token encoding, which limits generalization.
- Long-speech inference still requires substantial GPU memory even with LMME.
- No direct comparison with closed-source models such as GPT-4o on speech interaction tasks is provided.
Related Work & Insights¶
- vs. VITA: VITA requires 66B parameters and 5M data and supports only 1 minute of speech. Lyra-Base, with 9B parameters and 2.7M data, supports 2 hours and achieves stronger performance across multiple benchmarks. The key differentiators are LCMR and MLoRA.
- vs. EMOVA: EMOVA focuses on expressive speech but does not support long speech or vision-speech interaction. Lyra offers broader coverage.
- vs. Intern-Omni: Intern-Omni requires 27M training samples, whereas Lyra requires only 2.7M. Lyra outperforms Intern-Omni by approximately 9% on vision-speech tasks (DocVQAS: 85.5 vs. 79.9).
Rating¶
- Novelty: ⭐⭐⭐⭐ — DTW-based cross-modal regularization combined with multi-modality LoRA is novel; the long-speech SFT dataset fills an important gap.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across vision-language, vision-speech, and speech-language tasks; three model scales; long-speech Needle-in-a-Haystack tests; extensive ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure; the speech-centric evaluation perspective offers genuine insight.
- Value: ⭐⭐⭐⭐⭐ — Outperforms a competitor trained on 27M samples using only 2.7M data; first to support 2-hour long speech; significant implications for the omni-modal research direction.