Skip to content

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Conference: NeurIPS 2025 arXiv: 2501.01957 Code: https://github.com/VITA-MLLM/VITA Area: Multimodal LLM / Speech Interaction Keywords: multimodal LLM, vision-speech interaction, end-to-end speech, three-stage training, omni model

TL;DR

VITA-1.5 proposes a carefully designed three-stage progressive training strategy that incrementally integrates visual and speech capabilities into an LLM, achieving end-to-end vision-speech real-time interaction without relying on standalone ASR/TTS modules, while attaining state-of-the-art performance among open-source models on image, video, and speech benchmarks.

Background & Motivation

Multimodal large language models (MLLMs) have achieved significant progress in vision-language integration (e.g., LLaVA, InternVL, Qwen-VL), yet the integration of the speech modality remains comparatively underdeveloped. In practical human-computer interaction systems, speech serves as a critical medium for information transmission and substantially enhances the naturalness and convenience of interaction.

Key Challenge: The modality gap between vision and speech is fundamentally distinct—visual data encodes spatial information whereas speech data encodes temporal information. Simultaneously optimizing both modalities frequently induces training conflicts: incorporating speech data may degrade visual task performance, and vice versa.

Conventional speech-to-speech systems rely on cascaded ASR + LLM + TTS pipelines, suffering from high latency, loss of paralinguistic information such as prosody and emotion, and poor real-time performance. GPT-4o has demonstrated the feasibility of end-to-end multimodal interaction, yet the open-source community still exhibits a considerable gap in models that simultaneously possess strong visual and speech capabilities. VITA-1.0 made an initial attempt, but introducing speech data caused interference with visual performance, and speech generation still depended on an external TTS system.

Key Insight: Through a carefully designed three-stage training strategy, the paper progressively introduces different modalities, enabling the model to acquire new modality capabilities while preserving the performance of existing ones. The final system achieves end-to-end speech output, eliminating dependence on external TTS modules.

Method

Overall Architecture

The VITA-1.5 model architecture comprises:

Input Side: - Visual encoder: InternViT-300M (448×448 input, 256 visual tokens/image) with dynamic tiling for high-resolution images - Audio encoder: 350M parameters (4× downsampling convolution + 24-layer Transformer, hidden dimension 1024), output frame rate 12.5 Hz - Visual adapter: two-layer MLP - Speech adapter: multi-layer convolution (2× downsampling)

Output Side: - Text output: directly generated by the LLM (Qwen2-7B) - Speech output: end-to-end speech generation module - TiCodec codec: single codebook (size 1024), encodes speech into discrete tokens at 40 Hz and decodes back to 24 kHz waveforms - NAR (non-autoregressive) speech decoder: 4-layer LLaMA decoder, processes global semantic information from text tokens - AR (autoregressive) speech decoder: 4-layer LLaMA decoder, autoregressively generates high-quality speech tokens - Each decoder contains approximately 120M parameters (hidden dimension 896)

Key Designs

  1. Video Processing Strategy:
  2. Videos <4 s: uniformly sample 4 frames
  3. Videos 4–16 s: 1 frame per second
  4. Videos >16 s: uniformly sample 16 frames
  5. Dynamic tiling is not applied to video frames to avoid excessive token counts

  6. End-to-End Speech Generation:

  7. Conventional approach: LLM → text → TTS → speech (cascaded, high latency)
  8. VITA-1.5: LLM text tokens → NAR decoder (global semantic features) → AR decoder (stepwise speech token generation) → Codec decoder (speech waveform)
  9. The LLM is frozen, leaving multimodal comprehension performance unaffected

  10. Input Classification Head: A classification head is added to the LLM output in Stage 2.2 to distinguish whether the input originates from speech or text, enabling the model to flexibly handle different input modalities

Three-Stage Training Strategy

Mechanism: Progressively introduce different modalities to avoid modality conflicts.

Stage 1: Vision-Language Training

  • Stage 1.1 Visual Alignment: train only the visual adapter with 20% caption data; all other modules are frozen
  • Stage 1.2 Visual Understanding: train the visual encoder, adapter, and LLM with 100% caption data to learn image description
  • Stage 1.3 Visual SFT: train the visual modules and LLM with 100% QA data + 20% caption data to acquire instruction-following and visual QA capabilities

Stage 2: Audio Input Tuning

  • Stage 2.1 Audio Alignment:
  • (a) Train the speech encoder with CTC loss on ASR tasks
  • (b) Train the speech adapter and LLM on 11,000 hours of speech-transcript pairs to enable the LLM to understand audio input
  • Stage 2.2 Audio SFT: 4% caption + 20% QA data, with approximately half of the text questions replaced by TTS-generated speech versions; all modules are trainable

Stage 3: Audio Output Tuning

  • Stage 3.1 Codec Training: train a single-codebook codec model on 3,000 hours of speech data
  • Stage 3.2 NAR + AR Decoder Training: text-speech paired data with the LLM frozen; only the speech decoders are trained

Training Data

  • Multimodal instruction tuning: 22,133.16K QA pairs in total (Chinese and English), covering:
  • General image captioning/QA: ShareGPT4V, LLaVA series, LVIS-Instruct, etc.
  • OCR & charts: Anyword-3M, UReader, SynDOG, etc.
  • Video: ShareGemini + synthetic data
  • Text-only QA: synthetic data
  • ASR data: 110,000 hours of internal speech-transcript pairs (Chinese and English)
  • Speech generation data: 3,000 hours of TTS-generated text-speech pairs

Key Experimental Results

Image Understanding Benchmarks

Method LLM MMB MMS MMMU MathV OCR Avg
GPT-4o - 82.8 61.6 62.8 56.5 663 69.3
InternVL2 InternLM2.5-7B 79.4 61.5 51.2 58.3 794 67.3
MiniCPM-V 2.6 Qwen2-7B 78.0 57.5 49.8 60.6 852 68.5
VITA-1.0 Mixtral-8x7B 71.8 46.4 47.3 44.9 678 57.8
VITA-1.5 (Stage 1) Qwen2-7B 77.1 59.1 53.1 66.2 752 67.1
VITA-1.5 (Stage 3) Qwen2-7B 76.7 59.9 52.1 66.2 732 66.8

Automatic Speech Recognition (ASR)

Model aishell-1 (CER↓) test_net (CER↓) dev_clean (WER↓) test_clean (WER↓)
Wav2vec2-base - - 6.0 -
Mini-Omni2 - - 4.8 4.7
Freeze-Omni 2.8 12.6 4.2 4.1
VITA-1.0 - 12.2 7.6 8.1
VITA-1.5 2.2 8.4 3.3 3.4

Ablation Study: Effect of the Three-Stage Training Strategy

Stage MMBench MMStar MMMU MathVista Video-MME Note
Stage 1 (vision only) 77.1 59.1 53.1 66.2 56.8 Vision baseline
Stage 3 (full) 76.7 59.9 52.1 66.2 56.1 Near-lossless vision after adding speech

Key Findings

  • Modality conflict is largely eliminated: Compared to Stage 1, Stage 3 shows an average image understanding drop of only 0.3% (67.1→66.8) and a video understanding drop of only 0.7%, demonstrating the effectiveness of the three-stage training strategy in mitigating modality conflict.
  • VITA-1.5 comprehensively outperforms dedicated speech models (Freeze-Omni, Mini-Omni2) in ASR, achieving a Chinese CER as low as 2.2%.
  • Compared to VITA-1.0, image understanding improves by 9% (57.8→66.8) and ASR error rates are substantially reduced (English WER from 8.1 to 3.4).
  • Image understanding performance approaches GPT-4o (66.8 vs. 69.3) and surpasses GPT-4V (58.5) and GPT-4o-mini (66.3).
  • End-to-end speech output eliminates the additional latency of a TTS module, significantly improving the real-time interaction experience.

Highlights & Insights

  • Three-stage progressive training: This is the paper's most central contribution. By carefully arranging the order in which data and trainable modules are introduced, it effectively resolves the vision-speech modality conflict. The strategy carries strong methodological value.
  • End-to-end speech generation: The NAR + AR dual-decoder design achieves speech output while keeping the LLM frozen, leaving existing multimodal comprehension capabilities intact.
  • Fine-grained data engineering: The data mixing ratios at each stage (e.g., ~50% speech question replacement rate in Stage 2.2, retaining 20% caption data in Stage 1.3 for diversity) are carefully tuned.
  • Productization potential: VITA-1.5 demonstrates the feasibility of open-source omni models simultaneously achieving SOTA-level performance in both the visual and speech dimensions.

Limitations & Future Work

  • Video understanding still lags approximately 16 points behind GPT-4o (56.1 vs. 71.9), indicating substantial room for improvement in the video modality.
  • The quality and naturalness of speech output are not evaluated in detail (e.g., MOS scores), and the actual user experience remains to be verified.
  • The speech-side training data relies on internal ASR data (110K hours) and TTS-generated data, which may hinder reproducibility.
  • Detailed evaluation of duplex dialogue capability is absent.
  • Single-codebook codec speech quality may be inferior to multi-codebook approaches (e.g., EnCodec), though it simplifies decoding.
  • vs. VITA-1.0: The key improvements in version 1.5 are end-to-end speech output (replacing external TTS) and a more refined three-stage training strategy.
  • vs. Mini-Omni2 / LLaMA-Omni / Moshi: These models achieve duplex speech interaction but lack visual understanding capabilities.
  • vs. GPT-4o: VITA-1.5 is currently the open-source omni model most comparable to GPT-4o, yet gaps remain in video understanding and speech naturalness.
  • The strategy of freezing the LLM when training the speech decoder originates from Freeze-Omni; VITA-1.5 demonstrates that this approach is highly effective at preserving multimodal performance.
  • Insight: Training strategy design may be more important than architectural design for multimodal models; progressively introducing modalities is an effective paradigm for handling modality conflict.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐