Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations¶

Conference: ICCV 2025 arXiv: 2503.06273 Code: Available (the paper states that code and models are available online) Area: Audio & Speech Keywords: zero-shot speech recognition, audio-visual speech recognition, romanization, large language models, multilingual

TL;DR¶

This paper proposes Zero-AVSR, a framework that transcribes speech into language-agnostic romanized text (Roman text) and then leverages an LLM to convert the Roman text into target-language graphemes, enabling zero-shot audio-visual speech recognition without any target-language speech data. The authors also construct the MARC dataset, covering 82 languages and 2,916 hours of audio-visual data.

Background & Motivation¶

Audio-visual speech recognition (AVSR) combines audio and lip-movement information to enhance speech understanding, particularly under noisy conditions. However, existing AVSR research has focused predominantly on English, and multilingual AVSR datasets cover only 9 languages. Acquiring sufficient annotated audio-visual data for each language is highly challenging, severely limiting the extension of AVSR to broader languages.

Core Problem: How can speech recognition be performed without using any speech data from the target language?

Key insight: Different languages share phonetic characteristics at the phoneme level, and romanization can unify all languages into a single language-agnostic representation space. Furthermore, pre-trained LLMs already possess the capability to convert Roman text into the writing systems of various languages.

Method¶

Overall Architecture¶

Zero-AVSR consists of two core components: (1) AV-Romanizer, which predicts language-agnostic romanized text from multilingual audio-visual speech input; and (2) an LLM decoder, which converts the romanized text into target-language graphemes. The framework is instantiated in two forms: Cascaded and Unified.

Key Designs¶

MARC Dataset: Integrates four datasets—LRS3, MuAViC, VoxCeleb2, and AVSpeech—to construct a multilingual romanized corpus covering 82 languages and 2,916 hours of audio-visual data. Romanization annotations are generated using GPT-4o-mini, which is empirically shown to achieve the best performance in romanization and de-romanization reconstruction tests. For unlabeled datasets, pre-trained language identification and ASR models are used to obtain language IDs and transcriptions.
AV-Romanizer: Based on the AV-HuBERT architecture, comprising an audio encoder \(\mathcal{F}_a\) (linear layer), a visual encoder \(\mathcal{F}_v\) (ResNet-18 + 3D convolution), a Transformer encoder \(\mathcal{B}\) (24 layers), and a linear classification head. Audio features \(f_a\) and visual features \(f_v\) are concatenated along the channel dimension, projected via a linear layer, and fed into the Transformer encoder to obtain fused features \(f_{av} = \mathcal{B}((f_a \oplus f_v)W)\). Trained with CTC loss to predict romanized text.
Cascaded Zero-AVSR: Cascades the AV-Romanizer with a pre-trained LLM (e.g., GPT-4o-mini) without fine-tuning the LLM. The AV-Romanizer converts audio-visual speech into Roman text, which is then passed to the LLM via instruction prompting to produce target-language graphemes. This approach is compatible with any LLM, including API-based models.
Zero-AVSR (Unified Model): Embeds audio-visual features encoded by the AV-Romanizer directly into an LLM (Llama3.2-3B) and achieves end-to-end zero-shot recognition through multi-task training.
Task 1 (Alignment): Uses a length compressor (1D convolution, kernel=2, stride=2) and an adapter to map audio-visual features into the LLM embedding space, trained on seen languages with a language modeling objective. The AV-Romanizer and LLM original weights are frozen; only LoRA weights, the compressor, and the adapter are trained.
Task 2 (Learning De-romanization): A text-only task that trains the LLM to convert Roman text into target-language graphemes, covering both seen and unseen languages to prevent the LLM from forgetting multilingual capabilities. Only LoRA weights are trained.

Loss & Training¶

The AV-Romanizer is trained with CTC loss.
Three-stage learning rate schedule: 10K warmup, 40K hold, 50K decay, with a peak learning rate of 1e-4.
MUSAN noise is randomly added to audio during training (0 dB SNR).
The LLM stage of Zero-AVSR uses a cosine scheduler (0.5K warmup + 29.5K decay) with QLoRA fine-tuning.
Beam search (width=2, temperature=0.3) is used at inference.

Key Experimental Results¶

Main Results¶

Method	Modality	Training Hours	Languages	Ara	Deu	Ell	Spa	Fra	Ita	Por	Rus	Eng	Avg (w/ Eng)
AV-HuBERT	AVSR	1759h	9	89.4	52.0	46.2	17.4	20.3	20.8	22.1	44.7	1.7	35.0
XLAVS-R 2B	AVSR	437Kh	9	79.3	44.4	19.0	9.1	12.3	10.6	11.2	25.0	1.7	23.6
MMS Zero-shot	ASR	476Kh	1078+	84.9	31.5	47.9	17.7	33.6	19.0	35.5	42.8	35.7	38.9
Cascaded Zero-AVSR	AVSR	2916h	82+	82.1	29.3	47.2	16.3	28.9	21.6	20.2	42.9	2.9	30.2
Zero-AVSR	AVSR	2916h	82+	81.4	27.8	38.4	13.1	14.3	15.9	15.4	32.6	1.5	25.2

Ablation Study¶

Effectiveness of MARC Dataset (Cascaded Zero-AVSR; target unseen language: Rus):

Training Data	# Training Languages	Training Hours	Zero-shot CER↓	Avg CER↓
MuAViC	8	745h	62.3	48.3
+MARC (8 langs)	8	1944h	61.0	28.3
+MARC (40 langs)	40	2418h	49.5	25.1
+MARC (81 langs)	81	2793h	40.0	21.9

Effect of Different LLMs on Cascaded Zero-AVSR:

LLM	Avg CER↓
Llama3.2-3B	35.7
Mistral-7B	29.6
Llama3.1-8B	27.3
Llama3.1-70B	21.3
GPT-4o-mini	19.5

Key Findings¶

Zero-AVSR achieves an average WER of 25.2% using only 2,916 hours of audio-visual data, comparable to XLAVS-R 2B (23.6%) trained on 436K hours of audio data.
Increasing language diversity (8 → 81 languages) substantially improves zero-shot performance, reducing the zero-shot CER on Russian from 62.3% to 40.0%.
Data from linguistically related languages provides greater benefit for zero-shot performance, validating linguistic priors.
Larger LLMs yield better decoding performance in the Cascaded setting.

Highlights & Insights¶

Romanization as a language-agnostic representation: Compared to phonemes, Roman text is simpler and LLMs inherently possess the ability to perform the conversion.
Practicality of the Cascaded approach: No LLM fine-tuning is required, enabling direct use of closed-source API-based models.
Scaling effect of data volume and language diversity: More languages and more data significantly enhance zero-shot capability.
LLM as a universal de-romanizer: This eliminates the need to train separate language models for each target language.

Limitations & Future Work¶

Zero-shot performance still lags behind supervised methods, particularly for languages with highly divergent writing systems such as Arabic.
The Unified Zero-AVSR employs the relatively small Llama3.2-3B; larger models may yield further improvements.
Romanization is not a fully invertible transformation, introducing some information loss.
Coverage of 82 languages remains incomplete with respect to all world languages, especially low-resource ones.

AV-HuBERT: Foundational architecture for self-supervised audio-visual pre-training.
MMS Zero-shot: A pioneer in zero-shot speech recognition in the audio domain.
XLAVS-R: Multilingual audio-visual self-supervised learning.
Inspiration: The romanization + LLM paradigm is potentially generalizable to other multilingual tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ First zero-shot AVSR framework; the romanization + LLM design is highly original.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 12 ablation experiments, comprehensive evaluation across 8 languages, and comparisons with multiple LLMs.
Writing Quality: ⭐⭐⭐⭐ Clear structure with well-motivated problem formulation.
Value: ⭐⭐⭐⭐ Contributions in both methodology and dataset for multilingual audio-visual recognition across 82 languages.