Skip to content

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

Conference: ACL 2026 arXiv: 2604.18360 Code: Web Demo Area: Multimodal VLM / Audio Retrieval Keywords: Audio-text retrieval, CLAP, multimodal LLM, user-intent queries, hard negative discrimination

TL;DR

This paper proposes OEA (Omni-Embed-Audio), which employs a multimodal LLM as a unified encoder to construct a retrieval-oriented audio-text embedding space. It introduces the User-Intent Queries (UIQ) benchmark and hard-negative discrimination metrics (HNSR/TFR), demonstrating that the LLM backbone significantly outperforms CLAP-based methods on T2T retrieval (+22%) and hard negative discrimination (+4.3%p HNSR@10).

Background & Motivation

Background: Contrastive Language-Audio Pretraining (CLAP)-based methods have become the dominant paradigm for audio-text retrieval. The latest M2D-CLAP achieves state-of-the-art performance by combining self-supervised masked modeling with CLAP. Performance on standard benchmarks (AudioCaps, Clotho) continues to improve.

Limitations of Prior Work: (1) Standard benchmarks use caption-style queries that diverge substantially from real-world search behavior—actual Freesound queries average only 1.8 words; (2) existing models suffer up to 16% performance degradation on paraphrase queries; (3) standard evaluation metrics only check whether the target is retrieved, without measuring whether models can suppress acoustically similar but semantically distinct distractors—i.e., discrimination ability is not evaluated.

Key Challenge: CLAP text encoders are lightweight and optimized for contrastive audio alignment, compressing entire queries into "bag-of-content" vectors—rendering them unable to handle negation semantics ("no thunder") or fine-grained semantic distinctions, which are precisely the core requirements of real-world search scenarios.

Goal: (1) Build a unified retrieval encoder based on a multimodal LLM; (2) systematically evaluate retrieval robustness across diverse real-world query types; (3) propose new metrics that measure hard negative discrimination capability.

Key Insight: LLMs encounter abundant negation patterns ("not," "except") during instruction-following pretraining, and their attention mechanisms can preserve compositional semantic structures—complementing the limited capacity of CLAP's lightweight text encoders.

Core Idea: Use a multimodal LLM with native audio understanding as a unified encoder, combined with LoRA adaptation and contrastive learning, to surpass dedicated CLAP models in retrieval quality and semantic discrimination.

Method

Overall Architecture

OEA uses a single shared multimodal LLM backbone to process both text and audio inputs. Text queries are encoded with a "query:" prefix, while audio inputs are processed with a "passage:" prefix through the model's native audio encoder. Both modalities obtain representations via mean pooling over the final hidden layer, which are then mapped to a shared 512-dimensional L2-normalized embedding space through modality-specific projection heads. Backbone weights are frozen; only LoRA adapters (~11–16M parameters) and projection heads are trained.

Key Designs

  1. Unified LLM Backbone Encoder Architecture:

    • Function: Encodes both text and audio with a single Transformer, eliminating the modality gap inherent in conventional dual encoders.
    • Mechanism: A multimodal LLM with native audio understanding (Nemotron-3B, Qwen2.5-Omni-3B/7B) is selected as the shared backbone. LoRA adapters are attached to attention layers, and modality-specific projection heads compress backbone representations into a 512-dimensional shared space. All backbone weights are frozen; only LoRA adapters and projection heads are trained (~0.29–0.36% of total parameters).
    • Design Motivation: Conventional dual encoders (e.g., CLAP) use separate lightweight text and audio encoders with limited text representational capacity; a shared LLM backbone allows audio understanding to benefit from the LLM's rich linguistic priors.
  2. User-Intent Queries (UIQ) Benchmark:

    • Function: Systematically evaluates retrieval model robustness across diverse real-world query types.
    • Mechanism: Five query types are defined across three categories: conversational (Question—natural language questions; Imperative—command-style instructions), reformulation (Keyphrase—keyword tags; Paraphrase—synonymous rewrites), and exclusion (Negative—queries specifying content to exclude). Queries are generated using GPT-5.1 under lexical constraints and length controls, with human verification (9 annotators, mean score 4.15/5).
    • Design Motivation: Existing benchmarks test only caption-style queries, failing to reflect the diversity of real user search behavior—particularly imperative and exclusion-based queries.
  3. Hard Negative Mining and Discrimination Metrics (HNSR/TFR):

    • Function: Evaluates a model's ability to retrieve the target while simultaneously suppressing acoustically similar distractors.
    • Mechanism: A four-stage hard negative mining pipeline: MGA-CLAP acoustic similarity retrieval → BGE text semantic dissimilarity filtering → human verification → construction of (target audio, hard negative audio) pairs. HNSR@k is defined as the proportion of cases where the target is within top-k and the hard negative is outside top-k. \(\Delta\)-Rank = Rank(HN) − Rank(Target) measures separation.
    • Design Motivation: Standard R@k only checks whether the target is retrieved; for exclusion-type queries, the core challenge is whether the model can simultaneously suppress acoustically similar distractors.

Loss & Training

Symmetric InfoNCE contrastive learning loss is used with temperature \(\tau = 0.07\). Multi-stage curriculum learning is applied: initial audio-text alignment on WavCaps (275K samples), followed by fine-tuning on AudioCaps v2 (91K samples). Clotho v2 training data may optionally be added (denoted +Cl). AdamW optimizer with PyTorch DDP and BFloat16 precision training.

Key Experimental Results

Main Results (T2A Retrieval)

Model AudioCaps R@5 Clotho R@5 MECAT R@5
M2D-CLAP 77.13 42.91 23.55
OEA-Nemo3B 72.64 40.57 24.53
OEA-Qwen3B (+Cl) 69.35 49.78 17.16
OEA-Qwen7B 72.25 44.78 23.29

T2T Retrieval and Hard Negative Discrimination

Model Clotho T2T R@1 MECAT T2T R@5 HNSR@10
M2D-CLAP 55.85 38.74 30.3%
OEA-Qwen7B (+Cl) 63.58 47.41 34.6%
Gain +13.8% +22.4% +4.3%p

Key Findings

  • On T2A retrieval, OEA and M2D-CLAP are broadly comparable; M2D-CLAP is stronger in-domain on AudioCaps, while OEA generalizes better cross-domain (Clotho/MECAT).
  • On T2T retrieval, OEA leads by a substantial margin (+22% relative gain), owing to the LLM backbone's superior text comprehension compared to CLAP's lightweight text encoder.
  • On imperative queries, OEA holds an exclusive advantage (+5.1%p), attributable to the LLM's instruction-following pretraining.
  • On hard negative discrimination, OEA is significantly stronger (HNSR@10 +4.3%p, TFR@10 +34.7%), as the LLM's attention mechanism preserves the compositional structure of negation semantics.
  • The 7B model does not consistently outperform the 3B model—retrieval quality is more constrained by contrastive alignment and data-backbone compatibility.

Highlights & Insights

  • The "complementary capability" argument is clearly articulated—M2D-CLAP is stronger for in-domain caption-style retrieval, while OEA excels at T2T and semantic discrimination; the two have distinct deployment scenarios, and the paper provides concrete decision rules.
  • The HNSR/TFR metrics address a gap in evaluating exclusion-type queries—standard R@k cannot distinguish between "target retrieved but hard negatives also present" and "clean target retrieval."
  • Transforming an LLM into a retrieval encoder by training only 0.29–0.36% of parameters demonstrates extreme parameter efficiency.

Limitations & Future Work

  • OEA depends on multimodal LLM backbones with native audio understanding, limiting the choice of base models.
  • Memory footprint is substantially larger than CLAP (18.3 GB vs. ~0.6 GB); edge deployment requires quantization or distillation.
  • Hard negative mining via MGA-CLAP + BGE filtering may miss certain types of acoustic confusions.
  • UIQ is generated by a single LLM and, despite human verification, may not cover all real-world query styles.
  • Performance in multilingual audio retrieval settings is not evaluated.
  • Audio encoding latency is high (666 ms/clip for Qwen7B), requiring pre-computation for real-time scenarios.
  • vs M2D-CLAP (Niizumi et al., 2025): M2D-CLAP is stronger on T2A but lacks T2T and discrimination capability; OEA's semantic understanding advantage stems from the LLM backbone.
  • vs RobustCLAP (Selvakumar et al., 2024): RobustCLAP is optimized for paraphrase robustness but does not handle exclusion queries; OEA naturally handles negation semantics through the LLM.
  • vs NevIR/ExcluIR (Weller et al., 2023): These works find that text retrieval models perform near-randomly on negation queries; OEA demonstrates that an LLM backbone can partially address this problem.

Rating

  • Novelty: ⭐⭐⭐⭐ Using a multimodal LLM as an audio retrieval encoder is a novel perspective; the UIQ benchmark and discrimination metrics are meaningful contributions.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, six OEA variants, four CLAP baselines, five query types, and multi-dimensional analysis.
  • Writing Quality: ⭐⭐⭐⭐ Conclusions are clear, experimental design is thorough, and deployment recommendations are practical.
  • Value: ⭐⭐⭐⭐ Advances the evaluation paradigm for audio retrieval; the UIQ benchmark can be broadly adopted by the community.