Skip to content

Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods

Conference: CVPR 2026 arXiv: 2603.25767 Code: https://github.com/AudenAI/Auden/tree/main/examples/uts Area: Audio & Speech Keywords: audio pre-training, unified tag system, data-centric, label quality, cross-domain generalization

TL;DR

Through systematic data-centric experiments, this paper demonstrates that audio pre-training performance is primarily driven by label/supervision quality rather than model design. It proposes the Unified Tag System (UTS), which unifies speech, music, and environmental sound under a high-granularity vocabulary of 800–3k tags. Models trained with UTS surpass AudioSet baselines on out-of-domain tasks such as speaker verification (VoxCeleb2) and music (MusicCaps) using 5× less data.

Background & Motivation

  1. Background: Audio pre-training is dominated by two paradigms: (1) label classification pre-training (with AudioSet-527 as the standard); and (2) audio-language alignment pre-training (e.g., CLAP, audio captioning). The former relies on AudioSet's manually defined label taxonomy; the latter depends on the quality of text descriptions.
  2. Limitations of Prior Work: (1) AudioSet's 527 tags primarily cover environmental sounds, with severely insufficient coverage of speech and music labels, leading to poor generalization of pre-trained models on speech/music downstream tasks; (2) gains from scaling data size and model architecture are approaching saturation, yet the role of label quality remains substantially underestimated.
  3. Key Challenge: The field pursues ever-larger datasets and models while potentially overlooking a more fundamental question—whether the label system itself is adequate. Without sufficiently fine-grained labels, additional data cannot support learning of fine-grained semantic distinctions.
  4. Goal: Design a unified, high-quality label system and systematically compare different pre-training objectives (classification, captioning, contrastive, multi-task) under this label system.
  5. Key Insight: Leverage powerful audio LLMs such as Qwen3-Omni to generate high-fidelity audio descriptions (averaging 388 words), then use an LLM to extract semantic tags and construct a cross-domain unified tag vocabulary.
  6. Core Idea: Automatically extract tags from high-quality audio descriptions using an LLM, construct the UTS vocabulary via TF-IDF filtering, and systematically compare classification, generative, contrastive, and multi-task pre-training objectives under this tag system.

Method

Overall Architecture

CaptionStew 400K dataset → Qwen3-Omni generates high-fidelity audio descriptions → Qwen2.5-7B extracts semantic tags → TF-IDF filtering → UTS vocabulary (\(K\) = 800–3k) → train classification/captioning/contrastive/multi-task models on UTS → evaluate on 7+ downstream tasks.

Key Designs

  1. Unified Tag System (UTS) Construction

  2. Function: Create a unified semantic tag vocabulary spanning speech, music, and environmental sound domains.

  3. Mechanism: Qwen3-Omni first generates detailed descriptions for each audio clip (mean 388 words); Qwen2.5-7B-Instruct then extracts semantic tags from these descriptions (outperforming NLTK POS tagging for modern complex descriptions). Tags are filtered by TF-IDF score \(s(t) = df(t) \cdot \log(\frac{N+1}{df(t)+1})\) to retain the most informative ones, yielding vocabularies of size \(K \in \{800, 1\text{k}, 1.5\text{k}, 2\text{k}, 3\text{k}\}\).
  4. Design Motivation: AudioSet-527 offers narrow coverage defined by human annotators. UTS leverages LLMs to automatically mine finer-grained, cross-domain semantic coverage. t-SNE analysis confirms that the AudioSet semantic space is fully subsumed by UTS.

  5. Parallel Decoding Objective (PAR)

  6. Function: Force the encoder to learn richer representations through non-autoregressive caption generation.

  7. Mechanism: Multi-hot label vectors are converted to canonical text sequences \(Y_i = \text{"tag\_a, tag\_d, tag\_k"}\), but during decoding all inputs are masked and causal attention is removed, yielding parallel generation: \(\mathcal{L}_{\text{par}} = -\sum_{t=1}^T \log p_\phi(y_t|z_i^a)\). Unlike standard AR decoding, the PAR decoder's sole information source is the audio encoder representation.
  8. Design Motivation: AR decoding suffers from a "language prior bias"—the model can predict the next token from already-generated tokens without fully exploiting audio features. PAR eliminates this shortcut.

  9. Multi-Task Joint Training

  10. Function: Simultaneously cultivate discriminative and descriptive capabilities.

  11. Mechanism: Jointly optimize \(\mathcal{L}_{\text{MTL}} = \mathcal{L}_{\text{MTC}} + \lambda \mathcal{L}_{\text{gen}}\), where MTC is the multi-label binary cross-entropy classification objective and gen is a mixed AR/PAR captioning objective (0.25 AR + 0.75 PAR). \(\lambda\) controls task weighting.
  12. Design Motivation: Single-objective training induces task bias—models trained purely for classification perform poorly on captioning and retrieval tasks, and vice versa. Multi-task joint training achieves a balance between the two.

Loss & Training

MTC: multi-label binary cross-entropy. Contrastive learning: symmetric InfoNCE. Captioning: mixed AR/PAR. Multi-task: weighted combination. Backbone: Zipformer-M encoder + BERT-base text encoder + BART-base decoder. Training: 700k steps (MTC) or 400k steps (others), 8 × V100 GPUs, batch size 640 audio seconds.

Key Experimental Results

Main Results

Model FSD-50k VggSound VoxCeleb2↑ CREMA-D↑ MTAT NSynth
MTC-AudioSet (baseline) 0.656 56.46 18.84 67.14 0.407 67.19
MTC-UTS (Ours) 0.459 37.70 37.10 66.01 0.375 63.62
Contrastive (Ours) 0.445 40.78 33.88 67.29 0.396 61.40
Multi-task (Ours) 0.485 40.81 34.62 65.31 0.396 59.94

Ablation Study

UTS Size Linear Probe Captioning Retrieval Notes
\(K\)=800 Moderate Moderate Moderate Tags too coarse
\(K\)=1.5k Peak Peak Peak Optimal balance
\(K\)=3k Drops Robust Slight drop Increased data sparsity

Key Findings

  • Most critical finding: UTS-MTC outperforms AudioSet-MTC on the speech task (VoxCeleb2) by 18.26% (37.10 vs. 18.84) using 5× less data, demonstrating out-of-domain superiority—confirming that supervision quality > data quantity.
  • The AudioSet baseline remains strongest on in-domain tasks (FSD-50k, VggSound), indicating that the AudioSet label system is highly optimized for environmental sound.
  • PAR decoding outperforms AR on the speech task (38.78 vs. 29.87), confirming that eliminating the language shortcut drives the encoder to learn richer audio representations.
  • An optimal label vocabulary size exists (\(K\)=1.5k); excessively large vocabularies lead to insufficient training of long-tail tags.

Highlights & Insights

  • Strong empirical evidence for "data quality > data quantity": UTS trained on 80k samples surpasses the AudioSet baseline trained on 2M samples on out-of-domain tasks—a finding with broad implications for the pre-training field.
  • PAR decoding eliminates the language shortcut: The design philosophy of "strengthening the encoder by weakening the decoder" is elegant and transferable to other modalities such as visual captioning.
  • Scalability of the UTS pipeline: The toolchain (LLM captioner → LLM tagger → TF-IDF filtering) is fully automated and requires zero manual effort to adapt to new domains.

Limitations & Future Work

  • UTS relies on descriptions generated by a single "teacher" model (Qwen3-Omni), introducing systematic bias from that model.
  • In-domain tasks (FSD-50k, VggSound) still lag behind the AudioSet baseline, indicating that large-scale data retains advantages within the source domain.
  • The optimal vocabulary size (\(K\)=1.5k) may vary with data distribution, and an adaptive selection mechanism is absent.
  • Designing a single unified objective that simultaneously achieves optimal performance across all downstream tasks remains an open challenge.
  • Future work could combine data mixing strategies and train with larger-scale data under the UTS label system.
  • vs. AudioSet-MTC: AudioSet labels offer broad coverage but coarse semantic granularity (only 527 classes); UTS fills the semantic gaps in speech and music.
  • vs. CLAP/LAION-Audio: Contrastive learning methods depend heavily on the quality of short text–audio pairs, whereas this work achieves more precise semantic alignment through the tag system.
  • vs. BEATs/Audio-MAE: Self-supervised methods require no labels but exhibit low pre-training efficiency and require substantial annotated data for downstream fine-tuning.

Rating

  • Novelty: ⭐⭐⭐⭐ The UTS construction pipeline and PAR decoding design are innovative, though the central message that "data quality matters" is not entirely novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five pre-training objectives × multiple vocabulary sizes × 7 downstream tasks × linear probing + captioning + retrieval + QA—extremely comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ The data-centric narrative is logically clear.
  • Value: ⭐⭐⭐⭐⭐ Provides a systematic answer to the label system question in audio pre-training; the open-sourced UTS toolchain is directly reusable.