Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods¶

Conference: CVPR 2026 arXiv: 2603.25767 Code: https://github.com/AudenAI/Auden/tree/main/examples/uts Area: Audio & Speech Keywords: audio pre-training, unified tag system, data-centric, label quality, cross-domain generalization

TL;DR¶

Through systematic data-centric experiments, this paper demonstrates that audio pre-training performance is primarily driven by label/supervision quality rather than model design. It proposes the Unified Tag System (UTS), which unifies speech, music, and environmental sound under a high-granularity vocabulary of 800–3k tags. Models trained with UTS surpass AudioSet baselines on out-of-domain tasks such as speaker verification (VoxCeleb2) and music (MusicCaps) using 5× less data.

Background & Motivation¶

Background: Audio pre-training is dominated by two paradigms: (1) label classification pre-training (with AudioSet-527 as the standard); and (2) audio-language alignment pre-training (e.g., CLAP, audio captioning). The former relies on AudioSet's manually defined label taxonomy; the latter depends on the quality of text descriptions.
Limitations of Prior Work: (1) AudioSet's 527 tags primarily cover environmental sounds, with severely insufficient coverage of speech and music labels, leading to poor generalization of pre-trained models on speech/music downstream tasks; (2) gains from scaling data size and model architecture are approaching saturation, yet the role of label quality remains substantially underestimated.
Key Challenge: The field pursues ever-larger datasets and models while potentially overlooking a more fundamental question—whether the label system itself is adequate. Without sufficiently fine-grained labels, additional data cannot support learning of fine-grained semantic distinctions.
Goal: Design a unified, high-quality label system and systematically compare different pre-training objectives (classification, captioning, contrastive, multi-task) under this label system.
Key Insight: Leverage powerful audio LLMs such as Qwen3-Omni to generate high-fidelity audio descriptions (averaging 388 words), then use an LLM to extract semantic tags and construct a cross-domain unified tag vocabulary.
Core Idea: Automatically extract tags from high-quality audio descriptions using an LLM, construct the UTS vocabulary via TF-IDF filtering, and systematically compare classification, generative, contrastive, and multi-task pre-training objectives under this tag system.

Method¶

Overall Architecture¶

CaptionStew 400K dataset → Qwen3-Omni generates high-fidelity audio descriptions → Qwen2.5-7B extracts semantic tags → TF-IDF filtering → UTS vocabulary (\(K\) = 800–3k) → train classification/captioning/contrastive/multi-task models on UTS → evaluate on 7+ downstream tasks.

Key Designs¶

Unified Tag System (UTS) Construction
Function: Create a unified semantic tag vocabulary spanning speech, music, and environmental sound domains.
Mechanism: Qwen3-Omni first generates detailed descriptions for each audio clip (mean 388 words); Qwen2.5-7B-Instruct then extracts semantic tags from these descriptions (outperforming NLTK POS tagging for modern complex descriptions). Tags are filtered by TF-IDF score \(s(t) = df(t) \cdot \log(\frac{N+1}{df(t)+1})\) to retain the most informative ones, yielding vocabularies of size \(K \in \{800, 1\text{k}, 1.5\text{k}, 2\text{k}, 3\text{k}\}\).
Design Motivation: AudioSet-527 offers narrow coverage defined by human annotators. UTS leverages LLMs to automatically mine finer-grained, cross-domain semantic coverage. t-SNE analysis confirms that the AudioSet semantic space is fully subsumed by UTS.
Parallel Decoding Objective (PAR)
Function: Force the encoder to learn richer representations through non-autoregressive caption generation.
Mechanism: Multi-hot label vectors are converted to canonical text sequences \(Y_i = \text{"tag\_a, tag\_d, tag\_k"}\), but during decoding all inputs are masked and causal attention is removed, yielding parallel generation: \(\mathcal{L}_{\text{par}} = -\sum_{t=1}^T \log p_\phi(y_t|z_i^a)\). Unlike standard AR decoding, the PAR decoder's sole information source is the audio encoder representation.
Design Motivation: AR decoding suffers from a "language prior bias"—the model can predict the next token from already-generated tokens without fully exploiting audio features. PAR eliminates this shortcut.
Multi-Task Joint Training
Function: Simultaneously cultivate discriminative and descriptive capabilities.
Mechanism: Jointly optimize \(\mathcal{L}_{\text{MTL}} = \mathcal{L}_{\text{MTC}} + \lambda \mathcal{L}_{\text{gen}}\), where MTC is the multi-label binary cross-entropy classification objective and gen is a mixed AR/PAR captioning objective (0.25 AR + 0.75 PAR). \(\lambda\) controls task weighting.
Design Motivation: Single-objective training induces task bias—models trained purely for classification perform poorly on captioning and retrieval tasks, and vice versa. Multi-task joint training achieves a balance between the two.

Loss & Training¶

MTC: multi-label binary cross-entropy. Contrastive learning: symmetric InfoNCE. Captioning: mixed AR/PAR. Multi-task: weighted combination. Backbone: Zipformer-M encoder + BERT-base text encoder + BART-base decoder. Training: 700k steps (MTC) or 400k steps (others), 8 × V100 GPUs, batch size 640 audio seconds.

Key Experimental Results¶

Main Results¶

Model	FSD-50k	VggSound	VoxCeleb2↑	CREMA-D↑	MTAT	NSynth
MTC-AudioSet (baseline)	0.656	56.46	18.84	67.14	0.407	67.19
MTC-UTS (Ours)	0.459	37.70	37.10	66.01	0.375	63.62
Contrastive (Ours)	0.445	40.78	33.88	67.29	0.396	61.40
Multi-task (Ours)	0.485	40.81	34.62	65.31	0.396	59.94

Ablation Study¶

UTS Size	Linear Probe	Captioning	Retrieval	Notes
\(K\)=800	Moderate	Moderate	Moderate	Tags too coarse
\(K\)=1.5k	Peak	Peak	Peak	Optimal balance
\(K\)=3k	Drops	Robust	Slight drop	Increased data sparsity

Key Findings¶

Most critical finding: UTS-MTC outperforms AudioSet-MTC on the speech task (VoxCeleb2) by 18.26% (37.10 vs. 18.84) using 5× less data, demonstrating out-of-domain superiority—confirming that supervision quality > data quantity.
The AudioSet baseline remains strongest on in-domain tasks (FSD-50k, VggSound), indicating that the AudioSet label system is highly optimized for environmental sound.
PAR decoding outperforms AR on the speech task (38.78 vs. 29.87), confirming that eliminating the language shortcut drives the encoder to learn richer audio representations.
An optimal label vocabulary size exists (\(K\)=1.5k); excessively large vocabularies lead to insufficient training of long-tail tags.

Highlights & Insights¶

Strong empirical evidence for "data quality > data quantity": UTS trained on 80k samples surpasses the AudioSet baseline trained on 2M samples on out-of-domain tasks—a finding with broad implications for the pre-training field.
PAR decoding eliminates the language shortcut: The design philosophy of "strengthening the encoder by weakening the decoder" is elegant and transferable to other modalities such as visual captioning.
Scalability of the UTS pipeline: The toolchain (LLM captioner → LLM tagger → TF-IDF filtering) is fully automated and requires zero manual effort to adapt to new domains.

Limitations & Future Work¶

UTS relies on descriptions generated by a single "teacher" model (Qwen3-Omni), introducing systematic bias from that model.
In-domain tasks (FSD-50k, VggSound) still lag behind the AudioSet baseline, indicating that large-scale data retains advantages within the source domain.
The optimal vocabulary size (\(K\)=1.5k) may vary with data distribution, and an adaptive selection mechanism is absent.
Designing a single unified objective that simultaneously achieves optimal performance across all downstream tasks remains an open challenge.
Future work could combine data mixing strategies and train with larger-scale data under the UTS label system.

vs. AudioSet-MTC: AudioSet labels offer broad coverage but coarse semantic granularity (only 527 classes); UTS fills the semantic gaps in speech and music.
vs. CLAP/LAION-Audio: Contrastive learning methods depend heavily on the quality of short text–audio pairs, whereas this work achieves more precise semantic alignment through the tag system.
vs. BEATs/Audio-MAE: Self-supervised methods require no labels but exhibit low pre-training efficiency and require substantial annotated data for downstream fine-tuning.

Rating¶

Novelty: ⭐⭐⭐⭐ The UTS construction pipeline and PAR decoding design are innovative, though the central message that "data quality matters" is not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five pre-training objectives × multiple vocabulary sizes × 7 downstream tasks × linear probing + captioning + retrieval + QA—extremely comprehensive.
Writing Quality: ⭐⭐⭐⭐ The data-centric narrative is logically clear.
Value: ⭐⭐⭐⭐⭐ Provides a systematic answer to the label system question in audio pre-training; the open-sourced UTS toolchain is directly reusable.