FLAM: Frame-Wise Language-Audio Modeling¶
Conference: ICML2025
arXiv: 2505.05335
Code: flam-model.github.io
Area: Audio & Speech
Keywords: Open-Vocabulary Sound Event Detection, Frame-Level Contrastive Learning, Logit Adjustment, Audio-Language Alignment, Data Augmentation
TL;DR¶
Proposes FLAM, a frame-level audio-language contrastive model that achieves precise temporal localization of open-vocabulary sound events through text-dependent logit bias correction and a million-scale synthetic SED dataset, while maintaining outstanding performance in global retrieval and zero-shot classification.
Background & Motivation¶
- Limitations of Prior Work (ALMs): Audio-language models like CLAP learn instance-level global embeddings and excel at text-audio retrieval, but fail to precisely localize the temporal boundaries of sound events.
- Limitations of Traditional SED: Traditional sound event detection (SED) models can localize events precisely but are confined to predefined categories, failing to handle out-of-distribution events in an open-vocabulary manner.
- Scarcity of Annotated Data: Unlike the image domain, audio frame-level annotations are extremely scarce, and manual labeling is highly costly. Existing SED datasets are small in scale and limited in categories.
- Insufficiency of Self-Supervised Methods: Previous self-supervised local alignment methods (e.g., MGA-CLAP) improve frame-level capabilities to some extent, but lack fine-grained annotations, resulting in limited localization performance.
Core Problem: How can ALMs be endowed with open-vocabulary, frame-level sound event localization capabilities while preserving their global retrieval capacity?
Method¶
Overall Architecture¶
FLAM is based on the LAION-CLAP architecture (HTSAT audio encoder + RoBERTa text encoder), extended to simultaneously output both global embeddings and frame-level embedding sequences. A 10-second audio input processed by HTSAT yields a frame-level embedding sequence \(\mathbf{e}^{a,loc}(x) \in \mathbb{R}^{L \times d}\) of length \(L=32\), and the global representation is obtained by averaging these frame embeddings.
Frame-Level Contrastive Loss¶
Open-vocabulary SED is modeled as a frame-level binary classification task, which determines whether an event is active in each frame for every (audio frame, text event description) pair:
where the logit function incorporates text-dependent scaling and bias:
- \(\alpha^t(y) > 0\): Text-dependent logit scaling, produced by \(\text{MLP}^\alpha(E^t(y))\)
- \(\beta^t(y)\): Text-dependent logit bias, produced by \(\text{MLP}^p(E^t(y))\)
Logit Adjustment: Handling Event-Dependent Class Imbalance¶
Frame-level labels suffer from severe imbalance (most frame-text pairs are negative samples), and the degree of imbalance varies across events (e.g., "thunder" is rare and short-lived, while "rain" is frequent and long-lasting).
Bayesian Optimal Classifier: During inference, an unbiased classifier is used to normalize raw predictions into a localization score independent of the event prior:
Bias Training: \(\beta^t\) is trained independently to approximate the optimal bias \(\beta^*(y) = \log\frac{p(z=1|y)}{p(z=-1|y)}\) using an auxiliary loss \(\mathcal{L}_p\), while blocking gradient propagation from \(\mathcal{L}_{\text{SED}}\) to it:
Joint Training Objective¶
where \(\gamma^{\text{CLIP}}=1\), \(\gamma^{\text{SED}}=200\), and \(\gamma^p=1\).
Memory-Efficient Training¶
A SigLIP-style block-wise ring communication strategy is adopted: each GPU processes a local subset of the frame-text pair loss, and text embeddings are passed around the GPUs in a ring. This prevents the gathering of all embeddings on a single GPU, enabling large-batch training.
Data Augmentation Pipeline¶
Synthesizes 1 million 10-second mixed samples from 1.1M audio source clips (licensed sound effects library + CC-licensed general-purpose datasets):
- Randomly select a background audio (\(\ge\)10s, containing the "ambiance" keyword).
- Sample \(N \sim \mathcal{U}(1, 10)\) foreground events (80% from sound effects, 20% from general datasets).
- Randomly place events, with a maximum of 3 overlapping concurrently.
- Split 10% of the events into 2-3 fragments, and repeat another 10% 2-3 times to simulate realistic scenarios.
- Apply a random loudness shift of \(\mathcal{U}(6,30)\) dB and a 10ms fade-in/fade-out.
- Perform boundary calibration based on A-weighted RMS loudness (regions below -70 dB are marked inactive).
Using Mixtral to generate 2-13 word captions for the sound effects.
Key Experimental Results¶
Sound Event Detection (Table 1)¶
| Model | Held-out AUROC | ASFX-SED AUROC | DESED AUROC | MAESTRO MPAUC | AudioSet-S AUROC | UrbanSED AUROC |
|---|---|---|---|---|---|---|
| FLAM-Global | 67.76 | 65.14 | 85.52 | 51.13 | 82.54 | 67.39 |
| FLAM | 91.00 | 81.23 | 91.66 | 56.97 | 94.76 | 93.62 |
| MGA-CLAP* | 74.17 | 69.56 | 89.28 | 52.50 | 79.12 | 78.22 |
FLAM significantly outperforms baselines across almost all metrics, improving the open-vocabulary SED (Held-out) AUROC from 67.76 to 91.00.
Zero-Shot Classification (Table 3)¶
| Model | ESC-50 | US8K | VGGSound |
|---|---|---|---|
| FLAM-Global | 81.6 | 65.4 | 38.9 |
| FLAM | 86.9 | 75.6 | 39.3 |
| MGA-CLAP* | 72.6 | 69.9 | 38.6 |
Frame-level supervision does not compromise global representations; on the contrary, it enhances zero-shot classification accuracy.
Retrieval Performance (Table 2)¶
FLAM achieves retrieval performance close to that of FLAM-Global (AudioCaps T2A R@1: 32.1 vs 36.0), demonstrating that frame-level training has minimal negative impact on global retrieval.
Highlights & Insights¶
- Clear Modeling of Open-Vocabulary SED: Formulates frame-level event localization as frame-text pair binary classification. By inheriting the contrastive learning framework, it only needs to encode new text queries once audio embeddings are precomputed during inference, achieving high efficiency.
- Text-Dependent Logit Adjustment: Introduces event-specific scaling and bias to handle event-level class imbalance, transforming predictions from raw cosine similarity into calibrated probabilities with mathematically rigorous derivations.
- Large-Scale Synthetic Data Pipeline: Overcomes the scarcity of frame-level annotations by synthesizing 1 million mixed samples with precise boundary labels, successfully simulating realistic scenarios using splitting and repetition strategies.
- Frame-Level Supervision Benefits Global Representations: The improvement in zero-shot classification performance (ESC-50 81.6 \(\to\) 86.9) demonstrates that fine-grained alignment also boosts overall discriminative ability.
- Memory-Efficient Training: Implementing ring transmission avoids centralized gathering, making large-scale frame-level contrastive training highly feasible.
Limitations & Future Work¶
- Fixed Input Length: Limits support to 10-second audio with a coarse frame resolution (32 frames/10s), making it challenging to handle long-form audio or scenarios requiring finer temporal granularity.
- Synthetic vs. Real Data Gap: Since training data consists of synthesized mixtures, real-world scenes feature more complex reverberations, masking, and co-occurrence patterns.
- Lightweight Model Scale: The HTSAT + RoBERTa architecture is relatively lightweight; adopting larger encoders or more expressive backbones may yield further improvements.
- Suboptimal PSDS on DESED: Having only 692 real annotated samples leads to high variance, hinting at potential limitations in scenarios with small-scale real annotations.
- Minor Drop in Retrieval Performance: AudioCaps T2A R@1 drops from 36.0 to 32.1, indicating a slight trade-off between frame-level training and global retrieval targets.
Related Work & Insights¶
- CLAP / LAION-CLAP: Serves as the backbone architecture for FLAM, extending its capability from instance-level to frame-level alignment.
- SigLIP: The conceptual ideas of sigmoid contrastive loss and logit bias directly inspired the design of FLAM's frame-level objective.
- MGA-CLAP: A representative method for self-supervised local alignment that serves as a primary baseline, though it lacks explicit frame-level supervision.
- Scaper / Synthetic SED Data: Traditional synthesizer frameworks for SED data, which FLAM extends to open-vocabulary contexts.
- GLIP / PACL (Computer Vision): Open-vocabulary detection/segmentation works in the visual domain that provide analogous inspiration for the audio domain.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Successfully combines frame-level contrastive learning with text-dependent logit adjustment for open-vocabulary SED with a clear and highly original approach.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Thoroughly covers open/closed SED, retrieval, and zero-shot classification, supported by comprehensive ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Mathematically rigorous derivations, consistent notations, and highly intuitive diagrams.
- Value: ⭐⭐⭐⭐ — Open-vocabulary frame-level audio localization is a practical and vital direction; the proposed method is generalizable to broader audio understanding tasks.