Skip to content

AVRobustBench: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

Conference: NeurIPS 2025 arXiv: 2506.00358 Code: Available (mentioned in the paper) Area: Audio/Speech (Multimodal Robustness) Keywords: audio-visual robustness, distribution shift, test-time adaptation, multimodal benchmark, co-occurring corruptions

TL;DR

This paper proposes AVRobustBench, the first benchmark that systematically evaluates the test-time robustness of audio-visual models under co-occurring correlated dual-modality corruptions, comprising 4 datasets × 75 corruption types, and introduces AV2C, a TTA method based on low-entropy sample selection.

Background & Motivation

Background: Audio-visual models such as UAVM, CAV-MAE, and ImageBind have achieved remarkable progress, yet their robustness to test-time distribution shifts remains largely understudied. Existing robustness benchmarks focus primarily on single modalities (e.g., ImageNet-C for images) or apply uncorrelated perturbations across modalities.

Limitations of Prior Work: (a) Existing benchmarks corrupt only a single modality, or apply corruptions to each modality independently; (b) In real-world scenarios (e.g., autonomous driving in rain), corruptions affect audio and video simultaneously and in a correlated manner; (c) No systematic evaluation of state-of-the-art audio-visual models under co-occurring corruptions exists.

Key Challenge: Audio-visual models perform well on clean data, but their robustness when both modalities are simultaneously corrupted remains entirely unknown.

Goal: To construct a comprehensive audio-visual robustness benchmark that systematically evaluates state-of-the-art supervised/self-supervised models and TTA methods.

Key Insight: Designing 15 real-world-inspired audio-visual co-occurring correlated corruptions (5 severity levels) spanning three categories: digital, environmental, and adversarial.

Core Idea: Real-world distribution shifts affect audio and video simultaneously and in a correlated fashion — existing models and TTA methods fail severely under such conditions.

Method

Overall Architecture

AVRobustBench comprises: - 4 benchmark datasets: AudioSet-2C (16,742 samples, 527 classes), VGGSound-2C (14,046 samples, 309 classes), Kinetics-2C (3,111 samples, 32 classes), EpicKitchens-2C (205 samples, 97+300 classes) - 75 corruptions: 15 types × 5 severity levels - Evaluation: 6 supervised models + 3 self-supervised models + 6 TTA methods

Key Designs

Corruption Taxonomy (15 types, applied synchronously to audio and video)

Category Video Corruption Audio Corruption
Digital Gaussian/Impulse/Shot/Speckle noise + JPEG compression Corresponding noise (SNR-controlled) + DCT quantization
Environmental Snow/Frost/Spatter/Wind (motion blur)/Rain/Underwater Snow/frost/droplet/wind/rain/underwater sounds
Adversarial Concert (brightness shift)/Smoke (haze)/Crowd (shadow occlusion)/Interference (random rotation) Music noise/alarm/crowd noise/random silence

Key property: all corruptions are applied simultaneously to both audio and video, and are correlated (e.g., rain simultaneously introduces raindrop visuals and rain audio).

Evaluation Metrics

  • Accuracy/mAP: Classification accuracy under corruption
  • Absolute robustness \(\alpha_{i,s} = 1 - \frac{\delta A}{100}\)
  • Relative robustness \(\rho_{i,s} = 1 - \frac{\delta A}{A_{cl}}\), where \(\delta A = A_{cl} - A_{i,s}\)

AV2C — Proposed TTA Method

  • Adapts QKV attention weights (analogous to READ)
  • Minimizes weighted Shannon entropy, assigning higher weights to low-entropy (reliable) samples
  • Selects diverse samples based on similarity between current predictions and historical exponential moving averages

Loss & Training

  • Evaluation uses frozen pretrained models (standard robustness benchmark protocol)
  • TTA experiments: batch=16, single forward+backward pass
  • AV2C: online adaptation of QKV weights in the CAV-MAE joint encoder

Key Experimental Results

Main Results — Test-Time Robustness (Severity=5)

Model VGGSound-2C mAcc Drop \(\rho\) Kinetics-2C mAcc Drop
UAVM 27.41 -38.39 0.42 48.06 -30.06
CAV-MAE 35.54 -29.96 0.54 58.15 -29.95
EquiAV 33.78 -28.12 0.55 63.73 -22.29
AudioCLIP 11.14 -15.64 0.41 23.57 -27.44
ImageBind 10.25 -17.93 0.36 26.82 -25.64
Wav2CLIP 4.99 -19.33 0.21 17.25 -35.40

All models exhibit significant performance degradation at severity level 5.

TTA Results (VGGSound-2C, Severity=5)

TTA Method Mean Accuracy vs. Source
Source (CAV-MAE) 35.54
TENT 19.09 -16.45
SAR 26.07 -9.47
EATA 40.60* +5.06
READ 35.28 -0.26
SuMi 32.10 -3.44
AV2C (ours) 40.60 +5.06

Note: EATA and AV2C achieve the best results on VGGSound-2C. TENT degrades severely.

Ablation Study — Effect of Corruption Severity

  • The \(\rho\) of all models decreases monotonically as severity increases
  • Exception: the Interference corruption has relatively little impact on robustness (some models remain recognizable even under heavy frame rotation and audio silencing)
  • Digital corruptions (e.g., Gaussian noise) cause the largest performance drops

Key Findings

  1. Supervised models: EquiAV > CAV-MAE > UAVM; equivariant feature learning yields better robustness
  2. Self-supervised models: ImageBind's zero-shot generalization fails under corruption, with \(\rho\) as low as 0.21 (Wav2CLIP on VGGSound)
  3. TTA methods broadly fail: LayerNorm updates in TENT/RPL/SAR lead to overfitting under dual-modality corruption
  4. Modality bias in READ: Under dual-modality corruption, cross-attention exhibits progressively increasing modality bias (visual→audio weight increases from 13.09 at \(t=0\) to 20.04 at \(t=100\))
  5. Prompt engineering is ineffective: Switching ImageBind to noise-aware prompts yields only marginal improvement

Highlights & Insights

  • First co-occurring correlated corruption benchmark: Fills a critical gap in audio-visual robustness evaluation
  • Comprehensive failure analysis: Systematically exposes the vulnerability of supervised/self-supervised models and TTA methods under dual-modality corruption
  • Discovery of modality bias: The shift in attention weights during adaptation in READ constitutes an interesting and previously unreported failure mode
  • Simplicity of AV2C: Low-entropy sample selection combined with QKV adaptation is conceptually straightforward yet effective

Limitations & Future Work

  1. AV2C achieves significant improvement only on VGGSound-2C; gains on Kinetics-2C are marginal
  2. The corruption suite can be further expanded (e.g., network latency, codec distortions)
  3. Large-scale foundation models (e.g., VideoLLaMA) are not evaluated
  4. Only online TTA is considered; offline or few-shot adaptation strategies are not explored
  • ImageNet-C (Hendrycks & Dietterich, 2019): Pioneering work on single-modality visual robustness benchmarking
  • READ (2024): The first audio-visual TTA method, yet shown to fail under dual-modality corruption
  • TENT (Wang et al., 2020): Foundational TTA method based on entropy minimization
  • Insight: The robustness of multimodal models cannot be simply inferred from single-modality results — co-occurring corruptions present fundamentally new challenges

Rating

⭐⭐⭐⭐ (4/5) The benchmark construction is solid and comprehensive, with broad experimental coverage; however, the proposed AV2C method offers limited improvement, and the work is more diagnostic than prescriptive.