AVRobustBench: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time¶

Conference: NeurIPS 2025 arXiv: 2506.00358 Code: Available (mentioned in the paper) Area: Audio/Speech (Multimodal Robustness) Keywords: audio-visual robustness, distribution shift, test-time adaptation, multimodal benchmark, co-occurring corruptions

TL;DR¶

This paper proposes AVRobustBench, the first benchmark that systematically evaluates the test-time robustness of audio-visual models under co-occurring correlated dual-modality corruptions, comprising 4 datasets × 75 corruption types, and introduces AV2C, a TTA method based on low-entropy sample selection.

Background & Motivation¶

Background: Audio-visual models such as UAVM, CAV-MAE, and ImageBind have achieved remarkable progress, yet their robustness to test-time distribution shifts remains largely understudied. Existing robustness benchmarks focus primarily on single modalities (e.g., ImageNet-C for images) or apply uncorrelated perturbations across modalities.

Limitations of Prior Work: (a) Existing benchmarks corrupt only a single modality, or apply corruptions to each modality independently; (b) In real-world scenarios (e.g., autonomous driving in rain), corruptions affect audio and video simultaneously and in a correlated manner; (c) No systematic evaluation of state-of-the-art audio-visual models under co-occurring corruptions exists.

Key Challenge: Audio-visual models perform well on clean data, but their robustness when both modalities are simultaneously corrupted remains entirely unknown.

Goal: To construct a comprehensive audio-visual robustness benchmark that systematically evaluates state-of-the-art supervised/self-supervised models and TTA methods.

Key Insight: Designing 15 real-world-inspired audio-visual co-occurring correlated corruptions (5 severity levels) spanning three categories: digital, environmental, and adversarial.

Core Idea: Real-world distribution shifts affect audio and video simultaneously and in a correlated fashion — existing models and TTA methods fail severely under such conditions.

Method¶

Overall Architecture¶

AVRobustBench comprises: - 4 benchmark datasets: AudioSet-2C (16,742 samples, 527 classes), VGGSound-2C (14,046 samples, 309 classes), Kinetics-2C (3,111 samples, 32 classes), EpicKitchens-2C (205 samples, 97+300 classes) - 75 corruptions: 15 types × 5 severity levels - Evaluation: 6 supervised models + 3 self-supervised models + 6 TTA methods

Key Designs¶

Corruption Taxonomy (15 types, applied synchronously to audio and video)¶

Category	Video Corruption	Audio Corruption
Digital	Gaussian/Impulse/Shot/Speckle noise + JPEG compression	Corresponding noise (SNR-controlled) + DCT quantization
Environmental	Snow/Frost/Spatter/Wind (motion blur)/Rain/Underwater	Snow/frost/droplet/wind/rain/underwater sounds
Adversarial	Concert (brightness shift)/Smoke (haze)/Crowd (shadow occlusion)/Interference (random rotation)	Music noise/alarm/crowd noise/random silence

Key property: all corruptions are applied simultaneously to both audio and video, and are correlated (e.g., rain simultaneously introduces raindrop visuals and rain audio).

Evaluation Metrics¶

Accuracy/mAP: Classification accuracy under corruption
Absolute robustness \(\alpha_{i,s} = 1 - \frac{\delta A}{100}\)
Relative robustness \(\rho_{i,s} = 1 - \frac{\delta A}{A_{cl}}\), where \(\delta A = A_{cl} - A_{i,s}\)

AV2C — Proposed TTA Method¶

Adapts QKV attention weights (analogous to READ)
Minimizes weighted Shannon entropy, assigning higher weights to low-entropy (reliable) samples
Selects diverse samples based on similarity between current predictions and historical exponential moving averages

Loss & Training¶

Evaluation uses frozen pretrained models (standard robustness benchmark protocol)
TTA experiments: batch=16, single forward+backward pass
AV2C: online adaptation of QKV weights in the CAV-MAE joint encoder

Key Experimental Results¶

Main Results — Test-Time Robustness (Severity=5)¶

Model	VGGSound-2C mAcc	Drop	\(\rho\)	Kinetics-2C mAcc	Drop
UAVM	27.41	-38.39	0.42	48.06	-30.06
CAV-MAE	35.54	-29.96	0.54	58.15	-29.95
EquiAV	33.78	-28.12	0.55	63.73	-22.29
AudioCLIP	11.14	-15.64	0.41	23.57	-27.44
ImageBind	10.25	-17.93	0.36	26.82	-25.64
Wav2CLIP	4.99	-19.33	0.21	17.25	-35.40

All models exhibit significant performance degradation at severity level 5.

TTA Results (VGGSound-2C, Severity=5)¶

TTA Method	Mean Accuracy	vs. Source
Source (CAV-MAE)	35.54	—
TENT	19.09	-16.45
SAR	26.07	-9.47
EATA	40.60*	+5.06
READ	35.28	-0.26
SuMi	32.10	-3.44
AV2C (ours)	40.60	+5.06

Note: EATA and AV2C achieve the best results on VGGSound-2C. TENT degrades severely.

Ablation Study — Effect of Corruption Severity¶

The \(\rho\) of all models decreases monotonically as severity increases
Exception: the Interference corruption has relatively little impact on robustness (some models remain recognizable even under heavy frame rotation and audio silencing)
Digital corruptions (e.g., Gaussian noise) cause the largest performance drops

Key Findings¶

Supervised models: EquiAV > CAV-MAE > UAVM; equivariant feature learning yields better robustness
Self-supervised models: ImageBind's zero-shot generalization fails under corruption, with \(\rho\) as low as 0.21 (Wav2CLIP on VGGSound)
TTA methods broadly fail: LayerNorm updates in TENT/RPL/SAR lead to overfitting under dual-modality corruption
Modality bias in READ: Under dual-modality corruption, cross-attention exhibits progressively increasing modality bias (visual→audio weight increases from 13.09 at \(t=0\) to 20.04 at \(t=100\))
Prompt engineering is ineffective: Switching ImageBind to noise-aware prompts yields only marginal improvement

Highlights & Insights¶

First co-occurring correlated corruption benchmark: Fills a critical gap in audio-visual robustness evaluation
Comprehensive failure analysis: Systematically exposes the vulnerability of supervised/self-supervised models and TTA methods under dual-modality corruption
Discovery of modality bias: The shift in attention weights during adaptation in READ constitutes an interesting and previously unreported failure mode
Simplicity of AV2C: Low-entropy sample selection combined with QKV adaptation is conceptually straightforward yet effective

Limitations & Future Work¶

AV2C achieves significant improvement only on VGGSound-2C; gains on Kinetics-2C are marginal
The corruption suite can be further expanded (e.g., network latency, codec distortions)
Large-scale foundation models (e.g., VideoLLaMA) are not evaluated
Only online TTA is considered; offline or few-shot adaptation strategies are not explored

ImageNet-C (Hendrycks & Dietterich, 2019): Pioneering work on single-modality visual robustness benchmarking
READ (2024): The first audio-visual TTA method, yet shown to fail under dual-modality corruption
TENT (Wang et al., 2020): Foundational TTA method based on entropy minimization
Insight: The robustness of multimodal models cannot be simply inferred from single-modality results — co-occurring corruptions present fundamentally new challenges

Rating¶

⭐⭐⭐⭐ (4/5) The benchmark construction is solid and comprehensive, with broad experimental coverage; however, the proposed AV2C method offers limited improvement, and the work is more diagnostic than prescriptive.