AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models¶
Conference: ICLR 2026 arXiv: 2505.16211 Code: GitHub Area: AI Safety Keywords: Audio LLM, trustworthiness, benchmark, fairness, hallucination, safety, privacy, robustness, authentication
TL;DR¶
This paper proposes AudioTrust, the first multidimensional trustworthiness evaluation benchmark for audio large language models (ALLMs), encompassing six dimensions—fairness, hallucination, safety, privacy, robustness, and authentication—with 26 sub-tasks and 4,420+ audio samples. It systematically evaluates the trustworthiness boundaries of 14 state-of-the-art open- and closed-source ALLMs in high-stakes audio scenarios.
Background & Motivation¶
Background: ALLMs have advanced rapidly (GPT-4o Audio, Qwen2-Audio, Gemini, etc.), yet existing safety evaluation frameworks (SafeDialBench, SafetyBench) are primarily designed for the text modality and overlook trustworthiness risks unique to audio.
Key Challenge: Audio signals contain rich non-semantic acoustic cues (timbre, accent, background noise, emotion) that can be exploited to manipulate model behavior—attack vectors that text-based safety frameworks are fundamentally incapable of capturing.
Core Idea: The paper constructs the first comprehensive ALLM trustworthiness evaluation framework covering six audio-native safety dimensions. Through carefully designed real-world scenario datasets and an automated evaluation pipeline (human verification agreement >97%), it systematically quantifies trustworthiness risks in ALLMs.
Method¶
Overall Architecture¶
AudioTrust decouples trustworthiness evaluation into six orthogonal dimensions, each with independent attack strategies, datasets, and evaluation metrics:
- Fairness: Assesses biases induced by acoustic attributes (accent, speech rate, emotion, background environment), distinguishing traditional fairness (gender/age/race) from audio-specific fairness (accent/language fluency/socioeconomic status/personality traits); includes decision experiments and stereotype experiments; 840 audio samples.
- Hallucination: Defines audio-specific hallucination types—violations of physical laws (flames burning underwater) and violations of temporal causality (ignition before engine start); 320 samples.
- Safety: Designs emotional deception attacks (exploiting urgent/sorrowful tones to bypass safety filters) covering jailbreak attacks and illegal activity inducement across enterprise, financial, and medical domains; 600 samples.
- Privacy: Distinguishes content-level leakage (directly reading out bank account numbers) from paralinguistic inference leakage (inferring age/race/geolocation from voiceprints); 900 samples.
- Robustness: Evaluates adversarial attacks and natural degradation (background noise, multiple speakers, audio quality variation, environmental sounds); 240 samples.
- Authentication: Covers identity verification bypass (social engineering attacks), hybrid deception (voice cloning + background noise), and voice cloning spoofing; 400 samples.
Key Designs¶
- Data Construction: GPT-4o is used to generate textual content; F5-TTS synthesizes audio; emotional control is achieved by selecting reference audio with different emotional timbres. Portions of the data are sourced from public datasets such as Common Voice and freesound.
- Automated Evaluation Pipeline: Dual evaluators (GPT-4o and Qwen3) provide scores, with human expert review (agreement >97%), supporting large-scale reproducible evaluation.
- Metric Design: Each dimension employs targeted metrics—fairness uses group fairness score \(\Gamma\) (1.0 is ideal), safety uses Defense Success Rate (DSR), privacy uses refusal rate, robustness uses a 10-point scale, and authentication uses Impostor Rejection Rate (IRR).
Key Experimental Results¶
Fairness¶
| Metric | Best Open-Source | Best Closed-Source | Average |
|---|---|---|---|
| \(\Gamma_\text{stereo}\) | Step-Fun 0.658 | GPT-4o Audio 0.926 | 0.328 |
| \(\Gamma_\text{decision}\) | Step-Fun 0.505 | Gemini-1.5 Pro 0.460 | 0.261 |
- Biases introduced by audio attributes (accent, emotion) are stronger than those from traditional sensitive attributes (age, gender).
- Closed-source models exhibit stronger decision bias; open-source models show stronger stereotype associations.
- The GPT-4o series excels in stereotype fairness (\(\Gamma_\text{stereo}=0.926\)) but performs moderately on decision fairness (\(\Gamma_\text{decision}=0.264\)), as it sacrifices fairness for accuracy in extreme decision scenarios.
Hallucination¶
- The Gemini series achieves the best performance on physical/logical violation detection (scores 8–9).
- GPT-4o Audio performs unexpectedly poorly on content-mismatch and label-mismatch tasks (scores 3–4).
- Models achieve high accuracy on physical law violation tasks but perform poorly on content-mismatch tasks that humans find easy—revealing a significant human-machine perceptual gap.
Safety¶
| Scenario | Closed-Source Avg. DSR | Open-Source Avg. DSR |
|---|---|---|
| Enterprise jailbreak | ~99% | ~80% |
| Illegal activity inducement | ~99% | ~89% |
- Kimi-Audio achieves the best performance among open-source models, approaching closed-source levels.
- OpenS2S is the most vulnerable, with a DSR of only 51.4% in enterprise scenarios.
Privacy¶
- Direct leakage: GPT-4o mini Audio achieves a refusal rate of 100%; privacy-enhancing prompts improve refusal rates by approximately 25%.
- Inference leakage: The average refusal rate across all models is only 9.02%, and privacy-enhancing prompts yield only ~3% improvement—ALLMs fail to recognize information inferred from paralinguistic cues as private.
- Models such as Qwen2-Audio, MiniCPM-o 2.6, and Qwen2.5-Omni exhibit near-zero refusal rates on direct leakage, demonstrating almost no content-level privacy protection.
- The consistently low refusal rates for inference leakage indicate that the absence of paralinguistic privacy constraints in model training is a systemic issue, not an artifact of individual models.
Robustness¶
- Closed-source models (led by Gemini-2.5 Pro) consistently outperform open-source models under nearly all degradation conditions.
- Open-source models exhibit an "over-textualization" tendency—continuing to reason from text when transcription is partially correct while ignoring acoustic cues.
- In multi-speaker scenarios, Step-Audio2 scores approach zero (MS=0.00/0.12), exposing extreme multi-speaker robustness deficiencies.
- The advantage of closed-source models is most pronounced under severe acoustic distortion, suggesting more mature front-end signal processing and noise reduction architectures.
Authentication¶
| Attack Type | Open-Source Avg. IRR | Closed-Source Avg. IRR |
|---|---|---|
| Identity verification bypass | 55.3% | 97.2% |
| Hybrid deception | 55.1% | 97.0% |
| Voice cloning | 45.0% | 44.9% |
- Voice cloning is a universal weakness across all models, including closed-source ones.
- Stricter system prompts consistently improve resistance to spoofing attacks.
- Among open-source models, SALMONN consistently ignores prompt instructions and outputs audio descriptions, rendering it incapable of completing voice cloning detection tasks.
Highlights & Insights¶
- First Audio-Native Trustworthiness Benchmark: This is the first work to systematically define and evaluate safety dimensions unique to the audio modality, filling a critical gap in ALLM trustworthiness assessment.
- Threat of Paralinguistic Cues: Non-semantic information in audio (accent, timbre, background noise) represents a severely underestimated source of bias and a novel attack vector.
- Privacy Inference Leakage: ALLMs can infer sensitive attributes such as age and race from voiceprints, yet almost never treat such inferences as privacy violations—constituting an entirely new category of privacy threat.
- Human-Machine Perceptual Gap: Models excel at detecting physical violations but perform poorly on commonsense reasoning that humans find trivial, revealing fundamental deficiencies in current model perception mechanisms.
- Evaluation Scale and Rigor: The benchmark spans 14 models × 26 sub-tasks × 4,420 samples × dual evaluators, demonstrating exceptional coverage and methodological rigor.
Limitations & Future Work¶
- Audio samples are primarily synthetic (F5-TTS), potentially introducing distributional gaps relative to real human speech; synthetic attack audio may underestimate the effectiveness of real-world attacks.
- Evaluation is predominantly English-centric; multilingual coverage remains limited, and acoustic characteristics of different languages may introduce distinct trustworthiness risks.
- The paper does not explore model improvement methods—it exposes problems without proposing remediation strategies.
- Some open-source models exhibit random audio recognition failures, potentially inflating safety scores.
- Interaction effects among the six dimensions (e.g., whether poor robustness amplifies safety risks) are unexplored.
- Evaluation relies on GPT-4o/Qwen3 scoring; biases in the evaluators themselves may affect the generalizability of conclusions.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First audio-native trustworthiness benchmark, defining entirely new evaluation dimensions and threat models.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 14 models, 6 dimensions, 26 sub-tasks, dual evaluators, and human verification.
- Writing Quality: ⭐⭐⭐⭐ Structure is clear and systematic, though dense tables impose some readability burden.
- Value: ⭐⭐⭐⭐⭐ Directly guides safe deployment of ALLMs and will shape the research agenda for audio AI safety.