Skip to content

Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

Conference: ACL 2026
arXiv: 2601.03615
Code: None
Area: LLM Security / Audio Deepfake Detection
Keywords: Audio Language Model, Chain-of-Thought, Adversarial Robustness, Cognitive Dissonance, Forensic Auditing

TL;DR

This paper designs a "three-dimensional forensic auditing" framework (acoustic perception / cognitive coherence / cognitive dissonance) for Audio Language Models (ALMs) performing deepfake detection with Chain-of-Thought (CoT) reasoning. It discovers that CoT is not a universal enhancement—it serves as a "Shield" for models with strong acoustic perception (Qwen2-Audio) but acts as a "Reasoning Tax" for models with weak perception (Gemma-3n, Phi-4). Furthermore, when a model is compromised, high cognitive dissonance can serve as a "Silent Alarm" to alert human auditors.

Background & Motivation

Background: Audio Deepfake Detection (ADD) has traditionally relied on binary black-box classifiers (RawNet-2 / AASIST-2 / CLAD). Recently, Audio Language Models (ALMs) with CoT reasoning (Qwen2-Audio, Phi-4-multimodal, gemma-3n-E4B, granite-3.3-8b) have been introduced for "glass-box" adjudication—providing both "Fake/Real" labels and intermediate reasoning steps.

Limitations of Prior Work: The industry defaults to the belief that "explicit reasoning \(\Rightarrow\) higher robustness," but there has been no systematic auditing of how reasoning itself "drifts" under adversarial attacks. Existing interpretability methods (occlusion, attention rollout, SHAP) are post-hoc visualizations that do not reveal whether reasoning truly supports the conclusion or if it changes covertly under attack.

Key Challenge: In forensic scenarios, binary "Fake/Real" labels are insufficient. Auditors need to understand why a model reached a decision, whether the reasoning is consistent with the judgment, and if the reasoning retains any forensic value even if the final judgment is incorrect.

Goal: Developed around three Research Questions: ① Does the ALM's description truly ground in the raw audio (RQ1 Acoustic Perception)? ② Does the reasoning chain logically support the final conclusion (RQ2 Cognitive Coherence)? ③ Can the reasoning layer serve as a "Silent Alarm" when the judgment is compromised (RQ3 Cognitive Dissonance)?

Key Insight: Borrowing from the courtroom tradition of "qualifying the witness"—first perform a voir dire to verify the model's "hearing," then examine its "logic," and finally evaluate whether it "realizes" when its judgment is wrong.

Core Idea: Shift the evaluation of reasoning robustness from the "final-label" dimension to a three-dimensional framework of perception + coherence + dissonance, using differential metrics \(\Delta\Phi\) and \(\Delta\Psi\) to quantify shifts in reasoning morphology between "Original vs. Adversarial" states.

Method

Overall Architecture

Input: Audio \(X\) (clean or adversarial \(\tilde{X}=\mathrm{Adv}(X,\theta)\)) + ALM \(\mathcal{F}\). Output: \(Y=\{r_1,\dots,r_N,c\}\), containing free-text reasoning across 6 forensic dimensions (Prosody / Disfluency / Speed / Speaking Style / Liveliness / Quality) + a final label \(c\in\{\text{fake},\text{real}\}\). The auditing framework parallelly computes three types of metrics: - Perception \(\Phi_{\text{Perc}}\): Verifies if the acoustic attributes perceived by the model match a ground-truth (GT) question bank \(\mathcal{Q}_k\). - Coherence \(\Phi_{\text{Coh}}\): Determines if reasoning \(r_i\) logically entails the current label \(c\). - Dissonance \(\Psi_{\text{Diss}}\): In misclassified samples, measures if \(r_i\) and \(c\) are contradictory (high dissonance = reasoning is still signaling "something is wrong"). Differential \(\Delta\) metrics are used to compare morphological shifts between ORG and PER states.

Key Designs

  1. Three-Dimensional Forensic Auditing Metrics (Perception / Coherence / Dissonance):

    • Function: Decomposes "reasoning robustness" into three semantically independent and measurable forensic dimensions.
    • Mechanism: Defines a verification function \(\mathcal{V}:(X,q)\mapsto\{0,1\}\) and an entailment function \(\mathcal{E}:(r_i,c)\mapsto\{0,1\}\). Metrics are defined as \(\Phi_{\text{Perc}}(r_k)=\frac{1}{|\mathcal{D}|\cdot|\mathcal{Q}_k|}\sum\sum\mathcal{V}\), \(\Phi_{\text{Coh}}(r_i)=\frac{1}{|\mathcal{D}|}\sum\mathcal{E}(r_i^j,c^j)\), and \(\Psi_{\text{Diss}}(r_i)=\frac{1}{|\mathcal{D}_{\text{Wrong}}|}\sum(1-\mathcal{E})\). These are implemented via a frontier LLM ensemble (GPT-5 / Gemini-3).
    • Design Motivation: High coherence is not inherently good (it could be "self-consistent hallucination"). Dissonance is needed to distinguish between "confident errors" and "struggling failures."
  2. Differential Metrics \(\Delta\Phi, \Delta\Psi\) and Failure Mode Classification:

    • Function: Automatically labels model failure modes (Coherence Erosion / Resistance / Silent Alarm / Systemic Deception) using the delta between "Adversarial vs. Original" states.
    • Mechanism: \(\Delta\Phi_{\text{Coh}}=\Phi_{\text{Coh}}^{\text{PER}}-\Phi_{\text{Coh}}^{\text{ORG}}\). \(\Delta\Phi\ll 0\) indicates reasoning collapse ("Panic"), while \(\Delta\Phi\ge 0\) with an incorrect judgment indicates rationalized hallucination. \(\Delta\Psi\ge 0\) indicates a Silent Alarm.
    • Design Motivation: Looking at the PER state alone confuses inherent weakness with attack-induced degradation. Differentials isolate the true reasoning shift caused by the attack.
  3. Acoustic vs. Linguistic Dual-Track Adversarial Protocol:

    • Function: Splits adversarial attacks into two independent tracks to induce "Panic" and "Rationalization" failure modes respectively.
    • Mechanism: Acoustic attacks use CLAD recipes (Background Noise, Time & Pitch shifts, Shape & Space distortions). Linguistic attacks use TAPAS+TextFooler for synonymous substitution followed by Kokoro TTS synthesis, preserving the original voice but complicating the prosody.
    • Design Motivation: Acoustic attacks destroy perceptual evidence (leaving artifacts), while linguistic attacks only change transcript complexity. This contrast reveals the bifurcation of Tax vs. Shield.

Loss & Training

All ALMs are fine-tuned using QLoRA on a CoT dataset synthesized from ASVSpoof 2019 + DeepSeek-R1 cold-start traces: BF16, AdamW (\(\beta_1=0.9, \beta_2=0.95\)), LR=\(1e^{-4}\) with linear decay, global batch=16, weight decay=0.1. LoRA rank=16, \(\alpha=64\). Loss is calculated only on Assistant tokens (reasoning + judgment). The CoT data was refined to 25,108 samples via 3-round majority-vote bootstrapping.

Key Experimental Results

Main Results

Evaluation on ASVSpoof 2019 LA with 4 ALMs (NON = standard classification / RSN = explicit reasoning) + 3 traditional ADD baselines.

Model Mode Acc. Real F1 Fake F1
AASIST-2 Binary 99.58% 98.02% 99.77%
Qwen2-Audio-7B NON 98.00% 91.19% 98.88%
Qwen2-Audio-7B RSN 98.20% 91.70% 99.00%
granite-3.3-8b NON 99.87% 99.39% 99.93%
granite-3.3-8b RSN 96.11% 78.39% 97.88%
gemma-3n-E4B NON 99.89% 99.52% 99.94%
gemma-3n-E4B RSN 95.63% 81.95% 97.73%

Only Qwen2-Audio maintains or improves performance under RSN; Gemma’s Real F1 drops from 99.52% to 81.95%—a typical reasoning tax.

Ablation Study (\(\Delta\) Metrics under Acoustic vs. Linguistic Attacks)

Model Attack Type ASR \(\Phi_{\text{Coh}}^{PER}\) \(\Psi_{\text{Diss}}^{PER}\) Failure Mode
Qwen2-Audio (RSN) Acoustic 45.7 78.0 (↓8.2) 29.2 (↓16.8) Reasoning Shield
Qwen2-Audio (RSN) Linguistic 31.5 80.6 (↓7.3) 9.6 (↑6.8) Robust
Gemma-3n-E4B (RSN) Acoustic 49.1 43.2 (↓27.4) 67.9 (↓15.5) Coherence Erosion
Gemma-3n-E4B (RSN) Linguistic 82.8 86.9 (↑22.9) 11.2 (↓1.4) Systemic Deception
Phi-4 (RSN) Acoustic (Noise) 44.8 72.1 41.3 (↑) Silent Alarm Example

In linguistic attacks on "American Female" voices, Gemma reached ASR=100% (completely fooled) while maintaining 95.3% coherence and only 4.7% dissonance—"confidently talking nonsense."

Key Findings

  • Tax vs. Shield is determined by Acoustic Perception: Qwen2-Audio is the only model with >80% perception scores across Prosody/Speed/Disfluency, and the only one where RSN does not underperform NON. For models with poor perception, CoT becomes a new attack surface for "verbal overshadowing."
  • Strong Negative Correlation between Coherence and Dissonance (\(r=-0.79, p<.001\)): Models must choose between being self-consistent but deceptive or having internal conflict with disorganized expression.
  • Attack Modality Dictates Failure Mode: Acoustic attacks \(\rightarrow\) Panic (coherence ↓ + dissonance ↑); Linguistic attacks \(\rightarrow\) Rationalization Trap (coherence ↑ + dissonance ↓).
  • Silent Alarm efficacy: In 78.2% of samples where Gemma was compromised by Shape-Space attacks, dissonance remained high, proving that reasoning can still signal useful forensic information to human auditors even when the judgment is wrong.

Highlights & Insights

  • The "Reasoning Tax vs. Shield" duality: Disrupts the naive assumption that adding CoT is always beneficial and provides a falsifiable criterion (Strong Perception \(\rightarrow\) Shield, Weak \(\rightarrow\) Tax).
  • Cognitive Dissonance as a Silent Alarm: A true forensic innovation. While traditional robustness focuses on labels, this paper demonstrates that "wrong judgment but screaming reasoning" is a vital forensic signal.
  • Model Profiling: The use of perception question banks and frontier-model ensembles for GT generation provides a robust template for "LLM-evaluating-LLM" in multimodal tasks.

Limitations & Future Work

  • Evaluated primarily on ASVSpoof 2019 LA (English); generalizability to "In-the-Wild" datasets or other languages is unconfirmed.
  • Model scale was limited to 7-8B; whether larger models (e.g., Qwen3-Omni-30B) overcome the reasoning tax remains a scaling law question.
  • The work is diagnostic; it does not yet provide training objectives or architectural fixes to mitigate the reasoning tax.
  • Forensic dimensions (\(N=6\)) were fixed; sensitivity analysis on the number of reasoning dimensions was not performed.
  • vs. ALLM4ADD (Gu et al. 2025): While both use ALMs for ADD, this work is the first to systematically audit the drift of the reasoning itself under attack.
  • vs. CoT Robustness (NLP domain): While previous work looked at CoT robustness in text-only settings, this paper reveals that the "perception bottleneck" is the true constraint on CoT utility in multimodal settings.
  • vs. TAPAS (Nguyen et al. 2025): Reuses the linguistic attack protocol but shifts focus from "label fooling" to "reasoning redirection."

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The dual concepts of Tax vs. Shield and Silent Alarm are formal firsts, challenging industry assumptions.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Good coverage across 4 ALMs and multiple attack modalities, though lacks cross-dataset generalization.
  • Writing Quality: ⭐⭐⭐⭐⭐ Extremely clear structure; metaphors like "panic" and "silent alarm" are effective and impactful.
  • Value: ⭐⭐⭐⭐ High practical value for the deployment of forensic audio AI; the Silent Alarm concept is extensible to other high-stakes ML systems.