Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization¶
Conference: ICML 2026
arXiv: 2605.04700
Code: Not released
Area: Audio & Speech / AI Safety
Keywords: Audio Language Models, Jailbreak Attacks, Sparse Optimization, Gradient Heterogeneity, Adversarial Perturbations
TL;DR¶
This paper discovers that waveform gradients in Audio Language Model (ALM) jailbreak optimization are highly concentrated on a few tokens. It proposes TAGO, which updates only the waveform segments corresponding to the top-\(\zeta\) high-energy tokens at each step. On Qwen3-Omni, retaining only 25% of tokens maintains an 86% LLM-judge jailbreak success rate (vs. 87% for full tokens).
Background & Motivation¶
Background: ALMs (e.g., Qwen-Omni / LLaMA-Omni) feed speech directly into LLM backbones to generate natural language responses and are widely deployed in human-computer interaction. Their safety faces jailbreak threats similar to text LLMs. Existing ALM jailbreak attacks (SpeechGuard / AdvWave) follow the GCG approach from the text side, using "target prefix + Teacher Forcing cross-entropy" as the loss and performing dense PGD updates on the entire waveform.
Limitations of Prior Work: Audio waveform dimensions are extremely high (tens of thousands of samples per second). Dense updates on the entire waveform are slow and waste gradient budget due to signal redundancy (large silent sections / steady-state vowel regions). Existing works on models with strong safety alignment, such as Qwen3-Omni, generally see ASR\(_l\) drop below 45% (AdvWave 70%/45%), indicating that dense optimization is neither efficient nor strong enough.
Key Challenge: The authors deconstruct this problem from the "structural" level of optimization signals. Dense updates implicitly assume that gradient energy is uniformly distributed across tokens. However, ALM audio tokens are generated by front-end convolution/downsampling, where each token corresponds to a receptive field. Different tokens have vastly different impacts on the jailbreak prefix probability. If gradients are highly concentrated, distributing updates across all tokens dilutes the truly effective directions.
Goal: (1) Quantify the degree of token-level gradient non-uniformity in ALM jailbreak optimization; (2) Design a jailbreak algorithm that enforces token-level sparsity during the optimization process; (3) Verify whether "dense optimization followed by pruning" is equivalent (subsequently proved not to be).
Key Insight: Aggregate waveform gradients according to the token receptive fields at each step to obtain the token-aligned gradient energy \(\tilde{g}_i^{(k)}\). Characterize its distribution using three metrics: coefficient of variation, top-\(q\) mass, and \(q_\alpha\). Evaluations on Qwen3-Omni show that the top 16% of audio tokens account for 90% of the gradient energy (summing perspective: CV=2.74, \(q_{0.9}=9.64\) / average of 60 tokens).
Core Idea: Use a token-aware sparse mask to apply gradients at each step only to the receptive fields of the top-\(\zeta\) high-energy tokens, masking other positions to 0. Simultaneously, combine "model-compatible prefix templates" with an EOS suppression term to systematically bypass the "prefix alignment" shortcut.
Method¶
Overall Architecture¶
Input: Original benign audio \(x\in\mathbb{R}^L\), fixed text prompt \(t_{1:n}\), harmful query \(q\), white-box ALM \(\theta\), token retention ratio \(\zeta\), perturbation budget \(\epsilon\), maximum iterations \(K\), early-stop threshold \(\tau\). Output: Adversarial audio \(x+\delta\) that causes the ALM response to start with the target prefix \(r_{1:m}\) and continue generating harmful content. Procedure: (1) Construct prefix \(r_{1:m}\leftarrow\mathsf{Prefix}(q)\) using the model's own benign response style; (2) Compute loss \(\mathcal{L}\) (prefix cross-entropy + \(L_2\) penalty + EOS suppression) in the forward pass; (3) Backpropagate to obtain waveform gradients, aggregate them into token-aligned gradients by receptive field, and select top-\(\lceil\zeta T\rceil\) tokens to form a binary mask \(M^{(k)}\); (4) Perform PGD updates with clipping only within the mask: \(\delta^{(k+1)}=\mathrm{Clip}_{[-\epsilon,\epsilon]}(\delta^{(k)}-\eta(M^{(k)}\odot\nabla_\delta\mathcal{L}))\); (5) Terminate early when prefix CE falls below \(\tau(\rho)=-\log\rho\).
Key Designs¶
-
Sparse Token-Selective Optimization:
- Function: Aggregates waveform-level gradients into token-level energy based on the audio token receptive fields and retains only waveform gradients corresponding to top-\(\zeta\) tokens in each iteration.
- Mechanism: Map each pre-attention audio token \(\Phi_i(x)\) generated by the front-end \(\Phi(\cdot)\) to a unique waveform interval \(\mathcal{R}(i)\subseteq\{1,\dots,L\}\). Define sample-level energy \(g^{(k)}(s)=([\nabla_\delta\mathcal{L}]_s)^2\) and token-level energy \(\tilde{g}_i^{(k)}=\sum_{s\in\mathcal{R}(i)}g^{(k)}(s)\). Select top indices \(\mathcal{S}^{(k)}\) based on \(\tilde{g}_i^{(k)}\) to obtain mask \(M^{(k)}=\mathbf{1}_{\cup_{i\in\mathcal{S}^{(k)}}\mathcal{R}(i)}\). Update rules follow the formula in the previous section.
- Design Motivation: Gradient analysis shows the top 16% of tokens hold 90% of the energy. Dense updates distribute the perturbation budget to low-energy tokens, reducing the effective step size per token. Dynamic (re-selected every step) rather than static masking follows the shift in the optimization trajectory, preventing "early high-energy tokens from becoming ineffective later."
-
Model-Compatible Target Prefix:
- Function: Avoids hand-writing prefixes like "Sure, here is..." for every harmful query, which may not match the ALM's actual response style. Otherwise, Teacher Forcing would force the model out of distribution, increasing optimization difficulty.
- Mechanism: First query the target ALM with a small batch of benign prompts and extract the first sentences of its responses as a template \(\mathsf{Prefix}(\cdot)\) with placeholders. For any harmful query \(q\), instantiate it as \(r_{1:m}(q)=\mathsf{Prefix}(q)\) for Teacher Forcing optimization.
- Design Motivation: Response styles vary significantly across ALMs (Qwen-Omni vs. LLaMA-Omni). Using the model's own style keeps the "prefix alignment" problem on the output manifold already familiar to the model, making the CE loss smoother and easier to drop below \(\tau\) for early stopping.
-
Suppressing Premature Termination:
- Function: Prevents the ALM from immediately outputting
<|im_end|>to cut off generation after being forced to spit out the target prefix (a "shortcut" in safety alignment—shaping only the first few tokens followed by immediate termination). - Mechanism: Add a term \(\mathcal{L}_{\mathrm{eos}}=p_\theta(\mathrm{EOS}\mid h_m)\) to the total loss, where \(h_m\) is the decoding context after the prefix. Final objective: \(\mathcal{L}=\frac{1}{m}\sum_{i=1}^m\mathcal{L}_{\mathrm{CE}}(r_i,p_\theta(\cdot\mid h_{i-1}))+\lambda\|\delta\|_2^2+\lambda_{\mathrm{eos}}\mathcal{L}_{\mathrm{eos}}\).
- Design Motivation: Citing Qi et al. 2025, the authors point out that safety alignment is primarily achieved through "distribution shaping of the first few tokens." Only by forcibly suppressing the EOS probability can the model be pressured into truly generating harmful content; otherwise, false positives occur where the "prefix looks like a jailbreak but ends abruptly."
- Function: Prevents the ALM from immediately outputting
Loss & Training¶
The total loss is given by Equation 13 in the previous section, containing prefix CE, \(L_2\) perturbation penalty, and EOS suppression. Optimization uses PGD with \(\ell_\infty\) clipping. Early stopping criterion: if the prefix CE term \(\leq\tau(\rho)=-\log\rho\), the prefix is considered aligned with high confidence, and iterations stop immediately to save time. Main experiments set \(\rho=0.9\) (corresponding to \(\tau\approx 0.105\)), token retention ratio \(\zeta\in\{1.0,0.75,0.5,0.25\}\), with a white-box threat model perturbing only waveforms and keeping text prompts fixed.
Key Experimental Results¶
Main Results¶
On AdvBench-50 (each query synthesized with 2 speakers via Google TTS, 100 samples total), comparing three ALMs:
| Model | Method | ASR\(_r\) (%) | ASR\(_l\) (%) |
|---|---|---|---|
| Qwen3-Omni | Direct | 0 | 0 |
| Qwen3-Omni | SpeechGuard | 100 | 42 |
| Qwen3-Omni | AdvWave | 70 | 45 |
| Qwen3-Omni | Post-hoc prune (\(\zeta=0.25\)) | 9 | 1 |
| Qwen3-Omni | TAGO (\(\zeta=1.0\)) | 100 | 87 |
| Qwen3-Omni | TAGO (\(\zeta=0.25\)) | 99 | 86 |
| Qwen2.5-Omni | AdvWave | 36 | 4 |
| Qwen2.5-Omni | TAGO (\(\zeta=0.25\)) | 97 | 53 |
| LLaMA-Omni | AdvWave | 100 | 68 |
| LLaMA-Omni | TAGO (\(\zeta=0.25\)) | 100 | 72 |
The advantage is maintained on HarmBench (Qwen3-Omni: Direct 4.5 → TAGO \(\zeta=1.0\) 76.5 → \(\zeta=0.25\) 70.0 ASR\(_l\)).
Ablation Study¶
Sensitivity of TAGO to \(\zeta\) and early stop \(\rho\) (Qwen3-Omni):
| Config | ASR\(_l\) (%) | Avg. Iterations | Note |
|---|---|---|---|
| \(\zeta=1.0,\rho=0.9\) | 87 | 256.64 | Dense baseline |
| \(\zeta=0.25,\rho=0.9\) | 86 | 323.16 | Only 25% tokens; +25.92% iterations recovers performance |
| \(\zeta=1.0,\rho=0.7\) | 32 | 75.39 | Early stop too loose; insufficient prefix alignment |
| Post-hoc prune \(\zeta=0.25\) | 1 | — | Dense optimization then prune — complete failure |
Under the SNR metric, TAGO (\(\zeta=0.25,\rho=0.9\)) produced 20.65 / 21.83 / 22.45 dB on Qwen3-Omni / Qwen2.5-Omni / LLaMA-Omni, respectively, indicating moderate perturbation energy.
Key Findings¶
- Gradient Heterogeneity is a Universal Law: The CV of sum-gradients on Qwen3-Omni is as high as 2.74, with top 10% tokens accounting for 91.52% of energy (\(q_{0.9}=9.64/60\)). This is the entire foundation for TAGO.
- Sparsity Must be Applied Online: Post-hoc pruning at \(\zeta=0.25\) yields an ASR\(_l\) of only 1% (vs. TAGO's 86%), indicating that the optimization trajectory itself is reshaped by the sparse mask, rather than being a small perturbation that can be approximated post-hoc. This comparison is the paper's strongest counterexample.
- Linear Increase in Iterations: Reducing \(\zeta\) from 1.0 to 0.25 only uses 25.92% more iterations instead of 4x, showing that top-\(\zeta\) tokens truly carry most of the effective optimization direction, rather than simply "doing less and moving slower."
- Significant Cross-Model Variance in Safety Alignment: Direct attacks on Qwen3-Omni / Qwen2.5-Omni are nearly 0 (strong alignment), whereas LLaMA-Omni has 49% ASR\(_l\) under direct attack (weak alignment). TAGO brings all three to 70%+.
Highlights & Insights¶
- Rooted in Optimization Structure, Not Attack Tricks: Questioning the "necessity" of dense waveform updates and providing rigorous metrics (CV / top-\(q\) mass / \(q_\alpha\)) ensures that sparsity is presented as an optimization fact rather than just an engineering trick.
- Clever Token-as-Analysis-Unit Abstraction: Using pre-attention audio tokens as the analysis unit instead of post-self-attention representations avoids scattering temporal locality. The receptive field concept can be migrated to any mode with conv/downsample front-ends (e.g., video frame patches, point cloud voxels), offering methodological value for future cross-modal research.
- EOS Suppression Exposes Real Alignment Weaknesses: Materializing the observation by Qi et al. 2025—that alignment mainly shapes the first few tokens and then terminates—into an optimizable objective is a trick that can be reused for text LLM jailbreaking (most existing GCG methods only optimize prefix CE without explicitly suppressing EOS).
Limitations & Future Work¶
- Strong White-box Assumption: Requires access to full ALM parameters and gradients, making it not directly applicable to closed-source API services. The authors did not discuss whether token-level gradient statistics could be approximated via queries (e.g., using logits).
- Dependence on Known Receptive Fields: The front-end mapping \(\mathcal{R}(i)\) is a fixed deterministic map, but for ALMs with variable downsampling rates or dynamic chunking (like future end-to-end speech LMs), this mapping may be more complex.
- Hard Top-k Selection: It may miss cases where multiple medium-energy tokens act synergistically. Soft/learned masks or group sparsity could be considered.
- Insufficient Defense Discussion: While the authors advocate for utilizing token-level gradient heterogeneity in future safety alignment, specific defense schemes (e.g., alignment augmentation on high-energy token segments) remain open questions.
Related Work & Insights¶
- vs. SpeechGuard / AdvWave: Both use dense waveform (or suffix segment) updates. The core difference here is limiting updates to a token-receptive-field subset. On strong-alignment ALMs like Qwen3-Omni, ASR\(_l\) jumps from 42/45 to 86, proving that sparsity is a "free lunch" rather than a trade-off.
- vs. Weighted-Sampling Audio Attacks (Liu et al. 2020): Both are based on "uneven waveform importance," but the former uses static pre-estimated importance sampling. This work uses dynamic gradient energy selection per step combined with Teacher Forcing jailbreak objectives—a dual upgrade of dynamic vs. static and attack on ASR vs. jailbreak on ALM.
- vs. GCG (Zou et al. 2023) Text LLM Jailbreak: TAGO moves the "gradients concentrated on a few tokens" phenomenon from discrete token-level greedy search to continuous waveform + receptive field aggregation. It can be viewed as the "gradient sparsity" counterpart of GCG for continuous modalities.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to systematically quantify token-level heterogeneity in ALM jailbreak gradients and use in-optimization sparsity as an attack lever. EOS suppression and model-compatible prefixes are incremental but well-combined.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers 3 ALMs, 2 benchmarks, \(\zeta\times\rho\) dual-dimensional sensitivity, post-hoc prune counterexample, and SNR quantified perturbation.
- Writing Quality: ⭐⭐⭐⭐ Metrics, algorithms, and formulas are clearly organized. Algorithm 1 and Equation 13 are intuitive. High readability, though some figure densities are low.
- Value: ⭐⭐⭐⭐ Directly challenges "dense optimization" baselines and points out a clear direction for the defense side (security enhancement of high-energy token segments). Highly practical for audio/multimodal LLM safety research.