ACL 2025 AI Safety adversarial attack multi-task black-box text classification transfer attack few-shot queries

Multi-task Adversarial Attacks against Black-box Model with Few-shot Queries¶

Conference: ACL 2025
Code: -
Area: AI Safety
Keywords: adversarial attack, multi-task, black-box, text classification, transfer attack, few-shot queries

TL;DR¶

This paper proposes CEMA (Cluster and Ensemble Multi-task Text Adversarial Attack), which transforms complex multi-task black-box attacks into single-task text classification attacks by training a "deep-level surrogate model." CEMA can simultaneously attack multiple downstream tasks (such as classification, translation, summarization, and text-to-image generation) with only about 100 queries. Its effectiveness is validated on commercial models, including ChatGPT-4o, Baidu Translate, and Stable Diffusion.

Background & Motivation¶

Text adversarial attacks have been widely studied, but they primarily focus on single-task scenarios (e.g., classification, translation).
Multi-task adversarial attacks have had limited exploration in the image domain (e.g., MTA, MTADV) but remain almost completely unexplored in the text domain.
Limitations of Prior Work:
Multi-task white-box attacks require access to internal model features, making them inapplicable to black-box APIs.
Single-task black-box attacks require a massive number of queries (tens of thousands).
It is challenging to handle different task types (classification vs. translation vs. generation) in a unified manner.
Practical threat scenario: Modern AI systems increasingly adopt multi-task architectures (where the same input is fed to multiple downstream tasks). How to simultaneously attack multiple tasks with extremely few queries is a critical security concern.
Key Insight: Although different task outputs vary in form, they share "deep-level features." For instance, the deep feature distinguishing cats from birds is more fundamental than determining "whether it is a mammal / can fly / how many legs it has."

Method¶

Overall Architecture¶

CEMA consists of three steps: 1. Deep-level Surrogate Model Training 2. Candidate Adversarial Sample Generation 3. Transferability-based Adversarial Sample Selection

Step 1: Deep-level Surrogate Model Training¶

Deep-level Attack Hypothesis: Adversarial samples generated by a surrogate model trained on deep-level labels can effectively attack multiple downstream tasks of the victim model.

Workflow: 1. Collect a small amount of auxiliary text (e.g., 100 unlabeled victim texts). 2. Query the victim model with each auxiliary text \(x_i\) to obtain all task outputs \(\{y_i^1, y_i^2, ..., y_i^N\}\). 3. Use a pre-trained encoder (mT5) to encode the text and all outputs, and concatenate them into a unified vector \(E_i = \text{Concat}(E_{x_i}, E_{y_i^1}, ..., E_{y_i^N})\). 4. Perform binary clustering (Spectral Clustering) on \(E_i\) to obtain deep-level labels \(y_i^c \in \{0, 1\}\). 5. Train a binary classification surrogate model \(f_s\) using the auxiliary text-cluster label pairs.

Key Designs: - Number of clusters = 2: Experiments show that binary clustering captures the most fundamental deep-level labels, whereas multi-clustering eventually merges into two. - The surrogate model is plug-and-play, requiring no imitation of the victim model's specific architecture. - Only ~100 queries are needed to obtain the auxiliary data.

Step 2: Candidate Adversarial Sample Generation¶

Apply \(l\) text classification attack methods (Hotflip, FD, TextBugger) to the surrogate model \(f_s\) to generate \(l\) adversarial candidates.
Filtering condition: Cosine similarity between the adversarial sample and the original text \(\ge\) threshold \(\epsilon = 0.8\).
Mathematical guarantee of the multi-method strategy (Theorem 3.2):
- The more candidates there are, the probability of at least one successful attack monotonically increases.
- The more candidates there are, the probability of at least one exceeding the similarity threshold monotonically increases.

Step 3: Transferability-based Adversarial Sample Selection¶

Retrain \(w\) surrogate models (\(w=6\)) using 80% of the auxiliary data.
For each candidate adversarial sample, calculate how many surrogate models it successfully attacks (transferability score \(I_{ij}\)).
Select the candidate that attacks the most surrogate models as the final adversarial sample.
In case of a tie, select the one with the largest probability change (\(p_c^j = p_{\hat{y}}(x_i^*) - p_{\hat{y}}(\tilde{x}_i^j)\)).

Key Experimental Results¶

Experimental Settings¶

Datasets: SST5 (sentiment analysis, 5 classes), Emotion (emotion classification, 6 classes)

Victim Model Settings: - LLM attack: ChatGPT-4o, Claude 3.5 (prompt = translate to French/Chinese simultaneously + predict emotion category) - M3TL (Multi-Model Multi-Task Learning): - Victim A: dis-sst5 + dis-emotion + opus-mt (En-Zh) - Victim B: ro-sst5 + ro-emotion + T5-small (En-Fr) - Victim C: Baidu Translate + Ali Translate (commercial APIs) - Image Generation: Stable Diffusion V2

Baselines: BAE, FD, Hotflip, SememePSO, TextBugger (classification); kNN, Morphin, Seq2Sick, TransFool (translation)

Main Results: LLM Attacks¶

Dataset	Model	Classif. ASR↑	En-Fr BLEU↓	En-Zh BLEU↓	Queries↓
Emotion	ChatGPT-4o	32.05	0.39	0.33	100
Emotion	Claude 3.5	36.80	0.38	0.35	100
SST5	ChatGPT-4o	38.63	0.32	0.27	100
SST5	Claude 3.5	37.12	0.29	0.25	100

With only 100 queries, CEMA significantly outperforms the Random-Del baseline in both classification ASR and translation BLEU.

M3TL Multi-task Attack¶

Classification Task (SST5/Emotion, Victim A/B): - CEMA achieves an ASR of 80.80% on Emotion (Victim A), far exceeding all single-task methods. - The number of queries is only 100, whereas baseline methods require 20k to 70k queries.

For example, SST5 + Victim A:

Method	ASR↑	Sim.↑	Queries↓
BAE	42.71	0.888	47,360
PSO	45.14	0.954	24,398
HQA	46.11	0.936	64,864
CEMA	73.57	0.934	100

ASR increases from 46% to 73%, while the number of queries drops from tens of thousands to 100.

Attacks on Commercial APIs and Image Generation¶

Against Baidu Translate and Ali Translate (Victim C), CEMA achieves BLEU < 0.35 using only 100 auxiliary texts.
It can successfully attack the text-to-image capabilities of Stable Diffusion V2.

Ablation Study¶

Number of clusters: 2 clusters is optimal; performance declines with 3 or 4 clusters.
Amount of auxiliary data: 100 samples are sufficient; increasing to 200/400 yields limited performance gains.
Combination of attack methods: The combination of three methods outperforms any single method.
Number of surrogate models \(w\): 6 surrogate models represent the optimal trade-off point.

Highlights & Insights¶

The Deep-level Attack Hypothesis is highly innovative: it bypasses task differences and attacks from the inherent structure of the data.
Highly practical: It requires only 100 queries (potentially completed in under a minute), reducing the query count by 2-3 orders of magnitude compared to existing single-task methods.
Task universality: The same framework can attack four distinct types of tasks: classification, translation, summarization, and text-to-image generation.
Defeats ChatGPT-4o and Claude 3.5: The reality of the threat is validated on commercial state-of-the-art LLMs.
Plug-and-play: Training the surrogate model takes only 4 minutes (on a 24GB RTX 3090), requiring no knowledge of the victim model's architecture.

Limitations & Future Work¶

The classification ASR on LLMs is only 30-38%, indicating that LLMs still possess a degree of robustness against such attacks.
It relies on existing text classification attack methods (Hotflip/FD/TextBugger), meaning the attack quality is constrained by these base methods.
Clustering quality has a significant impact on the final performance, and clustering performance may be unstable under different data distributions.
The setting of the similarity threshold \(\epsilon = 0.8\) introduces a trade-off between attack success rate and stealthiness.
Defenses and countermeasures such as adversarial robust training are not discussed.

Text Classification Attacks: Hotflip (Ebrahimi et al.), BAE (Garg et al.), TextBugger (Ren et al.), HQA (Liu et al.)
NMT Attacks: Seq2Sick (Cheng et al.), TransFool (Sadrizadeh et al.), Morphin (Tan et al.)
Multi-task Attacks: MTA (Guo et al., 2020) in the image domain; MTADV (Wang et al., 2024) in face verification
Transfer Attacks: Surrogate model training + auxiliary data (Li et al., 2020; Sun et al., 2022)

Rating ⭐⭐⭐⭐¶

Fills the gap in text multi-task adversarial attacks; the core idea (deep clustering \(\rightarrow\) unified attack) is clear and elegant. The efficiency achieved with 100 queries is highly impressive. However, there is still room for improvement regarding the attack success rate on LLMs, and discussion on defense mechanisms is lacking.