AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts¶

Conference: ICCV 2025 arXiv: 2510.24034 Code: N/A Area: Image Generation / AI Safety Keywords: Red-Teaming, text-to-image, Adversarial Prompts, LLM, Safety Evaluation

TL;DR¶

This paper proposes APT (AutoPrompT), a black-box red-teaming framework driven by an LLM. Through an alternating optimize-finetune pipeline and a dual-evasion strategy, APT automatically generates human-readable adversarial suffixes that bypass content filters, effectively circumventing the safety mechanisms of T2I models while enabling zero-shot cross-prompt transferability.

Background & Motivation¶

Text-to-image (T2I) diffusion models have achieved unprecedented generative capabilities through large-scale multimodal training, yet they inherit safety risks from uncontrolled data collection — carefully crafted adversarial prompts can induce the generation of unsafe (NSFW) content. Existing safety mechanisms include training data filtering, NSFW safety checkers, inference-time guidance, and concept-erasure fine-tuning, but their effectiveness and robustness lack standardized automated evaluation.

Existing red-teaming methods suffer from three critical limitations:

White-box dependency: Most methods (Ring-A-Bell, P4D, UnlearnDiffAtk) require gradient access to the target model, which is impractical in real-world scenarios.

Semantic unreadability: Methods based on discrete optimization produce adversarial prompts that are meaningless character concatenations ("gibberish"), easily detected and blocked by perplexity-based filters.

Inclusion of forbidden words: Generated adversarial prompts frequently contain sensitive vocabulary present in blocklists and are directly intercepted by word filters.

Core innovation: leveraging the natural language generation capabilities of LLMs to automatically produce human-readable and filter-evading adversarial suffixes under a fully black-box setting.

Method¶

Overall Architecture¶

APT employs an alternating optimize-finetune training strategy. During the optimization phase, the LLM is frozen and adversarial suffixes are optimized token-by-token via stochastic beam search; during the fine-tuning phase, the LLM is fine-tuned using the optimized suffixes as targets. The dual-evasion strategy is applied throughout the optimization phase to ensure that generated adversarial prompts simultaneously bypass perplexity filters and blocklist word filters.

Key Designs¶

Adversarial Suffix Optimization:
- Function: Given a benign prompt \(x\), generate an adversarial suffix \(S_T = [s_1, \ldots, s_T]\) such that the concatenated prompt \([x, S_T]\) induces the T2I model to generate unsafe content.
- Mechanism:
  - Alignment constraint: \(\ell_{align}(x, S_t) = \text{sim}(\mathcal{G}([x, S_t]), I) + \frac{1}{|c|} \sum_{w \in \mathcal{W}} \text{sim}(\mathcal{G}([x, S_t]), w)\), where the first term aligns the generated image with unsafe reference images and the second term aligns it with unsafe textual concepts.
  - Stochastic beam search: At each step, \(k=12\) candidate tokens are sampled from the LLM's predicted distribution, and the \(b=4\) beams with the lowest objective value are retained, iterating up to maximum length \(T=15\).
  - Prior suffix: A prior suffix (e.g., "and a beautiful girl's body with") is appended after the benign prompt to provide contextual guidance for the LLM.
- Design Motivation: Token-by-token optimization enables fine-grained control at each generation step; incorporating the LLM's language prior ensures generation quality.
Dual-Evasion Strategy:
- Function: Ensure that generated adversarial prompts simultaneously bypass perplexity filters and blocklist word filters.
- Mechanism:
  - Perplexity constraint: An auxiliary pre-trained LLM \(\mathcal{M}_\phi\) is introduced to compute perplexity: \(\ell_{per}(S_t|x) = -\sum_{t=1}^T \log p_\phi(s_t | [x, S_{t-1}])\), integrated into the jailbreak objective: \(\min_{S_T} \mathcal{L}_{jai} = -\ell_{align} + \lambda \ell_{per}\).
  - Forbidden token penalty: The tokenizer vocabulary is scanned to identify tokens whose semantic similarity to unsafe words \(\mathcal{W}\) exceeds a threshold; their probabilities are penalized during prediction. An additional check is applied for multi-token combinations that may spell forbidden words (by inspecting the last complete word of each beam).
- Design Motivation: Low perplexity ensures readability; the forbidden penalty prevents the LLM from taking shortcuts by directly generating sensitive words.
Suffix Generator Fine-Tuning:
- Function: Fine-tune the LLM with the high-quality suffixes obtained during optimization, enabling it to progressively learn to directly generate effective suffixes.
- Mechanism: \((x, S_T)\) pairs are stored in a replay buffer \(\mathcal{R}\), with priority sampling determined by successful jailbreaks and the lowest \(\mathcal{L}_{jai}\); the LLM is fine-tuned using cross-entropy loss: \(\mathcal{L}_{CE} = -\sum_{t=1}^T \log p_\theta(s_t | [x, S_{t-1}])\).
- Design Motivation: The quality of suffixes obtained during the optimization phase improves over iterations; fine-tuning internalizes jailbreak patterns into the LLM, ultimately enabling zero-shot inference — directly generating effective adversarial suffixes for unseen prompts.

Implementation Details¶

The suffix generator uses Llama-3.1-8B; the auxiliary LLM uses the same weights (frozen). The unsafe image set contains 50 images (verified by a classifier); 23 nudity-related and 17 violence-related forbidden words are used. Benign prompts are truncated to 50 tokens.

Key Experimental Results¶

Main Results (RSR after blocklist word filtering)¶

Method	ESD↑	SLD-MAX↑	Receler↑	AdvUnlearn↑	Note
Ring-A-Bell	2.00%	2.50%	1.00%	0.50%	White-box, nearly ineffective
UnlearnDiffAtk	18.50%	52.00%	16.50%	3.00%	White-box
P4D-Union	41.50%	62.50%	41.50%	9.50%	White-box, requires gradients
APT (Ours)	61.50%	70.50%	36.50%	30.50%	Black-box, human-readable

Ablation Study (ESD model, nudity category)¶

Configuration	RSR↑	PPL_Avg↓	BR↓	Note
w/o unsafe image alignment	38.5%	0.175	1%	Lacks visual guidance
w/o unsafe word list alignment	30.5%	0.067	1%	Lacks semantic guidance
w/o perplexity constraint	35%	0.198	1%	Reduced readability
w/o forbidden token penalty	9.5%	0.171	87%	Nearly all intercepted
Full APT	61.5%	0.167	2%	All components

Key Findings¶

APT's perplexity (PPL) is only 1/70 that of Ring-A-Bell (0.167 vs 11,646 ×10³), far superior to all baselines.
APT achieves the lowest block rate (BR) — approximately 2% for both nudity and violence categories, compared to up to 87% for baseline methods.
APT achieves an RSR of 30.5% against AdvUnlearn, 3.2× that of P4D (9.5%), with a particularly pronounced advantage under strong defenses.
Strong cross-model transferability: prompts optimized for AdvUnlearn achieve over 40% success rate on the other three models.
APT can directly attack the latest models including SDXL, SD3.5, and FLUX.1-dev, as well as commercial platforms such as Leonardo.Ai.

Highlights & Insights¶

Simultaneously satisfying the three constraints of black-box, human-readable, and filter-evading is far more practically valuable than white-box methods in real deployment scenarios.
The alternating optimize-finetune strategy enables the LLM to progressively internalize jailbreak patterns, ultimately achieving zero-shot generalization.
Priority sampling in the replay buffer is a critical design choice for training stability.
The two-level mechanism of the forbidden token penalty (single-token level + multi-token combination check) reflects engineering completeness.
Successful attacks against the latest commercial APIs reveal the fundamental fragility of existing safety measures.

Limitations & Future Work¶

Maintaining low perplexity and evading filters may sacrifice some attack strength — overly strict forbidden penalties may suppress semantically critical tokens.
The prior suffix is currently set manually ("and a beautiful girl's body with"); automated selection could further improve performance.
A separate suffix generator must be trained for each safe T2I model; a unified generator across different defense methods has not yet been realized.
The paper focuses on nudity and violence — coverage of other harmful content types (hate speech, discrimination, etc.) remains unexplored.
The release of red-teaming tools requires careful balance between research value and potential misuse risk.

vs Ring-A-Bell: Based on genetic algorithm discrete optimization; generated prompts exhibit extremely high perplexity (~11,646), rendering them nearly completely ineffective under filters.
vs P4D: Optimizes in continuous space and requires model gradients; achieves higher RSR but cannot be adapted to black-box settings and produces unreadable prompts.
vs AdvPromter: Also an LLM-driven method but requires white-box gradients; APT is fully black-box and introduces the dual-evasion strategy.
Insight: The language generation capability of LLMs can be directed via "guided fine-tuning" to produce specific adversarial suffixes — this paradigm may generalize to other safety evaluation tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The three-way constraint of black-box + human-readable + filter-evading is achieved simultaneously in T2I red-teaming for the first time.
Experimental Thoroughness: ⭐⭐⭐⭐ Four safe T2I models, latest architectures and commercial APIs, comprehensive ablation and transferability analysis.
Writing Quality: ⭐⭐⭐⭐ Method description is clear, algorithm pseudocode is complete, and comparative analysis is thorough.
Value: ⭐⭐⭐⭐⭐ Reveals the fundamental vulnerability of existing T2I safety mechanisms and provides a practical tool for safety evaluation.