AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts¶
Conference: ICCV 2025 arXiv: 2510.24034 Code: N/A Area: Image Generation / AI Safety Keywords: Red-Teaming, text-to-image, Adversarial Prompts, LLM, Safety Evaluation
TL;DR¶
This paper proposes APT (AutoPrompT), a black-box red-teaming framework driven by an LLM. Through an alternating optimize-finetune pipeline and a dual-evasion strategy, APT automatically generates human-readable adversarial suffixes that bypass content filters, effectively circumventing the safety mechanisms of T2I models while enabling zero-shot cross-prompt transferability.
Background & Motivation¶
Text-to-image (T2I) diffusion models have achieved unprecedented generative capabilities through large-scale multimodal training, yet they inherit safety risks from uncontrolled data collection — carefully crafted adversarial prompts can induce the generation of unsafe (NSFW) content. Existing safety mechanisms include training data filtering, NSFW safety checkers, inference-time guidance, and concept-erasure fine-tuning, but their effectiveness and robustness lack standardized automated evaluation.
Existing red-teaming methods suffer from three critical limitations:
White-box dependency: Most methods (Ring-A-Bell, P4D, UnlearnDiffAtk) require gradient access to the target model, which is impractical in real-world scenarios.
Semantic unreadability: Methods based on discrete optimization produce adversarial prompts that are meaningless character concatenations ("gibberish"), easily detected and blocked by perplexity-based filters.
Inclusion of forbidden words: Generated adversarial prompts frequently contain sensitive vocabulary present in blocklists and are directly intercepted by word filters.
Core innovation: leveraging the natural language generation capabilities of LLMs to automatically produce human-readable and filter-evading adversarial suffixes under a fully black-box setting.
Method¶
Overall Architecture¶
APT employs an alternating optimize-finetune training strategy. During the optimization phase, the LLM is frozen and adversarial suffixes are optimized token-by-token via stochastic beam search; during the fine-tuning phase, the LLM is fine-tuned using the optimized suffixes as targets. The dual-evasion strategy is applied throughout the optimization phase to ensure that generated adversarial prompts simultaneously bypass perplexity filters and blocklist word filters.
Key Designs¶
-
Adversarial Suffix Optimization:
- Function: Given a benign prompt \(x\), generate an adversarial suffix \(S_T = [s_1, \ldots, s_T]\) such that the concatenated prompt \([x, S_T]\) induces the T2I model to generate unsafe content.
- Mechanism:
- Alignment constraint: \(\ell_{align}(x, S_t) = \text{sim}(\mathcal{G}([x, S_t]), I) + \frac{1}{|c|} \sum_{w \in \mathcal{W}} \text{sim}(\mathcal{G}([x, S_t]), w)\), where the first term aligns the generated image with unsafe reference images and the second term aligns it with unsafe textual concepts.
- Stochastic beam search: At each step, \(k=12\) candidate tokens are sampled from the LLM's predicted distribution, and the \(b=4\) beams with the lowest objective value are retained, iterating up to maximum length \(T=15\).
- Prior suffix: A prior suffix (e.g., "and a beautiful girl's body with") is appended after the benign prompt to provide contextual guidance for the LLM.
- Design Motivation: Token-by-token optimization enables fine-grained control at each generation step; incorporating the LLM's language prior ensures generation quality.
-
Dual-Evasion Strategy:
- Function: Ensure that generated adversarial prompts simultaneously bypass perplexity filters and blocklist word filters.
- Mechanism:
- Perplexity constraint: An auxiliary pre-trained LLM \(\mathcal{M}_\phi\) is introduced to compute perplexity: \(\ell_{per}(S_t|x) = -\sum_{t=1}^T \log p_\phi(s_t | [x, S_{t-1}])\), integrated into the jailbreak objective: \(\min_{S_T} \mathcal{L}_{jai} = -\ell_{align} + \lambda \ell_{per}\).
- Forbidden token penalty: The tokenizer vocabulary is scanned to identify tokens whose semantic similarity to unsafe words \(\mathcal{W}\) exceeds a threshold; their probabilities are penalized during prediction. An additional check is applied for multi-token combinations that may spell forbidden words (by inspecting the last complete word of each beam).
- Design Motivation: Low perplexity ensures readability; the forbidden penalty prevents the LLM from taking shortcuts by directly generating sensitive words.
-
Suffix Generator Fine-Tuning:
- Function: Fine-tune the LLM with the high-quality suffixes obtained during optimization, enabling it to progressively learn to directly generate effective suffixes.
- Mechanism: \((x, S_T)\) pairs are stored in a replay buffer \(\mathcal{R}\), with priority sampling determined by successful jailbreaks and the lowest \(\mathcal{L}_{jai}\); the LLM is fine-tuned using cross-entropy loss: \(\mathcal{L}_{CE} = -\sum_{t=1}^T \log p_\theta(s_t | [x, S_{t-1}])\).
- Design Motivation: The quality of suffixes obtained during the optimization phase improves over iterations; fine-tuning internalizes jailbreak patterns into the LLM, ultimately enabling zero-shot inference — directly generating effective adversarial suffixes for unseen prompts.
Implementation Details¶
The suffix generator uses Llama-3.1-8B; the auxiliary LLM uses the same weights (frozen). The unsafe image set contains 50 images (verified by a classifier); 23 nudity-related and 17 violence-related forbidden words are used. Benign prompts are truncated to 50 tokens.
Key Experimental Results¶
Main Results (RSR after blocklist word filtering)¶
| Method | ESD↑ | SLD-MAX↑ | Receler↑ | AdvUnlearn↑ | Note |
|---|---|---|---|---|---|
| Ring-A-Bell | 2.00% | 2.50% | 1.00% | 0.50% | White-box, nearly ineffective |
| UnlearnDiffAtk | 18.50% | 52.00% | 16.50% | 3.00% | White-box |
| P4D-Union | 41.50% | 62.50% | 41.50% | 9.50% | White-box, requires gradients |
| APT (Ours) | 61.50% | 70.50% | 36.50% | 30.50% | Black-box, human-readable |
Ablation Study (ESD model, nudity category)¶
| Configuration | RSR↑ | PPL_Avg↓ | BR↓ | Note |
|---|---|---|---|---|
| w/o unsafe image alignment | 38.5% | 0.175 | 1% | Lacks visual guidance |
| w/o unsafe word list alignment | 30.5% | 0.067 | 1% | Lacks semantic guidance |
| w/o perplexity constraint | 35% | 0.198 | 1% | Reduced readability |
| w/o forbidden token penalty | 9.5% | 0.171 | 87% | Nearly all intercepted |
| Full APT | 61.5% | 0.167 | 2% | All components |
Key Findings¶
- APT's perplexity (PPL) is only 1/70 that of Ring-A-Bell (0.167 vs 11,646 ×10³), far superior to all baselines.
- APT achieves the lowest block rate (BR) — approximately 2% for both nudity and violence categories, compared to up to 87% for baseline methods.
- APT achieves an RSR of 30.5% against AdvUnlearn, 3.2× that of P4D (9.5%), with a particularly pronounced advantage under strong defenses.
- Strong cross-model transferability: prompts optimized for AdvUnlearn achieve over 40% success rate on the other three models.
- APT can directly attack the latest models including SDXL, SD3.5, and FLUX.1-dev, as well as commercial platforms such as Leonardo.Ai.
Highlights & Insights¶
- Simultaneously satisfying the three constraints of black-box, human-readable, and filter-evading is far more practically valuable than white-box methods in real deployment scenarios.
- The alternating optimize-finetune strategy enables the LLM to progressively internalize jailbreak patterns, ultimately achieving zero-shot generalization.
- Priority sampling in the replay buffer is a critical design choice for training stability.
- The two-level mechanism of the forbidden token penalty (single-token level + multi-token combination check) reflects engineering completeness.
- Successful attacks against the latest commercial APIs reveal the fundamental fragility of existing safety measures.
Limitations & Future Work¶
- Maintaining low perplexity and evading filters may sacrifice some attack strength — overly strict forbidden penalties may suppress semantically critical tokens.
- The prior suffix is currently set manually ("and a beautiful girl's body with"); automated selection could further improve performance.
- A separate suffix generator must be trained for each safe T2I model; a unified generator across different defense methods has not yet been realized.
- The paper focuses on nudity and violence — coverage of other harmful content types (hate speech, discrimination, etc.) remains unexplored.
- The release of red-teaming tools requires careful balance between research value and potential misuse risk.
Related Work & Insights¶
- vs Ring-A-Bell: Based on genetic algorithm discrete optimization; generated prompts exhibit extremely high perplexity (~11,646), rendering them nearly completely ineffective under filters.
- vs P4D: Optimizes in continuous space and requires model gradients; achieves higher RSR but cannot be adapted to black-box settings and produces unreadable prompts.
- vs AdvPromter: Also an LLM-driven method but requires white-box gradients; APT is fully black-box and introduces the dual-evasion strategy.
- Insight: The language generation capability of LLMs can be directed via "guided fine-tuning" to produce specific adversarial suffixes — this paradigm may generalize to other safety evaluation tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The three-way constraint of black-box + human-readable + filter-evading is achieved simultaneously in T2I red-teaming for the first time.
- Experimental Thoroughness: ⭐⭐⭐⭐ Four safe T2I models, latest architectures and commercial APIs, comprehensive ablation and transferability analysis.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear, algorithm pseudocode is complete, and comparative analysis is thorough.
- Value: ⭐⭐⭐⭐⭐ Reveals the fundamental vulnerability of existing T2I safety mechanisms and provides a practical tool for safety evaluation.