Skip to content

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

Conference: AAAI 2026 arXiv: 2601.07141 Code: None Area: Image Generation Keywords: Text-to-image safety, jailbreak attack, cross-lingual adversarial, macaronic words, concept removal

TL;DR

This paper proposes MacPrompt, a black-box cross-lingual attack method that translates harmful words into multi-language candidates and performs character-level recombination to construct "macaronic words" as adversarial prompts. The method simultaneously bypasses text safety filters and concept removal defenses, achieving attack success rates of up to 92% on sexual content and 90% on violent content.

Background & Motivation

Text-to-image (T2I) models (e.g., Stable Diffusion, DALL·E, Midjourney) are widely used in creative design, but pose risks of generating NSFW (Not Safe for Work) content due to unfiltered internet training data.

Existing Defense Paradigms:

Safety Filters: - Text filters: keyword blacklist matching (text-match) or BERT classifiers (text-classifier) - Image filters: detecting NSFW content in generated images - Latent-space filters: e.g., LatentGuard

Concept Removal: Directly modifying model weights to erase NSFW concepts (e.g., ESD, SLD, FMN, SafeGen)

Limitations of Prior Work: - Most attacks can only bypass one category of defense (either filters or concept removal), not both simultaneously - Attacks capable of bypassing both typically require internal model information (white-box/gray-box), which is impractical - Synonym-substitution-based methods (e.g., DiffZOO) are susceptible to detection by semantic matching filters

Core Insight: Although T2I models are primarily trained on English, prompts in other languages can elicit similar visual semantics. More importantly, certain cross-lingual composite words can preserve visual semantics while diverging significantly from the original harmful words in textual semantic space, thereby evading safety filters.

Method

Overall Architecture

The overall pipeline of MacPrompt:

  1. Sensitive Word Detection: Identifying sensitive words in the original harmful prompt
  2. Cross-lingual Candidate Selection: Translating sensitive words into 79 languages and selecting the most effective candidates
  3. Macaronic Substitute Construction: Constructing adversarial substitutes via parameterized character-level recombination from multilingual candidates
  4. Zeroth-Order Optimization: Iteratively optimizing construction parameters using NSFW detection scores as feedback
  5. Adversarial Prompt Generation: Replacing original sensitive words with the constructed substitutes

Key Designs

1. Sensitive Word Detection

Two strategies are employed: - Blacklist matching: Comparing against a predefined list of harmful words - Semantic similarity scoring: Computing cosine similarity between each word embedding and harmful concept embeddings; words exceeding threshold \(\tau\) are flagged as sensitive:

\[\max_j \cos(\text{Embed}(w_i), e_{harm}^j) > \tau\]

2. Cross-lingual Candidate Selection

For each sensitive word \(w_{\lambda_i}\):

  1. An LLM translates it into \(L=79\) languages, forming candidate pool \(V^{(\lambda_i)}\)
  2. Each candidate is inserted into a template prompt to generate 10 images
  3. Two metrics evaluate candidate quality:
    • Harmfulness score \(\mathcal{H}\): target class probability output by an NSFW detector
    • Visual semantic similarity CLIPSim: CLIP score between images generated by the candidate and a safe prompt
  4. Candidates are ranked by combined score; top-\(k\) (\(k=10\)) are retained

3. Macaronic Substitute Construction (Core Innovation)

Due to the irreversibility of T2I tokenizers for non-English languages (\(\epsilon(\epsilon^{-1}(\epsilon(v))) \neq \epsilon(v)\)), direct token-level manipulation is infeasible. Therefore, a character-level manipulation strategy is proposed.

For the \(k\) candidates of each sensitive word \(w_{\lambda_i}\), three sets of parameters are defined:

  • Boundary parameters \(\beta_1^{(\lambda_i)}, \beta_2^{(\lambda_i)} \in [0,1]^k\): controlling the start and end positions of substrings extracted from each candidate
  • Ordering parameters \(\alpha^{(\lambda_i)} \in \mathbb{R}^k\): controlling the concatenation order of substrings

Substring extraction positions are computed as:

\[\mu_{1,j}^{(\lambda_i)} = \lfloor l_j \cdot \beta_{1,j}^{(\lambda_i)} \rfloor, \quad \mu_{2,j}^{(\lambda_i)} = \lfloor l_j \cdot \beta_{2,j}^{(\lambda_i)} \rfloor\]

Substrings \(\bar{v}_j^{(\lambda_i)} = \hat{v}_j(\mu_{1,j}:\mu_{2,j})\) are extracted from candidate words, sorted in descending order of \(\alpha\), and concatenated to form the macaronic substitute.

For example: "nudity" may be replaced with "nuditéudenakt" (fusing character fragments from French, German, etc.).

4. Zeroth-Order Optimization (ZOO)

The NSFW detection probability serves as the objective function:

\[\mathcal{L} = \|\mathcal{H}(p_{adv}) - \mathbf{1}\|_2\]

Gradients are approximated via finite differences:

\[\nabla_{\beta_r}\mathcal{L} \approx \frac{\mathcal{L}(\beta_r + \delta) - \mathcal{L}(\beta_r - \delta)}{2\delta}\]

Learning rate is set to 0.1, with 100 iterations, initial perturbation magnitude \(\delta_0 = 0.25\), and early stopping supported.

Loss & Training

This method is an inference-time adversarial attack and does not involve model training. The optimization process is conducted entirely under a black-box setting, relying solely on feedback from generated images.

Key Experimental Results

Main Results

NSFW Concept Generation Attack Performance (Sexual Content):

Method Blacklist BPR LatentGuard BPR BERT BPR SD ASR-5 ESD ASR-5 SLD.Ma ASR-5 SafeGen ASR-5 FMN ASR-5
DACA 94% 98% 72% 40% 36% 34% 34% 36%
PGJ 96% 98% 54% 38% 46% 54% 62% 50%
SurPro 100% 94% 76% 52% 60% 68% 68% 48%
DiffZOO 52% 56% 36% 52% 50% 74% 66% 28%
MacPrompt 100% 82% 70% 96% 74% 96% 88% 76%

Violence Content:

Method Blacklist BPR SD ASR-5 ESD ASR-5 SafeGen ASR-5 FMN ASR-5
DACA 78% 85% 72% 80% 80%
DiffZOO 48% 66% 66% 40% 70%
MacPrompt 100% 72% 74% 90% 74%

Semantic Consistency (CLIPScore/BLIPScore):

Content Type Original↔Adversarial Prompt CLIPScore Original↔Adversarial Image CLIPScore Original Prompt↔Adversarial Image BLIPScore
Sex 0.8768 0.7893 0.5602
Violence 0.8618 0.8012 0.5893
Dog 0.9223 0.8597 0.9572
Car 0.9348 0.7335 0.5047

Ablation Study

Banned Object Generation Attack:

Object Method SD ASR-1/5 ESD ASR-1/5 FMN ASR-1/5 EAP ASR-1/5
Dog MMP-Attack 66/90 78/88 60/90 52/94
Dog MacPrompt 96/100 64/88 78/98 46/88
Car MMP-Attack 76/84 52/86 70/88 62/92
Car MacPrompt 92/100 50/94 86/98 60/96

Key Findings

  1. Cross-lingual character recombination effectively preserves visual semantics while evading text detection: Macaronic words are distant from the original words in the text embedding space, yet the generated images remain highly consistent in the image embedding space.
  2. Particularly effective against concept removal defenses: MacPrompt achieves very high attack success rates against ostensibly stronger concept removal models (ASR-5 of 96% on SD).
  3. Strong transferability: A single adversarial prompt can simultaneously attack multiple defense models.
  4. Effective against commercial systems: ASR of 65% on DALL·E 3 and 96% on Doubao.
  5. Average BLIPScore of 0.6953, significantly outperforming MMP-Attack's 0.414, indicating superior semantic consistency.

Highlights & Insights

  • Novel attack vector: Exploits the implicit multilingual capability of T2I models, uncovering an entirely new class of security vulnerabilities.
  • Character-level manipulation: Elegantly resolves the tokenizer irreversibility problem, achieving finer-grained control than token-level manipulation.
  • High practicality: Fully black-box, requiring no internal model information, and applicable to any T2I system.
  • Reveals systemic security flaws: All existing T2I safety mechanisms—from simple keyword matching to state-of-the-art concept removal—fail to robustly defend against multilingual adversarial attacks.
  • Visualization analysis of macaronic words is compelling: separation in text space but clustering in image space.

Limitations & Future Work

  • Generating images for candidates across 79 languages incurs substantial computational cost.
  • The fixed 100 iterations in ZOO optimization may require adaptive adjustment for sensitive words of varying difficulty.
  • Defense direction: This work motivates a reconsideration of multilingual robustness—future safety filters should account for cross-lingual token sharing and detection at the visual semantic level.
  • Ethical considerations: Although the authors conducted responsible disclosure, public release of the method may be subject to malicious exploitation.

This work is situated at the frontier of T2I security research: - Defense side: Evolution from simple blacklists → BERT classifiers → LatentGuard → concept removal (ESD/SLD/FMN/SafeGen/DUO/EAP/PromptGuard) - Attack side: Progression from white-box (Prompting4Debugging) → gray-box (P4D) → black-box (DACA/DiffZOO/PGJ/SurrogatePrompt) → the cross-lingual black-box attack proposed in this paper

Core insight: T2I safety mechanisms cannot be confined to monolingual assumptions; defenses must be constructed at the visual semantic level rather than the textual level.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (Cross-lingual character-level recombination as an attack strategy is unprecedented)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Spanning 3 filter types + 9 concept removal methods + 7 baselines)
  • Writing Quality: ⭐⭐⭐⭐ (Method description is clear and rigorously formalized)
  • Value: ⭐⭐⭐⭐⭐ (Exposes fundamental vulnerabilities in current T2I safety frameworks)