Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search¶

Conference: ICLR 2026 arXiv: 2602.22983 Code: None Area: LLM Alignment Keywords: LLM Safety, Jailbreak Attack, Classical Chinese, Bio-Inspired Optimization, Black-Box Attack

TL;DR¶

This paper proposes CC-BOS, a framework that exploits the semantic compression and inherent ambiguity of Classical Chinese, combined with a Fruit Fly Optimization Algorithm to search an eight-dimensional strategy space for optimal jailbreak prompts, achieving nearly 100% attack success rate across six mainstream LLMs.

Background & Motivation¶

The safety alignment mechanisms of LLMs exhibit uneven performance across different linguistic contexts. Low-resource languages are more prone to eliciting unsafe outputs due to insufficient training data. This paper is the first to systematically investigate the role of Classical Chinese in jailbreak attacks: - Classical Chinese features semantic compression, rich rhetorical devices, and inherent polysemy. - Its expression diverges substantially from Modern Chinese, enabling partial bypass of keyword- and template-matching-based defenses. - While models can adequately comprehend Classical Chinese input, safety guardrails optimized for modern languages fail to detect malicious intent embedded within it.

Core insight: The safety vulnerability of Classical Chinese does not stem from insufficient data coverage, but rather constitutes a security blind spot.

Method¶

Overall Architecture¶

CC-BOS comprises three core components: (1) an eight-dimensional strategy space defining the generative dimensions of jailbreak prompts; (2) a bio-inspired Fruit Fly Optimization Algorithm (FOA) that searches the strategy space for optimal configurations; and (3) a two-stage translation module that converts Classical Chinese responses into English to ensure evaluation accuracy.

Key Designs¶

Eight-Dimensional Strategy Space \(\mathbb{S} = D_1 \times D_2 \times \cdots \times D_8\):
- \(D_1\): Role identity (e.g., ancient scholar, strategist)
- \(D_2\): Behavioral guidance
- \(D_3\): Mechanism (e.g., cascaded reasoning)
- \(D_4\): Metaphorical mapping
- \(D_5\): Expressive style (e.g., parallel prose, free prose)
- \(D_6\): Knowledge association
- \(D_7\): Trigger pattern
- \(D_8\): Contextual setting
- Given an original query \(q_0\) and strategy \(\mathbf{s}\), the prompt generator \(G\) produces an adversarial query \(q = G(q_0; \mathbf{s})\)
Fruit Fly Optimization Algorithm (FOA):
- Olfactory search: Adaptive local perturbation with step size decaying over iterations as \(\Delta_t = \max(1, \lfloor \alpha |D_i| \cdot \gamma^t \rfloor)\)
- Visual search: Attraction toward the global optimum with attraction probability \(\beta_t = \beta_0 + (1-\beta_0) \cdot t/N\)
- Cauchy mutation: Leverages the heavy-tailed property of the Cauchy distribution to escape local optima
- Hash-based deduplication and early stopping are introduced to improve search efficiency
Fitness Evaluation \(F(\mathbf{s}) = S_c(\mathbf{s}) + S_k(\mathbf{s})\):
- Consistency score \(S_c\): Scored by an evaluation model (0–100), measuring the alignment between the response and the malicious instruction
- Keyword score \(S_k\): Detects the presence of rejection keywords (0 or 20 points)
- Total fitness range: \([0, 120]\)

Loss & Training¶

DeepSeek-Chat is used as both the attack model and the translation model.
Initial population size is 5; maximum number of iterations is 5.
A fitness threshold of 80 is used to determine jailbreak success.
Two-stage translation module: Classical Chinese → Modern Chinese → English.

Key Experimental Results¶

Main Results (AdvBench Dataset)¶

Target Model	CC-BOS ASR	CC-BOS Avg. Score	ICRT ASR	ICRT Avg. Score
Gemini-2.5-flash	100%	4.82	92%	4.52
Claude-3.7	100%	3.14	40%	1.60
GPT-4o	100%	4.74	74%	3.06
DeepSeek-Reasoner	100%	4.84	88%	4.00
Qwen3-235B	100%	4.88	84%	4.00
Grok-3	100%	4.76	98%	4.30

Efficiency Comparison (Average Query Count Avg.Q)¶

Method	Gemini	Claude	GPT-4o	DeepSeek	Qwen3	Grok-3
CC-BOS	1.46	2.38	1.28	1.12	1.54	1.18
CL-GSO	3.62	21.42	4.00	3.26	5.06	1.24
PAIR	60.00	51.12	57.36	40.32	57.00	51.36

Key Findings¶

CC-BOS achieves 100% ASR on all six target models, substantially outperforming all baselines.
The average number of queries is only 1–2, far exceeding the efficiency of competing methods.
Near-100% ASR is also maintained on the CLAS and StrongREJECT datasets.
CC-BOS retains high success rates even against the Llama-Guard-3-8B defense.

Highlights & Insights¶

This is the first systematic investigation of Classical Chinese in LLM security evaluation, opening a new research direction.
The formalized eight-dimensional strategy space enables comprehensive coverage of attack vectors.
The three-phase FOA search strategy—olfactory, visual, and Cauchy mutation—efficiently balances exploration and exploitation.
The extremely low query count suggests that the Classical Chinese context itself possesses strong bypass capability.

Limitations & Future Work¶

The Classical Chinese attack relies on the model's comprehension of Classical Chinese; effectiveness may diminish for models with very limited exposure to such data.
The selection of dimensions in the eight-dimensional strategy space depends on human expertise.
Defensive countermeasures, such as incorporating Classical Chinese safety data during training, are relatively straightforward to implement.
The paper focuses on attack capability without providing an in-depth discussion of defensive strategies.

Compared to CL-GSO: CC-BOS leverages Classical Chinese context rather than strategy decomposition in modern English.
Compared to white-box methods such as GCG: CC-BOS is fully black-box and requires no gradient information.
Implication: LLM safety alignment must extend coverage to historical languages and specialized linguistic contexts.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Classical Chinese jailbreak attacks represent an entirely novel perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ Six models, three datasets, and multi-dimensional comparisons.
Writing Quality: ⭐⭐⭐⭐ Methods are clearly described with a high degree of mathematical formalization.
Value: ⭐⭐⭐⭐ Carries significant implications for LLM safety research.