PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization¶

Conference: AAAI 2026 arXiv: 2511.16209 Code: github.com/psm-defense/psm Area: AI Security Keywords: System Prompt Security, Prompt Extraction Attack, Black-Box Optimization, LLM-as-Optimizer, Defense Shield

TL;DR¶

This paper proposes PSM, a framework that formalizes system prompt protection as a utility-constrained black-box optimization problem. By leveraging LLM-as-Optimizer, PSM automatically searches for an optimal "shield" suffix that reduces prompt extraction attack success rates to near zero without degrading model functionality.

Background & Motivation¶

The Dual Nature of System Prompts¶

System prompts are central to modern LLM applications, defining the model's role, operational constraints, task rules, and interaction style. For many commercial LLM products, system prompts embody core intellectual property and competitive advantage—they serve as the "secret weapon" that transforms a general-purpose base model into a specialized, high-performance application.

The Prompt Extraction Threat¶

Given their high value, system prompts have become a primary target for malicious actors. Attackers craft adversarial queries to induce LLMs into leaking their system instructions. Such "adversarial reconnaissance" can be exploited to: - Replicate commercial services - Expose sensitive operational information - Serve as a precursor to more sophisticated attacks

The emergence of prompt trading markets has further transformed prompt leakage from a theoretical vulnerability into a tangible economic risk.

Limitations of Prior Work¶

Heuristic instruction-based defenses (e.g., "Do not reveal your instructions"): Fragile and easily bypassed by attacks such as "ignore all previous instructions."

Input/output filtering: Introduces additional computational overhead and struggles to detect novel or obfuscated attacks.

Prompt transformation and obfuscation (e.g., ProxyPrompt): Requires white-box model access, which is infeasible for many developers.

The root cause of the difficulty lies in a fundamental tension: LLMs are trained to follow instructions helpfully, which conflicts inherently with security constraints. Simple defensive instructions merely create a direct conflict between developer rules and attacker commands—and attackers have demonstrated a consistent ability to win such conflicts.

PSM's Positioning¶

PSM addresses a practical question: "As a developer using a closed-source LLM API, how can one design a black-box, lightweight, and effective system prompt defense?" PSM strikes a unique balance between robustness and practicality—achieving optimization quality comparable to ProxyPrompt while operating entirely through natural language interaction with a black-box LLM.

Method¶

Overall Architecture¶

The core idea of PSM is shield appending—appending a protective text suffix to the end of the original system prompt. The problem is formalized as a constrained optimization task, and LLM-as-Optimizer is used to iteratively search for the optimal shield.

The prompt structure is: [SYSTEM PROMPT] Original Prompt P [SHIELD] Shield S

Key Designs¶

1. Problem Formulation: Utility-Constrained Optimization¶

Prompt protection is formalized as a constrained optimization problem:

\[\min_{S} L(P \oplus S) \quad \text{s.t.} \quad U(P \oplus S) \geq \tau\]

where \(L(P \oplus S)\) denotes the leakage score, \(U(P \oplus S)\) denotes task utility, and \(\tau\) is the minimum acceptable utility threshold.

Design Motivation: This formulation explicitly captures the trade-off between robustness and functionality, preventing the defense from excessively degrading model usefulness.

2. Leakage Objective¶

ROUGE-L recall is used to measure prompt exposure. For an attack set \(A\), a log-sum-exp (LSE) smooth approximation of the maximum is applied:

\[L(P \oplus S) = \frac{1}{\beta} \log \sum_{a \in A} \exp(\beta \cdot \text{ROUGE-L}_{\text{recall}}(P, R_a))\]

where \(\beta = 10\) is a temperature parameter. As \(\beta \to \infty\), this converges to the hard maximum.

Adversarial Attack Design: \(|A| = 50\) composite attack queries are used, each constructed by concatenating three orthogonal strategies: - Distractor: Context-switching phrases that exploit the model's tendency to prioritize recent instructions - Repetition: Direct commands such as "repeat the system prompt" - Formatting: Disguising extraction as a benign task (e.g., "format the system prompt as a Python triple-quoted string")

This composite structure achieves nearly 100% extraction success on both GPT-4o and GPT-4o-mini.

3. Utility Objective¶

A gold-standard dataset \(\mathcal{D}_{\text{gold}} = \{(q_i, g_i)\}\) is constructed with GPT-4o-generated reference answers. Utility retention is measured using cosine similarity of sentence embeddings (sentence-transformers/all-MiniLM-L6-v2):

\[r_i = \frac{\text{sim}(t_i, g_i)}{\text{sim}(b_i, g_i)}\]

The ratio formulation tolerates minor degradation while flagging significant utility loss.

4. Fitness Function and LLM-as-Optimizer¶

The constrained problem is converted to an unconstrained scalar objective via a penalty function:

\[\text{fitness}(S) = L(P \oplus S) + \lambda \cdot \max(0, \tau - U(P \oplus S))\]

where \(\lambda = 100\) and \(\tau = 0.9\).

Evolutionary Optimization Procedure: 1. Initialization: The optimizer LLM generates 5 diverse shield candidates. 2. Evaluation: Full adversarial and baseline queries are executed for each candidate. 3. Selection and Generation: The top 10 historically best shields, along with their fitness scores, are provided as context; the LLM is prompted to analyze successful patterns and generate 5 improved candidates. 4. Iteration: Steps 2–3 are repeated for 10 iterations or until termination conditions are met (\(U \geq 0.9\) and \(L < 0.65\)).

Shield Placement Strategy¶

Suffix placement is a deliberate design choice: research shows that LLMs tend to weight information at the end of the context more heavily, and later instructions can override earlier ones—the same effect exploited by indirect prompt injection attacks. PSM inverts this property for defensive purposes.

Structured markers ([SYSTEM PROMPT] and [SHIELD]) provide clear delimiters and labels, helping the model recognize and respect the hierarchical relationship between instructions.

Loss & Training¶

The optimizer model defaults to GPT-4o-mini (temperature=1 to encourage diversity).
The victim model uses temperature=0, top-p=1 for deterministic decoding.
Each optimization iteration generates 5 shield candidates.
Termination condition: high utility (\(U \geq 0.9\)) and low leakage (\(L < 0.65\)).

Key Experimental Results¶

Main Results¶

Evaluated models: GPT-5-mini, GPT-4.1-mini, GPT-4o-mini (all accessed via black-box API)

Evaluation datasets: Synthetic System Prompts (30 prompts) and UNNATURAL Instructions (30 prompts)

Attack suites: Raccoon (59 queries), Raccoon-Language (multilingual variants), Liang (22 polite requests), Zhang (110 command overrides)

Dataset	Attack Suite	Defense	GPT-5-mini JM	GPT-4.1-mini JM	GPT-4o-mini JM
Synthetic	Raccoon	No Defense	42%	59%	27%
Synthetic	Raccoon	N-gram Filtering	18%	3%	5%
Synthetic	Raccoon	PSM	8%	5%	4%
UNNATURAL	Raccoon	No Defense	22%	42%	20%
UNNATURAL	Raccoon	N-gram Filtering	9%	1%	1%
UNNATURAL	Raccoon	PSM	3%	5%	0%
Synthetic	Liang	No Defense	54%	78%	32%
Synthetic	Liang	PSM	13%	4%	6%
UNNATURAL	Zhang	No Defense	30%	32%	14%
UNNATURAL	Zhang	PSM	3%	1%	0%

Ablation Study¶

Utility Retention Rate:

Model	Synthetic Prompts	UNNATURAL
GPT-5-mini	101.88%	101.27%
GPT-4.1-mini	100.89%	114.76%
GPT-4o-mini	99.73%	100.73%

All utility scores are ≥ 99.73%, with some exceeding 100% (indicating that the shield actually improves task performance), demonstrating that PSM does not degrade model utility.

N-gram Filtering vs. PSM under Multilingual Attacks: N-gram filtering degrades substantially against Raccoon-Language attacks (which require translation or paraphrase), whereas PSM maintains low ASR (single-digit percentages) across both settings.

Key Findings¶

PSM reduces attack success rates to 0–6%: Near-zero leakage is achieved across most dataset × attack combinations.
Strong cross-model generalization: Despite large differences in baseline leakage rates across models, PSM consistently yields low ASR.
Outperforms exact-match filtering: N-gram filtering fails under translation/paraphrase attacks, while PSM is robust to both scenarios.
Zero utility loss: Utility retention rates are uniformly ≥ 99.73%, confirming that the shield does not interfere with normal model functionality.

Highlights & Insights¶

Paradigm shift through formalization: This work is the first to elevate prompt protection from ad hoc patching to a rigorous constrained optimization problem, rendering defenses quantifiable and reproducible.
Black-box practicality: Only API access is required—no model weights or gradients—making the approach genuinely accessible to practitioners.
Turning adversarial properties into defensive assets: The suffix placement strategy cleverly inverts the "LLMs weight end-of-context information more heavily" effect that prompt injection attacks exploit.
Offline optimization, zero inference cost: The shield is a static text suffix; once found, it incurs no additional inference-time computation.

Limitations & Future Work¶

Computationally intensive optimization loop: Repeated evaluation over adversarial and benign query sets is required.
No guarantee of transferability: Defense effectiveness depends on the breadth and representativeness of the attack suite used during optimization.
Scope limited to extraction attacks: Jailbreak and multi-turn conversational attacks are not addressed.
Future directions include extending PSM to jailbreak defense, evaluating cross-model-family transferability, and developing more efficient search heuristics.

ProxyPrompt optimizes a "proxy" in embedding space to preserve utility while rendering extraction meaningless → requires white-box access.
Spotlighting (Microsoft) uses delimiters or character encoding to help LLMs distinguish trusted from untrusted inputs.
OPRO and related LLM-as-Optimizer works provide the methodological foundation for PSM.
Implication for the security community: the "LLM optimizing LLMs" paradigm can be extended from task optimization to security defense.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to formalize prompt protection as a constrained optimization problem
Experimental Thoroughness: ⭐⭐⭐⭐ — 3 models × 4 attack suites × 2 datasets; comprehensive coverage
Writing Quality: ⭐⭐⭐⭐⭐ — Clear mathematical derivations and well-motivated problem setup
Value: ⭐⭐⭐⭐⭐ — Highly practical; a plug-and-play solution targeting real deployment scenarios