This paper systematically studies the detection and removal of indirect prompt injection attacks: it constructs an evaluation benchmark, discovers that existing detection models perform poorly against indirect attacks while specially trained models can achieve 99% accuracy, proposes two removal methods (segmentation-based and extraction-based), and combines detection and removal into a filtering pipeline to effectively reduce the attack success rate of indirect prompt injection.
Prompt injection attacks are categorized into direct attacks (where the user is the attacker) and indirect attacks (where malicious instructions are injected into external data sources).
Indirect attacks present a more realistic threat: attackers embed malicious instructions into webpages or documents, which misguide LLMs after being retrieved via search engines.
Indirect attacks can achieve various malicious goals: phishing, ad promotion, public opinion manipulation, etc.
Existing defense methods mainly focus on direct attacks, leaving the detection and removal of indirect attacks severely under-investigated.
Existing detection models (such as ProtectAI, Prompt Guard, and Llama Guard) are primarily trained on direct attacks, yielding low detection rates for indirect attacks.
There is almost no research on what to do after detection—how to remove malicious content while preserving useful information?
Over-defense issue: detection models misclassify clean documents as injected documents, affecting normal utility.
The injection location (prefix/middle/suffix) significantly impacts detection performance, but prior methods fail to consider this.
Detection of indirect prompt injection requires specialized trained models, as general safety models are insufficient.
Detection and removal should be combined into a unified filtering pipeline.
Over-defense primarily occurs on out-of-domain (OOD) documents and is almost non-existent in-domain (ID).
Different injection locations require distinct removal strategies: segmentation is suitable for prefix/middle, while extraction is suitable for suffix.
A two-stage filtering pipeline: the first stage uses a detection model to determine whether a document has been injected; the second stage removes the injected content from the detected documents. Finally, the processed clean document is passed to the LLM to execute the user's original task.
Existing models are ineffective against indirect attacks: Llama Guard achieves a maximum of only 39%, and ProtectAI is only effective against specific attacks.
Specialized training yields notable performance: DeBERTa trained only on Naive Attacks achieves a 99% detection rate and generalizes well to other attacks.
Over-defense primarily occurs out-of-domain: The in-domain over-defense rate is 0%, while out-of-domain goes up to 27%; stronger models and more fluent documents are less prone to over-defense.
Complementarity between segmentation and extraction: Segmentation performs better overall, but extraction achieves a 100% removal rate for suffix injections (the most effective attack location).
Removal does not compromise information utility: Document corruption caused by over-defense barely affects the accuracy of responses to the original QA tasks.
Injected positions in training data are crucial: Models trained on only a single position struggle to generalize to other positions.
Limited generalization to Fake Completion Attacks (due to its injection pattern being highly distinct from Naive Attacks).
Over-defense remains notable in out-of-domain scenarios (12-27%), restricting cross-domain deployment.
Segmentation removal relies on the sentence-level classification capability of the detection model, whereas that model is trained on document-level tasks.
Does not consider active evasion of detection by attackers (e.g., dispersing the injection instruction across multiple sentences).
The benchmark dataset only covers QA scenarios and has not been expanded to multi-turn dialogues, code generation, etc.