ICML2025 LLM Safety LLM unlearning in-context learning chain-of-thought training-free privacy protection harmful knowledge removal

DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning¶

Conference: ICML2025
arXiv: 2511.05784
Code: TBD
Area: Model Compression
Keywords: LLM unlearning, in-context learning, chain-of-thought, training-free, privacy protection, harmful knowledge removal

TL;DR¶

This paper proposes DRAGON, a training-free LLM unlearning framework. It identifies prompts to be forgotten using a dual-layer detection module and subsequently performs in-context intervention using a CoT guard model to generate reasoning instructions, achieving efficient unlearning without modifying model parameters.

Background & Motivation¶

Core Problem: LLM training data may contain private information or harmful knowledge, which needs to be "forgotten" after deployment to comply with regulations such as GDPR.
Limitations of Prior Work:
- Fine-tuning-based methods (e.g., GA, GD, NPO) require retain data, incur high computational costs, and impair the general capabilities of the model; under TOFU-5%/10% settings, several methods collapse entirely (MU→0).
- Fine-tuning-based methods are inapplicable to black-box models (such as GPT-4, Claude), nor do they support sequential unlearning scenarios.
- Existing training-free methods (such as ICUL) assume complete knowledge of the unlearning data, which is impractical.
Motivation: To design a lightweight unlearning scheme that does not require access to retain data, does not modify model weights, and is scalable to any LLM.

Method¶

DRAGON consists of two core modules: Unlearning Prompt Detection and In-Context Intervention.

1. Unlearning Prompt Detection¶

Receiving a user query \(\mathbf{x}\), it computes a confidence score \(f(\mathbf{x}, D_u)\); if this score exceeds a threshold \(\tau\), an intervention is triggered:

\[\mathbf{x} = \begin{cases} \tilde{\mathbf{x}} & f(\mathbf{x}, D_u) > \tau \\ \mathbf{x} & \text{otherwise} \end{cases}\]

Unlearn Store Construction: Llama3.1-70B is used to generate 4 rewrite candidates for the unlearning prompts, and BERTScore-based rejection sampling is employed to retain the most similar one. The original answers are not stored to prevent information leakage.

Confidence score for Privacy Unlearning Scenario (exact match + cosine similarity):

\[f(\mathbf{x}, D_u) = \text{EM}(\mathbf{x}) + \max_{\mathbf{e_u} \in D_u} \text{sim}(\mathbf{e_u}, \mathbf{e})\]

Confidence score for Harmful Knowledge Unlearning Scenario (trained scoring model + BERTScore + ROUGE-L double verification):

\[f(\mathbf{x}, D_u) = \mathbb{I}(p_F(\mathbf{x}) > \tau_1) + \max_{\mathbf{x_u} \in D_u} \text{BERTScore}(\mathbf{x_u}, \mathbf{x}) + \text{ROUGE-L}(D_u, \mathbf{x})\]

2. In-Context Intervention¶

Safety Policy Retrieval: Once an unlearning prompt is detected, the corresponding safety policy is retrieved (e.g., copyright protection, prevention of harmful knowledge leakage).
CoT Dataset Construction: GPT-4o is used to generate 800 fictitious author questions + 200 TOFU rewrite questions, designed to pair-generate CoT reasoning instructions. Rejection sampling is used to filter high-quality samples.
SFT Guard Model: Llama3.1-8B-Instruct is fine-tuned on the CoT dataset to serve as the guard model. During inference, it generates CoT instructions prepended to the original prompt, guiding the target LLM to refuse or redirect according to the instructions.

3. Newly Proposed Evaluation Metrics¶

Refusal Quality (RQ): Jointly measures the refusal rate and generation quality (cosine similarity + refusal classifier + utterance quality detection).
Dynamic Deviation Score (DDS): Measures the average deviation and stability under sequential unlearning settings.

\[\text{DDS} = \frac{1}{T}\sum_{i=1}^{T} s_i + \frac{\beta}{T-1}\sum_{i=1}^{T-1} \max(0, s_{i+1} - s_i)\]

Dynamic Utility Score (DUS): Measures the consistency of model utility during sequential unlearning.

\[\text{DUS} = 1 - \frac{\sum_{i=1}^{T-1} |u_{i+1} - u_i|}{T-1}\]

Key Experimental Results¶

WMDP Harmful Knowledge Unlearning (Llama3.1-8B-Instruct)¶

Method	Bio ProbAcc↓	Bio RQ↑	Chem ProbAcc↓	Chem RQ↑	Cyber ProbAcc↓	Cyber RQ↑	MMLU↑
Original	73.1	0.411	54.9	0.342	46.7	0.415	68.0
RMU	66.8	0.412	51.7	0.338	45.0	0.422	59.9
ICUL+	52.8	0.382	35.8	0.330	38.6	0.357	68.0
DRAGON	26.2	0.921	23.5	0.795	27.9	0.875	68.0

DRAGON approaches the level of random guessing (25%) across all three domains, while MMLU remains completely preserved (lossless).

TOFU Privacy Unlearning (Llama2-7B-Chat)¶

Method	DS↓ (1%)	MU	KFR	KRR	DS↓ (5%)	MU	KFR	KRR
GA	48.8	0.634	0.55	0.77	95.6	0.0	0.99	0.0
PO	37.9	0.631	0.65	0.73	33.0	0.519	0.96	0.57
NPO-RT	46.4	0.633	0.68	0.80	69.9	0.473	0.94	0.16
ICUL+	58.1	0.634	0.97	0.87	49.9	0.634	0.95	0.85
DRAGON	21.4	0.634	0.98	0.88	23.1	0.634	0.99	0.87

DRAGON achieves the lowest Deviation Score across all forget ratios, keeping model utility completely preserved (MU remains unchanged at 0.634), and yields optimal unlearning and retention rates.

Highlights & Insights¶

Truly training-free: It does not modify any parameters of the target LLMs, making it applicable to black-box models with zero extra cost for scaling to larger models.
Exquisite design of the dual-layer detection mechanism: It balances false positives and false negatives by combining a trained scoring model with similarity-based double verification.
Complete preservation of model capabilities: It registers almost zero decline in MMLU and MU metrics, whereas fine-tuning-based methods collapse frequently under high unlearning ratios.
Support for sequential unlearning: DDS/DUS metrics are proposed to quantify the stability of sequential unlearning, addressing the practical deployment scenario where unlearning requests arrive continuously.
Better target models yield better results: The RQ even exceeds 1.0 on larger models like Mixtral-8x7B, indicating a positive correlation between the framework's effectiveness and the model's instruction-following capabilities.

Limitations & Future Work¶

Detection threshold relies on manual configuration: The choice of \(\tau\) impacts the balance between unlearning quality and false positive rate, and an adaptive scheme is not provided.
The Guard model itself requires training: Although the target LLM requires no fine-tuning, SFT for the guard model still requires CoT data and computational resources.
Vulnerability to adversarial robustness: Although the paper evaluates paraphrase attacks, defenses against more complex jailbreak attacks (e.g., multi-turn guidance, role-playing) are not fully validated.
Unlearn Store management overhead: In sequential unlearning scenarios where the store continuously expands, retrieval efficiency and storage costs might become bottlenecks.
Evaluation limited to English: The effectiveness of cross-lingual unlearning has not been verified.

Rating¶

Novelty: ⭐⭐⭐⭐ — Introduces in-context learning + CoT reasoning to the unlearning problem, with a systematic and novel framework design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Involves 9 LLMs, 3 tasks, and multiple sets of ablation studies, yielding comprehensive results.
Writing Quality: ⭐⭐⭐⭐ — Features a clear structure, standardized mathematical formulations, and rigorous definitions of metrics.
Value: ⭐⭐⭐⭐ — Highly valuable for practical deployment in black-box LLM unlearning scenarios.