DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning¶

Conference: NeurIPS 2025 arXiv: 2511.05784 Code: Not yet released Area: Model Compression / LLM Safety Keywords: LLM Unlearning, In-Context Intervention, Chain-of-Thought, Black-Box Unlearning, Continual Unlearning

TL;DR¶

DRAGON proposes a systematic LLM unlearning framework that requires no fine-tuning of the base model. It employs a two-layer detection module to identify prompts subject to unlearning, then uses a specially fine-tuned guard model to generate CoT reasoning instructions for in-context intervention, effectively removing private or harmful knowledge while preserving the model's general capabilities.

Background & Motivation¶

Background: LLM unlearning aims to remove the influence of private data or harmful knowledge, ensuring GDPR compliance and safe deployment. Mainstream approaches fall into two categories: training-based (gradient ascent / preference optimization / negative-sample fine-tuning) and training-free (prompt engineering / in-context example guidance).

Limitations of Prior Work: (a) Training-based methods require retain data, which is often unavailable in practice; (b) gradient optimization over billions of parameters is costly and infeasible for closed-source models (GPT-4/Claude); (c) most methods support only one-time unlearning and cannot handle continual unlearning requests; (d) training-based methods frequently degrade general model capabilities.

Key Challenge: The trade-off between unlearning effectiveness and general model utility—existing training-based methods either fail to achieve thorough unlearning (GA/KL/DPO nearly collapse under TOFU-5%) or severely impair general capabilities.

Goal: Design a lightweight, systematic framework that modifies no model weights, requires no retain data, is applicable to black-box LLMs, and supports continual unlearning.

Key Insight: Reframe unlearning as an inference-time intervention—detect at inference time whether a prompt triggers unlearning, and if so, guide the model to refuse or redirect via CoT reasoning.

Core Idea: In-context unlearning intervention via detection + CoT reasoning guidance, without modifying model parameters.

Method¶

Overall Architecture¶

Input query → detection module determines whether it falls within the unlearning scope (two-layer detection: scoring model + similarity metric) → if matched, the guard model generates CoT reasoning instructions + safety policy → CoT instructions are prepended to the query and fed into the base LLM → the model follows the instructions to refuse/redirect rather than relying on memorized knowledge.

Key Designs¶

Unlearn Store:
- Function: Stores synthesized/paraphrased prompts of content to be forgotten (original data is not stored, preventing information leakage).
- Mechanism: Llama3.1-70B-Instruct generates 4 paraphrase candidates per unlearning prompt; rejection sampling via BERTScore retains the most semantically similar one. Only embedding vectors are stored, not original responses.
- Design Motivation: Even if the database is compromised, original private data cannot be recovered.
Two-Layer Detection Mechanism:
- Privacy Record Detection (Sample Unlearning): \(f(x, D_u) = \text{EM}(x) + \max_{e_u \in D_u} \text{sim}(e_u, e)\), where \(\text{EM}(x)\) checks whether the forgotten entity name is present and sim denotes embedding cosine similarity.
- Harmful Knowledge Detection (Concept Unlearning): \(f(x, D_u) = \mathbb{I}(p_F(x) > \tau_1) + \max_{x_u \in D_u} \text{BERTScore}(x_u, x) + \text{ROUGE-L}(D_u, x)\), where \(F\) is a fine-tuned Llama-3.1-7B-Instruct scoring model.
- Design Motivation: A single signal can be bypassed by paraphrase attacks; the two-layer design (model scoring + semantic similarity) provides robustness.
CoT Guard Model:
- Function: Generates in-context reasoning instructions for detected unlearning prompts.
- Mechanism: Fine-tuned from Llama3.1-8B-Instruct on a synthetic CoT dataset. Training data includes 800 synthetic fictional-author questions and 200 TOFU paraphrase questions, each paired with high-quality CoT reasoning chains generated by GPT-4o.
- Design Motivation: CoT instructions are not pre-stored (preventing information leakage); instead, context-aware reasoning is dynamically generated per query, leveraging the LLM's inherent instruction-following capability.
Safety Policy Retrieval:
- Function: Retrieves task-specific safety policies for different unlearning scenarios (copyright protection / harmful knowledge prevention / privacy fabrication replacement).
- TOFU scenario: Dual protection—random fictitious author information replacement + CoT refusal guidance.
- WMDP scenario: Relevant policies and refusal guidelines are extracted and explicitly injected into the prompt.

Loss & Training¶

The guard model is fine-tuned with standard SFT; only the guard model is trained while the base LLM remains completely unchanged. The scoring model in the detection module is fine-tuned on synthetic harmful/benign queries.

Key Experimental Results¶

Harmful Knowledge Unlearning (WMDP, Llama3.1-8B-Instruct)¶

Method	Bio ProbAcc↓	Bio RQ↑	Chem ProbAcc↓	Cyber ProbAcc↓	MMLU↑
Original	73.1	0.411	54.9	46.7	68.0
RMU	66.8	0.412	51.7	45.0	59.9
Filter-Prompting	45.1	0.444	40.2	46.1	68.0
ICUL+	52.8	0.382	35.8	38.6	68.0
DRAGON	26.2	0.921	23.5	27.9	68.0

DRAGON approaches random chance (25%) across all harmful domains while incurring zero degradation on MMLU.

Privacy Record Unlearning (TOFU, Llama2-7B-Chat)¶

Method	DS↓(1%)	MU	KFR	KRR	DS↓(5%)	DS↓(10%)
Original LLM	94.1	0.634	0.18	0.85	97.3	98.8
GA	48.8	0.633	0.55	0.77	95.6 (collapse)	98.7 (collapse)
PO	37.9	0.631	0.65	0.73	33.0	23.7
NPO-RT	46.4	0.633	0.68	0.80	69.9	64.7
ICUL+	58.1	0.634	0.97	0.87	49.9	49.9
DRAGON	21.4	0.634	0.98	0.88	23.1	26.5

DRAGON achieves the lowest deviation scores across all unlearning ratios while fully preserving model utility.

Continual Unlearning (Llama2-7B-Chat)¶

Method	DDS↓	DUS↑
GA	0.935	0.684
PO	0.315	0.934
NPO-RT	0.662	0.915
ICUL+	0.526	1.000
DRAGON	0.249	1.000

Key Findings¶

DRAGON is the only method that consistently performs well across all 9 LLMs, with performance improving as model capability increases (stronger instruction-following ability).
Training-based methods (GA/KL/DPO) frequently collapse (MU drops to 0) under large unlearning ratios (5%/10%); DRAGON is entirely immune to this failure mode.
Ablation experiments on CoT demonstrate that removing CoT instructions significantly degrades unlearning performance, confirming that reasoning guidance is the core mechanism.
Utility preservation stems directly from leaving model weights untouched—MMLU scores remain identical to the original model.

Highlights & Insights¶

Fully train-free design: Zero modification to the base model, making the approach naturally applicable to closed-source models such as GPT-4/Claude, with no risk of catastrophic forgetting.
Decoupled detection and intervention: The detection module can be upgraded independently (e.g., replacing the scoring model or adding stronger semantic similarity signals), and intervention strategies can be customized per task (fabrication replacement for privacy, refusal for harmful knowledge).
Native support for continual unlearning: New entries can be added to the unlearn store without retraining any component—a scalability advantage that training-based methods cannot match.

Limitations & Future Work¶

The recall rate of the detection module is a critical bottleneck—if an adversary carefully paraphrases prompts to evade detection, the entire system fails.
The CoT dataset relies on GPT-4o generation, which may be unacceptable in privacy-sensitive deployments (e.g., hospitals) where external API usage is restricted.
The generalization of the guard model is bounded by the coverage of its training data—novel types of unlearning requests may require re-fine-tuning.
Performance is highly dependent on the base model's instruction-following capability, which may degrade on smaller, weaker models.
The correlation between the Refusal Quality metric and human judgment has not been validated.

vs. RMU: A training-based method that modifies model parameters. On Llama3.1-8B, it reduces biology accuracy only to 66.8% (vs. DRAGON's 26.2%) while dropping MMLU from 68.0 to 59.9.
vs. ICUL (In-Context Unlearning): Assumes full access to unlearning data (an idealized setting), yet achieves DS=58.1 on TOFU—far below DRAGON's 21.4.
vs. Filter-Prompting: Simple prompt filtering without reasoning guidance yields incomplete unlearning (Bio 45.1%).

Rating¶

Novelty: ⭐⭐⭐⭐ A systematic unlearning framework combining detection and CoT reasoning, representing a significant advance in the train-free direction.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 9 LLMs, 3 unlearning tasks, continual unlearning, and ablation studies comprehensively.
Writing Quality: ⭐⭐⭐⭐ Motivation is clear and new metrics are well-defined, though some experimental tables are scattered.
Value: ⭐⭐⭐⭐⭐ Highly practical—black-box compatible, continual unlearning support, zero model degradation, and directly deployable.