DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training¶
Conference: ACL 2026
arXiv: 2604.16845
Code: GitHub
Area: Medical Imaging
Keywords: Difference-aware, Harm Drift, Distill-Audit-Repair, Safety Alignment, Over-refusal
TL;DR¶
DART identifies and addresses the "Harm Drift" problem: while fine-tuning LLMs improves difference-aware classification accuracy (e.g., identifying legitimate demographic differences), the explanations generated by the model become more harmful. Through a three-stage Distill-Audit-Repair pipeline, DART increases Llama-3-8B accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.
Background & Motivation¶
Background: Safety-aligned LLMs often default to "identity blindness"—refusing to acknowledge demographic differences even when they are factually correct (e.g., ancestry-based disease incidence rates) or legally reasonable (e.g., hiring preferences for religious institutions). This results in incorrect responses, unnecessary refusals, or generic "equal treatment" defaults.
Limitations of Prior Work: (1) LLMs perform poorly on difference-aware classification—Llama-3-8B predicts 88.6% of prompts as "requiring differentiation" when the actual rate is 50.2%, leading to an accuracy of only 11.3% for equal-treatment cases; (2) 26.8% of outputs are unparseable refusals or vague responses; (3) Fine-tuning can improve accuracy but triggers "Harm Drift"—where the conclusion is correct but the explanation introduces harmful content.
Key Challenge: Improving difference-aware accuracy requires fine-tuning, but fine-tuning compromises safety alignment. Accuracy and safety appear to be in conflict.
Goal: To simultaneously improve difference-aware classification accuracy and explanation safety, demonstrating that the two need not conflict.
Key Insight: Decouple accuracy optimization and safety repair into stages—first use distillation to improve accuracy (allowing temporary safety degradation), then audit to locate harm drift cases, and finally perform targeted repairs.
Core Idea: Harm drift is a novel safety failure mode—the model's decision becomes correct, but the explanation becomes harmful. This requires detecting safety degradation at the explanation level rather than just checking the decision output.
Method¶
Overall Architecture¶
A three-stage pipeline: Stage I (Distillation) fine-tunes the student model using teacher rationales to improve accuracy → Stage II (Auditing) compares outputs of the same prompt before and after distillation using a toxicity classifier + LLM-as-Judge to detect harm drift → Stage III (Repairing) applies severity-weighted fine-tuning to replace drifted cases with safer rationales.
Key Designs¶
-
Label-Conditioned Teacher Distillation (Stage I):
- Function: Improve difference-aware classification accuracy.
- Mechanism: The teacher model receives the correct label \(y^*\) and generates a rationale \(r^*\) explaining that label (rather than predicting the label independently). Harm-aware prompting guides the teacher to generate concise rationales while avoiding repetitive harmful content. LoRA is used to fine-tune the student model \(M_0\) to obtain the intermediate model \(M_{int}\).
- Design Motivation: Label conditioning ensures rationales are aligned with correct conclusions—replacing ground-truth labels with predicted labels drops accuracy from 0.682 to 0.641 and severely interferes with subsequent auditing.
-
Paired Harm Auditing (Stage II):
- Function: Precisely identify harm drift cases caused by distillation.
- Mechanism: For each test prompt \(x\), outputs are generated from both \(M_0\) and \(M_{int}\) under identical decoding conditions. A toxicity classifier filters cases where \(\mathcal{H}(r_{int}) - \mathcal{H}(r_0) > \tau_{delta}\) (\(\tau_{delta}=0.01\)). LLM-as-Judge then confirms if they belong to three drift categories: (i) repeating or elaborating on harmful content avoided by \(M_0\), (ii) normalizing problematic assumptions, or (iii) omitting hazards identified by \(M_0\). Confirmed cases are ranked by severity across four levels (Minor/Moderate/Severe/Extreme).
- Design Motivation: The paired design controls for prompt difficulty—by only comparing changes for the same prompt, it ensures that detected degradation is caused by distillation rather than the prompt itself.
-
Severity-Weighted Repair (Stage III):
- Function: Targeting the repair of harm drift cases without damaging accuracy.
- Mechanism: For drift cases in \(\mathcal{P}_{drift}\), safer alternative rationales are generated and assigned different training weights based on severity. LoRA is used to continue fine-tuning \(M_{int}\) to obtain \(M_{DART}\). Only the behavior of drift cases is modified, limiting parameter drift.
- Design Motivation: A staged approach is superior to joint multi-objective optimization—ablation studies confirm that joint training achieves neither the accuracy of pure distillation nor the safety of targeted repair.
Loss & Training¶
Both Stage I and Stage III use LoRA fine-tuning (standard next-token prediction), with Stage III additionally introducing severity weighting. Explanatory strategy constraints can be optionally added to rationale generation during inference.
Key Experimental Results¶
Main Results¶
| Model | Method | Total Accuracy | EQUAL Accuracy | DIFF Accuracy | Harm Drift↓ |
|---|---|---|---|---|---|
| Llama-3-8B | Baseline \(M_0\) | 39.0% | 11.3% | 66.6% | - |
| Llama-3-8B | \(M_{DART}\) | 68.8% | 72.6% | - | -72.6% |
| Llama-3.2-3B | \(M_{DART}\) | +24.7pp | - | - | Significant reduction |
Ablation Study¶
| Configuration | Accuracy | Safety | Description |
|---|---|---|---|
| Distillation only (Stage I) | 68.2% | Low | High accuracy but severe harm drift |
| Joint Toxicity Regularization | ~60% | Medium | Neither objective is sufficiently met |
| Full DART | 68.8% | High | Staged strategy is optimal |
Key Findings¶
- The largest accuracy gain occurred in equal-treatment cases (11.3%→72.6%), indicating that the over-refusal problem was effectively solved.
- In open-domain queries, difference-appropriate responses increased from 39.8% to 77.5%, and the refusal rate dropped from 34.3% to 3.0%.
- Label-conditioned generation is crucial for auditing precision—using predicted labels for auditing drops detection Precision/Recall from 0.720/0.810 to 0.582/0.694.
- Harm drift differs from traditional toxicity—it appears within explanatory reasoning rather than response compliance, making it undetectable by standard metrics.
Highlights & Insights¶
- Harm Drift is a novel and important safety failure mode—the phenomenon of "correct conclusion but harmful reasoning" has not been systematically studied previously.
- The design philosophy of the staged strategy is worth promoting: fully optimize the primary objective first, then specifically repair side effects, rather than attempting to balance multiple objectives from the start.
- The two-stage auditing design, combining LLM-as-Judge with toxicity classifiers, balances efficiency and precision.
Limitations & Future Work¶
- Auditing relies on the judgment quality of the LLM-as-Judge, which may contain biases.
- Evaluation was limited to difference-aware classification tasks; the performance of "harm drift" in other fine-tuning scenarios remains unknown.
- The repair stage may introduce new side effects, requiring iterative refinement.
Related Work & Insights¶
- vs. Standard Safety Fine-tuning: Standard methods focus on response compliance (whether to refuse), while DART focuses on explanation quality—a more fine-grained safety dimension.
- vs. DPO/RLHF: These methods align models globally through preference data, whereas DART uses precise auditing to locate and repair specific harm drift cases.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The concept of harm drift is novel, and the staged solution is ingenious.
- Experimental Thoroughness: ⭐⭐⭐⭐ 8 benchmarks + 280 open-domain queries + detailed ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear problem definition with intuitive examples.
- Value: ⭐⭐⭐⭐⭐ Reveals a new safety risk in fine-tuning with important implications for LLM alignment research.