CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification¶
Conference: ACL 2026 arXiv: 2604.14602 Code: N/A Area: Causal Inference Keywords: Detoxification, Causal Inference, Attention Head Selection, Inference-Time Intervention, PNS
TL;DR¶
CausalDetox uses Probability of Necessity and Sufficiency (PNS) as causal criterion to precisely locate attention heads causally responsible for toxic content, applying local inference-time intervention and PNS-guided fine-tuning for detoxification, achieving up to 5.34% toxicity reduction while preserving language fluency.
Method¶
Key Designs¶
-
PNS Causal Head Selection: PN measures necessity ("does removing toxic activation eliminate toxicity?"); PS measures sufficiency ("does injecting toxic activation into non-toxic inputs produce toxicity?"). Uses VAE-inferred latent confounders for tractable lower-bound estimation. 7x faster than accuracy-based selection.
-
Local Inference-Time Intervention (Local ITI): Constructs input-specific steering vectors via softmax-weighted nearest-neighbor aggregation, mixed with global vectors \(\mathbf{v}_{mix} = (1-\lambda)\mathbf{v}_{local} + \lambda\mathbf{v}_{global}\).
-
PNS-Guided Fine-Tuning: Permanently decouples toxicity representations in selected heads by maximizing PNS lower bound as training objective with KL divergence regularization.
Key Experimental Results¶
- PNS head selection consistently outperforms accuracy-based selection across all model-dataset combinations
- Fine-tuning + intervention synergy outperforms either alone
- Different models require different numbers of heads (Mistral: 5, LLaMA: 36)
Highlights & Insights¶
- PNS replacing correlation as intervention target selection is a generalizable principle for any component-level intervention scenario
- Fine-tuning + intervention synergy: fine-tune to concentrate toxicity encoding first, then precisely remove via intervention
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐