CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification¶

Conference: ACL 2026 arXiv: 2604.14602 Code: N/A Area: Causal Inference Keywords: Detoxification, Causal Inference, Attention Head Selection, Inference-Time Intervention, PNS

TL;DR¶

CausalDetox uses Probability of Necessity and Sufficiency (PNS) as causal criterion to precisely locate attention heads causally responsible for toxic content, applying local inference-time intervention and PNS-guided fine-tuning for detoxification, achieving up to 5.34% toxicity reduction while preserving language fluency.

Method¶

Key Designs¶

PNS Causal Head Selection: PN measures necessity ("does removing toxic activation eliminate toxicity?"); PS measures sufficiency ("does injecting toxic activation into non-toxic inputs produce toxicity?"). Uses VAE-inferred latent confounders for tractable lower-bound estimation. 7x faster than accuracy-based selection.
Local Inference-Time Intervention (Local ITI): Constructs input-specific steering vectors via softmax-weighted nearest-neighbor aggregation, mixed with global vectors \(\mathbf{v}_{mix} = (1-\lambda)\mathbf{v}_{local} + \lambda\mathbf{v}_{global}\).
PNS-Guided Fine-Tuning: Permanently decouples toxicity representations in selected heads by maximizing PNS lower bound as training objective with KL divergence regularization.

Key Experimental Results¶

PNS head selection consistently outperforms accuracy-based selection across all model-dataset combinations
Fine-tuning + intervention synergy outperforms either alone
Different models require different numbers of heads (Mistral: 5, LLaMA: 36)

Highlights & Insights¶

PNS replacing correlation as intervention target selection is a generalizable principle for any component-level intervention scenario
Fine-tuning + intervention synergy: fine-tune to concentrate toxicity encoding first, then precisely remove via intervention

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐