Skip to content

CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

Conference: ACL 2026 arXiv: 2604.14602 Code: N/A Area: Causal Inference Keywords: Detoxification, Causal Inference, Attention Head Selection, Inference-Time Intervention, PNS

TL;DR

CausalDetox uses Probability of Necessity and Sufficiency (PNS) as causal criterion to precisely locate attention heads causally responsible for toxic content, applying local inference-time intervention and PNS-guided fine-tuning for detoxification, achieving up to 5.34% toxicity reduction while preserving language fluency.

Method

Key Designs

  1. PNS Causal Head Selection: PN measures necessity ("does removing toxic activation eliminate toxicity?"); PS measures sufficiency ("does injecting toxic activation into non-toxic inputs produce toxicity?"). Uses VAE-inferred latent confounders for tractable lower-bound estimation. 7x faster than accuracy-based selection.

  2. Local Inference-Time Intervention (Local ITI): Constructs input-specific steering vectors via softmax-weighted nearest-neighbor aggregation, mixed with global vectors \(\mathbf{v}_{mix} = (1-\lambda)\mathbf{v}_{local} + \lambda\mathbf{v}_{global}\).

  3. PNS-Guided Fine-Tuning: Permanently decouples toxicity representations in selected heads by maximizing PNS lower bound as training objective with KL divergence regularization.

Key Experimental Results

  • PNS head selection consistently outperforms accuracy-based selection across all model-dataset combinations
  • Fine-tuning + intervention synergy outperforms either alone
  • Different models require different numbers of heads (Mistral: 5, LLaMA: 36)

Highlights & Insights

  • PNS replacing correlation as intervention target selection is a generalizable principle for any component-level intervention scenario
  • Fine-tuning + intervention synergy: fine-tune to concentrate toxicity encoding first, then precisely remove via intervention

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐