When Bad Data Leads to Good Models¶

Conference: ICML 2025
arXiv: 2505.04741
Code: None
Area: Social Computing
Keywords: Toxic Data, Pre-training Data Quality, Detoxification, Feature Entanglement, Inference-time Intervention

TL;DR¶

This paper proposes a "pre-training/post-training co-design" perspective, demonstrating through controlled experiments that incorporating a moderate amount of toxic data (~10%) into pre-training data actually reduces the entanglement of toxic features. This makes the model easier to detoxify during post-training (e.g., via ITI activation steering), ultimately reducing toxicity on ToxiGen from 41.40 to 2.63 while maintaining language capabilities.

Background & Motivation¶

Background: The standard practice for current LLM pre-training is to filter out toxic data from training corpora (e.g., the rigorous cleaning done in the C4 dataset) to reduce the risk of the model outputting harmful content. Intuitively, cleaner training data should lead to a safer model.

Limitations of Prior Work: Longpre et al. [2023] discovered that filtering toxic data not only degrades the model's ability to identify toxicity, but also harms downstream performance on most QA tasks. The reduction in data diversity limits the model's capacity to build complete world representations. Furthermore, Lee et al. [2024] and Qi et al. [2023] found that alignment algorithms do not genuinely "forget" the mechanisms for generating toxic content but merely bypass them, making such defenses easily reversible.

Key Challenge: Filtering toxic data during pre-training \(\rightarrow\) representation of toxic concepts becomes highly entangled with other unrelated concepts (superposition) \(\rightarrow\) any aggressive editing of toxic directions in post-training severely damages general capabilities \(\rightarrow\) irreconcilable conflict between detoxification and capability preservation.

Goal: (a) How does the proportion of toxic content in pre-training data affect the geometric structure of toxic features in the representation space? (b) Does a better toxic representation make post-training detoxification more effective? (c) Is there an optimal proportion of toxic data?

Key Insight: The authors start from the superposition hypothesis in Elhage et al. [2022]—when the number of features exceeds the number of neurons, the model must superimpose multiple features onto the same dimension. If a certain class of data is underrepresented in the training set, its corresponding feature representation direction will be highly entangled with other features, rendering it difficult to edit independently.

Core Idea: Instead of removing toxic data during pre-training, it is better to retain or even increase it so the model builds a clear linear representation of toxicity. This makes post-training detoxification more precise and less prone to side effects.

Method¶

Overall Architecture¶

Instead of proposing a new model architecture or training algorithm, this paper introduces a concept of pre-training/post-training co-design. The overall pipeline is as follows:

Toy Experiments for Hypothesis Verification: Train small Transformers on controlled sequences generated by Markov chains to study the relationship between data proportions and feature entanglement.
Controlled OLMo-1B Pre-training: Gradually add 0% to 25% of 4chan data (extremely toxic data) on top of C4 (clean data) to train a series of models.
Probing Analysis of Internal Representations: Use linear probes to detect the linear separability of toxic concepts in each model.
Post-training Detoxification Evaluation: Apply post-training techniques such as prompting, ITI, SFT, and DPO to each model, and evaluate their detoxification effectiveness on ToxiGen and Real Toxicity Prompts.

Key Designs¶

特征纠缠度度量（Entanglement Measure）:
- Function: Quantify the degree to which a certain feature is "entangled" with other features in the representation space.
- Mechanism: Define the entanglement of feature \(P_i\) as \(\mathcal{E}_{P_i} = \max\{|v_{P_i} \cdot v_{P_j}|\}_{j \neq i}\), which represents the maximum absolute cosine similarity between this feature direction and all other feature directions. Lower entanglement means more independent feature representation, leading to fewer side effects during editing.
- Design Motivation: When \(N\) features are compressed into an \(M\)-dimensional space (\(N > M\)), superposition inevitably occurs. The Welch bound provides a lower bound for maximum entanglement: \(\sqrt{(N-M)/((N-1)M)}\), which is only achieved when feature directions are uniformly distributed. Underrepresented features deviate from the uniform distribution, resulting in higher entanglement.
Toy 实验设计:
- Function: Verify the hypothesis that "data proportion affects entanglement" in a controlled environment.
- Mechanism: Generate sequences using \(N\) cyclic Markov chains (sharing state space \(V\)), choose one chain to reduce its data volume, train a 4-layer Transformer (4D latent space, 10 random seeds), and observe changes in the entanglement of underrepresented features.
- Key Findings: As the proportion of data for the underrepresented feature increases, its entanglement drops sharply, gradually approaching the average entanglement of other features (around 0.8).
OLMo-1B 受控预训练实验:
- Function: Verify the impact of toxic data on representation quality at a realistic LLM scale.
- Mechanism: Keep the amount of clean C4 data constant and gradually add 4chan data from 0% to 25% (in steps of 5%), taking the total tokens from 20.1B to 25.7B. Each configuration is trained twice with different seeds. Utilizing 16 H100 GPUs, each configuration takes approximately 12 hours.
- Design Motivation: Keeping the amount of clean data constant isolates the confounding factor of "reducing high-quality data," ensuring that experimental results reflect only the impact of toxic data.
线性探针分析（Probing）:
- Function: Detect whether a good linear representation of toxicity has been established inside the model.
- Mechanism: For each text segment in the ToxiGen dataset, collect activations at the final token from each attention head of the model, and train binary linear probes. Compare the models trained with 0% vs. 25% toxic data.
- Key Findings: The model containing toxic data exhibits a significant "fat right tail" in probe accuracy distribution (\(p = 0.0002\)), indicating that more heads specialize in toxicity detection. This is crucial for ITI, which relies on selecting high-accuracy heads for intervention.
推理时干预（ITI）去毒:
- Function: Shift activations along the toxicity-related linear directions during decoding to guide the model towards generating non-toxic content.
- Mechanism: Select the 30 heads with the highest probing accuracy and intervene with three levels of intensity (weak=4, mid=8, strong=12).
- Key Findings: The model containing 10% toxic data achieves the lowest toxicity under ITI (Toxigen: 2.63), forming a "smile curve" where toxicity continuously declines from 0% to 10%, and rises after exceeding 10% but still remains better than the clean model.

Loss & Training¶

The pre-training phase does not involve any special loss function design and uses the standard language modeling objectives. The core innovation lies in the strategic adjustment of the data mixture ratio, rather than changes to the training pipeline itself. In the post-training phase, various standard techniques (prompting, ITI, SFT, DPO) are tested to verify whether the gain from toxic pre-training data generalizes across different methods.

Key Experimental Results¶

Main Results: Comparison of Detoxification Effects¶

Method	Toxigen ↓	Real Toxicity Prompts ↓	CE Loss ↓
Clean data (Baseline)	41.40	31.15	2.60
Clean + prompting	32.12	31.00	2.62
Clean + ITI (weak)	36.30	24.83	2.63
Clean + ITI (mid)	28.31	20.41	2.72
Clean + ITI (strong)	19.82	13.33	2.88
MEDA	22.02	28.32	2.71
INST	18.99	30.09	2.73
SFT	39.27	28.00	2.68
DPO	38.86	29.67	2.71
10% Toxic + prompting	29.07	24.84	2.62
10% Toxic + ITI (weak)	16.25	20.09	2.65
10% Toxic + ITI (mid)	8.19	14.28	2.85
10% Toxic + ITI (strong)	2.63	7.11	3.23

Ablation Study: Different Toxic Data Proportions + DPO/SFT¶

Toxic Ratio	Method	Toxigen ↓	RTP ↓	CE Loss ↓
0%	SFT	39.27	28.00	2.68
5%	SFT	38.40	26.21	2.69
10%	SFT	37.62	25.78	2.71
15%	SFT	37.45	25.81	2.73
20%	SFT	38.20	26.39	2.75
0%	DPO	38.86	29.67	2.71
5%	DPO	33.91	19.85	2.70
10%	DPO	27.45	13.02	2.73
15%	DPO	26.88	13.19	2.74
20%	DPO	29.34	15.97	2.75

Red-Teaming Adversarial Experiment: GCG Attack Success Rate¶

Configuration	No ITI	Strong ITI
Clean data	80%	46%
10% Toxic	82%	38.5%

Key Findings¶

10% toxic data is the optimal ratio: Under ITI, the toxicity curve takes a "smile shape", reaching its lowest point at 10%. Although it rises after exceeding 10%, it still remains better than the clean model.
The gain of toxic data generalizes across methods: Not only ITI, but also DPO and SFT benefit from toxic pre-training data, exhibiting a similar trend of declining first and then rising.
Weak intervention is sufficient to outperform all baselines: 10% toxic data + weak ITI (Toxigen: 16.25) already outperforms MEDA (22.02) and INST (18.99) while maintaining a lower CE Loss (2.65 vs. 2.71/2.73).
Enhanced adversarial robustness: Toxic pre-training + strong ITI reduces the GCG attack success rate from 46% to 38.5%.
Probing analysis is statistically significant: The model containing toxic data achieves significantly higher probing accuracy (\(p = 0.0002\)), with a 95% confidence interval of \([0.67, 1.18]\).

Highlights & Insights¶

Counter-intuitive core finding: Contrary to the conventional wisdom that toxic data must be filtered during pre-training, this paper demonstrates that a moderate amount of toxic data is actually beneficial. This stems from a profound geometric insight: data diversity determines the separability of features in the representation space.
Elegant application of the superposition perspective: The paper connects the superposition hypothesis from interpretability research with data curation strategies, bridging data ratios and post-training steerability through the concept of entanglement.
From Toy to Real validation pathway: Practical intuition and theoretical predictions are first established via small Markov chain experiments and then validated on OLMo-1B. The experimental design is exceptionally solid and structured step-by-step.
Highly practical: This method requires no changes to the model architecture or the training process; it only requires adjusting the data mixture ratio, which can be directly integrated into industrial pre-training pipelines. It is also compatible with multiple post-training methods (ITI, SFT, DPO).

Limitations & Future Work¶

Limited scale: Experiments are conducted only on OLMo-1B (~20B tokens). The optimal toxic proportion might vary under larger models (e.g., 7B/70B) or larger data scales, and the scaling laws remain unexplored.
Narrow definition of toxicity: The paper only utilizes the toxicity defined by PerspectiveAPI, without addressing wider alignment dimensions such as bias, discrimination, or misinformation. The authors note that generalizing to other alignment concepts is a direction for future work.
Extremity of 4chan data: 4chan contains extremely toxic data; in reality, the distribution of toxic data is more diverse and subtle. Whether the conclusions apply to more "moderate" toxic data still needs verification.
Optimal ratio requires empirical determination: The value of 10% is highly dependent on specific model, data, and evaluation setups, and the paper lacks theoretical guidance on how to predict the optimal ratio.
CE Loss cost: Under strong ITI, the CE Loss increases from 2.60 to 3.23. The loss in language fluency cannot be ignored, necessitating a trade-off in real-world deployment.

vs. MEDA/INST (Prabhumoye et al. 2023): MEDA/INST inject artificial toxicity-annotated prefixes into pre-training data, whereas this work directly incorporates raw toxic data. While the goal is identical, the proposed approach is simpler and avoids distorting the language distribution. This work achieves lower toxicity on ToxiGen with a lower CE Loss.
vs. DPO/RLHF (Rafailov et al. 2023): Traditional methods rely on filtering toxic content during pre-training followed by post-training alignment. Lee et al. found that DPO's defenses are fragile and can be bypassed linearly. This work improves representation quality at its root, making post-training detoxification more thorough.
vs. ITI (Li et al. 2023): ITI assumes that well-formed linear representations exist within the model. This paper answers the question of "when linear representations are better"—namely when related data is sufficiently represented during pre-training. The two are complementary.
vs. Superposition Theory (Elhage et al. 2022): This work scales the superposition hypothesis from purely theoretical analysis into practical guidance for data curation, representing a new application direction for mechanistic interpretability.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Counter-intuitive core finding, bridging data strategies and alignment capabilities using superposition theory, offering a unique perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-level verification from toy to real, comparison against 6 baselines, and red-teaming tests, though lacking larger-scale experiments.
Writing Quality: ⭐⭐⭐⭐⭐ Exceptionally clear logical chain, progressing naturally through toy experiments, probing, and detoxification, supported by intuitive charts.
Value: ⭐⭐⭐⭐ Crucially inspires pre-training data curation strategies, though the generalizability of the optimal ratio remains to be verified.