I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift¶

Conference: ICLR 2026 arXiv: 2603.01297
Code: https://github.com/SubramanyamSahoo/Collapse-of-Safety-Classifiers-under-Embedding-Drift
Area: LLM Inference Keywords: embedding drift, safety classifier, silent failure, RLHF alignment, toxicity detection

TL;DR¶

This paper systematically investigates the fragility of frozen-embedding-based safety classifiers under embedding drift induced by model updates. It finds that a mere 2% perturbation in the embedding space is sufficient to degrade classifier performance from 85% ROC-AUC to near-random levels (50%), with 72% of misclassifications occurring at high confidence (silent failure). Counterintuitively, instruction-tuned models prove harder to classify than their base counterparts.

Background & Motivation¶

Background: Instruction-tuned inference models deployed in production are typically paired with safety classifiers (e.g., toxicity detectors) trained on frozen embeddings, implicitly assuming that embeddings remain stable across model updates.

Limitations of Prior Work: Base models are updated frequently—for safety patches, performance improvements, etc.—yet safety classifiers are rarely retrained in synchrony, giving rise to a "model updates, classifier stays fixed" production paradigm.

Key Challenge: Whether embedding spaces remain stable after model updates, and whether standard monitoring mechanisms (confidence-based) can detect such failures if they do not.

Goal: (1) Quantify the precise failure threshold of embedding drift; (2) characterize the silent failure phenomenon—misclassifications occurring despite high reported confidence; (3) reveal the counterintuitive effect of RLHF alignment on classifier robustness.

Key Insight: Controlled additive perturbations are used to simulate embedding drift, systematically testing classifier degradation under different drift types (Gaussian, directional, subspace rotation).

Core Idea: The embedding stability assumption does not hold in practice; even minor drift can cause catastrophic failure in safety classifiers that goes undetected by standard monitoring.

Method¶

Overall Architecture¶

Given an input text \(x\), a language model \(\mathcal{M}_t\) produces embedding \(z_t = f_{\theta_t}(x) \in \mathbb{R}^d\), which is passed to a safety classifier \(g_\phi\) for binary classification (toxic/safe). The core experimental framework trains the classifier on checkpoint 0 and evaluates it on drifted embeddings from subsequent checkpoints, simulating the production scenario of a fixed classifier with an evolving model.

Key Designs¶

Embedding Drift Modeling:
- Function: Parameterized perturbations simulate embedding changes induced by model updates.
- Mechanism: \(z_c = \text{Normalize}(z_0 + \varepsilon_c)\), where \(\varepsilon_c\) is sampled from three distributions—Gaussian drift \(\varepsilon_c \sim \mathcal{N}(0, \sigma_c^2 I)\), directional drift \(\varepsilon_c = \sigma_c v\) (fixed direction), and subspace rotation \(z_c = \text{Normalize}(Rz_0)\).
- Design Motivation: To cover diverse real-world drift scenarios; normalization constrains embeddings to the unit hypersphere.
Silent Failure Metric:
- Function: Detects misclassifications occurring at high confidence.
- Mechanism: A silent failure is defined when \(\max_y p(y|x) > 0.8\) and \(\hat{y} \neq y\).
- Design Motivation: Standard monitoring relies on average confidence; if confidence remains high while classification is already incorrect, the failure goes undetected—the most dangerous failure mode.
Alignment Impact Analysis:
- Function: Compares embedding separability between base and instruction-tuned models.
- Mechanism: Silhouette score and Fisher discriminant ratio are used to measure the separation of toxic/safe classes in the embedding space.
- Design Motivation: To examine whether RLHF alignment introduces adverse effects on safety classification.

Experimental Setup¶

Dataset: Civil Comments (~1.8M human-annotated comments); a balanced subset of 10,000 samples is constructed (70/10/20 train/val/test split).
Models: Qwen-0.6B (base) and Qwen-4B-Instruct (instruction-tuned + RLHF).
Classifier: \(\ell_2\)-regularized logistic regression, mirroring lightweight classifier practices in production.
Drift magnitude: \(\sigma \in [0, 0.15]\), generating 6–8 checkpoints.

Key Experimental Results¶

Main Results¶

Metric	Baseline (σ=0)	Checkpoint 1 (σ=0.028)	Relative Change
ROC-AUC	85%–90%	49.75%	−45%
Mean Confidence	0.85	0.73	−14%
Silent Failure Rate	—	38.4% (high-conf. errors)	—
High-Conf. Misclassification Share	—	72%	—
ECE (Calibration Error)	1.2%	22.6%	+18.8×

Ablation Study¶

Configuration	ROC-AUC @max drift	Silent Failure Rate	Note
Base model (Qwen-0.6B)	−39.2% drop	35.2%	Slightly more robust
Instruct model (Qwen-4B)	−41.2% drop	42.1%	RLHF model more fragile
Gaussian drift	−42.7% drop	—	Standard Gaussian
Directional drift	~−45% drop	—	Fixed direction
Subspace rotation	−48.9% drop	—	Worst-case drift type

Key Findings¶

Threshold Effect: Drift exhibits a sharp cliff—\(\sigma < 0.01\) produces almost no impact (<5% AUC drop), while \(\sigma > 0.02\) results in near-random performance; the transition occurs within an extremely narrow 1% window.
Silent Failure as the Greatest Threat: When the classifier reports 90% confidence, actual accuracy is only 56%—worse than random guessing in practice.
Counterintuitive Effect of Alignment: The instruction-tuned model's Silhouette score drops from 0.245 to 0.198 (−19%), Fisher ratio from 4.23 to 3.12 (−26%), and class overlap increases from 12.3% to 18.7%—RLHF blurs class boundaries in the embedding space.
Drift Mechanism Independence: The three drift types differ by no more than 6 percentage points, indicating that the fragility is structural rather than specific to a particular perturbation.

Highlights & Insights¶

Concise Mathematical Explanation of High-Dimensional Fragility: In 896-dimensional space, even at \(\sigma=0.02\), the variance of the perturbation term \(w^\top \varepsilon\) is \(\|w\|^2 \sigma^2 \approx 0.5\), comparable to the signal strength \(w^\top z\), yielding SNR ≈ 1 and rendering the classifier unusable. High dimensionality amplifies the independent noise contribution of each dimension.
Sigmoid-Based Explanation of Confidence Failure: The sigmoid function maps large-magnitude inputs to confidence values near 0 or 1. Even when drift randomly flips the sign of \(w^\top z + b\) (destroying accuracy), it does not systematically reduce the magnitude of \(|w^\top z + w^\top \varepsilon + b|\), so confidence remains high while the predicted direction is already wrong.
Implications for Production Systems: Every model update must mandate retraining of safety classifiers; average-confidence monitoring cannot be relied upon to detect such failures.

Limitations & Future Work¶

Controlled Experiment vs. Real Drift: Additive perturbations may underestimate the distributional shift caused by actual model updates, which involve more complex changes from architectural modifications and data updates.
Classifier Simplicity: Only logistic regression is evaluated; the behavior of more complex classifiers (e.g., MLPs, ensembles) under drift remains unknown.
Insufficient Mitigation Analysis: The paper mentions meta-learning and domain adaptation as potential remedies but provides no empirical validation.
Limited Model Scale: Only 0.6B and 4B models are tested; the embedding stability behavior of larger models (e.g., 70B+) may differ substantially.

vs. Cunningham et al. (2026) Constitutional Classifiers++: That work focuses on building better safety classifiers, whereas this paper reveals a more fundamental problem—even a well-designed classifier will fail if not retrained alongside model updates.
vs. SHADE-Arena (Kutasov et al. 2025): That work evaluates sabotage risks in LLM agents; this paper demonstrates the fragility of safety mechanisms at the embedding level.
This paper serves as an important warning for any production system that employs a frozen-embedding + downstream-classifier paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐ The problem of embedding drift is not new, but the focus on silent failure in safety classifiers is a genuinely novel angle.
Experimental Thoroughness: ⭐⭐⭐ Experimental design is clear, but the range of models and classifiers is limited, and empirical validation of mitigation strategies is absent.
Writing Quality: ⭐⭐⭐⭐ Well-structured with concise mathematical analysis.
Value: ⭐⭐⭐⭐ Provides important practical guidance for the production deployment of AI safety systems.