AAAI 2026 AI Safety Adversarial Defense Text Classification Manifold Correction Normalizing Flow Riemannian Geometry OOD Detection Geodesic Purification

Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification¶

Conference: AAAI 2026 arXiv: 2511.07888 Code: To be confirmed Area: AI Safety / Adversarial Robustness Keywords: Adversarial Defense, Text Classification, Manifold Correction, Normalizing Flow, Riemannian Geometry, OOD Detection, Geodesic Purification

TL;DR¶

This paper proposes the Manifold-Correcting Causal Flow (MC²F) framework, which employs a Stratified Riemannian Continuous Normalizing Flow (SR-CNF) to learn the manifold density of clean data embeddings for adversarial example detection, and subsequently applies a Geodesic Purification Solver to project detected adversarial embeddings back onto the clean manifold along geodesic paths. MC²F comprehensively surpasses state-of-the-art methods in adversarial robustness across SST-2, AGNews, and YELP benchmarks, while incurring no loss—and even achieving marginal gains—in clean accuracy.

Background & Motivation¶

Background: Pre-trained language models (PLMs, e.g., BERT) have achieved remarkable success in text classification, yet remain highly vulnerable to adversarial attacks (e.g., TextFooler, BERT-Attack)—subtle, semantically imperceptible textual perturbations can completely flip model predictions.
Limitations of Prior Work: Existing defenses (adversarial training, embedding denoising, etc.) suffer from a pervasive robustness–accuracy trade-off: improving adversarial robustness inevitably degrades clean accuracy, which is unacceptable in safety-critical applications.
Key Challenge: Adversarial training (AT) forcibly enhances robustness via data augmentation at high computational cost, and frequently produces "robustness illusions" due to gradient masking. Purification-based methods avoid modifying model training but lack precise modeling of the geometric structure of the embedding space, limiting their purification efficacy.
Key Insight: Through empirical analysis, the authors find that clean and adversarial text embeddings occupy geometrically separable regions of BERT's embedding space—reframing adversarial defense as a geometric correction problem rather than a brute-force training problem.

Core Problem¶

Can adversarial examples be detected as outliers and projected back onto the clean manifold by precisely modeling the manifold structure of clean data embeddings, thereby achieving both high robustness and zero accuracy loss simultaneously?

Method¶

Empirical Foundation: Manifold Separability Hypothesis¶

The paper first conducts systematic geometric analysis on the SST-2 dataset:

Visualization Evidence: PCA, t-SNE, and UMAP all reveal visibly separated clusters between clean and adversarial embeddings.
Statistical Distance Evidence: MMD, JSD, and Wasserstein distances between clean and adversarial distributions are significantly larger than intra-distribution distances within clean data.
Local Intrinsic Dimensionality (LID) Evidence: The mean LID of adversarial embeddings (28.20) is significantly higher than that of clean embeddings (23.74), with a p-value approaching \(10^{-43}\)—adversarial perturbations systematically push embeddings toward geometrically more complex regions.

Based on these findings, two hypotheses are established: (1) Manifold Separability—clean and adversarial embeddings are statistically and geometrically separable; (2) Stratified Manifold Structure—the embedding space is composed of sub-manifolds with varying intrinsic dimensionalities.

Overall Architecture¶

MC²F comprises two core modules: (1) SR-CNF for adversarial example detection; (2) Geodesic Purification Solver for embedding correction. At inference time, the log-likelihood \(\log p(z_{in})\) is computed for an input embedding \(z_{in}\); if it falls below threshold \(\tau\), purification is triggered; otherwise, the embedding passes through directly.

Key Designs¶

Stratified Riemannian Continuous Normalizing Flow (SR-CNF)
- Function: Learns the probability density \(p_{clean}(z)\) of clean data embeddings to detect OOD adversarial examples.
- Mechanism: Rather than assuming a fixed geometry, a Mixture-of-Experts (MoE) network learns a position-dependent Riemannian metric tensor \(G(z) = \sum_{k=1}^{K} \alpha_k(z) E_{\psi_k}(z)\).
- A gating network \(g_\phi(z)\) outputs mixture weights; \(K\) expert networks each specialize in the local geometry of a particular stratum.
- Positive definiteness is ensured by constructing each expert output as \(L_k(z)L_k(z)^T + \epsilon I\).
- A CNF is defined on the learned Riemannian manifold, and log-likelihood is computed via Riemannian divergence (Equations 3–4).
- Detection Mechanism: \(\log p(z_{in}) < \tau\) flags the input as adversarial.
- Design Motivation: The embedding space is not a single uniform manifold but a stratified structure with varying intrinsic dimensionalities; MoE adaptively captures this layered geometry.
Geodesic Purification Solver
- Function: Projects detected adversarial embeddings back onto the clean manifold along geodesics (shortest paths on the manifold).
- Formulation: Minimizes the path energy functional \(\mathcal{L}[\gamma] = \int_0^1 \langle \gamma'(t), \gamma'(t) \rangle_{G(\gamma(t))} dt\).
- Boundary Conditions: \(\gamma(0) = z_{adv}\), \(\gamma(1) = z_{corr} \in \mathcal{M}_{clean}\).
- Optimization: The path is discretized and its waypoints are optimized via gradient descent to minimize the energy functional; the constraint \(\log p(z_{corr}) \geq \tau\) is enforced via a soft penalty.
- Design Motivation: Rather than arbitrary denoising, this approach finds the geometrically nearest clean representation—maximally preserving semantic information.
Multi-Objective Training Paradigm
- Density Estimation Loss \(\mathcal{L}_{NLL}\): Standard negative log-likelihood of the normalizing flow, driving the learning of the clean data distribution.
- Topological Regularization \(\mathcal{L}_{topo}\): Based on differentiable persistent homology, computes the Wasserstein distance between persistence diagrams of clean embedding batches and their latent-space counterparts, ensuring the flow transformation preserves global topological structure.
- Causal Semantic Regularization \(\mathcal{L}_{causal}\): Treats the purification process as a causal intervention (removing the confounding effect of adversarial perturbations), using Fisher-Rao distance to constrain the classifier output distribution of purified embeddings to be consistent with that of original clean embeddings.
- Total Loss: \(\mathcal{L}_{total} = \mathcal{L}_{NLL} + \lambda_{topo}\mathcal{L}_{topo} + \lambda_{causal}\mathcal{L}_{causal}\)

Key Experimental Results¶

Main Results (3 Datasets × 3 Attack Methods)¶

Dataset	Method	Clean%	BERT-Attack Aua%	TextFooler Aua%	TextBugger Aua%
SST-2	Fine-tune	92.71	3.83	6.10	28.70
SST-2	SD (SOTA)	91.36	36.46	46.30	54.50
SST-2	MC²F	92.71	40.05	52.60	61.50
AGNews	Fine-tune	94.68	4.09	14.70	40.00
AGNews	SD (SOTA)	93.81	38.60	49.30	60.10
AGNews	MC²F	95.13	45.30	53.80	64.30
YELP	Fine-tune	95.19	5.40	5.20	29.60
YELP	SD (SOTA)	93.45	39.61	47.80	55.10
YELP	MC²F	95.26	48.50	54.00	63.20

Ablation Study (AGNews, TextFooler Attack)¶

Configuration	Clean%	Aua%	#Query
MC²F (Full)	95.13	53.8	561.4
w/o \(\mathcal{L}_{NLL}\)	93.22	32.6	366.7
w/o \(\mathcal{L}_{topo}\)	93.41	32.9	375.4
w/o \(\mathcal{L}_{causal}\)	94.76	48.6	479.1

Key Findings¶

Zero accuracy loss with marginal gains: MC²F achieves 95.13% Clean% on AGNews (vs. 94.68% for fine-tuning) and 95.26% on YELP (vs. 95.19%)—completely breaking the robustness–accuracy trade-off.
Substantially increased attack query counts: Against BERT-Attack on YELP, MC²F requires 586.4 queries (vs. 320.7 for SD), indicating a significantly harder-to-explore decision boundary.
Topological regularization contributes most: Removing \(\mathcal{L}_{topo}\) causes Aua% to drop sharply from 53.8% to 32.9%—preserving the global topological structure of the manifold is critical for preventing brittle representations.
All three loss terms are indispensable: Removing any individual loss term leads to significant degradation in robustness and/or clean accuracy.

Highlights & Insights¶

"Detect-then-correct" paradigm as an alternative to adversarial training: The method operates as an embedding-space input filter at inference time without modifying the model training process—fully decoupled from downstream models and thus highly generalizable.
Complete chain from empirical observation to method design: The manifold separability hypothesis is validated through multiple lenses (PCA/t-SNE/UMAP/MMD/JSD/LID) before informing method design, rather than post-hoc rationalization.
MoE for learning stratified Riemannian geometry: Mixture-of-Experts adaptively captures the non-uniform geometric structure of the embedding space, offering greater flexibility than fixed metrics.
Importance of topological regularization: Persistent homology constraints enforce topological invariance under the flow transformation—its role in adversarial robustness is explicitly validated for the first time.

Limitations & Future Work¶

Inference requires additional density estimation and potentially iterative geodesic optimization steps; computational overhead is not reported in detail and may become a bottleneck in real-time applications.
Validation is limited to BERT-base; generalization to larger models (RoBERTa-large, LLMs, etc.) is untested.
The detection threshold \(\tau\) is determined on the validation set; in real deployments, the clean/adversarial distribution may shift continuously.
Experiments cover only word-level attacks (TextFooler/BERT-Attack/TextBugger); sentence-level (paraphrase) and character-level adversarial attacks are not evaluated.
The convergence analysis and number of iteration steps for the geodesic solver are insufficiently characterized.
The purification process may introduce subtle semantic drift, potentially affecting non-adversarial examples in edge cases.

vs. Adversarial Training (FreeLB/WLRE): Adversarial training modifies model weights and typically incurs 0.5–1.5% clean accuracy degradation; MC²F operates as a post-processing module that leaves model weights unchanged, achieving clean accuracy gains rather than losses.
vs. Subspace Defense (SD): SD removes adversarial components via subspace projection, but the projection is linear; MC²F performs geodesic projection under a learned nonlinear Riemannian metric, more precisely adapting to the curved geometry of the embedding space.
vs. DAD (Zhang et al. 2025): DAD uses MMD for detection and a denoiser for purification; MC²F employs Riemannian CNF for more precise detection (density estimation vs. two-sample testing) and offers geometric optimality guarantees for purification (geodesics vs. heuristic denoising).
Insight: The "learn data manifold → OOD detection → geometric projection correction" paradigm of this framework is transferable to adversarial defense in image and multimodal domains.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First work to unify stratified Riemannian CNF, geodesic purification, and topological regularization into a cohesive adversarial defense framework with complete theoretical derivations.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 3 datasets × 3 attacks × 4 baselines + ablation + preliminary validation, but lacks evaluation on larger models and a broader range of attack types.
Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from empirical hypotheses to method design to experimental validation is exceptionally clear, with rigorous mathematical exposition.
Value: ⭐⭐⭐⭐ Addresses the long-standing robustness–accuracy trade-off in text classification with high practical utility.