AlignTree: Efficient Defense Against LLM Jailbreak Attacks¶

Conference: AAAI 2026 arXiv: 2511.12217v1 Code: https://github.com/Gilgo2/AlignTree Area: LLM Alignment Keywords: LLM Safety, Jailbreak Attack Defense, Random Forest Classifier, Refusal Direction, SVM

TL;DR¶

AlignTree leverages internal LLM activation features — combining linear refusal directions with nonlinear SVM signals — to train a lightweight random forest classifier that efficiently detects jailbreak attacks with negligible computational overhead, achieving state-of-the-art reductions in attack success rate (ASR).

Background & Motivation¶

LLMs face serious jailbreak attack threats, where adversaries craft prompts to bypass safety alignment mechanisms and elicit harmful outputs. Existing defenses exhibit a pronounced efficiency–robustness trade-off:

Pre-processing defenses (e.g., LlamaGuard, ShieldGemma): require deploying an additional safety LLM for input filtering, incurring substantial computational cost;
In-process defenses (e.g., SmoothLLM): require multiple repeated inferences or generating numerous prompt copies, resulting in high latency;
Post-processing defenses (e.g., SelfDefense, AutoDefense): require the LLM to perform secondary review of its own outputs, doubling or more the computational burden.

More critically, prior activation-space defenses primarily rely on a single linear refusal direction to classify harmful prompts. Recent studies, however, reveal that the refusal behavior in LLMs is geometrically non-linear, and a single linear signal is insufficient to capture all malicious patterns.

Core Problem¶

How to design an computationally efficient, model-free, inference-free jailbreak defense mechanism for LLMs that effectively reduces ASR without causing excessive refusal of benign prompts?

Method¶

Overall Architecture¶

AlignTree is an in-process defense that monitors internal activation states during LLM inference. The pipeline proceeds as follows: 1. Perform a single forward pass on the input prompt and extract hidden states from each layer; 2. Derive two types of features from the hidden states: (i) linear refusal activations, and (ii) nonlinear SVM probability features; 3. Concatenate both feature types and feed them into a random forest classifier to produce a harmfulness confidence score; 4. Compare the score against threshold \(\tau\) to decide whether to pass or block the input.

Key Designs¶

Refusal Activations (Linear Refusal Signal): The difference-in-means method is applied to compute the mean activation vector difference at each token position and each layer across a harmful sample set \(D_{\text{harmful}}\) and a harmless sample set \(D_{\text{harmless}}\): \(r_i^{(l)} = \mu_i^{(l)} - v_i^{(l)}\). A validation set is then used to evaluate the refusal-inducing and refusal-suppressing effect of each direction vector, and the optimal single refusal direction \(r^*\) is selected. For the hidden state \(h\) of the last token at each layer, a scalar projection is computed: \(\text{proj}_{r^*}(h) = \frac{h \cdot r^*}{\|r^*\|}\), yielding per-layer refusal activation scalar features.
SVM Nonlinear Malicious Signal Extraction: For each layer \(l\) and 8 selected token positions (first 3 and last 5), an independent RBF-kernel SVM classifier \(\text{SVM}_i^{(l)}\) is trained, yielding \(8 \times L\) classifiers in total. After evaluation on the validation set, the top \(L/2\) classifiers by accuracy are retained. Platt scaling maps each SVM decision value to a calibrated harmfulness probability \(P_{\text{harmful}}(x_i^{(l)})\), forming the nonlinear feature vector.
Random Forest Classifier: The two feature types are concatenated into a complete input vector:

\[F(t) = [\text{proj}_{r^*}(x_{-1}^{(l)}(t))]_{l=1}^{L} \oplus [P_{\text{harmful}}(x_i^{(l)}(t))]_{(i,l) \in \mathcal{S}}\]

A shallow random forest (n_estimators=50, max_depth=6, min_samples_split=5) is used for harmful/harmless classification.

Loss & Training¶

SVMs use RBF kernels; out-of-fold probabilities are generated via 5-fold cross-validation.
Random forest hyperparameters: n_estimators=50, max_depth=6, min_samples_split=5 (grid search confirms low hyperparameter sensitivity).
Threshold selection: The optimal threshold \(\tau\) is selected on the validation set using the \(F_\beta\) score (\(\beta=0.2\), emphasizing precision):

\[F_\beta = \frac{(1+\beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\]

Training data: refusal/SVM training samples are drawn from AdvBench, MaliciousInstruct, TDC2023, StrongReject, and HarmBench (harmful) and ALPACA (harmless); random forest training samples use JailbreakBench, PAIR, and AutoDAN attack samples together with ALPACA and XSTest.
Training time is approximately 3 minutes on a single RTX 6000 Ada GPU for the largest model.

Key Experimental Results¶

Dataset	Metric	AlignTree	Prev. SOTA	Gain
MalwareGen (Qwen2.5-0.5B)	ASR↓	4.0	5.0 (AutoDefense)	−1pp
PAIR (Qwen2.5-0.5B)	ASR↓	6.0	8.0 (SelfDefense-Input)	−2pp
AutoDAN (Qwen2.5-0.5B)	ASR↓	0	0 (AutoDefense)	Tied
PromptInject (Llama-3.1-8B)	ASR↓	18.0	28.0 (SelfDefense)	−10pp
PAIR (Gemma-3-12b)	ASR↓	10.0	19.0 (AutoDefense)	−9pp
White-box adaptive attack (3 models)	ASR↓	0	0 (AutoDefense)	Tied but 60×+ faster
PIQA/ARC and other benign datasets	False refusal rate↓	0–1%	0–8% (AutoDefense)	Lowest false refusal

Efficiency comparison: AlignTree's execution time is close to the undefended baseline, approximately 10–50× faster than AutoDefense and 5–20× faster than SmoothLLM. In white-box attack experiments, AlignTree (2.40s) is approximately 58× faster than AutoDefense (140.74s), with both methods achieving ASR of 0.

Ablation Study¶

RefusalClassifier (linear signal only): effective on well-aligned models (Llama), but nearly ineffective on weakly aligned models (Qwen, ASR 89.0), indicating that linear signals depend heavily on the base model's alignment quality.
SVMClassifier (nonlinear signal only): poor generalization; excessive false refusal rates on some datasets.
MultiRefusalsClassifier (multiple refusal directions): outperforms the single-direction variant, confirming the multi-dimensional nature of refusal mechanisms.
AlignTreeLinear (linear SVM replacing RBF): performs well on individual models (Gemma-3-12b) but is inconsistent overall (ASR 61.0 vs. AlignTree's 4.0 on Qwen2.5-0.5B).
Full AlignTree: achieves the most stable and consistent performance across all models.

Highlights & Insights¶

Minimal computational overhead: requires no additional LLM, no repeated inference, and no prompt variants; performs lightweight classification solely on activations from the existing forward pass.
Complementary linear and nonlinear signals: the first defense framework to combine linear refusal directions with nonlinear SVM features; ablation studies strongly demonstrate the importance of nonlinear signals.
Low false refusal rate: near-zero false refusals across four commonsense reasoning datasets, demonstrating strong practical usability.
Comprehensive evaluation: 9 LLMs (3 model families × 3 scales), multiple attack benchmarks, and white-box adaptive attacks provide extensive experimental coverage.

Limitations & Future Work¶

A separate classifier must be trained for each individual model, precluding cross-model reuse.
On weakly aligned models, the refusal direction signal is nearly ineffective, leaving the method entirely dependent on SVM signals.
ASR on the PromptInject dataset remains relatively high (e.g., 41.0 on Qwen2.5-0.5B), indicating that prompt-injection-style attacks remain challenging.
ASR evaluation relies on ChatGPT-4o as a judge, which may introduce evaluation bias.
Only shallow classifiers are explored; whether more complex models (e.g., MLPs) could yield further improvements remains uninvestigated.
Future work could introduce a "suspicious" threshold interval to route uncertain prompts to a stronger downstream defense.

Method	Extra Model	Extra Inference Passes	Computational Cost	ASR
LlamaGuard	Required	1 (guard LLM)	High	Low
AutoDefense	Required	20	Very high	Low
SmoothLLM	Not required	10	Medium–high	Medium
SelfDefense	Not required	2	Medium	Medium
PerplexityDefense	Not required	0	Very low	High (weak defense)
AlignTree	Not required	0	Very low	Low

AlignTree is the only method that simultaneously achieves low ASR and low false refusal rates without introducing any additional model or extra inference passes.

Non-linear nature of refusal behavior: Experimental results compellingly demonstrate that refusal behavior in LLMs is not a simple linear phenomenon; future alignment research should pay greater attention to nonlinear structures in activation space.
Cascaded lightweight classifier design: The two-stage design of extracting probability features via SVM and feeding them into a random forest is transferable to other scenarios requiring real-time in-inference decisions, such as hallucination detection and toxicity filtering.
Adaptive threshold strategy: The \(F_\beta\)-score-based threshold selection method generalizes to any security decision scenario requiring a precision–recall trade-off.
Further integration with activation-space defense works such as JBShield could be explored to investigate richer feature combinations.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of linear and nonlinear signals is a novel contribution, though the individual techniques (refusal directions, SVM/RF) are not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Nine models, three model families, multiple attack types, white-box adaptive attacks, detailed ablations, and hyperparameter sensitivity analysis — extremely comprehensive.
Writing Quality: ⭐⭐⭐⭐ Well-structured with detailed method descriptions; the large number of tables makes the paper slightly verbose.
Value: ⭐⭐⭐⭐ Strongly practical; the extremely low computational overhead is a genuine highlight that enables direct deployment in production environments.