Skip to content

AlignTree: Efficient Defense Against LLM Jailbreak Attacks

Conference: AAAI 2026 arXiv: 2511.12217v1 Code: https://github.com/Gilgo2/AlignTree Area: LLM Alignment Keywords: LLM Safety, Jailbreak Attack Defense, Random Forest Classifier, Refusal Direction, SVM

TL;DR

AlignTree leverages internal LLM activation features — combining linear refusal directions with nonlinear SVM signals — to train a lightweight random forest classifier that efficiently detects jailbreak attacks with negligible computational overhead, achieving state-of-the-art reductions in attack success rate (ASR).

Background & Motivation

LLMs face serious jailbreak attack threats, where adversaries craft prompts to bypass safety alignment mechanisms and elicit harmful outputs. Existing defenses exhibit a pronounced efficiency–robustness trade-off:

  • Pre-processing defenses (e.g., LlamaGuard, ShieldGemma): require deploying an additional safety LLM for input filtering, incurring substantial computational cost;
  • In-process defenses (e.g., SmoothLLM): require multiple repeated inferences or generating numerous prompt copies, resulting in high latency;
  • Post-processing defenses (e.g., SelfDefense, AutoDefense): require the LLM to perform secondary review of its own outputs, doubling or more the computational burden.

More critically, prior activation-space defenses primarily rely on a single linear refusal direction to classify harmful prompts. Recent studies, however, reveal that the refusal behavior in LLMs is geometrically non-linear, and a single linear signal is insufficient to capture all malicious patterns.

Core Problem

How to design an computationally efficient, model-free, inference-free jailbreak defense mechanism for LLMs that effectively reduces ASR without causing excessive refusal of benign prompts?

Method

Overall Architecture

AlignTree is an in-process defense that monitors internal activation states during LLM inference. The pipeline proceeds as follows: 1. Perform a single forward pass on the input prompt and extract hidden states from each layer; 2. Derive two types of features from the hidden states: (i) linear refusal activations, and (ii) nonlinear SVM probability features; 3. Concatenate both feature types and feed them into a random forest classifier to produce a harmfulness confidence score; 4. Compare the score against threshold \(\tau\) to decide whether to pass or block the input.

Key Designs

  1. Refusal Activations (Linear Refusal Signal): The difference-in-means method is applied to compute the mean activation vector difference at each token position and each layer across a harmful sample set \(D_{\text{harmful}}\) and a harmless sample set \(D_{\text{harmless}}\): \(r_i^{(l)} = \mu_i^{(l)} - v_i^{(l)}\). A validation set is then used to evaluate the refusal-inducing and refusal-suppressing effect of each direction vector, and the optimal single refusal direction \(r^*\) is selected. For the hidden state \(h\) of the last token at each layer, a scalar projection is computed: \(\text{proj}_{r^*}(h) = \frac{h \cdot r^*}{\|r^*\|}\), yielding per-layer refusal activation scalar features.

  2. SVM Nonlinear Malicious Signal Extraction: For each layer \(l\) and 8 selected token positions (first 3 and last 5), an independent RBF-kernel SVM classifier \(\text{SVM}_i^{(l)}\) is trained, yielding \(8 \times L\) classifiers in total. After evaluation on the validation set, the top \(L/2\) classifiers by accuracy are retained. Platt scaling maps each SVM decision value to a calibrated harmfulness probability \(P_{\text{harmful}}(x_i^{(l)})\), forming the nonlinear feature vector.

  3. Random Forest Classifier: The two feature types are concatenated into a complete input vector:

\[F(t) = [\text{proj}_{r^*}(x_{-1}^{(l)}(t))]_{l=1}^{L} \oplus [P_{\text{harmful}}(x_i^{(l)}(t))]_{(i,l) \in \mathcal{S}}\]

A shallow random forest (n_estimators=50, max_depth=6, min_samples_split=5) is used for harmful/harmless classification.

Loss & Training

  • SVMs use RBF kernels; out-of-fold probabilities are generated via 5-fold cross-validation.
  • Random forest hyperparameters: n_estimators=50, max_depth=6, min_samples_split=5 (grid search confirms low hyperparameter sensitivity).
  • Threshold selection: The optimal threshold \(\tau\) is selected on the validation set using the \(F_\beta\) score (\(\beta=0.2\), emphasizing precision):
\[F_\beta = \frac{(1+\beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\]
  • Training data: refusal/SVM training samples are drawn from AdvBench, MaliciousInstruct, TDC2023, StrongReject, and HarmBench (harmful) and ALPACA (harmless); random forest training samples use JailbreakBench, PAIR, and AutoDAN attack samples together with ALPACA and XSTest.
  • Training time is approximately 3 minutes on a single RTX 6000 Ada GPU for the largest model.

Key Experimental Results

Dataset Metric AlignTree Prev. SOTA Gain
MalwareGen (Qwen2.5-0.5B) ASR↓ 4.0 5.0 (AutoDefense) −1pp
PAIR (Qwen2.5-0.5B) ASR↓ 6.0 8.0 (SelfDefense-Input) −2pp
AutoDAN (Qwen2.5-0.5B) ASR↓ 0 0 (AutoDefense) Tied
PromptInject (Llama-3.1-8B) ASR↓ 18.0 28.0 (SelfDefense) −10pp
PAIR (Gemma-3-12b) ASR↓ 10.0 19.0 (AutoDefense) −9pp
White-box adaptive attack (3 models) ASR↓ 0 0 (AutoDefense) Tied but 60×+ faster
PIQA/ARC and other benign datasets False refusal rate↓ 0–1% 0–8% (AutoDefense) Lowest false refusal

Efficiency comparison: AlignTree's execution time is close to the undefended baseline, approximately 10–50× faster than AutoDefense and 5–20× faster than SmoothLLM. In white-box attack experiments, AlignTree (2.40s) is approximately 58× faster than AutoDefense (140.74s), with both methods achieving ASR of 0.

Ablation Study

  • RefusalClassifier (linear signal only): effective on well-aligned models (Llama), but nearly ineffective on weakly aligned models (Qwen, ASR 89.0), indicating that linear signals depend heavily on the base model's alignment quality.
  • SVMClassifier (nonlinear signal only): poor generalization; excessive false refusal rates on some datasets.
  • MultiRefusalsClassifier (multiple refusal directions): outperforms the single-direction variant, confirming the multi-dimensional nature of refusal mechanisms.
  • AlignTreeLinear (linear SVM replacing RBF): performs well on individual models (Gemma-3-12b) but is inconsistent overall (ASR 61.0 vs. AlignTree's 4.0 on Qwen2.5-0.5B).
  • Full AlignTree: achieves the most stable and consistent performance across all models.

Highlights & Insights

  • Minimal computational overhead: requires no additional LLM, no repeated inference, and no prompt variants; performs lightweight classification solely on activations from the existing forward pass.
  • Complementary linear and nonlinear signals: the first defense framework to combine linear refusal directions with nonlinear SVM features; ablation studies strongly demonstrate the importance of nonlinear signals.
  • Low false refusal rate: near-zero false refusals across four commonsense reasoning datasets, demonstrating strong practical usability.
  • Comprehensive evaluation: 9 LLMs (3 model families × 3 scales), multiple attack benchmarks, and white-box adaptive attacks provide extensive experimental coverage.

Limitations & Future Work

  • A separate classifier must be trained for each individual model, precluding cross-model reuse.
  • On weakly aligned models, the refusal direction signal is nearly ineffective, leaving the method entirely dependent on SVM signals.
  • ASR on the PromptInject dataset remains relatively high (e.g., 41.0 on Qwen2.5-0.5B), indicating that prompt-injection-style attacks remain challenging.
  • ASR evaluation relies on ChatGPT-4o as a judge, which may introduce evaluation bias.
  • Only shallow classifiers are explored; whether more complex models (e.g., MLPs) could yield further improvements remains uninvestigated.
  • Future work could introduce a "suspicious" threshold interval to route uncertain prompts to a stronger downstream defense.
Method Extra Model Extra Inference Passes Computational Cost ASR
LlamaGuard Required 1 (guard LLM) High Low
AutoDefense Required 20 Very high Low
SmoothLLM Not required 10 Medium–high Medium
SelfDefense Not required 2 Medium Medium
PerplexityDefense Not required 0 Very low High (weak defense)
AlignTree Not required 0 Very low Low

AlignTree is the only method that simultaneously achieves low ASR and low false refusal rates without introducing any additional model or extra inference passes.

  • Non-linear nature of refusal behavior: Experimental results compellingly demonstrate that refusal behavior in LLMs is not a simple linear phenomenon; future alignment research should pay greater attention to nonlinear structures in activation space.
  • Cascaded lightweight classifier design: The two-stage design of extracting probability features via SVM and feeding them into a random forest is transferable to other scenarios requiring real-time in-inference decisions, such as hallucination detection and toxicity filtering.
  • Adaptive threshold strategy: The \(F_\beta\)-score-based threshold selection method generalizes to any security decision scenario requiring a precision–recall trade-off.
  • Further integration with activation-space defense works such as JBShield could be explored to investigate richer feature combinations.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of linear and nonlinear signals is a novel contribution, though the individual techniques (refusal directions, SVM/RF) are not entirely new.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Nine models, three model families, multiple attack types, white-box adaptive attacks, detailed ablations, and hyperparameter sensitivity analysis — extremely comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ Well-structured with detailed method descriptions; the large number of tables makes the paper slightly verbose.
  • Value: ⭐⭐⭐⭐ Strongly practical; the extremely low computational overhead is a genuine highlight that enables direct deployment in production environments.