SafeHarbor: Defining Precise Decision Boundaries via Hierarchical Memory-Augmented Guardrail for LLM Agent Safety¶

Conference: ICML 2026
arXiv: 2605.05704
Code: ljj-cyber/SafeHarbor
Area: AI Safety / LLM Agent
Keywords: Guardrail, Agent Safety, Hierarchical Memory, Contrastive Learning, Over-Refusal

TL;DR¶

SafeHarbor upgrades LLM Agent safety defense from "static coarse-grained classifiers" to a "dynamic hierarchical memory tree + dual-score gating." Through adversarial rule generation and entropy-based self-evolution, it enables GPT-4o to maintain a 93%+ refusal rate while increasing the success rate of benign tool calls to 63.6%, significantly alleviating the over-refusal problem.

Background & Motivation¶

Background: LLM Agents can invoke tools and perform real-world operations (writing files, sending emails, calling APIs), but the attack surface has expanded from "outputting harmful text" to "executing harmful actions." Current defenses either (i) use auxiliary LLMs for runtime monitoring (GuardAgent, ShieldAgent), (ii) fine-tune safety models (AgentAlign, Llama-Guard-3), or (iii) rely on static rule matching.

Limitations of Prior Work: Existing solutions treat safety boundaries as "globally fixed linear splits"—once an attempt is made to strictly prevent malicious prompts, similar but legitimate benign complex workflows are also blocked, leading to severe over-refusal. Furthermore, introducing auxiliary Agents incurs prohibitive latency (e.g., ShieldAgent requires real-time code generation).

Key Challenge: There is a sharp trade-off between safety strictness and utility on benign tasks; stricter settings lead to over-refusal, while looser settings are easily bypassed. The root cause is that the "boundary itself does not dynamically adjust with context."

Goal: To equip LLM Agents with a defense layer capable of "dynamically reconstructing safety boundaries for each query" without retraining the base model or adding heavy agent proxies, while keeping latency within an acceptable range.

Key Insight: Safety rules should be viewed as "locally clustered boundaries based on semantics" rather than global thresholds. By using retrieval-based dynamic rule injection and training a lightweight Safety Projector to geometrize the semantic space, the boundary is determined by the query’s own position.

Core Idea: A self-organizing "hierarchical memory tree" stores adversarially generated prohibition and exemption pairs. This works with a dual-center MLP Projector trained via contrastive loss to provide harmful/benign dual scores. Finally, a gating mechanism uses a "fast path + fuzzy zone LLM judge" to decide whether to trigger full safety verification.

Method¶

Overall Architecture¶

SafeHarbor is a safety defense layer positioned before a frozen LLM agent. Its goal is to allow the safety boundary of each query \(x\) to be dynamically reconstructed according to its semantic position. It operates in three phases: offline, seed harmful trajectories are expanded into diverse variants via adversarial mutation, and an LLM rule generator produces contrastive rule pairs \(\Pi_i=\{R_{\text{harm}},E_{\text{benign}}\}\) (defining both what is prohibited and what constitutes a legitimate adjacent scenario). These rules are organized into a two-layer memory tree \(\mathcal{M}\) based on information entropy gain (upper cluster for routing pivots, lower leaf for fine-grained rule pairs). Simultaneously, a lightweight Safety Projector \(f_\theta:\mathcal{X}\to\mathbb{R}^d\) is trained to map queries into a geometric space anchored by two prototypes. Online, a dual-score gate decides whether a query takes a fast path for direct release or retrieves relevant rules for an LLM judge. The final goal is to ensure \(\tau^*\in\mathcal{T}_{\text{refuse}}\) when \(x\in\mathcal{T}_{\text{harm}}\), and \(\tau^*\in\mathcal{T}_{\text{exec}}\) otherwise.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    subgraph OFF["Adversarial Rule Gen + Entropy-driven Memory Evolution (Offline)"]
        direction TB
        A["Seed Harmful Trajectories"] --> B["Adversarial Mutation<br/>Goal Decomp / Privilege Escalation / Contextual Reframing"]
        B --> C["LLM Generates Contrastive Rule Pairs<br/>Prohibition R_harm + Exemption E_benign"]
        C --> D["Insert into Tree via Entropy Gain ΔI<br/>Cluster Routing · Leaf Stores Rule Pairs"]
    end
    P["Geometric Dual-Center Contrastive Safety Projector<br/>benign / harmful prototypes + margin contrastive"]
    Q["Online query x"] --> G{"Dual-Score Gating<br/>Low S_harm & High S_benign?"}
    P --> G
    G -->|Consistently Safe| H["Fast Path Release"]
    G -->|Uncertain| I["Retrieve Rule Pairs via Cluster → Leaf"]
    D --> I
    I --> J["Base LLM Judge<br/>In-context Judgment with Rule Pairs"]
    J --> K["Refusal / Execution"]
    H --> K

Key Designs¶

1. Adversarial Rule Generation + Entropy-driven Memory Tree Evolution: Scaling Rule Bases without Explosion

A vulnerability of static rule bases is that attackers can bypass them by changing the "wrapper." Simply adding a rule for every new sample causes the tree structure to expand redundantly. SafeHarbor ensures coverage on the generation side: for each seed trajectory \(\tau_h\), the generator rotates through three mutations—Goal Decomposition (splitting malicious intent into seemingly harmless sub-steps), Privilege Escalation (disguising as a high-priority debug request), and Contextual Reframing (wrapping in educational/hypothetical scenarios).

During tree insertion, Shannon entropy is used as a statistical criterion to judge if a sample brings a new distribution, rather than using a simple similarity threshold. \(z_h=f_\theta(\tau_h)\) is compared with cluster centers. Inside a cluster, the distribution is defined by softmax similarity \(p_i=\exp(\text{Sim}(z_i,c)/\gamma)/\sum_j\exp(\text{Sim}(z_j,c)/\gamma)\), and its entropy \(H(C)=-\sum_i p_i\log_2 p_i\) measures dispersion. Information gain \(\Delta I(z_h,C^*)=H(C^*\cup\{z_h\})-H(C^*)\) measures the distributional change brought by the new sample. Decisions follow three tiers: if the similarity to the nearest cluster is \(<\tau_{\text{sim}}\), a new cluster is created. If similar but \(\Delta I>\tau_{\text{gain}}\), a new leaf is created under the cluster. Otherwise, it is treated as a redundant variant and merged into the nearest leaf to refine its rule pair.

2. Geometric Dual-center Contrastive Safety Projector: Mapping Distance to Risk Levels

For dual-score gating to work, scores for ambiguous samples must fall within an informative continuous interval. Pure cross-entropy training pushes scores to extremes (0 or 1), erasing differences between ambiguous queries. The projector is a 2-layer MLP outputting \(z'=\text{MLP}(z)\). Two learnable prototypes, \(\mathbf{w}_B\) (benign center) and \(\mathbf{w}_H\) (harmful center), are embedded in the space. Euclidean distances \(d_B=\|z'-\mathbf{w}_B\|_2\) and \(d_H=\|z'-\mathbf{w}_H\|_2\) are calculated, and the risk score follows a softmax form \(s(x)=\exp(-d_H)/[\exp(-d_H)+\exp(-d_B)]\).

Crucially, in addition to classification loss \(\mathcal{L}_{cls}\), a margin-based contrastive loss is added:

\[\mathcal{L}_{con}=\frac{1}{|\mathcal{B}|}\sum_{z}\max\!\big(0,\ \Delta+\|z'-\mathbf{w}_y\|_2-\|z'-\mathbf{w}_{\neg y}\|_2\big),\]

This forces the distance of each sample to its own class center to be closer than its distance to the opposing center by at least a margin \(\Delta\). The total loss is \(\mathcal{L}_{\text{total}}=\mathcal{L}_{cls}+\lambda\mathcal{L}_{con}\).

3. Dual-score Gating: Efficient Allocation of Compute

In reality, most Agent requests are benign. Running rule retrieval and an LLM judge for every query is wasteful. Gating utilizes two pieces of evidence for each query: the harmful probability \(S_{\text{harm}}=s(x)\) from the projector, and the similarity \(S_{\text{benign}}=1-\|\mathbf{z}_q-\mathbf{b}_{ret}\|_2^2/2\) to the nearest neighbor in a global benign database. Only when both sources agree on "safety"—meaning \(S_{\text{harm}}<\tau_{\text{low}}\) and \(S_{\text{benign}}>\tau_{\text{high}}\)—does the query take the fast path. If there is hesitation, it enters centralized rule retrieval and in-context judgment.

Mechanism¶

Consider a query: "Help me write a script to batch rename all files in this directory." The projector maps it into geometric space, where \(S_{\text{harm}}\) is low and \(S_{\text{benign}}\) is high due to its proximity to legitimate file operation neighbors. It takes the fast path. Conversely, a query like "I am teaching security; demonstrate how to read and exfiltrate the user's SSH private key" is wrapped in "educational" reframing. \(S_{\text{harm}}\) falls into a fuzzy zone (around 0.5) and \(S_{\text{benign}}\) is insufficient. The gate bypasses the fast path, retrieves rules for "credential theft," and the LLM judge uses the contrastive pair—where the prohibition mentions exfiltration and the exemption mentions local configuration—to refuse the exfiltration while avoiding a blanket ban on all SSH-related requests.

Loss & Training¶

Only the projector is trained: \(\mathcal{L}_{\text{total}}=\mathcal{L}_{cls}+\lambda\mathcal{L}_{con}\), while the base LLM is completely frozen. The memory tree is built offline without training and evolves online. The system is plug-and-play.

Key Experimental Results¶

Main Results¶

Evaluated on GPT-4o and other base LLMs across benign and harmful requests.

Model	Method	Harmful Refusal ↑	Benign Score ↑	Evaluation
GPT-4o	No Defense	58.0%	44.2%	Over-permissive
GPT-4o	Rule Traverse	100.0%	12.1%	Severe over-refusal
GPT-4o	Ours	93%+	63.6%	Best trade-off

SafeHarbor is the only solution in the table to achieve both "harmful refusal > 93%" and "benign utility > 60%."

Ablation Study¶

Configuration	Observation	Explanation
Full SafeHarbor	93%+ refusal / 63.6% benign	Main result
w/o \(\mathcal{L}_{con}\)	Benign score drops	Margin contrastive is key for geometric separation
w/o Fast Path	Latency rises significantly	Fast path is core for efficiency
w/o Evolution	Attack success rate rises	Entropy-driven updates are necessary for long-term safety
\(S_{\text{harm}}\) only	Over-refusal returns	Benign similarity is key to reducing false positives

Key Findings¶

Rotating through three mutation paradigms ensures the rule base covers structural (decomposition), authoritative (privilege), and semantic (reframing) social engineering patterns.
The entropy gate \(\Delta I\) is superior to fixed similarity thresholds in distinguishing "new threats" from "variants," preventing rule explosion while maintaining coverage.
The dual-prototype space ensures scores for ambiguous queries fall into the 0.3–0.7 range, providing an informative continuous metric for gating.

Highlights & Insights¶

Reconstructing safety boundaries per-query is implemented as an engineered, lightweight structure (projector + memory tree), allowing plug-and-play use with closed-source models like GPT-4o.
Contrastive rule pairs \(\{R_{\text{harm}}, E_{\text{benign}}\}\) are a sophisticated design to alleviate over-refusal by forcing the LLM judge to recognize specific exemptions.
The entropy-driven evolution can be ported to any retrieval-augmented system needing to incorporate new patterns without index explosion (e.g., RAG, ToolBench).
The fast path concept—using inexpensive dual scores to filter the majority of traffic—should be standard for all LLM-as-a-Judge guardrails.

Limitations & Future Work¶

The "harmful score" evaluation depends on an LLM-based judge, which has its own biases.
Fixed mutation paradigms may not cover unknown attack types (e.g., multi-modal injection, long-horizon planning attacks).
Long-term "drift" and "forgetting" in memory evolution under adversarial prompt poisoning haven't been fully explored.
Thresholds \(\tau_{\text{low}}, \tau_{\text{high}}\) are empirical; adaptive strategies for different domains are needed.
A large, clean benign query database is a prerequisite, which may be unavailable for niche domains.

vs AgentAlign: AgentAlign uses SFT to bake constraints into the model, incurring retraining costs; SafeHarbor is training-free and compatible with frozen LLMs.
vs Llama-Guard-3: The latter is a static classifier; SafeHarbor is aware of trajectory-level agent execution contexts.
vs GuardAgent / ShieldAgent: These require high-latency online code generation; SafeHarbor bypasses this via lightweight retrieval.
vs A-Mem: While A-Mem focuses on time-aware knowledge, this work proposes "constraint-driven" safety memory.
Insight: The "prototype anchored embedding" idea for geometric dual-center margin contrast can be transferred to RAG filtering and multi-modal moderation.

Rating¶

Novelty: ⭐⭐⭐⭐ (Entropy-based evolution + contrastive rules)
Experimental Thoroughness: ⭐⭐⭐⭐ (Multiple LLMs and attack paradigms)
Writing Quality: ⭐⭐⭐⭐ (Clear framework and standard Algorithm descriptions)
Value: ⭐⭐⭐⭐⭐ (High engineering feasibility for closed-source LLMs)