When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents¶
Conference: AAAI 2026 arXiv: 2512.02445 Code: None Area: LLM Agent Keywords: LLM Safety, Long Context, Agent Safety, Refusal Mechanism, AgentHarm
TL;DR¶
This paper systematically investigates how long-context padding affects the safety behavior of LLM agents. Models claiming support for 1M–2M token windows exhibit performance collapse exceeding 50% at 100K tokens. Refusal rates fluctuate in unpredictable directions (GPT-4.1-nano rises from 5% to 40%; Grok 4 Fast drops from 80% to 10%), revealing critical safety vulnerabilities in long-context agent systems.
Background & Motivation¶
Background: LLM context windows have expanded from thousands to millions of tokens, enabling agents to handle increasingly long multi-step tasks. However, a substantial gap may exist between the "claimed capacity" and the practically usable capability of context windows.
Limitations of Prior Work: Prior work on long-context LLMs (e.g., "Lost in the Middle") has focused primarily on factual recall and accuracy, with almost no investigation into how long contexts affect agents' safety refusal behavior and task execution capability. Safety evaluations have consistently relied on short prompts and static refusal tests.
Core Problem: Does long context undermine the safety alignment of LLM agents? Do models become more cautious in refusing, or does overall capability simply degrade?
Key Insight: The AgentHarm benchmark is extended to systematically vary context padding length (1K–200K), type (random / relevant / irrelevant / multi-task), and position (before / after the task) under controlled conditions, in order to observe changes in agent behavior.
Core Idea: The effect of long context on agent safety is fundamentally unpredictable—different models exhibit opposite trends in refusal rate, and performance degradation begins well below the claimed context window length.
Method¶
Overall Architecture¶
Building on the AgentHarm benchmark (176 samples spanning 8 categories of harmful tasks including disinformation and fraud, equipped with simulated tools), context padding of varying types and lengths is inserted before or after the task description to evaluate changes in Harm Score (degree of harmful task completion) and refusal rate.
Key Designs¶
-
Four Types of Context Padding:
- Random Padding: Tokens randomly sampled from the vocabulary—fully controllable and reproducible.
- Irrelevant Text: Coherent text drawn from literary works across 5 genres.
- Relevant Text: Wikipedia articles semantically related to the task category.
- Multi-Task Padding: Other task descriptions sampled from the validation set—semantically most disruptive.
- Design Motivation: To isolate the effects of different factors—randomness, textual coherence, semantic relevance, and task confusion.
-
Two Padding Positions:
- Before: Padding inserted between the system prompt and the task description (simulating historical context).
- After: Padding inserted after the task description (simulating user-appended information).
- Design Motivation: To test the recency bias of the attention mechanism—the farther the task description is from the execution region, the more severe the performance degradation.
-
Evaluation Metrics:
- Harm Score: Degree of harmful task completion based on hand-written rubrics (0–1), checking whether target tools are called and in the correct order.
- Refusal Rate: Proportion of refusal responses as judged by an LLM judge.
Models Tested¶
GPT-4.1-nano (1M), GPT-5 (400K), DeepSeek-V3.1 (128K), Grok 4 Fast (2M).
Key Experimental Results¶
Main Results: Effect of Random Padding on Performance and Refusal Rate¶
| Model | Claimed Window | Harm Score (No Padding) | Harm Score (200K) | Refusal Rate (No Padding) | Refusal Rate (200K) |
|---|---|---|---|---|---|
| GPT-4.1-nano | 1M | ~80% | ~25% | ~5% | ~40% ↑ |
| Grok 4 Fast | 2M | ~50% | ~0% | ~80% | ~10% ↓ |
| DeepSeek-V3.1 | 128K | Relatively robust | Slight decline | — | — |
| GPT-5 | 400K | High refusal | High refusal | ~90% | ~90% |
Ablation Study: Effect of Padding Type and Position¶
| Factor | Finding |
|---|---|
| Padding type ranking | Coherent text (irrelevant/relevant) > Random > Multi-task (worst) |
| Padding position | Before degrades more slowly than After (task description closer to execution is better) |
| Cross-model difference | Grok 4 Fast undergoes "mode collapse" between 50K–100K tokens, with all padding types converging to near-zero performance |
Key Findings¶
- Claimed capacity ≠ actual capability: Grok 4 Fast, with a 2M context window, completely collapses at 100K tokens—a serious "false promise" problem.
- Refusal rate changes are directionally unpredictable: GPT-4.1-nano's refusal rate rises from 5% to 40% (becoming more "cautious"), while Grok 4 Fast's drops from 80% to 10% (becoming more "compliant")—the same stimulus produces diametrically opposite safety behavior changes.
- Multi-task padding causes the strongest "semantic confusion" effect: This indicates that agents do not simply ignore long context but are influenced by its semantic content—which has direct security implications for prompt injection.
- After padding is more harmful than Before: The greater the distance between the task description and execution, the harder it is for the attention mechanism to maintain instruction following.
Highlights & Insights¶
- The finding that refusal rates change in opposite directions across models carries major security implications—it means that safety guarantees for long-context agents are fundamentally unreliable. An attacker could reduce the refusal rate of certain models (e.g., Grok 4 Fast) simply by padding the context to sufficient length, requiring no carefully engineered jailbreak prompt whatsoever.
- The controlled variable design is methodologically instructive: by systematically varying padding length × type × position, the three factors of context length, semantic content, and attention distance are decoupled, yielding clear causal analysis.
Limitations & Future Work¶
- Only four models are tested, with heterogeneous access methods (some via OpenRouter); provider-level safety filters may introduce noise.
- Only the simplest task subset (with hints and detailed instructions) is used, representing an upper bound on performance—in real deployments where agents must plan autonomously, performance should be worse.
- The LLM judge used to evaluate refusal rates may itself be unreliable under long-context conditions.
- Insufficient analysis of "deferred refusals"—cases where the model begins execution (potentially causing harm) before ultimately refusing.
- No systematic analysis of whether safety degradation differs across harmful task types (e.g., fraud vs. disinformation vs. cyberattacks).
- No exploration of potential defenses (e.g., context window limits, segmented safety checks).
- Reproducibility is affected by API provider updates—model behavior may change entirely after updates.
Related Work & Insights¶
- vs. AgentHarm (original): The original benchmark evaluates agent safety under normal context lengths; this paper reveals safety degradation along the critical long-context dimension, suggesting that standard safety evaluations may fail in long-context settings.
- vs. "Lost in the Middle": That work focuses on degradation in factual recall; this paper is the first to study degradation in safety refusal behavior—and refusal behavior degrades in a far more unpredictable manner than factual recall.
- vs. LongSafety: LongSafety evaluates the impact of long inputs (~5K tokens) on LLMs; this paper extends the scope to 200K-token agent scenarios and finds that the scale of behavioral degradation far exceeds prior expectations.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First systematic study of how long context affects agents' safety refusal behavior; the finding that refusal rate changes are directionally unpredictable constitutes a significant warning to the security community.
- Experimental Thoroughness: ⭐⭐⭐ — The controlled variable design is carefully constructed, but only 4 models are tested, and partial reliance on API access may introduce noise.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure and effective visualizations.
- Value: ⭐⭐⭐⭐⭐ — Exposes critical safety blind spots in long-context agent systems and significantly advances the methodology of agent safety evaluation.
Additional Notes¶
- The methodology and experimental design of this work offer reference value for related research areas.
- Future work could validate the generalizability and scalability of the approach across broader scenarios and at larger scale.
- Potential research value exists in combining this work with recent related directions (e.g., intersections with RL, MCTS, and multimodal methods).
- The feasibility of deployment and computational efficiency should be evaluated against practical application requirements.
- The choice of datasets and evaluation metrics may affect the generalizability of conclusions; cross-validation on additional benchmarks is recommended.
Additional Notes¶
- The methodology and experimental design of this work offer reference value for related research areas.
- Future work could validate the generalizability and scalability of the approach across broader scenarios and at larger scale.
- Potential research value exists in combining this work with recent related directions (e.g., intersections with RL, MCTS, and multimodal methods).