Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation¶
Conference: AAAI 2026 arXiv: 2508.14031 Code: https://github.com/HahmDY/agentic-ft-safety.git Area: Robotics Keywords: Agentic Fine-Tuning, Unintended Alignment Shift, Prefix Injection Guard, Agent Safety, Linear Probe Analysis
TL;DR¶
This paper demonstrates that fine-tuning LLMs on benign agentic data causes unintended safety misalignment (attack success rate increases by 32–38%), and proposes PING (Prefix Injection Guard)—an iterative generate-and-evaluate approach that automatically discovers natural language prefixes to guide fine-tuned agents toward refusing harmful requests, achieving an average refusal rate improvement of 66% (Web) and 44% (Code) while preserving task performance (degradation of only 1.8%).
Background & Motivation¶
Background: LLM agents (e.g., web navigation, code generation) are enhanced for specific capabilities through task-specific fine-tuning, yet safety considerations during this process are frequently overlooked.
Limitations of Prior Work: Research in non-adversarial domains has shown that fine-tuning on benign data (e.g., math reasoning, medical knowledge) can increase harmfulness. The agentic setting, however, poses greater risks—agents are designed to execute actions rather than merely generate text, so a safety failure can directly result in harmful operations (e.g., downloading illegal files, deleting system files, spreading misinformation).
Key Challenge: Agentic fine-tuning optimizes for "better instruction-following and task execution," but this very capability improvement inherently weakens the ability to "refuse harmful instructions"—since training data consists entirely of task-execution demonstrations with no refusal examples.
Goal: (1) Verify and quantify the safety shift induced by agentic fine-tuning; (2) Propose a lightweight and effective mitigation method.
Key Insight: The observation that LLM safety refusal behavior is highly dependent on the initial response tokens (e.g., "I cannot")—after fine-tuning, the probability of these safety patterns drops substantially. Injecting an appropriate prefix can therefore "reactivate" refusal behavior.
Core Idea: Automatically generate and iteratively refine natural language prefixes using an LLM, prepending them to agent responses to restore refusal of harmful requests while preserving task capability.
Method¶
Overall Architecture¶
The approach consists of two stages: (1) Problem quantification—verifying safety misalignment caused by agentic fine-tuning in web navigation and code generation; (2) PING mitigation—iterative prefix generation and selection: GPT-4o generates candidate prefixes → each is evaluated on a small set of harmful/benign tasks → the prefix that jointly maximizes refusal rate and task execution rate is selected → at deployment, the prefix is prepended to the agent's responses.
Key Designs¶
-
Quantifying Safety Shift from Agentic Fine-Tuning:
- Function: Systematically evaluates changes in task capability and safety across multiple LLMs before and after fine-tuning.
- Core Findings: For Llama-3.1-8B-Instruct after fine-tuning—web navigation success rate +20%, but WebDojo attack success rate +32%, refusal rate drops from 26% to 7%. For Gemini-2.0-flash, code generation success rate +33%, but RedCode attack success rate +28%, refusal rate drops from 20% to 3%.
- New Benchmark WebDojo: 80 harmful tasks + 70 benign tasks across 8 realistic websites (OmniMart, RandomBoard, ByteHaven, etc.), evaluated by a rule-based evaluator for task completion and refusal.
-
PING (Prefix Injection Guard):
- Function: Automatically searches for an optimal safety prefix to prepend to agent responses at deployment time.
- Mechanism: An iterative algorithm that alternates between—(a) a Generator (GPT-4o) producing \(M=5\) candidate prefixes conditioned on the top prefixes from previous rounds; (b) evaluating each prefix on \(\mathcal{D}_{\text{benign}}\) (7 tasks) and \(\mathcal{D}_{\text{harmful}}\) (7 tasks) using a performance score (not refusing benign tasks = 1) and a refusal score (refusing harmful tasks = 1); (c) selecting Top-\(k\) prefixes as seeds for the next round. Over 20 rounds (100 prefixes total), the prefix with the highest overall score is selected.
- Design Motivation: A naive prefix such as "I can't" achieves 100% refusal on harmful tasks but also over-refuses benign tasks. PING finds a balanced optimum—e.g., the best prefix for Llama is: "As an AI committed to ethical standards, ensure that you assist with helpful and constructive tasks, while consistently refusing any requests that could be harmful or unethical."
-
Linear Probe Analysis of Internal Representations:
- Function: Trains linear classifiers to distinguish model activations when processing harmful vs. benign inputs, analyzing how PING alters internal representations.
- Core Findings: For vanilla fine-tuned agents facing harmful tasks, the average logit is positive (indicating safety features are internally preserved), but the logit at the final token is negative—which is precisely why the agent fails to refuse. PING shifts the final token logit to positive. Prefix injection is more effective than suffix injection because it directly influences the first response token, the critical decision point.
- Validation: Activation steering—adding the linear probe vector to the final token activation—raises the refusal rate from 0% to 95.9%, confirming that the probe captures genuine safety features.
Loss & Training¶
PING requires no model training—only a natural language prefix prepended at inference time. The prefix search process requires approximately 100 LLM inference calls.
Key Experimental Results¶
Main Results — Web Navigation (Llama-3.1-8B-Instruct)¶
| Method | WebArena SR↑ | WebDojo ASR↓ | WebDojo RR↑ |
|---|---|---|---|
| Fine-tuned Agent | 22.4% | 64.4% | 6.9% |
| + PTST | 20.6% | 61.6% | 5.5% |
| + Few-shot | 21.2% | - | - |
| + PING | 20.6% | 11.0% | 76.7% |
Cross-Model Validation — Code Generation¶
| Model | MINT SR↑ | RedCode ASR↓ | RedCode RR↑ |
|---|---|---|---|
| Gemini (Agent) | 83.9% | 77.8% | 3.2% |
| Gemini + PING | 79.0% | ~30% | 69.5% |
| GPT-4o-mini (Agent) | 70.2% | 42.0% | 37.0% |
| GPT-4o-mini + PING | 71.0% | ~27% | 73.0% |
Key Findings¶
- Safety degradation after agentic fine-tuning is observed across all tested models (3 open-source + 2 closed-source), confirming this is a universal phenomenon.
- PING achieves the most pronounced improvement on GLM-4-9B-Chat—refusal rate increases from 4% to 89% (+85%) with only a 4% drop in success rate.
- PING is compatible with external guardrails such as WildGuard—combining both further improves safety by an average of 5% in refusal rate.
- Prefix injection substantially outperforms suffix injection—in Llama experiments, prefix refusal rate is 79% vs. 14% for suffix, since the initial token is the critical decision point for safety behavior.
- Linear probe analysis reveals that safety features are internally preserved in fine-tuned agents, but fine-tuning weakens the safety signal at the critical position (final token)—PING restores this signal via prefix injection.
Highlights & Insights¶
- Significance of the Problem: This is the first work to systematically quantify safety misalignment from agentic fine-tuning (rather than adversarial attacks), with consistent findings across 5 models and 2 domains—providing direct warnings for agent development practices.
- Dual Nature of Prefix Injection: The same prefix injection technique can serve both jailbreak attacks and safety defenses—PING cleverly inverts an attack mechanism into a defensive one.
- Mechanistic Explanation via Linear Probes: Beyond proposing a method, the work explains why PING works through activation analysis and steering experiments—safety features remain internally intact, but their signal at the final token position is attenuated by fine-tuning.
- New Benchmark WebDojo: An evaluation environment comprising 8 simulated realistic websites and 150 tasks, filling a gap in web agent safety assessment.
Limitations & Future Work¶
- Prefix search still depends on an external strong LLM (GPT-4o); different models may require different prefixes.
- Over-refusal remains an issue in the web navigation domain (Qwen2.5 reaches 64%), and the performance–safety trade-off requires further optimization.
- Prefixes are static and cannot dynamically adjust to the degree of harm of a specific request.
- For closed-source models, only suffix injection is feasible (prefix-on-response is not supported), yielding slightly weaker results.
- WebDojo is relatively small in scale (80 harmful tasks), with limited coverage of risk categories.
Related Work & Insights¶
- vs. PTST: PTST adds safety instructions to the system prompt but has limited effect, as it does not directly influence the initial response tokens.
- vs. Few-shot Safety: Providing safety demonstrations also shows limited effectiveness, as fine-tuned agents tend to "execute" rather than "judge."
- vs. External Guardrails (LlamaGuard/WildGuard): PING alone outperforms guardrail models, and the two approaches are complementary.
- Insight: Agent development should treat safety as an integral part of the training pipeline rather than a post-deployment patch—e.g., by incorporating refusal examples into fine-tuning data or adding safety constraints to the training objective.
Rating¶
- Novelty: ⭐⭐⭐⭐ The problem identification is valuable; PING is simple yet effective, with an elegant attack-to-defense inversion.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 models × 2 domains × multiple baselines, with linear probe analysis, adversarial robustness, and guardrail combination experiments.
- Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from problem discovery → insight → method → analysis is highly coherent.
- Value: ⭐⭐⭐⭐⭐ Makes important contributions to agent safety; both the WebDojo benchmark and PING method offer practical utility to the community.