DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments¶

Conference: NeurIPS 2025 arXiv: 2506.00739 Code: https://github.com/microsoft/DefenderBench Area: LLM Agent Keywords: Cybersecurity, LLM Agent, Benchmark, Vulnerability Detection, Network Intrusion Simulation

TL;DR¶

This paper presents DefenderBench, an open-source modular toolkit for systematically evaluating LLM agents across three categories of cybersecurity tasks—offensive, defensive, and knowledge understanding—covering five scenarios: network intrusion simulation, malicious content detection, code vulnerability detection/repair, and CTI knowledge QA. Benchmark results show that Claude-3.7-sonnet achieves the best overall performance (81.65 points).

Background & Motivation¶

Background: LLM agents have demonstrated strong capabilities in software development, document translation, and fact-checking, yet their evaluation in the cybersecurity domain remains insufficient. Existing security benchmarks (Cybench for CTF, CyberMetric for knowledge QA, CyberSecEval for code vulnerabilities) each focus on a single task type.

Limitations of Prior Work: - Lack of a unified comprehensive evaluation platform covering offensive, defensive, and knowledge-understanding tasks - Different works employ different evaluation frameworks, making fair cross-model comparisons difficult - Most existing benchmarks are costly and hard to reproduce

Key Insight: Construct a practical, open-source, modular one-stop evaluation toolkit that enables researchers to fairly assess LLM agents on cybersecurity tasks at low cost.

Method¶

Overall Architecture¶

DefenderBench consists of three major modules: 1. Data Preprocessing Module: Automatically downloads, cleans, and splits datasets, caching them locally 2. Task Environment Module: Constructs an interactive environment for each task (providing instructions, defining action spaces, managing conversation history) 3. Agent Interface Module: A unified LLM agent interface supporting plug-and-play integration of both open-source and closed-source models

Key Designs¶

Five Cybersecurity Task Categories:

Network Intrusion Simulation (CyberBattleSim)
- Built on the CyberBattleSim simulator, converted into a text-based interactive game
- Agents can execute three operations: local_vulnerability (local exploit), remote_vulnerability (remote attack), and connect (credential-based connection)
- Two network topologies: Chain (simpler) and CTF (more complex)
- Metric: node takeover rate (winning rate)
Malicious Content Detection
- Malicious-Text: phishing email/SMS detection (20,137 samples, 500 test)
- Malicious-Web: phishing webpage detection (15,612 samples, 500 test)
- Metric: Macro-F1
CTI Knowledge QA (MCQA)
- Based on the CTI-MCQA dataset; 2,338 four-choice questions on cyber threat intelligence
- 500 test samples + 20 few-shot sample pool
- Metric: Macro-F1
Code Vulnerability Detection
- Vulnerable-CG: C-language function vulnerability detection based on CodeXGLUE
- Vulnerable-DV: vulnerability detection based on Devign (FFmpeg + Qemu)
- Metric: Macro-F1
Code Vulnerability Repair (CVEFix)
- 240 single-method vulnerability repair samples covering C/C++/Go/Java/JS/PHP/Python/Rust
- Given vulnerable code, the agent is required to generate a repaired version
- Metric: CodeBLEU

Global Metric: DefenderBench Score = unweighted average of all task metrics

Agent Baseline Design¶

A minimalist scaffolding baseline agent is adopted: - Provides task instruction and response format requirements - Supplies complete trajectory history (prior actions + observations) at each step - Agent generates one action → sends to environment → receives observation → determines termination - Maximum 5 steps for detection/QA tasks; maximum 100 steps for network intrusion tasks

Loss & Training¶

This paper presents an evaluation benchmark rather than a training method; no loss function design is involved. All LLMs are evaluated directly without fine-tuning.

Key Experimental Results¶

Main Results¶

Model	CBS-Chain	CBS-CTF	Mal.Text	Mal.Web	MCQA	Vuln-CG	Vuln-DV	CVEfix	DefB
Naive Baseline	19.4	22.2	52.4	50.4	25.0	50.0	47.8	83.2	43.8
Llama 3.3 70B	100.0	33.3	96.0	82.8	69.6	58.0	57.4	77.3	71.8
GPT-4-turbo	90.0	46.7	93.4	83.2	73.8	58.2	57.6	73.7	72.1
Claude-3.5-sonnet	100.0	56.7	93.8	88.2	72.4	56.4	56.8	75.7	75.0
Claude-3.7-sonnet	100.0	100.0	96.2	90.0	74.2	56.6	56.0	80.2	81.7
Claude-3.7-sonnet-think	100.0	76.7	94.4	91.0	78.2	54.6	52.8	79.5	78.4
o3	83.3	20.0	92.4	88.0	76.4	30.8	59.6	55.6	63.9

Ablation Study¶

Model Scale Effect: - Llama 3.1 8B → 70B: DefB 54.7 → 68.7 (+14.0) - Llama 3.2 1B → 3B: DefB 38.3 → 50.2 (+11.8) - GPT-4.1 → 4.1-mini → 4.1-nano: 63.9 → 58.9 → 47.5 (larger scale consistently better)

Few-Shot Augmentation: - Most large models benefit significantly from few-shot ICL - Smaller models (Llama 3.2 1B/3B, Phi-3.5-mini) suffer performance degradation due to longer input context

CoT Effect: - CoT is most effective for interactive tasks (network intrusion): GPT-4o gains +17.0 points - CoT has limited effect on static tasks; some models show marginal performance drops

Key Findings¶

Claude-3.7-sonnet is the strongest overall model (81.65), achieving 100% winning rate on both network intrusion environments
Reasoning-augmented models (o1/o3/o4-mini) do not outperform base models—reasoning capability alone is not the key factor for security tasks
Vulnerability detection remains the hardest task—most models only marginally outperform random baselines, revealing LLM limitations in fine-grained program understanding
Small models perform extremely poorly on long-input scenarios (e.g., HTML webpage detection)—Llama 3.2 1B even falls below the random baseline
CodeBLEU may be an inadequate metric for vulnerability repair evaluation—a copy-paste baseline achieves the highest score

Highlights & Insights¶

Comprehensiveness: Currently the most complete LLM cybersecurity evaluation toolkit, covering offensive, defensive, and knowledge dimensions across five task types
Modular Design: Users can easily integrate their own LLMs, agents, and new tasks; supports Weights & Biases visualization
Fair Comparison: A unified agent framework and standardized data processing eliminate evaluation bias across different works
Practical Insights: Reveals unexpected weaknesses of reasoning models on security tasks and the critical influence of model scale on security capabilities
Cost-Friendly: Test set sizes are deliberately controlled (500 samples), making evaluation affordable for small and medium-sized research teams

Limitations & Future Work¶

Overly Simple Agent Design: Only a minimalist scaffolding baseline agent is used; more complex tool-augmented agents (e.g., integrating static analysis tools) are not explored
Inadequate CVEFix Metric: CodeBLEU fails to accurately reflect the quality of small-scope code modifications; better evaluation metrics are needed
Expandable Task Coverage: Important security scenarios such as social engineering, forensic analysis, and log analysis are not included
Limited Network Intrusion Environment: CyberBattleSim's topology is relatively simplified and diverges considerably from real-world network environments
Security Risks of Agents Not Addressed: As a dual-use technology, the paper does not thoroughly discuss countermeasures against the misuse of LLM agents

vs AgentBench/SWE-bench: These general-purpose agent benchmarks do not cover the security domain; DefenderBench fills this gap
vs Cybench: Cybench focuses solely on CTF, whereas DefenderBench has broader coverage (offensive + defensive + knowledge)
vs CyberSecEval: CyberSecEval focuses on code security; DefenderBench additionally incorporates network intrusion and malicious content detection
Insights: Future work could combine DefenderBench with red-teaming frameworks to evaluate the robustness of LLM agents in adversarial settings

Rating¶

Novelty: ⭐⭐⭐ Engineering contribution outweighs methodological innovation, yet fills an important evaluation gap
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 17+ models, 5 task types, and multiple augmentation strategies with thorough comparisons
Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed task descriptions
Value: ⭐⭐⭐⭐ Provides important reference for evaluating LLM security capabilities; the open-source toolkit has strong practical utility