CoV-Eval: Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective¶

Conference: ACL 2025
arXiv: 2505.10494
Code: https://github.com/MurrayTom/CoV-Eval
Area: LLM Evaluation
Keywords: Code Security, Vulnerability Evaluation, LLM Code Generation, Automated Evaluation, Multi-Task Benchmark

TL;DR¶

Proposes CoV-Eval, the first multi-task code vulnerability evaluation benchmark covering code completion, vulnerability repair, vulnerability detection, and vulnerability classification. It develops the VC-Judge vulnerability judgment model to replace traditional static analysis tools. A comprehensive evaluation of 20 LLMs reveals that although most LLMs can detect vulnerable code, they still tend to generate unsafe code and possess limited vulnerability repair capabilities.

Background & Motivation¶

Background: LLM-driven code assistant tools (such as GitHub Copilot) have been widely deployed. Existing evaluation datasets (such as HumanEval, MBPP) primarily evaluate code functional correctness (whether test cases are passed) but underperform on code security evaluation. Although code generated by GPT-4o implements the required features, it may contain security vulnerabilities, such as information leakage and memory overflows.

Limitations of Prior Work: Code security evaluation datasets (CWE-scenario, SecurityEval, CyberSecEval) focus only on a single evaluation task (such as code completion), failing to exhaustively evaluate the capabilities of LLMs in secure code generation, vulnerability repair, and vulnerability identification across multiple dimensions, as well as their correlations. Furthermore, regarding automated evaluation tools, traditional static analysis tools (such as CodeQL, Bandit) are constrained by handcrafted rules, leading to high false negative rates; LLMs as evaluators exhibit fewer false negatives but yield higher false positive rates than human experts.

Key Challenge: Single-task evaluations fail to capture the complete picture of LLMs' code security capabilities—a model capable of detecting vulnerabilities may not necessarily generate secure code, and a model with a high secure code rate might be weak at vulnerability repair. Meanwhile, a reliable automated vulnerability evaluation method is lacking.

Goal: (1) Construct a multi-task evaluation benchmark covering multiple dimensions of code security; (2) Develop an automated vulnerability judgment model highly aligned with human experts.

Key Insight: Starting from CWE (Common Weakness Enumeration), covering 18 vulnerability types, designing four complementary evaluation tasks, and training a specialized vulnerability judgment model via instruction tuning to address the reliability issue of automated evaluation.

Core Idea: Use a multi-task evaluation framework to comprehensively characterize the multi-dimensional performance of LLM code security capabilities, and replace unreliable static analysis tools and general LLM evaluators with the instruction-tuned VC-Judge model.

Method¶

Overall Architecture¶

CoV-Eval consists of two parts: (1) a multi-task evaluation dataset constructed by building test sets for four tasks based on a GitHub-CWE seed set, and synthesizing more complex code scenarios via the Vul-Evol framework; (2) automated evaluation using VC-Judge to replace traditional tools for safety auditing of LLM-generated code. It evaluates 4 closed-source + 16 open-source LLMs, outputting multiple metrics including security rate (\(SR@1\)), F1 score, and recall.

Key Designs¶

Four-Task Evaluation System:
- Function: Comprehensively evaluates LLM code security capabilities from different perspectives.
- Mechanism: (a) Code Completion—given an incomplete program with comments, LLMs complete the code, evaluating the security of the generated code; (b) Vulnerability Repair—given vulnerable code and a description of the vulnerability type, LLMs repair the vulnerability; (c) Vulnerability Detection—given complete code, determine whether a vulnerability exists; (d) Vulnerability Classification—not only detect vulnerabilities but also identify the specific vulnerability type (CWE ID).
- Design Motivation: The complementarity of the four tasks better simulates real-world software development challenges: being able to detect vulnerabilities does not guarantee writing secure code, and identifying vulnerability types is a prerequisite for repair.
Vul-Evol Code Scenario Synthesis Framework:
- Function: Generates more complex code scenarios for stress testing.
- Mechanism: Based on instruction evolution, GPT-4o is used to perform four types of complexity enhancements on 54 code scenarios from the seed set—adding new constraints, replacing common requirements with rare ones, increasing reasoning steps, and raising time/space complexity requirements. After manual quality filtering (removing scenarios that already contain secure features), 270 new scenarios are obtained.
- Design Motivation: Seed set scenarios are relatively simple, whereas real-world programming environments are more complex. Concurrently, it was found that 40% of the synthesized scenarios already contained secure features (due to GPT-4o's high safety standards), necessitating manual filtering to ensure fair assessment.
VC-Judge Vulnerability Judgment Model:
- Function: Replaces traditional static analysis tools to automatically evaluate the safety of LLM-generated code with higher reliability.
- Mechanism: Based on LLAMA3-8B-Instruct instruction tuning, with training data from three sources: 216 human-annotated programs from CoV-Eval code completion tests, 531 programs from the vulnerability detection test set, and the BigVul open-source vulnerability dataset. Three styles of prompt templates (vulnerability judgment, classification, and repair) are designed. It employs a judgment-style evaluation template rather than multi-class or binary detection, leveraging known vulnerability type information to enhance judgment reliability.
- Design Motivation: Traditional static analysis exhibits high false negative rates (limited by handcrafted rules), while general LLMs suffer from high false positive rates (lacking vulnerability domain expertise). The specifically fine-tuned VC-Judge achieves the highest alignment with human experts (78.24%) and the smallest safety rate discrepancy (1.39%).

Loss & Training¶

VC-Judge is trained using standard supervised fine-tuning (SFT). Evaluation metrics: Security Rate \(SR@1\) is used for code completion and vulnerability repair, while weighted F1, recall, and accuracy are used for vulnerability detection and classification.

Key Experimental Results¶

Main Results¶

Model	Code Completion \(SR@1\)	Vulnerability Repair \(SR@1\)	Vulnerability Detection F1	Vulnerability Classification F1	Overall Security Score
Claude-3	74.07	66.25	92.42	45.00	69.43
GPT-4o	72.84	63.94	94.62	36.05	66.86
LLAMA3.1-8B	75.92	58.70	92.89	26.45	63.49
DeepSeek-Coder-V2	75.31	51.57	90.63	35.50	63.25
CodeLLAMA-7B	68.21	39.62	93.57	11.47	53.22

Ablation Study: Impact of High-Quality Code Data¶

Fine-Tuning Data Configuration	Code Completion (seed)	Code Completion (vul-evol)	Vulnerability Repair	HumanEval
LLAMA2-7B Baseline	42.59	58.89	42.98	14.51
+Secure Code SFT	62.96	76.29	24.53	16.04
+Secure Code + Vulnerability Detection SFT	64.81	76.67	32.91	16.74
+General Code SFT (GC-IFT)	40.74	49.63	6.08	20.27

Key Findings¶

Mismatch between detection ability and secure code generation ability: Qwen1.5-14B and ChatGLM3-6B achieve a vulnerability detection recall rate of 100%, yet their code completion security rates are only 69% and 74%—knowing what a vulnerability is does not equate to being able to avoid generating one.
Most common generated vulnerability types: CWE-78 (OS command injection), CWE-434 (unrestricted file upload), and CWE-190 (integer overflow) are high-probability vulnerabilities across almost all LLMs.
Vulnerabilities LLMs successfully avoid: CWE-125 (out-of-bounds read), CWE-89 (SQL injection), CWE-732 (incorrect permission assignment), and CWE-416 (use-after-free). These vulnerabilities, which involve data integrity and memory safety, are well-avoided.
Code-specialized fine-tuning improves security: CodeLLAMA-7B shows a 12% increase (\(56\% \to 68\%\)) in code completion security rate compared to LLAMA2-7B.
High-quality secure code data is key: Fine-tuning with security-audited code data can simultaneously improve security and functional correctness, whereas unvetted code data may compromise security.

Highlights & Insights¶

Multi-task design reveals correlations between capabilities: Vulnerability classification ability correlates positively with code safety—models worse at classification tend to generate more vulnerabilities, suggesting that infusing vulnerability knowledge could be an effective path to enhance code security.
Insights from Vul-Evol quality filtering: 40% of scenarios synthesized by GPT-4o already contained secure features, indicating that the inherent security preferences of strong models during data synthesis could bias evaluation fairness.
Self-repair experiment: Mistral-7B achieves the best performance in self-detection + self-repair (\(SR@1\) 63.74%), outperforming even stronger closed-source models, which is a noteworthy finding.

Limitations & Future Work¶

The scale of the dataset is limited; since it is expanded based on 54 seed scenarios, diversity is constrained by the seed set.
Although VC-Judge achieves the highest alignment with humans, it still falls short of human experts and exhibits false negatives.
Security and functional correctness (usability) evaluations use different datasets, lacking a unified testing framework.
Future directions: more diverse code scenarios and vulnerability types, unified security + usability testing, and exploring optimal data proportions and training methodologies.

vs CWE-scenario/SecurityEval/CyberSecEval: These datasets are limited to a single code completion task, whereas CoV-Eval provides a more comprehensive evaluation perspective through its four-task design.
vs VulBench/VulDetectBench: These works evaluate LLM vulnerability detection capabilities but lack correlation analysis with code generation security. CoV-Eval's multi-task design reveals the relationship between detection and generation.

Rating¶

Novelty: ⭐⭐⭐⭐ The multi-task code security evaluation framework is well-designed, and VC-Judge holds practical value.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, evaluating 20 models across 4 tasks, plus self-repair analysis, data ablation, and evaluator comparisons.
Writing Quality: ⭐⭐⭐⭐ Clearly structured, though some tables host excessively high data density.
Value: ⭐⭐⭐⭐ Directly provides guiding significance for research on LLM code security and development of code Assistant tools.