Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data¶

Conference: ACL 2025
arXiv: 2506.07390
Code: https://github.com/Xin-Cheng-Wen/PO4Vul
Area: LLM Alignment / Code Security
Keywords: Vulnerability Detection, Preference Optimization, Synthetic Reasoning Data, Curriculum Learning, Triplet SFT

TL;DR¶

The proposed ReVD framework comprehensively enhances LLM vulnerability detection accuracy by 12-23% and achieves SOTA on PrimeVul and SVEN. It features bidirectional vulnerability reasoning data synthesis, triplet SFT (simultaneously learning reasoning across vulnerable code, patched code, and code differences), and Curriculum Online Preference Optimization (COPO).

Background & Motivation¶

Background: LLMs perform exceptionally in code-related tasks but remain limited in software vulnerability detection. CodePTMs like CodeBERT and UniXcoder exhibit insufficient performance when fine-tuned for vulnerability detection.

Limitations of Prior Work: (a) Lack of vulnerability reasoning data—existing datasets only provide code labels without the "why it is a vulnerability" reasoning process; (b) Models learn semantic representations rather than vulnerability patterns—vulnerable code and patched code are highly semantically similar (GPT-4 fails to distinguish between them in 78.6% of matched pairs).

Key Challenge: Vulnerability patching typically involves only minor code changes (e.g., adjusting buffer sizes, adding security checks), rendering vulnerable and patched code virtually synonymous; thus, LLM understanding based on semantics struggles to capture these crucial differences.

Goal: Enable LLMs to learn reasoning patterns of vulnerabilities (why a vulnerability exists and how to patch it) rather than merely semantic representations.

Key Insight: Synthetic reasoning data + triplet contrastive learning + curriculum preference optimization.

Core Idea: Teach the LLM "why there is a vulnerability" using synthetic reasoning data, distinguish vulnerable from patched code using triplet loss, and focus on weak vulnerability types via curriculum preference optimization.

Method¶

Overall Architecture¶

A three-module pipeline: BVD Data Synthesis \(\rightarrow\) T-SFT Triplet Fine-Tuning \(\rightarrow\) COPO Curriculum Preference Optimization

Key Designs¶

BVD (Bidirectional Vulnerability Data Generation):
- Forward reasoning: Analyze vulnerable code \(\rightarrow\) generate the reasoning chain for "why this code contains a vulnerability."
- Backward reasoning: Analyze patched code \(\rightarrow\) generate "what was patched and why it was patched this way."
- Leverages CVE/CWE information to assist reasoning, generating 28K high-quality reasoning data.
T-SFT (Triplet SFT):
- Three-way loss: pre-code reasoning + post-code reasoning + code difference reasoning.
- \(\mathcal{L} = \ell(pre\_code) + \ell(post\_code) + \ell(code\_diff)\)
- Enables the model to simultaneously learn "vulnerability patterns," "patching patterns," and "difference patterns."
COPO (Curriculum Online Preference Optimization):
- Instance-level curriculum: Sample weighted by accuracy per vulnerability type—sampling more from underperforming types.
- Task-level curriculum: Decompose each data point into 3 progressive tasks (locating vulnerable lines \(\rightarrow\) analyzing trigger paths \(\rightarrow\) explaining the root cause).
- IPO optimization to prevent overfitting on limited vulnerability data.

Key Experimental Results¶

Main Results (PrimeVul + SVEN Datasets)¶

Method	PrimeVul Acc	PrimeVul F1	SVEN VP-Score
CodeBERT	Baseline	Baseline	Baseline
GPT-4 (Zero-shot)	Lower	Lower	Lower
ReVD	+12-23%	+10.3%	+18.15%

Key Findings¶

ReVD comprehensively outperforms 9 baselines (CodePTMs + LLMs), improving accuracy by 12-23%.
GPT-4 makes identical predictions on 78.6% of vulnerable/patched pairs, validating that semantic similarity is indeed the core challenge.
Progressive training in curriculum optimization yields better results than direct single-stage training, significantly boosting the detection of weak-type vulnerabilities.
Reasoning data is crucial for vulnerability detection—fine-tuning performance drops drastically without reasoning data.

Highlights & Insights¶

Bidirectional reasoning data synthesis explicates the implicit patching logic, allowing the LLM to learn "why" rather than "what it looks like."
Triplet loss cleverly leverages the unique structure of vulnerability detection (the triangular relationship between vulnerable code, patching, and differences).
Curriculum preference optimization addresses the imbalance in vulnerability types by focusing on weak areas.

Limitations & Future Work¶

Reasoning data is synthesized by LLMs, making its quality dependent on the generator model's understanding of vulnerabilities.
Tested only on C/C++ code, leaving multi-language applicability unverified.
The curriculum design of COPO relies on CWE classification, which might lack flexibility for novel vulnerabilities.

vs CodeBERT/UniXcoder: While CodePTMs only learn semantic representations, ReVD learns vulnerability reasoning patterns, breaking through the dilemma of "semantically similar but security-wise distinct" code.
vs GPT-4 Zero-shot: Even the strongest commercial models struggle to distinguish subtle vulnerabilities, highlighting the necessity of domain-specific training.
Offers significant advancement for LLM applications in the domain of code security.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of reasoning data synthesis, triplet SFT, and curriculum PO is novel and highly targeted.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated against 9 baselines on two datasets with comprehensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear motivation of the core problem (the 78.6% failure rate of GPT-4 is highly convincing).
Value: ⭐⭐⭐⭐⭐ Introduces the first vulnerability reasoning dataset and framework, offering direct practical value for LLMs in code security.