Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences¶
Conference: ACL 2025
arXiv: 2506.00419
Code: https://github.com/StonyBrookNLP/disco-lpo
Area: Alignment RLHF
Keywords: secure code generation, preference optimization, localized alignment, CWE, code security
TL;DR¶
Proposes DiSCo (a secure code preference dataset distilled from frontier LLMs, with 10K instances covering 431 CWEs) and LPO (Localized Preference Optimization algorithm, propagating loss only on security-related tokens), reducing security vulnerabilities by 19-40% across four secure coding benchmarks while improving code quality by 3-10%.
Background & Motivation¶
Background: LLMs are widely used for programming assistance (GitHub Copilot has over 1.2 million subscribers, with 92% of developers using AI coding), but studies show that 40-76% of AI-generated code contains security vulnerabilities (vulnerabilities under the CWE classification).
Limitations of Prior Work: (1) High-quality secure training data is difficult to acquire—data automatically extracted from open-source repositories is highly noisy and has narrow CWE coverage. (2) Standard preference optimization (DPO/SimPO) is unsuitable for secure code—differences between secure and insecure code are typically localized to a few code lines/tokens, whereas standard methods propagate loss uniformly across all tokens, diluting critical signals.
Key Challenge: The difference between secure and insecure code is highly localized (potentially differing by only a few tokens), yet existing preference optimization methods cannot exploit this localization.
Goal: (1) Large-scale, high-quality secure code training data. (2) Alignment algorithms specifically designed for localized preferences.
Key Insight: Guiding frontier LLM data distillation using a security knowledge base (ensuring CWE coverage), cleaning data with static analyzers (reducing noise), and designing token-level masked preference optimization.
Core Idea: Constructing broad-coverage data via a security knowledge-base-guided distillation pipeline + localized preference optimization focusing on secure tokens using masks + SFT regularization to maintain code quality.
Method¶
Overall Architecture¶
Two-stage training: (1) SFT stage—trains the model to generate secure reasoning \(R\) + secure code \(y^+\). (2) LPO stage—performs preference optimization on security-related tokens to make the model prefer \(y^+\) over \(y^-\), while applying SFT regularization on other tokens.
Key Designs¶
-
DiSCo Data Distillation:
- Function: Distilling 10K secure/insecure code pairs and secure reasoning from GPT-4o.
- Mechanism: (1) Constructing prompts using a security knowledge base (534 entries from the CWE website + CodeQL/Bandit documentation + 75 common security libraries) to guide GPT-4o to first generate code with specific vulnerabilities and then patch it. (2) Using static security analyzers to detect remnant issues and feedback to GPT-4o for refinement. One round of refinement reduces secure problems from 37.4% to 12.7%.
- Design Motivation: Direct prompting of LLMs only generates common CWEs; guiding with a knowledge base ensures broad coverage (431 CWE types). Secure reasoning \(R\) forces the model to think about potential issues before coding.
-
SFT + Secure Reasoning:
- Function: Training the model to generate secure reasoning \(R\) (CWE-ID + problem description + insecure reasons + secure fix) before generating secure code.
- Mechanism: \(\mathcal{L}_{SFT} = -\mathbb{E}_{(x, y^+, R) \sim D} \log \pi_\theta(y^+, R | x)\)
- Design Motivation: "Reasoning before coding" enables the model to consider security concerns before code generation, similar to the CoT approach.
-
Localized Preference Optimization (LPO):
- Function: Propagating loss only on security-related tokens during preference optimization, while utilizing SFT regularization on other tokens.
- Mechanism: Constructing binary masks \(m^+, m^-\) to identify differing tokens in secure/insecure code (computed via difflib). The reasoning section \(R\) is masked (as it is identical in both). LPO loss = localized preference term \(\Delta\) + SFT regularization term: \(\mathcal{L}_{LPO} = -\mathbb{E}[\log\sigma(\Delta - \gamma) + \alpha \bar{m}^+ \odot \log\pi_\theta(y^+, R|x)]\)
- Design Motivation: Standard SimPO applies losses uniformly across all tokens, which severely dilutes signals of safety-critical tokens with a large volume of non-security tokens. LPO focuses on key differences. SFT regularization prevents the model from generating unparseable/incoherent code for the sake of "security".
Loss & Training¶
Localized preference term \(\Delta = \frac{\beta}{|y^+|} m^+ \odot \log\pi_\theta(y^+,R|x) - \frac{\beta}{|y^-|} m^- \odot \log\pi_\theta(y^-,R|x)\), plus SFT regularization \(\alpha \bar{m}^+ \odot \log\pi_\theta(y^+, R|x)\).
Key Experimental Results¶
Main Results¶
| Model + Method | SecurityEval↓ | Asleep↓ | LLMSecEval↓ | HumanEvalX↑ |
|---|---|---|---|---|
| CodeLlama-7B (Original) | High Insecurity Rate | High | High | Baseline |
| + SFT on DiSCo | Reduced | Reduced | Reduced | +3-10% |
| + SimPO on DiSCo | Slightly Reduced | Slight | Slight | Decreased |
| + LPO on DiSCo | -19~40% | -19~40% | -19~40% | +3-10% |
Key Findings/Results: LPO reduces security issues by 19-40% across four security benchmarks while improving performance on two code quality benchmarks by 3-10%. Smaller models trained with LPO even outperform GPT-4o and Claude-3.5-Sonnet in terms of security.
Ablation Study¶
| Configuration | Security | Code Quality | Explanation |
|---|---|---|---|
| LPO (Full) | Optimal | Maintained/Improved | Balances safety and quality |
| LPO w/o SFT Regularization | Safer | Decreased | Overfitting to safety makes code unusable |
| SimPO (Standard) | Limited Improvement | Decreased | Unable to focus on localized differences |
| SFT w/o Secure Reasoning | Moderate | Improved | Reasoning benefits security |
| Multi-round Refinement (3 times) | Insecurity rate reduced to 9.4% | Quality degraded | Over-engineered |
Key Findings¶
- SFT Regularization is Crucial: Without regularization, the model hacks the reward—generating unparseable code (which the analyzer deems "secure" because it cannot detect vulnerabilities), but is practically unusable.
- Secure Reasoning Chain is Effective: Prompting the model to output CWE analysis before coding yields a significant safety boost compared to direct coding.
- Small Models Outperform Large Models: The 7B model trained with LPO outperforms GPT-4o and Claude-3.5-Sonnet in security.
- Single-round Refinement is Optimal: Excessive refinement leads to over-engineered code, making one round of refinement the optimal balance between safety and quality.
Highlights & Insights¶
- Generality of Localized Preference Optimization: The masking mechanism of LPO can be generalized to any scenario with "localized preference differences" (such as code formatting, specific-style writing, etc.).
- Knowledge-Base-Guided Distillation: Constructing distillation prompts using domain knowledge bases to ensure data coverage is a transferable paradigm to other domains requiring broad coverage (such as regulatory compliance, medical safety, etc.).
- SFT Regularization Prevents Reward Hacking: In safety alignment, models may generate "technically safe but unusable" outputs. SFT regularization offers a concise solution.
Limitations & Future Work¶
- Only Python is supported; other programming languages require rebuilding the dataset.
- Static analyzers (CodeQL/Bandit) inherently have false negatives, leaving some security issues undetected.
- 12.7% of the secure code in DiSCo still contains remnant issues.
- Only billion-parameter models were tested; the effectiveness on 10B+ models remains unverified.
Related Work & Insights¶
- vs SafeCoder (He et al. 2024): SafeCoder utilizes contrastive/unlikelihood training, whereas LPO provides a more natural preference optimization formulation and achieves better performance.
- vs Pivotal Token Search (Abdin et al. 2024): PTS selects key tokens by estimating their contribution to the overall probability, whereas LPO identifies differing tokens directly via
diff, offering a simpler and more direct approach.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of localized preference optimization and knowledge-base-guided distillation is creative.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 4 security benchmarks + 2 code quality benchmarks + multiple models + exhaustive ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation and detailed methodology descriptions.
- Value: ⭐⭐⭐⭐⭐ Vital contribution to secure code generation, with code and data open-sourced.