CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation¶

Conference: ACL 2025
arXiv: 2502.19166
Code: https://github.com/lin-rany/codeIF
Area: Code Intelligence
Keywords: Code Generation, Instruction Following, Evaluation Benchmark, Constraint Satisfaction, LLM Evaluation

TL;DR¶

CodeIF is proposed as the first systematic benchmark to evaluate the instruction-following capabilities of LLMs in code generation. It includes 50 fine-grained constraint instructions across 8 major categories, introduces 4 new evaluation metrics, and comprehensively evaluates 35 SOTA models.

Background & Motivation¶

Background: LLMs have made significant progress in code generation, but their ability to understand and execute complex instructions remains a key challenge. Existing evaluation frameworks (such as HumanEval and MBPP) mainly focus on functional correctness, lacking a systematic evaluation of instruction following.

Limitations of Prior Work: (1) Existing code benchmarks do not assess whether models adhere to constraints such as global formatting, naming conventions, and structural control; (2) There is a lack of evaluation metrics for multi-constraint problems, making it impossible to distinguish model performance across different constraint types; (3) Instruction-following evaluation benchmarks (such as FollowBench and InfoBench), although rich, are not tailored for code generation scenarios.

Key Challenge: Code generation in practice requires simultaneously satisfying functional requirements and coding style constraints. However, existing evaluations only focus on whether the code runs successfully, ignoring whether it is well-written and complies with constraints.

Goal: To build a multilingual, multi-difficulty, and multi-constraint code generation instruction-following benchmark, and to propose quantifiable metrics for assessing instruction-following capabilities.

Key Insight: Starting from coding constraints, decompose constraints into indivisible atomic instructions and model the dependencies between them to achieve binary objective evaluation.

Core Idea: Decompose code instruction following into 50 atomic constraints across 8 categories, paired with 4 hierarchical evaluation metrics to systematically characterize LLMs' capabilities in constraint satisfaction, logical consistency, and dependency handling.

Method¶

Overall Architecture¶

The construction of CodeIF consists of three steps: (1) Designing a constraint instruction taxonomy (50 fine-grained sub-instructions under 8 major categories); (2) Generating constraint instruction lists based on benchmarks like McEval and FullStackBench using GPT-4, covering four programming languages: Java, Python, Go, and C++; (3) Utilizing LLMs to model the dependency relationships among atomic constraints.

Key Designs¶

Constraint Instruction Taxonomy: Eight categories cover different abstraction levels — Global, Structural Control, Variable, Interface, Function, Class, File, and Combination. Each atomic instruction is designed as a binary evaluation (Yes/No) to avoid subjective judgment.
Multilingual and Multi-Difficulty Design: The dataset spans Java (353), Python (348), C++ (269), Go (230), and is divided into two difficulty levels, Easy (averaging 11.99 instructions/task) and Hard (averaging 13.80 instructions/task), totaling 1,200 tasks.
Instruction Dependency Modeling: LLMs are used to automatically label prerequisite dependencies among atomic constraints. For example, "creating a function body" depends on "naming the function" and "defining parameter types". This dependency structure supports the more rigorous evaluation metric, RSR.
Automated Generation Pipeline: Using GPT-4 based on 20 detailed examples, task-level constraint instruction lists are generated automatically, with manual reviews conducted to guarantee quality.

Evaluation Metrics¶

This work proposes four complementary evaluation metrics. For a dataset containing \(m\) problems with \(n_i\) constraints for each problem:

CSR (Completely Satisfaction Rate): \(\text{CSR} = \frac{1}{m}\sum_{i=1}^{m}\prod_{j=1}^{n_i} r_{i,j}\), where a score is given only if all constraints are fully satisfied. This is the most stringent metric.
SSR (Soft Satisfaction Rate): \(\text{SSR} = \frac{1}{m}\sum_{i=1}^{m}\frac{\sum_{j=1}^{n_i} r_{i,j}}{n_i}\), which calculates the average constraint satisfaction ratio per problem. This is a softer metric.
RSR (Rigorous Satisfaction Rate): Building on SSR, this metric introduces dependencies where constraint \(j\) is satisfied only if all its prerequisite dependencies \(D_{i,j}\) are also satisfied: \(\text{RSR} = \frac{1}{m}\sum_{i=1}^{m}\frac{\sum_{j=1}^{n_i}[r_{i,j}\cdot\prod_{k\in D_{i,j}}r_{i,k}]}{n_i}\).
CCSR (Consistent Continuity Satisfaction Rate): \(\text{CCSR} = \frac{1}{m}\sum_{i=1}^{m}\frac{L_i}{n_i}\), where \(L_i\) represents the length of the longest consecutively satisfied sequence of constraints. This measures sustained following capabilities.

Key Experimental Results¶

Main Results¶

Model	CSR(Full)	SSR(Full)	RSR(Full)	CCSR(Full)
DeepSeek-V3	0.414	0.821	0.764	0.712
Claude-3-5-Sonnet	0.444	0.727	0.692	0.652
GPT-4o	0.383	0.748	0.689	0.650
Gemini-Exp-1206	0.357	0.744	0.685	0.636
Qwen2.5-Coder-32B	0.365	0.736	0.679	0.634
Llama-3.3-70B	0.307	0.698	0.632	0.589
Llama-3.2-1B	0.034	0.218	0.182	0.152

Ablation Study¶

Difficulty	GPT-4o CSR	DeepSeek-V3 CSR	Claude CSR
Easy	0.441	0.468	0.525
Hard	0.325	0.359	0.362

The best CSR on Hard tasks is only 0.362 (Claude), indicating that strict constraint satisfaction is highly challenging.

Key Findings¶

DeepSeek-V3 is the overall strongest: It leads across SSR, RSR, and CCSR, particularly reaching an SSR of 0.831 on compositional constraint tasks.
Claude-3-5-Sonnet achieves the highest CSR: It reaches a CSR of 0.504 on Java, but its SSR on Python is only 0.703.
Model size is positively correlated but not absolutely: The Qwen2.5 family displays clear scaling effects, whereas the Llama3 family shows inconsistency.
Closed-source > Open-source: GPT-4o and Claude perform significantly better than open-source models of similar scale on complex constraints.
C++ is the most difficult language: Complex template metaprogramming leads to the poorest performance across all models.
Common deviation types: Naming conventions (e.g., camelCase requested but snake_case outputted), and ignoring negative constraints (e.g., using "if" statements when instructed not to).

Highlights & Insights¶

The first dedicated benchmark targeting instruction following in code generation, filling an important gap.
The 4-metric system is elegantly hierarchical: CSR evaluates full satisfaction, SSR evaluates average, RSR evaluates dependency chains, and CCSR evaluates continuity.
The binary design of atomic instructions avoids subjective evaluation; the Pearson correlation between GPT-4 automatic evaluation and human annotation reaches 0.87.
Designing instruction dependencies is a unique aspect, with the RSR metric successfully capturing cascading failures.

Limitations & Future Work¶

Inadequate language coverage: It only includes Java, Python, Go, and C++, lacking other popular languages like JavaScript, Rust, and Swift.
Mainly static evaluation: Focuses primarily on code structure and naming conventions, neglecting runtime behavior, performance, and debugging capabilities.
Uniform metric weighting: All constraints are weighted equally, whereas in practice, syntactic correctness is far more important than naming conventions.
Automated evaluation is dependent on GPT-4: Binary judgments rely on GPT-4-1106-Preview, which incurs high costs and may introduce biases.

Code Benchmarks: HumanEval and MBPP focus on functional correctness; EvalPlus enhances test cases, and BigCodeBench extends function calls. None of these involve instruction following.
Instruction-Following Benchmarks: InfoBench, FollowBench, and CFBench evaluate instruction following in text generation. This work extends this to the programming domain.
Insights: Constraint-following capabilities can be evaluated jointly with code functional correctness. The constraint dependency graph can be used to diagnose points where the LLM's reasoning chain breaks.

Rating¶

Novelty: ⭐⭐⭐⭐ — The first code instruction-following benchmark, precisely positioned.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 35 models, 4 languages, 2 difficulty levels, multi-dimensional analysis.
Writing Quality: ⭐⭐⭐⭐ — Clear structure and rich tables.
Value: ⭐⭐⭐⭐ — Provides a new dimension for code generation evaluation, guiding future model improvements.