CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models¶

Conference: ACL 2025
arXiv: 2503.16167
Code: https://huggingface.co/datasets/Tomo-Melb/CodeReviewQA
Area: Code Intelligence
Keywords: Code Review, Benchmark, Multiple-Choice Probing, Data Contamination, Reasoning Decomposition

TL;DR¶

The CodeReviewQA benchmark is proposed, decomposing the Automated Code Refinement (ACR) task into three intermediate reasoning steps: Change Type Recognition (CTR), Change Localization (CL), and Solution Identification (SI). Each step is formulated as a multiple-choice question-answering (MCQA) probe with different difficulty levels. Evaluated with 72 LLMs on 900 human-verified, high-quality samples (across 9 languages), it reveals specific weaknesses of models in code review comprehension.

Background & Motivation¶

Background: While LLMs exhibit excellent performance in code generation, they still struggle with real-world collaborative software engineering tasks like code review. Review comments are often implicit, ambiguous, and colloquial, requiring the simultaneous comprehension of both code and human intentions.

Limitations of Prior Work: (a) Existing ACR evaluations rely heavily on lexical matching metrics (Exact Match/BLEU), which fail to pinpoint exactly where models fall short; (b) evaluation datasets derived from popular GitHub repositories face severe training data contamination risks; (c) existing benchmarks lack human verification, leading to a high amount of noisy samples.

Key Challenge: ACR is a complex multi-step reasoning task (understanding the change type \(\rightarrow\) locating the code \(\rightarrow\) determining the modification solution), but current evaluations treat it as a single-step sequence-to-sequence translation problem, making it impossible to diagnose specific failure reasons.

Goal: Construct a fine-grained evaluation benchmark for code review comprehension that mitigates data contamination while offering multiple difficulty levels.

Key Insight: Decompose the generation task into three multiple-choice question-answering (MCQA) probes, where each probe corresponds to a distinct reasoning step, utilizing synthetic answer options to circumvent data contamination.

Core Idea: Replace generation-based evaluations with MCQA probing, decomposing ACR into a three-step reasoning diagnostic (CTR + CL + SI) while simultaneously contesting data contamination.

Method¶

Overall Architecture¶

Input: Code review comment \(R_{nl}\) + pre-commit code \(H_{pre}\). Evaluation dimensions: Three MCQA probes (CTR/CL/SI) + original ACR generation task. Each probe is designed with two difficulty levels, Easy and Hard, controlled by the design of distractor options.

Key Designs¶

Change Type Recognition (CTR): Determines whether the review comment requires adding (add), deleting (delete), or modifying (modify) code. Formulated as a 3-choice closed-set classification task.
Change Localization (CL): Determines which lines of the code snippet the changes should happen in. The Easy version uses random line numbers as distractors, while the Hard version uses related but incorrect line numbers generated by LLMs.
Solution Identification (SI): Selects the correct modification solution from multiple candidate code patches. The Easy version uses distractors that modify different code locations, while the Hard version uses distractors that make different modifications to the same target location.
Data Quality Assurance: All 900 samples are human-verified, filtering out instances that can be handled automatically by static analysis tools. This covers 199 repositories and 9 programming languages.

Loss & Training¶

Purely evaluation benchmark without training. Evaluates 72 LLMs (ranging from 1B to 72B parameters across 18 organizations), including code-specific and general models.

Key Experimental Results¶

Main Results¶

Model Scale	ACR Acc	CTR-E	CTR-H	CL-E	CL-H	SI-E	SI-H
1-3B	Low	Medium	Low	Medium	Low	Medium	Low
7-8B	Medium	High	Medium	High	Medium	Medium	Low
14-32B	Relatively High	High	Relatively High	High	Relatively High	Relatively High	Medium
70-72B	Highest	Highest	High	Highest	Relatively High	Relatively High	Medium

Key Findings¶

ACR Generation Accuracy does not equate to comprehension ability: Some models show high ACR scores but low MCQA probing scores, indicating potential data memorization rather than actual comprehension.
CL and SI are performance bottlenecks: Most models perform acceptably on CTR, but the Hard variants of CL (localization) and SI (solution selection) are the primary failure points.
Hard difficulty effectively discriminates capability: The performance gap between Easy and Hard settings helps diagnose the sources of model vulnerabilities.
Reasoning models (e.g., QwQ-32B) show advantages but not absolutely: They possess some advantages in SI-Hard which requires multi-step reasoning.

Highlights & Insights¶

Reasoning Decomposition Evaluation: Instead of end-to-end evaluation, it diagnoses models step-by-step to precisely locate cognitive gaps. This paradigm is transferable to evaluating any complex multi-step generation task.
MCQA to Combat Data Contamination: Synthetic options ensure that correct answers never appeared in the training corpora in this specific context, making it more sustainable than time-cutoff methods.
Importance of Human Verification: Excluding over 40% of noisy samples makes the evaluation results significantly more reliable.

Limitations & Future Work¶

Relatively small scale of 900 samples: Human verification guarantees high quality but restricts the scale.
Exclusively evaluates English code reviews: The comprehension of non-English review comments is currently not covered.
Gap between MCQA and real-world scenarios: Real-world ACR relies on open-ended generation; whether performance on MCQA translates directly to real-world capability requires further validation.

vs CodeReviewer/CodeReview-New: Prior works use text-matching metrics, lack data contamination protection, and lack human verification. CodeReviewQA brings comprehensive improvements.
vs HumanEval/SWE-bench: HumanEval assesses code generation and SWE-bench evaluates issue resolving, whereas CodeReviewQA uniquely focuses on understanding human communication intent.

Rating¶

Novelty: ⭐⭐⭐⭐ — Reasoning decomposition + MCQA to combat data contamination is a novel approach.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 72 models, 9 languages, with fully human-verified data.
Writing Quality: ⭐⭐⭐⭐ — Clear layout with complete formalizations.
Value: ⭐⭐⭐⭐ — Fills the gap in fine-grained evaluation of code reviews.

Area: LLM NLP
Keywords: Code Review, Automated Code Refinement, Multiple-Choice Probing, Data Contamination, LLM Evaluation

TL;DR¶

The CodeReviewQA benchmark is proposed, decomposing code review comprehension into three reasoning steps: change type recognition, change localization, and solution identification. It provides fine-grained feedback and mitigates data contamination risk via multiple-choice probing, systematically evaluating the code review comprehension capabilities of 72 LLMs.

Background & Motivation¶

Background: LLMs exhibit strong performance in code generation but remain limited in collaborative software engineering tasks (such as code reviews). Code review comments are frequently implicit, ambiguous, and colloquial, requiring models to understand code alongside human intentions. Automated Code Refinement (ACR) serves as the core task in this domain.
Limitations of Prior Work: (a) Current evaluations rely on lexical-matching metrics like exact match and BLEU, which only capture surface-level similarities and fail to expose specific drawbacks in intermediate reasoning steps. (b) Evaluation benchmarks using popular GitHub projects are subject to severe training data contamination. (c) Prior benchmarks built through large-scale automated mining contain a substantial number of noisy samples.
Key Challenge: ACR is a process requiring multi-step reasoning (from understanding intent, to locating code, then to generating corrections), but current evaluations solely look at the final output, leaving them unable to diagnose specific failure points.
Goal: To build a code review comprehension evaluation benchmark that yields intermediate reasoning feedback, mitigates data contamination, and consists of human-verified, high-quality samples.
Key Insight: Decompose ACR into three MCQA probing tasks, each validating a specific reasoning step. The MCQA format inherently combats data contamination.
Core Idea: Replace end-to-end generation assessment with multiple-choice probing, decomposing code review comprehension into independently measurable reasoning steps to obtain fine-grained model diagnostics.

Method¶

Overall Architecture¶

Decompose the ACR task \(P(H_{post} | H_{pre}, R_{nl})\) into three sequential reasoning steps, with each designed as an MCQA probe: CTR (Change Type Recognition), CL (Change Localization), and SI (Solution Identification).

Key Designs¶

Change Type Recognition (CTR): A 3-choice MCQA given \(H_{pre}\) and \(R_{nl}\) to predict the change type required by the code review (add/delete/modify). Distractors are the other two change types. Mechanism/Design Motivation: As the most basic step of intent understanding, whether CTR is correct dictates the direction of the subsequent reasoning.
Change Localization (CL): A coreference resolution task mapping natural language descriptions in comments to locate the exact modified code lines in \(H_{pre}\). Difficulty variations: Easy version has low Jaccard similarity options that are easy to distinguish; Hard version has high Jaccard similarity options with only minor differences. Mechanism/Design Motivation: Code review comments rarely specify direct line numbers and require cross-modal coreference resolution.
Solution Identification (SI): Involves intent extraction and solution selection to identify the correct code refinement \(H_{post\_plus}\). Distractors are generated via high-temperature sampling from a surrogate LLM: masking code elements with the highest surprisal and then filling them back to yield diverse yet incorrect candidate fixes. Difficulty variations are also split into Easy and Hard. Mechanism/Design Motivation: If a model is capable of generating a correct refinement, it should, at the very least, be able to identify it.
Invariance Testing: Evaluations are performed on all permutations (\(N!\) configurations) of answer options for every question; a sample is counted as correct only if the model identifies the correct answer across all permutations. Mechanism/Design Motivation: The probability of randomly guessing correctly across all permutations is merely \((1/N)^{N!}\), which drastically minimizes the effect of lucky guessing.

Loss & Training¶

No training involved. Assessment uses Invariant Accuracy, which requires correct selection across all permutations of answer options. The ACR task uses exact match rate.

Experimental Key Results¶

Main Results¶

Model	ACR(%)	CTR(%)	CL-E(%)	CL-H(%)	SI-E(%)	SI-H(%)
Qwen2.5-Coder-3B	30.3	77.7	1.8	1.8	12.2	8.0
Qwen2.5-Coder-7B	41.0	78.6	13.8	10.7	67.6	55.2
phi-4	37.1	76.6	50.9	44.8	84.4	77.5
gemma-2-27b-it	46.4	74.0	70.1	58.7	76.2	65.7
Llama-3.1-70B	50.3	68.4	74.7	69.0	84.2	76.7
Qwen2.5-72B	48.7	79.8	64.2	58.3	97.1	90.9

Ablation Study¶

Finding	Details
ACR vs MCQA Inconsistency	Qwen2.5-72B has 2% lower ACR than Llama but scores over 10% higher in CTR and SI
Diminishing Scale Effects	ACR only improves by 3.7% beyond 16B
Small Models Struggle with CL	CL scores are close to 0% for models under 3B
Difficulty Variations Prove Effective	CL-H is consistently lower than CL-E, and SI-H is lower than SI-E

Key Findings¶

ACR results often do not line up with MCQA probing results: Models might score on ACR through surface pattern matching while having fundamental flaws in intermediate reasoning and comprehension.
CTR is relatively the easiest, with models under 3B already hitting around 78%, whereas CL and SI require much larger parameters.
CL is the biggest bottleneck for small models: Locating the target lines for modification demands precise cross-modal coreference resolution.
Performance variance on SI is the largest: phi-4 possesses only 37.1% ACR accuracy but obtains 84.4% on SI-E, showing it understands but cannot generate the code.
Data quality is crucial: only 900 samples (a 13% selection rate) were preserved from the initial pool of 9,367.

Highlights & Insights¶

Decomposing end-to-end generation tasks into multiple, independently measurable reasoning steps provides a new paradigm for model diagnostics.
The defensive capability of the MCQA format against data contamination is an underappreciated yet critical advantage.
Invariance testing (requiring correct choice across all answer permutations) serves as a rigorous and fair evaluation standard.
It reveals an important phenomenon: comprehension is not equal to generation—models can often recognize the correct solution yet lack the faculty to produce it autonomously.
The 13% data retention rate serves as a wake-up call for the quality issues pervasive in current code review datasets.

Limitations & Future Work¶

The small scale of only 900 samples might not fully encompass all real-world code review scenarios.
Distractors being generated by Codestral-22B might introduce a bias toward models utilizing similar architectures.
Only open-source models (under 72B) were evaluated, leaving out closed-source models such as GPT-4 and Claude.
The 3-way classification for CTR is relatively coarse-grained and could be expanded to finer change types.
Multi-turn context dependencies in code review discussions are not factored in.

Compared to prior ACR benchmarks such as T5CR and CodeReviewer, this work is the first to introduce intermediate reasoning probing and data-contamination resilience mechanisms.
Addresses a different level of granularity compared to SWE-bench: CodeReviewQA concentrates on the comprehension of a single code review comment.
The MCSB analysis by Robinson and Wingate (2023) provides a theoretical foundation for the answer-extraction methodologies.
Insights for future code evaluation efforts: there is a critical need for more evaluation frameworks decomposed into intermediate reasoning steps.

Key Terms¶

ACR (Automated Code Refinement): The task of automatically modifying source code based on code review comments.
Invariant Test: Passing requires being correct across all \(N!\) permutations of answers; the random-guess correct probability is \((1/N)^{N!}\).
Surprisal-based Masking: Generates distractors using high-surprisal code elements to ensure distractors mimic typical errors made by models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Pioneeringly decomposes code review comprehension into measurable reasoning steps.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 72 models, fully human-verified, with multiple difficulty variants.
Writing Quality: ⭐⭐⭐⭐ — Clear layout with complete formalizations.
Value: ⭐⭐⭐⭐⭐ — Establishes a new benchmark and methodology for code comprehension evaluation.

CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Experimental Key Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Key Terms¶

Rating¶

Related Papers¶