M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation¶
Conference: ACL 2025
arXiv: 2410.21157
Code: github.com/M2RC-Eval-Team/M2RC-Eval
Area: Multilingual Translation
Keywords: Code completion, Repository-level code, Multilingual evaluation, Abstract Syntax Tree, Fine-grained annotation
TL;DR¶
Proposes M2rc-Eval, a large-scale multilingual repository-level code completion benchmark covering 18 programming languages, combined with AST-based fine-grained annotations at both bucket and semantic levels, and constructs the M2rc-Instruct instruction corpus to enhance model performance.
Background & Motivation¶
Repository-level Code Completion is an active research topic in software engineering that, unlike simple in-file completion, requires understanding cross-file contextual dependencies. Existing benchmarks have the following limitations:
Insufficient Language Coverage: Mainstream benchmarks like CrossCodeEval only cover 4 languages (Python, Java, TypeScript, C#) and RepoBench covers only 2, failing to comprehensively evaluate the multilingual capabilities of code LLMs.
Lack of Fine-grained Analysis: Existing benchmarks typically only report average scores across all languages, ignoring fine-grained capability differences across various completion scenarios.
Coarse Difficulty Classification: Most methods only consider the number of involved files, neglecting the structural and semantic context of code within a project.
These issues create a significant blind spot when evaluating the multilingual repository-level code completion capabilities of code LLMs, making a benchmark with broader coverage and finer annotations highly desirable.
Method¶
Overall Architecture¶
The M2rc-Eval system consists of three core components:
- M2rc-Eval Benchmark: Covers 18 programming languages (C, C#, C++, Go, HTML, Haskell, Java, JavaScript, Kotlin, Lua, Objective-C, PHP, Python, R, Ruby, Rust, Scala, TypeScript), with 100 validation samples and 500 test samples for each language.
- Fine-grained Annotation System: Provides bucket-level and semantic-level annotations based on Abstract Syntax Trees (AST).
- M2rc-Instruct Instruction Corpus: Contains 50,000 files per language, used for fine-tuning to boost completion performance.
Key Designs¶
Data Collection and Quality Control: - Data is sourced from The Stack v2 (a license-compliant GitHub repository subset), keeping repositories with \(>5\) stars and comprising 10-50 files. - The overall data pool is constructed from 431,353,244 files. - The completion cursor position is determined by parsing the AST and randomly selecting a node rather than a random string, ensuring the integrity of identifiers and statements.
Quality Filtering Rules: - The completion cursor spans no more than 5 lines. - If the completion ground truth is shorter than 20 characters, at least 20% must be alphabetic. - Repositories in M2rc-Eval do not appear in M2rc-Instruct. - 30% of the completion ground truths are at least 2 lines. - Samples that DeepSeekCoder-1.3B can accurately predict without cross-file context are filtered out.
Bucket-level Annotation: - Divides the \(N\) layers of the AST into \(M=10\) buckets. - The \(i\)-th layer belongs to the \(\lceil i/(N/M) \rceil\)-th bucket. - Reflects completion depth and difficulty: deeper layers (higher bucket numbers) represent more complex code structures.
Semantic-level Annotation: - Predefines 11 major semantic categories: program structure, declarations and definitions, control flow structures, expressions, data types, statements, modifiers and attributes, comments and documentation, preprocessor directives, identifiers and scopes, and special language structures. - Specific subcategories are designed for different languages. - Maps grammar tags from the Tree-sitter parser to semantic labels.
Cross-file Context Retrieval: - Follows CrossCodeEval's methodology, extracting \(L\)-line contiguous code segments from the same repository. - Ranked based on Jaccard similarity. - Appended in descending order of relevance to the in-file context until reaching the 4,096-token limit.
Loss & Training¶
Fine-tuning on M2rc-Instruct adopts the standard code completion objective—predicting the completion ground truth given the in-file context and cross-file context. Training uses the Fill-In-the-Middle (FIM) paradigm, consistent with the pre-training strategies of models like Code Llama and DeepSeek-Coder.
Key Experimental Results¶
Main Results¶
Evaluated on three code LLMs (Code Llama-7B, StarCoder-7B, and DeepSeekCoder-6.7B) using EM (Exact Match) and ES (Edit Similarity) metrics:
18-language average results of Baseline (in-file context only) \(\rightarrow\) +Retrieval \(\rightarrow\) +Retrieval & Tuning:
| Model | Baseline EM | +Retrieval EM | +Tuning EM |
|---|---|---|---|
| Code Llama-7B | 19.4% | ~21% | ~43% |
| StarCoder-7B | ~20% | ~23% | ~46% |
| DeepSeekCoder-6.7B | ~22% | ~26% | ~48% |
Key Observations: - Cross-file retrieval augmentation yields about a 2-4% EM improvement. - Fine-tuning on M2rc-Instruct brings a massive EM improvement of about 20-25%, showing remarkable effectiveness. - DeepSeekCoder-6.7B achieves the best performance after fine-tuning.
Comparison with Existing Benchmarks:
| Benchmark | # Languages | Fine-Grained Annotation | Training Set | # Test Repositories |
|---|---|---|---|---|
| RepoBench | 2 | ✗ | ✓ | 1669 |
| CrossCodeEval | 4 | ✗ | ✗ | 1002 |
| R2C2-Bench | 4 | ✗ | ✓ | 1353 |
| M2rc-Eval | 18 | ✓ | ✓ | 5993 |
Ablation Study¶
Bucket-level Difficulty Analysis: - As the bucket number increases (AST depth increases), the completion difficulty gradually rises. - Completion EM for shallow nodes (buckets 1-3) is around 50%+, dropping to about 30% for deep nodes (buckets 8-10). - Fine-tuning shows more pronounced improvements on deep nodes.
Semantic-level Analysis: - Performance varies significantly across different languages on the same semantic categories. - Declarations and definitions show the best completion performance. - Control flow structures and special language structures are relatively difficult to complete. - Different languages have distinct bottlenecks: for example, Haskell is challenging in program structure, while Go is relatively easier in control flow.
Key Findings¶
- Fine-tuning on M2rc-Instruct brings significant improvements across all 18 languages, with average EM increases of over 20 percentage points.
- Cross-file context retrieval has limited direct impact on completion performance (about 2-4% EM), but its effect is amplified when combined with fine-tuning.
- The complexity of cross-file dependencies varies heavily across programming languages, with HTML being the lowest, and Scala and C++ being the highest.
- Bucket-level annotations effectively distinguish completion scenarios of different difficulty levels.
- Semantic-level annotations reveal how model capabilities vary across different code structures, pointing the way for model improvements.
Highlights & Insights¶
- AST-Driven Completion Cursor Selection: Compared to choosing random strings, selecting based on AST nodes ensures the semantic integrity of the completions.
- Two-Level Fine-Grained Annotation System: The bucket level (depth-wise difficulty) and semantic level (code semantic category) enable multidimensional analytical capabilities.
- "Reverse Filtering" in Quality Control: Removing samples that small models can predict without cross-file context ensures the test set truly evaluates repository-level capacity.
- Broad Coverage: Covering 18 programming languages spanning both mainstream and niche languages fills a critical gap in multilingual code completion evaluation.
- Practical Value: M2rc-Instruct provides ready-to-use, multilingual, repository-level code instruction data.
Limitations & Future Work¶
- The completion cursor position is restricted to a maximum of 5 lines, which may underestimate the difficulty of long-sequence completion.
- Only string matching metrics (EM and ES) are used, without considering semantic equivalence (such as functionally equivalent code written differently).
- Grammatical tags from Tree-sitter are highly fine-grained; mapping them to semantic labels might experience some information loss.
- Only 7B-scale models are evaluated, without covering larger or more modern code LLMs.
- The cross-file retrieval strategy remains simple (Jaccard similarity); more advanced retrieval methods could be explored.
Related Work & Insights¶
- CrossCodeEval: A 4-language repository-level completion benchmark; this work significantly expands on it in terms of language coverage and fine-grained annotation.
- RepoBench: Establishes a retrieval-augmented paradigm for repository-level completion; this work adopts and extends it.
- MultiPL-E / McEval: Multilingual in-file code generation benchmarks; this work generalizes the multilingual approach to the repository level.
- R2C2-Bench: A 4-language benchmark; this work comprehensively surpasses it in scale and annotation dimensions.
- Insight: Fine-grained evaluation and annotation systems are crucial for diagnosing model capability bottlenecks.
Rating¶
- Novelty: ⭐⭐⭐ — The core contributions lie in benchmark construction and annotation design; technical novelty is moderate.
- Practicality: ⭐⭐⭐⭐⭐ — Fills an important gap in multilingual repository-level code completion evaluation.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across 18 languages \(\times\) 3 models \(\times\) 3 settings.
- Writing Quality: ⭐⭐⭐ — Detailed content, but with too many tables, making the main narrative line sometimes unclear.