LongCodeU: Benchmarking Long-Context Language Models on Long Code Understanding¶

Conference: ACL 2025
arXiv: 2503.04359
Code: None
Area: Code Intelligence
Keywords: long-context, code understanding, benchmark, code unit, dependency analysis

TL;DR¶

The authors propose the LongCodeU benchmark, which designs 8 tasks across four dimensions—code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long documentation understanding—to evaluate the comprehension capabilities of 9 long-context language models (LCLMs) on real-world, repository-level long code, revealing that 32K tokens is the practical upper limit for current LCLM long code understanding.

Background & Motivation¶

Background: Long-Context Language Models (LCLMs) such as GPT-4o (128K) and Gemini-1.5 (1M) claim to support ultra-long context windows, enabling potential software engineering applications like repository-level code generation, issue fixing, and long code summarization. However, current frameworks lack rigorous evaluation to measure whether LCLMs truly "understand" long code.

Limitations of Prior Work: The first type of benchmarks (e.g., RepoQA, L-Eval) suffers from four issues: ❶ insufficient task diversity, focusing only on single tasks such as needle function search; ❷ artificial concatenation of independent code snippets to construct "long code," which ignores dependencies in real-world code; ❸ lack of restriction on code release dates, posing data contamination risks; ❹ a maximum length of only 36.5K tokens, failing to stress-test 128K-1M context windows. The second type of benchmarks (e.g., LongBench, SWE-bench) evaluates indirectly via downstream tasks, leading to a fifth issue: ❺ code understanding capability is entangled with task-specific challenges such as code generation or bug fixing, making it impossible to measure comprehension independently.

Key Challenge: Are LCLMs claiming to support ultra-long contexts truly effective in real-world long code understanding tasks? Existing benchmarks cannot provide reliable answers.

Key Insight: Constructing a long code understanding benchmark covering a complete skill spectrum from basic perception to complex reasoning, utilizing real repository code created after 2024-06 to independently evaluate comprehension capabilities.

Method¶

Overall Architecture¶

LongCodeU designs 8 tasks across four dimensions, with approximately 500 samples per task, covering lengths from 0 to 128K tokens:

Code Unit Perception (CU_P) — 1 task
Intra-Code Unit Understanding — 2 tasks: Data Flow Analysis (CU_DFA) + Semantic Analysis (CU_SA)
Inter-Code Unit Relation Understanding — 4 tasks: Dependency Relation Analysis T1/T2 (DRA) + Semantic Relation Extraction T1/T2 (SRE)
Long Documentation Understanding (LDU) — 1 task

All tasks share a unified workflow: Given instruction + long code + anchor input \(\rightarrow\) LCLM outputs the answer.

Key Designs¶

Four-Dimension Eight-Task System:
- Function: Progressive evaluation from basic perception to complex reasoning
- Mechanism: CU_P tests function identification capability (foundation) \(\rightarrow\) CU_DFA/CU_SA test intra-unit variable tracking and semantic understanding \(\rightarrow\) DRA tests cross-unit invocation relations (including code-to-code and natural language-to-code modes) \(\rightarrow\) SRE tests semantic similarity reasoning \(\rightarrow\) LDU tests long document information extraction
- Design Motivation: The four dimensions correspond to the practical requirements of repository-level development: identifying functions, understanding function logic, analyzing relationships between functions, and reading technical documentation
Real-World Repository Code Construction Pipeline:
- Function: A six-stage pipeline to ensure data quality and authenticity
- Mechanism: Stage ❶ selects 116 repositories created after 2024-06, non-forked, and with 50+ stars from top 50 popular PyPI packages \(\rightarrow\) Stage ❷ uses tree-sitter static analysis to extract function definitions and dependency relationships + uses embedding models to compute semantic similarity \(\rightarrow\) Stage ❸ extracts requirements descriptions from code signatures \(\rightarrow\) Stage ❹ manually annotates 500 document understanding samples \(\rightarrow\) Stage ❺ excludes trivial functions and deduplicates \(\rightarrow\) Stage ❻ constructs samples according to a length distribution of 0-128K.
- Design Motivation: Utilizing code developed after 2024-06 significantly mitigates data contamination risks; real repository code contains natural dependencies, far superior to concatenated code
Fine-grained Evaluation Metric System:
- Function: Customizing evaluation metrics according to the output granularity of different tasks
- Mechanism: Utilizing EM-R/EM-P for code line outputs; LCS-R/LCS-P (longest common subsequence) for function name outputs; CodeBLEU-R/CodeBLEU-P for code unit outputs; and BLEU for natural language descriptions
- Design Motivation: The Kendall-Tau \(\tau\) correlation between automatic evaluation and human evaluation averages \(\ge 0.75\) across all tasks, with the minimum value exceeding 0.7, validating the reliability of the metrics.

Loss & Training¶

This paper presents a benchmark and does not involve model training. Evaluation employs greedy search, conducted within the maximum context window supported by each model.

Key Experimental Results¶

Main Results¶

Evaluation of 9 LCLMs (6 general-purpose + 3 code models), Recall metrics:

Model	Parameters	CU_P	CU_SA	CU_DFA	DRA_T1	DRA_T2	SRE_T1	SRE_T2	LDU	Average
DeepSeek-V2.5	236B	70.58	82.11	77.47	72.25	56.80	49.08	47.42	85.85	67.70
GPT-4o	—	56.42	86.76	87.87	71.58	48.88	44.45	43.14	87.54	65.83
Gemini-1.5-Flash	—	58.45	83.46	80.37	72.51	46.42	39.84	38.69	81.43	61.39
CodeLlama	33.7B	68.57	62.41	79.87	68.82	34.94	44.48	36.34	46.92	55.29
Mistral-v0.3	7.3B	57.42	63.90	58.00	46.66	18.92	33.91	32.50	58.64	46.24
Claude-3.5-Sonnet	—	43.82	40.60	45.65	29.37	28.70	26.55	27.77	41.81	35.53
Phi-3.5	3.8B	39.92	46.75	49.52	30.76	9.66	18.99	14.48	34.14	30.53

Ablation Study¶

Configuration	Key Metrics	Description
0-8K Length	Normal Performance	All models exhibit reasonable performance on short code
8K-32K Length	Slow Decline	Performance begins to degrade but remains acceptable
32K+ Length	Sharp Decline	Performance drops off a cliff, far below claimed capability
64K-128K Length	DRA/SRE near 0	Some tasks fail completely
w/o Code Context	Performance far below w/ context	Proves the models perform actual comprehension rather than memory retrieval
Code Models vs General-purpose Models	Code models of the same scale perform better	Qwen2.5-Coder outperforms Phi-3.5 by 24.31% on CU_SA
Automatic vs Human Evaluation	Kendall-Tau \(\tau \ge 0.75\)	Confirms the reliability of automatic metrics

Key Findings¶

32K is the Practical Limit: The performance of all LCLMs drops sharply once the code length exceeds 32K tokens, which is in stark contrast with their claimed 128K-1M context windows.
Inter-unit relation understanding is the hardest: DRA and SRE tasks pose the most significant bottlenecks for all models, particularly DRA_T2, which requires cross-file dependency tracking.
No universal champion: No single LCLM achieves optimal performance across all 8 tasks—code models excel in code perception, while general-purpose models perform better in documentation understanding.
Performance degradation rate varies by task: Documentation understanding exhibits the steepest degradation curve, whereas code unit understanding remains relatively stable.
Comprehension vs. Memory: The w/o context ablation confirms that LCLMs genuinely engage in code comprehension rather than simple memory retrieval.

Highlights & Insights¶

The design of four dimensions and eight tasks covers the complete skill spectrum from basic perception to complex reasoning, representing the most systematic evaluation of long code understanding to date.
Utilizing real code repositories created after 2024-06 successfully avoids data contamination while preserving natural code dependencies.
Clearly distinguishing between "code comprehension" and "code memory," with a highly ingenious design of the "w/o context" ablation study.
Providing practical rules of thumb for model selection: choose smaller models for volumes \(\le 16\text{K}\), GPT-4o/Gemini for documentation understanding, and the strongest available model for relation understanding.
Explaining why GPT-4o achieves only 4% Success@1 on RepoTransBench, which aligns with the bottleneck in code relation understanding discovered in this paper.

Limitations & Future Work¶

Currently only supporting Python, with a need to extend to other languages like Java and C++.
Next-generation reasoning models such as DeepSeek-R1 and GPT-o3-mini were not evaluated due to API instability.
Due to API limitations, DeepSeek-V2.5 was tested only up to 64K, failing to be evaluated on the full 128K range.
Semantic relation extraction relies on a specific embedding model (stella_en_400M_v5), which may introduce bias.
Code units are restricted to function granularity, without considering larger granularities like classes or modules.
No end-to-end correlation analysis from comprehension to generation is covered.

vs RepoQA: RepoQA only features a single needle function search task up to 16K, whereas ours contains 8 tasks covering up to 128K.
vs L-Eval: L-Eval uses artificial long code concatenated from independent code snippets and ignores dependencies, while ours uses real repository code.
vs SWE-bench: SWE-bench evaluates end-to-end capabilities integrating both comprehension and generation, while ours focuses on decoupling the understanding capability.
vs LongBench: LongBench has an average length of only 0.4K, whereas ours averages 54.8K, which belongs to completely different magnitudes.

Rating¶

Novelty: ⭐⭐⭐ A benchmark paper with comprehensive and systematic task designs, but limited methodological innovation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 9 models, 8 tasks, 5 length intervals, with sufficient ablation and human validation.
Writing Quality: ⭐⭐⭐⭐ Clear structure, rich charts, and practical rules of thumb.
Value: ⭐⭐⭐⭐ Fills the gap in long code understanding evaluation; the discovery of the 32K cliff has direct instructing significance for model design.