💻 Code Intelligence¶

💬 ACL2025 · 28 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (59) · 💬 ACL2026 (50) · 🧪 ICML2026 (22) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (19) · 📹 ICCV2025 (1)

🔥 Top topics: Code Intelligence ×13 · LLM ×8 · Agents ×4

LongCodeU: Benchmarking Long-Context Language Models on Long Code Understanding: The authors propose the LongCodeU benchmark, which designs 8 tasks across four dimensions—code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long documentation understanding—to evaluate the comprehension capabilities of 9 long-context language models (LCLMs) on real-world, repository-level long code, revealing that 32K tokens is the practical upper limit for current LCLM long code understanding.
Beyond Sequences: Two-dimensional Representation and Dependency Encoding for Code Generation: This paper proposes a two-dimensional code representation that moves beyond traditional one-dimensional sequence representations. By explicitly encoding the structural dependency relationships of code (such as syntax tree structures and variable dependencies), it significantly improves the accuracy and structural correctness of code generation.
CoCo-Bench: A Comprehensive Code Benchmark for Multi-task Large Language Model Evaluation: This paper introduces CoCo-Bench (Comprehensive Code Benchmark), a comprehensive code benchmark covering four dimensions: code understanding, code generation, code modification, and code review. It supports multiple programming languages and difficulty levels, ensures data quality through rigorous manual review, and reveals the unbalanced performance of existing LLMs in coding capabilities.
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code: Proposes CodeDPO, which constructs high-quality preference pairs (93K correctness + 21K efficiency) from self-generated code via a PageRank-inspired self-validation scoring mechanism. After DPO training, it achieves an average improvement of over 10 points on HumanEval across 8 code models, while accelerating code execution efficiency by 1.25-1.45×.
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation: CodeIF is proposed as the first systematic benchmark to evaluate the instruction-following capabilities of LLMs in code generation. It includes 50 fine-grained constraint instructions across 8 major categories, introduces 4 new evaluation metrics, and comprehensively evaluates 35 SOTA models.
CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models: The CodeReviewQA benchmark is proposed, decomposing the Automated Code Refinement (ACR) task into three intermediate reasoning steps: Change Type Recognition (CTR), Change Localization (CL), and Solution Identification (SI). Each step is formulated as a multiple-choice question-answering (MCQA) probe with different difficulty levels. Evaluated with 72 LLMs on 900 human-verified, high-quality samples (across 9 languages), it reveals specific weaknesses of models in code review comprehension.
CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System: Proposes CompileAgent, the first LLM agent framework designed for repository-level code compilation. By integrating five specialized tools and a flow-based agent strategy, it improves the compilation success rate by up to 71% on CompileAgentBench (consisting of 100 real-world C/C++ projects), costing only $0.22 per project on average.
CoRet: Improved Retriever for Code Editing: Proposed CoRet, a dense retrieval model tailored for code editing tasks. By integrating code semantics, repository-level file hierarchy, and call graph dependencies, and employing a log-likelihood loss function designed for repository-level retrieval, CoRet improves Recall by at least 15 percentage points over existing models on SWE-bench and Long Code Arena.
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal: This paper proposes DARS (Dynamic Action Re-Sampling), an inference-time compute scaling method for coding agents. It dynamically branches and attempts alternative actions at key decision points where the agent makes suboptimal choices. Using Claude 3.5 Sonnet V2, DARS achieves a 55% pass@k and 47% pass@1 on SWE-Bench Lite, outperforming the open-source SOTA frameworks of the time.
DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation: This paper proposes DynaCode, a dynamic, complexity-aware code generation benchmark. By classifying code problems based on cyclomatic complexity and nesting them using Call Graphs, DynaCode dynamically generates approximately 189 million unique problems. This design effectively mitigates data contamination and systematically evaluates the code generation capabilities of LLMs across different complexity levels.
ExploraCoder: Advancing Code Generation for Multiple Unseen APIs via Planning and Chained Exploration: This work proposes the training-free ExploraCoder framework. It decomposes complex multi-API programming problems into subtasks through task planning, and progressively conducts experiments to accumulate experiences on correct API usages via Chained API Exploration (CoAE). It achieves up to 17.28% absolute improvement in pass@10 on multi-API unseen library benchmarks.
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation: FEA-Bench is proposed as the first benchmark evaluating LLM capabilities in feature implementation within repository-level codebases. It contains 1,401 task instances from 83 GitHub repositories, with each instance equipped with unit tests. The strongest model, DeepSeek-R1, solves only about 10% of the tasks, revealing the significant challenges repository-level incremental development poses to current LLMs.
GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding: This paper proposes GALLa, which encodes the AST/DFG structural graph of code using a GNN and aligns it to the LLM embedding space via a cross-modal adapter. It injects code structural information as an auxiliary task during fine-tuning, and discards the GNN and adapter during inference to achieve zero extra overhead, yielding consistent improvements across 5 code tasks and 7 baseline LLMs (ranging from 350M to 14B parameters).
GiFT: Gibbs Fine-Tuning for Code Generation: Proposes Gibbs Fine-Tuning (GiFT), which is inspired by Gibbs sampling. It samples self-generated code from the marginal distribution instead of the conditional distribution through iterative "code $\rightarrow$ description $\rightarrow$ code" translation. Combined with perplexity-guided long-tail data selection, it improves up to 9.8% over standard self-training on APPS+, MBPP+, and CodeInsight.
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios: This work introduces MLDebugging—the first comprehensive benchmark tailored for multi-library Python code debugging. It spans 126 Python libraries and 7 bug categories (incorporating 1,175 samples), systematically evaluating the capabilities of mainstream open-source and closed-source LLMs under multi-library debugging scenarios, finding that current LLMs still have substantial room for improvement on this task.
OASIS: Order-Augmented Strategy for Improved Code Search: OASIS is proposed to capture subtle nuances in code semantics by introducing order-based similarity labels for negative pairs. By training code embedding models with a dual loss function combining InfoNCE and CoSENT, OASIS consistently outperforms existing state-of-the-art (SOTA) models on NL2Code and Code2Code search tasks across three benchmarks: CoSQA, AdvTest, and CodeSearchNet.
Personality-Guided Code Generation Using Large Language Models: This work dynamically generates matching MBTI personality types and detailed descriptions for each programming task using GPT-4o, and guides the target LLM to generate code by role-playing a programmer with this personality. Across 28 combinations of 7 LLMs and 4 datasets, improvements in pass rates are achieved in 23 cases (up to 12.9%). The key factor is personality diversity rather than any single specific personality.
Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment: This work builds a program synthesis benchmark based on the XLogoOnline visual programming environment, requiring a combination of multiple skills such as spatial planning, programming, and logical reasoning. The evaluation shows that GPT-4V only solves 20% of the tasks. However, through fine-tuning on 80k+ synthetic data combined with simulator-driven curriculum learning, Llama3-8B significantly outperforms both GPT-4V and Llama3-70B.
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation: ReflectionCoder achieves state-of-the-art (SOTA) performance in one-off code generation without requiring multi-round runtime debugging. It does this by constructing "reflection sequence" data that integrates compiler feedback, combined with two training strategies: reflection self-distillation and dynamically masked distillation.
Rethinking Repetition Problems of LLMs in Code Generation: This paper redefines the repetition problem in code generation by distinguishing "structural repetition," which is more prevalent and more challenging than content repetition. It proposes RPG (Repetition Penalization based on Grammar), a decoding method with grammar-rule-based repetition penalty, which significantly mitigates repetition issues on both the newly constructed CodeRepetEval and standard benchmarks.
Revisit Self-Debugging with Self-Generated Tests for Code Generation: This paper systematically investigates the effectiveness of self-debugging with self-generated tests using LLMs. It finds that post-execution-based self-debugging degrades performance on basic programming problems due to self-generated test bias. Conversely, in-execution self-debugging successfully avoids this bias, achieving consistent improvements on both basic and competitive programming tasks.
SceneGenAgent: Precise Industrial Scene Generation with Coding Agent: This work proposes SceneGenAgent, an LLM-based code-generation agent. Through structured layout planning, layout verification, and iterative refinement, it utilizes C# code to generate industrial scenes with high precision. It achieves an 81% success rate on real industrial tasks and constructs the SceneInstruct dataset, enabling open-source LLMs to perform closely to GPT-4o.
SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL: Proposed the SHARE framework, which uses three dedicated Small Language Models (SLMs) with <8B parameters to form a sequential pipeline. It translates declarative SQL into step-by-step action trajectories that expose the reasoning path, and then corrects schema linking errors and logical reasoning errors in stages, achieving self-correction for LLM Text-to-SQL at an extremely low cost.
STaR-SQL: Self-Taught Reasoner for Text-to-SQL: This paper reformulates the Text-to-SQL task as a reasoning-driven process. By employing the STaR (Self-Taught Reasoner) bootstrapping approach, it enables LLMs to learn how to generate step-by-step rationales to assist in SQL generation. Integrated with an Outcome-supervised Reward Model (ORM) validator for best-of-N sampling, the framework achieves an 86.6% execution accuracy on the Spider benchmark.
TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs: Proposes TeXpert, the first multi-difficulty level benchmark for systematically evaluating LLMs' capability to generate LaTeX code for scientific documents from natural language instructions. It contains 440 high-quality samples across three levels (Simple/Average/Hard). Evaluation of 9 open-source and closed-source LLMs reveals that LaTeX generation remains a significant weakness (accuracy on the Hard task is generally below 17.5%), with logical and formatting errors being the primary bottlenecks.
Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation: The Tree-of-Code (ToC) framework is proposed, which organizes end-to-end full code program (CodeProgram) nodes in a tree structure. Combined with an execution-result-based reflection mechanism and randomized prompt/model exploration strategies, it achieves nearly 20% higher accuracy on complex tasks compared to CodeAct with less than 1/4 of the interaction turns, under zero-shot settings without annotated data.
Tree-of-Evolution: Tree-Structured Instruction Evolution for Code Generation in Large Language Models: Proposes Tree-of-Evolution (ToE), a tree-structured code instruction synthesis framework. By leveraging multi-path evolution and quality-driven optimization, ToE overcomes the limitations of unidirectional synthesis and random generation in Code Evol-Instruct and OSS-Instruct. Fine-tuning a base model with only 75K of its synthesized data achieves or exceeds the performance of Qwen2.5-Coder-Instruct (fine-tuned on millions of samples).
UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench: This paper proposes the UTBoost framework, which enhances test case coverage of SWE-Bench through an LLM-based test case generator (UTGenerator) and an improved parser. It identifies 36 inadequately tested instances and 345 patches falsely flagged as passed, leading to ranking changes of 40.9% on SWE-Bench Lite and 24.4% on SWE-Bench Verified.