💻 Code Intelligence¶

💬 ACL2026 · 20 paper notes

Across Programming Language Silos: A Study on Cross-Lingual Retrieval-Augmented Code Generation: This paper presents the first systematic study of cross-programming-language retrieval-augmented code generation (RACG), constructing a 14K-instance dataset spanning 13 programming languages, and reveals the asymmetry of cross-lingual knowledge transfer and its relationship to language family relatedness and pretraining data diversity.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment: This paper proposes CodeRL+, which integrates execution semantics alignment into the RLVR training pipeline. By training the model to infer variable-level execution traces, CodeRL+ bridges the gap between code text representations and execution semantics, achieving an average pass@1 improvement of 4.6% on code generation, 15.5% on code reasoning, and 4.4% on test output generation benchmarks.
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases: This paper presents CodeWiki, an open-source framework based on hierarchical decomposition and recursive multi-agent processing for automated repository-level code documentation generation. It also introduces the CodeWikiBench benchmark, achieving a quality score of 68.79% across seven programming languages, surpassing the closed-source system DeepWiki (64.06%).
CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation: This paper proposes CollabCoder, a plan-code co-evolution framework that employs a Collaborative Decision Module (CDM) to determine whether errors should be repaired at the plan level or the code level, and a Reasoning Trajectory module (RT) to enable self-improving debugging that learns from failures. CollabCoder outperforms strong baselines by 11–20% on challenging programming benchmarks while reducing API calls by 4–10.
DeepGuard: Secure Code Generation via Multi-Layer Semantic Aggregation: DeepGuard is proposed to overcome the "final-layer bottleneck" by aggregating representations from multiple upper Transformer layers via an attention mechanism. Combined with multi-objective training and a lightweight inference-time safety guidance strategy, it achieves an average improvement of 11.9% in secure-and-correct generation rate across 5 code LLMs.
EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents: This paper proposes EET, an experience-driven early termination method that identifies unproductive iterations during patch generation and patch selection phases, reducing the total cost of SE agents by 19%–55% (32% on average) with negligible performance degradation (at most 0.2%).
From Charts to Code: A Hierarchical Benchmark for Multimodal Models: This paper proposes Chart2Code, a hierarchical benchmark comprising 2,186 tasks spanning 22 chart types, organized into three progressively challenging levels: chart reproduction (Level 1), chart editing (Level 2), and long-table-to-chart generation (Level 3). The benchmark evaluates 29 state-of-the-art multimodal models and reveals that even the strongest model, GPT-5.2, achieves a chart quality score of only 33.41 on editing tasks, exposing significant deficiencies in current models for practical chart code generation.
From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation: This work demonstrates that existing bias evaluations of LLM code generation severely underestimate real-world risk: in ML pipeline generation, sensitive attributes appear in 87.7% of feature selection decisions (vs. 59.2% in conditional statements), and models correctly exclude irrelevant features yet consistently retain sensitive attributes such as race and gender, revealing systematic implicit discrimination.
LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software: This paper presents LogicEval, the first systematic evaluation framework for logical vulnerability repair, along with the LogicDS dataset (61 real-world logical vulnerabilities + 61 synthetic Java samples). It systematically evaluates both traditional AVR tools and LLMs on logical vulnerability repair, finding that LLMs perform best when provided with auxiliary information yet overall repair rates remain low (only 5 out of 61 real-world samples correctly repaired). Key bottlenecks identified include prompt sensitivity, context loss, and patch localization difficulty.
MARS2: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation: MARS2 proposes a multi-agent reinforcement tree search framework that embeds multiple independently optimized policies into a shared search tree for collaborative exploration. Through Thompson sampling for agent–node pair selection, tree-consistent reward shaping, and path-level group advantage estimation, the framework consistently improves single-model Pass@1 by up to 8.0% and system-level Pass@1 (MCTS) by up to 6.5% on code generation benchmarks.
OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward: This paper proposes OmniDiagram, a unified diagram code generation framework covering three languages (LaTeX/Mermaid/PlantUML) and three tasks (diagram-to-code, diagram editing, text-to-code). It introduces the Viva (Visual Interrogation Verifies All) reward mechanism based on visual question answering to guide RL training, achieving state-of-the-art performance on multiple benchmarks.
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?: This paper reveals the "regeneration" tendency of frontier LLMs on debugging tasks. By introducing the PDB framework along with edit-level precision and bug-level recall metrics, the authors find that models such as GPT-5.1-Codex pass over 76% of unit tests yet achieve edit precision below 45%, and that iterative and agent-based debugging strategies fail to substantially improve precision.
QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization: This paper identifies the "over-editing" problem in LLM-based code repair—where models tend to rewrite large portions of code rather than precisely localizing and fixing bugs—and proposes the PRepair framework. Through Self-Breaking (diversified bug injection) and Self-Repairing (edit-aware GRPO training), PRepair significantly improves repair precision while maintaining correctness and accelerating speculative decoding inference.
ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization: This paper proposes ReFEree, a reference-free and fine-grained factual consistency evaluation method for real-world code summarization. It defines four categories of inconsistency criteria and evaluates at the sentence-segment level. Combined with a dependency information retrieval mechanism, ReFEree achieves 15–18% improvement in human judgment correlation over the previous state of the art on Python and Java.
River-LLM: Large Language Model Seamless Exit Based on KV Share: This paper proposes River-LLM, a training-free framework that addresses the KV Cache absence problem in Early Exit for decoder-only architectures by constructing a lightweight KV-shared exit channel (Exit River). It leverages state transition similarity to guide exit decisions, achieving 1.71×–2.16× real wall-clock inference speedup with near-lossless generation quality.
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Understanding: This paper distinguishes between lexical recall (verbatim code retrieval) and semantic recall (understanding runtime code semantics), demonstrating that frontier LLMs achieve near-perfect lexical recall yet exhibit severe semantic recall degradation in long contexts. The paper introduces the SemTrace benchmark, revealing that existing evaluations substantially underestimate the extent of semantic understanding failures.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization: This paper proposes SOCIA-EVO, an LLM agent framework that reformulates automated simulator construction as a dual-anchored evolutionary process. It anchors empirical constraints via a static Blueprint, decouples structural revision and parameter calibration through bi-level optimization, and manages repair hypotheses via a self-curated strategy Playbook with Bayesian-weighted retrieval guided by execution feedback. SOCIA-EVO significantly outperforms baselines such as Reflexion and G-SIM on three simulation tasks: user modeling, mask-wearing diffusion, and personal mobility.
SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution: SolidCoder transforms code verification from LLM "imagined execution" to "real execution" via the S.O.L.I.D. architecture (Shift-left Planning, Oracle-based Assertions, Live Execution, Intermediate Simulation, Defensive Accumulation), achieving pass@1 scores of 95.7% on HumanEval, 77.0% on CodeContests, and 26.7% on APPS with GPT-4o.
StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation: This paper proposes StoryCoder, a prompting framework that reformulates code generation problems into coherent natural language narratives. By guiding LLMs through three narrative components—task overview, constraints, and examples—the framework achieves an average zero-shot pass@10 improvement of 18.7% across 11 models.
The Path Not Taken: Duality in Reasoning about Program Execution: This paper introduces the concept of duality in program execution reasoning. Through the DexBench benchmark (445 paired instances), it jointly evaluates LLMs on forward execution reasoning (predicting code coverage under a given input) and backward counterfactual reasoning (inferring input mutations that redirect execution to a target branch). The results reveal that strong performance in a single direction does not transfer to success under joint evaluation, exposing a fundamental deficiency in models' causal understanding of program execution.