💻 Code Intelligence¶
💬 ACL2026 · 50 paper notes
📌 Same area in other venues: 📷 CVPR2026 (1) · 🔬 ICLR2026 (59) · 🧪 ICML2026 (22) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (19) · 📹 ICCV2025 (1)
🔥 Top topics: Code Intelligence ×15 · LLM ×7 · Agents ×4 · Reasoning ×3 · Reinforcement Learning ×2
- Across Programming Language Silos: A Study on Cross-Lingual Retrieval-Augmented Code Generation
-
This paper presents the first systematic study of cross-programming-language Retrieval-Augmented Code Generation (RACG). By constructing a 14K-instance dataset across 13 languages, the study reveals the asymmetry of cross-lingual knowledge transfer and its relationship with language affinity and pre-training diversity.
- AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor
-
This paper constructs AutoMonitor-Bench, the first systematic benchmark for evaluating whether LLM-based monitors can reliably identify model misbehavior (3,010 paired samples covering safety violations, sycophancy/bias, and specification gaming). Evaluation across 22 open-source and closed-source monitoring models reveals a systematic trade-off between Miss Rate (MR) and False Alarm Rate (FAR). Furthermore, SFT experiments on 153k samples demonstrate that fine-tuning on easily constructed misbehavior fails to generalize to implicit specification gaming.
- Benchmarking Testing in Automated Theorem Proving
-
Drawing inspiration from the concept of "integration testing" in software engineering, the semantic correctness of a generated theorem is determined by whether "all successor theorems depending on it still compile." This work constructs T2, a Lean 4 benchmark with 2206 problems, revealing a significant gap where mainstream LLMs achieve a 80%+ compilation rate but a semantic accuracy of only ~39%.
- Bootstrapping Code Translation with Weighted Multilanguage Exploration
-
BootTrans proposes a bootstrapping multilingual code translation method that leverages test cases from a single hub language (Python) as cross-language verification oracles. Combined with a dual-pool architecture for experience collection to expand training data and a language-aware weighting mechanism to prioritize difficult translation directions, it significantly outperforms baselines on HumanEval-X and TransCoder-Test.
- Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility
-
This paper proposes RoundTripCodeEval (RTCE): a code reasoning benchmark using 4 lossless compression algorithms (LZW/AE/RLE/Huffman) to construct 250 inputs × 4 subtasks = 1000 strict round-trip (encode→decode must restore bit-exact data) tasks. Results show that even QwQ-32B achieves 0% EM on Huffman encoding, a failure that cannot be addressed by SFT or self-reflection.
- ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis
-
ChatHLS proposes a multi-agent HLS design framework. Through two core components—HLSTuner (QoR-aware reasoning for optimization pragma selection) and HLSFixer (a debugging framework enhanced by hierarchical feedback)—combined with a self-evolving error case expansion mechanism (VODA), it significantly outperforms baselines in both HLS-C generation success rates and hardware performance optimization.
- ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning
-
ChipSeek proposes a hierarchical reward RL framework that integrates the EDA toolchain directly into the training loop. Through Curriculum-driven Dynamic Policy Optimization (CDPO), it enables LLMs to generate RTL code that meets both functional correctness and PPA (Power-Performance-Area) optimization objectives, achieving SOTA on standard benchmarks.
- CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
-
CodeDistiller automatically distills scientific GitHub repositories into runnable and debugged example code libraries, enabling Code-RAG scientific discovery agents to utilize real-world domain tools; on 250 materials science repositories, the best model achieved a human-verified functional correctness rate of 74.1%, and downstream discovery tasks were more preferred by experts.
- CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
-
This paper proposes CodeRL+, which integrates execution semantics alignment into the RLVR training pipeline. By enabling models to infer variable-level execution trajectories, it bridges the gap between code textual representation and execution semantics. CodeRL+ achieves an average 4.6% improvement in pass@1 for code generation and improvements of 15.5% and 4.4% on code reasoning and test output generation benchmarks, respectively.
- CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
-
Ours proposes CodeWiki, an open-source framework based on hierarchical decomposition and recursive multi-agent processing for automatic repository-level code documentation generation. It also constructs the CodeWikiBench benchmark, where it surpasses the closed-source system DeepWiki (64.06%) with a quality score of 68.79% across seven programming languages.
- CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation
-
This paper proposes CollabCoder, a plan-code co-evolution framework. Through a Collaborative Decision-Making (CDM) module, it determines whether errors should be fixed at the plan level or the code level. Combined with a Reasoning Trajectory (RT) module for self-improving debugging learned from errors, it achieves an 11-20% improvement over strong baselines on complex programming benchmarks while reducing API calls by 4-10.
- CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
-
To be added after in-depth reading.
- CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels
-
The authors transform the unreliable task of "modifying FlashAttention CUDA code directly via LLMs" into a three-stage workflow: "lifting to executable IR (CuIR) → transferring per PyTorch reference → differential lowering back to CUDA." This maintains 100% accuracy across 8 attention variants on A100/H100, achieving an average speedup of 16.03× over PyTorch, 1.39× over FlexAttention, and 3.33× over the previous LLM-based method Qimeng-Attention.
- DeepGuard: Secure Code Generation via Multi-Layer Semantic Aggregation
-
DeepGuard is proposed to overcome the "final-layer bottleneck" by aggregating multi-layer representations from the upper Transformer layers through an attention mechanism. Combined with multi-objective training and a lightweight inference-time security guidance strategy, it improves the security-correctness generation rate by an average of 11.9% across 5 Code LLMs.
- Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
-
DAP introduces the concept of Hard Mode ATP (where AI must discover answers before constructing proofs, rather than using Easy Mode statements with embedded answers), releases MiniF2F-Hard and FIMO-Hard benchmarks, and designs a "discover-and-prove" two-stage framework. By using LLMs for natural language reasoning to discover answers and rewriting them into Easy Mode statements for formal provers, DAP increases solved problems from 7 to 10 on CombiBench and proves 36 theorems on PutnamBench Hard Mode for the first time.
- DPC: Training-Free Text-to-SQL Candidate Selection via Dual-Paradigm Consistency
-
DPC transforms Text-to-SQL candidate selection from "guessing on hidden data" to "deterministic verification on visible data": it constructs a Minimum Discriminating Database (MDD) to force conflicting SQLs to produce different results, and then uses a Python/Pandas solution as a reference anchor to select the correct candidate through cross-paradigm consistency, outperforming Self-Consistency by up to 2.2% on BIRD and Spider.
- DUET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode
-
This paper proposes DUET, a dual-path framework that combines direct code execution and LLM-based pseudocode execution. By performing functional majority voting to fuse two complementary execution paths—deterministic execution (reliable when code is correct but fragile to implementation errors) and pseudocode execution (bypasses implementation details but prone to hallucinations)—the method improves Pass@1 on LiveCodeBench test output prediction by 13.6 percentage points.
- EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents
-
Ours proposes EET—an experience-driven early termination method that identifies invalid iterations and terminates them early during the patch generation and selection stages. It reduces the total cost of SE Agents by 19%-55% (average 32%) while incurring almost no loss in task performance (maximum 0.2%).
- FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean
-
FormalScience proposes a domain-agnostic Human-in-the-Loop (HITL) agent pipeline, enabling a single domain expert without Lean proficiency to translate informal scientific reasoning (specifically physics) into 100% compilable Lean4 code. It constructs FormalPhysics, the first benchmark of 200 university-level physics problems, and systematically characterizes the phenomenon where code is "compiled" but "semantically drifted."
- From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation
-
Research reveals that bias evaluations for LLM code generation significantly underestimate actual risks: in ML pipeline generation, sensitive attributes appear in \(87.7\%\) of feature selections (vs. \(59.2\%\) in conditional statements). Models correctly exclude irrelevant features but choose to retain sensitive attributes like race and gender, demonstrating systemic implicit discrimination.
- KoCo-Bench: Can Large Language Models Leverage Domain Knowledge in Software Development?
-
KoCo-Bench introduces the first code benchmark featuring an explicit domain knowledge corpus, covering 11 frameworks and 25 projects across 6 emerging areas (RL, Agent, RAG, etc.). It evaluates the ability of LLMs to acquire and apply domain knowledge from a corpus for code generation and understanding, revealing that even the strongest coding agent, Claude Code, achieves only 34.2%.
- Learning Adaptive Parallel Execution for Efficient Code Localization
-
FuseSearch models parallel tool calling in code localization as a joint quality-efficiency optimization problem. By using SFT+RL, the model learns to adaptively adjust search width according to task stages, achieving high F1 scores and significantly lower time/token costs on SWE-bench Verified using a compact model.
- LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software
-
This paper constructs LogicEval, the first evaluation framework for logical vulnerability repair, and LogicDS (61 real-world logical vulnerabilities + 61 synthetic Java samples). It systematically evaluates the capabilities of traditional AVR tools and LLMs in repairing logical vulnerabilities, finding that LLMs perform best when provided with auxiliary information, yet the overall repair rate remains very low (only 5 out of 61 real samples were correctly repaired), and identifies key bottlenecks such as prompt sensitivity, context loss, and patch localization difficulties.
- MARS2: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
-
MARS2 proposes a multi-agent reinforcement tree search framework that embeds multiple independently optimized policies within a shared search tree for collaborative exploration. Through Thompson sampling for agent-node selection, tree-coherent reward shaping, and path-level group advantage estimation, it consistently improves single-model Pass@1 by up to 8.0% and system-level Pass@1(MCTS) by up to 6.5% on code generation benchmarks.
- OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward
-
This paper proposes OmniDiagram, a unified diagram code generation framework covering three languages (LaTeX/Mermaid/PlantUML) and three tasks (Diagram-to-Code, Diagram Editing, Text-to-Code). It introduces the Viva reward mechanism based on Visual Question Answering to guide RL training, achieving SOTA performance across multiple benchmarks.
- PaT: Planning-after-Trial for Efficient Test-Time Code Generation
-
PaT shifts the paradigm from "planning before trial" to "planning after trial (and failure)." It uses execution feedback to trigger expensive decomposition steps and significantly improves the trade-off between Pass@1 and inference cost through a heterogeneous configuration consisting of small-model generation and large-model planning.
- PExA: Parallel Exploration Agent for Complex Text-to-SQL
-
PExA reformulates complex Text-to-SQL as a parallel exploration problem of "generating and executing a set of semantic test cases for a natural language query." Through three sub-agents—Planner, Test Case Generator, and SQL Proposer—it improves execution accuracy on Spider 2.0 while maintaining latency levels comparable to strong baselines.
- Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
-
This paper reveals the "regeneration" tendency of frontier LLMs in debugging tasks. By introducing the PDB framework and edit-level precision/bug-level recall metrics, the study finds that while models like GPT-5.1-Codex can pass over 76% of unit tests, their edit precision is below 45%. Furthermore, iterative and agent debugging strategies fail to significantly improve precision.
- PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents
-
This paper proposes PV-SQL, an agentic Text-to-SQL framework. By integrating two complementary components—Probe (iteratively generating probing queries to discover database value formats, column semantics, and table relationships) and Verify (extracting verifiable constraints via pattern matching to build checklists)—it achieves a 5% higher Execution Accuracy and a 20.8% higher Valid Efficiency Score on the BIRD benchmark compared to state-of-the-art baselines.
- QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions
-
QAQ starts from the reverse semantic consistency of "whether the answer can infer the question," utilizing stratified RMI and disagreement between strong and weak models to filter synthetic code instructions. Using only 25% of WarriorCoder data, it approaches full-scale training performance and significantly outperforms traditional data selection metrics like IFD.
- QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization
-
This paper identifies the "over-editing" problem in LLM-based code repair—where models tend to rewrite large portions of code instead of precisely locating and fixing bugs. It proposes the PRepair framework, which utilizes Self-Breaking (diverse bug injection) and Self-Repairing (edit-aware GRPO training) to significantly enhance repair precision while maintaining correctness and accelerating speculative decoding inference.
- R\(^3\)-SQL: Ranking Reward and Resampling for Text-to-SQL
-
R3-SQL targets generate-then-rank Text-to-SQL by grouping equivalent SQLs according to execution results and ranking them through a combination of pairwise/listwise and pointwise rewards. It further employs an LLM agent to determine if the candidate pool lacks correct SQLs for selective resampling, achieving 75.03 EX on BIRD-dev.
- ReCode: Reinforcing Code Generation with Reasoning-Process Rewards
-
ReCode trains a reward model capable of evaluating the quality of code reasoning processes via CRPL and utilizes CG-GRPO to activate process rewards only when code execution is correct, thereby improving the Pass@1 of code generation models while avoiding reward hacking.
- ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization
-
Ours proposes ReFEree, a reference-free and fine-grained factual consistency evaluation method for real-world code summarization. It defines four categories of inconsistency criteria, evaluates at the sentence/segment level, and incorporates a dependency information search mechanism. ReFEree achieves a 15-18% improvement in correlation with human judgment on Python and Java compared to the Prev. SOTA.
- RepoShapley: Shapley-Enhanced Context Filtering for Repository-Level Code Completion
-
RepoShapley is proposed as a coalition-aware context filtering framework based on Shapley values. It determines whether to retain or discard retrieved code snippets by estimating their interactive contributions within combinations, significantly improving the quality of repository-level code completion.
- RExBench: Can coding agents autonomously implement AI research extensions?
-
RExBench places coding agents into real AI paper repositories to implement expert-designed research extensions. Performance is scored via controlled execution results, revealing that even the strongest current agents achieve only about a one-third success rate, indicating a significant gap in autonomous research capabilities.
- River-LLM: Large Language Model Seamless Exit Based on KV Share
-
This paper proposes River-LLM, a training-free framework that solves the missing KV Cache issue in Early Exit for decoder-only architectures by constructing lightweight KV-shared exit channels (Exit River). It utilizes state transition similarity to guide exit decisions, achieving 1.71×-2.16× wall-clock inference speedup while maintaining near-lossless generation quality.
- Ro-SLM: Onboard Small Language Models for Robot Task Planning and Operation Code Generation
-
Ro-SLM utilizes LLMs to synthesize and verify robot task-code data, followed by SFT and GRPO optimization of Llama-3.1-8B using LLM rewards. This allows the small model to approach the planning and operation code generation capabilities of cloud-based LLMs for UAV and ground vehicle tasks.
- ROSE: An Intent-Centered Evaluation Metric for NL2SQL
-
ROSE shifts NL2SQL evaluation from "predicting whether SQL matches a single reference SQL" to "predicting whether SQL satisfies user intent." Through a two-stage reasoning process involving a SQL Prover and an Adversarial Refuter, it achieves a Cohen's Kappa nearly 24 percentage points higher than existing top metrics on ROSE-VEC, and exposes evaluation crises caused by reference errors and question ambiguities in benchmarks like BIRD.
- SciCoQA: Quality Assurance for Scientific Paper–Code Alignment
-
Ours introduces SciCoQA, the first benchmark dataset for detecting discrepancies between scientific papers and their code implementations. It contains 635 discrepancy instances (92 real + 543 synthetic). Evaluation of 22 LLMs reveals that the strongest model only detects 46.7% of real discrepancies, highlighting a critical capability gap in automated scientific quality assurance.
- SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios
-
This paper proposes SecureVibeBench, the first repository-level multi-file editing secure coding benchmark. It constructs 105 C/C++ secure coding tasks from 41 OSS-Fuzz projects. By accurately restoring the scenarios where vulnerabilities were first introduced through cascaded static and dynamic analysis, the evaluation reveals that only 23.8% of the code produced by the best agent (SWE-agent + Claude Sonnet 4.5) satisfies both functional correctness and security.
- Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Understanding
-
This paper proposes a distinction between lexical recall (verbatim retrieval of code) and semantic recall (understanding code execution semantics). It finds that frontier LLMs achieve near-perfect lexical recall in long contexts but suffer from severe degradation in semantic recall. The introduced SemTrace benchmark reveals that existing evaluations significantly underestimate the extent of semantic understanding failures.
- SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
-
Ours proposes SOCIA-EVO, an LLM agent framework that redefines automated simulator construction as a dual-anchored evolutionary process. By anchoring empirical constraints via a static Blueprint, decoupling structural correction from parameter calibration through bi-level optimization, and managing repair hypotheses via a self-curated Playbook with Bayesian-weighted retrieval based on execution feedback, SOCIA-EVO significantly outperforms baselines such as Reflexion and G-SIM on user modeling, mask-wearing diffusion, and personal mobility simulation tasks.
- SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution
-
SolidCoder transforms code verification from "imaginary execution" to "concrete execution" through the S.O.L.I.D. architecture (Shift-left Planning, Oracle-based Assertions, Live Execution, Intermediate Simulation, Defensive Accumulation), achieving pass@1 performance of 95.7% on HumanEval, 77.0% on CodeContests, and 26.7% on APPS using GPT-4o.
- Static Program Slicing Using Language Models With Dataflow-Aware Pretraining and Constrained Decoding
-
Sliceformer reformulates static program slicing as a seq2seq task for small code language models. It learns dependencies through dataflow-aware pretraining and utilizes lexical and syntactic constrained decoding to prevent hallucinations, significantly improving ExactMatch on Java and Python slicing benchmarks.
- StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation
-
Ours proposes StoryCoder, a prompting framework that reformulates code generation problems into coherent natural language narratives. By guiding LLMs through three narrative components—Task Overview, Constraints, and Examples—it achieves a structured reasoning process, improving zero-shot pass@10 by an average of 18.7% across 11 models.
- SWE-QA: Can Language Models Answer Repository-level Code Questions?
-
SWE-QA constructs a repository-level code question-answering benchmark covering 15 real-world Python repositories and 720 high-quality QA pairs. It induces question types from GitHub issues and validates answers through human experts. Experiments show that vanilla LLMs direct prompting is weak, and only RAG or tool-integrated agents like OpenHands/SWE-agent can approach the demands of real-world development QA.
- Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults
-
By establishing LinuxFLBench, a large-scale Linux kernel fault localization benchmark, this study reveals the limitations of existing LLM Agents in complex systems and proposes the LinuxFL+ framework. Through two-dimensional expansion (directory-awareness and potential causes), LinuxFL+ significantly improves fault localization accuracy at a low cost.
- The Path Not Taken: Duality in Reasoning about Program Execution
-
This paper proposes the concept of duality in program execution reasoning. Through the DexBench benchmark (445 paired instances), it jointly evaluates LLMs' forward execution reasoning (predicting code coverage under a given input) and backward counterfactual reasoning (inferring input mutations that redirect execution flow to a target branch). It discovers that strong performance in a single direction does not translate to success under joint evaluation, revealing deficiencies in models' causal understanding of programs.
- To Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM-based Code Editing
-
This paper treats the "output format" of LLM code editing as a training objective. It proposes BlockDiff, FuncDiff, and an adaptive format selection strategy, AdaEdit. The approach achieves accuracy close to full-code generation while reducing latency and output token costs by over 30% in long-code editing scenarios.