💻 Code Intelligence¶

🧪 ICML2026 · 22 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (59) · 💬 ACL2026 (50) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (19) · 📹 ICCV2025 (1) · 🧪 ICML2025 (9)

🔥 Top topics: Code Intelligence ×2 · LLM ×2 · Translation ×2 · Agents ×2 · Adversarial Robustness ×2

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets: Addressing the gap where spreadsheets lack next-action prediction similar to code completion, this paper constructs NAPE, the first spreadsheet action prediction benchmark (52 human-verified creation trajectories with 11,907 low-level actions). It proposes an online evaluation framework: after each action, the system provides predictions, simulates user acceptance/rejection, and dynamically rewrites remaining ground truth actions. Performance is measured by User Action Savings (uas); experiments show that a fine-tuned 360M model matches GPT-5 (both saving 27% of actions).
AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms: AlgoVeri constructs a strictly aligned benchmark for verified code generation of classical algorithms across Dafny, Verus, and Lean. It demonstrates that current LLMs still face significant gaps in handling complex global invariants, system-level constraints, and explicit proof search, with success rates in Lean and Verus being substantially lower than those in Dafny.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models: BoostAPR constructs a three-stage pipeline for training program-repair models via RL: execution-verified SFT → training sequence-level + line-level dual reward models → redistributing sequence rewards to key edit-line spans using the line-level model during PPO. Using Qwen2.5-Coder-32B, it pushes SWE-bench Verified performance from 17.8% to 40.7% (+22.9pp) and achieves 24.8% on Defects4J through cross-lingual transfer.
Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation: Addressing the neglected problem where "LLM-translated code is functionally correct but slower than human-written code," this work proposes the SwiftTrans framework. It generates multi-perspective candidates using parallel ICL and selects the optimal candidate in linear time via a difference-aware pairwise judge using bubble-scan. Combined with Hierarchical Guidance and Ordinal Guidance training strategies, a Qwen2.5-3B model surpasses GPT-5 in both functional correctness and runtime efficiency.
CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding: CentaurEval is proposed as the first unified evaluation framework for human-AI collaborative programming. By designing 45 "Collaboration-Necessary" task templates, it demonstrates that LLMs alone achieve only a 0.67% pass rate and humans alone achieve 18.89%, while human-AI collaboration reaches 31.11%, revealing that LLMs are evolving from execution tools into co-reasoning partners.
Entropy-informed Decoding: Adaptive Information-Driven Branching: EDEN (Entropy-informed DEcodiNg) sets the step-wise beam width \(B_t\) to be monotonically proportional to the normalized entropy \(\bar H_t\)—branching more at high-entropy forks and behaving almost greedily during low-entropy steps. This approximates wider beam search with fewer total expansions. The authors theoretically prove that entropy-monotonic branching factors are strictly superior to any fixed beam width in terms of expected cumulative regret, providing an explicit regret rate of \(\mathbb{E}[R_T] \leq G P_\max \sum_t \exp(-c m_t \Delta_\min^2)\).
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench: Traditional PPL on SWE-bench is disrupted by the "long context tax" and fails to predict agent capability after SFT. This paper proposes the "Entropy Compression Hypothesis" and the HE-SNR metric, which calculates the signal-to-noise ratio only at "high-entropy decision points" where Top-10 entropy exceeds \((\ln 3 + \ln 4)/2\). This achieves a Pearson correlation of 0.96 and a Kendall consistency of 0.98 with downstream SWE-bench scores.
How can we assess human-agent interactions? Case studies in software agent design: The authors propose the PULSE framework—which collects user feedback, trains an ML model to predict user satisfaction, and employs Prediction-Powered Inference (PPI) to combine real human labels with model pseudo-labels for efficient estimation of agent design effects. Deployed on the open-source coding agent OpenHands across 15,000 users and 36,000 sessions, this work represents the first large-scale real-world evaluation of agent design. Results show that PULSE narrows confidence intervals by approximately 40% compared to standard A/B testing and reveals that benchmark performance can be anti-correlated with human preference (e.g., GPT-5 outperformed Claude-Sonnet-4 on 6/7 benchmarks, yet humans preferred Claude on 4/7 task subsets).
Locally Coherent Parallel Decoding in Diffusion Language Models: This paper proposes CoDiLA, which attaches a lightweight autoregressive (AR) model to a masked diffusion language model (DLM). By receiving the marginal distributions of the DLM through "soft embeddings" and performing local autoregressive decoding within small blocks, it eliminates the local incoherence caused by parallel sampling while preserving the global bidirectional capabilities of the DLM. It establishes a new Pareto frontier on code generation with \(\geq 2\times\) throughput.
MARS: Modular Agent with Reflective Search for Automated AI Research: MARS reframes automated AI research as a problem of "searching for the optimal solution within a software repository space." Built on three pillars—Budget-Aware MCTS, a modular "Design-Decompose-Implement" pipeline, and Comparative Reflective Memory—it achieves SOTA among open-source frameworks on MLE-Bench with a 31.1% gold medal rate (Gemini-3-Pro-Preview) and demonstrates an "Aha! moment" with a 63% cross-branch lesson transfer rate.
MatchFixAgent: Language-Agnostic Autonomous Repository-Level Code Translation Validation and Repair: MatchFixAgent fully transforms "equivalence validation + repair" for repository-level code translation into an LLM-based task. By replacing expensive cross-language interoperability engineering with six parallel semantic sub-analyzers (Control Flow, Data Flow, IO, Library API, Exception, and Specification), and layering a Test & Repair Agent with an Arbiter Agent, it raises validation coverage from 71.6% to 99.2% and the repairable defect ratio from 18.5% to 50.6% with only 1650 lines of code.
MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering: MEnvAgent employs a "Plan-Execute-Verify" three-stage multi-agent closed-loop and an environment reuse mechanism to automatically build executable and verifiable (Fail-to-Pass) Docker environments for real-world repositories across 10 languages. On the self-constructed MEnvBench, it improves the F2P rate by 8.6% and reduces construction time by 43%, facilitating the creation of MEnvData-SWE, the largest polyglot verifiable SWE training set to date.
NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents: NEMO treats Autonomous Coding Agents (ACA) as a "first-class abstraction" on par with LLM calls. It enables independently generated simulators and optimizers to cross-verify via execution results in a shared sandbox, combined with diverse memory retrieval and MBR/self-consistency decoding. It achieves SOTA on 8 out of 9 optimization modeling benchmarks, leading by up to 28 percentage points.
Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software: This study presents a single-case (\(N=1\)) analysis where a physicist developed ~2,100 lines of differentiable cosmological perturbation theory code, clax-pt, using Claude Code over 12 days and 57 sessions. By quantifying 15 supervision events, the authors demonstrate that credibility in scientific software stems not from raw model capability, but from a structured human supervision protocol built around oracle tests, shared changelogs, and "no-patching" rules.
Poison with Style: A Practical Poisoning Attack on Code Large Language Models: PwS poisons open-source Code LLMs using common Python code styles (e.g., Yapf/Black/PEP8) as implicit triggers. The model generates completions with CWE vulnerabilities only after formatters automatically organize the code. On Qwen2.5-Coder-32B, it achieves up to 95% ASR for CWE-20 triggers while HumanEval/MBPP pass@1 drops only by approximately 5%, maintaining resistance against mainstream defenses like BEEAR, prefix tuning, and CodeShield.
PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees: The first work to address differentially private code generation in a "jointly sensitive" scenario where both prompts and code are sensitive. By replacing explicit prompt conditions with a Privacy-Free Latent Conditioning (PrivLC) module, combined with a two-stage pipeline of "DP Purification + non-DP Gain," the method achieves utility close to relaxed-privacy approaches at \(\epsilon=4\), while maintaining 0% leakage in canary tests.
Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning: RankTuner proposes the Relative Rank Indicator \(I_t\), which uses a single scalar signal comparing the "actual rank \(R_t\) of the ground-truth token" against the "expected rank \(\mathbb{E}[R_t]\) under the model distribution." By coupling probability \(p_t\) (task alignment) and entropy \(H_t\) (intrinsic uncertainty) into a token-level weight, it consistently outperforms pure probability/entropy reweighting baselines in Pass@1 for mathematical reasoning SFT.
Pull Requests as a Training Signal for Repo-Level Code Editing: This paper proposes the Clean-PR training paradigm, converting 16.4 million noisy GitHub Pull Requests into 2 million executable Search/Replace editing block corpora through filtering, reconstruction, and round-trip validation. By combining Agentless-aligned SFT with error-driven data augmentation, Qwen2.5-Coder-32B achieves relative gains of 13.6% and 12.3% on SWE-bench Lite and Verified respectively, surpassing 72B models like Lingma-SWE and SWE-Fixer with only 32B parameters.
SWE-IF: Aligning Code Evaluation with Human Preference: Addressing the issue where "code evaluation only focuses on the functional correctness of pass@k but is disconnected from real user preferences," this paper proposes VERICODE (a taxonomy of 30 verifiable code instructions with deterministic verifiers) and the SWE-IF testbed. By evaluating functional correctness alongside "instruction following" across 31 LLMs, the study finds that a composite score of functional correctness and instruction following aligns most closely with human preferences, with instruction following serving as the true differentiator between high-end models.
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale: The authors developed a "language-agnostic unified construction pipeline + interactive installation Agent + triple-model ensemble for issue clarity filtering" to automatically mine 32,079 executable SWE tasks across 20 languages and 3,617 repositories from GitHub (accompanied by 120,000+ PR-derived tasks). Each task includes pre-built Docker images, fail-to-pass tests, and instance-level diagnostic metadata, providing a stable, training-oriented substrate for large-scale reinforcement learning of SWE Agents rather than just evaluation.
Towards Functional Correctness of Code Models with Selective Generation: This work utilizes fuzzing to automatically generate a large volume of unit tests to determine the functional correctness of generated code. Based on this, it trains a selective code generator capable of "active abstention," providing PAC-style guarantees to keep the code hallucination rate (FDR-CE) below a user-specified threshold for non-abstaining responses.
UniRTL: Unified Code and Graph for Robust RTL Representation Learning: This paper proposes UniRTL—a multimodal unified representation learning framework that jointly learns from RTL code and Control-Data Flow Graphs (CDFG). By employing a graph-aware tokenizer and a hierarchical training strategy, it significantly outperforms existing methods in hardware performance prediction and code retrieval tasks.