💻 Code Intelligence¶
🔬 ICLR2026 · 59 paper notes
📌 Same area in other venues: 📷 CVPR2026 (1) · 💬 ACL2026 (50) · 🧪 ICML2026 (22) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (19) · 📹 ICCV2025 (1)
🔥 Top topics: LLM ×10 · Code Intelligence ×8 · Reinforcement Learning ×5 · Agents ×4 · Reasoning ×4
- A Problem-Oriented Perspective and Anchor Verification for Code Optimization
-
The paper proposes a problem-oriented (rather than user-oriented) approach to construct optimization pairs to integrate strategic diversity from multiple programmers. It also designs an anchor verification framework that utilizes "slow but correct code" to generate test cases, mitigating the "optimization tax" (correctness loss), thereby increasing the optimization ratio from 31.24% to 71.06% and the speedup from 2.95x to 6.08x.
- AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions
-
AetherCode is the first code reasoning benchmark to systematically collect 456 high-difficulty problems from premier programming competitions such as IOI and ICPC. It utilizes a hybrid approach of "automated generation + manual annotation by 67 experts" to achieve 100% TPR / 100% TNR for test cases. Results indicate that even the strongest model, o4-mini-high, achieves only a 35.5% Pass@1, debunking the illusion that "LLMs have conquered competitive programming."
- Agnostics: Learning to Synthesize Code in Any Programming Language with a Universal Reinforcement Learning Environment
-
By using "standard program input/output behavior" as the unified scoring criterion, a language-agnostic code execution sandbox and GRPO training framework are developed. This enables RL post-training for any low-resource programming language with only 4-5 lines of YAML configuration, elevating the performance of Qwen-3 4B on Lua, Julia, R, OCaml, and Fortran to levels comparable with 16B–70B models.
- Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering
-
The authors construct Ambig-SWE (an underspecified variant based on SWE-Bench Verified) to systematically evaluate the interaction capabilities of LLM programming agents across three dimensions: detecting underspecification, asking clarifying questions, and utilizing interactive information. They find that interaction can improve resolution rates in underspecified scenarios by up to 74%, yet models default to non-interactive behavior and struggle to distinguish between sufficient and underspecified instructions.
- An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems
-
AFL decomposes "using LLMs to solve complex Vehicle Routing Problems (VRP)" into three subtasks: problem description, code generation, and solution derivation. It utilizes four specialized agents (Generation, Judgement, Revision, and Error Analysis) to oversee each other, automatically producing a self-contained Python solver from raw VRPLIB instances. Across 60 VRP variants, AFL reduces the runtime error rate to 0%, achieves a 100% feasible solution rate, and maintains an optimality gap mostly within 3% compared to manually designed algorithms.
- ATGen: Adversarial Reinforcement Learning for Test Case Generation
-
ATGen places a "test case generator" and an "adversarial code generator" into a competitive reinforcement learning loop. As the generator strengthens, the opponent is forced to produce more subtle bugs. This self-escalating dynamic curriculum breaks the "fixed-difficulty ceiling" of static datasets, doubling the attack success rate of a 7B model compared to the SFT-based UTGen (36.99% vs 16.24%).
- Behavioral Embeddings of Programs: A Quasi-Dynamic Approach for Optimization Prediction
-
To address the dilemma of "static representations being too rigid and dynamic profiling being too expensive" in compiler optimization, this paper proposes a quasi-dynamic program representation. By "probing" LLVM IR with a set of optimization sequences, the changes in static features before and after optimization are quantified as a Program Behavior Spectrum. Product Quantization (PQ) is then used to discretize continuous reaction vectors into structured "sub-words," and a multi-task Transformer (PQ-BERT) is pre-trained to learn their syntax. This approach significantly outperforms static embeddings like inst2vec and IR2Vec in Best Pass Prediction and -Oz Benefit Prediction tasks.
- BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization
-
BOAD reformulates the design of a hierarchical multi-agent system for software engineering as a multi-armed bandit (MAB) problem. Each candidate sub-agent is treated as an arm, and the reward is its "helpfulness" within team collaboration. It employs UCB for exploration-exploitation, uses the Chinese Restaurant Process (CRP) to dynamically expand the agent archive, and applies hindsight credit assignment to avoid the "free-rider" problem. This approach automatically discovers a structure consisting of "one orchestrator + two specialized sub-agents" under a limited evaluation budget. On SWE-bench-Verified, a 36B model achieved 53.2%; on the more out-of-distribution SWE-bench-Live, it reached 20.0%, ranking second on the leaderboard and outperforming larger models like GPT-4o and Claude 3.7.
- CARD: Towards Conditional Design of Multi-agent Topological Structures
-
CARD proposes a conditional graph generation framework (Conditional Agentic Graph Designer) that utilizes a conditional variational graph encoder and environment-aware optimization to adaptively design multi-agent communication topologies based on dynamic environmental signals—such as model capabilities, tool availability, and knowledge source changes—consistently outperforming static and prompt-based baselines on HumanEval, MATH, and MMLU.
- Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction
-
To address the persistent issues of "static sources prone to contamination" and "superficial testing" in code generation evaluation, this paper proposes the Dual Scaling philosophy. It dynamically extracts problems from real-world repositories based on model knowledge cutoff dates (Scaling the Source) and automatically generates high-rigor test suites using Property-Based Testing (PBT) coupled with a 100% branch-coverage "Great Filter" (Scaling the Rigor). The instantiated end-to-end framework, Code2Bench, produces a benchmark (Code2Bench-2509) featuring native Python and Java instances, providing fine-grained diagnostics for 10 mainstream LLMs.
- Code Aesthetics with Agentic Reward Feedback
-
This paper defines programming tasks where visual outcomes are critical, such as web design and chart generation, as "code aesthetics" problems. It constructs the AesCode-358K dataset, the OpenDesign evaluation set, and an agentic reward framework consisting of execution, static aesthetic, and interactive aesthetic agents. By training small-scale AesCoder models using GRPO-AR, a 4B model outperforms GPT-4o, GPT-4.1, and various large-scale open-source code models on OpenDesign.
- Code World Models for General Game Playing
-
Instead of using the LLM as a direct "player," it is tasked with translating game rules and a few match trajectories into executable Python Code World Model (CWM) (including state transitions, legal actions, terminal state detection, plus value functions and hidden state inference functions). This code is then processed by classical planners like MCTS/ISMCTS for deep search. Across 10 games (including 4 entirely new OOD games), the approach tied with or outperformed Gemini 2.5 Pro in 9 games.
- CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning
-
CodeSense is the first fine-grained code semantic reasoning benchmark oriented toward real-world software engineering. The authors performed testing and captured execution traces across 744 Python/C/Java GitHub projects to automatically construct ground truth for execution values and program properties (loops, pointer aliasing, branches) at statement, block, and function levels. Evaluating 14 SOTA LLMs across 4,483 samples reveals that they frequently fail to correctly calculate arithmetic and API calls even for individual real-world statements.
- Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning
-
This paper proposes "Critique Reinforcement Learning" (CRL), which requires the model to make True/False judgments on "question-solution" pairs. The accuracy of these judgments serves as a verifiable reward. By mixing this with standard code RL at a 20%:80% ratio, the resulting Critique-Coder consistently outperforms pure RL models across multiple code benchmarks. The 8B model exceeds 60 on LiveCodeBench(v5) and transfers its critique capability to logical reasoning tasks.
- CrossPL: Systematic Evaluation of Large Language Models for Cross Programming Language Interoperating Code Generation
-
CrossPL is the first benchmark to systematically evaluate the "cross-programming-language (CPL) interoperating code" generation capabilities of LLMs. By using 156 finite state machines (FSM) to mine 1,982 IPC tasks from 19,000 multi-language GitHub repositories and constructing 522 Python–C FFI tasks using the GSL library, evaluations of 20 mainstream models reveal a critical weakness: models achieving 90%+ Pass@1 on single-language generation score at most 19.5% Pass@1 on FFI interoperability.
- DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
-
DevOps-Gym is the first end-to-end agent evaluation benchmark covering the full software DevOps lifecycle (build configuration, runtime monitoring, issue resolution, test generation). It semi-automatically collects 700+ tasks from 30+ real Java/Go projects and provides a dynamic execution environment with tool-calling interfaces. The evaluation reveals that even the strongest Claude Code + Claude-4-Sonnet only achieves 20%~50% success rates on operations tasks like monitoring and build configuration, while the success rate for end-to-end pipeline tasks is 0% across all evaluated agents.
- DiaBlo: Diagonal Blocks Are Sufficient For Finetuning
-
DiaBlo is proposed—a parameter-efficient fine-tuning method that replaces low-rank decomposition with diagonal block updates. By partitioning the weight matrix into \(N \times N\) blocks and training only the diagonal blocks \(\mathbf{D}_1, \ldots, \mathbf{D}_N\), it completely bypasses the non-convex optimization, initialization sensitivity, and gradient instability issues caused by the \(\mathbf{AB}\) product in LoRA. It converges with zero initialization and is efficiently implemented using a single
torch.einsumbatched matmul in PyTorch. Theoretically, its expressivity is strictly superior to LoRA under the same parameter budget. It achieves state-of-the-art performance across four major tasks—commonsense reasoning, arithmetic reasoning, code generation, and security alignment—and in 4-bit/2-bit quantization scenarios. - DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
-
This paper trains a 7B masked diffusion code model, DiffuCoder, proposes a local/global AR-ness metric system to characterize the "non-autoregressive" decoding behavior of diffusion LLMs (dLLMs), and designs coupled-GRPO (a diffusion-native RL method using complementary mask coupled sampling), achieving a 4.4% improvement on EvalPlus.
- EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits
-
EDIT-Bench transforms in-the-wild instructed code editing requests from nearly 500 real developers—captured via an in-house VSCode plugin—into 540 challenging problems with test harnesses. Evaluating 40 LLMs reveals that this is a difficult benchmark, with only one SOTA model exceeding a 60% success rate.
- Evolving Graph Structured Programs for Circuit Generation with Large Language Models
-
CircuitEvo encodes circuit graphs into "Graph-structured Programs," an LLM-friendly text format, and iteratively generates compact circuits using LLM + evolutionary prompting strategies. It features a theoretically guaranteed "Structure-aware Functional Completion" module to ensure correctness, making it the first LLM-based logic synthesis method capable of continuously compressing circuit size while guaranteeing 100% functional accuracy.
- FHE-Coder: Benchmarking Secure Agentic Code Generation for Fully Homomorphic Encryption
-
To address the fatal blind spot where "LLM-generated FHE code is functional but cryptographically insecure," research introduces FHE-Coder, a three-stage agentic framework (Prompt Formalizer + Expert-Augmented RAG + Security Verifier). Accompanied by a new metric \(Pass@1(func \ sec)\) and a 10-task benchmark, it enables various LLMs to consistently produce compilable, functionally correct, and verifiably secure homomorphic encryption code for TFHE/CKKS.
- From Assistant to Independent Developer — Are GPTs Ready for Software Development?
-
This paper introduces APPFORGE, the first benchmark to evaluate the capability of LLMs to build complete Android applications end-to-end from scratch (101 real-world tasks, fully automated compilation/functional/stability evaluation). Findings show that even the strongest GPT-5 achieves only 18.8% success, revealing a significant gap between current models and "independent developers."
- From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph
-
ReGraphT organizes CUDA optimization trajectories accumulated by large models into a reusable "Reasoning Graph." It then uses Monte Carlo Graph Search (MCGS) to guide small models (SLMs) in selecting optimization techniques step-by-step. This allows 7B-scale models to approach the CUDA code generation performance of a 671B model without training or cloud reliance, achieving an average speedup of 2.33×.
- Gistify: Codebase-Level Understanding via Runtime Execution
-
The GISTIFY task is proposed—requiring programming agents to compress the functionality of a specific command across an entire codebase into a single-file, self-contained, minimal, and faithful reproduction of runtime behavior. This task rigorously evaluates a model's understanding of codebase structure and execution flow, revealing that current SOTA models frequently fail on long execution trajectories.
- Gradient-Based Program Synthesis with Neurally Interpreted Languages
-
NLI allows an encoder-decoder architecture to end-to-end invent its own discrete, symbol-like programming language, accompanied by a differentiable recurrent neural executor that interprets programs token-by-token. This enables both compositional generalization akin to symbolic methods and gradient descent search in the program space, refining initial program guesses from the inductor at test time until the data is explained.
- HARDTESTGEN: A High-Quality RL Verifier Generation Pipeline for LLM Algorithmic Coding
-
Addressing algorithmic coding, the HARDTESTGEN pipeline is proposed—synthesizing "generator programs" via LLMs instead of direct test generation. Combined with multi-oracle consensus filtering, it creates HARDTESTS (26.6k problems), a high-quality dataset with 11% higher precision, proving that verifier quality directly determines the effectiveness of rejection sampling and RL post-training.
- Improving Code Localization with Repository Memory
-
Enhances the code localization capabilities of language agents by utilizing the repository's commit history to construct episodic memory (past commits) and semantic memory (summaries of active code functions), achieving significant improvements on SWE-bench.
- IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation
-
Ours proposes IMSE, which reinterprets pre-trained ViT linear layers as "spectral experts" via SVD. By fine-tuning only the singular values, it achieves extreme parameter efficiency for Test-Time Adaptation. Combining a diversity maximization loss and a domain-aware spectral code retrieval mechanism, it reaches SOTA performance across TTA, CTTA, and progressive CTTA scenarios.
- InnoGym: Benchmarking the Innovation Potential of AI Agents
-
Proposes InnoGym, the first benchmark and framework to systematically evaluate the innovation capability of AI agents. It introduces two complementary metrics, Performance Gain and Novelty, and discovers through 18 improvable tasks that while current agents possess some innovativeness, they lack the robustness to transform innovation into reliable performance improvements.
- JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
-
Addressing the bottleneck of scarce "code + vision" multimodal corpora, this work introduces a data synthesis toolbox to construct JanusCode-800K, the largest multimodal code corpus to date. Unified models JanusCoder / JanusCoderV are trained to simultaneously cover text-side and vision-side tasks such as chart generation, web UI, animation, and scientific demonstrations, approaching or even surpassing GPT-4o at scales of 7B–14B.
- Kimi-Dev: Agentless Training as Skill Prior for SWE-agents
-
This paper proposes treating Agentless (workflow-style) training as a "skill prior" for SWE-Agents (multi-turn interactive). By utilizing a recipe of mid-training + cold-start + RL + test-time self-play, the open-source model Kimi-Dev achieves 60.4% on SWE-bench Verified (a SoTA for workflow solutions). It is further upgraded into an agent with 48.6% pass@1, comparable to Claude 3.5 Sonnet, using a lightweight SFT of 5k trajectories.
- KV Cache Transform Coding for Compact Storage in LLM Inference
-
Ours proposes KVTC, a KV cache compression method inspired by classical media compression techniques (PCA feature decorrelation + adaptive quantization + entropy coding). It achieves up to 20× compression (40×+ in specific scenarios) on models like Llama 3, Mistral NeMo, and R1-Qwen 2.5, outforming baseline methods such as token eviction, quantization, and SVD.
- LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models
-
LearNAT utilizes AST-guided MCTS search to automatically synthesize "verifiable" NL2SQL task decomposition data, followed by fine-grained multi-step preference optimization using margin-aware DPO, enabling a 7B small model to achieve performance comparable to GPT-4 in NL2SQL.
- Learning to Reason without External Rewards
-
Proposes Intuitor, an RLIF method that replaces external verifiable rewards with the model's own self-certainty (KL divergence between the output distribution and a uniform distribution). It matches GRPO performance in mathematical reasoning while demonstrating better generalization in out-of-distribution tasks such as code generation.
- LLM-Guided Evolutionary Program Synthesis for Quasi-Monte Carlo Design
-
Two long-standing Quasi-Monte Carlo (QMC) design problems—constructing finite point sets with low star discrepancy and selecting Sobol' direction numbers—are reformulated as "program synthesis" tasks. An LLM acts as an intelligent mutation operator within an evolutionary loop to search for generating code. Without any task-specific training, this approach reproduces known optimal solutions and sets new benchmarks for several finite-scale and high-dimensional financial pricing scenarios.
- Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification
-
This paper introduces DAFNYCOMP, the first benchmark for compositional formal specification generation across multi-function programs. It reveals that while leading LLMs achieve over 58% pass rates on single-function Dafny verification, their end-to-end success rates drop nearly to zero (strongest model Pass@8 is only 2%) when 2–5 functions are composed into a call chain, proving that "local success does not compose."
- Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages
-
Multi-LCB extends the Python-only LiveCodeBench to 12 programming languages via a transformation pipeline that converts functional LeetCode tasks into a unified STDIN/STDOUT format. It enables cross-language comparisons on identical problems without compromising contamination control, revealing prevalent "Python overfitting" and language-specific data contamination in current LLMs.
- Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
-
PaperCoder is proposed—a multi-agent LLM framework that automatically transforms machine learning papers into runnable code repositories via a three-stage pipeline of Planning, Analysis, and Coding. 88% of the generated repositories were rated as best by the original authors, significantly outperforming baselines on the PaperBench benchmark.
- Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
-
Addressing the most fundamental yet failure-prone stage of "environment setup" for SWE agents, this paper proposes EnConda-Bench. By injecting six types of real-world errors into originally correct README files to automatically generate tasks, it decomposes the traditional black-box evaluation—which only checks "final build/test success"—into a process-level diagnosis of Planning, Perception, Feedback, and Execution. The study reveals that the inability to translate correct error detection into valid fixes is the current performance bottleneck.
- QLCoder: A Query Synthesizer For Static Analysis of Security Vulnerabilities
-
QLCoder embeds an LLM-Agent into an iterative loop of "candidate query generation \(\to\) CodeQL execution scoring \(\to\) feedback-based patching." It constrains reasoning using a custom MCP toolkit (CodeQL Language Server for syntax consistency + RAG vector database for semantic grounding) to automatically synthesize CodeQL queries from CVE metadata that "alert on vulnerable versions and stay silent on fixed versions." It achieved a 53.4% success rate and an F1 score of 0.70 across 176 real-world Java CVEs, significantly outperforming vanilla Claude Code (10%) and existing IRIS/CodeQL query suites (F1 0.048 / 0.073).
- ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
-
The paper proposes ReasoningBank, a memory framework that distills generalizable reasoning strategies from success and failure experiences judged by the agent itself. It also introduces memory-aware test-time scaling (MaTTS) to establish a synergy between memory and test-time expansion, consistently outperforming baselines on WebArena, Mind2Web, and SWE-Bench (up to 34.2% relative gain) while reducing interaction steps by 16%.
- RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback
-
RECODE-H transforms "research code generation" from a one-shot task into multi-turn human-computer collaboration: it features 102 repository-level tasks from real top-tier conference papers and official repositories, equipped with unit tests and a five-level feedback hierarchy. Using ReCodeAgent (multi-turn ReAct + memory compression) as a strong baseline, the study systematically quantifies how "finer feedback leads to more accurate LLM corrections"—the Recall of GPT-5 increases from 29.4% without feedback to 71.6% with the strongest feedback.
- RefineStat: Efficient Exploration for Probabilistic Program Synthesis
-
RefineStat enables Small Language Models (SLMs) of 7~8B to reliably synthesize probabilistic programs (PyMC/NumPyro). During the generation phase, it utilizes semantic constrained decoding to prune illegal distributions/parameters segment-by-segment. In the refinement phase, it backtracks and resamples priors or likelihoods based on Bayesian diagnostic metrics. This allows a single open-source SLM to produce programs whose statistical reliability matches or even exceeds that of closed-source Large Language Models (LLMs) such as GPT-4 or OpenAI o3.
- RESCUE: Retrieval Augmented Secure Code Generation
-
RESCUE proposes a novel RAG framework for "secure code generation": it offline distills messy vulnerability-fix data into a hierarchical security knowledge base using "clustering-summarization + program slicing," and online analyzes tasks from three security perspectives (vulnerability causes, API patterns, code) via "hierarchical multi-faceted retrieval." Across four benchmarks and six LLMs, it improves the SecurePass@1 metric (balancing security and functionality) by an average of 4.8 points, setting a new SOTA.
- RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
-
This paper proposes the Repository Planning Graph (RPG), which encodes both "what features to build (proposal)" and "how to implement them (implementation)" into an explicit graph (nodes represent capabilities/files/functions, edges represent data flow and hierarchy). Based on this, the ZeroRepo framework is built, utilizing a three-stage process: "proposal-level mapping → implementation-level mapping → graph-guided code generation" to generate entire codebases from scratch. On the self-constructed RepoCraft benchmark, it achieves 81.5% coverage, a 69.7% pass rate, and an average of 36K lines of code, exceeding the strongest baseline (Claude Code) by 3.9× in scale.
- Sharing State Between Prompts and Programs
-
The authors propose the abstraction of shared program state, allowing prompts to directly read/write program variables, manipulate heap objects, and control program flow. This is implemented as the Nightjar system (Python + prompt mixed programming), which reduces code volume by 39.6% while maintaining or improving accuracy (+4-19%).
- ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code
-
ShieldedCode is proposed as the first protection-aware code representation learning framework. By utilizing hierarchical dependency modeling (intra-instruction, preceding, and inter-instruction layers) and joint functional-aware plus protection-aware contrastive learning, it enables LLMs to generate, compare, and reason about virtual machine protected code. It outperforms existing methods in VM code generation (Pass@1 26.95% vs. GPT-4o 22.58%) and binary similarity detection.
- SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin
-
SK2Decompile decomposes binary decompilation into a two-phase LLM pipeline: "first recovering a compilable program skeleton, then restoring semantic identifiers." It utilizes reinforcement learning with compiler feedback and semantic similarity rewards, respectively, to simultaneously enhance the executability and readability of decompiled code.
- SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
-
This paper points out that current Text-to-SQL evaluation, which relies on "comparing execution results on a single test database," is overly optimistic. It proposes SpotIt, which uses SMT-based bounded equivalence checking to actively search for a database that can distinguish the generated SQL from the gold SQL. On the BIRD benchmark, it reduced the accuracy of ten SOTA methods by 9.8%–13.5% and discovered that "mismatches are often due to errors in the gold SQL itself."
- SWE-RM: Execution-Free Feedback for Software Engineering Agents
-
This paper points out that "strong Test-Time Scaling (TTS) performance" does not guarantee that a reward model will be effective in reinforcement learning (RL). It proposes to evaluate reward models through three dimensions: TTS + Discriminability (AUC) + Calibration (ECE). Based on this, it trains SWE-RM (30B-A3B), an execution-free reward model that improves Qwen3-Coder-Max from 67.0% to 74.6% (open-source SOTA) via TTS on SWE-Bench Verified, and provides an additional 3% gain when used as an RL reward compared to pure execution feedback.
- SweRank: Software Issue Localization via Code Ranking
-
SweRank reframes "finding functions to be modified based on bug reports" from expensive multi-step LLM agent reasoning into a one-time "retrieve-and-rerank" problem. By training a bi-encoder retriever (SweRankEmbed) and a listwise LLM reranker (SweRankLLM) on a self-constructed large-scale dataset, SweLoc, it achieves SOTA localization accuracy across file, module, and function granularities on SWE-Bench-Lite and LocBench at a significantly lower cost than Claude-3.5 agents.
- The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Evolution
-
This paper conducts a large-scale empirical study using 130,000+ code generation requests and hundreds of full-stack framework tasks. It quantifies how AI programming assistants yield significantly higher success rates for mainstream languages and frameworks compared to niche technologies. This reveals a feedback loop consistent with the "Matthew Effect"—ecosystems with abundant data receive superior AI support, further reinforcing their dominant status.
- The Natural Geometry of Code: Hyperbolic Representation Learning for Program Reasoning
-
This paper argues that the "natural geometry" of code is hyperbolic space. It proposes HypeCodeNet, a graph neural network operating natively on the numerically stable Lorentz model. Using hyperbolic embedding layers, tangent-space message passing, and geodesic attention, it learns low-distortion hierarchical representations for ASTs. HypeCodeNet outperforms Euclidean models across clone detection, code completion, and link prediction tasks, achieving parity with a 768-dimensional SOTA using only 32 dimensions.
- TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
-
TikZilla surpasses GPT-4o in Text-to-TikZ scientific graphics generation and exceeds GPT-5 on automatic metrics by constructing the million-scale high-quality TikZ dataset DaTikZ-V4 and further training small Qwen models using GRPO with an inverse graphics image encoder-based reward after SFT. This significantly improves compilation rates and graphical semantic alignment.
- Training Large Language Models To Reason In Parallel With Global Forking Tokens
-
This paper proposes Set Supervised Fine-Tuning (SSFT), which aligns global forking tokens with diverse reasoning trajectories through bipartite matching. This enables LLMs to globally steer different reasoning patterns from a single control token, significantly outperforming standard SFT and GRPO on mathematical reasoning and code generation tasks.
- VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code
-
To address the issue where "verifiable code generation" evaluation is limited by the scale and errors of manually annotated ground-truth specifications, this paper proposes an equivalence score. It uses the Dafny verifier to automatically check the bidirectional entailment between code and specifications, enabling quality assessment without ground truth. Based on this, VeriEquivBench is constructed with 2,389 complex algorithmic problems, where results show that even Claude-4-sonnet completely fails under pass@4.
- VERINA: Benchmarking Verifiable Code Generation
-
VERINA uses 189 manually refined Lean programming tasks to decompose "verifiable code generation" into three independent yet combinable base tasks: CodeGen, SpecGen, and ProofGen. It provides a multi-stage specification evaluator combining "theorem proving + full coverage testing." Results show that even the strongest o3 achieves only 72.6% code correctness and 52.3% specification success, while the proof success rate is as low as 4.9%.
- VisCoder2: Building Multi-Language Visualization Coding Agents
-
Addressing three major pain points of existing visualization code models—narrow language coverage, non-executability, and inability to iteratively correct errors—this paper introduces a dataset (VisCode-Multi-679K, 12 languages, 679k executable samples), a benchmark (VisPlotBench, 8 languages, 888 tasks), and a model family (VisCoder2, 3B~32B). For the first time, an open-source model matches GPT-4o in execution pass rate (32B reaches 82.4% after self-debugging), significantly leading in symbolic/compiled languages such as LilyPond, LaTeX, and Asymptote.
- WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
-
WebGen-Agent enables a coding LLM to iteratively refine website code using multi-level visual feedback ("screenshot + GUI agent testing") at each step. These feedback scores are then utilized as step-level rewards for Step-GRPO reinforcement learning. This approach improves Claude-3.5-Sonnet's accuracy on WebGen-Bench from 26.4% to 51.9% and elevates 7B small models from 38.9% to 45.4%.