💻 Code Intelligence¶

🧠 NeurIPS2025 · 22 paper notes

A Self-Improving Coding Agent: This paper proposes SICA (Self-Improving Coding Agent), a coding agent capable of autonomously editing its own codebase to improve performance. By eliminating the distinction between meta-agent and target-agent, SICA achieves iterative self-improvement, advancing from 17% to 53% on a subset of SWE-Bench Verified.
A Stochastic Differential Equation Framework for Multi-Objective LLM Interactions: This paper models multi-objective optimization in iterative LLM interactions as an SDE (drift-diffusion process), quantifies inter-objective coupling via an interference matrix, and analyzes strategy convergence behavior through eigenvalue spectral analysis. Validation on code generation (three objectives: security, efficiency, functionality) demonstrates convergence rates ranging from 0.33 to 1.29 and predictability up to \(R^2 = 0.74\) across different strategies.
AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy: AstroVisBench introduces the first code benchmark for evaluating LLMs on astronomical scientific computing and visualization. It extracts 864 tasks (processing + visualization) from 110 Jupyter Notebooks, and designs a dual evaluation pipeline (execution-based variable inspection + VLM-as-Judge visualization scoring, achieving Spearman ρ=0.822 with expert ratings). Evaluation of 8 state-of-the-art models reveals that Gemini 2.5 Pro performs best, yet attains only a 15.7% error-free rate, with FileNotFoundError accounting for 43% of all errors.
VeriMaAS: Automated Multi-Agent Workflows for RTL Design: VeriMaAS proposes a framework for automatically composing multi-agent workflows for RTL code generation. Its core innovation is the direct integration of formal verification feedback from HDL tools (Yosys synthesis + OpenSTA timing analysis) into workflow orchestration, achieving a 2–12% pass@1 improvement on VeriThoughts while requiring only a few hundred samples for controller tuning—an order of magnitude fewer than full fine-tuning.
Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning: This paper proposes CURE, a framework in which a single LLM simultaneously assumes the roles of code generator and unit test generator. Cross-execution between generated code and generated tests constructs a pairwise reward matrix; theoretically derived reward signals then drive reinforcement learning. Without any ground-truth code annotations, CURE achieves co-evolution of both code generation and unit test generation capabilities, substantially outperforming dedicated coder models of comparable scale across five programming benchmarks.
CoRe: Benchmarking LLMs' Code Reasoning Capabilities through Static Analysis Tasks: This paper introduces CoRe, a high-quality benchmark comprising 12,553 manually validated task instances. Through three categories of fundamental static analysis tasks—data dependency, control dependency, and information flow—CoRe directly evaluates the code semantic reasoning capabilities of LLMs, revealing that current models remain severely deficient on tasks requiring multi-step reasoning, such as trace generation and source enumeration.
Embedding Alignment in Code Generation for Audio: A dual-MLP + InfoNCE contrastive learning framework is proposed to align code embeddings (distilroberta-base) and audio embeddings (wav2vec2) into a shared space, enabling LLM-based code generation pipelines to infer musical similarity directly from code without compilation or execution. CKA improves from 0.090 to 0.590.
FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts: Inspired by the fly olfactory circuit, FlyLoRA replaces the down-projection matrix \(A\) in LoRA with a frozen sparse random projection and employs top-\(k\) activation selection to realize implicit rank-wise MoE routing. This design eliminates routing parameters, reduces intra-task interference, and naturally supports multi-task model merging by exploiting the near-orthogonality of random projections.
FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis: This paper introduces FractalBench, a benchmark for diagnosing visual-mathematical reasoning in MLLMs via fractal image program synthesis. Comprising 12 classical fractals, 610 test images, and evaluations across 4 MLLMs, it reveals that while 76% of generated code is executable, only 4% is visually correct, exposing fundamental deficiencies in recursive abstraction capabilities.
Learning From Design Procedure To Generate CAD Programs for Data Augmentation: This paper proposes a CAD program data augmentation paradigm inspired by industrial design workflows. By providing reference surface programs and design procedure descriptions as LLM prompts, the method guides the generation of CAD programs containing B-Spline organic shapes, substantially narrowing the geometric complexity gap between public CAD datasets and industrial-grade designs.
Learning to Solve Complex Problems via Dataset Decomposition: This paper proposes Decomp, a method that employs a teacher model (GPT-4o) to recursively decompose complex math problems into simpler subproblems along reasoning steps, constructs a concept dependency graph to quantify difficulty, and trains student models following an easy-to-hard curriculum. Qwen2.5-1.5B achieves 51.6% on MATH-500 (surpassing MuggleMath's 50.4% with 147K samples), while Qwen3-4B reaches 16.7% on AIME2025 using only 385 samples (surpassing Qwen2.5-72B's 15.0%).
MaintainCoder: Maintainable Code Generation Under Dynamic Requirements: This work is the first to systematically define and address the maintainability problem in LLM-based code generation, contributing both a benchmark and a method. MaintainBench evaluates code maintainability under requirement evolution using 4 change patterns and dynamic metrics; MaintainCoder integrates the Waterfall model, design patterns, and 6 specialized agents, achieving 60%+ improvement on dynamic maintainability metrics while also improving initial code correctness.
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research: This paper proposes MLR-Bench, a comprehensive benchmark comprising 201 open-ended ML research tasks, accompanied by MLR-Judge (an LLM-based evaluation framework) and MLR-Agent (a modular research agent). The study finds that state-of-the-art coding agents fabricate or fail to verify experimental results in approximately 80% of cases, exposing a fundamental bottleneck in AI-automated scientific research.
Once Upon an Input: Reasoning via Per-Instance Program Synthesis: This paper proposes PIPS (Per-Instance Program Synthesis), which iteratively refines programs through instance-level program synthesis and structured feedback, while dynamically selecting between direct reasoning and program synthesis via a confidence measure. PIPS achieves an 8.6% improvement in harmonic mean accuracy across 30 benchmarks.
Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization: This paper systematically investigates how compositional properties of calibration data (sequence length, sample size, source, format) and domain correspondence affect capability preservation after LLM compression. It finds that representativeness and diversity in the activation space are the fundamental determinants of calibration data quality, and proposes a three-stage calibration data curation framework, COLA.
Principled Fine-tuning of LLMs from User-Edits: A Medley of Preference, Supervision, and Reward: This paper systematically investigates how to fine-tune LLMs using user-edit data, unifying three feedback types—preference, supervision, and cost—and proposes a simple ensembling procedure that achieves robust adaptation across diverse user distributions.
Program Synthesis via Test-Time Transduction: This paper proposes SYNTRA, a framework that reframes program synthesis as transductive learning — at test time, it leverages visible test inputs and LLM judgment to iteratively eliminate inconsistent candidate program hypotheses. A greedy maximin algorithm minimizes the number of LLM queries, achieving accuracy improvements of up to 196% across 4 benchmarks.
QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation: This paper proposes QiMeng-SALV, a signal-aware learning method that extracts functionally correct signal-level code snippets from partially incorrect Verilog modules as reward signals for DPO training, elevating the optimization granularity from module level to signal level and achieving SOTA on VerilogEval and RTLLM.
Searching Latent Program Spaces: This paper proposes the Latent Program Network (LPN), which uses an encoder to map input–output examples into a latent program representation, then performs gradient-based search in the latent space at test time to adapt to new tasks. LPN substantially outperforms in-context learning and test-time training methods on the ARC-AGI benchmark.
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents: A fully automated pipeline is developed to continuously mine real-world software engineering interaction tasks from GitHub, producing the SWE-rebench dataset of 21,000+ executable Python tasks and a decontaminated benchmark. The work reveals that several models exhibit contamination-inflated performance on SWE-bench Verified (e.g., DeepSeek-V3: 39.7% on SWE-bench vs. 21.3% on SWE-rebench).
Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models: This paper proposes VSGRPO — a dual-reward reinforcement learning strategy based on GRPO — that jointly optimizes a structure-level reward (TEDS-Structure) and a visual fidelity reward (CW-SSIM on rendered images). The fine-tuned MLLM (only 3B parameters) surpasses GPT-4o and models with 72B+ parameters on the table-image-to-LaTeX generation task, with particularly significant gains on complex tables.
Text-to-Code Generation for Modular Building Layouts in Building Information Modeling: This paper proposes Text2MBL, a framework that translates natural language descriptions into executable BIM code (rather than coordinate sequences). Through an object-oriented code architecture and LLM fine-tuning, it enables automatic generation of modular building layouts, achieving 10%+ IoU improvement in geometric consistency over coordinate-driven methods.