💻 Code Intelligence¶

🔬 ICLR2026 · 23 paper notes

A Problem-Oriented Perspective and Anchor Verification for Code Optimization: This paper proposes a problem-oriented (rather than user-oriented) approach to constructing optimization pairs that integrates the strategic diversity of multiple programmers, and designs an anchor verification framework that leverages "slow but correct code" to generate test cases for mitigating the "optimization tax" (correctness loss), improving the optimization rate from 31.24% to 71.06% and the speedup ratio from 2.95x to 6.08x.
Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering: This paper introduces Ambig-SWE, an underspecified variant of SWE-Bench Verified, and systematically evaluates LLM coding agents across three dimensions of interactive capability—detecting underspecification, formulating clarification questions, and leveraging obtained information. Results show that interaction can improve resolution rates in underspecified settings by up to 74%, yet models default to non-interactive behavior and struggle to distinguish between well-specified and underspecified instructions.
Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation: To address the SFT performance plateau in chart-to-code generation, this paper proposes Multimodal Structured Reinforcement Learning (MSRL), which employs a dual-layer textual and visual reward function along with a two-stage RL strategy, achieving improvements of 6.2% and 9.9% on high-level metrics on ChartMimic and ReachQA respectively, establishing open-source SOTA and matching GPT-4o.
CARD: Towards Conditional Design of Multi-agent Topological Structures: CARD proposes a Conditional Agentic Graph Designer framework that adaptively designs multi-agent communication topologies based on dynamic environment signals—including model capability changes, tool availability, and knowledge source updates—via a conditional variational graph encoder and environment-aware optimization. The approach consistently outperforms static and prompt-based baselines on HumanEval, MATH, and MMLU.
DiaBlo: Diagonal Blocks Are Sufficient For Finetuning: This paper proposes DiaBlo—a parameter-efficient fine-tuning method that replaces low-rank decomposition with diagonal block updates. The weight matrix is partitioned into \(N \times N\) blocks, and only the diagonal blocks \(\mathbf{D}_1, \ldots, \mathbf{D}_N\) are trained. This approach entirely bypasses the non-convex optimization, initialization sensitivity, and gradient instability introduced by the \(\mathbf{AB}\) product in LoRA. Zero initialization suffices for convergence, and the method requires only a single torch.einsum batched matmul in PyTorch. Theoretical analysis proves that DiaBlo is strictly more expressive than LoRA under the same parameter budget. DiaBlo achieves state-of-the-art results across commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, as well as 4-bit/2-bit quantization settings.
DRO-InstructZero: Distributionally Robust Prompt Optimization for Large Language Models: This work integrates distributionally robust optimization (DRO) into a Bayesian optimization framework for zero-shot instruction optimization, enabling optimized instructions to maintain reliable performance under distribution shift and adversarial evaluation conditions.
DRO-InstructZero: Distributionally Robust Prompt Optimization for Large Language Models: This work integrates distributionally robust optimization (DRO) into the Bayesian optimization (BO) framework of InstructZero. By maximizing the worst-case expected utility over an ambiguity set defined by an f-divergence ball, the automatically searched prompts maintain reliable performance under distribution shift.
Execution-Grounded Credit Assignment for GRPO in Code Generation: This paper proposes EGCA (Execution-Grounded Credit Assignment), which leverages execution traces to localize the earliest semantic deviation in a program and concentrates GRPO gradients on the causal token span, addressing the coarse-grained credit assignment problem in code generation. EGCA achieves 82.1% pass@1 on HumanEval.
Improving Code Localization with Repository Memory: By leveraging a repository's commit history to construct episodic memory (past commits) and semantic memory (summaries of active code functionality), this work enhances the code localization capability of language agents, achieving significant improvements on SWE-bench.
IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation: This paper proposes IMSE, which decomposes the linear layers of a pretrained ViT via SVD into "spectral experts" and adapts only the singular values for extremely parameter-efficient test-time adaptation. Combined with a diversity maximization loss and a domain-aware spectral code retrieval mechanism, IMSE achieves state-of-the-art performance across three settings: TTA, CTTA, and progressive CTTA.
Inference-Time Safety for Code LLMs via Retrieval-Augmented Revision: This paper proposes SOSecure, a training-free inference-time safety mechanism that retrieves relevant community security warnings from a Stack Overflow knowledge base via BM25, guiding the model to autonomously revise unsafe code during inference. SOSecure achieves up to 96.7% vulnerability fix rate with zero new vulnerability introductions across three real-world datasets.
InnoGym: Benchmarking the Innovation Potential of AI Agents: This paper proposes InnoGym, the first benchmark and framework for systematically evaluating the innovation potential of AI agents. It introduces two complementary metrics—Performance Gain and Novelty—and, through 18 improvable tasks, finds that current agents exhibit a degree of innovativeness but lack the robustness to reliably translate novel ideas into performance improvements.
KV Cache Transform Coding for Compact Storage in LLM Inference: This paper proposes KVTC, a KV cache compression method inspired by classical media compression techniques (PCA-based feature decorrelation + adaptive quantization + entropy coding). KVTC achieves up to 20× compression (40×+ in specific scenarios) on Llama 3, Mistral NeMo, and R1-Qwen 2.5, outperforming baselines including token eviction, quantization, and SVD-based methods.
Learning to Reason without External Rewards: This paper proposes Intuitor, an RLIF method that replaces external verifiable rewards with the model's own self-certainty (the KL divergence between the output distribution and a uniform distribution). Intuitor matches GRPO performance on mathematical reasoning while exhibiting superior generalization to out-of-domain tasks such as code generation.
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task: Drawing inspiration from the Fill-in-the-Middle (FIM) paradigm in code completion, this work trains a dedicated step-expansion model, MathFimer-7B, to insert finer-grained intermediate reasoning steps into existing mathematical solution chains, thereby systematically improving the mathematical reasoning capability of downstream models.
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning: This work proposes PaperCoder — a multi-agent LLM framework that automatically converts machine learning papers into executable code repositories via a three-stage pipeline: Planning, Analysis, and Coding. 88% of the generated repositories are rated as best by the original paper authors, and the framework substantially outperforms baselines on the PaperBench benchmark.
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory: This paper proposes ReasoningBank, a memory framework that distills generalizable reasoning strategies from both successful and failed experiences as judged by the agent itself, and introduces memory-aware test-time scaling (MaTTS) to establish a synergy between memory and test-time scaling. The approach consistently outperforms baselines on WebArena, Mind2Web, and SWE-Bench (up to 34.2% relative improvement) while reducing interaction steps by 16%.
Sharing State Between Prompts and Programs: This paper proposes the shared program state abstraction, enabling prompts to directly read and write program variables, manipulate heap objects, and control program flow. The abstraction is realized in the Nightjar system (Python + prompt hybrid programming), achieving a 39.6% reduction in code size while maintaining or improving accuracy (+4–19%).
ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code: This paper proposes ShieldedCode — the first protection-aware code representation learning framework. By introducing hierarchical dependency modeling (three levels: intra-instruction, preceding-instruction, and inter-instruction) and joint functional-aware and protection-aware contrastive learning, the framework enables LLMs to generate, compare, and reason about VM-protected code. ShieldedCode surpasses existing methods on both VM code generation (Pass@1 26.95% vs. GPT-4o 22.58%) and binary similarity detection.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning: This paper proposes Supervised Reinforcement Learning (SRL), which reframes problem solving as a step-wise action generation process. By leveraging dense rewards based on sequence similarity, SRL enables small models to learn from expert trajectories on difficult reasoning problems that neither SFT nor RLVR can effectively handle.
The Limits of Long-Context Reasoning in Automated Bug Fixing: This paper systematically evaluates the limits of current LLMs in long-context code debugging. It finds that the success of agentic workflows stems from task decomposition rather than long-context reasoning (successful trajectories consume only 20–30K tokens), while performance degrades sharply under 64K single-pass patch generation (GPT-5-nano achieves 0%), revealing a significant gap between nominal context length and actual usable context capacity.
Training Large Language Models To Reason In Parallel With Global Forking Tokens: This paper proposes Set Supervised Fine-Tuning (SSFT), which aligns global forking tokens with diverse reasoning trajectories via bipartite matching, enabling LLMs to globally steer distinct reasoning modes from a single control token. SSFT substantially outperforms standard SFT and GRPO on mathematical reasoning and code generation tasks.
Training Large Language Models to Reason in Parallel with Global Forking Tokens: This paper proposes Set Supervised Fine-Tuning (SSFT), which introduces global forking tokens and a set-based loss via bipartite matching to train LLMs to produce diverse and correct reasoning patterns triggered by a single control token, outperforming standard SFT+GRPO on both Pass@1 and Cons@k.