💻 Code Intelligence¶

🤖 AAAI2026 · 10 paper notes

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation: This paper proposes DiffBench (an evaluation benchmark comprising 604 diffusion model acceleration tasks across 5 difficulty levels) and DiffAgent (a closed-loop framework integrating Planning, Coding, and Debugging agents with a genetic algorithm-based selector). On Claude Sonnet 4, the framework improves the pass rate for diffusion acceleration code generation from 54.30% to 81.59%, achieving a 68.27% success rate on complex optimization tasks.
EquaCode: A Multi-Strategy Jailbreak Approach for Large Language Models via Equation Solving and Code Completion: This paper proposes EquaCode, a multi-strategy jailbreak method that decomposes malicious queries into a cross-domain combination of equation solving (\(B+C+x=A\)) and code completion (completing the solve() method of a Solver class), achieving an average attack success rate of 92.78% on the GPT series and approaching 100% on the latest models (Gemini/DeepSeek/Grok).
Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction: This paper proposes Agent-Event-Coder (AEC), which reformulates zero-shot event extraction as a software engineering workflow. Four specialized agents (Retrieval→Planning→Coding→Verification) collaborate to perform extraction, while event schemas are encoded as executable Python classes to enable compiler-style deterministic validation and dual-loop iterative correction. AEC comprehensively outperforms zero-shot baselines across 5 domains and 6 LLMs.
MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings: This paper proposes ModularStarEncoder (MoSE), a 1B-parameter multi-exit encoder that significantly enhances early-layer representations via a novel self-distillation mechanism in which higher layers guide the training of lower layers. MoSE surpasses all open-source models on code understanding tasks such as CodeSearchNet while supporting flexible compute–accuracy tradeoff deployment.
ReCode: Updating Code API Knowledge with Reinforcement Learning: This paper proposes ReCode, a framework that trains LLMs via rule-based reinforcement learning (rather than SFT) to correctly leverage API update documentation provided in the prompt for code version migration, enabling a 7B model to surpass 32B models on CodeUpdateArena.
SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models: This paper proposes SPAN, a cross-calendar temporal reasoning benchmark (6 calendars × 10 reasoning directions × 100-year range × 37,380 instances). Baseline LLMs achieve an average accuracy of only 34.5% (none exceeding 80%), revealing two systematic failure modes—Future-Date Degradation and Calendar Asymmetry Bias. A tool-augmented Time Agent achieves 95.31%, demonstrating that cross-calendar reasoning requires external tools rather than parametric knowledge.
TAPA: Training-Free Adaptation of Programmatic Agents via LLM-Guided Program Synthesis in Dynamic Environments: TAPA positions LLMs as "intelligent modulators" of the symbolic action space rather than direct decision-makers. Through LLM-guided program synthesis, it dynamically adapts the symbolic actions of programmatic agents without retraining, achieving strong performance in cybersecurity DDoS defense (77.7% network uptime) and swarm intelligence formation control.
Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning: This paper proposes CL4D, a contrastive learning framework that adapts pretrained decoder-only code generation models to code understanding tasks (code search, clone detection) via continued pretraining, achieving performance comparable to or better than encoder-only models of equivalent scale without retraining them from scratch.
Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning: This paper proposes CL4D, a contrastive learning framework that continues pre-training decoder-only code generation models, enabling them to extract effective code representations and achieve performance on par with or superior to encoder-only models of comparable scale on code understanding tasks such as code search and clone detection.
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study: This paper systematically investigates the capability bottlenecks of open-source LLMs in data analysis tasks. It decomposes data analysis into three dimensions—data comprehension, code generation, and strategic planning—and identifies strategic planning as the decisive factor, rather than coding or data comprehension. A strategy-guided data synthesis approach is proposed, enabling fine-tuned 7B/14B models to achieve performance competitive with GPT-4o.