QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code¶
Conference: NeurIPS 2025 arXiv: 2511.01183 Code: https://github.com/ (not provided) Area: Video Understanding Keywords: Neural Compilation, LLM, IR-to-Assembly Translation, Self-Evolving Prompt Optimization, Compiler
TL;DR¶
This paper introduces the NeuComBack benchmark for evaluating neural compilation on IR-to-assembly translation tasks, and proposes a self-evolving prompt optimization method that iteratively improves compilation prompts by learning from LLM self-debugging trajectories. The approach raises correctness from 44% to 64%, with 87.5% of correctly generated programs outperforming clang-O3.
Background & Motivation¶
Compilers are indispensable yet extraordinarily complex software systems that demand substantial human expertise to develop and maintain. Given the remarkable performance of LLMs on code-related tasks, neural compilation—directly translating high-level languages or intermediate representations (IR) into low-level assembly using LLMs—has emerged as an appealing new paradigm.
Core Limitations of Prior Work:
Lack of benchmarks: No standardized benchmark exists for evaluating IR-to-assembly compilation, making it impossible to systematically measure and track progress.
Correctness challenge: Even the most capable LLMs fall far short of traditional compilers in terms of semantic correctness of generated assembly.
Optimization challenge: Generating assembly that outperforms mature compilers (e.g., clang-O3) while maintaining correctness is even more difficult.
Key Challenge: Although LLMs possess code understanding and generation capabilities, direct application to assembly generation lacks both evaluation standards and effective capability-enhancement methods.
Key Insight: - Construct a dedicated benchmark, NeuComBack, with two levels: basic compilation (L1) and optimization potential (L2). - Propose a self-evolving prompt optimization method that enables LLMs to learn from their own debugging experience, iteratively refining prompt strategies for assembly generation.
Method¶
Overall Architecture¶
The method comprises two phases: offline prompt learning and online inference. In the offline phase, LLM self-debugging trajectories are collected, insights are extracted, and prompts are iteratively evolved. In the online phase, the evolved prompts guide assembly generation and iterative optimization.
Key Designs¶
-
NeuComBack Benchmark:
- Level 1 (Basic Compilation, 200 tasks): Sampled from ExeBench, covering diverse real-world C programs with an emphasis on functional correctness. Programs undergo rigorous filtering for C/C++ standard compliance, selecting those with the longest LLVM IR after compilation.
- Level 2 (Optimization Potential, 151 tasks): Drawn from the TSVC vectorizing compiler test suite; programs have simple execution paths but complex loop structures, making them suitable for evaluating optimization capability.
- Evaluation metrics: ACC (functional correctness rate) and ACC+Perf (rate of correct programs that also outperform clang-O3).
-
Neural Compilation Workflow:
- Initial generation: Given an IR input, the LLM generates an initial assembly candidate.
- Self-debugging: After generation, correctness is verified via testing; failures trigger iterative self-debugging.
- Iterative optimization: Starting from a correct initial assembly, \(T\) rounds of performance optimization are performed.
- Correctness verification and self-debugging may be applied after each generation or optimization step.
-
Self-Evolving Prompt Optimization (Core Contribution):
- Trajectory collection: The LLM executes compilation tasks and full self-debugging trajectories are collected (covering the complete process from initial erroneous generation → debugging corrections → final correct output).
- Insight extraction: For trajectories that successfully transition from error to correctness, the LLM analyzes error patterns and effective repair strategies.
- Prompt evolution: Extracted insights are integrated into the existing prompt, reviewed and confirmed by the LLM, then applied as an update. One prompt update is performed after each mini-batch of compilation tasks.
- Key distinction: Unlike general APO methods, this approach learns specifically from complete self-debugging trajectories—enabling the LLM to internalize lessons from its own past experience resolving assembly errors.
Loss & Training¶
There is no training loss in the conventional sense. Prompt optimization is driven by compilation correctness feedback through an iterative process of 3 epochs with a batch size of 5. DeepSeek-R1 is used as the primary base LLM.
Key Experimental Results¶
Main Results¶
Frontier LLM Baseline Performance (NeuComBack-L2, x86_64, 151 cases)
| Model | ACC (%) | ACC+Perf (%) |
|---|---|---|
| GPT-4o | 1.99 (3/151) | 0.66 (1/151) |
| O3-Mini | 21.19 (32/151) | 5.30 (8/151) |
| O1 | 19.87 (30/151) | 5.30 (8/151) |
| DeepSeek-V3 | 14.57 (22/151) | 3.31 (5/151) |
| DeepSeek-R1 | 45.70 (69/151) | 21.85 (33/151) |
Self-Evolving Prompt Optimization Results (DeepSeek-R1, x86_64)
| Method | L1 Test ACC (%) | L2 Initial ACC (%) | L2 ACC+Perf after Optimization (%) |
|---|---|---|---|
| Baseline Prompt | 50.00 (20/40) | 44.00 (11/25) | 28.00 (7/25) |
| Learned Prompt | 80.00 (32/40) | 64.00 (16/25) | 56.00 (14/25) |
Ablation Study¶
| Configuration | ACC (%) | ACC+Perf (%) | Notes |
|---|---|---|---|
| x86_64 baseline → learned prompt | 44%→64% | 28%→56% | +100% programs exceeding O3 |
| aarch64 baseline → learned prompt | 36%→72% | 8%→28% | Effective across architectures |
| L2-learned prompt → L1 | 67.5% vs. L1-specific 80% | — | Transferable across data distributions |
| Reduced self-debugging rounds | 0.9 rounds (baseline) → 0.28 rounds (ours) | — | Learned prompts reduce self-debugging demand |
Key Findings¶
- DeepSeek-R1 is the strongest baseline, significantly outperforming GPT-4o (ACC 45.7% vs. 1.99%), underscoring the critical role of reasoning capability.
- The learned prompt achieves a 60% relative improvement in correctness on L1 (50%→80%).
- Among 16 correctly generated x86_64 programs on L2, 14 (87.5%) outperform clang-O3.
- Prompt optimization effects transfer across instruction set architectures (x86_64→aarch64).
- The learned prompts encode multi-level knowledge including formatting rules (
.textsection), syntactic rules (.Lprefix), and semantic rules (zeroing return registers for void functions).
Highlights & Insights¶
- LLMs' assembly optimization capability is surprisingly strong: 87.5% of correct programs outperform clang-O3, suggesting that LLMs can discover optimization opportunities unexploited by traditional compilers.
- The design of learning from self-debugging trajectories is distinctive—knowledge is distilled not from successes but from the "error → correction" process.
- LLMs can leverage vector instructions (e.g.,
cmpps) to perform vectorization optimizations that traditional compilers do not apply. - The two-level benchmark design (basic compilation vs. optimization potential) is well-targeted.
Limitations & Future Work¶
- Overall functional correctness remains insufficient (64% on L2), leaving a gap before practical deployment.
- Evaluation is limited to the function level; compilation of larger-scale programs is not addressed.
- Heavy reliance on DeepSeek-R1's reasoning capability may result in significantly weaker performance on other models.
- The computational cost of prompt learning is non-trivial, requiring multiple rounds of LLM calls.
- Reliability verification of assembly code in safety-critical scenarios is not considered.
Related Work & Insights¶
- Complementary to works such as LLM Compiler (Meta) and SLADE, which focus on pre-training, while this paper focuses on inference-time prompt optimization.
- The self-evolving prompt methodology is transferable to other code translation tasks (e.g., decompilation, cross-language translation).
- NeuComBack can serve as a standard test suite for evaluating the compilation capability of next-generation LLMs.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of self-evolving prompts learned from debugging trajectories is original, and the benchmark fills an existing gap.
- Experimental Thoroughness: ⭐⭐⭐⭐ Cross-architecture, cross-distribution, transferability, and ablation analyses are comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, experimental setup is detailed, and case studies are informative.
- Value: ⭐⭐⭐⭐ Provides both a benchmark and methodology for neural compilation, though practical deployment remains distant.