QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code¶

Conference: NeurIPS 2025 arXiv: 2511.01183 Code: https://github.com/ (not provided) Area: Video Understanding Keywords: Neural Compilation, LLM, IR-to-Assembly Translation, Self-Evolving Prompt Optimization, Compiler

TL;DR¶

This paper introduces the NeuComBack benchmark for evaluating neural compilation on IR-to-assembly translation tasks, and proposes a self-evolving prompt optimization method that iteratively improves compilation prompts by learning from LLM self-debugging trajectories. The approach raises correctness from 44% to 64%, with 87.5% of correctly generated programs outperforming clang-O3.

Background & Motivation¶

Compilers are indispensable yet extraordinarily complex software systems that demand substantial human expertise to develop and maintain. Given the remarkable performance of LLMs on code-related tasks, neural compilation—directly translating high-level languages or intermediate representations (IR) into low-level assembly using LLMs—has emerged as an appealing new paradigm.

Core Limitations of Prior Work:

Lack of benchmarks: No standardized benchmark exists for evaluating IR-to-assembly compilation, making it impossible to systematically measure and track progress.

Correctness challenge: Even the most capable LLMs fall far short of traditional compilers in terms of semantic correctness of generated assembly.

Optimization challenge: Generating assembly that outperforms mature compilers (e.g., clang-O3) while maintaining correctness is even more difficult.

Key Challenge: Although LLMs possess code understanding and generation capabilities, direct application to assembly generation lacks both evaluation standards and effective capability-enhancement methods.

Key Insight: - Construct a dedicated benchmark, NeuComBack, with two levels: basic compilation (L1) and optimization potential (L2). - Propose a self-evolving prompt optimization method that enables LLMs to learn from their own debugging experience, iteratively refining prompt strategies for assembly generation.

Method¶

Overall Architecture¶

The method comprises two phases: offline prompt learning and online inference. In the offline phase, LLM self-debugging trajectories are collected, insights are extracted, and prompts are iteratively evolved. In the online phase, the evolved prompts guide assembly generation and iterative optimization.

Key Designs¶

NeuComBack Benchmark:
- Level 1 (Basic Compilation, 200 tasks): Sampled from ExeBench, covering diverse real-world C programs with an emphasis on functional correctness. Programs undergo rigorous filtering for C/C++ standard compliance, selecting those with the longest LLVM IR after compilation.
- Level 2 (Optimization Potential, 151 tasks): Drawn from the TSVC vectorizing compiler test suite; programs have simple execution paths but complex loop structures, making them suitable for evaluating optimization capability.
- Evaluation metrics: ACC (functional correctness rate) and ACC+Perf (rate of correct programs that also outperform clang-O3).
Neural Compilation Workflow:
- Initial generation: Given an IR input, the LLM generates an initial assembly candidate.
- Self-debugging: After generation, correctness is verified via testing; failures trigger iterative self-debugging.
- Iterative optimization: Starting from a correct initial assembly, \(T\) rounds of performance optimization are performed.
- Correctness verification and self-debugging may be applied after each generation or optimization step.
Self-Evolving Prompt Optimization (Core Contribution):
- Trajectory collection: The LLM executes compilation tasks and full self-debugging trajectories are collected (covering the complete process from initial erroneous generation → debugging corrections → final correct output).
- Insight extraction: For trajectories that successfully transition from error to correctness, the LLM analyzes error patterns and effective repair strategies.
- Prompt evolution: Extracted insights are integrated into the existing prompt, reviewed and confirmed by the LLM, then applied as an update. One prompt update is performed after each mini-batch of compilation tasks.
- Key distinction: Unlike general APO methods, this approach learns specifically from complete self-debugging trajectories—enabling the LLM to internalize lessons from its own past experience resolving assembly errors.

Loss & Training¶

There is no training loss in the conventional sense. Prompt optimization is driven by compilation correctness feedback through an iterative process of 3 epochs with a batch size of 5. DeepSeek-R1 is used as the primary base LLM.

Key Experimental Results¶

Main Results¶

Frontier LLM Baseline Performance (NeuComBack-L2, x86_64, 151 cases)

Model	ACC (%)	ACC+Perf (%)
GPT-4o	1.99 (3/151)	0.66 (1/151)
O3-Mini	21.19 (32/151)	5.30 (8/151)
O1	19.87 (30/151)	5.30 (8/151)
DeepSeek-V3	14.57 (22/151)	3.31 (5/151)
DeepSeek-R1	45.70 (69/151)	21.85 (33/151)

Self-Evolving Prompt Optimization Results (DeepSeek-R1, x86_64)

Method	L1 Test ACC (%)	L2 Initial ACC (%)	L2 ACC+Perf after Optimization (%)
Baseline Prompt	50.00 (20/40)	44.00 (11/25)	28.00 (7/25)
Learned Prompt	80.00 (32/40)	64.00 (16/25)	56.00 (14/25)

Ablation Study¶

Configuration	ACC (%)	ACC+Perf (%)	Notes
x86_64 baseline → learned prompt	44%→64%	28%→56%	+100% programs exceeding O3
aarch64 baseline → learned prompt	36%→72%	8%→28%	Effective across architectures
L2-learned prompt → L1	67.5% vs. L1-specific 80%	—	Transferable across data distributions
Reduced self-debugging rounds	0.9 rounds (baseline) → 0.28 rounds (ours)	—	Learned prompts reduce self-debugging demand

Key Findings¶

DeepSeek-R1 is the strongest baseline, significantly outperforming GPT-4o (ACC 45.7% vs. 1.99%), underscoring the critical role of reasoning capability.
The learned prompt achieves a 60% relative improvement in correctness on L1 (50%→80%).
Among 16 correctly generated x86_64 programs on L2, 14 (87.5%) outperform clang-O3.
Prompt optimization effects transfer across instruction set architectures (x86_64→aarch64).
The learned prompts encode multi-level knowledge including formatting rules (.text section), syntactic rules (.L prefix), and semantic rules (zeroing return registers for void functions).

Highlights & Insights¶

LLMs' assembly optimization capability is surprisingly strong: 87.5% of correct programs outperform clang-O3, suggesting that LLMs can discover optimization opportunities unexploited by traditional compilers.
The design of learning from self-debugging trajectories is distinctive—knowledge is distilled not from successes but from the "error → correction" process.
LLMs can leverage vector instructions (e.g., cmpps) to perform vectorization optimizations that traditional compilers do not apply.
The two-level benchmark design (basic compilation vs. optimization potential) is well-targeted.

Limitations & Future Work¶

Overall functional correctness remains insufficient (64% on L2), leaving a gap before practical deployment.
Evaluation is limited to the function level; compilation of larger-scale programs is not addressed.
Heavy reliance on DeepSeek-R1's reasoning capability may result in significantly weaker performance on other models.
The computational cost of prompt learning is non-trivial, requiring multiple rounds of LLM calls.
Reliability verification of assembly code in safety-critical scenarios is not considered.

Complementary to works such as LLM Compiler (Meta) and SLADE, which focus on pre-training, while this paper focuses on inference-time prompt optimization.
The self-evolving prompt methodology is transferable to other code translation tasks (e.g., decompilation, cross-language translation).
NeuComBack can serve as a standard test suite for evaluating the compilation capability of next-generation LLMs.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of self-evolving prompts learned from debugging trajectories is original, and the benchmark fills an existing gap.
Experimental Thoroughness: ⭐⭐⭐⭐ Cross-architecture, cross-distribution, transferability, and ablation analyses are comprehensive.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, experimental setup is detailed, and case studies are informative.
Value: ⭐⭐⭐⭐ Provides both a benchmark and methodology for neural compilation, though practical deployment remains distant.