Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems¶
Conference: NeurIPS 2025 (ML for Systems Workshop)
arXiv: 2512.19250
Code: Not released
Area: LLM Evaluation
Keywords: small language models, auto-parallelization, compiler optimization, heterogeneous systems, inference strategies
TL;DR¶
This work systematically evaluates three language models with fewer than 1.5B parameters (gemma3, llama3.2, qwen2.5) on compiler auto-parallelization tasks. Using six inference strategies across 11 real-world kernels, the approach achieves an average speedup of 6.81x and a peak speedup of 43.25x, demonstrating that small models can serve as powerful compiler optimization reasoning engines.
Background & Motivation¶
Following the end of Moore's Law, performance gains increasingly rely on heterogeneous computing (mixed CPU/GPU architectures), yet software toolchains have lagged behind:
Traditional auto-parallelizing compilers (LLVM Polly, GCC, etc.) depend on rigid heuristic rules and struggle to capture complex dependencies in real-world code.
Large model solutions are prohibitively costly: most AI for Systems research focuses on large proprietary models whose latency and cost are incompatible with compiler integration.
Core Problem: Can small, efficient LLMs provide the sophisticated reasoning required for complex compiler tasks?
This paper answers affirmatively: through a carefully designed inference framework, models with fewer than 1.5B parameters can match or surpass traditional compiler performance.
Method¶
Overall Architecture¶
A three-stage pipeline:
- Code Analyzer: statically analyzes input C/C++ code.
- LLM Reasoner: formulates a parallelization plan based on the analysis results.
- Parallelization Generator: implements the plan as parallel code.
Key Design 1: LLM-Guided Dependency Reasoning¶
The LLM acts as a semantic reasoning engine, receiving abstract representations of loop nests, memory access patterns, and control flow, and explicitly reasoning about:
- Loop-carried dependencies: dependencies that prevent safe parallel execution.
- Reduction patterns: operations that can be safely parallelized via reduction clauses.
- Privatizable variables: variables whose scope must be restricted to avoid data races.
- Target-specific execution strategies: CPU thread-level parallelism vs. GPU kernel decomposition.
The LLM produces a structured parallelization plan that is validated by a static analyzer before code generation. Unsafe transformations are automatically rejected via sanitizers and regression tests.
Key Design 2: Six Inference Strategies¶
- Tree of Thoughts (ToT): explores multiple reasoning paths; yields the best results.
- Chain of Thought (CoT): step-by-step chain reasoning.
- ReAct: alternates between reasoning and action.
- Few-shot: leverages a small number of demonstrations.
- Step-by-Step: sequential execution of reasoning steps.
- Zero-shot: direct inference without examples.
Key Design 3: Safety Guarantees¶
All LLM-generated parallel code must pass multiple validation checks: regression tests, sanitizer-based detection of data races and memory issues, and cross-compiler compatibility testing (GCC 11+, Clang 14+, ICC 2021+, MSVC 2019+). Unsafe transformations are automatically rejected.
Evaluation Benchmark¶
Eleven real-world computational kernels spanning three domains: scientific computing (FFT, Jacobi, MatMul), graph algorithms (BFS, PageRank, Dijkstra), and ML kernels (Conv2D, Attention, Pooling).
Key Experimental Results¶
Main Results¶
A total of 376 evaluations covering a comprehensive comparison across models, strategies, and baselines.
Model Performance Comparison (averaged over all strategies):
| Model | Avg. Speedup | Best Speedup | Analysis Quality | Response Time |
|---|---|---|---|---|
| gemma3:1b | 6.2x | 38.7x | 0.78 | 12.3s |
| llama3.2:1b | 6.8x | 41.2x | 0.82 | 15.7s |
| qwen2.5:1.5b | 7.2x | 43.25x | 0.85 | 18.9s |
Inference Strategy Comparison (averaged over all models):
| Strategy | Avg. Speedup | Success Rate | Quality Score |
|---|---|---|---|
| Tree of Thoughts | 7.1x | 88% | 0.84 |
| Chain of Thought | 6.9x | 85% | 0.81 |
| ReAct | 6.7x | 83% | 0.79 |
| Zero-shot | 5.8x | 78% | 0.72 |
Comparison with Advanced Compiler Baselines:
| Method | Avg. Speedup | Best Performance | GPU Support |
|---|---|---|---|
| LLM (qwen2.5+ToT) | 7.1x | Conv 43.25x | Yes |
| LLVM Polly | 5.8x | MatMul 8.2x | No |
| TVM | 7.4x | Conv 11.2x | Yes |
| Triton | 8.9x | Attn 13.7x | Yes |
Ablation Study¶
Scalability (matrix multiplication): LLM speedup grows from 4.2x at size 1K to 13.1x at size 16K, consistently outperforming LLVM Polly.
Correctness verification: LLM-ToT achieves a validation rate of 88%, race-freedom of 91%, and memory safety of 94%—lower than LLVM Polly (95%/97%/98%) but still reliable.
Key Findings¶
- Inference strategy matters more than model scale: ToT consistently yields the best results.
- LLMs generalize better than domain-specific tools: more consistent performance across domains.
- Small models are sufficient: average speedup surpasses both LLVM Polly and GCC.
- Generated code scales well: performance improves steadily with input size and core count.
Highlights & Insights¶
- Small model potential is underestimated: 1B-scale models perform strongly on structured compiler tasks.
- The inference framework is the key lever: ToT yields a 22% speedup improvement over Zero-shot.
- Safety guarantees are well-designed: incorrect parallelizations are automatically intercepted.
- Compiler integration is feasible: latency of 12–19s is acceptable for offline optimization.
- Strong cross-platform compatibility: compilation success rates of 98% on GCC and 96% on Clang.
Limitations & Future Work¶
- High compilation latency: 18.9s far exceeds the 2–4s of traditional compilers, precluding JIT use.
- Correctness gap: 88% falls short of the 95% achieved by traditional compilers.
- Dependence on prompt engineering: effectiveness is highly sensitive to code abstraction and prompt design.
- Limited evaluation scope: only 11 kernels are evaluated.
- Workshop paper: methodological details are insufficiently elaborated.
Related Work & Insights¶
- LLVM Polly: polyhedral model auto-parallelization baseline.
- TVM / Triton: domain-specific compilation optimization.
- Tree of Thoughts (Yao et al., 2023): multi-path reasoning strategy.
- Future directions: verifier-in-the-loop feedback, multi-hardware backends, multi-language extension.
Rating¶
⭐⭐⭐⭐ (3.5/5)
- Novelty ⭐⭐⭐⭐: using small models as compiler experts is an interesting and practical direction.
- Experimental Thoroughness ⭐⭐⭐⭐: 376 evaluations provide broad coverage.
- Methodological Depth ⭐⭐⭐: limited by workshop paper length constraints.
- Value ⭐⭐⭐⭐: provides a clear pathway for integrating compilers with LLMs.