DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation¶
Conference: AAAI 2026 arXiv: 2601.03178 Code: Not released Area: Code Intelligence Keywords: LLM Agent, diffusion model acceleration, code generation, benchmark, genetic algorithm optimization
TL;DR¶
This paper proposes DiffBench (an evaluation benchmark comprising 604 diffusion model acceleration tasks across 5 difficulty levels) and DiffAgent (a closed-loop framework integrating Planning, Coding, and Debugging agents with a genetic algorithm-based selector). On Claude Sonnet 4, the framework improves the pass rate for diffusion acceleration code generation from 54.30% to 81.59%, achieving a 68.27% success rate on complex optimization tasks.
Background & Motivation¶
Background: Diffusion models have achieved remarkable success in image and video generation, but their multi-step iterative inference introduces substantial computational overhead. Existing acceleration methods include fast samplers (DPM-Solver, UniPC), feature reuse (DeepCache), token merging (ToMe), gated activation (T-Gate), and others.
Limitations of Prior Work: Each acceleration method requires expert knowledge for implementation and hyperparameter tuning; different architectures (U-Net vs. DiT) and deployment scenarios demand distinct combinations of acceleration strategies; jointly tuning multiple methods requires deep domain expertise.
Key Challenge: While LLMs have demonstrated strong code generation capabilities and solid performance on GPU kernel optimization benchmarks (KernelBench, TritonBench), diffusion acceleration code poses unique challenges—requiring simultaneous understanding of diffusion architectures, the parameter semantics of acceleration techniques, and quality–speed trade-offs.
Goal: (a) How to standardize the evaluation of LLMs' ability to generate diffusion acceleration code? (b) How to build an agent that enables LLMs to autonomously complete the full pipeline of "requirement understanding → strategy planning → code generation → debugging → iterative optimization"?
Key Insight: Emulating a human developer's workflow—iteratively refining code based on environment feedback, combined with genetic algorithms for efficient exploration of the search space.
Core Idea: Employ multi-agent closed-loop collaboration with a genetic algorithm-driven selector to achieve end-to-end automatic generation of diffusion acceleration code that satisfies quality and speed constraints from natural language requirements.
Method¶
Overall Architecture¶
The system consists of two major components: DiffBench (evaluation benchmark) and DiffAgent (code generation framework). DiffBench provides 604 tasks constructed from real-world deployment scenarios along with a three-stage automated evaluation pipeline. DiffAgent generates high-quality acceleration code through closed-loop iteration among Planning, Coding, and Debugging agents and a genetic algorithm selector.
Key Designs¶
-
DiffBench — 5-Level Difficulty Evaluation System:
- Function: 604 tasks covering U-Net (SD1.5/2.1/SDXL) and Transformer (DiT/PixArt-α/Σ) architectures, supporting text2img/class2img/img2img at resolutions from 256 to 1024.
- Mechanism: Five difficulty levels—L1 basic pipeline generation (41) → L2 single-method acceleration (116) → L3 combined acceleration (261) → L4 specified speedup ratio constraints (93) → L5 latency constraints (93). Feasibility of L4/L5 tasks is determined via 50-round search, with scaling factors used to generate easy/medium/hard samples.
- Design Motivation: Existing coding benchmarks do not involve diffusion model domain knowledge and are therefore unable to evaluate LLMs' code generation capability in this specialized domain.
-
Three-Stage Evaluation Pipeline:
- Function: Progressive evaluation through static parameter validation → absolute quality measurement → relative performance analysis.
- Mechanism: Stage 1 checks whether key attributes—pipeline type, model ID, scheduler, and acceleration method—match; Stage 2 evaluates generation quality using CLIP-Score on 10 COCO samples; Stage 3 computes relative quality loss \(L\) and speedup ratio \(U\).
- Key Formulas: Quality loss \(L = \frac{\frac{1}{N}\sum(S_{base}^{(i)} - S_{acc}^{(i)})}{\frac{1}{N}\sum S_{base}^{(i)}}\), speedup ratio \(U = \frac{\frac{1}{N}\sum T_{base}^{(i)}}{\frac{1}{N}\sum T_{acc}^{(i)}}\)
-
DiffAgent — Four-Component Closed-Loop Architecture:
- Planning Agent: For L1–L3, generates coding plans and passes them directly to the Coding Agent; for L4/L5, first generates a baseline plan and then an acceleration plan. During genetic algorithm iterations, it receives feedback reports and tuning experience from \(M=4\) high-fitness offspring, produces \(M\) refined plans along with \(P-M=3\) entirely new plans (to maintain diversity and avoid local optima), yielding \(P=7\) plans in total.
- Coding Agent: Generates diffusion inference code according to the plan. Acceleration code templates are introduced as references to improve structural accuracy.
- Debugging Agent: Based on the Reflexion architecture, iterates with the Coding Agent for up to \(T_{debug}=3\) rounds; upon failure, restarts code generation (up to \(T_{code}=5\) rounds); if still unsuccessful, backtracks to the Planning Agent for replanning.
- Genetic Algorithm Selector: Evaluates each implementation's quality and efficiency, computes a fitness score via weighted summation, normalizes into sampling probabilities, and selects \(M=4\) high-fitness offspring for the next generation. Iterates for at most \(T_{sel}=4\) rounds.
Loss & Training¶
No training is required—the framework relies entirely on the inference and code generation capabilities of existing LLMs, optimized through prompt engineering and closed-loop feedback. Genetic algorithm hyperparameters: \(P=7\), \(M=4\), \(T_{sel}=4\), \(T_{code}=5\), \(T_{debug}=3\).
Key Experimental Results¶
Main Results¶
Four LLMs are evaluated on DiffBench; DiffAgent yields substantial improvements across all models:
| Model | L1 | L2 | L3 | L4 | L5 | Avg |
|---|---|---|---|---|---|---|
| Claude Sonnet 4 | 78.04 | 72.41 | 76.25 | 5.38 | 8.60 | 54.30 |
| + DiffAgent | 90.24 | 91.38 | 99.23 | 33.33 | 63.44 | 81.59 |
| o3-mini | 41.46 | 24.14 | 4.60 | 9.68 | 6.45 | 11.92 |
| + DiffAgent | 73.17 | 70.69 | 69.73 | 22.58 | 27.96 | 56.46 |
| GPT-4.1 | 56.10 | 18.97 | 7.28 | 10.75 | 12.90 | 14.24 |
| Gemini 2.5 Flash | 39.02 | 29.31 | 7.66 | 2.15 | 1.08 | 12.09 |
Ablation Study¶
| Configuration | Avg \(S_p\) | Hard \(S_a\) | Notes |
|---|---|---|---|
| Full DiffAgent | 81.59 | 68.27 | Complete model |
| w/o Knowledge Base | 64.90 | 45.94 | Removing knowledge base drops \(S_p\) by 16.69%, the largest single impact |
| w/o GA | 67.88 | 8.16 | Removing GA drops L4/L5 pass rate to 4.30% |
| w/o Debugging Agent | 66.23 | 62.02 | Removing debugging agent drops L5 by ~30% |
Key Findings¶
- Genetic algorithm is critical for complex tasks: Removing GA causes L4/L5 pass rates to plummet to 4.30% and hard-task success rate to only 8.16%, demonstrating that GA's search capability is indispensable for satisfying performance constraints.
- Knowledge base provides global benefit: Removal degrades performance at every level, with the largest overall drop (16.69%), indicating that domain knowledge is crucial for code structure and parameter selection.
- DiffAgent substantially reduces compilation errors: L5 compilation errors decrease from 31.18% to 2.15%, with significant reductions in key attribute errors and low-quality errors as well.
- GA hyperparameter analysis: Performance saturates at \(P=7\) and \(T_{sel}=4\); further increases yield diminishing returns.
Highlights & Insights¶
- Paradigm breakthrough of "using AI to optimize AI" — Combining LLM agents with genetic algorithms enables agents to not only generate code but also automatically search for optimal solutions in the quality–speed space. This closed-loop optimization paradigm is transferable to any code generation scenario requiring iterative tuning.
- Elegant multi-agent division of labor with fault-tolerant fallback — The hierarchical fallback mechanism (code failure → regeneration → replanning) across Planning → Coding → Debugging ensures robustness, with an upper bound on LLM invocations in the worst case.
- Benchmark construction methodology with 5-level difficulty design — Progressing from basic pipelines to combined acceleration to constraint optimization, with difficulty escalating and feasibility verified via search, this methodology is reusable for constructing agent benchmarks in other domains.
Limitations & Future Work¶
- Limited hardware coverage: Evaluation is conducted only on specific GPUs without verifying cross-hardware generalizability (e.g., A100 vs. consumer-grade GPUs).
- Closed acceleration method library: Only four acceleration techniques are included (ToMe/DeepCache/T-Gate/FP16), excluding more complex methods such as knowledge distillation and quantization.
- No learning mechanism: The framework relies entirely on prompt engineering, treating each task independently without accumulating experience from historical tasks. Incorporating experiential memory or fine-tuning could be explored.
- Limited evaluation metrics: Generation quality is assessed solely via CLIP-Score, without considering more comprehensive quality indicators such as FID or human preference.
Related Work & Insights¶
- vs. KernelBench/TritonBench: These benchmarks evaluate general-purpose GPU kernel generation, whereas DiffBench targets the diffusion model domain specifically, requiring understanding of diffusion architectures and acceleration method semantics.
- vs. General Coding Agents (CodeAgent, etc.): General-purpose agents lack domain knowledge; DiffAgent compensates through a knowledge base and feedback mechanisms, significantly outperforming general solutions in this specialized domain.
- Insights: The Agent + GA search framework can be generalized to any code optimization task requiring satisfaction of multi-objective constraints.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The first benchmark and agent framework for diffusion acceleration code generation; a field-defining contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across 4 LLMs × 5 levels, complete ablation, and detailed GA hyperparameter analysis.
- Writing Quality: ⭐⭐⭐⭐ — Task definitions are clear and the framework is described systematically.
- Value: ⭐⭐⭐⭐⭐ — DiffBench can serve as a standard evaluation tool; the multi-agent + GA paradigm of DiffAgent is broadly transferable.