Skip to content

ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning

Conference: ACL 2026
arXiv: 2507.04736
Code: https://github.com/rong-hash/chipseek
Area: Reinforcement Learning
Keywords: Verilog Generation, EDA Integration, Hierarchical Rewards, PPA Optimization, Curricular Policy Optimization

TL;DR

ChipSeek proposes a hierarchical reward RL framework that directly integrates the EDA toolchain into the training loop. Through Curriculum-guided Dynamic Policy Optimization (CDPO), the LLM can generate RTL code that simultaneously satisfies functional correctness and PPA (Power-Performance-Area) optimization, reaching SOTA on standard benchmarks.

Background & Motivation

Background: LLMs have demonstrated great potential in automated RTL code generation. Existing methods enhance functional correctness through SFT, RAG, multi-agent systems, and CoT reasoning but typically overlook hardware-specific metrics (PPA).

Limitations of Prior Work: (1) Existing models lack an inherent mechanism to optimize both functional correctness and PPA concurrently; (2) post-processing methods (such as MCTS) fail to improve the capabilities of the LLM itself; (3) Verilog generated by existing models is often less hardware-efficient than expert-written code.

Key Challenge: Current methods lack a mechanism to parallelize functional correctness and PPA optimization within the training objectives.

Goal: Design a framework that directly incorporates feedback from the EDA toolchain into RL training, enabling the LLM to internalize hardware design knowledge.

Key Insight: Hierarchical reward design + curricular weight scheduling + prompt-conditioned PPA preferences.

Core Idea: By connecting a complete open-source EDA toolchain (compilation, simulation, synthesis, back-end analysis) to the training loop, the framework provides hierarchical rewards ranging from syntax to PPA, allowing the LLM to learn hardware design trade-offs during training.

Method

Overall Architecture

The LLM serves as the policy \(\pi_\theta\), generating Verilog code based on design specifications. This code is evaluated by a complete EDA toolchain to provide hierarchical rewards, and multi-objective optimization is performed via CDPO.

Key Designs

  1. Hierarchical Rewards:

    • Function: Provides multi-level feedback from syntax to PPA.
    • Mechanism: Divided into process rewards (format, syntax, synthesizability) and core rewards (functional correctness, PPA). A strict gating mechanism ensures that downstream metrics are only evaluated if upstream checks are passed. The PPA reward is defined as the improvement ratio relative to a reference design \(r_m = \text{ref}_m / \text{gen}_m\).
    • Design Motivation: Avoids performing expensive downstream evaluations on invalid code and decouples continuous PPA rewards from discrete functional rewards.
  2. CDPO (Curriculum-guided Dynamic Policy Optimization):

    • Function: Addresses learning stage and scale mismatches in multi-objective optimization.
    • Mechanism: (a) Decoupled advantage estimation—each reward component is normalized independently; (b) Adaptive curriculum—dynamically adjusts process reward weights based on global success rates (e.g., automatically decreasing the weight of syntax rewards when syntax success rate is high); (c) Prompt-conditioned PPA weighting—adjusts weights for power, delay, or area based on preference vectors provided in the prompt.
    • Design Motivation: Simple reward summation tends to be dominated by easily learned components; curricular scheduling achieves a learning progression from easy to difficult.
  3. Automated Data Augmentation Pipeline:

    • Function: Constructs PPA-aware training data.
    • Mechanism: A three-stage pipeline consisting of generating SFT cold-start data, synthesizing diverse PPA preference vectors, and generating testbenches alongside PPA metrics.
    • Design Motivation: Addresses the scarcity of high-quality hardware design data.

Loss & Training

The approach utilizes policy optimization based on GRPO, employing multi-objective advantage aggregation with decoupled clipping and dynamic weights. RL training is initiated following an SFT cold-start phase.

Key Experimental Results

Main Results

  • Achieved SOTA in both functional correctness and PPA performance on standard benchmarks.
  • Capable of generating efficient designs tailored to specific optimization targets (Power, Delay, Area).

Key Findings

  • Integrating the EDA toolchain into the training loop is more effective than post-processing.
  • The curricular scheduling in CDPO is essential for effective multi-objective optimization.
  • Prompt-conditioned PPA weighting enables flexible control over design preferences.

Highlights & Insights

  • The concept of using an EDA toolchain as a source of verifiable rewards can be generalized to other engineering domains.
  • The multi-objective optimization design of CDPO is universal and applicable elsewhere.
  • Hierarchical gating significantly reduces computational resource consumption.

Limitations & Future Work

  • The framework relies on open-source EDA tools; commercial tools might produce different optimization results.
  • The execution time of EDA tools is relatively long, which increases training costs.
  • Future work may explore more complex design scenarios and larger-scale models.
  • Compared to RAG methods like RTLFixer and HDLDebugger, RL training successfully internalizes hardware knowledge.
  • Compared to post-processing techniques like VeriGen-MCTS, RL directly improves the inherent capabilities of the LLM.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Integrating the complete EDA toolchain into RL training represents a significant innovation.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across multiple benchmarks and optimization targets.
  • Writing Quality: ⭐⭐⭐⭐ The framework design is clear, and the mathematical derivations are detailed.