Skip to content

ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning

Conference: ACL 2026
arXiv: 2507.04736
Code: https://github.com/rong-hash/chipseek
Area: Reinforcement Learning
Keywords: Verilog Generation, EDA Integration, Hierarchical Rewards, PPA Optimization, Curriculum-Guided Policy Optimization

TL;DR

ChipSeek proposes a hierarchical reward RL framework that integrates the EDA toolchain directly into the training loop. Through Curriculum-guided Dynamic Policy Optimization (CDPO), it enables LLMs to generate RTL code that simultaneously satisfies functional correctness and PPA (Power-Performance-Area) optimization, achieving SOTA on standard benchmarks.

Background & Motivation

Background: LLMs have demonstrated significant potential in automated RTL code generation. Existing approaches improve functional correctness via SFT, RAG, multi-agent systems, and CoT reasoning, but typically neglect hardware-specific metrics (PPA).

Limitations of Prior Work: (1) Existing models lack an intrinsic mechanism to jointly optimize functional correctness and PPA; (2) post-processing methods (e.g., MCTS) do not improve the LLM's own capabilities; (3) Verilog generated by existing models is generally less hardware-efficient than expert-written designs.

Key Challenge: Current methods lack a mechanism to incorporate both functional correctness and PPA optimization as parallel training objectives.

Goal: Design a framework that incorporates EDA toolchain feedback directly into RL training, enabling LLMs to internalize hardware design knowledge.

Key Insight: Hierarchical reward design + curriculum-based weight scheduling + prompt-conditioned PPA preference.

Core Idea: By integrating a complete open-source EDA toolchain (compilation, simulation, synthesis, and backend analysis) into the training loop, the framework provides hierarchical rewards spanning syntax to PPA, enabling LLMs to learn hardware design trade-offs during training.

Method

Overall Architecture

The LLM serves as the policy \(\pi_\theta\), generating Verilog code from design specifications. The complete EDA toolchain evaluates the output and provides hierarchical rewards, which are used for multi-objective optimization via CDPO.

Key Designs

  1. Hierarchical Rewards:

    • Function: Provides multi-level feedback spanning syntax to PPA.
    • Mechanism: Rewards are divided into process rewards (format, syntax, synthesizability) and core rewards (functional correctness, PPA). A strict gating mechanism ensures that downstream metrics are only evaluated after upstream checks pass. The PPA reward is defined as the improvement ratio relative to a reference design: \(r_m = \text{ref}_m / \text{gen}_m\).
    • Design Motivation: Avoids expensive downstream evaluation on invalid code; decouples continuous PPA rewards from discrete functional rewards.
  2. CDPO (Curriculum-guided Dynamic Policy Optimization):

    • Function: Addresses learning-stage mismatch and scale mismatch in multi-objective optimization.
    • Mechanism: (a) Decoupled advantage estimation — each reward component is normalized independently; (b) Adaptive curriculum — process reward weights are dynamically adjusted based on global success rate (weight for syntax is automatically reduced when syntax success rate is high); (c) Prompt-conditioned PPA weighting — power/latency/area weights are adjusted according to preference vectors specified in the prompt.
    • Design Motivation: Naive reward summation is dominated by easily learnable components; curriculum scheduling enables a learning progression from easy to hard.
  3. Automated Data Augmentation Pipeline:

    • Function: Constructs PPA-aware training data.
    • Mechanism: A three-stage pipeline — generating SFT cold-start data, synthesizing diverse PPA preference vectors, and generating testbenches and PPA metrics.
    • Design Motivation: Addresses the scarcity of hardware design data.

Loss & Training

Policy optimization is based on GRPO, using decoupled clipping and dynamic-weight multi-objective advantage aggregation. RL training follows SFT cold-start initialization.

Key Experimental Results

Main Results

  • Achieves SOTA functional correctness and PPA performance on standard benchmarks.
  • Produces efficient designs under specific optimization objectives (power, latency, area).

Key Findings

  • Integrating the EDA toolchain into the training loop is more effective than post-processing approaches.
  • The curriculum-based scheduling in CDPO is critical for multi-objective optimization.
  • Prompt-conditioned PPA weighting enables flexible control over design preferences.

Highlights & Insights

  • The concept of using the EDA toolchain as a verifiable reward source is generalizable to other engineering domains.
  • The multi-objective optimization design of CDPO has broad applicability.
  • Hierarchical gating significantly reduces computational overhead.

Limitations & Future Work

  • Reliance on open-source EDA tools; commercial tools may yield different results.
  • EDA tool execution time is substantial, increasing training cost.
  • Future work may explore more complex design scenarios and larger-scale models.
  • Compared to RAG-based methods such as RTLFixer/HDLDebugger, RL training internalizes hardware knowledge.
  • Compared to post-processing approaches such as VeriGen-MCTS, RL improves the intrinsic capabilities of the LLM.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Integrating a complete EDA toolchain into RL training represents a significant contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across multiple benchmarks and optimization objectives.
  • Writing Quality: ⭐⭐⭐⭐ Framework design is clearly presented with detailed mathematical derivations.