Beyond Tokens: Enhancing RTL Quality Estimation via Structural Graph Learning¶
Conference: ICML 2026
arXiv: 2508.18730
Code: https://github.com/cure-lab/StructRTL
Area: Graph Learning / Self-supervised Learning / EDA
Keywords: RTL Quality Estimation, Graph Self-supervised Learning, Control Data Flow Graph (CDFG), Knowledge Distillation, Hardware Design Automation
TL;DR¶
StructRTL is proposed to conduct structure-aware graph self-supervised pre-training (masked node modeling + edge prediction) on the Control Data Flow Graph (CDFG) of RTL designs. Combined with knowledge distillation from post-mapping netlists to CDFGs, it significantly surpasses SOTA LLM-based and manual feature-based methods in area/delay estimation tasks.
Background & Motivation¶
Background: In modern hardware design flows, RTL (Register-Transfer Level) code must undergo logic synthesis to obtain quality metrics such as area and delay, but this process is time-consuming and computationally expensive. Consequently, fast estimation of design quality directly at the RTL stage has become a research hotspot in EDA.
Limitations of Prior Work: Early methods extracted manual features (e.g., node type frequency, longest combinational path length) from Data Flow Graphs (DFG) or Abstract Syntax Trees (AST), which have limited expressive power. Recent works (VeriDistill) utilize Large Language Models (LLMs) to extract token-level embeddings from RTL code, achieving decent results. However, two fundamental issues exist: first, the pre-training objective of LLMs is code generation rather than quality estimation, limiting representation transferability; second, the token perspective only implicitly encodes design structural semantics, which is less effective than an explicit graph perspective.
Key Challenge: DFG only models data dependencies and ignores control flow; AST emphasizes syntactic hierarchy but lacks structural semantics; LLM token sequences lose topological information. Since RTL design quality (especially delay) is tightly coupled with structure, existing representations cannot fully capture it.
Goal: (1) Design a self-supervised pre-training framework that directly models the structural semantics of CDFGs; (2) Transfer low-level information from post-mapping netlists to the RTL-stage predictor through knowledge distillation.
Key Insight: CDFGs integrate both control flow and data flow dependencies, providing a more complete view of design structure than DFG/AST/tokens. The authors observe that directly masking nodes or edges on the CDFG introduces ambiguity (e.g., masking an addition node makes subtraction or multiplication plausible alternatives). Therefore, masking is performed at the level of context-aware embeddings generated by a GNN, preserving the integrity of the computation graph while allowing the model to recover masked information using neighborhood context.
Core Idea: Use GNN + Transformer for structure-aware self-supervised pre-training on CDFGs to learn explicit structural representations, then enhance RTL-stage prediction using knowledge distillation from post-mapping netlists.
Method¶
Overall Architecture¶
The input is RTL Verilog code, which is first compiled into the RTLIL intermediate representation via Yosys and then parsed into a CDFG (nodes represent operations/storage elements, directed edges represent data/control dependencies). Node embeddings are initialized by concatenating node type one-hot encoding and bit-width: \(\mathbf{h}_i^0 = \text{concat}(\text{one-hot}(\text{type}(v_i)), \text{width}(v_i))\). Embeddings pass through 8 GIN layers to obtain context-aware embeddings, which are then augmented with global positional encodings based on Laplacian eigenvectors and fed into an 8-layer Transformer encoder. The pre-training phase executes two self-supervised tasks; the fine-tuning phase aggregates graph-level representations via combined mean and max pooling, performing area/delay regression via a 3-layer MLP.
Key Designs¶
-
Structure-Aware Masked Node Modeling:
- Function: Learn structure-aware representations by predicting types of masked nodes.
- Mechanism: Randomly mask 20% of post-GNN embeddings (replaced with a learnable [MASK] token) and use the Transformer to predict their original node types (32-class classification). Crucially, masking occurs after GNN output rather than on the original graph to avoid disrupting computation graph semantics. To address class imbalance (e.g., operator nodes being much fewer than wires/registers), a stratified masking strategy ensures at least \(m\) nodes per class are selected, and a class-balanced focal loss is used instead of cross-entropy, with weights \(w_c = (1-\beta) / (1-\beta^{S_c})\) where \(\beta=0.9999\) and \(\gamma=2.0\).
- Design Motivation: Masking raw nodes introduces ambiguity (addition vs. multiplication); post-GNN embeddings already encode neighborhood context, allowing for unambiguous reconstruction using rich contextual information.
-
Edge Prediction:
- Function: Learn the topological connection patterns of the graph.
- Mechanism: When post-GNN embeddings are flattened into sequences for the Transformer, graph connectivity information is lost (equivalent to all edges being masked). Each iteration randomly samples 20% real edges as positive samples and an equal number of non-existent edges as negative samples. The Transformer output embeddings of source/target nodes are concatenated and passed through a 3-layer MLP for binary classification of edge existence. The total pre-training loss is \(\mathcal{L}_{pre} = \alpha \mathcal{L}_{mnm} + (1-\alpha) \mathcal{L}_{ep}\) with \(\alpha=0.5\).
- Design Motivation: Transformers naturally lose graph topology; edge prediction forces the model to recover connectivity structures from embeddings, complementing masked node modeling (one learns node semantics, the other learns edge topology).
-
Post-Mapping Knowledge Distillation:
- Function: Transfer low-level information from post-mapping (PM) netlists to the CDFG predictor.
- Mechanism: First, RTL is synthesized into PM netlists using Yosys + ABC. A teacher model (20-layer GIN) is trained to predict area/delay directly from netlists (as metrics derived from netlists are naturally more accurate). After freezing the teacher, distillation is performed by aligning the last-layer activations of the CDFG student and PM teacher: \(\mathcal{L}_{kd} = \tau \cdot \mathcal{L}_{cos}(z_{CDFG}^{-1}, z_{PM}^{-1}) + (1-\tau) \cdot \mathcal{L}_{mse}(z_{CDFG}^{-1}, z_{PM}^{-1})\) with \(\tau=0.7\). total training loss is \(\mathcal{L}_{total} = \mu \mathcal{L}_{qe} + (1-\mu) \mathcal{L}_{kd}\) with \(\mu=0.5\). Only the CDFG predictor is retained during inference.
- Design Motivation: An abstraction gap exists between RTL and actual metrics; PM netlists are closer to physical implementation, and distillation bridges this gap.
Loss & Training¶
- Quality estimation uses log-cosh loss (robust to outliers), with target values undergoing log transformation.
- Global positional encodings utilize the \(k=16\) smallest eigenvectors of the symmetric normalized Laplacian matrix of the directed graph, with random sign flipping during training to enhance generalization.
Key Experimental Results¶
Main Results (Without Knowledge Distillation)¶
| Method | Area \(R^2\)↑ | Area MAPE↓ | Delay \(R^2\)↑ | Delay MAPE↓ |
|---|---|---|---|---|
| Graph-XGBoost | 0.3987 | 0.19 | 0.3362 | 0.12 |
| Graph-GNN | 0.5857 | 0.09 | 0.6639 | 0.13 |
| CodeV-DS-6.7B | 0.4862 | 0.17 | 0.3905 | 0.12 |
| CodeV-CL-7B | 0.5755 | 0.15 | 0.5174 | 0.10 |
| CodeV-QW-7B | 0.6353 | 0.13 | 0.5277 | 0.09 |
| StructRTL | 0.7463 | 0.06 | 0.7630 | 0.10 |
Ablation Study + Knowledge Distillation Effect¶
| Configuration | Area \(R^2\)↑ | Delay \(R^2\)↑ | Description |
|---|---|---|---|
| StructRTL (full, w/o KD) | 0.7463 | 0.7630 | Full model without distillation |
| w/o \(\mathcal{L}_{mnm}\) (w/o KD) | 0.7249 | 0.7473 | \(R^2\) drops without masked node modeling |
| w/o \(\mathcal{L}_{ep}\) (w/o KD) | 0.7018 | 0.7368 | \(R^2\) drops more without edge prediction |
| StructRTL + KD | 0.8676 | 0.8872 | Substantial gain with knowledge distillation |
| w/o \(\mathcal{L}_{mnm}\) + KD | 0.8557 | 0.8796 | Consistent ablation under distillation |
| w/o \(\mathcal{L}_{ep}\) + KD | 0.8480 | 0.8654 | EP contribution slightly higher than MNM |
| CodeV-QW-7B + KD | 0.8174 | 0.7687 | Strongest LLM baseline + KD |
| PM Teacher (Upper Bound) | 0.9334 | 0.9484 | Prediction directly from netlist |
Key Findings¶
- Without KD, StructRTL outperforms the strongest LLM baseline (CodeV-QW-7B) by +0.11 (Area) and +0.24 (Delay) in \(R^2\). The larger advantage in delay estimation confirms that structural information is vital for delay prediction.
- Knowledge distillation brings an \(R^2\) gain of +0.12 (Area) and +0.12 (Delay) to StructRTL.
- Using only 20% of training data, StructRTL achieves competitive performance (Area 0.56 / Delay 0.60) compared to LLM baselines trained on the full set.
- Industrial Evaluation: On 51 pairs of real industrial designs, ranking accuracy reached 82.35% for area and 88.23% for delay.
- Inference Speed: Average 0.096s per design vs. 13.97s per design for synthesis, achieving a 145x speedup.
Highlights & Insights¶
- Post-GNN masking instead of raw graph masking: This is the most ingenious design. Computation graphs differ from natural language or social networks; node semantics are strictly constrained. Direct masking of raw nodes might have multiple valid substitutes. Post-GNN masking preserves the integrity of the original graph while providing sufficient context, which can be transferred to computation graphs in other domains (e.g., compiler IR, circuit netlists).
- CDFG vs. Token "Explicit vs. Implicit Structure" debate: Using a light GNN+Transformer to observe graph structure is more effective than using a 7B LLM to observe code tokens. This suggests that task-aligned representations are more important than general LLMs. This insight is transferable to other downstream tasks in code analysis.
- Cross-abstraction level distillation: The teacher is at the PM netlist level, and the student is at the RTL level. Utilizing the natural correspondence of the same design across different abstraction levels for distillation is a strategy applicable to any engineering design problem with multi-level abstractions.
Limitations & Future Work¶
- The dataset primarily consists of small-to-medium scale open-source designs (13,200). Although industrial designs were evaluated, scalability on large-scale industrial SoC-level designs remains unverified.
- CDFG construction depends on successful Yosys compilation, making it unable to handle RTL code containing proprietary IP or non-synthesizable constructs.
- Knowledge distillation requires performing synthesis to obtain PM netlist labels, partially canceling out the "skip synthesis" motivation (though it is only needed during training).
- The authors note that technology-node migration remains an open problem; current models require recalibration with labeled data for each process node.
- Future directions: Extend the method to more EDA quality metrics (Power, Timing Slack) and explore pure self-supervised quality ranking without synthesis labels.
Related Work & Insights¶
- VeriDistill (Moravej et al., 2025): Pioneer in LLM-based RTL embedding + knowledge distillation. This paper switches the representation from tokens to CDFG graphs while building on the distillation framework.
- GraphMAE (Hou et al., 2022) / MaskGAE (Li et al., 2023): General graph self-supervised learning methods for node feature masked reconstruction and edge mask recovery. This paper adapts these techniques specifically for computation graph semantics.
- MasterRTL (Fang et al., 2023): Quality estimation based on SOG manual features, though SOG construction itself requires partial synthesis steps.