PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer¶
Conference: AAAI 2026 arXiv: 2511.19472 Code: https://github.com/Mightlaus/PrefixGPT-AAAI26 Area: LLM Pretraining Keywords: prefix adder optimization, GPT generative model, hardware design automation, reinforcement learning fine-tuning, legality mask
TL;DR¶
PrefixGPT frames prefix adder optimization as a sequence generation problem. A customized GPT model is pretrained to learn design rules, then fine-tuned via RL to generate optimized designs, achieving state-of-the-art area-delay product (ADP) with robustness to initialization.
Background & Motivation¶
Prefix adders are widely used in computation-intensive applications such as signal processing and AI due to their high speed. The core challenge lies in designing the prefix graph, which must simultaneously optimize circuit area (corresponding to graph size) and delay (corresponding to graph depth), while the design space grows exponentially.
Limitations of Prior Work: Existing AI-based methods — PrefixRL (deep Q-learning), ArithTree (MCTS), and PrefixLLM (large language models) — all adopt a "structural modification" paradigm: starting from an existing design and iteratively adding/removing nodes or editing local connections. This introduces two fundamental problems:
Frequent design rule violations: Iterative modifications often violate the three strict design rules of prefix adders (fan-in rule, output rule, merging rule), necessitating additional repair steps that themselves introduce extra hardware overhead.
Strong initialization bias: Method performance is heavily dependent on the choice of initial design (e.g., Sklansky, Kogge-Stone, Brent-Kung), with large performance variations or complete failure across different initializations.
Core Insight: The design rules of prefix adders are more structured than natural language. GPT models have demonstrated the ability to master complex linguistic "grammar" and generate high-quality text satisfying structural and logical constraints. Can GPT similarly learn the "grammar" of prefix adders and directly generate legal, high-quality designs from scratch?
Key Insight: Model prefix adder design as a 2D coordinate sequence generation problem. Train a GPT to generate optimized designs directly from scratch rather than modifying existing ones. A legality mask guarantees that every generation step satisfies design rules, completely eliminating repair steps.
Method¶
Overall Architecture¶
PrefixGPT adopts a two-stage pretrain-then-fine-tune pipeline: 1. Convert prefix graphs into 2D coordinate sequence representations. 2. Design a legality mask to ensure each generation step is legal. 3. Pretrain on one million randomly synthesized legal prefix adders to learn design rules. 4. Fine-tune using GRPO reinforcement learning with ADP as the reward signal to optimize design quality.
Key Designs¶
1. Sequence Representation and Legality Mask¶
- Function: Convert prefix graph → prefix matrix → coordinate sequence, making the problem tractable for GPT.
- Mechanism: An \(n\)-bit prefix adder corresponds to an \(n \times n\) binary lower-triangular matrix. The coordinate sequence \(L_{1:N} = (L_1, \ldots, L_N)\) is generated by scanning the '1' entries of the matrix, where each \(L_p = (L_p^r, L_p^c)\) denotes row and column coordinates.
- Dynamic legality mask: Two cases of legal next-step choices are derived from the design rules:
- Case 1: If the current coordinate is in column 0, the next step can only be the diagonal entry of the next row: \(L_{k+1} = (L_k^r+1, L_k^r+1)\).
- Case 2: If the current coordinate is not in column 0, the next coordinate must lie in the same row, with the column coordinate constrained by the set of columns of existing nodes in row \((L_k^c - 1)\).
- Design Motivation: The mask is implemented in parallel on GPU, setting the probability of illegal choices to zero and guaranteeing valid-by-construction generation, fundamentally eliminating repair steps.
2. Spatial Coordinate Embedding and Dual-Head Transformer Backbone¶
- Function: A customized GPT architecture for processing 2D coordinate data.
- Spatial coordinate embedding: Separate embedding vectors \(\mathbf{L}_p^r, \mathbf{L}_p^c \in \mathbb{R}^d\) are learned for row and column coordinates, with relative position information encoded via RoPE. The fused token is obtained by concatenation: \(\mathbf{T}_p = \hat{\mathbf{L}}_p^r \| \hat{\mathbf{L}}_p^c \in \mathbb{R}^{2d}\).
- Dual-head Transformer: A shared decoder (4 layers) feeds into a row head (1 layer, predicting row coordinates) and a column head (2 layers, predicting column coordinates after incorporating row information). The column head depends on the row head output, as legal column coordinates are determined by the row coordinate.
- Design Motivation: Standard GPT processes one-dimensional token sequences and cannot directly handle 2D coordinates. The dual-head design naturally models the dependency between row and column.
3. Pretraining and GRPO Fine-tuning¶
- Pretraining: Self-supervised learning on one million randomly synthesized legal coordinate sequences. The loss is the sum of row and column cross-entropies: \(\mathcal{L}_{pre} = -\frac{1}{k}\sum_{p=1}^{k}(\log P(L_p^r|L_{1:p-1}) + \log P(L_p^c|L_{1:p-1}))\).
- Fine-tuning: The pretrained model is copied into a policy model \(\pi_\theta\) and a frozen reference model \(\pi_{ref}\). GRPO is applied for optimization. At each iteration, \(G=512\) legal adders are sampled, with reward defined as negative ADP: \(r_i = -\text{area}(\alpha_i) \times \text{delay}(\alpha_i)\).
- Best-design retrieval mechanism: A global database records all previously generated designs. At each iteration, the top designs (≤10% of batch size) are retrieved and merged into the current batch to reinforce high-quality pattern learning.
- Objective: \(\mathcal{J}(\theta) = \frac{1}{G}\sum_{i=1}^{G}\frac{1}{N_i}\sum_{p=1}^{N_i}\{\gamma^p s_\theta(L_p)\hat{A}_i - \beta \mathbb{D}_{KL}(L_p)\}\)
Loss & Training¶
- Pretraining: Row and column cross-entropy loss, 5 epochs, learning rate \(10^{-4}\).
- Fine-tuning: GRPO objective, 200 RL iterations, 512 samples per iteration, temperature 0.8.
- KL divergence regularization (\(\beta=0.001\)) to prevent policy collapse.
- Pretraining is performed at \(n=48\)-bit width and can be immediately transferred to any \(m \leq n\) bit width.
Key Experimental Results¶
Main Results (Prefix Graph Size Comparison)¶
| Bit Width | Depth Limit | ArithTree (Sk/Ko/Br/Ra) | PrefixRL (Sk/Ko) | PrefixGPT (Sk/Ko/Br/Ra) |
|---|---|---|---|---|
| 16-bit | 5 | 31/45/-/- | -/- | 31/31/31/32 |
| 32-bit | 7 | 65/108/-/- | 77/128 | 57/61/57/60 |
| 48-bit | 7 | 119/217/-/- | 224/- | 102/94/98/102 |
| 48-bit | 9 | 94/155/-/- | 120/207 | 86/86/86/86 |
PrefixGPT achieves the best size across all 12 (bit-width × depth-limit) combinations, with a maximum reduction of 59.1%.
ADP Comparison¶
| Bit Width | Method | Best ADP (μm²·ns) | Mean ADP | Std. Dev. |
|---|---|---|---|---|
| 32-bit | ArithTree | - | 324.64 | - |
| 32-bit | PrefixGPT | - | 91.14 (↓71.9%) | very low |
| 48-bit | ArithTree | 131.4 | - | 422.2 |
| 48-bit | PrefixRL | 142.8 | - | - |
| 48-bit | PrefixGPT | 121.3 (↓7.7%) | 185.15 (↓79.1%) | 25.2 (↓94%) |
Ablation Study¶
| Configuration | Final Mean Reward Change | Std. Dev. Change |
|---|---|---|
| w/o pretraining | −31.6% | significantly increased |
| w/o RoPE | −16.6% | increased |
| w/o KL divergence + best-design retrieval | −1.3% | +110% |
| Full PrefixGPT | baseline | baseline |
Key Findings¶
- Initialization robustness: PrefixGPT with random initialization outperforms competing methods using the best manually designed initializations.
- Legality rate: After pretraining, removing the legality mask still yields near-100% legal generation, demonstrating that the model has genuinely internalized design rules.
- Attention mechanism: The attention scores of the column head closely align with the merging rule, indicating that the model develops a fundamental understanding of the design space beyond mere pattern memorization during pretraining.
- Inference efficiency: Generating a 32-bit design takes approximately 7 ms.
Highlights & Insights¶
- Paradigm shift: Transitioning from "modifying existing designs" to "generating designs from scratch" fundamentally bypasses both repair steps and initialization bias.
- Elegant legality mask design: Complex design rules are reformulated as dynamic masks computable in parallel on GPU, seamlessly integrated into the generation process.
- Deep value of pretraining: Beyond learning to generate legal designs, the model internalizes the structural characteristics of the design space, enabling more efficient search during fine-tuning.
- Cross-domain implications: Demonstrates that the GPT paradigm can be generalized to combinatorial optimization problems with strict constraints, not limited to natural language.
Limitations & Future Work¶
- MSP constraint limitation: The current constraint that the MSP of \(L_{k+1}\) must be the node corresponding to \(L_k\) simplifies generation at the cost of some optimality.
- GPU memory constraint: Pretraining is limited to 48-bit width; larger bit widths require more memory.
- Only ADP is optimized; other circuit metrics such as power consumption are not considered.
- Stronger RL algorithms could be explored as alternatives to GRPO.
Related Work & Insights¶
- PrefixRL (Roy et al., 2021): Pioneering work applying deep Q-learning to prefix adder optimization, but limited by initialization bias.
- ArithTree (Lai et al., 2024): MCTS-based method, scalable but similarly affected by initialization issues.
- PrefixLLM (Xiao et al., 2024): Uses LLMs to edit text representations, but generation time is prohibitively long (>200 s/design vs. PrefixGPT's 7 ms).
- GRPO (Shao et al., 2024): A lightweight RL method that avoids the cost of a value model.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐