NN-Former: Rethinking Graph Structure in Neural Architecture Representation¶
Conference: CVPR 2025
arXiv: 2507.00880
Code: https://github.com/XuRuihan/NNFormer
Area: Graph Learning
Keywords: Architecture Prediction, NAS, Graph Transformer, Sibling Nodes, Latency Prediction
TL;DR¶
NN-Former proposes a hybrid GNN-Transformer architecture predictor, revealing that existing methods overlook the topological information of "sibling nodes" (nodes sharing parent/child nodes). By introducing Adjacency-Sibling Multihead Attention (ASMA) and Bidirectional Graph Isomorphism FFN (BGIFFN), it achieves a Kendall's Tau of \(0.877\)/\(0.890\) on NAS-Bench-101/201 and reduces the MAPE of latency prediction by 48-64%.
Background & Motivation¶
- Background: Performance predictors in neural architecture search (NAS) need to predict accuracy or latency from architecture graphs. Existing methods are divided into GNN-based (GATES, GMAE-NAS) and Transformer-based (NAR-Former V2, PINAT).
- Limitations of Prior Work: (1) GNNs only propagate adjacency information, ignoring "sibling" relationships (e.g., two parallel branches sharing input/output); (2) The global attention of Transformers does not distinguish between different types of graph structural relationships; (3) Existing predictors lack accuracy under low-data regimes (0.02% training data).
- Key Challenge: "Sibling relationships" in graph structures (such as two branches of a residual connection) contain crucial design information but cannot be directly observed in the adjacency matrix \(A\) (requiring \(AA^T\) or \(A^TA\)), which are entirely missed by existing methods.
- Goal: Design a hybrid predictor that simultaneously utilizes adjacency and sibling topologies.
- Key Insight: Use \(AA^T\) to extract siblings sharing a parent node (e.g., two paths of a residual block) and \(A^TA\) to extract siblings sharing a child node (e.g., multi-input aggregation).
- Core Idea: ASMA (four-head attention, each using a different topological mask) + BGIFFN (Bidirectional Graph Isomorphism Feed-Forward Network).
Method¶
Overall Architecture¶
Architecture graph \(\to\) one-hot operation encoding + sinusoidal attribute encoding \(\to\) \(L\)-layer NN-Former (ASMA + BGIFFN) \(\to\) global pooling \(\to\) MLP predicting accuracy or latency.
Key Designs¶
-
Adjacency-Sibling Multihead Attention (ASMA)
- Function: Simultaneously models four topological relationships: adjacency and sibling relationships.
- Mechanism: Four attention heads utilize different topological masks—Head 1: \((I+A)\) forward adjacency; Head 2: \((I+A^T)\) backward adjacency; Head 3: \((I+AA^T)\) sibling sharing a parent; Head 4: \((I+A^TA)\) sibling sharing a child. Each head calculates: \(X_i = \sigma((Q_iK_i^T \circ Mask_i)/\sqrt{h})V_i\)
- Design Motivation: Global attention in standard Transformers treats all node pairs equivalently, but in a computational graph, the importance of adjacent, parent-child, and sibling relationships varies significantly.
-
Bidirectional Graph Isomorphism FFN (BGIFFN)
- Function: Injects graph structural information into the feed-forward network.
- Mechanism: \(H_g = \text{Concat}(\text{GC}(H, A), \text{GC}(H, A^T))\), which concatenates graph convolutions in both forward and backward directions, and then fuses them with the original features: \(\text{BGIFFN}(H, A) = \text{ReLU}(HW_1 + H_g)W_2\). No positional encodings are required.
- Design Motivation: Standard FFNs process each node independently. Injecting graph convolution allows the FFN to be topographically aware.
-
Positional Encoding-Free Design
- Function: BGIFFN naturally provides positional information.
- Mechanism: Since BGIFFN performs forward and backward graph convolutions at each layer, node positional information is implicitly encoded through multi-layer propagation.
- Design Motivation: Nodes in a graph do not have a fixed order (unlike sequences), making traditional positional encodings inapplicable.
Loss & Training¶
L1 loss (for accuracy prediction) or MAPE loss (for latency prediction). AdamW optimizer.
Key Experimental Results¶
Main Results¶
| Setting | NAR-Former V2 | NN-Former | Gain |
|---|---|---|---|
| NAS-101 (0.02%) | 0.663 | 0.709 | +6.9% |
| NAS-101 (1%) | 0.861 | 0.877 | +1.9% |
| NAS-201 (1%) | 0.752 | 0.804 | +6.9% |
| NAS-201 (10%) | 0.888 | 0.890 | +0.2% |
| Latency-EfficientNet | 13.20 | 4.81 | -63.6% |
| Latency-MnasNet | 7.16 | 2.54 | -64.5% |
Ablation Study¶
| Configuration | NAS-101 (0.1%) Tau | Description |
|---|---|---|
| w/o Siblings (Adjacency only) | ~0.77 | Information loss |
| w/o BGIFFN | ~0.78 | FFN lacks graph awareness |
| Cross-attention alternative | 0.804 | Suboptimal |
| Full NN-Former | 0.809 | Optimal |
Key Findings¶
- The performance gain is most significant under low-data regimes (0.02% scale) (+6.9%), showing that sibling information is vital for data efficiency.
- The improvements in latency prediction are the most remarkable (48-64% MAPE reduction), as latency is directly influenced by parallel structures, and sibling relationships precisely encode parallelism.
- The positional encoding-free design is an engineering advantage—simplifying the pipeline and naturally adapting to graphs of varying sizes.
Highlights & Insights¶
- Discovery of Sibling Nodes: The topological relationships extracted by \(AA^T\) and \(A^TA\) were overlooked by all prior work in NAS—simple yet highly impactful.
- 4-Head Topological Attention: Designing each head to focus on one specific structural relationship is more efficient and effective than global attention.
- Practical Value of Latency Prediction: While accuracy prediction is research-oriented, latency prediction is deployment-oriented. The massive advantage of NN-Former in latency makes it highly valuable for real-world model deployment.
Limitations & Future Work¶
- Sibling calculation requires \(O(n^2)\) matrix multiplications (\(AA^T\), \(A^TA\)), which might become a bottleneck for exceptionally large graphs.
- Evaluated primarily on cell-based architectures; custom architectures (such as MoE, branching networks) have not been extensively verified.
- The design of 4 heads (each mapping to one topology) is hardcoded—more heads or different topological combinations have not been explored.
Related Work & Insights¶
- vs NAR-Former V2: A pure Transformer approach that does not differentiate between topological types. NN-Former explicitly encodes four relationships via ASMA.
- vs GATES/GMAE-NAS: GNN approaches propagate adjacency information only. NN-Former bridges the blind spot of GNNs through sibling relationships.
Rating¶
- Novelty: ⭐⭐⭐⭐ Sibling node discovery and 4-head topological attention are innovative.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ NAS-101/201 accuracy + NNLQ latency + comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear.
- Value: ⭐⭐⭐⭐ Offers direct improvement value for NAS predictors.