On the Expressive Power of GNNs for Boolean Satisfiability¶
Conference: ICLR 2026 arXiv: 2602.08745 Code: GitHub Area: Graph Learning Keywords: GNN expressiveness, Boolean satisfiability, Weisfeiler-Leman test, SAT solving, theoretical analysis
TL;DR¶
This paper rigorously proves, from the perspective of the Weisfeiler-Leman (WL) test, that the complete WL hierarchy cannot distinguish satisfiable from unsatisfiable 3-SAT instances, revealing the theoretical expressiveness limits of GNNs for SAT solving. It also identifies positive instance families—such as planar SAT and random SAT—where GNNs can successfully distinguish satisfiability.
Background & Motivation¶
Boolean satisfiability (SAT) is a classic NP-complete problem. In recent years, GNNs have become the dominant architecture for learning-based SAT solvers (e.g., NeuroSAT, QuerySAT), since CNF formulas are naturally represented as literal-clause bipartite graphs (LCG). However, GNN expressiveness is bounded by the WL test—graphs that are WL-equivalent must receive identical outputs.
Core Problem: Do GNNs possess sufficient expressive power to reason about satisfiability?
Prior work lacks a systematic analysis of GNN expressiveness on SAT tasks. Although one might intuit that "SAT is NP-hard, therefore GNNs must be insufficient," expressive power and computational complexity are orthogonal concepts—for instance, planar SAT is also NP-complete, yet 4-WL can identify all planar graphs.
Method¶
Overall Architecture¶
This is a theory-driven work. The core contributions are a series of theorems and constructions concerning the distinguishability of SAT formulas by the WL hierarchy, supplemented by experimental validation. The research pipeline proceeds as follows:
- Define a graph representation for SAT formulas (LCN: Literal-Clause graph with Negation connections)
- Construct families of WL-indistinguishable SAT instances
- Analyze the distinguishability of special SAT families (regular, planar, random)
- Validate WL expressiveness on random and competition instances
Key Designs¶
Graph Representation: LCN (Literal-Clause Graph with Negation Connections)¶
The standard LCG is a bipartite graph between literals and clauses. This paper emphasizes that edges between each literal and its negation must be added (yielding LCN), because: - Removing node labels (to preserve permutation invariance) causes information loss in LCG alone - The LCN representation is sufficient: an unlabeled LCN uniquely determines a SAT formula up to isomorphism
Theorem 5.3 (Main Theorem): \(k\)-WL-Indistinguishable SAT Instances¶
Construction: Based on the classical Cai-Fürer-Immerman (1992) construction.
Given a graph \(G\), a formula \(f_G\) is constructed to encode "whether \(G\) admits an even orientation of edges" (an orientation where every vertex has even out-degree). The answer depends on the parity of the number of edges. A "twisted" formula \(\tilde{f}_G\) is then constructed by reversing one edge to bidirectional. Consequently, exactly one of \(f_G\) and \(\tilde{f}_G\) is satisfiable and the other is not, yet their LCNs are indistinguishable under \(n\)-WL.
This construction is closely related to Tseitin formulas (hard instances for resolution)—both involve global inconsistency that cannot be detected by local reasoning.
Implications for Sequential Solvers (Lemma 5.4)¶
Many GNN-based SAT solvers (e.g., QuerySAT) operate by setting variables one at a time. Lemma 5.4 shows that if two formulas are \(k\)-WL-indistinguishable, the residual formulas remain indistinguishable even after \(\lfloor k/2 \rfloor - 1\) variables have been assigned.
Corollary 5.5: A WL-powerful GNN remains unable to distinguish satisfiable from unsatisfiable residual formulas even after setting \(\Theta(n)\) variables—with significant implications for learning restart strategies.
3-Regular SAT: An NP-Complete Problem Where WL Is Entirely Useless¶
Define \(k\)-regular SAT: each literal appears in exactly \(k\) clauses, and each clause contains exactly \(k\) literals.
- Theorem 5.1: 3-regular SAT is NP-complete
- Observation 5.2: The WL test cannot distinguish any two 3-regular SAT formulas with the same number of variables
Positive Results: Distinguishable SAT Families¶
Planar SAT (Theorem 6.1): 4-WL can distinguish all planar SAT instances, since planar graphs are completely identified by 4-WL.
Random SAT (Theorem 6.3): CNF formulas sampled from a uniformly random literal-incidence graph are identified by WL with probability at least \(1 - n^{-1/7}\).
Loss & Training¶
This is a purely theoretical work with no model training. The experimental section validates WL expressiveness by: - Running \(r\) rounds of WL on a satisfiable formula and imposing equality constraints on literals within the same equivalence class - Checking whether the constrained formula \(f_r\) remains satisfiable - Defining \(r_{\text{crit}}\) as the minimum number of rounds for which \(f_r\) is satisfiable
Key Experimental Results¶
Main Results: Random Instances (G4SAT Benchmark)¶
| Family | Difficulty | \(r_{\text{crit}}\) | \(r_{\text{converge}}\) | Variables | Count |
|---|---|---|---|---|---|
| 3-SAT | easy | 2.97±0.18 | 3.68±0.47 | 26±9 | 1000 |
| 3-SAT | medium | 3.00±0.04 | 3.92±0.28 | 119±47 | 1000 |
| 3-SAT | hard++ | 3.08±0.28 | 4.00±0.00 | 5001±62 | 25 |
| k-clique | easy | 4.12±0.73 | 6.26±0.83 | 33±13 | 960 |
Random instances typically require only 3–4 WL rounds to distinguish literals; approximately 40% of formulas have all literals uniquely identified upon convergence.
SAT Competition Instances¶
| Family | \(r_{\text{crit}}\) | \(r_{\text{converge}}\) | Variables | Count |
|---|---|---|---|---|
| argumentation | 2.94±0.44 | 4.31±0.87 | 1266±625 | 16 |
| circuit-multiplier | unsat | 7.18±0.40 | 1075±50 | 11 |
| cryptography | 15.74±14.67 | 17.63±14.34 | 41510±29705 | 19 |
| heule-nol | unsat | 8.60±1.26 | 1419±0 | 10 |
| maxsat-optimum | 26.64±3.64 | 29.27±2.49 | 22157±5623 | 11 |
Of 448 competition instances, only 234 are solvable by WL; among 69 families, 38 contain instances that WL cannot distinguish.
Ablation Study¶
WL Convergence Pattern Analysis for 3-SAT: - Round 1: Each literal observes its own degree - Round 2: No further refinement (all neighboring clauses have the same degree = 3) - Round 3: Literals observe the degrees of other literals in shared clauses, enabling effective distinction - Round 4: WL typically converges
Key Findings¶
- Large gap between random and industrial instances: Random instances pose virtually no challenge to WL, whereas industrial instances frequently exceed WL expressiveness
- Regular structure is fatal: The heule-nol family encodes grid coloring problems; its regular structure renders WL entirely ineffective
- Connection to resolution complexity: The indistinguishable instances constructed in this paper share structural similarities with Tseitin formulas (hard instances for resolution)—local reasoning cannot detect global inconsistency
- The k-clique and k-vertex-cover families contain small fractions of WL-unsolvable instances (2.0% and 0.3%, respectively), arising from symmetries in the underlying graphs
Highlights & Insights¶
- Landmark theoretical result: The first proof of the expressiveness limits of the complete WL hierarchy for SAT, clarifying the common misconception that "NP-hardness implies WL-indistinguishability" (counterexample: Theorem 6.1)
- Tseitin–CFI connection: The paper uncovers a deep structural link between the Cai-Fürer-Immerman construction and Tseitin formulas, establishing a new bridge between the WL test and proof complexity
- Clear practical guidance: GNNs are well-suited for random SAT, but solving industrial SAT requires architectures that go beyond WL (e.g., symmetry breaking, higher-order GNNs)
Limitations & Future Work¶
- Theoretical results are based on worst-case constructions; sparse or structured instances may exhibit better behavior in practice
- Experiments measure only whether WL expressiveness is sufficient, without assessing the ability to generalize from data
- The paper does not investigate the concrete effectiveness of architectures beyond WL (e.g., \(k\)-GNNs, random feature augmentation) on SAT tasks
- Competition instance experiments are constrained by a 10 MB size limit; behavior on larger-scale instances remains unknown
Related Work & Insights¶
- NeuroSAT / QuerySAT / SATformer: End-to-end learning-based SAT solvers, all subject to the theoretical limitations established in this paper
- Cai-Fürer-Immerman (1992): This paper generalizes their classical indistinguishable graph construction to the SAT domain
- GNN expressiveness literature (GIN, \(k\)-GNN, higher-order WL): This paper provides concrete positive and negative results for these frameworks in the context of SAT
- The findings carry important implications for future learning-based SAT solver design: expressiveness-enhancement strategies tailored to the structural properties of industrial instances are needed
Rating¶
- Novelty: ★★★★★ — First systematic theoretical analysis of GNN expressiveness for SAT
- Technical Depth: ★★★★★ — Rigorous proofs, elegant constructions, tight connections to algebra and complexity theory
- Experimental Thoroughness: ★★★★☆ — Comprehensive coverage of random and competition instances, validating theoretical predictions
- Writing Quality: ★★★★☆ — Clear structure with effective integration of theory and practice