PolyGraph Discrepancy: a classifier-based metric for graph generation¶
Conference: ICLR 2026
arXiv: 2510.06122
Code: PolyGraph Open Source Library (mentioned in text, to be publicly released)
Area: Graph Generative Model Evaluation
Keywords: Graph Generation, Jensen-Shannon Distance, Classifier-based Evaluation, MMD, TabPFN
TL;DR¶
Ours proposes PolyGraph Discrepancy (PGD), which approximates the variational lower bound of the Jensen-Shannon distance by training a classifier to distinguish between real and generated graphs. This approach solves three core problems: the lack of an absolute scale in MMD metrics, incomparability between different descriptors, and high bias/variance in small sample sizes.
Background & Motivation¶
Graph generative models are increasingly important in fields such as drug discovery, social network modeling, and materials science, yet progress is constrained by inadequate evaluation methods. Current evaluation relies primarily on Maximum Mean Discrepancy (MMD) based on graph descriptors, which suffers from three fundamental flaws:
Lack of an absolute scale: The range of MMD is \([0, \infty)\), depending on the kernel function and feature scaling. Whether an MMD = 0.05 is good or poor is impossible to judge. Under a linear kernel, multiplying input features by an arbitrary scalar scales the MMD proportionally.
Incomparability between descriptors: MMD values calculated using degree histograms versus orbit counts are not on the same scale, making it impossible to determine which descriptor better distinguishes real from generated graphs.
Severe small-sample issues: Current mainstream benchmarks (Planar, SBM, Lobster) contain only 20-40 test graphs. At this scale, MMD estimates exhibit extreme bias and variance, leading to unreliable model rankings.
| Property | MMD | PGD |
|---|---|---|
| Range | \([0, \infty)\) | \([0, 1]\) |
| Absolute Scale | ✗ | ✓ |
| Comparable Descriptors | ✗ | ✓ |
| Multi-descriptor Aggregation | ✗ | ✓ |
| Unified Ranking | ✗ | ✓ |
Method¶
Overall Architecture¶
PGD reformulates "evaluating graph generation quality" as a classification problem: training a classifier to distinguish real graphs from generated ones. If the classifier distinguishes them easily, generation quality is poor; if it struggles, quality is good. Crucially, the classifier's log-likelihood on a test set is the variational lower bound of the Jensen-Shannon divergence—it naturally falls in the \([0,1]\) interval and holds for any classifier \(D\). The tighter the fit, the closer the bound. The variational form is: $\(D_{\text{JS}}(P \| Q) = \sup_{D:\mathcal{X}\to[0,1]} \frac{1}{2}\mathbb{E}_{x\sim P}[\log_2 D(x)] + \frac{1}{2}\mathbb{E}_{x\sim Q}[\log_2(1-D(x))] + 1\)$ The workflow is a direct chain: first, use a set of descriptors to vectorize graphs; for each descriptor, split real and generated graphs into fit/test halves; fit a TabPFN classifier on the fit half; compute the JS divergence lower bound using log-likelihood on the test half; finally, take the tightest (maximum) lower bound across all descriptors as the final PGD score.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
G["Ref Graphs P_ref + Gen Graphs Q_gen"] --> D["Descriptor System<br/>General + Molecular Layer Multi-view Vectorization"]
D --> S["fit/test Split<br/>50:50 independent test to prevent overestimation"]
S --> C["TabPFN Classifier<br/>test log-likelihood = JS divergence lower bound<br/>Truncation → Root → Single Descriptor JS Distance"]
C --> M["Aggregation<br/>4-fold CV selection + max aggregation"]
M --> O["PGD Score ∈ [0,1]<br/>Lower is better quality"]
Key Designs¶
1. Descriptor System: General + Domain-Specific dual layers to vectorize graphs
Classifiers require vectors. The first step determines which perspectives characterize a graph—the more comprehensive the descriptors, the closer the JS distance estimate is to the true divergence. Ours designs two layers: the general layer includes degree histograms, clustering coefficient histograms, Laplacian spectra, 4/5-node orbit counts, and GIN embeddings; the molecular layer adds topological indices (BertzCT, Kappa, etc.), physicochemical parameters (Lipinski rules), Morgan fingerprints, and ChemNet/MolCLR learned representations. General descriptors ensure applicability to any graph, while molecular descriptors extend PGD to drug discovery, making the metric sensitive to domain-specific structures.
2. fit/test Split: Using independent test sets for unbiased JS lower bound estimation
Given a reference set \(P_{\text{ref}}\), a generated set \(Q_{\text{gen}}\), and a descriptor, training and evaluating a classifier on the same set causes overfitting and overestimation of the difference. PGD splits both sets (50:50): the fit set is used only to train the classifier, and the test set is used only to calculate log-likelihood. An independent test set ensures the log-likelihood is an unbiased lower bound estimate—easier classification translates to a larger lower bound, a larger JS distance, and a poorer generative model.
3. TabPFN Classifier and JS Bound: Table-based Foundation Model for the tightest estimate
The lower bound holds for any classifier, but to make it tight without hyperparameter tuning, ours selects TabPFN. It meets three requirements: it outputs class probabilities (required for log-likelihood), trains quickly, and has no hyperparameters (Transformer-based Bayesian in-context inference). This excludes deep networks (slow/tuning-intensive) and SVMs or Decision Trees (hard labels/no probabilities). After obtaining the test log-likelihood, it is truncated to \(\max(\cdot, 0)\) to ensure non-negativity, followed by a square root to obtain the JS distance (since JS distance is a metric, while JS divergence is not). Experiments show TabPFN consistently provides tighter bounds than logistic regression due to its ability to model non-linear decision boundaries.
4. Multi-descriptor Max Aggregation: Data-driven selection of the most discriminative perspective
The most sensitive descriptor (degree, orbit, spectra, etc.) varies by dataset and perturbation type; no single descriptor is universally superior. When using \(K\) descriptors, PGD performs 4-fold cross-validation on the fit set for each descriptor, selects the descriptor \(d^\star\) with the highest average validation metric, then trains on the full fit set and evaluates on the test set using \(d^\star\). Since each descriptor provides a lower bound, taking the maximum yields the tightest bound, representing the perspective that best distinguishes real from generated graphs.
Loss & Training¶
PGD is an evaluation metric, not a generative model. "Training" refers to a single fit of the TabPFN classifier—as TabPFN is an in-context learner, this is an inference pass rather than gradient iteration. Key implementation details: 50:50 fit/test split, 4-fold CV for descriptor selection, JS bound truncation at \(\max(\cdot, 0)\), and square root transformation to convert divergence to distance.
Key Experimental Results¶
Main Results¶
MMD Bias and Variance Issues
At common scales of 20-40 test graphs: - Biased MMD estimates are severely affected by bias (magnitude differences). - Unbiased MMD estimates still exhibit variance high enough to make model comparisons unreliable. - Ours recommends using larger datasets (SBM-L, Planar-L, Lobster-L, each with 4096 samples).
Correlation between PGD and Model Quality
| Metric | Planar-L Correlation | SBM-L | Lobster-L |
|---|---|---|---|
| PGD | 99.52 | 88.07 | 89.32 |
| Orbit RBF | 73.49 | 51.05 | -34.81 |
| Degree RBF | 70.79 | 15.77 | -33.40 |
| Spectral RBF | 73.34 | 36.76 | -22.79 |
| Clustering RBF | 71.48 | 83.97 | 87.05 |
| GIN RBF | 82.78 | 14.12 | -30.31 |
The Pearson correlation between PGD and validity reaches 99.52% (Planar-L), significantly outperforming all MMD variants.
Tracking Training Dynamics
| Metric | Planar-L Spearman | SBM-L | Lobster-L |
|---|---|---|---|
| Validity | 92.31 | 83.64 | 85.47 |
| PGD | 93.71 | 62.73 | 78.19 |
| Orbit RBF | 86.71 | 20.00 | -8.09 |
| Degree RBF | 41.96 | -19.09 | -4.66 |
PGD improves monotonically with training iterations, whereas MMD metrics exhibit instability or even negative correlation.
Model Benchmarking (PGD × 100, lower is better)
| Dataset | AutoGraph | DiGress | ESGG | GRAN |
|---|---|---|---|---|
| Planar-L | 34.0 | 45.2 | 54.2 | 65.1 |
| SBM-L | 30.3 | 23.7 | 35.1 | 50.1 |
| Lobster-L | 16.1 | 18.8 | 51.1 | 56.2 |
| Proteins | 72.5 | 68.2 | — | 73.3 |
Ablation Study¶
| Configuration | Metric Effect | Description |
|---|---|---|
| TabPFN vs LogReg | TabPFN consistently tighter | Advantage of non-linear decision boundaries |
| PGD-JS vs PGD-TV | PGD-JS is more robust | TV binarization threshold introduces noise |
| Sample size < 256 | High variance | PGD stabilizes at 256+ samples |
| Sample size > 256 | Stable | Mean and variance converge |
| Single vs max-aggregation | Aggregation is more robust | No single descriptor is optimal across all scenarios |
Key Findings¶
- No single descriptor is universal: The most discriminative descriptor varies by dataset and perturbation type, justifying the necessity of a multi-descriptor, data-driven selection strategy.
- Monotonic response to perturbations: PGD increases monotonically with the degree of perturbation (edge removal/addition/rewiring/swapping), with Spearman correlation comparable to MMD.
- Unique value of edge-swapping perturbation: Ours proposes a new perturbation (degree-preserving edge swapping) that MMD (degree and GIN-based) fails to detect, while PGD remains robust through its multi-descriptor strategy.
- Proteins dataset challenge: PGD values are highest (~70), indicating that modeling protein graphs is the most difficult task.
- 5-node orbit is a strong descriptor: In model benchmarks, 5-node orbit counts frequently yield the highest PGD, identifying them as a previously underutilized descriptor.
Highlights & Insights¶
- Theoretical Elegance: Repurposes the adversarial training concept from GANs for evaluation—the classifier's log-likelihood provides a variational lower bound for JS distance.
- High Practical Value: Absolute scale, descriptor comparability, and multi-descriptor aggregation solve three core pain points in graph generation evaluation.
- Big Data Initiative: Through detailed bias/variance analysis, ours argues strongly that existing benchmarks (20-40 graphs) are insufficient for reliable evaluation and provides SBM-L/Planar-L/Lobster-L as alternatives.
- Molecular-specific Descriptors: Extends the PGD framework to molecular graph evaluation with a descriptor system based on RDKit topological/physicochemical/fingerprint/learned representations.
- Open Source Contribution: Commitment to release the PolyGraph library lowers the barrier to entry for the community.
Limitations & Future Work¶
- Descriptor Dependence and Information Loss: PGD operates on handcrafted descriptors; if the divergence of the descriptor distributions does not closely approximate the divergence of the graph distributions, PGD remains a loose lower bound.
- Max-aggregation Limitations: Taking the maximum may not fully utilize complementary information from different descriptors. Future work could explore concatenating multiple descriptors as input to TabPFN.
- Sample Size Requirements: PGD requires approximately 256+ samples for reliable estimation, which limits its use in data-scarce scenarios.
- TabPFN Feature Dimension Limits: TabPFN recommends under 500 dimensions, requiring dimensionality reduction for high-dimensional descriptors.
- Unexplored Graph Types: Validated only on undirected, unweighted, and molecular graphs; directed, weighted, temporal, and heterogeneous graphs require further verification.
Related Work & Insights¶
- MMD Framework (You et al., 2018): The direct baseline. This work provides a detailed analysis of MMD variants (Gaussian TV, RBF, biased/unbiased).
- Classifier Two-Sample Test (Lopez-Paz & Oquab, 2017): The direct theoretical predecessor, using classifier performance as a statistic for two-sample testing.
- TabPFN (Hollmann et al., 2025): A foundation model for tabular data; PGD's practical utility depends on this model's robust performance.
- Evaluation metric design requires a balance between theoretical grounding and engineering feasibility; viewing evaluation problems through an information-theoretic lens often leads to more principled solutions.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Systematic application of C2ST in the graph domain; multi-descriptor aggregation is innovative despite known theoretical foundations.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extremely detailed: perturbation experiments, training dynamics, model benchmarks, ablations, and comparisons with various MMD variants.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, rigorous theoretical derivation, and informative visualizations.
- Value: ⭐⭐⭐⭐⭐ — Likely to become a new standard for graph generation evaluation, solving core problems that have long troubled the community.