Stealthy Yet Effective: Distribution-Preserving Backdoor Attacks on Graph Classification¶
Conference: NeurIPS 2025 arXiv: 2509.26032 Code: Available Area: AI Safety / Graph Learning Security Keywords: backdoor attack, graph classification, distribution preservation, adversarial training, GNN security
TL;DR¶
This paper proposes DPSBA, a clean-label backdoor attack framework for graph classification that generates in-distribution trigger subgraphs via adversarial training while suppressing both structural and semantic anomalies, achieving high attack success rates with significantly improved stealthiness.
Background & Motivation¶
GNNs demonstrate strong performance on graph classification tasks but are vulnerable to backdoor attacks. Existing graph-level backdoor attack methods face two core issues:
Structural Deviation: Existing methods (e.g., ER-B, GTA, Motif) inject rare or unnatural subgraphs as triggers that deviate from the structural distribution of clean graphs. Experiments show that backdoored graphs generated by these methods exhibit a pronounced shift in anomaly score distributions under anomaly detection models (e.g., SIGNET) compared to clean graphs. For instance, on the AIDS dataset, Motif triggers achieve an AUC of 99.71%, meaning they are nearly perfectly detected.
Semantic Deviation: Traditional methods employ label flipping to reassign backdoored graphs to the target class, causing inconsistency between the graph's true semantics and its label. Although the clean-label setting mitigates this issue, it leads to a significant drop in attack success rate (ASR).
Core Challenge: How to design a graph-level backdoor attack that preserves the distributional properties of clean samples and avoids label manipulation, while simultaneously maintaining attack effectiveness and stealthiness?
Method¶
Overall Architecture¶
DPSBA consists of two stages: - Poisoned sample construction: Hard sample selection → injection location identification → trigger subgraph generation - Trigger optimization: Jointly optimizing attack effectiveness and distributional stealthiness via adversarial training
Key Designs¶
- Hard Sample Selection
Function: Selects the samples with the highest model uncertainty from the target class for poisoning.
Mechanism: A surrogate model computes the confidence of each graph towards the target label \(\text{cfd}(G) = \text{softmax}(f_\theta(G))_{y_t}\), and the bottom \(p\%\) samples by confidence are selected.
Design Motivation: Low-confidence samples lie near the decision boundary and are more susceptible to prediction changes under small perturbations, making them ideal candidates for clean-label attacks. This avoids the semantic inconsistency introduced by label flipping.
- Trigger Location Selection
Function: Identifies which nodes in the graph should receive the trigger subgraph.
Mechanism: A two-stage filtering approach — first, the \(2M\) nodes with the highest degree centrality are selected as candidates; then, an ablation study using the surrogate model evaluates each node's influence \(S(v) = |f_\theta(G + \Delta_v) - f_\theta(G)|\), and the \(M\) most influential nodes are retained.
- Trigger Generation and Injection (Topology-Feature Generator)
Function: Uses learnable networks to generate the topology and node features of the trigger subgraph.
Mechanism: - Topology generator: An MLP maps the adjacency matrix of the injection region to a learnable soft structure, producing a binary adjacency matrix via binarization: \(A_{binary} = \mathbb{I}(A > 0.5)\) - Feature generator: An MLP generates trigger node features based on the original features at the injection location \(X' = \sigma(W_2 X + b_2)\), ensuring the generated features align with the data distribution
- Adversarial Anomaly Minimization
Function: Employs adversarial training to make poisoned graphs statistically indistinguishable from clean graphs under anomaly detection.
Mechanism: Two discriminators are introduced — a topology discriminator \(D_{\theta_t}\) (implemented via GCN) and a feature discriminator \(D_{\theta_f}\) (implemented via MLP). The generator and discriminators engage in a minimax game:
\(\min_{\omega_t} \max_{\theta_t} \mathcal{L}_d^{(t)} = \sum_{G \sim \mathcal{G}_c} \log D_{\theta_t}(G) + \sum_{G \sim \mathcal{G}_b} \log(1 - D_{\theta_t}(G_{g_t}(\omega_t)))\)
Loss & Training¶
The attack loss and adversarial anomaly loss are jointly optimized:
- Attack loss: \(\mathcal{L}_{atk} = -\log f_{\theta^*}(G_{g_t})_{y_t}\)
- Topology optimization: \(\min_{\omega_t} \sum \mathcal{L}_{atk} + \alpha \mathcal{L}_d^{(t)}\)
- Feature optimization: \(\min_{\omega_f} \sum \mathcal{L}_{atk} + \beta \mathcal{L}_d^{(f)}\)
A staged adversarial training procedure is adopted: the topology generator and discriminator are optimized first, followed by the feature generator and discriminator. The surrogate model is fine-tuned after each stage to maintain reliable gradient signals. Hyperparameters \(\alpha\) and \(\beta\) balance stealthiness against attack effectiveness.
Key Experimental Results¶
Main Results¶
Experiments are conducted on 4 TUDataset benchmarks with 3 graph classifiers (GCN, GIN, SAGPool), compared against 5 baseline attack methods.
| Dataset | Model | Method | ASR(%) | CAD(%) | AUC(%) |
|---|---|---|---|---|---|
| AIDS | GCN | GTA | 93.21 | 5.14 | 99.34 |
| AIDS | GCN | DPSBA | 94.76 | 2.38 | 72.65 |
| AIDS | GIN | Motif | 97.75 | 2.28 | 99.71 |
| AIDS | GIN | DPSBA | 95.87 | 1.94 | 73.66 |
| FRANKENSTEIN | GCN | GTA | 99.35 | 2.74 | 100.00 |
| FRANKENSTEIN | GCN | DPSBA | 98.37 | 1.01 | 68.96 |
| ENZYMES | GCN | GTA | 95.33 | 3.00 | 71.20 |
| ENZYMES | GCN | DPSBA | 96.67 | -0.67 | 66.11 |
Ablation Study¶
| Configuration | ASR | AUC | Note |
|---|---|---|---|
| Full DPSBA | 94.76% | 72.65% | Best trade-off |
| w/o topology discriminator | ~95% | ~85% | Increased structural anomaly |
| w/o feature discriminator | ~94% | ~80% | Increased feature deviation |
| Dirty-label setting | ASR↑ | AUC↑ | More detectable |
| Non-adversarial training | ASR↑ | AUC↑↑ | Stealthiness significantly degraded |
Key Findings¶
- DPSBA reduces the anomaly detection AUC by 20–30 percentage points across all datasets while maintaining or improving ASR.
- On the AIDS dataset, Motif achieves an AUC of 99.71% (nearly perfectly detected), whereas DPSBA achieves only 73.66%.
- Under the clean-label setting, DPSBA consistently yields lower CAD (clean accuracy drop) than competing methods, indicating less interference with normal samples.
Highlights & Insights¶
- First systematic analysis of structural and semantic deviation in graph-level backdoor attacks, accompanied by a theoretical lower bound linking total variation distance to optimal AUC.
- The combination of clean-label + adversarial training is elegant: labels remain unmodified (avoiding semantic deviation), while discriminators constrain the trigger distribution (suppressing structural deviation).
- The threat model is well-grounded: black-box knowledge combined with limited poisoning capability closely reflects real-world scenarios.
Limitations & Future Work¶
- The trigger size is fixed at 4 nodes, which may lack flexibility for large-scale graphs.
- Only SIGNET is considered as an anomaly detection baseline for evaluation.
- In the clean-label setting, ASR on certain datasets (e.g., Motif-S) still has room for improvement.
- Adaptive defenses — where the defender is aware of the attack method — are not considered.
Related Work & Insights¶
- A key distinction from node-level distribution-preserving attacks [Zhang et al.]: graph classification requires modifying global message passing to influence whole-graph representations, making distribution preservation significantly more challenging.
- The adversarial training paradigm can be extended to other security domains, such as maintaining sample distribution during adversarial robustness training.
- This work also serves as a warning that existing graph classification backdoor defenses relying on anomaly detection may not be sufficiently reliable.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of clean-label and distribution preservation is proposed for the first time in graph backdoor attacks.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 datasets × 3 classifiers × 5 baselines with comprehensive ablation.
- Writing Quality: ⭐⭐⭐⭐ — Problem analysis is clear and motivation is well articulated.
- Value: ⭐⭐⭐⭐ — Carries significant cautionary implications for GNN security research.