Skip to content

Stealthy Yet Effective: Distribution-Preserving Backdoor Attacks on Graph Classification

Conference: NeurIPS 2025 arXiv: 2509.26032 Code: Available Area: AI Safety / Graph Learning Security Keywords: backdoor attack, graph classification, distribution preservation, adversarial training, GNN security

TL;DR

This paper proposes DPSBA, a clean-label backdoor attack framework for graph classification that generates in-distribution trigger subgraphs via adversarial training while suppressing both structural and semantic anomalies, achieving high attack success rates with significantly improved stealthiness.

Background & Motivation

GNNs demonstrate strong performance on graph classification tasks but are vulnerable to backdoor attacks. Existing graph-level backdoor attack methods face two core issues:

Structural Deviation: Existing methods (e.g., ER-B, GTA, Motif) inject rare or unnatural subgraphs as triggers that deviate from the structural distribution of clean graphs. Experiments show that backdoored graphs generated by these methods exhibit a pronounced shift in anomaly score distributions under anomaly detection models (e.g., SIGNET) compared to clean graphs. For instance, on the AIDS dataset, Motif triggers achieve an AUC of 99.71%, meaning they are nearly perfectly detected.

Semantic Deviation: Traditional methods employ label flipping to reassign backdoored graphs to the target class, causing inconsistency between the graph's true semantics and its label. Although the clean-label setting mitigates this issue, it leads to a significant drop in attack success rate (ASR).

Core Challenge: How to design a graph-level backdoor attack that preserves the distributional properties of clean samples and avoids label manipulation, while simultaneously maintaining attack effectiveness and stealthiness?

Method

Overall Architecture

DPSBA consists of two stages: - Poisoned sample construction: Hard sample selection → injection location identification → trigger subgraph generation - Trigger optimization: Jointly optimizing attack effectiveness and distributional stealthiness via adversarial training

Key Designs

  1. Hard Sample Selection

Function: Selects the samples with the highest model uncertainty from the target class for poisoning.

Mechanism: A surrogate model computes the confidence of each graph towards the target label \(\text{cfd}(G) = \text{softmax}(f_\theta(G))_{y_t}\), and the bottom \(p\%\) samples by confidence are selected.

Design Motivation: Low-confidence samples lie near the decision boundary and are more susceptible to prediction changes under small perturbations, making them ideal candidates for clean-label attacks. This avoids the semantic inconsistency introduced by label flipping.

  1. Trigger Location Selection

Function: Identifies which nodes in the graph should receive the trigger subgraph.

Mechanism: A two-stage filtering approach — first, the \(2M\) nodes with the highest degree centrality are selected as candidates; then, an ablation study using the surrogate model evaluates each node's influence \(S(v) = |f_\theta(G + \Delta_v) - f_\theta(G)|\), and the \(M\) most influential nodes are retained.

  1. Trigger Generation and Injection (Topology-Feature Generator)

Function: Uses learnable networks to generate the topology and node features of the trigger subgraph.

Mechanism: - Topology generator: An MLP maps the adjacency matrix of the injection region to a learnable soft structure, producing a binary adjacency matrix via binarization: \(A_{binary} = \mathbb{I}(A > 0.5)\) - Feature generator: An MLP generates trigger node features based on the original features at the injection location \(X' = \sigma(W_2 X + b_2)\), ensuring the generated features align with the data distribution

  1. Adversarial Anomaly Minimization

Function: Employs adversarial training to make poisoned graphs statistically indistinguishable from clean graphs under anomaly detection.

Mechanism: Two discriminators are introduced — a topology discriminator \(D_{\theta_t}\) (implemented via GCN) and a feature discriminator \(D_{\theta_f}\) (implemented via MLP). The generator and discriminators engage in a minimax game:

\(\min_{\omega_t} \max_{\theta_t} \mathcal{L}_d^{(t)} = \sum_{G \sim \mathcal{G}_c} \log D_{\theta_t}(G) + \sum_{G \sim \mathcal{G}_b} \log(1 - D_{\theta_t}(G_{g_t}(\omega_t)))\)

Loss & Training

The attack loss and adversarial anomaly loss are jointly optimized:

  • Attack loss: \(\mathcal{L}_{atk} = -\log f_{\theta^*}(G_{g_t})_{y_t}\)
  • Topology optimization: \(\min_{\omega_t} \sum \mathcal{L}_{atk} + \alpha \mathcal{L}_d^{(t)}\)
  • Feature optimization: \(\min_{\omega_f} \sum \mathcal{L}_{atk} + \beta \mathcal{L}_d^{(f)}\)

A staged adversarial training procedure is adopted: the topology generator and discriminator are optimized first, followed by the feature generator and discriminator. The surrogate model is fine-tuned after each stage to maintain reliable gradient signals. Hyperparameters \(\alpha\) and \(\beta\) balance stealthiness against attack effectiveness.

Key Experimental Results

Main Results

Experiments are conducted on 4 TUDataset benchmarks with 3 graph classifiers (GCN, GIN, SAGPool), compared against 5 baseline attack methods.

Dataset Model Method ASR(%) CAD(%) AUC(%)
AIDS GCN GTA 93.21 5.14 99.34
AIDS GCN DPSBA 94.76 2.38 72.65
AIDS GIN Motif 97.75 2.28 99.71
AIDS GIN DPSBA 95.87 1.94 73.66
FRANKENSTEIN GCN GTA 99.35 2.74 100.00
FRANKENSTEIN GCN DPSBA 98.37 1.01 68.96
ENZYMES GCN GTA 95.33 3.00 71.20
ENZYMES GCN DPSBA 96.67 -0.67 66.11

Ablation Study

Configuration ASR AUC Note
Full DPSBA 94.76% 72.65% Best trade-off
w/o topology discriminator ~95% ~85% Increased structural anomaly
w/o feature discriminator ~94% ~80% Increased feature deviation
Dirty-label setting ASR↑ AUC↑ More detectable
Non-adversarial training ASR↑ AUC↑↑ Stealthiness significantly degraded

Key Findings

  • DPSBA reduces the anomaly detection AUC by 20–30 percentage points across all datasets while maintaining or improving ASR.
  • On the AIDS dataset, Motif achieves an AUC of 99.71% (nearly perfectly detected), whereas DPSBA achieves only 73.66%.
  • Under the clean-label setting, DPSBA consistently yields lower CAD (clean accuracy drop) than competing methods, indicating less interference with normal samples.

Highlights & Insights

  • First systematic analysis of structural and semantic deviation in graph-level backdoor attacks, accompanied by a theoretical lower bound linking total variation distance to optimal AUC.
  • The combination of clean-label + adversarial training is elegant: labels remain unmodified (avoiding semantic deviation), while discriminators constrain the trigger distribution (suppressing structural deviation).
  • The threat model is well-grounded: black-box knowledge combined with limited poisoning capability closely reflects real-world scenarios.

Limitations & Future Work

  • The trigger size is fixed at 4 nodes, which may lack flexibility for large-scale graphs.
  • Only SIGNET is considered as an anomaly detection baseline for evaluation.
  • In the clean-label setting, ASR on certain datasets (e.g., Motif-S) still has room for improvement.
  • Adaptive defenses — where the defender is aware of the attack method — are not considered.
  • A key distinction from node-level distribution-preserving attacks [Zhang et al.]: graph classification requires modifying global message passing to influence whole-graph representations, making distribution preservation significantly more challenging.
  • The adversarial training paradigm can be extended to other security domains, such as maintaining sample distribution during adversarial robustness training.
  • This work also serves as a warning that existing graph classification backdoor defenses relying on anomaly detection may not be sufficiently reliable.

Rating

  • Novelty: ⭐⭐⭐⭐ — The combination of clean-label and distribution preservation is proposed for the first time in graph backdoor attacks.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 datasets × 3 classifiers × 5 baselines with comprehensive ablation.
  • Writing Quality: ⭐⭐⭐⭐ — Problem analysis is clear and motivation is well articulated.
  • Value: ⭐⭐⭐⭐ — Carries significant cautionary implications for GNN security research.