Skip to content

Improving Black-Box Generative Attacks via Generator Semantic Consistency

Conference: ICLR 2026 arXiv: 2506.18248 Code: To be released Area: Audio & Speech Keywords: Generative adversarial attacks, black-box transferability, Mean Teacher, semantic consistency, feature distillation

TL;DR

By analyzing semantic degradation in intermediate-layer features of perturbation generators, this paper proposes a Mean Teacher-based semantic structure-aware framework that performs self-feature distillation at early generator layers to preserve semantic consistency, thereby enhancing the transferability of adversarial examples across models, domains, and tasks.

Background & Motivation

State of the Field

Generative adversarial attacks train a perturbation generator on a white-box surrogate model and transfer the generated perturbations to unseen black-box victim models. Compared to iterative attacks, generative methods offer higher inference efficiency, scalability, and transferability. However, existing methods predominantly treat the generator as a black box, optimizing only end-to-end metrics while neglecting how the generator internally represents semantic information (e.g., object boundaries and coarse shapes).

Key Observation

The authors conduct a systematic analysis of trained generators, partitioning intermediate activations into three stages: early, mid, and late blocks.

  • Early blocks: consistently preserve coarse semantic structure of the input image (object contours, shape priors).
  • Mid and late blocks: semantic cues progressively degrade and dissipate as perturbations accumulate.

This implies that preserving semantic integrity at early stages allows subsequent perturbations to better focus on salient object regions, thereby improving transferability.

Core Problem

  1. At which stage of the generator do semantic cues degrade during adversarial synthesis?
  2. Which intermediate blocks of the generator most significantly affect transferability?

Method

Overall Architecture

The framework is built on a Student-Teacher architecture comprising the following components: - Student generator \(\mathcal{G}_\theta\): trained via gradient descent to produce adversarial perturbations. - Teacher generator \(\mathcal{G}_{\theta'}\): weights updated via EMA, providing temporally smoothed feature references. - Frozen surrogate model: provides adversarial supervision signals. - Perturbation projector \(\mathcal{P}\): enforces \(\ell_\infty\) constraints on perturbations.

Key Designs

  1. Mean Teacher feature smoothing: Two generators are maintained — a Student (trained via gradient descent) and a Teacher (updated via EMA). Teacher parameters are updated by exponential moving average: \(\theta' \leftarrow \eta\theta + (1-\eta)\theta\) (\(\eta=0.999\)). EMA smooths high-frequency perturbation artifacts and enhances the semantic consistency and stability of Teacher intermediate feature maps, providing reliable semantic references for the Student.

  2. Self-Feature Distillation: At the early blocks (\(L_{\text{early}}=\{1,2\}\)) of the generator, a hinge-based loss enforces alignment between the Student's early activations and the Teacher's semantically rich features:

\[\mathcal{L}_{\text{distill}} = \sum_{\ell=1}^{L_{\text{early}}} \mathcal{W}_{\text{distill}} \max(0, \tau - \cos(\mathbf{g}_s^{(\ell)}, \mathbf{g}_t^{(\ell)}))\]

where \(\cos(\cdot,\cdot)\) denotes cosine similarity, \(\tau=0.6\) is the similarity threshold, and \(\mathcal{W}_{\text{distill}}\) is a learnable softmax-weighted parameter.

  1. Novel evaluation metric ACR: The paper proposes the Accidental Correction Rate (ACR), which captures predictions accidentally corrected during the attack, providing a more comprehensive assessment of attack efficacy.

Loss & Training

The adversarial loss is defined as cosine similarity in the surrogate feature space: $\(\mathcal{L}_{\text{adv}} = \cos(\mathcal{F}_k(x), \mathcal{F}_k(x^{adv}))\)$

The total loss is: $\(\mathcal{L} = \mathcal{L}_{\text{adv}} + \lambda_{\text{distill}} \cdot \mathcal{L}_{\text{distill}}\)$

where \(\lambda_{\text{distill}}=0.7\). The 16th layer (Maxpooling.3) of VGG-16 is used as the surrogate feature extractor. Training is conducted on ImageNet-1K with perturbation budget \(\epsilon=10\).

Key Experimental Results

Main Results (Cross-Model Transfer)

The proposed method serves as a plug-and-play module that can be integrated into any existing generative attack baseline:

Baseline Cross-Model ASR Gain Cross-Domain ASR Gain Cross-Task Improvement
BIA (baseline) Significant Significant Consistent
CDA + Ours
LTP + Ours
GAMA + Ours
FACL + Ours
PDCL + Ours

Cross-Domain Transfer (CUB-200, Stanford Cars, FGVC Aircraft)

Using BIA as baseline with VGG-19 as surrogate: - Accuracy drop: 10.05%p (lower is better) - ASR gain: 11.20%p - FR gain: 10.39%p - ACR reduction: 2.26%p (lower is better)

Ablation Study

Configuration Key Metric Note
Early block distillation (1,2) Best Early blocks retain most semantic information
Mid block distillation Inferior Semantics partially degraded
Late block distillation Worst Semantics severely degraded
τ=0.6 Best Balances attack strength
Without Mean Teacher Degraded Lacks temporally smoothed reference

Key Findings

  • Consistent improvements are observed across all four transfer settings (cross-model, cross-domain, cross-task SS, cross-task OD).
  • Robustness is maintained under adversarial purification defense (NRP).
  • Perceptual quality (PSNR/SSIM/LPIPS) is not degraded and shows marginal improvement.
  • Multi-seed experiments demonstrate good training stability (low standard deviation).
  • Performance on CLIP zero-shot classification varies depending on the baseline.

Highlights & Insights

  1. Semantic analysis of generator internals: For the first time, this work systematically analyzes semantic degradation in intermediate-layer features of adversarial perturbation generators, identifying early blocks as the key to preserving semantic integrity.
  2. Plug-and-play design: The framework is architecture-agnostic and can be stacked onto any existing generative adversarial attack method, yielding consistent performance gains.
  3. ACR metric: Reveals a limitation of existing evaluation protocols — conventional metrics overlook "accidental corrections" where predictions become correct after the attack.
  4. Difference map analysis: Visualizations confirm that the proposed method concentrates adversarial noise more on semantically meaningful object regions within residual blocks.

Limitations & Future Work

  • The method's effectiveness depends on the generator architecture — when early intermediate block features lack rich semantic cues (e.g., in alternative architectures such as U-Net), the benefit of the distillation mechanism is limited.
  • Transferability gains on tasks beyond image classification (detection, segmentation) are limited, suggesting that classification-oriented surrogate models struggle to align feature representations for other tasks.
  • The method focuses on preserving internal generator semantics and differs in principle from approaches that explicitly target the benign-adversarial discrepancy; the two are complementary and can be used jointly.
  • BIA (Zhang et al., 2022): Baseline method using similarity loss in surrogate feature space.
  • GAMA (Aich et al., 2022): Leverages CLIP vision-language models to enhance generative attacks.
  • Mean Teacher (Tarvainen & Valpola, 2017): Originally proposed for semi-supervised learning; this paper innovatively applies it to adversarial attacks.
  • Insights: The approach of analyzing intermediate-layer semantics can be generalized to other generative tasks such as image inpainting and style transfer.

Rating

  • Novelty: ⭐⭐⭐⭐ — The perspective of analyzing semantic degradation in generators is novel, though Mean Teacher itself is not a new technique.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across models/domains/tasks, thorough ablation, with adversarial purification and zero-shot testing included.
  • Writing Quality: ⭐⭐⭐⭐ — Motivation is clear and visualizations are rich.
  • Value: ⭐⭐⭐⭐ — Plug-and-play with consistent gains; offers meaningful reference for adversarial robustness research.