Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting¶

Conference: CVPR 2026 arXiv: 2603.24821 Code: https://github.com/simurgh7/CrowdGen Area: AI Safety / Adversarial Attacks Keywords: adversarial attack, crowd counting, cross-paradigm transferability, generative adversarial perturbation, black-box attack

TL;DR¶

This paper proposes CrowdGen, the first cross-paradigm adversarial attack framework targeting both density-map and point-regression crowd counting models. A lightweight UNet generator combined with a multi-task loss (logit suppression, density suppression, GradCAM guidance, and frequency-domain constraint) achieves high transferability (TR up to 1.69) across seven SOTA crowd counting models while maintaining visual imperceptibility (~19 dB PSNR), increasing attack MAE by an average factor of 7×.

Background & Motivation¶

Localized crowd counting is widely deployed in public safety, retail analytics, and epidemic monitoring. Current mainstream approaches fall into two paradigms: density-map methods (e.g., SASNet, FIDTM), which regress spatial density distributions and extract localization via post-processing, and point-regression methods (e.g., P2PNet, PET), which directly predict coordinates and confidence scores end-to-end.

Existing adversarial attacks suffer from the following limitations:

Attack strength vs. imperceptibility trade-off: PAP and GE-AdvGAN achieve good visual quality (PSNR ≥ 22 dB) but weak attacks (MAE < 120); DiffAttack yields strong attacks (MAE = 414) but severe visual degradation (PSNR = 11.5 dB).

Single-paradigm limitation: Existing transferable attacks (APAM, PAP) transfer only within density-map methods and do not consider cross-paradigm transfer (density-map ↔ point-regression).

Black-box deployment requirements: Real-world crowd counting systems are typically black-box, necessitating surrogate-model-based transfer attacks.

Core Idea: The shared backbone feature space (e.g., VGG-16, ResNet-50) across both paradigms encodes common inductive biases. By combining paradigm-specific attack losses with paradigm-agnostic perceptual constraints, a unified generative perturbation generator can be learned.

Method¶

Overall Architecture¶

The framework consists of three core components: a 3-layer UNet generator \(G_\theta\) that maps input images to bounded perturbations \(\delta\), with the total loss composed of a paradigm-specific model loss \(\mathcal{L}_{model}\) and a cross-paradigm perturbation loss \(\mathcal{L}_{pert}\). Training is performed against surrogate models; at inference time, adversarial examples are generated in a single forward pass without per-image optimization.

Key Designs¶

Logit Suppression Loss (for point-regression models):
- Attacks are focused on high-confidence detections \(\mathcal{P}_{high} = \{i : s_i^{(h)} > \tau\}\)
- Dense scenes (\(C_{gt} > C_{sparse}\)): directly minimizes logit values in high-confidence regions
- Sparse scenes (\(C_{gt} \le C_{sparse}\)): applies weighted penalties to detections near the confidence boundary
- Adaptive threshold decay: \(\tau(t) = \max(\tau_{min}, \tau_{max} - \nu \cdot t / T_{max})\), progressively lowering the threshold during training to expand the attack surface
Density Suppression Loss (for density-map models):
- Heatmap suppression \(\mathcal{L}_{hmap}\): simultaneously attacks salient peaks and near-threshold regions; local maxima are detected via 3×3 max-pooling, with foreground separated by an adaptive threshold \(\phi = \phi' \cdot \max(\mathcal{D})\)
- Peak suppression \(\mathcal{L}_{peak}\): additionally incorporates peak prominence (difference between peak value and local neighborhood) to emphasize isolated high-density clusters
- An isolation ratio (proportion of peaks with no neighbors within a 5×5 window > 0.7) automatically selects which loss to apply
Cross-paradigm Perturbation Loss:
- Frequency-domain constraint \(\mathcal{L}_{freq}\): suppresses high-frequency components via FFT, exploiting the low-frequency dominance of crowd scenes to improve transferability
- GradCAM guidance \(\mathcal{L}_{cam}\): concentrates perturbations on semantically important regions identified by the shared backbone, minimizing perturbation outside attention regions
- Magnitude constraint \(\mathcal{L}_{hinge}\): bounds perturbation energy via L2 norm
- Spatial smoothness regularization \(\mathcal{L}_{tv}\): total variation regularization to reduce perturbation artifacts

Loss & Training¶

Total loss: \(\mathcal{L}_{attack} = \alpha \cdot \mathcal{L}_{model} + \beta \cdot \mathcal{L}_{hinge} + \gamma \cdot \mathcal{L}_{tv} + \zeta \cdot \mathcal{L}_{freq} + \kappa \cdot \mathcal{L}_{cam}\)

Perturbation budget \(\epsilon = 8/255\); image resolution 512×512
Cosine annealing learning rate schedule
Hyperparameters tuned via grid search on a validation set: \(\beta=0.01, \gamma=0.05, \zeta=0.01, \kappa=0.5\)

Key Experimental Results¶

Main Results (Cross-model Transferability, SHHA Dataset)¶

Surrogate → Target	MAE / TR	Notes
HMoDE → P2PNet	420.71 / 1.69	Cross-paradigm super-transfer: stronger than white-box self-attack
FIDTM → P2PNet	426.89 / 1.64	Density-map → point-regression strong transfer
SASNet → APGCC	397.96 / 1.32	Density-map → point-regression
P2PNet → SASNet	281.00 / 0.89	Point-regression → density-map
APGCC → HMoDE	171.53 / 0.55	Weakest transfer, yet MAE still doubles
Clean baseline	28–75	Counting error on clean images

Ablation Study (Loss Combinations, SHHA Dataset)¶

Loss Combination	Miss Rate (%)	PSNR (dB)	Notes
\(\mathcal{L}_{hmap} + \mathcal{L}_{hinge}\) (baseline)	45.35	17.67	Basic density attack only
+ \(\mathcal{L}_{cam}\)	59.47	17.67	GradCAM adds +14% MR
+ \(\mathcal{L}_{freq}\)	60.46	17.75	Frequency constraint also significantly improves
Full combination (density-map)	60.89	17.47	Best attack strength
\(\mathcal{L}_{logit} + \mathcal{L}_{hinge}\) (baseline)	45.15	19.09	Basic logit attack only
+ \(\mathcal{L}_{cam}\)	45.61	19.10	Best trade-off for point-regression

Key Findings¶

Cross-paradigm super-transfer: CNN-based HMoDE attacking Transformer-based PET achieves TR = 1.60 on UCF-QNRF, validating the shared backbone inductive bias hypothesis.
The contribution of each loss component is paradigm-dependent: frequency-domain constraints are critical for density-map models, while GradCAM guidance benefits point-regression models more.
In dense scenes, the miss rate reaches 58%, concealing the majority of the crowd, whereas prior methods achieve only 15–31%.

Highlights & Insights¶

This work is the first to reveal cross-paradigm adversarial vulnerability in crowd counting models; the super-transfer phenomenon (TR > 1) suggests that black-box attackers may be more effective than white-box ones.
The scene-density-adaptive logit suppression strategy (dense vs. sparse branches) elegantly handles the distinct characteristics of different scene types.
The generative single-forward-pass attack is more practical than iterative optimization methods, offering high inference efficiency.

Limitations & Future Work¶

Only digital-domain attacks are evaluated; physical-world scenarios (printing, projection) are not considered.
The attack focuses primarily on under-counting; the over-counting direction (hallucinating crowds) remains unexplored.
The perturbation budget \(\epsilon = 8/255\) is standard; performance under smaller budgets has not been verified.
Experiments against adversarial defense methods are absent.

This framework can serve as a standardized benchmark for robustness evaluation of crowd counting systems.
The GradCAM-guided perturbation allocation strategy is generalizable to adversarial attacks on other dense prediction tasks.
The finding of shared-backbone vulnerability across paradigms has important implications for security strategies in model deployment.

Rating¶

Novelty: ⭐⭐⭐⭐ First cross-paradigm adversarial attack on crowd counting, though the generative adversarial perturbation framework itself is not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Transfer matrix across 7 models × 2 datasets, comparison with 9 baselines, and ablation studies — fairly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and notation is complete, though some symbols are heavy.
Value: ⭐⭐⭐⭐ Provides important insights into the vulnerability of safety-critical crowd analysis systems.