🛡️ AI Safety¶
📷 CVPR2026 · 143 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (140) · 💬 ACL2026 (5) · 🧪 ICML2026 (114) · 🤖 AAAI2026 (45) · 🧠 NeurIPS2025 (73) · 📹 ICCV2025 (24)
🔥 Top topics: Adversarial Robustness ×65 · Multimodal/VLM ×19 · Watermarking ×11 · Federated Learning ×10 · Diffusion Models ×10
- A Combination of Noise and Bilateral Filters Achieve Supralinear and Scalable Adversarial Robustness in CNNs
-
This paper theoretically demonstrates from the perspective of decision boundary geometry that "Gaussian noise" and "image filtering" defend against adversarial attacks through two complementary mechanisms. Consequently, their combination yields supralinear robustness gains. Based on this, a minimalist preprocessor (pixel-level Gaussian noise + iterative bilateral filtering, applied during both training and inference) is proposed, which approaches or even exceeds SOTA defenses on RobustBench using only ~35% of the training FLOPs and half the parameters.
- A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World
-
This paper performs a "sanity check": it reveals that existing deepfake detectors achieve seemingly high AUC on multi-domain mixed data but suffer from low frame-level real/fake Accuracy (ACC) because "inter-domain discrepancies" overshadow "real/fake differences" in the feature space. Subsequently, it proposes a model-agnostic two-stage framework, DevDet (FFDev to expose forgery traces + DAFT for dose-adaptive fine-tuning), which significantly boosts frame-level ACC while preserving original generalization capabilities.
- A Unified Perspective on Adversarial Membership Manipulation in Vision Models
-
This paper first reveals the vulnerability of Membership Inference Attacks (MIA) in vision models to adversarial membership manipulation. It demonstrates that imperceptible perturbations can forge non-members as members to deceive auditing. It identifies a "gradient norm collapse" signature in forged members and proposes a gradient-geometry-based detection strategy along with an adversarial robust inference framework.
- AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples
-
Unrestricted adversarial attacks are implemented within the continuous-time velocity field of flow matching. Instead of perturbing pixels directly or using diffusion-based "denoising-then-re-noising," PGD perturbations on reconstructed images are translated into velocity field perturbations propagated deterministically along probability flow ODEs. A "lookahead two-point objective" corrects temporal mismatch, achieving simultaneously stronger black-box transferability and higher success rates against purification and adversarial training on ImageNet.
- All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference
-
Ours proposes the Pseudo-Random Bayesian Inference (PRBI) framework for collaborative perception scenarios where all vehicles are untrusted. By utilizing inter-frame temporal consistency as a self-reference signal through pseudo-random grouping and Bayesian inference, the system efficiently identifies and excludes malicious vehicles with an average of only 2.5 verifications per frame, restoring detection accuracy to 79.4%–86.9% of pre-attack levels.
- AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal
-
The authors invert style transfer into "style removal" to eliminate the "random texture style" of adversarial patches from images. By locating and masking patch pixels based on these style changes, they develop a zero-shot defense that is agnostic to models, patches, and attacks. This method avoids training and preserves clean image performance while improving adversarial mAP by 8–15 points, achieving real-time detection at 10–12 FPS (40–90ms per image).
- AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
-
AVFakeBench is the first comprehensive audiovisual forgery detection benchmark covering "Human + Generic Scenes, 7 types of AV forgery combinations, and 4 levels of annotation" (3K segments / 12K QA). Using a multi-stage hybrid forgery framework based on "proprietary model planning + expert model execution" to mass-produce fake data, the authors evaluated 11 Audiovisual Large Multimodal Models (AV-LMMs) and 2 expert detectors. The study reveals that while AV-LMMs outperform expert models in binary real/fake judgment, they nearly collapse in fine-grained forgery classification and explanatory reasoning.
- Batman: Benign Knowledge Alignment Through Malicious Null Space in Federated Backdoor Attack
-
Addressing the dilemma in federated backdoor attacks where "aligning benign knowledge weakens the attack, while failing to align makes it easily detected by defenses," Batman utilizes SVD to compress malicious knowledge into the dominant directions of parameter matrices. It aligns benign knowledge within the orthogonal "malicious null space," significantly enhancing stealthiness while maintaining backdoor functionality. It achieves high ASR and ACC simultaneously across four datasets and six aggregation/defense mechanisms.
- Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection
-
Addressing the "Pre-trained Information Bias" (PIB) where the [CLS] token in ViT foundation models excessively focuses on global semantics and ignores local forgery traces during deepfake detection, this paper proposes the QTFP framework. By replacing [CLS] with a set of randomly initialized learnable query tokens to aggregate local evidence, combined with "Forgery Likelihood Weighted Contrastive Loss" and "Real-Graph Attention Alignment" regularizations, the average cross-dataset AUC is improved from 0.923 (Effort) to 0.947.
- Bias at the End of the Score
-
This paper conducts a large-scale bias audit of five widely used reward models (PickScore, ImageReward, HPS, VQAScore, CLIP) in text-to-image (T2I) systems. It demonstrates that these scoring functions, acting as proxies for "image quality," encode systematic demographic biases. When used as noise optimizers, they disproportionately hypersexualize female subjects and "whiten" non-White subjects. Furthermore, the scores themselves correlate highly with real-world demographic distributions (such as gender ratios in occupations) rather than truly measuring quality.
- Bridging Privacy and Provenance: Traceable Virtual Identity Generation
-
This paper proposes a diffusion-based framework to generate "stable, reproducible, yet unrecognizable" virtual faces for users. By embedding a 128-bit invisible watermark as an identity fingerprint during generation, users can verify "this virtual face belongs to me" with a secret key without exposing their real faces, simultaneously achieving anonymity and traceability.
- Bypassing the Transport Plan: Dynamic Reweighting for Out-of-Distribution Detection with Optimal Transport
-
To address the lack of OOD labels in open-set semi-supervised learning, this paper proposes DREW: modeling batch-level OOD detection as Semi-Unbalanced Optimal Transport (SemiUOT). Through "dynamic reweighting," it is equivalently transformed into classical OT, allowing pseudo OOD scores to be directly read from the source distribution weights—bypassing the computation of the full transport plan \(\pi\). This yields more accurate, faster pseudo-labels with theoretical error bounds to supervise the OOD detection head.
- CamPI: Physical Adversarial Examples through Camera Power Signal Injection
-
By injecting a modulated signal into the camera's power supply line, controllable stripe/mask perturbations are induced in the imaging process through ADC sampling aliasing. This generates physical adversarial examples that are invisible to the naked eye, require no physical patches or external lighting, and do not need to face the target directly. The authors build a differentiable simulation model to optimize this physical mechanism end-to-end into attack parameters, achieving physical attack success rates of 92% and 82% under white-box and black-box settings, respectively.
- ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
-
Proposes ClusterMark, a watermarking scheme based on visual token clustering that adapts KGW-style LLM watermarking to autoregressive (AR) image generators. By assigning visually similar tokens to the same green/red sets, the method significantly enhances watermark robustness against image perturbations while preserving visual quality.
- Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
-
The FOUL framework is proposed to achieve efficient federated unlearning with low communication overhead without accessing client data. It utilizes a two-stage strategy: decoupling causal and non-causal features during the learning phase, followed by server-side gradient conflict matching during the unlearning phase.
- COPYLENS: Towards Copyrighted Characters Infringement Detection via Copyright-Aware Prompt Learning
-
Addressing the governance challenge where text-to-image models unintentionally replicate copyrighted characters (e.g., Disney characters), COPYLENS employs an LVLM as a detector and an LLM as a prompt optimizer. Using Cohen's Kappa (alignment with human annotations) as a feedback signal, it automatically rewrites detection prompts in a closed loop to maximize human-like judgment, achieving a \(5\%–10\%\) improvement in alignment on the new COPYCHARS dataset compared to existing methods.
- Cross-modal Representation Learning for Diffusion-generated Image Detection
-
This work utilizes RGB and NPR (Neighborhood Pixel Relationship) modalities for representation learning—employing Cross-modal Contrastive Learning (CMCL) to increase inter-class separation and Cross-modal Mutual Distillation (CMMD) to tighten intra-class structures. Together, they learn a "forgery-detection-oriented" embedding space, reaching SOTA performance on three benchmarks: GenImage, DRCT-2M, and Co-Spy-Bench.
- DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
-
DASH treats a set of off-the-shelf \(\ell_p\) norm attacks (PGD, CW, various FGSM variants, etc.) as components, "softly combining" their adversarial examples using learnable softmax weights and refining them through multi-stage chaining. It learns these weights end-to-end via a meta-loss that simultaneously optimizes attack success rate and SSIM perceptual similarity. On adversarially trained robust models, it pushes the attack success rate to nearly 100% while remaining more stealthy than specialized perceptually-aligned attacks (e.g., AdvAD).
- Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
-
To address biases in Deepfake detectors across demographic groups (e.g., gender, race), this paper proposes a synergistic dual-mechanism framework: "Structural Fairness Decoupling + Global Distribution Alignment." It first prunes convolutional channels most sensitive to demographic attributes using a channel sensitivity metric and then aligns the prediction distributions of subgroups to a global distribution via entropy-regularized optimal transport. This approach improves both inter-group and intra-group fairness across multiple datasets without sacrificing (and sometimes even improving) detection AUC.
- AdvMark: Decoupling Defense Strategies for Robust Image Watermarking
-
AdvMark is proposed as a two-stage decoupled defense framework: Stage 1 Encoder Adversarial Training (EAT) shifts watermarked images into non-attackable regions to resist adversarial attacks; Stage 2 utilizes direct image optimization to counter distortion and regeneration attacks while preserving adversarial robustness. Across 9 watermarking methods and 10 attack types, AdvMark improves distortion/regeneration/adversarial accuracy by 29%/33%/46% respectively, achieving optimal image quality.
- DeepfakeImpact: A Two-Stage Benchmark with Real-World Impact in Deepfake Detection
-
This research redefines deepfake detection from "measuring accuracy" to "measuring social utility" through a two-stage benchmark: Stage I reproduces 33 SOTA detectors across 12 datasets; Stage II introduces the Social Misjudgment Impact (SMI) metric to assign "social harm scores" to missed detections, constructing an SMI-critical dataset of 17,653 high-risk samples. The findings reveal that models leading in technical metrics often fail under SMI-based evaluation.
- DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion
-
DeepProtect "vaccinates" face images before upload: it first dilutes extractable identity features by performing channel-wise blending of the target identity with visually similar but distinct faces in the StyleGAN W+ latent space. Subsequently, it embeds invisible adversarial watermarks along directions of specific facial parts (e.g., nose, eyebrows) specified by text prompts. This ensures that any deepfakes generated by subsequent face-swapping models are corrupted, while the protected image itself remains visually indistinguishable from the original.
- Detect Any AI-Counterfeited Text Image
-
Aiming at the detection of generative AI-counterfeited text images, the authors utilized an MLLM-driven Creative Proposer pipeline to construct the DanceText dataset, which is over 100 times larger than previous works. They proposed DS-Net, which leverages an "Artifact-Content Decoupling Encoder" to learn general artifacts from massive fake images in non-text domains, and a "Synergy Denoising Decoder" to enable mutual error correction between image-level classification and region-level localization. This approach improved the average F1 from 49.4 to 53.9 across eight out-of-domain test sets, including cross-generator, cross-language, and real-world software scenarios.
- Detecting Compressed AI-Generated Images via Phase Spectrum Robustness
-
To address the failure of AI-generated image detectors caused by JPEG compression on social networks destroying forgery traces, this paper starts from the signal processing observation that "phase spectra are more robust to compression than magnitude spectra." It proposes CPTFormer, which uses phase features to guide RGB representations through bidirectional cross-modal fusion, followed by fine-tuning via spatial and wavelet-frequency dual-branch adapters. By employing a difficulty-aware loss to focus on hard samples with limited compression labels, the model improves accuracy by up to 6.7% across four compression benchmarks for GAN/Diffusion models.
- DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning
-
When migrating Visual Foundation Models (VFMs) like CLIP to deepfake detection, DFD-HR avoids simply "tuning fewer parameters." Instead, it implements "hierarchical routing" at both the layer level (adaptively determining the number of layers per sample) and token level (filtering irrelevant tokens via Spearman rank loss + MoE expert routing). This allows the model to concentrate computational power on representations containing actual forgery clues, achieving gains of +2.3% and +3.8% in Video-level AUC across datasets and forgery methods, respectively.
- DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization
-
DiffusionFF repurposes a pre-trained forgery detector as an "artifact encoder" and a denoising diffusion model as an "artifact decoder." Using multi-scale forgery features as conditions, it progressively generates fine-grained DSSIM artifact localization maps, which are subsequently fused back into the detector for classification, achieving SOTA results in both detection and localization tasks.
- Domain-Skewed Federated Learning with Feature Decoupling and Calibration
-
Ours proposes the F²DC framework, which separates local features of clients in federated learning into domain-robust and domain-related features using a Domain Feature Decoupler (DFD) and a Domain Feature Corrector (DFC). By calibrating domain-related features to rescue discarded category information and employing a Domain-Aware Aggregation strategy, the method consistently outperforms SOTA on three multi-domain datasets.
- DSO: Direct Steering Optimization for Bias Mitigation
-
DSO uses reinforcement learning to learn a set of linear transformations (steering) applied to activations. By formulating the objective of "decoupling occupational judgments from gender stereotypes" as an optimizable fairness reward, it enables a continuous trade-off between bias reduction and capability preservation at inference time via a tunable intensity parameter \(\omega\). It achieves SOTA on the fairness-capability trade-off while modifying less than 0.005% of the parameters.
- DualMirage: Hunting Stealthy Multimodal LLM Agents via CAPTCHAs with Contour and Adversarial Illusions
-
DualMirage embeds two types of "illusions" in a single image: colored illusory contours (Colored Abutting Grating) that humans can perceive but machines cannot, combined with adversarial perturbations that machines "see" but humans do not. This system blocks malicious multimodal agents disguised as humans (up to 100% interception rate) and actively induces them to reveal their model names (58.8% white-box, 21.9% black-box), upgrading traditional CAPTCHAs from "capability testing" to "identity hunting."
- Editprint: General Digital Image Forensics via Editing Fingerprint with Self-Augmentation Training
-
Editprint utilizes an "online editing pool" to self-augment a mere 10 original raw images into tens of millions of "imaging + post-processing" editing chains with text labels. Through Self-Augmentation Training (SAT), it learns a universal "editing fingerprint" feature, allowing it to approach or even surpass supervised methods in unannotated, zero-shot tasks such as Synthetic Image Detection (SID), Social Network Provenance (SNP), and Camera Source Identification (CSI).
- Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models
-
LaySelFL is a stealthy backdoor attack designed for Federated Learning (FL). It evaluates the "sensitivity" of each layer to the backdoor goal, poisons only a few most sensitive layers, uses a constraint loss to pull the poisoned layers closer to the server model, and performs element-wise clipping on the remaining normal layers. This approach eliminates the distance difference between the backdoor model and clean models, improving overall attack effectiveness by 25% and reducing the interception rate of five SOTA distance/similarity-based defenses from 26.6% to 4%.
- Enhancing Out-of-Distribution Detection with Extended Logit Normalization
-
This paper identifies that LogitNorm leads to two types of feature collapse (dimensional collapse and origin collapse) during training. It proposes a hyperparameter-free Extended Logit Normalization (ELogitNorm), which replaces the distance-to-origin scaling factor with the distance from features to decision boundaries. This significantly enhances the performance of various post-hoc OOD detection methods and improves confidence calibration without compromising classification accuracy.
- Enhancing the Security of Visual Speaker Authentication Based on Dynamic Lip-Print Analysis
-
Ours proposes "viseme combinations" as the analysis unit for Visual Speaker Authentication (VSA), extracting unique sequential viseme speaking habits as "dynamic lip-prints." Combined with a multi-layer dynamic enhancement encoder using layer-wise frame differences, the system expands authentication prompt sets without retraining or re-recording user videos. It significantly enhances resistance to replay attacks and various DeepFakes (approaching 1.0 AUC on VSA/GRID/TCD-TIMIT, with HTER as low as 0.1–0.2%).
- Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
-
This paper proposes VIPER—the first ViT backdoor attack framework built on a dynamic Visual Prompt Generator (VPG). Through the joint optimization of triggers and prompts, it induces a new phenomenon called "Functional Fusion," where malicious logic and benign utility are compressed into the same sparse, high-amplitude parameter core. This creates a "hostage dilemma" for defenders: removing the attack via pruning inevitably destroys benign accuracy. VIPER maintains nearly 100% ASR, superior clean accuracy, and negligible inference overhead (+0.06ms).
- FeatureFool: Zero-Query Fooling of Video Models via Feature Map
-
FeatureFool proposes the first zero-query black-box adversarial attack for videos: it extracts a motion-semantic-rich feature map using Guided Back-propagation (GB) on the "maximum optical flow frame" from a public pre-trained 3D-CNN, and broadcasts it as a universal perturbation across all frames. This causes traditional video classifiers to misclassify (ASR >70%) and forces Video-LLMs to miss harmful content like violence/pornography (ASR >70%) without any queries, while maintaining high visual quality (SSIM >0.87, PSNR >28dB).
- FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation
-
The FedAFD framework is proposed to simultaneously improve the model performance of both heterogeneous clients and the server in multimodal federated learning through a three-stage design: dual-layer adversarial alignment, granularity-aware feature fusion, and similarity-guided ensemble distillation.
- FedCART: Tackling Long-Tailed Distributions in Federated Adversarial Training via Classifier Refinement
-
To address the performance collapse of Federated Adversarial Training (FAT) under long-tailed data, FedCART decomposes the global model into a "shared feature extractor + dual classifiers." Clients utilize contrastive loss to align natural and adversarial features for robustness. The server synthesizes class-balanced virtual features using aggregated gradient prototypes and retrains an auxiliary classifier for de-biasing. FedCART outperforms SOTA methods like CalFAT in both natural and robust accuracy on long-tailed variants of CIFAR, SVHN, and FMNIST.
- FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift
-
Ours proposes FedDAP, a domain-aware prototype federated learning framework. By constructing domain-specific global prototypes and a dual prototype alignment strategy (intra-domain alignment + cross-domain contrastive), it addresses the performance degradation of global models caused by data domain shifts across clients in federated learning.
- Federated Active Learning Under Extreme Non-IID and Global Class Imbalance
-
This work systematically analyzes the impact of global class imbalance and client heterogeneity on query model selection in Federated Active Learning (FAL). Based on three core Observations, it proposes FairFAL—a class-fair FAL framework featuring adaptive query model selection, prototype-guided pseudo-labeling, and two-stage uncertainty-diversity balanced sampling, consistently outperforming all baselines across five benchmark datasets.
- FedMOP: Achieving Enhanced Privacy and Performance in Federated Learning via Momentum Orthogonal Projection
-
FedMOP applies a "momentum-evolved orthogonal shift" to the initial model before the start of each client's local training. The orthogonal component offsets non-IID drift to enhance performance, while momentum evolution transforms the shift vector into a \((d+t)\)-dimensional intractable inverse problem for attackers to protect privacy. This marks the first time "stronger privacy" and "higher accuracy" are achieved simultaneously rather than through mutual sacrifice.
- FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning
-
Proposes the FedRE framework, which achieves a triple balance of performance, privacy protection, and communication overhead in model-heterogeneous Federated Learning (FL) through "entangled representation"—aggregating all local representations of each client into a single cross-class representation using normalized random weights.
- FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
-
Targeting flow-matching-based VLA robotics policies such as π0, this paper proposes FlowHijack, the first backdoor attack directed at "vector field dynamics." By utilizing semantic context triggers, hijacking the vector field only during the early generation stage (small \(\tau\)), and employing a dynamics mimicry regularizer, the attack achieves an Attack Success Rate (ASR) up to 100% while maintaining the original mission success rate. The generated malicious actions are kinematically indistinguishable from normal actions and can bypass existing defenses.
- Forensic-Friendly Image Manipulation via Controllable Latent Diffusion
-
FFIM is a plug-and-play controllable denoising framework that "conveniently" amplifies endogenous feature differences between edited and unedited regions during the sampling process of diffusion-based editing. This allows third-party forensic models, without any prior keys, to locate and detect tampering with high precision while preserving editing quality (Pixel-level localization F1 up to +6.6%, image-level detection AUC up to +27.3%).
- Fractal Camouflage: A Bio-Inspired Approach for Multi-Scale Adversarial Attacks in the Infrared Domain
-
Aiming at infrared pedestrian detectors, this work generates "cross-scale effective" physical adversarial perturbations (cold patches attached to clothing) using the natural self-similar structure of H-shaped fractals. Parameters are searched under black-box conditions using Particle Swarm Optimization (PSO), achieving a physical world ASR of 97.54% and a cross-dataset ASR of 99.16%, significantly outperforming existing single-scale methods.
- Frequency-domain Manipulation for Face Obfuscation
-
FreM shifts face obfuscation from the spatial domain to the frequency domain: it first decomposes the face into LL, LH, HL, and HH subbands using block DCT, applies specialized "Neutralization / Perturbation / Suppression" modules for differentiated processing of each subband, and then refines parameters image-by-image via backpropagation. It achieves a balance between being "unrecognizable to humans + recognizable to machines" while demonstrating significantly stronger robustness against reconstruction attacks (achieving the lowest PSNR).
- From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal
-
This work systematically quantifies identity leakage in frozen vision encoders like CLIP / DINOv2/v3 / SSCD on facial data from an attacker's perspective (open-set low-FAR verification + template inversion + face-background attribution). It proposes ISP, a one-time closed-form projection that linearly removes the identity subspace from embeddings, reducing linear probes to near-random performance with almost no loss in retrieval or classification utility.
- FVBench: Benchmarking Deepfake Video Detection Capability of Large Multimodal Models
-
FVBench establishes the largest current benchmark for deepfake video detection (120,000+ videos, 42 SOTA generation/editing models, three categories: Real/AI-edited/Full AI-generated). It provides the first systematic evaluation of Large Multimodal Models' (LMMs) detection capabilities, concluding that the primary challenge is not supervised training on known fakes, but rather zero-shot/cross-generator generalization to unseen generators.
- GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models
-
GenBreak fine-tunes an open-source LLM into a "red teaming agent": starting with a cold-start via SFT on two custom datasets, followed by GRPO reinforcement learning with six-way multi-objective rewards. This allows the agent to automatically generate adversarial prompts that bypass T2I safety filters, induce high-toxicity images, and maintain semantic fluency and diversity. It achieves a 70% nudity bypass rate on commercial APIs like Leonardo.Ai in a single attempt.
- Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
-
The paper proposes CrowdGen, the first adversarial attack framework with cross-paradigm (density map + point regression) transferability. By utilizing a lightweight UNet generator and a multi-task loss (logit suppression + density suppression + GradCAM guidance + frequency domain constraints), it achieves high transferability (TR up to 1.69) across seven SOTA crowd counting models while maintaining visual stealthiness (~19dB PSNR), increasing the attack MAE by 7x on average.
- Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling
-
This paper proposes UAtt3D, which unifies adversarial and backdoor attacks on 3D point clouds into a single transformation function using a differentiable "Flexible Isotropic Resampling (FIR)". It reverses the traditional paradigm—instead of hiding by minimizing perturbations, it evades detection by making the attacked point cloud quality higher than the original, achieving optimal imperceptibility while maintaining high attack success rates.
- GROW: Watermark Generation with Progressive Guidance for Diffusion Models
-
GROW reformulates diffusion model watermarking from "one-time embedding in initial noise requiring expensive DDIM inversion during extraction" to "progressive guidance using frequency-domain gradients during denoising." This allows the watermark to naturally grow into the image texture, enabling inversion-free extraction—surpassing existing methods in robustness and invisibility while achieving nearly 100x faster extraction.
- GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
-
Addressing the blind spot where multimodal reasoning models provide safe final answers while leaking dangerous content in intermediate steps, this paper constructs the first image-text Question–Thinking–Answer (QTA) safety dataset, GuardTrace. Using a three-stage progressive training pipeline (SFT → DPO → Oracle-Guided Refined DPO), a 3B visual safety auditor is trained, achieving a 93.1% F1 score in unsafe reasoning detection on the self-built test set, outperforming the strongest multimodal guardrails by 13.5 percentage points.
- GVIS: Generative Vector Image Steganography
-
GVIS treats "deterministically generating a raster image via a diffusion model and then vectorizing it into an SVG" as the steganographic cover. It embeds ciphertext by perturbing the control points of cubic Bézier curves. Without requiring training, it can embed approximately 88,000 bits into a single \(256 \times 256\) image with 100% lossless extraction while maintaining file size and statistical distribution. It is the first generative steganography framework for vector images.
- Hidden Dangers of Compositional Generation: Diagnosing Semantic Safety Failures in Text-to-Image Models
-
This paper proposes CoRA (Composable Reassembly Attack): a black-box text-to-image (T2I) attack framework operating entirely in text space. It decomposes harmful intentions into a set of fine-grained visual elements that appear "harmless" in isolation, then induces the model to reassemble these elements into the original malicious semantics through iterative selection and reorganization, significantly improving attack success rates without triggering safety filters.
- Hierarchically Robust Zero-shot Vision-language Models
-
Transforms CLIP adversarial fine-tuning from a "flat scheme aligning only leaf/base classes" into a hierarchical scheme aligning across multiple layers of a WordNet category tree. Leveraging hyperbolic (Poincaré ball) geometry, which naturally provides different margins for different hierarchy levels, it generates more universal adversarial perturbations, improving both clean accuracy (\(62.5\%\)) and robust accuracy (\(45.4\%\)) across 15 datasets simultaneously.
- Image-based Outlier Synthesis With Training Data
-
Without relying on external data, near-manifold virtual outliers are synthesized from training images by using "gradient attribution perturbation" to destroy invariant features while preserving environmental features. Through joint training with outlier exposure and z-score normalized features, the method provides a unified solution for spurious, fine-grained, and conventional OOD detection.
- Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics
-
This paper proposes CLAMP, a model immunization method against "long-horizon harmful fine-tuning." Instead of merely shaping the geometry of initial weights, it "contracts" the attacker's entire optimization trajectory—ensuring each update step is smaller than the last. This provides a closed-form upper bound on the attainable gain from step 0 to infinity. It maintains defenses across classification, generation, and autoregressive models even after thousands of fine-tuning steps, with negligible impact on benign fine-tuning capabilities.
- Improving Adversarial Transferability with Local Perturbation Augmentation
-
This paper points out that iterative adversarial attacks tend to "overfit" perturbations to the local gradient features of the proxy model, making it difficult to transfer to other models. It proposes LPAA: using random masks to construct multiple augmented local subspaces in each iteration, aggregating subspace gradients to push updates in a more generalized direction, and incorporating a transfer-oriented perturbation initialization strategy. LPAA significantly outperforms existing SOTA transfer attacks on both CNNs and ViTs.
- Jailbreaking Vision-Language Models via Dissonance-Guided Suffix Optimization and Image-Phrase Injection
-
DGSIP utilizes an "unaligned guide model" and the predictive distribution difference (dissonance) at each token position relative to the target VLM to perform gradient-free searches for adversarial suffixes. When suffix optimization reaches a plateau, it switches to a visual injection mechanism that renders induction phrases into the image. This alternating process achieves a 100% attack success rate on MiniGPT-4 and InstructBLIP using AdvBench and demonstrates significant transferability to black-box commercial models like GPT-4o-Mini, Gemini, and Qwen2.5-VL.
- JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
-
JANUS reformulates jailbreaking attacks on Text-to-Image (T2I) models as a "low-dimensional distribution optimization" problem. By employing two semantically anchored Gaussian distributions for "wave interference" mixing and a lightweight policy gradient to learn optimal mixing coefficients under black-box rewards, it improves the ASR-8 on SD3.5 Large Turbo from 25.30% to 43.15% without a large model generator, exposing structural weaknesses in T2I safety pipelines.
- LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
-
Through a systematic analysis of how pop-up injection attacks distort the layer-wise attention of GUI Agents, the authors found that deep-layer attention diverges significantly between "correct" and "incorrect" samples. They propose LaSM—a training-free, plug-and-play layer-wise scaling mechanism that amplifies attention and MLP weights specifically in middle semantic layers. This mechanism improves the defense success rate (DSR) of Qwen2-VL-7B from approximately 19% to over 66% under pop-up attacks, with minimal impact on normal task performance.
- Learning Latent Concepts for Detecting Out-of-Distribution Objects
-
UNO-Adapter injects "unknown" concepts into a fully frozen detector in a plug-and-play manner. It first abstracts the entire image into sparse concepts using unsupervised object-centric slots, then binds these concepts to the detector's instance features during inference, combined with an image-level OOD score. Without modifying any detector weights, it reduces the FPR95 on BDD-100K by up to 11.96% compared to the previous state-of-the-art.
- Logit-Margin Repulsion for Backdoor Defense
-
LMR reformulates backdoor defense as a geometric problem in logit space: using only a minimal amount of clean samples (as low as 0.1%), it first locates the backdoor class, then artificially enlarges the margin between the "backdoor class logit and the strongest competitor logit" on clean data, and prunes classification head channels strongly correlated with the backdoor. This ensures that logit shifts caused by triggers or quantization/pruning are insufficient to flip the top-1 prediction, effectively defending against both traditional backdoors and conditional backdoors (quantization/pruning-based).
- MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding
-
MaxMark embeds watermarks into the "most reliable sign bits" of latent noise and utilizes an Invertible Neural Network (INN) to map the watermarked latent back to a standard Gaussian distribution. This achieves a "high-capacity + high-robustness + high-fidelity" latent-space watermark for latent diffusion models, improving extraction accuracy by approximately 46% compared to the strongest baseline at a full capacity of 16,384 bits.
- Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
-
Meta-FC replaces the conventional "Single Random Distortion" (SRD) strategy in deep watermarking—which randomly selects one distortion per batch—with meta-learning. Each batch samples multiple distortions for meta-train and reserves one as an "unseen distortion" for meta-test, combined with a feature consistency loss to align decoder features. This allows the model to learn distortion-invariant representations. As a plug-and-play strategy applied to five existing models, it improves average accuracy by 1.59%, 4.71%, and 2.38% under high-intensity, combined, and unseen distortions, respectively.
- Mitigating Error Amplification in Fast Adversarial Training
-
This paper identifies low-confidence/misclassified samples as the primary culprits behind catastrophic overfitting (CO) and the robustness-accuracy trade-off in Fast Adversarial Training (FAT). It proposes the Distribution-aware Dynamic Guidance (DDG) strategy, which dynamically allocates perturbation budgets based on sample confidence, adaptively adjusts supervision signals according to prediction status, and incorporates a weighted smoothing regularizer. DDG simultaneously mitigates CO and improves the robustness-accuracy trade-off across CIFAR-10/100 and Tiny-ImageNet.
- Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis
-
This paper identifies that existing OOD detection methods suffer from "simplicity bias," where models focus on the most easily learned local cues, making them ineffective for near-OOD cases. The authors employ Slot Attention to decompose images into object-level slots and explicitly model "object co-occurrence patterns." By categorizing test samples into single, typical, or atypical scenarios based on their co-occurrence alignment with the training distribution and designing dedicated scoring functions for each, the proposed method achieves superior robustness against both near-OOD and covariate shifts on the OpenOOD and full-spectrum OOD benchmarks.
- Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
-
This paper reveals an overlooked "Semantic Slot Filling (SSF)" security vulnerability: LVLMs actively complete content for "seemingly benign" slots, even when the combination of these slots implies malicious intent. Based on this, a black-box, one-shot jailbreak framework StructAttack is proposed. It decomposes harmful instructions into locally benign "Lego blocks" and renders them into structured visual diagrams (mind maps/tables/sunburst charts), enticing the model to reassemble them into harmful responses. It achieves a single-query attack success rate of approximately 69% on GPT-4o.
- Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models
-
Proposes the MPCAttack framework, which integrates feature representations from three learning paradigms—cross-modal alignment, multi-modal understanding, and visual self-supervision. By employing a multi-paradigm collaborative optimization strategy to generate highly transferable adversarial samples, it achieves SOTA attack performance on both open-source and closed-source MLLMs.
- No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation
-
To address the misuse of "identity-preserving personalized generation" in diffusion models for face theft, this paper proposes IDGuardian. It abstracts the personalization process into two stages: "identity extraction" and "identity injection." By simultaneously disrupting both stages through cross-encoder identity field confusion and guided flow identity deflection, it achieves the first universal, model-agnostic facial identity protection effective against both training-based and training-free personalization methods.
- Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
-
This paper constructs Omni-Fake, the first social media deepfake benchmark covering four modalities (Image, Audio, Video, Audio-Visual Talking Head) with over 1 million training samples and 200,000+ strictly disjoint OOD samples, unified under "Detect-Locate-Explain" annotations. It also presents Omni-Fake-R1, a unified detector based on Qwen2.5-Omni-7B trained with "Curricular SFT + GSPO Reinforcement Learning," which outperforms single-modality SOTAs in detection, localization, explanation, and cross-generator generalization across all four modalities.
- One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
-
O2MAG proposes a training-free few-shot anomaly generation method that synthesizes realistic anomalies from a single reference image via Tri-branch Attention Grafting (TriAG). Combined with Anomaly-Guided Optimization (AGO) to align text semantics and Dual Attention Enhancement (DAE) to ensure complete mask filling, it significantly outperforms existing methods in downstream MVTec-AD anomaly detection tasks.
- PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
-
This paper conducts gray-box attacks on shared vision encoders of LVLMs. By employing "Prototype-Anchored Guidance + Class Token Attention Weighting + Two-stage Attention Refreshing," small perturbations (\(\epsilon=2/255\)) can universally disrupt models across tasks, achieving an average Score Decline Rate (SRR) of 75.1%, significantly outperforming existing gray-box and black-box methods.
- PECCAVI: Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks
-
This paper first proposes a "Visual Paraphrasing Attack" (VPA) that can easily remove existing AI image watermarks (by captioning the image and then using a diffusion model to redraw it based on the text, creating a semantically identical but watermark-free image). It then designs PECCAVI, which embeds multi-channel watermarks into the frequency domain of "Non-Melting Points" (NMP)—saliency regions that remain stable after paraphrasing—significantly improving survival rates against paraphrasing attacks while maintaining PSNR > 30dB.
- PGA: Prior-free Generative Attack for Practical No-box Scenario
-
PGA is the first generative adversarial attack designed for the "Practical No-box Scenario" (PNS, where the attacker has only a small amount of unlabeled images, no pre-trained proxies, and no labels). It trains a stable self-supervised proxy from scratch using curriculum micro-robust optimization and then trains a generator via region-aware consistent perturbation learning. It produces highly transferable adversarial examples in a single inference step, surpassing existing PNS methods by over 10 percentage points in success rate while being over 100 times faster in inference.
- Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors
-
This paper proposes Phantom—an object detection backdoor that requires no pixel modifications and is implanted solely by adding boxes to the annotations. By constructing "poisoned labels + forced confidence ranking" during training, it hijacks the NMS post-processing of the detector. This causes natural spatial overlap between two objects in the real world to trigger misclassification, mislocalization, or the appearance/disappearance of objects, while maintaining performance on clean samples and bypassing existing defenses.
- Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
-
This paper constructs a 3D adversarial clothing (NORP) made of two mutually non-overlapping materials—"visible-light printed fabric + aluminum film." Combined with the Spatially Discrete-Continuous Optimization (SDCO) method, which simultaneously optimizes continuous RGB pixels and discrete thermal pixels, the wearer can evade RGB-T pedestrian detectors in both visible and thermal modalities across all viewpoints (0°–360°). The digital world achieves an average ASR of 99.6%, and the physical world reaches 71.0%.
- PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
-
Ours proposes the PinPoint benchmark, comprising 7,635 queries and 329K human-verified relevance judgments. By incorporating four dimensions—explicit negatives, multi-image queries, paraphrase variants, and demographic metadata—it reveals critical deficiencies in existing CIR methods regarding false-positive suppression, linguistic robustness, and multi-image reasoning. A training-free MLLM-based reranking method is introduced as an improved baseline.
- PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems
-
This paper identifies that the geometric core of 3D reconstruction pipelines—the SfM initialization module—is a vulnerable "Achilles' heel." The authors propose PoInit-of-View, which injects perturbations nearly invisible to the human eye into multi-view input images to specifically disrupt the local gradient consistency between views. This triggers a collapse in SfM feature matching, reducing camera registrations from nearly a hundred to single digits, thereby causing downstream MVS, NeRF, and 3DGS to fail entirely. The attack does not depend on specific reconstruction architectures and achieves black-box transfer (e.g., PSNR drops an additional 25.1% in the 3DGS \(\rightarrow\) NeRF transfer compared to single-view baselines).
- POUR: A Provably Optimal Method for Unlearning Representations via Neural Collapse
-
Aiming at the problem where existing machine unlearning methods only modify the classification head while forgotten class information remains in the features, this paper elevates "unlearning" to the representation level. Using the geometry of the simplex-ETF in Neural Collapse, it proves that "removing a class = orthogonal projection along its direction results in an ETF remaining an ETF." This result yields the closed-form projection operator POUR-P and its distillation variant POUR-D. These methods refresh both class-level and representation-level unlearning metrics on CIFAR-10/100 and PathMNIST, and provide a formal proof of optimality under the definition of representation-level weak unlearning.
- PrivateEyes: Gaze-Preserving Anonymization for Data Sharing
-
PrivateEyes utilizes a three-stage pipeline consisting of "segmentation + 3D eye pose estimation + ControlNet conditional diffusion" to re-synthesize eye images. By removing identifiable iris biometric features (reducing iris recognition rates by approximately 50%) while preserving gaze direction (achieving >10% higher gaze estimation accuracy than SOTA anonymization methods), it enables the compliant sharing of eye-tracking datasets.
- PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data
-
PrivSynth models the "privacy-utility trade-off in synthetic data generation" as a bi-objective optimization problem, alternating optimization between the generator and data selection parameters. It reformulates the data selection step as a discrete-time optimal control problem solved via the Pontryagin Maximum Principle (PMP), reducing the Membership Inference Attack (MIA) success rate from 48% to approximately 2% while guaranteeing downstream utility.
- PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and VLM-Guided Optimization
-
PromptMiner is a black-box prompt stealing framework: given an image generated by a text-to-image (T2I) model, it first utilizes reinforcement learning (RL) with reward shaping to invert an accurate "Subject" prompt, and then employs VLM-guided evolutionary search to complete the "Style Modifiers". Without requiring model gradients or large-scale labeled data, it recovers prompts that replicate highly similar images, achieving CLIP similarity up to 0.958 and SBERT text alignment up to 0.751, while remaining robust to common image perturbation defenses.
- Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
-
Protego compresses a user's 3D facial features into a pose-invariant 2D "Privacy Protection Texture" (PPT). Combined with a novel loss that makes face recognition models "hypersensitive" to protected images, it ensures that even protected face photos cannot be retrieved or matched against each other. This protects the user's digital footprint from search engines like Clearview AI and PimEyes, reducing black-box retrieval recall by at least half compared to existing methods.
- ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning
-
The authors propose ProxyFL, a framework that leverages classifier weights as a unified proxy to simultaneously mitigate external heterogeneity (inter-client distribution shifts) and internal heterogeneity (labeled/unlabeled distribution mismatch) in Federated Semi-Supervised Learning (FSSL), significantly outperforming existing FSSL methods across multiple datasets.
- PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
-
PureProof is the first black-box targeted attack capable of resisting "Diffusion-Based Purification (DBP)": it utilizes a diffusion proxy for single-step reverse prediction to align target semantics (SRA), uses timestep-adaptive re-noising to stabilize gradients (ARA), and applies self-consistency regularization for local coherence (SCR), ensuring adversarial images induce the attacker-specified target text even after being purified by filters like DiffPure.
- R\(^2\)TUA: Reconstruction-residual Based Targeted and Untargeted Attack Against Text-Image Person Re-Identification
-
\(R^2\)TUA is the first multi-modal adversarial attack specifically designed for Text-Image Person Re-Identification (TI-ReID). Given an image and an adversarial text prompt, it utilizes progressive multi-modal fusion to inject adversarial identity attributes into the image, followed by a "reconstruction-residual" process to extract nearly invisible perturbations. This approach effectively prevents the original image from being retrieved by its true description (untargeted) while misleading the system towards an adversarial identity (targeted), outperforming all transferable existing attacks across three datasets and three target models.
- RankOOD: Class Ranking-based Out-of-Distribution Detection
-
RankOOD leverages the insight that a classifier naturally induces an inter-class ranking pattern for each ID (In-Distribution) class, while OOD samples struggle to adhere to this ranking. It first extracts a canonical rank for each class using ILP, then retrains the classifier using Plackett-Luce ListMLE loss to reinforce this ranking, and finally scores OOD based on the test sample's deviation from the canonical rank. It reduces FPR95 by 4.3% on TinyImageNet near-OOD, achieving SOTA.
- RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning
-
RaPA discovers that adversarial perturbations in existing targeted transfer attacks over-rely on a few key parameters in the surrogate model. By applying random parameter pruning (DropConnect) at each optimization step, it implicitly adds an "importance equalization regularization" to the loss. This disperses the dependency and significantly improves targeted attack success rates across architectures (especially CNN \(\rightarrow\) Transformer).
- RAVEN: Erasing Invisible Watermarks via Novel View Synthesis
-
RAVEN reformulates "erasing invisible watermarks in AI-generated images" as "observing the same scene from a different perspective." By using a frozen image-to-image diffusion model to perform a slight viewpoint shift in the latent space, combined with cross-view correspondence attention to maintain visual consistency, it achieves an average TPR@1%FPR of only 0.026 across 15 watermark methods. This represents a reduction of over 60% compared to the strongest attack baseline while maintaining superior image quality (FID 40.18) in a zero-shot setting without access to the detector or watermark algorithm.
- RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
-
RecoverMark is proposed as a robust watermarking framework that embeds the face content itself as a watermark into the background. It achieves simultaneous tamper localization, original content recovery, and copyright verification, remaining effective even under watermark removal attacks.
- Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
-
Addressing Retrieval-Augmented Diffusion Models (RAG-DMs), this paper proposes JOB (Jointly Optimized Backdoor), the first backdoor attack for black-box scenarios. By injecting a minimal number of target-class poisoned images into the knowledge base and jointly optimizing a trigger word via reinforcement learning, JOB ensures that queries with the trigger retrieve poisoned images and drive the diffusion model to generate target-class images, while maintaining normal performance for benign queries.
- Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
-
Addressing the vicious cycle of "real data scarcity \(\rightarrow\) poor generative models \(\rightarrow\) useless synthetic data" in privacy-constrained scenarios, this paper models the synthesis process of diffusion models as a reinforcement learning (RL) problem. It utilizes a DiT pre-trained on generic domains for cold-start alignment, followed by policy fine-tuning using a triple reward—"semantic consistency + distribution coverage + expression diversity." Finally, lookahead dynamic sampling selects high-utility samples, simultaneously enhancing generative fidelity and downstream classification accuracy for person re-identification and face recognition tasks.
- RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
-
RemedyGS proposes the first black-box defense framework against "computation cost attacks" on 3DGS (such as Poison-splat, which triggers Gaussian explosion by poisoning input images to exhaust GPU resources and cause Denial-of-Service). Utilizing a two-stage "detector + purifier + adversarial training" pipeline, it purifies only images identified as poisoned, restoring computation costs to normal levels while maintaining reconstruction quality for legitimate users.
- ReMoE: Region-Mixture Experts for Adversarially-Robust Vision Transformers
-
ReMoE replaces the standard FFN in ViT with a "Region-aware Mixture-of-Experts layer"—utilizing experts across three granularities (global/center/region) coupled with attention-guided routing. By reweighting based on regional vulnerability and aligning regional attention distributions between clean and adversarial samples during adversarial training, it significantly enhances the adversarial robustness of ViTs with negligible computational overhead.
- RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation
-
RevINN utilizes an Invertible Neural Network (INN) to "exchange/perturb" the image's own high- and low-frequency discriminative information in the wavelet domain, generating Reversible Adversarial Examples (RAEs) in a single step. This approach both misleads unauthorized models and allows authorized users to recover the original image near-losslessly, completely bypassing the degradation in image quality and attack effectiveness caused by traditional "attack-then-embed" two-stage schemes.
- Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats
-
In reality, defenders often only obtain a very small number of adversarial samples to counter emerging new attacks. This paper proposes a new setting called "Few-Shot Continual Adversarial Training (FS-CAT)" and introduces a three-component toolkit: the Adversarial Boundary Loss (ADM) that pushes clean samples away from decision boundaries, GMM Prototypical Replay that synthesizes pseudo-features using Gaussian Mixture Models for memory-free replay, and the Multi-domain Balancing Loss (MDB) that pulls the update direction toward the majority of old domains. Together, these components alleviate the difficulties of robust generalization and catastrophic forgetting in few-shot scenarios on ImageNet-1K and CIFAR-100.
- Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models
-
This paper reveals a neglected security vulnerability in "pruning-based concept unlearning": the positions of pruned (zeroed-out) weights themselves leak conceptual information. The authors design a completely data-free and training-free attack framework that restores the recognition accuracy of erased concepts from an average of 8% to 54% within 7 minutes by merely recovering the signs of the weights.
- RunawayEvil: Jailbreaking the Image-to-Video Generative Models
-
This paper proposes RunawayEvil—the first multimodal jailbreak attack framework targeting "Image-to-Video (I2V)" models. It utilizes a "Strategy-Tactic-Action" paradigm to coordinate attacks across text and image modalities, evolving through Reinforcement Learning and LLMs to increase attack success rates on COCO2017 by 58.5%–79% compared to existing methods.
- SafeLogo: Turning Your Logos into Jailbreak Shields via Micro-Regional Adversarial Training
-
SafeLogo optimizes a "logo-level" small patch, occupying ≤2% of image pixels, into a universal jailbreak shield via min–max adversarial training. The inner loop dynamically selects the strongest current jailbreak attack, while the outer loop updates this local patch to resist it. Without modifying the VLM backbone, it significantly reduces jailbreak success rates on MM-SafetyBench, VLGuard, and FigStep, while maintaining near-original performance on benign tasks.
- SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers
-
SafeRoPE observes that in MMDiTs (e.g., FLUX), only a few "safety-critical attention heads" carry unsafe semantics, and these semantics are concentrated in low-dimensional subspaces. Consequently, it learns low-rank orthogonal rotation matrices for these specific heads to adaptively rotate unsafe components based on the "latent risk score" of each token. This precisely suppresses unsafe content—such as nudity, violence, and copyright—without modifying the 10-billion-parameter backbone or compromising normal generation quality.
- SAGA: Source Attribution of Generative AI Videos
-
SAGA upgrades the question "is this video AI-generated?" to "which generator did it come from?". By utilizing frozen Vision Large Model features and a spatio-temporal dual-layer Transformer, combined with a two-stage strategy of "binary classification pre-training followed by contrastive adaptation with 0.5% labels," it achieves five-level source attribution from real/fake to specific models across 19 video generators. It further introduces Temporal Attention Signatures (T-Sig) to provide the first visual explanation of "why different generators are distinguishable."
- SAIDO: Scene-Aware and Importance-Guided Dynamic Optimization for Generalizable AI-Generated Image Detection
-
SAIDO frames AI-generated image detection as a replay-free continual learning framework: a Vision-Large Language Model (VLLM) routes images to scene-specific LoRA experts based on scene awareness, while a "neuron-level" importance-guided gradient projection based on Fisher information harmonizes plasticity and stability. This reduces the detection error rate by 44.22% in the continual learning protocol and improves open-set accuracy by 9.47%.
- Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes
-
The authors identify a paradox where training AIGI detectors with an increasing number of generators leads to performance improvement followed by a decline ("Benefit then Conflict"). This is attributed to excessive heterogeneity in generated image features and the capacity bottleneck of frozen encoders. They propose GAPL, which distills thousands of generators into a small set of "generator-aware prototypes" using PCA, reorganizes arbitrary image features into this low-variance prototype space via cross-attention, and employs two-stage LoRA fine-tuning. It achieves a 90.4% mean accuracy across 6 benchmarks, outperforming the previous SOTA by 3.5%.
- SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning
-
SEBA utilizes a differentiable "Shadow Critic," a GAN-based perturbation generator, and a World Model to generate nearly imperceptible adversarial perturbations for pixel-input continuous control RL agents under black-box conditions (no access to victim policy gradients). It reduces cumulative rewards to near zero while decreasing environment/victim query volume by one to two orders of magnitude compared to RL-based attacks.
- Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models
-
CSE addresses "class-level concept unlearning" for pre-trained vision models—making the model completely unable to recognize an entire semantic category (rather than just forgetting specific training samples). It avoids training and does not modify task heads; instead, it employs contrastive subnet discovery to identify a small subset of channels responsible for the target class, applies calibrated attenuation, and algebraically folds the changes into the next layer to achieve zero inference overhead, stability, and reduced collateral damage to non-target classes.
- Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
-
This paper points out that robustness evaluations for Vision-Language Navigation (VLN) agents have traditionally relied on "weird textures that rarely appear in reality." Instead, it proposes ILA, a black-box attack framework that only manipulates global indoor lighting intensity. Its static mode (SILA) searches for a constant brightness that best disrupts navigation, while the dynamic mode (DILA) suddenly switches lights on/off at critical moments. Evaluated on two SOTA VLN models across three tasks, it significantly pushes up failure rates and lowers trajectory efficiency, revealing the hidden vulnerability of VLN to "daily lighting variations" as natural perturbations.
- SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
-
To address the copyright tracking of misappropriated open-source Large Vision-Language Models (LVLMs), SIF first utilizes a Semantic Divergence Attack (SDA) to expose the fatal flaw of existing fingerprints—being "semantically anomalous and easily detected." It then proposes a non-intrusive fingerprinting scheme that distills text decoding watermarks into trigger images and performs robust optimization against worst-case representation perturbations. This ensures stolen models generate "semantically natural yet verifiable watermarked" responses under standard decoding. SIF achieves several times higher FMR than baselines like PLA under quantization, fine-tuning, input perturbations, and SDA defense.
- Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
-
Skyra transforms AI-generated video detection from black-box binary classification into interpretable artifact reasoning. It utilizes a cold-start SFT phase on the manually annotated ViF-CoT-4K dataset to teach MLLMs to spatio-temporally locate and explain artifacts. This is followed by GRPO reinforcement learning with asymmetric rewards to encourage active artifact discovery, achieving a 26.73% absolute accuracy improvement over the second-best method on the ViF-Bench.
- Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
-
This paper applies Top-k Sparse Autoencoders (SAE) to the ViT [CLS] token for the first time, disentangling entangled dense features into an interpretable sparse latent space. It finds that ID samples of the same class form stable "Class Activation Profiles" (CAP), whereas OOD samples, despite activating core features of the predicted class, fail to replicate the energy distribution shape. Based on this, it proposes the EPD score, achieving the best average FPR95 (40.96%) across multiple benchmarks.
- Stealing Split Learning Bottom Models by Recovering Embedding Geometry
-
In the context of split learning for Vertical Federated Learning (VFL), the authors propose VENOM—a "geometry-aware" model stealing attack. Instead of performing point-wise fitting of embedding coordinates seen by the server, VENOM first uses contrastive learning to reconstruct a stable neighborhood geometric space on these embeddings. It then trains a surrogate model to simultaneously align coordinates, feature shapes, and respect the local structure where "near neighbors stay near and far neighbors stay far." This approach bypasses mainstream noise-injection and decoupling defenses, restoring stealing accuracy (especially under the strong Model Rake defense) to usable levels across 6 datasets.
- Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation
-
To address the issue of "overconfidence in head classes and lack of robustness in tail classes" in adversarial training (AT) on long-tailed data, this paper theoretically proves that perturbation intensity itself can simultaneously fix adversarial vulnerability and class imbalance. Consequently, the authors propose RobustLT, a plug-and-play method that assigns larger perturbation budgets to tail classes and smaller budgets to head classes (CPB), while gradually warming up the perturbation from 0 in early training to stabilize adversarial distribution evolution (AIW). This can be applied to any AT algorithm, improving tail-class robust accuracy by up to 7 percentage points.
- The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts
-
This work reveals a core threat where existing multimodal manipulation detection neglects MLLMs' ability to generate semantically consistent deceptive narratives. It constructs the MDSM dataset with 441k semantic-aligned forged samples and proposes the AMD framework based on Artifact Tokens and manipulation-oriented reasoning. With only 0.27B parameters, it achieves SOTA generalization performance in cross-domain detection: 88.18 ACC / 60.25 mAP / 61.02 mIoU.
- Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems
-
This paper presents a "normally black T-shirt" that reveals adversarial patterns after 50 seconds of heating. By combining thermochromic dyes with flexible heating pads, a polygon-shaped adversarial patch is hidden within the fabric. Heating triggers color changes to deceive visible light detectors and thermal distribution changes to deceive infrared detectors, maintaining an Attack Success Rate (ASR) over 80% for pedestrian detection in real-world surveillance scenarios.
- TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery
-
TokenTrace injects secret signatures of concepts simultaneously into text prompt embeddings and initial latent noise (dual conditioning). It employs a query-based retrieval module—given a generated image and a text query specifying "which concept to check"—to independently decode the corresponding secret. This allows for individual attribution of multiple concepts (objects and styles) within a single image, significantly outperforming ProMark and CustomMark in both single-concept and multi-concept attribution tasks.
- Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
-
Ours proposes SADCA (Semantic-Augmented Dynamic Contrastive Attack), which iteratively disrupts the cross-modal semantic consistency between adversarial images and text through a dynamic contrastive interaction mechanism and a semantic augmentation module. It significantly improves adversarial transferability against Vision-Language Pre-trained (VLP) models, outperforming existing SOTA methods in both cross-model and cross-task attacks.
- Towards Human-Imperceptible Backdoor Attacks on Text-to-Image Diffusion Models
-
This paper proposes the first clean-label backdoor attack for text-to-image (T2I) diffusion models. By injecting nearly invisible perturbations into the image latent space and composite semantic triggers ("synonym replacement + sentence restructuring") into the text, the poisoned image-text pairs appear semantically consistent and normal to both humans and automated auditing tools. However, the attack is activated during inference under strict composite trigger conditions to generate attacker-predefined unsafe images, achieving an average human-evaluated attack success rate (ASR-H) of 97.2% with a 0% detection rate by mainstream NSFW filters.
- Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks
-
To address the "artificially high" adversarial robustness evaluation of Spiking Neural Networks (SNNs) caused by gradient vanishing—stemming from the binary and discontinuous nature of spike activations—this paper proposes ASSG (Adaptive-Sharpness Surrogate Gradient) and SA-PGD (Stable Adaptive PGD). By optimizing both gradient approximation and attack algorithms, this work significantly increases the Attack Success Rate (ASR), revealing that the adversarial robustness of current SNNs is severely overestimated.
- Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
-
SAFEMLLM is the first adversarial training framework designed specifically for Multimodal Large Language Models (MLLMs). By injecting compact, learnable perturbation matrices into the token embedding layer to simulate cross-modal attacks (CoE-Attack) and iteratively updating model parameters to neutralize these perturbations, SAFEMLLM reduces the Attack Success Rate (ASR) of six jailbreak methods to near 0 in white-box scenarios while maintaining standard multimodal task performance.
- Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
-
This paper introduces a "Gradient Path Masking" diagnostic tool to dissect the internal information flow of ViT attention. It discovers that the residual path is the primary vulnerability for adversarial attacks, while the QK path carries robustness. Based on these insights, a simple two-stage adversarial training is designed (Teacher ViT provides class attention map priors + Student distillation + Residual gating), improving both clean accuracy and robustness across five ViT variants and three AT frameworks.
- Towards Stealthy and Effective Backdoor Attacks on Lane Detection: A Naturalistic Data Poisoning Approach
-
DBALD utilizes "gradient attention heatmaps to select the most sensitive locations + region-based diffusion inpainting to synthesize natural triggers," transforming lane detection backdoor triggers from conspicuous white blocks or mud-pattern noise into natural road objects like traffic cones or mud spots. Across four lane detection models, it improves the average attack success rate (ASR) by +10.87% while suppressing forensic detection rates to below 3%.
- Transform to Transfer: Boosting Adversarial Attack Transferability on Vision-Language Pre-training Models
-
To address the poor transferability of black-box adversarial examples on Vision-Language Pre-training (VLP) models, this paper proposes the Transform to Transfer Attack (TTA). It employs a set of learnable block-level image transformations to automatically select optimal transformation combinations to expand input diversity, combined with Boosted Integrated Gradients (Boosted IG) to sample gradients along multiple transformation paths. This approach effectively decouples the attack from source model overfitting, improving attack success rates by up to nearly 40 percentage points in cross-architecture (ALBEF ↔ CLIP) transfers.
- TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
-
TTP introduces a "detect-then-adapt" test-time defense for CLIP. It distinguishes clean from adversarial images based on the cosine similarity shift of feature embeddings before and after image padding. Clean samples are output directly, while adversarial samples utilize a single-step trainable padding combined with a similarity-aware weighted ensemble to recover attention disrupted by perturbations. This significantly enhances adversarial robustness without retraining or compromising clean accuracy.
- Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
-
The authors propose the Tutor-Student Reinforcement Learning (TSRL) framework, which models the training process of a deepfake detector as a Markov Decision Process (MDP). A "Tutor" (PPO agent) dynamically assigns loss weights based on the visual features and historical learning dynamics (EMA loss, forgetting counts) of each sample. Guided by "state-change" reward signals, the "Student" (detector) prioritizes learning high-value samples, significantly improving generalization in cross-dataset and cross-method evaluations.
- UniDef: Universal Defense Against Unauthorized Image Manipulation
-
UniDef applies an invisible adversarial perturbation to images, causing any diffusion-based editing or generation (SD, InstructPix2Pix, SR, Image-to-Video, Image-to-3D) to produce semantically collapsed results. Instead of perturbing single-step denoising directions, it pushes the output distribution away from the original image along the entire denoising trajectory and utilizes finite difference-based Jacobian estimation to achieve cross-model transferability without requiring specific model gradients.
- UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
-
UniGame proposes the first self-adversarial post-training framework for Unified Multimodal Models (UMM). By installing a lightweight perturber at the shared visual token interface, it enables the generation branch to actively create semantically consistent adversarial samples to challenge the understanding branch. This forms a min-max self-play that significantly improves consistency (+4.6%), understanding (+3.6%), generation, and robustness.
- Unlearning without Forgetting: Securely Removing Targeted Concepts from Large-Scale Vision-Language Open-Vocabulary Detectors
-
SafeDetect constrains the concept unlearning of open-vocabulary detectors (e.g., GroundingDINO, LLM-Det) to parameter updates within the "null space of the retained concept subspace." Combined with a one-step mean-flow unlearning target and cross-modal decoupling loss, it removes target concepts (such as faces or specific individuals) with minimal damage to retained concepts and zero-shot generalization. It achieves a 64.75% improvement in forgetting efficacy over NPO and converges 1.5× faster.
- Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model
-
The authors propose Eidolon: by implanting a backdoor into a single text-to-image diffusion model (DM) once, the generated "synthetic training data" inherently carries triggers and is mislabeled into a target class. Any downstream classifier trained using these augmented data becomes "passively infected" (ASR generally 95–100%), achieving a "poison once, propagate infinitely" backdoor pandemic for the first time.
- Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
-
This paper proposes Unsafe2Safe, a fully automated privacy protection pipeline utilizing a four-stage scheme: VLM privacy inspection → dual caption generation (private/public) → LLM editing instructions → text-guided diffusion editing. The approach achieves significant improvements in the VLMScore privacy metric while improving accuracy on Caltech-101 classification and OK-VQA compared to original images.
- V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs
-
It is discovered that Value features in ViTs possess more disentangled local semantic representations compared to Patch features. V-Attack is proposed to achieve precise and controllable local semantic attacks on LVLMs through self-enhanced Value features and text-guided semantic manipulation, improving ASR by an average of 36%.
- VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models
-
VCP-Attack achieves SOTA in black-box targeted attacks on Large Vision-Language Models (LVLMs) by constraining adversarial perturbations within a low-dimensional semantic subspace derived via dynamic PCA and using multi-sample contrastive loss to pull adversarial features toward target semantics while pushing them away from source semantics—achieving average Attack Success Rates (ASR) of 94.2% on open-source models, 83.1% on closed-source models, and up to 95.6% on GPT-4o.
- Verifying Neural Network Robustness with Dual Perturbations
-
VeriDou integrates "arbitrary continuous convolutional perturbations (e.g., motion blur at various angles)" and "pixel-independent noise" into a unified perturbation space, encoded as an affine network layer prepended to the original network. This allows existing DNN verifiers (αβ-CROWN / Venus / NeuralSAT) to verify robustness under such "dual perturbations" in a single pass. The results reveal that many networks judged 100% robust by existing methods yield up to 99% adversarial examples under dual perturbations.
- VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation
-
VisiLock integrates "access control" directly into the weights of instruction-based image editing models. High-quality editing is only performed when a specific "visual key" (a patch) is present in the input image; otherwise, the output degrades into a fixed "authorization required" image. It utilizes Dual Teacher Score Distillation to circumvent multi-task gradient conflicts and employs Degenerate Initialization to ensure public checkpoints are locked by default and resistant to adversarial fine-tuning for unlocking.
- VMD-FACT: A New Video Dataset and MLLM-based method for Detecting Realistic AI-Generated Video Misinformation
-
Addressing the detection blind spot where "AI-generated video misinformation is highly realistic, cross-modally consistent, and existing datasets have obvious editing artifacts," this paper utilizes a multi-agent framework to iteratively generate 9049 pairs of highly realistic claim–video forgery samples to form the RAVM dataset. It further proposes the IEEG model, which constructs a directed acyclic evidence graph consisting of "multimodal evidence + fact-checking results + their dependencies." IEEG achieves 75.99% Accuracy with 7B parameters on RAVM, surpassing 25 open/closed-source MLLMs (including Gemini 2.5 at 68.89%).
- WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
-
WaTeRFlow enables high-accuracy watermark decoding from video frames even after an image undergoes "Image-to-Video (I2V)" translation. This is achieved via a FUSE module that integrates an image editing proxy, a fast video diffusion proxy, and optical flow alignment into the encoder-decoder training loop. Combined with temporal consistency and semantic preservation losses, it improves the average bit accuracy on SVD-XT from VINE's 73.92% to 84.96%, with the first frame reaching 96.93%.
- What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs
-
Addressing the intermediate features transmitted in Split DNNs (edge-side head, cloud-side tail), a black-box, data-efficient feature inversion framework called FIA-Flow is proposed. It first uses LFSAM to align task features with the VAE latent space, then utilizes Deterministic Flow Matching (DIFM) to pull "off-manifold" latent codes back to the natural image manifold in a single step. High-fidelity original private images are reconstructed from intermediate features using fewer than 4096 training samples.
- When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
-
Addressing the test-time adversarial robustness of CLIP, MAC utilizes multiple augmented views to jointly execute a "counterattack," escaping the misleading influence of a single attacked image. By defining a new "corruption level" metric to adaptively adjust counterattack intensity for each view, MAC improves robust accuracy from the previous generation TTC's 6.8% to 45.2% across 20 datasets under strong PGD-100 attacks, while maintaining high-speed and low-memory tuning-free inference.
- When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters
-
The study packages malicious backdoors into benign-looking LoRA adapters (freezing the base model, training only low-rank weights). By performing "semantic surgery" with contrastive loss, trigger word embeddings are aligned to attack targets without disrupting normal functions, enabling stable generation of specified content from semantically similar triggers like "cool car" with up to 99.8% ASR.
- When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
-
Ours proposes the UPA-RFAS framework to learn a single physical adversarial patch that achieves universal, transferable black-box attacks on VLA robot policies through a three-pronged approach: feature-space shifting, attention hijacking, and semantic misalignment.
- When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
-
A systematic comparative analysis of the safety risk differences between MLLMs and diffusion models reveals that MLLMs are more prone to generating unsafe images due to their stronger semantic understanding (comprehending abstract or non-English prompts). Furthermore, images generated by MLLMs are more difficult for existing fake image detectors to identify; even detectors fine-tuned specifically for MLLMs can be bypassed by enriching prompt details.
- X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
-
X-AVDT feeds a video into a pre-trained audio-driven diffusion model and extracts two signals simultaneously via DDIM inversion: inversion reconstruction residuals (appearance cues) + audio-visual cross-attention maps inside the diffusion U-Net (lip-speech alignment cues). By fusing these to perform binary classification, it leverages "generator-internal forced audio-visual consistency" as a universal signal for cross-generator generalization, achieving an average accuracy +13.1% higher than the strongest baseline.
- Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation
-
Through energy landscape analysis, the complementarity of AT and JEM is revealed (AT aligns clean-adv energy distributions → robustness; JEM aligns clean-generated energy distributions → accuracy + generation). EB-JDAT is proposed to model the joint distribution \(p(\mathbf{x}, \tilde{\mathbf{x}}, y)\) and use min-max energy optimization to align the energy distributions of three types of data. It achieves a CIFAR-10 AutoAttack robustness of 68.76% (Gain +10.78% over SOTA AT), while maintaining a 90.39% clean accuracy and competitive generation quality with FID=27.42.
- \(\varphi\)-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models
-
\(\varphi\)-DPO is proposed to frame DPO as a continual learning paradigm by using the model from the previous step as the reference policy. It introduces a fairness modulation factor \((1-p)^\gamma\), inspired by focal loss, to balance gradient contributions across different data groups. Theoretically, it is proven that the gradient bias approaches zero as \(\gamma \to \infty\). The method achieves SOTA performance on CoIN and MLLM-CL benchmarks.