CVPR2025 AI Safety AI paper notes paper summaries Adversarial Robustness Federated Learning Face & Gaze Alignment/RLHF

🛡️ AI Safety¶

📷 CVPR2025 · 27 paper notes

📌 Same area in other venues: 📷 CVPR2026 (145) · 🔬 ICLR2026 (141) · 💬 ACL2026 (5) · 🧪 ICML2026 (114) · 🤖 AAAI2026 (45) · 🧠 NeurIPS2025 (73)

🔥 Top topics: Adversarial Robustness ×9 · Federated Learning ×6 · Face & Gaze ×3 · Alignment/RLHF ×2

A Simple Data Augmentation for Feature Distribution Skewed Federated Learning: Proposes FedRDN—an extremely simple data augmentation method for federated learning. During training, it randomly uses the channel-wise mean/standard deviation from other clients for data normalization (instead of relying fixedly on local statistics). Requiring only a few lines of code, it significantly mitigates the feature distribution skew problem and consistently improves performance across multiple FL methods.
Data-free Universal Adversarial Perturbation with Pseudo-Semantic Prior: This paper proposes PSP-UAP, a data-free generation method for universal adversarial perturbations. By extracting pseudo-semantic priors from the UAP itself, utilizing input transformation enhancement, and applying a sample reweighting strategy, it achieves an average white-box fooling rate of 89.95% and significantly outperforms existing methods in black-box scenarios without requiring any training data.
DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging: This work proposes DEAL (Data-Efficient Adversarial Learning), an adversarial learning framework trained on only 50 clean infrared images. Through dynamic adversarial degradation synthesis and a dual-channel interaction network (Scale Transform + Spiking Neurons), it simultaneously addresses three types of infrared degradations (stripe noise, low resolution, and low contrast) with an ultra-lightweight parameter size of 0.96M.

DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders

Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection: Proposes the AlignIns defense method, which identifies malicious model updates in federated learning through dual-granularity direction alignment detection (global direction + fine-grained sign analysis), outperforming existing defense methods under both IID and non-IID settings.
Detecting Out-of-Distribution through the Lens of Neural Collapse: Based on Neural Collapse theory, this paper discovers that centered in-distribution (ID) features cluster near the weight vectors of their predicted classes and far from the origin (forming a simplex ETF). Guided by this, the NCI detector is designed by combining the angular proximity (pScore) between features and weight vectors with a feature norm filter. NCI achieves the best overall OOD detection performance on CIFAR-10/100 and ImageNet across multiple architectures while maintaining inference latency on par with the softmax baseline.
Dynamic Integration of Task-Specific Adapters for Class Incremental Learning: Achieves class incremental learning through the dynamic integration of task-specific adapters, where a lightweight adapter is trained for each task, and relevant adapters are dynamically selected and combined during inference.
FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors: FedAWA is proposed, which is inspired by task arithmetic and uses client vectors (the difference between local parameters and global parameters) to adaptively optimize aggregation weights in federated learning. Clients whose updates align with the global optimization direction are assigned higher weights, consistently improving FedAvg by 1–4 percentage points in non-IID scenarios.
Forensics Adapter: Adapting CLIP for Generalizable Face Forgery Detection: This paper proposes Forensics Adapter, a lightweight adapter network with only 5.7M parameters that learns blending boundary features of face forged images in parallel with a frozen CLIP. Highly generalizable cross-dataset face forgery detection is achieved via a triple objective: masked boundary prediction, patch-level contrastive learning, and sample-level contrastive learning, achieving an AUC of 0.914 on CDF-v1.
Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning: Obtains the geometry of the global embedding distribution by accurately reconstructing the global covariance matrix from local covariance matrices in federated learning. It generates augmented samples along global principal directions to localize global distribution information, improving performance by 17 percentage points on CIFAR-100 under extreme heterogeneous scenarios (\(\beta=0.01\)).
Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning: This work demonstrates for the first time that adapter-based PEFT is not privacy-secure in federated learning. A malicious server can design the pre-trained model as an identity mapping, allowing patch embeddings to propagate to the adapter layer unchanged, and analytically reconstruct training images from the adapter gradients (CIFAR-100 SSIM 0.88).
H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection: Proposes the H2ST method, which utilizes a hierarchical two-sample test framework to achieve OOD detection in continual learning. Each task corresponds to a feature-level source-target binary classifier layer that automatically determines ID/OOD through Clopper-Pearson confidence interval hypothesis testing (requiring no manual threshold), while providing task ID prediction capabilities. It outperforms MSP, Energy, and ODIN across 7 benchmarks and improves computational efficiency by \((T+1)/2\) times.
Infighting in the Dark: Multi-Label Backdoor Attack in Federated Learning: This paper is the first to study non-collaborative multi-label backdoor attacks (MBA) in federated learning. It reveals the inherent flaw of prior single-label backdoor attacks that leads to mutual exclusion among attackers when extended to multi-label scenarios due to constructing similar out-of-distribution (OOD) mappings. It proposes Mirage, which establishes in-distribution (ID) backdoor mappings to allow multiple attackers to inject backdoors independently and persistently, achieving an average attack success rate of over 97% that remains above 90% even after 900 rounds.
INACTIVE: Invisible Backdoor Attack against Self-supervised Learning: Proposes INACTIVE, the first invisible backdoor attack effective against self-supervised learning (SSL). By designing triggers in the HSV/HSL color space to escape the distribution space of SSL data augmentations, it achieves a 99.09% average attack success rate while maintaining high stealthiness with SSIM 0.9763 / PSNR 41.07dB, resisting 7 defense methods.
Joint Out-of-Distribution Filtering and Data Discovery Active Learning: Proposes the Open-Set Discovery Active Learning (OSDAL) scenario and designs the Joda algorithm. Through a three-phase workflow of training, filtering, and selection, Joda utilizes a single model to simultaneously filter OOD data and discover new categories without requiring auxiliary models, consistently achieving state-of-the-art accuracy across 18 configurations.
Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection: The authors find that the detection scores of OOD samples are more vulnerable to adversarial perturbations than those of IND samples. They propose the PRO method, which searches for the minimum OOD score within the \(\epsilon\)-ball using gradient descent during inference to enhance IND/OOD separability, reducing FPR@95 on CIFAR-10 from 44.35% to 19.95%.
Lyapunov Stable Graph Neural Flow: Integrates Lyapunov stability theory (integer-order and fractional-order) with graph neural flows. By utilizing a learnable Lyapunov function and a projection mechanism, it dynamically constrains GNN feature trajectories within a stable space. This provides the first provable adversarial robustness guarantee for graph neural flows and is orthogonally stackable with adversarial training.
Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis: This paper proposes GWAD, a black-box adversarial attack detection framework based on query update patterns (rather than input patterns). By introducing the Delta Similarity metric, it captures the inherent patterns of zero-order optimization in query-based attacks, achieving near 100% detection rate with extremely low false positive rates across 8 SOTA attacks (including the adaptive attack OARS), significantly outperforming existing stateful defense methods.
MOS-Attack: A Scalable Multi-Objective Adversarial Attack Framework: This paper proposes the MOS-Attack framework, which models adversarial attacks as a multi-objective set-based optimization problem. By incorporating smooth max/min approximations, it enables the joint optimization of multiple loss functions and automatically discovers synergy patterns among them, outperforming existing state-of-the-art single-objective and ensemble attacks on CIFAR-10 and ImageNet.
NoT: Federated Unlearning via Weight Negation: This paper proposes the NoT algorithm, which achieves unlearning by multiplying the weights of specific layers of the global model by -1 (negating) to disrupt inter-layer co-adaptation, followed by fine-tuning with retained data to recover performance. It requires no extra storage or access to target data, and significantly outperforms seven baseline methods on CIFAR-10/100 and Caltech-101 with the lowest communication and computational overheads.
OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary: OODD is proposed to maintain a dynamic OOD dictionary via a priority queue to collect potential OOD sample features in real-time during test time for calibrating OOD scores. Compared to SOTA methods, it reduces FPR95 by 26.0% on CIFAR-100 Far OOD without requiring any fine-tuning.
PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection: This paper proposes PSBD, which discovers that during inference with dropout enabled, a backdoor-infected model shifts its predictions on clean data toward the target class while maintaining stable predictions on backdoored data (the Prediction Shift phenomenon). Based on this insight, the authors design the Prediction Shift Uncertainty (PSU) metric to achieve SOTA backdoor training data detection.
Split Adaptation for Pre-trained Vision Transformers: This paper proposes Split Adaptation (SA), which splits a pre-trained ViT into a front-end (quantized and sent to clients) and a back-end (retained on the server). It protects data privacy via bi-level noise injection, and mitigates noise degradation and overfitting through OOD augmentation and patch retrieval augmentation, achieving efficient few-shot downstream adaptation while securing both the model and the data.
Stacking Brick by Brick: Aligned Feature Isolation for Incremental Face Forgery Detection: Introduces the SUR-LID method to address the catastrophic forgetting problem in Incremental Face Forgery Detection (IFFD). It retains the global feature distribution of old tasks through Sparse Uniform Replay (SUR), and "stacks" the distributions of new and old tasks "brick by brick" in the latent space through feature isolation and decision alignment strategies in the Latent Incremental Detector (LID), rather than overwriting each other.
Towards General Visual-Linguistic Face Forgery Detection: VLFFD proposes a vision-language paradigm for deepfake detection. It automatically generates blended forgery images with fine-grained text descriptions using a Prompt Forgery Image Generator (PFIG), and then jointly trains on coarse- and fine-grained data using a Coarse-and-Fine Co-training (C2F) framework, significantly enhancing both the generalizability and interpretability of the detection model.
Towards Source-Free Machine Unlearning: This paper proposes a source-free machine unlearning algorithm. In the absence of original training data, it approximates the Hessian matrix of the remaining data using only the forget data and the trained model, achieving efficient unlearning for linear and mixed linear classifiers with rigorous theoretical upper-bound guarantees.
Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted: Reveals a severe security risk in Deepfake detectors, where third-party data providers can inject backdoors by introducing password-controlled, adaptive, and invisible triggers. These poisoned detectors make incorrect predictions when encountering samples with specific triggers, while maintaining normal performance on clean samples. This supports both dirty-label and clean-label attack scenarios.