🛡️ AI Safety¶

📹 ICCV2025 · 22 paper notes

A Framework for Double-Blind Federated Adaptation of Foundation Models: This paper proposes BlindFed, a framework that achieves "double-blind" federated adaptation of foundation models through FHE-friendly architectural transformation, two-stage split learning, and privacy-enhancing strategies — keeping the model hidden from data holders and data hidden from the service provider. BlindFed achieves 94.28% accuracy on CIFAR-10, approaching LoRA's 95.92%.
A Framework for Double-Blind Federated Adaptation of Foundation Models: BlindFed proposes a double-blind federated foundation model adaptation framework combining FHE-friendly architectural redesign (polynomial approximation of nonlinear operations), a two-stage split learning protocol (offline knowledge distillation + online encrypted inference), and privacy enhancements (sample permutation + random block sampling), achieving adaptation accuracy close to LoRA under the constraint that the data owner cannot observe the model and the model owner cannot observe the data.
Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning: This paper proposes Active MINT (aMINT), a multi-task learning framework that jointly trains a MINT model alongside the audited model during training, enabling detection of whether specific data was used for training with over 80% accuracy — significantly outperforming existing passive MINT and membership inference attack methods.
Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering: This paper proposes QUAD—a continual VQA method that stores only past task questions (without images). Through question replay and attention consistency distillation, QUAD achieves privacy preservation while outperforming methods that store full image–question–answer triplets.
Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering: This paper proposes QUAD, which replays only questions from previous tasks (without storing images), combined with attention consistency distillation to preserve intra- and inter-modal attention patterns across tasks, achieving state-of-the-art performance in continual VQA under a privacy-preserving setting.
Backdoor Attacks on Neural Networks via One-Bit Flip: This paper proposes SOLEFLIP, the first inference-time backdoor attack on quantized models that requires flipping only a single bit. Through an efficient algorithm for identifying exploitable weights and bit positions, along with a corresponding trigger generation procedure, SOLEFLIP achieves an average attack success rate of 98.9% with zero degradation in clean accuracy across CIFAR-10, SVHN, and ImageNet.
Backdoor Mitigation by Distance-Driven Detoxification: This paper proposes Distance-Driven Detoxification (D3), which reformulates backdoor defense as a constrained optimization problem — maximizing the distance between the fine-tuned model weights and the poisoned initial weights, subject to a constraint that the clean sample loss does not exceed a threshold. This allows the model to effectively escape the "backdoor region," achieving best or second-best defense performance across 7 state-of-the-art attacks.
Backdooring Self-Supervised Contrastive Learning by Noisy Alignment: This paper proposes Noisy Alignment (NA), a method that enhances backdoor attacks against self-supervised contrastive learning by explicitly suppressing noise components in poisoned images. The attack is formulated as a 2D image layout optimization problem, and theoretically optimal layout parameters are derived. NA achieves up to 45.9% improvement in ASR on ImageNet-100.
Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing: This paper proposes the Client2Vec mechanism, which leverages a CLIP encoder and a Distribution Shifts Aware Index Generation Network (DSA-IGN) to generate, prior to federated training, an index vector for each client that encodes both label and feature distribution information. The resulting indices are then used to improve three key stages of FL: client sampling, model aggregation, and local training.
Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation: This paper proposes the Controllable Feature Whitening (CFW) framework, which eliminates linear correlations between target features and bias features via whitening transformations to mitigate model bias. The approach requires neither adversarial training nor additional regularization hyperparameters, and supports smooth interpolation between demographic parity and equalized odds through a single weighting coefficient.
FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos: This paper proposes FakeRadar, a deepfake video detection framework that actively generates outlier samples simulating unknown forgeries in the feature space via Forgery Outlier Probing, and designs an Outlier-Guided Tri-Training strategy with three-class optimization (Real/Fake/Outlier). FakeRadar significantly outperforms existing methods on cross-dataset and cross-manipulation evaluations.
FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields: This paper is the first to study federated meta-learning for Neural Fields (NFs) under private data settings. It reveals the severe privacy leakage mechanisms of existing federated meta-learning methods on neural field tasks, and proposes FedMeNF, which regularizes private information in local meta-gradients via a privacy-preserving loss function, effectively protecting client data privacy while retaining fast adaptation capability.
FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation: This paper proposes FedVLA — the first federated learning framework for Vision-Language-Action (VLA) models — comprising three synergistic components: Instruction-Oriented Scene-Parsing (IOSP) for task-aware feature extraction, Dual Gating Mixture-of-Experts (DGMoE) for adaptive knowledge routing, and Expert-Driven Aggregation (EDA) for effective cross-client knowledge integration, achieving task success rates comparable to centralized training while preserving data privacy.
Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning: This paper proposes FedPoisonMIA, a poisoning-based membership inference attack for federated learning that maximizes angular deviation, along with a defense mechanism called Angular Trimmed-mean (ATM) that filters malicious gradients via angular distance.
FRET: Feature Redundancy Elimination for Test Time Adaptation: This paper proposes Feature Redundancy Elimination (FRET) as a novel perspective for test-time adaptation (TTA), observing that embedding feature redundancy increases significantly under distribution shift. Two methods are designed: S-FRET (direct minimization of the redundancy score) and G-FRET (GCN-based attention-redundancy decomposition with bi-level optimization). G-FRET achieves state-of-the-art performance across multiple architectures and datasets.
LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement: This paper proposes LoRA-FAIR, which introduces a server-side residual correction term \(\Delta\mathbf{B}\) to simultaneously address two fundamental challenges in federated LoRA fine-tuning — server-side aggregation bias and client-side initialization staleness — consistently outperforming existing federated fine-tuning methods on ViT and MLP-Mixer without incurring additional communication overhead.
Mind the Cost of Scaffold! Benign Clients May Even Become Accomplices of Backdoor Attack: This paper proposes BadSFL, the first backdoor attack tailored to the Scaffold federated learning algorithm. By manipulating control variates, BadSFL turns benign clients into unwitting accomplices. Combined with GAN-based data augmentation and an optimization strategy that predicts the global model's future convergence direction, BadSFL achieves backdoor persistence lasting 60+ rounds after the attack ceases in non-IID settings—three times longer than baseline methods.
Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers: This paper proposes SARDFQ to address semantic distortion and semantic insufficiency in data-free quantization (DFQ) of ViTs. Attention Prior Alignment (APA) guides synthetic images to match the attention patterns of real images, while Multi-Semantic Reinforcement (MSR) enriches local patch semantics. SARDFQ achieves a 15.52% Top-1 accuracy improvement on ImageNet W4A4 ViT-B.
SpecGuard: Spectral Projection-based Advanced Invisible Watermarking: SpecGuard embeds watermark information into the spectral domain of high-frequency subbands obtained via wavelet decomposition (approximated through FFT-based spectral projection). The encoder employs a strength factor to enhance robustness, while the decoder applies a learnable threshold derived from Parseval's theorem for bit recovery. The method achieves high image quality (PSNR > 42 dB) alongside comprehensive robustness against distortion, regeneration, and adversarial attacks, surpassing existing SOTA methods.
Staining and Locking Computer Vision Models without Retraining: This paper proposes novel algorithms for staining (watermark embedding) and locking (usage protection) of pretrained vision models without any retraining or fine-tuning. The approach directly modifies a small number of weights to implant highly selective detector neurons, provides theoretically computable false positive rate guarantees, and is validated on image classification and object detection models.
Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment: This paper reveals that inverse adversarial attacks in adversarial training introduce spurious correlations by shifting model attention toward background features. The proposed DHAT method addresses this bias through two components—Debiased High-confidence Logit Regularization (DHLR) and Foreground Logit Orthogonal Enhancement (FLOE)—achieving state-of-the-art adversarial robustness on CIFAR-10/100 and ImageNet-1K.
Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection: This paper proposes FakeSTormer, a fine-grained generative deepfake video detection framework that simultaneously models temporal and spatial vulnerability regions via multi-task learning, coupled with a Self-Blended Video (SBV) data synthesis strategy to generate high-quality forgery samples. Trained exclusively on real data, it achieves state-of-the-art generalization across multiple cross-dataset benchmarks.