🛡️ AI Safety¶

🧪 ICML2025 · 37 paper notes

📌 Same area in other venues: 📷 CVPR2026 (145) · 🔬 ICLR2026 (141) · 💬 ACL2026 (5) · 🧪 ICML2026 (114) · 🤖 AAAI2026 (45) · 🧠 NeurIPS2025 (73)

🔥 Top topics: Adversarial Robustness ×9 · Federated Learning ×6 · Reinforcement Learning ×2

A Certified Unlearning Approach without Access to Source Data: This paper proposes the first certified unlearning framework that does not require access to the original training data. By leveraging a surrogate dataset to approximate the statistical properties of the original data, and employing a noise scaling mechanism based on the statistical distance between the source and surrogate distributions, it achieves provable data deletion guarantees.
Accelerating Spectral Clustering under Fairness Constraints: The fair spectral clustering (Fair SC) problem is formulated into a Difference-of-Convex (DC) optimization framework. By employing a variable augmentation strategy and an ADMM-type algorithm, expensive eigendecomposition computations are avoided, achieving significant acceleration on large-scale problems.
Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection: An Adaptive Multi-prompt Contrastive Network (AMCN) is proposed to perform high-quality OOD detection under few-shot ID label conditions by generating three classes of adaptive textual prompts (learnable ID prompts, label-fixed OOD prompts, and label-adaptive OOD prompts) combined with class-adaptive thresholds, significantly outperforming existing few-shot OOD detection methods.
Adversarial Inception Backdoor Attacks against Reinforcement Learning: Proposes the "inception" backdoor attack framework—by inserting triggers into the RL agent's training trajectories and replacing high-reward actions with targeted adversarial actions, achieving a 100% attack success rate (ASR) under strict reward constraints for the first time, while maintaining agent performance on clean tasks.
An Efficient Private GPT Never Autoregressively Decodes: This paper proposes POST (Public decOding and Secure verificationTion), a method that leverages public GPT models to generate draft tokens and securely verifies them using a private model. Exploiting the characteristic that secure decoding latency is insensitive to input length, POST achieves a 2.1× to 6.0× speedup in private inference while maintaining the same privacy guarantees and generation quality as standard secure decoding.
Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts: This paper reveals the "leakage poisoning" phenomenon in Concept Bottleneck Models (CBMs)—where information bypassing the concept bottleneck hurts prediction accuracy under distribution shifts, leading to failed concept interventions. It proposes MixCEM, which utilizes a confidence gate to dynamically decide when to use or discard leaked information, maintaining both high accuracy and effective interventions under both in-distribution and out-of-distribution scenarios.
Breaking the n^{1.5} Additive Error Barrier for Private and Efficient Graph Sparsification: This paper breaks the \(n^{1.5}\) additive error barrier for differentially private graph cut sparsification by proposing a polynomial-time \((\varepsilon,\delta)\)-DP algorithm that reduces the additive error to \(n^{1.25+o(1)}\). The core technology is the first privacy-preserving expander decomposition algorithm.
Can One Safety Loop Guard Them All? Agentic Guard Rails for Federated Computing: Proposes Guardian-FC—the first backend-agnostic unified security framework for federated computing. By employing a finite-state safety loop (Sense→Predict→Act→Prove) on an Agentic-AI control plane, Guardian-FC uniformly regulates heterogeneous privacy mechanisms such as FHE, DP, and MPC, achieving consistent execution of a single set of guard-rail policies across all privacy backends.
Clients Collaborate: Flexible Differentially Private Federated Learning with Guaranteed Improvement of Utility-Privacy Trade-off: This paper proposes the FedCEO framework, which applies low-rank tensor proximal optimization on stacked client model parameters at the server side. By leveraging semantic complementarity among different clients, it recovers semantic information corrupted by DP noise, improving the utility-privacy trade-off bound by an order of \(O(\sqrt{d})\).
Collaborative Mean Estimation Among Heterogeneous Strategic Agents: Individual Rationality, Fairness, and Truthful Contribution: For the collaborative mean estimation problem among multi-agents with heterogeneous costs, this paper designs monetary-free mechanisms that simultaneously satisfy individual rationality (IR), incentive compatibility (IC), and fairness, achieving an \(\mathcal{O}(\sqrt{m})\) approximation ratio in the worst case, and proves three impossibility results.
Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret: The authors propose the DP-TS-UCB algorithm, which establishes a connection between Thompson Sampling and UCB by restricting the number of Gaussian samples and reusing the maximum model value. This achieves a parameterized tradeoff between \(\tilde{O}(T^{0.25(1-\alpha)})\)-GDP privacy guarantees and an \(O(K\ln^{\alpha+1}(T)/\Delta)\) regret upper bound.
Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning: Proposes the Convex Markov Game (cMG) framework, generalizing single-agent convex MDPs to multi-agent settings, which allows general convex preferences over occupancy measures (such as entropy, KL divergence, fairness penalties, and safety constraints). It proves the existence of pure-strategy Nash equilibria and designs a differentiable Projected Gradient Loss (PGL) algorithm to approximate equilibria.
De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks: This paper presents the first systematic evaluation of the vulnerability of protective perturbation-based voice cloning (VC) defense methods against adversarial purification. It proposes PhonePuRe, a two-stage "Purification-Refinement" framework that utilizes a phoneme-guided diffusion model to effectively eliminate protective perturbations. This allows voice cloning models to accurately replicate speaker characteristics again, revealing the fundamental limitations of existing defense schemes.
Disparate Conditional Prediction in Multiclass Classifiers: This paper proposes the extension of the Disparate Conditional Prediction (DCP) metric from binary to multiclass classification. By leveraging local optimization and linear programming methods, it provides upper and lower bound estimates for the degree of fairness deviation of multiclass classifiers. This supports fairness auditing in two scenarios: when the confusion matrix is known, or when only population-level statistics are available.
Distributed and Decentralised Training: Technical Governance Challenges in a Shifting AI Landscape: This paper systematically distinguishes between two emerging paradigms: distributed training (multi-data centre) and decentralised training (community-driven). It analyzes how low-communication training algorithms (such as DiLoCo) enable these two paradigms and delves into the challenges and opportunities they present for technical AI governance (compute structuring, capability proliferation, and shut-down capability).
Doubly Robust Fusion of Many Treatments for Policy Learning: A Calibration-Weighted Treatment Fusion method is proposed to reduce the dimensionality of the action space by doubly robustly merging treatment arms with similar effects, enabling existing multi-armed policy learning methods (such as policy trees) to be efficiently applied to individualized recommendation scenarios with a large number of treatment options.
Enhancing Certified Robustness via Block Reflector Orthogonal Layers and Logit Annealing Loss: This paper proposes an efficient low-rank orthogonal layer parameterization method (BRO Layer) and an annealing-based loss function (Logit Annealing Loss) to construct BRONet, a Lipschitz neural network with stronger certified robustness, achieving SOTA on CIFAR-10/100, Tiny-ImageNet, and ImageNet.
Faster Rates for Private Adversarial Bandits: Proposes a simple and efficient non-private to private reduction framework for the differentially private adversarial bandits problem. By utilizing batched losses and Laplace noise, it achieves an O(√(KT/ε)) regret bound, proves for the first time a separation between central DP and local DP for this problem, and provides the first private bandits with expert advice algorithm.
FicGCN: Unveiling the Homomorphic Encryption Efficiency from Irregular Graph Convolutional Networks: The FicGCN framework is proposed to address the fundamental conflict between the irregular sparsity of GCNs and the SIMD computation pattern of homomorphic encryption. By introducing three innovations—latency-aware packing, Sparse Intra-Ciphertext Aggregation (SpIntra-CA), and region-based node ordering—FicGCN achieves up to \(4.10\times\) end-to-end acceleration on large-scale graphs such as Corafull.
Fully Heteroscedastic Count Regression with Deep Double Poisson Networks: This paper proposes Deep Double Poisson Networks (DDPN), which achieve full heteroscedasticity in discrete count regression by outputting parameters of the Double Poisson distribution. Supporting arbitrarily high or low predictive variances, DDPN comprehensively outperforms existing baselines in accuracy, calibration, and OOD detection.
Generalization in Federated Learning: A Conditional Mutual Information Framework: A generalization analysis framework for Federated Learning based on Conditional Mutual Information (CMI) is proposed, which unifies the characterization of the participation gap and the out-of-sample gap for the first time, and reveals the intrinsic connection between differential privacy and generalization.
Identifying and Understanding Cross-Class Features in Adversarial Training: From the perspective of class-level feature attribution, this work reveals how "cross-class features" in adversarial training (AT) are first learned and then forgotten, offering a unified explanation for both robust overfitting and the advantages of soft-label training.
Improving the Variance of Differentially Private Randomized Experiments through Clustering: The Cluster-DP mechanism is proposed to leverage non-sensitive clustering structure information to improve the privacy-variance trade-off of causal effect estimation in differentially private randomized experiments. Without sacrificing privacy guarantees, a more homogeneous clustering structure significantly reduces the variance loss of ATE estimation.
On Differential Privacy for Adaptively Solving Search Problems via Sketching: This work extends differential privacy from numerical estimation to search problems (which require returning a solution vector rather than a single scalar) for the first time. Under a mild sparse nearest neighbor assumption, it proposes an algorithm that correctly answers \(T\) adaptive approximate nearest neighbor queries using \(\tilde{O}(\sqrt{T} \cdot s)\) copies of the data structure, while also providing adaptive regression data structures that depend on the condition number.
Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models: Privacy-Shielded Image Compression (PSIC) is proposed. By injecting condition-triggered biases during the decoding stage of learned image compression, it achieves dual-mode decoding from a single bitstream. The default mode preserves visual perceptual quality while shielding against the semantic understanding of VLP models, whereas the authorized mode fully recovers image semantics, thereby providing users with plug-and-play privacy protection within the compression pipeline.
Private Model Personalization Revisited: Proposed the Private FedRep algorithm, which learns a shared low-dimensional embedding \(U^* \in \mathbb{R}^{d \times k}\) (\(k \ll d\)) via an alternating minimization framework under user-level differential privacy (DP) constraints. This reduces the privacy error term by a factor of \(\widetilde{O}(dk)\) compared to the prior work by Jain et al., and generalizes to a broader class of sub-Gaussian distributions (rather than being restricted to Gaussian distributions). Additionally, dimension-free classification risk bounds are provided using the Johnson-Lindenstrauss transform.
Quadratic Upper Bound for Boosting Robustness: By leveraging the convexity of the cross-entropy loss with respect to logits, a quadratic upper bound (QUB) for the adversarial training loss is derived. This serves as a plug-and-play loss function replacement for existing fast adversarial training methods, significantly boosting robustness.
Relative Error Fair Clustering in the Weak-Strong Oracle Model: Proposes the first fair \(k\)-median clustering algorithm achieving \((1+\varepsilon)\) approximation under the weak-strong oracle model, requiring only \(\text{poly}(k \log n / \varepsilon)\) expensive strong oracle queries, representing a fundamental improvement over previous constant-factor approximations greater than 10.
Rethinking the Bias of Foundation Model under Long-tailed Distribution: This work reveals that fine-tuning foundation models on long-tailed tasks is doubly affected by "parameter imbalance" (pre-training data bias) and "data imbalance" (downstream data bias). It discovers that parameter imbalance is more critical and cannot be resolved by existing logit adjustment methods. It proposes a method based on causal backdoor adjustment to eliminate the confounding effect of incomplete semantic factors, achieving an average improvement of approximately 1.67% across three long-tailed benchmarks.
Retraining with Predicted Hard Labels Provably Increases Model Accuracy: Under noisy labels, retraining a model on a training set relabeled with its own predicted hard labels (\(0/1\) labels) can provably increase model accuracy. Furthermore, this study proposes consensus-based retraining (retraining only on samples where the predicted labels match the given labels), which significantly improves performance with zero additional privacy cost under label DP scenarios.
Retraining with Predicted Hard Labels Provably Increases Model Accuracy: In the context of noisy labels, relabeling the training set with hard labels (0/1) predicted by the model itself and retraining can provably improve classification accuracy. Furthermore, a consensus filtering strategy is proposed (retraining only on samples where the predicted label matches the given label), which significantly boosts performance in label-differentially private training with no extra privacy cost.
SecEmb: Sparsity-Aware Secure Federated Learning of On-Device Recommender System with Large Embedding: Proposed SecEmb, a lossless secure federated recommendation protocol exploiting the sparsity of embedding updates. By using Function Secret Sharing (FSS), it protects the privacy of user-rated item indices and gradients while reducing upload/download communication overhead by up to 90x and user-side computation time by up to 70x.
Solving Probabilistic Verification Problems of Neural Networks Using Branch and Bound: This paper proposes a neural network probabilistic verification algorithm based on Branch and Bound. By iteratively refining the upper and lower bounds of the output probability, it answers "what is the probability that the network output satisfies a specific condition under a given input distribution," achieving a speedup of one to two orders of magnitude compared to existing methods.
Theoretically Unmasking Inference Attacks Against LDP-Protected Clients in Federated Vision Models: This work derives, for the first time, the theoretical upper and lower bounds on the success rate of Active Membership Inference (AMI) attacks based on fully connected and self-attention layers under Local Differential Privacy (LDP) in federated learning. It reveals that even under LDP protection, the privacy risk still depends on the privacy budget \(\varepsilon\), and the noise required to effectively mitigate such attacks severely degrades model utility.
TIMING: Temporality-Aware Integrated Gradients for Time Series Explanation: The authors propose TIMING, which improves Integrated Gradients by introducing a temporality-aware segmented random masking baseline. Additionally, they design new evaluation metrics, CPD and CPP, to address the issue of positive and negative attributions canceling each other out in current time series XAI evaluations, outperforming existing baselines across multiple real-world datasets.
Towards Trustworthy Federated Learning with Untrusted Participants: The CafCor algorithm is proposed, which injects correlated noise achieved through shared randomness among participants and integrates a novel robust aggregation method called CAF. This achieves a privacy-utility trade-off close to Central Differential Privacy (CDP), without trusting the server and in the presence of malicious participants.
Understanding Model Ensemble in Transferable Adversarial Attack: For the first time, a theoretical framework is established for model ensemble adversarial attacks, defining transferability error and decomposing it into vulnerability and diversity, followed by deriving upper bounds using information-theoretic tools. This theoretically validates three practical guidelines: "more models, higher diversity, and lower complexity."