Skip to content

🛡️ AI Safety

🧪 ICML2026 · 16 paper notes

📌 Same area in other venues: 💬 ACL2026 (2) · 📷 CVPR2026 (22) · 🔬 ICLR2026 (26) · 🤖 AAAI2026 (44) · 🧠 NeurIPS2025 (70) · 📹 ICCV2025 (21)

🔥 Top topics: Adversarial Robustness ×3 · Federated Learning ×2

ACTG-ARL: Differentially Private Conditional Text Generation with RL-Boosted Control

This paper proposes a hierarchical framework, ACTG, which decomposes private text generation into two subtasks: feature learning and conditional text generation. Furthermore, it introduces Anchored RL, which combines reinforcement learning objectives with optimal N-out-of-K SFT anchors, thereby improving the instruction-following ability of the conditional generator while maintaining text fidelity. On biomedical data, it achieves a 20% MAUVE improvement over prior work.

Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning

The authors systematically evaluate, for the first time, the impact of 7 mainstream plasticity interventions (SAM/Shrink&Perturb/Weight Clip/SN/WD/LN/ReDo) on deep reinforcement learning (DRL) backdoor attacks (14,664 experiments), finding that only SAM is a "demon"—significantly exacerbating backdoor threats. Based on this, they propose the "Sweeper-Converter-Connector" robust backdoor injection framework and provide a detection signal based on loss landscape sharpness.

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

This work extends randomized smoothing (RS) from "supporting only single continuous or discrete input" to the hybrid perturbation setting of "discrete tokens + continuous images." Through a hybrid Neyman–Pearson analysis, it derives a one-dimensional, continuous, invertible likelihood ratio CDF, transforming the originally combinatorial discrete knapsack problem into a solvable root-finding problem. The first model-agnostic certificate for "joint image-text insecurity" is provided on LLaVA-Guard multimodal safety filtering.

DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning

This paper proposes DP-KFC: based on the observation that "the scaling of the Fisher matrix is determined by architecture, and its correlation structure can be approximated by modality-level spectral statistics," it uses structured synthetic noise (pink noise \(1/f^\alpha\) for images, Zipf sampling for text) to probe the network and reconstruct the KFAC preconditioner, without consuming privacy budget or introducing distribution shift. Under strong privacy (\(\varepsilon\le 3\)), it consistently outperforms DP-SGD and public data preconditioning methods.

Dual-branch Robust Unlearnable Examples

This paper proposes DUNE: extending unlearnable example (UE) perturbations from a single spatial domain to joint "spatial + color" dual-domain optimization, aligning perturbation features to shift-induced labels and enhancing with pre-trained model ensembles. On CIFAR-10 / ImageNet, DUNE remains robust against 7 mainstream defenses (including ECLIPSE, ISS-J, COIN), reducing average test accuracy by 14.95%–50.82% compared to 12 SOTA UE methods.

Fair Dataset Distillation via Cross-Group Barycenter Alignment

This work reveals that dataset distillation (DD) amplifies biases present in the original data—rooted in the interaction between "subgroup sample size imbalance" and "subgroup representational separation." The authors propose COBRA: using the barycenter of subgroup representations (independent of group size) as the distillation target, which simultaneously reduces EOD and improves accuracy across multiple DD frameworks.

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

To address the issue in prototype-based federated learning where "directly averaging local prototypes inherits client bias," this work introduces a set of learnable global hyper-prototypes. These are optimized on the server via gradient matching to simulate prototypes as if trained in a centralized manner. Combined with client-side contrastive and alignment losses, this approach significantly improves accuracy under heterogeneous scenarios.

Frequency Matching in Spiking Neural Networks for mmWave Sensing

From a "mechanism-data alignment" perspective, this work proves that LIF spiking neurons are equivalent to a first-order IIR low-pass filter, and proposes setting the membrane decay coefficient \(\beta\) according to the discriminative spectrum of mmWave signals. This enables SNNs to achieve an average of 6.22% higher accuracy and 3.64× lower theoretical energy consumption than ANNs on four standard mmWave datasets.

LAPRAS: Learning-Augmented PRivate Answering for Linear Query Streams

LAPRAS uses a predictor for "which queries will arrive" to split an online DP query stream into in-prediction and out-of-prediction categories. For in-prediction queries, it releases answers with low noise in one shot using the offline-optimal Matrix Mechanism. For out-of-prediction queries, it applies Smooth Allocation, estimating the total number of "unpredicted queries" online based on observed positions and allocating budget smoothly. When predictions are accurate, utility nearly matches the offline optimum; when predictions are poor, performance degrades gracefully to the online baseline.

Limits of Convergence-Rate Control for Open-Weight Safety

The authors formalize "open-weight safety" as "how to delay the convergence speed of malicious fine-tuning," proving that the largest singular value of the Hessian spectrum is determined by the lower bound of the weight spectrum. Based on this, they design the SpecDef algorithm, which can strictly slow down first/second-order optimization. However, they also prove that any such convergence-rate control method can be circumvented by an attacker at the cost of a linear increase in model size.

MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

Multiple client-specific experts, each fine-tuned on private data, can be merged into a deployable MoE model without sharing private data. The core is to select "relevant and diverse" proxy samples from public data using relevance-weighted DPP, enabling proxy-aligned expert training followed by context-aware router training. This aligns expert behaviors with proxy supervision and significantly outperforms similarity-only proxy selection methods like FlexOlmo.

Position: Embodied AI Requires a Privacy-Utility Trade-off

This position paper argues that privacy in embodied AI cannot be addressed by single-stage patches, but must be treated as an architectural, dynamic control signal spanning the entire lifecycle—across instruction, perception, planning, and interaction. The SPINE framework is proposed, leveraging an L1-L4 four-level privacy classification matrix to coordinate agent behavior at each stage.

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

The authors provide the first convergent hidden-state DP upper bound for "differentially private zeroth-order optimization (DP-ZOGD)"—by designing a "directional + isotropic" hybrid noise mechanism and constructing an auxiliary process between two neighboring trajectories, they circumvent the technical barrier that zeroth-order updates lack global Lipschitzness. This reveals a previously unknown DP algorithmic principle: increasing the number of sampled directions per step \(K\) can actually reduce privacy loss.

Scaling Unsupervised Multi-Source Federated Domain Adaptation through Group-Wise Discrepancy Minimization

Addressing the limitation that existing federated unsupervised multi-source domain adaptation (UMDA) methods can only handle 2–6 sources—becoming unstable or computationally infeasible as the number of sources increases—the authors propose GALA: all sources are randomly divided into small groups, and group-wise prediction distribution discrepancies are minimized (reducing \(O(N^2)\) pairwise alignment to linear complexity). Additionally, a centroid+temperature-based similarity weighting is stacked to select sources truly close to the target domain. On the newly constructed Digit-18 (18 sources) benchmark, the method converges stably and outperforms all baselines.

The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents

This work constructs a programmatically generated "Synthetic Web" environment. By injecting a single high-credibility honeypot misinformation item at search rank 0, it causally demonstrates that cutting-edge LLM agents such as GPT-5 experience an accuracy drop from 65% to 18% under adversarial contamination at a 1-in-thousands rate. The models do not increase search effort and still answer with high confidence, revealing a deeply rooted "positional anchoring" failure mode.

VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection

The authors constructed a large-scale visual privacy dataset, VPD-100K, with 100,000 images, 33 fine-grained categories, and over 190,000 instances, covering four major domains (faces/on-screen PII/physical identifiers/location markers). They propose a three-part frequency-domain enhancement module (FDAF + Adaptive Spectral Gating + Frequency-domain Consistency Loss) inserted into the Neck of YOLOv10, boosting YOLOv10-L's AP on VPD-100K from 53.8 to 58.6 (+4.8), while maintaining stable real-time performance on live streams at 7.51ms latency.