🛡️ AI Safety¶

🔬 ICLR2026 · 27 paper notes

Action-Free Offline-to-Online RL via Discretised State Policies: This paper formalises the "action-free offline-to-online RL" setting for the first time and proposes the OSO-DecQN algorithm. By discretising continuous state differences into ternary tokens \(\{-1, 0, 1\}\), a state policy \(Q(s, \Delta s)\) is pretrained on action-free \((s, r, s')\) tuples to predict the expected direction of next-state change rather than actions. A policy-switching mechanism combined with an online-trained inverse dynamics model (IDM) then translates the state policy into executable actions, guiding online agents to accelerate learning. The approach consistently improves both convergence speed and asymptotic performance on D4RL and DeepMind Control Suite (including 78-dimensional state spaces).
Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective: This work is the first to analyze differentially private optimizers through a stochastic differential equation (SDE) framework, revealing fundamental behavioral differences between DP-SGD and DP-SignSGD under privacy noise: adaptive methods achieve a superior privacy-utility tradeoff of \(\mathcal{O}(1/\varepsilon)\) vs. \(\mathcal{O}(1/\varepsilon^2)\) in high-privacy regimes, and their hyperparameters transfer across privacy budgets.
ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks: This paper proposes ATEX-CF, a framework that, for the first time, unifies the edge-addition strategy from adversarial attacks with the edge-deletion strategy from counterfactual explanations. Through joint optimization of prediction flipping, sparsity, and plausibility, ATEX-CF generates more faithful, concise, and plausible instance-level counterfactual explanations for GNNs.
Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD: This paper proposes the Banded Inverse Square Root (BISR) matrix factorization method, which imposes a banded structure on the inverse correlation matrix (rather than on the correlation matrix itself). This approach achieves, for the first time, an asymptotically optimal factorization error bound for multi-epoch differentially private SGD, and is accompanied by a memory-efficient variant, BandInvMF.
Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning: This paper proposes Daze, a backdoor attack in which a malicious simulator developer—without any access to or modification of the agent's reward function—plants a backdoor solely by manipulating state transitions: when the agent fails to execute the target action in a trigger state, it is forced to take random actions ("dazed"), thereby theoretically guaranteeing both attack success and stealthiness. The work also presents the first demonstration of an RL backdoor attack on real robot hardware.
Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching: This paper proposes Matching for Retention (MRet), an algorithm that, for the first time, shifts the optimization objective of two-sided matching platforms from "maximizing the number of matches" or "satisfying fairness constraints" to "directly maximizing user retention rate." By learning personalized retention curves and exploiting the concavity of the retention function, the otherwise NP-hard joint retention-gain optimization for both sides is reduced to an \(O(N \log N)\) sorting problem. MRet achieves significant retention improvements on both synthetic data and real-world data from a large Japanese dating platform.
Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?: The first systematic large-scale quantitative study on the relationship between input-based explanations and fairness: explanations can effectively detect biased predictions and serve as training regularizers to reduce bias, but cannot be reliably used for automatic fair model selection.
Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients: This paper proposes FedMosaic, a framework addressing dual heterogeneity in personalized federated learning (PFL): RELA measures task relevance via gradient similarity to enable customized aggregation (addressing data heterogeneity), while Co-LoRA enables cross-architecture knowledge sharing (e.g., Llama vs. Qwen) through dimension-invariant modules \(P \in \mathbb{R}^{r \times r}, Q \in \mathbb{R}^r\) (addressing model heterogeneity). The framework achieves substantial improvements over SOTA on DRAKE, a newly proposed 40-task multimodal PFL benchmark.
Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature: This work elegantly bridges classical curvature approximation theory (KFAC) with the practical demands of task arithmetic, proposing a data-free weight disentanglement regularization method. The theoretical derivation is clear, with a coherent logical chain from representation drift regularization → Jacobian Gramian → GGN → KFAC. Experiments span multiple model scales across both vision and language domains, and the robustness analysis with respect to the \(\alpha\) hyperparameter is practically valuable. Limitations include the \(O(d^2)\) storage overhead of KFAC for large models and a remaining gap relative to data-dependent methods in the text domain.
Efficient Resource-Constrained Training of Transformers via Subspace Optimization: This paper proposes WASI (Weight-Activation Subspace Iteration), which leverages the observation that parameter subspaces remain stable during fine-tuning to simultaneously compress both the weights (via SVD + Gram-Schmidt subspace iteration) and activations (via Tucker decomposition) of Transformers. Both training and inference are performed entirely within low-rank representations, achieving 62× training memory compression and 1.4× speedup on Raspberry Pi 5 with negligible accuracy loss.
Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction: This paper challenges the prevailing "longer is better" paradigm in gene expression prediction, demonstrating that current SSM models fundamentally rely only on proximal information. It further identifies background chromatin signals (DNase-seq/Hi-C) as confounding variables that introduce spurious correlations, and proposes the Prism framework, which applies backdoor adjustment for deconfounding—achieving state-of-the-art performance with only 2k-length sequences, surpassing methods that use 200k-length sequences.
Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning: This paper proposes FedShift, a two-stage "hide-and-find" distributed adversarial attack framework. In the first stage, covert shifters are injected into training graphs via gentle distributional shifts. In the second stage, the trained shifter generator serves as a warm initialization for efficiently searching adversarial perturbations, which are then aggregated across multiple malicious clients to form the final adversarial examples. FedShift achieves state-of-the-art attack success rates on six large-scale datasets, evades three mainstream defense algorithms, and improves convergence speed by over 90%.
Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights: This paper reveals that privacy vulnerability is concentrated in a remarkably small fraction of weights (as few as 0.1%), which is highly entangled with learnability (Pearson \(r > 0.9\)). The proposed CWRF method achieves superior privacy-utility trade-offs by rewinding privacy-vulnerable weights to their initialization and freezing them, while fine-tuning only the remaining weights.
Less is More: Towards Simple Graph Contrastive Learning: This paper revisits the foundational principles of graph contrastive learning (GCL) and identifies that node feature noise can be mitigated through structural feature aggregation derived from graph topology. Based on this insight, the authors propose a minimalist GCL model that contrasts a GCN encoder (capturing structural features) against an MLP encoder (isolating node feature noise), requiring neither data augmentation nor negative sampling. The method achieves state-of-the-art performance on heterophilic graph benchmarks while offering advantages in complexity, scalability, and robustness on homophilic graphs.
Risk-Sensitive Agent Compositions: This paper formalizes agent workflows as directed acyclic graphs (Agent Graphs), models safety/fairness/privacy requirements via a max loss function, and proposes the BucketedVaR algorithm, which combines union bounds with dynamic programming to find the optimal agent composition minimizing VaR/CVaR in polynomial time. The approach is proven to be asymptotically near-optimal under an independence assumption on agent losses.
Robust Spiking Neural Networks Against Adversarial Attacks: This paper theoretically demonstrates that threshold-proximal spiking neurons are the key robustness bottleneck in directly trained SNNs — they simultaneously set the theoretical upper bound on adversarial attack strength and are most susceptible to state flipping. The proposed Threshold Guarding Optimization (TGO) method addresses this through a dual strategy of membrane potential constraint and noisy LIF neurons, achieving state-of-the-art robustness across multiple adversarial attack scenarios with zero additional inference overhead.
Membership Privacy Risks of Sharpness Aware Minimization: This paper presents the first systematic study demonstrating that models trained with SAM (Sharpness-Aware Minimization), despite achieving better generalization, are more vulnerable to membership inference attacks (MIA) than SGD-trained models. Two complementary explanations are provided through theoretical analysis and experiments: memorization behavior and variance contraction.
Sample-Efficient Distributionally Robust Multi-Agent Reinforcement Learning via Online Interaction: This paper presents the first study of online learning in Distributionally Robust Markov Games (DRMGs), proposing the MORNAVI algorithm. Without relying on a simulator or offline data, MORNAVI efficiently learns optimal robust policies through online interaction, and provides the first provable regret bounds under both TV-divergence and KL-divergence uncertainty sets.
Skirting Additive Error Barriers for Private Turnstile Streams: This paper demonstrates that in the differentially private turnstile streaming model, allowing multiplicative error circumvents known polynomial additive error lower bounds, reducing the additive error for distinct elements and \(F_2\) moment estimation from polynomial to \(\mathrm{polylog}(T)\).
Skirting Additive Error Barriers for Private Turnstile Streams: This paper proves that the polynomial pure additive error lower bounds in differentially private turnstile streams—\(\Omega(T^{1/4})\) for distinct elements counting and \(\Omega(T)\) for \(F_2\) moment estimation—can be circumvented by introducing multiplicative error. The paper achieves \((\text{polylog}(T), \text{polylog}(T))\) mixed error for distinct elements and \((1+\eta, \text{polylog}(T))\) mixed error for \(F_2\) moments, both requiring only polylogarithmic space.
Time Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks: This paper proposes the Spike-Retiming Attack — a temporal attack that perturbs only spike timestamps without adding or removing spikes. It formalizes a unified tri-norm budget (\(\mathcal{B}_\infty\) local jitter / \(\mathcal{B}_1\) total delay / \(\mathcal{B}_0\) tamper count) under a capacity-1 constraint, and employs Projected-in-the-Loop (PIL) optimization to decouple strict forward projection from soft backward differentiation. The method achieves >90% ASR with <2% spike perturbation on CIFAR10-DVS/DVS-Gesture/N-MNIST, revealing a critical temporal vulnerability in event-driven SNNs.
Toward Enhancing Representation Learning in Federated Multi-Task Settings: This paper proposes the Muscle loss — an N-tuple-level multi-model contrastive learning objective whose minimization is equivalent to maximizing a lower bound on the mutual information among all model representations. Building on this, the FedMuscle algorithm aligns the representation spaces of heterogeneous models via a public dataset, naturally handling both model and task heterogeneity. FedMuscle consistently outperforms state-of-the-art baselines across CV/NLP multi-task settings, with gains of up to +28.65%.
Traceable Black-box Watermarks for Federated Learning: This paper proposes TraMark, which partitions the model parameter space into a main-task region and a watermark region and employs masked aggregation to prevent watermark collision. TraMark achieves server-side traceable black-box watermark injection in federated learning for the first time, attaining a verification rate of 99.58% with only a 0.54% drop in main-task accuracy.
Unified Privacy Guarantees for Decentralized Learning via Matrix Factorization: This paper unifies diverse algorithms and trust models in decentralized learning (DL) under a matrix factorization (MF) mechanism framework, extends privacy guarantees to more general matrix types, and proposes the MAFALDA-SGD algorithm that significantly outperforms existing methods on both synthetic and real-world graph topologies by optimizing noise correlation.
VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents: This paper introduces VPI-Bench, the first comprehensive visual prompt injection attack benchmark (306 samples), systematically evaluating the security of Computer-Use and Browser-Use Agents across 5 platforms. Results reveal that Browser-Use Agents are critically vulnerable (100% AR on Amazon/Booking), that even Anthropic's CUA exhibits severe vulnerabilities (up to 59% AR), and that system prompt defenses are ineffective.
Watermark-based Detection and Attribution of AI-Generated Content: This paper presents the first systematic study on watermark-based user-level detection and attribution of AI-generated content. It provides theoretical analysis (bounds on TDR/FDR/TAR), an efficient watermark selection algorithm (A-BSTA), and cross-modal (image + text) experimental validation, demonstrating that detection and attribution inherit the accuracy and (non-)robustness of the underlying watermarking method.
Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information: This paper provides a unified explanation for the effectiveness of all unlearnable example (UE) methods through the lens of mutual information (MI) reduction, and proves that minimizing the intra-class covariance of poisoned features reduces the MI upper bound. Based on this framework, MI-UE is proposed, which achieves covariance reduction via intra-class cosine similarity maximization, suppressing test accuracy to 9.95% on CIFAR-10 (near random-chance), while significantly outperforming existing methods under adversarial training defenses.