🛡️ AI Safety¶
🔬 ICLR2026 · 140 paper notes
📌 Same area in other venues: 📷 CVPR2026 (143) · 💬 ACL2026 (5) · 🧪 ICML2026 (114) · 🤖 AAAI2026 (45) · 🧠 NeurIPS2025 (73) · 📹 ICCV2025 (24)
🔥 Top topics: Adversarial Robustness ×47 · Federated Learning ×10 · Diffusion Models ×5 · Watermarking ×5 · Multimodal/VLM ×4
- A Bayesian Nonparametric Framework for Private, Fair, and Balanced Tabular Data Synthesis
-
This paper embeds a conditional VAE-GAN generator into a Bayesian Nonparametric Learning (BNPL) framework. It utilizes a Dirichlet process for global privacy, a copula base measure for column-wise local privacy, BNP mutual information regularization for fairness, and KL divergence for class balance. It represents the first unified framework with theoretical guarantees to simultaneously handle privacy, fairness, and class imbalance constraints while naturally supporting non-binary sensitive attributes.
- A Fair Bayesian Inference through Matched Gibbs Posterior
-
Targeting the limitation that "fair models only provide point estimates and fail to quantify predictive uncertainty," this paper integrates group fairness constraints into a Bayesian framework. It proposes the matched Gibbs posterior with matched deviation as a penalty term and treats the matching function \(T\) as a learnable parameter to avoid adversarial training. This allows an \(O(n)\) Gibbs sampler to simultaneously produce "calibrated" posterior distributions that satisfy demographic parity constraints.
- A General Framework for Black-Box Attacks Under Cost Asymmetry
-
Addressing real-world scenarios where "different queries incur different costs" (e.g., submitting violating images to an NSFW detector triggers account bans), this paper proposes a general framework for decision-based black-box attacks adaptable to any cost ratio \(c^\star\). By replacing binary search with Asymmetric Search (AS) and standard Monte Carlo gradient estimation with Asymmetric Gradient Estimation (AGREST), the framework minimizes total query costs without discarding core attack components, reducing perturbation norms by up to 40%.
- A Unified Total Variation Framework for Membrane Potential Perturbation Dynamic
-
This paper proves that the "Membrane Potential Perturbation Dynamic (MPPD)" used to characterize adversarial perturbations in Spiking Neural Networks (SNNs) is essentially a Total Variation (TV) operator. Consequently, existing mean-square MPPD regularization is equivalent to a TV-\(\ell_2\) framework. The authors propose a stronger TV-\(\ell_1\) framework—leveraging the coarea formula to achieve better suppression of sharp adversarial noise—reaching new SOTA robust accuracy for SNNs under both Gaussian and adversarial training.
- Action-Free Offline-to-Online RL via Discretised State Policies
-
This paper formally defines the "Action-Free Offline-to-Online RL" setting for the first time and proposes the OSO-DecQN algorithm. By discretizing continuous state differences into three categorical tokens \(\{-1, 0, 1\}\), the method pre-trains a state policy (predicting desired directions of state change rather than actions) on data containing only \((s, r, s')\) tuples. During the online phase, the state policy is converted into executable actions via a policy switching mechanism and an online-trained inverse dynamics model, accelerating online agent learning. Consistent improvements in convergence speed and asymptotic performance are demonstrated on D4RL and DeepMind Control Suite (including a 78-dimensional state space).
- Adaptive Logit Adjustment for Debiasing Multimodal Language Models
-
ALA is a post-processing debiasing method. During each step of autoregressive generation, it utilizes external image and text classifiers to measure the discrepancy between the "attributes the image should have" and the "current bias expressed in the text." It then performs proportional fine-tuning only on the logits of bias-related tokens along the gradient direction. This aligns image-text attributes or neutralizes harmful stereotypes without modifying internal representations or retraining, while maintaining model utility.
- Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective
-
This work introduces the first Stochastic Differential Equation (SDE) framework to analyze differential privacy (DP) optimizers, revealing fundamental differences between DP-SGD and DP-SignSGD under privacy noise. The analysis shows that adaptive methods achieve superior privacy-utility trade-offs of \(\mathcal{O}(1/\varepsilon)\) compared to \(\mathcal{O}(1/\varepsilon^2)\) in high privacy settings, and their hyperparameters remain transferable across varying privacy budgets.
- Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models
-
The authors observe that transformed adversarial samples in the CLIP feature space collectively shift along a "dominant direction" (whereas clean samples diverge), which happens to point back to the correct category center. Consequently, they propose DBD, a training-free test-time defense that estimates the "defense direction" and repairs representations via dual-stream feature reconstruction guided by DB-score. DBD not only sets a new SOTA for adversarial robustness across 15 datasets but also exhibits the counter-intuitive phenomenon where "adversarial accuracy surpasses clean accuracy."
- AP-OOD: Attention Pooling for Out-of-Distribution Detection
-
The authors propose AP-OOD, which replaces mean pooling in Mahalanobis distance with learnable attention pooling. This addresses the issue where mean pooling loses token-level anomaly information, reducing the FPR95 of XSUM summarization from 27.84% to 4.67% and supporting a smooth transition from unsupervised to semi-supervised settings.
- ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks
-
Proposes the ATEX-CF framework, which for the first time unifies edge addition strategies from adversarial attacks with edge removal strategies from counterfactual explanations. By jointly optimizing prediction flipping, sparsity, and plausibility, it generates more faithful, concise, and reasonable instance-level counterfactual explanations for GNNs.
- Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD
-
This paper proposes the Banded Inverse Square Root (BISR) matrix factorization method. By imposing a banded structure on the inverse correlation matrix (rather than the correlation matrix itself), it achieves the first asymptotically optimal factorization error bound for multi-epoch Differentially Private SGD, accompanied by a low-storage optimized variant, BandInvMF.
- Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs
-
This paper proposes NH-Fair, a "Fairness Without Harm" evaluation benchmark covering classical vision models and Large Vision-Language Models (LVLMs), unifying data, metrics, and training protocols. Through a two-stage model selection process (DTO to select ERM baseline + FWH four-quadrant selection for mitigation methods), it systematically proves that many specialized debiasing algorithms do not consistently outperform a well-tuned ERM. Data augmentation is the most practical path for harmless enhancement, while simply scaling up models does not necessarily make them fairer.
- Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks
-
This paper formalizes the "training of fair deep networks" as a stochastic optimization problem with inequality constraints (specifically, loss differences between subgroups). It points out that no existing algorithm currently provides convergence guarantees across the full spectrum of "stochastic + inequality + non-convex + non-smooth" scenarios. Consequently, it selects three types of stochastic approximation algorithms from the literature that best fit this scenario but lacked implementation, integrates them into a Python toolbox, and provides the first systematic comparison of their optimization performance and fairness behavior on large-scale real-world US Census data (Folktables/ACSIncome).
- Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning
-
The authors propose the Daze attack—where a malicious simulator developer implants backdoors solely by manipulating state transitions without accessing or modifying the agent's reward function. When the agent fails to perform a target action in a trigger state, it is forced to execute random actions ("dazed"), theoretically guaranteeing attack success and stealth. This work also provides the first demonstration of RL backdoor attacks on real robot hardware.
- Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching
-
Ours proposes the Matching for Retention (MRet) algorithm, which shifts the optimization objective of two-sided matching platforms from "maximizing match counts" or "satisfying fairness" to "directly maximizing user retention rates." By learning personalized retention curves and utilizing concavity properties, the NP-hard joint optimization of bilateral retention gain is reduced to an \(O(N \log N)\) sorting problem. MRet significantly improves retention on both synthetic data and real-world data from a large Japanese dating platform.
- Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy
-
The paper argues that the add/remove adjacency used by mainstream DP libraries only protects "whether a member is in the training set." For attacks aiming to "infer attributes/labels of samples known to be in the set," it only provides much weaker protection under substitute adjacency. The authors design a canary auditing toolkit for substitute adjacency, empirically demonstrating that privacy leakage can exceed the \(\varepsilon_{AR}\) upper bound reported by add/remove accountants, while closely matching the \(\varepsilon_S\) predicted by substitute accountants.
- Black-Box Privacy Attacks on Shared Representations in Multitask Learning
-
This paper proposes the "task-inference" threat model, demonstrating that by querying the shared representation of multitask learning (MTL) in a black-box manner and obtaining embeddings for samples of the same task, an attacker can determine whether a specific task was included in the training set. This is achieved without training shadow models or using any reference data, leveraging the strong collaborative dependency between embeddings of the same task.
- Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?
-
The first systematic quantitative analysis of the relationship between input-based explanations and fairness: explanations effectively detect biased predictions and serve as training regularizers to reduce bias, but are unreliable for automated fair model selection.
- HyCAS: Simultaneous Certified and Empirical Robustness via Hybrid Convolutional and Attentional Stochasticity
-
HyCAS couples deterministic 1-Lipschitz spectral-normalized convolutions with two types of internal architectural stochasticity (spectral-normalized random projection + random attentional noise) into a global \(\le 2\)-Lipschitz randomized network. This achieves both a provable \(\ell_2\) certified radius and empirical robustness against strong \(\ell_\infty\) attacks (APGD/AutoAttack) within the same model.
- Certifying the Full YOLO Pipeline: A Probabilistic Verification Approach
-
This paper proposes ODPV—the first PAC probabilistic verification framework capable of verifying the robustness of the full YOLO detection pipeline (including NMS post-processing) against "object disappearance" attacks at a practical scale. It transforms the certification of high-dimensional detection networks into a feasible sampling problem via three steps: "output approximation → formal NMS verification → counterexample refinement."
- Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
-
This paper addresses the lack of safety concept erasure mechanisms in visual autoregressive (VAR) text-to-image models. It proposes VARE and S-VARE, utilizing auxiliary visual tokens to stabilize erasure training and employing filtered cross-entropy alongside preservation losses to achieve "surgical" concept erasure—removing target concepts while minimizing collateral damage to generation capabilities.
- Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients
-
Proposed the FedMosaic framework to address dual heterogeneity in personalized federated learning (PFL): RELA achieves customized aggregation via gradient similarity to measure task relevance (addressing data heterogeneity), and Co-LoRA enables knowledge sharing across heterogeneous architectures (e.g., Llama vs. Qwen) via dimension-invariant \(P \in \mathbb{R}^{r \times r}, Q \in \mathbb{R}^r\) modules (addressing model heterogeneity). It significantly outperforms SOTA methods on the newly proposed 40-task multimodal PFL benchmark, DRAKE.
- Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing
-
This study presents the first controlled evaluation comparing AI agents and human cybersecurity experts within the same real-world production network (a university environment with approximately 8,000 hosts). The researchers simultaneously deployed 10 professional penetration testers, 6 existing agent scaffolds, and a self-developed multi-agent framework, ARTEMIS. ARTEMIS ranked second overall, identifying 9 valid vulnerabilities with an 82% valid submission rate, outperforming 9 out of 10 human professionals. In contrast, off-the-shelf scaffolds like Codex and CyAgent performed poorly. The results highlight AI's advantages in systematic enumeration, parallel exploitation, and cost-efficiency, while revealing critical weaknesses in GUI operations and high false-positive rates.
- Concept-Aware Privacy Mechanisms for Defending Embedding Inversion Attacks
-
To address the pain point where standard differential privacy (DP) defenses indiscriminately add noise to all embedding dimensions and destroy semantics, this paper proposes SPARSE. It utilizes a differentiable neuron mask to learn critical dimensions related to user-specified privacy concepts and subsequently injects ellipsoidal noise calibrated by dimensional sensitivity using the Mahalanobis mechanism. This perturbation targets only sensitive dimensions while preserving non-sensitive semantics, effectively reducing privacy leakage while maintaining downstream utility across six datasets.
- Concept-based Adversarial Attack: a Probabilistic Perspective
-
The adversarial attack is upgraded from "perturbing a single image" to "perturbing an entire concept distribution." A diffusion generative model is used to fit multi-pose and multi-view images of a specific identity (e.g., a specific Corgi) into a concept distribution. By sampling from the product of this concept distribution and the victim classifier's distribution, the method generates adversarial samples that retain the original concept identity while achieving a high success rate (white-box targeted attack success rate improved from 59% of ProbAttack to 98%).
- Control Tax: The Price of Keeping AI in Check
-
This paper introduces "Control Tax"—the operational and financial costs of integrating AI Control (AIC) measures into production pipelines. The authors empirically measure the ROC performance of frontier models acting as monitors on the APPS code backdoor task, translate these ROC curves into "safety probability under a given auditing budget" using game theory, and finally plot the Pareto frontier of "Safety vs. Monitoring Cost," demonstrating that the most expensive monitors are not necessarily the safest.
- Convergent Differential Privacy Analysis for General Federated Learning
-
This paper utilizes the f-DP framework and shifted interpolation techniques to prove, for the first time, that the "worst-case privacy" of two classic Federated Learning methods (Noisy-FedAvg / Noisy-FedProx) under non-convex smooth objectives converges to a constant lower bound as the number of communication rounds \(T \to \infty\) rather than diverging. This theoretically refutes the long-standing perception that "long-term FL-DP training inevitably exhausts the privacy budget."
- Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning
-
This paper reveals for the first time that even if a model is trained exclusively on "public data" and never directly sees private data, as long as private data is used to guide data curation, an attacker can successfully infer the membership of private samples across three stages: curation scores, curated subsets, and the final model. It also proposes a differentially private version of curation as an effective defense.
- Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature
-
This work adeptly combines the classical theory of curvature approximation (KFAC) with the practical requirements of task arithmetic, proposing a weight disentanglement regularization method that requires no external data. Theoretical derivation is clear, following a smooth logical chain: representation drift regularization \(\to\) Jacobian Gramian \(\to\) GGN \(\to\) KFAC. Experiments cover various model scales across vision and language, providing practical robustness analysis for the \(\alpha\) hyperparameter. Limitations include the \(O(d^2)\) storage overhead of KFAC for large models and the performance gap compared to data-dependent methods in text domains.
- Dataset Distillation for Memorized Data: Soft Labels can Leak Held-Out Teacher Knowledge
-
This paper systematically demonstrates that in dataset distillation, by training a student only on a teacher's soft labels, the student can achieve accuracy far exceeding random chance on "memorized data" that it has never seen and cannot infer through generalization. This represents both an efficient pathway for transferring memorized knowledge and a hidden privacy leakage channel, precisely regulated by sample complexity and softmax temperature.
- Decoupling the Class Label and the Target Concept in Machine Unlearning
-
This paper points out that traditional class unlearning assumes "class label = target concept to be removed," whereas real-world deletion requests often involve a mismatch between the two. To address this, the authors decouple forget data, model output, and target concepts into three label domains, defining target/model/data mismatch tasks. They propose the TARF framework, which utilizes "representational gravity" to identify data sharing the same concept hidden in the remaining set, and employs a three-phase dynamic objective (annealed gradient ascent + target-aware gradient descent) to precisely extract the target concept and approximate the retrained model.
- Defending against Backdoor Attacks via Module Switching
-
Focusing on the post-training scenario where "suspicious pre-trained models are obtained without training data or trigger priors," this paper proposes Module Switching Defense (MSD). By interchanging weights of specific layers or modules between multiple isomorphic models, MSD disrupts the "shortcut paths" that backdoors rely on. The authors theoretically prove that its backdoor deviation is strictly higher than Weight Averaging (WAG). An evolutionary search is employed to find optimal switching strategies, significantly reducing the Attack Success Rate (ASR) using only two models and 20–50 clean validation samples.
- Designing Affine-Invariant Neural Networks for Photometric Corruption Robustness and Generalization
-
This paper proposes SEqSI, a CNN design that implements intensity shift invariance in the first layer and intensity scale equivariance in the subsequent backbone. Without significant computational overhead, it provides verifiable robustness to global brightness/contrast affine transformations for tasks including classification, localization, and segmentation, significantly outperforming standard networks in real-world photometric domain shifts such as Cryo-ET and microscopy.
- Missing Mass for Differentially Private Domain Discovery
-
This paper redefines the utility of "domain discovery" in differential privacy—extracting informative high-frequency items from an unknown massive dictionary—from "the number of distinct items (cardinality)" to "the amount of mass (missing mass)." Based on this, it proves that a simple and scalable Weighted Gaussian Mechanism (WGM) provides near-optimal \(\ell_1\) missing mass guarantees for Zipf-distributed data and distribution-free \(\ell_\infty\) guarantees. These guarantees are extended to private top-k and k-hit set tasks via a meta-algorithm that allocates half the budget to domain discovery and the other half to known-domain algorithms. Experiments on six real-world datasets show that the proposed approach matches or outperforms existing methods.
- Differentially Private Two-Stage Gradient Descent for Instrumental Variable Regression
-
This paper proposes DP-2S-GD—the first differentially private algorithm for Instrumental Variable Regression (IVaR). It rewrites the classic Two-Stage Least Squares (2SLS) as a two-stage gradient descent process, performing per-sample clipping and injecting calibrated Gaussian noise in each gradient update step to satisfy \(\rho\)-zCDP. The authors provide finite-sample convergence rates that explicitly characterize the optimization-privacy-sampling trade-off.
- Discrete Latent Features Ablate Adversarial Attack: A Robust Prompt Tuning Framework for VLMs
-
DEFEAT discovers that "discretizing CLIP's image latent features" naturally weakens adversarial perturbations. Consequently, it inserts a VQ-VAE-based PerturbShield module into a prompt tuning framework to reconstruct grid features, followed by logits fusion to balance robustness and clean accuracy. On 15 datasets, the harmonic mean of robustness and accuracy for adversarial few-shot classification is improved by an average of 13.76% compared to the previous SOTA.
- Distributional Machine Unlearning via Selective Data Removal
-
This work formalizes "forgetting an unwanted sub-distribution" as an information-theoretic problem. It proves that removing a small set of high-impact samples furthest from the retained distribution yields a quadratic improvement in sample efficiency in low-divergence scenarios, reducing data deletion by 15–82% compared to "full deletion" in empirical tests.
- Don't Shift the Trigger: Robust Gradient Ascent for Backdoor Unlearning
-
The authors discover that using Gradient Ascent (GA) for backdoor unlearning does not truly "erase" the trigger but instead shifts its influence to another class (termed "Trigger Drift"). They propose Robust Gradient Ascent (RGA), which utilizes an adaptive weight based on KL divergence to automatically shut down GA once the backdoor is neutralized, combined with \(L_2\) anchoring regularization to stabilize optimization, thereby removing the backdoor without introducing new misclassifications.
- Doubly-Regressing Approach for Subgroup Fairness
-
When numerous sensitive attributes lead to an explosion of subgroups and extreme data sparsity, this paper introduces "subgroup subset fairness" measured by supIPM. By employing a "Doubly Regressing \(R^2\) (DR²)" proxy objective that simultaneously regresses weight vectors and a discriminator, the method ensures distributional fairness across all large subgroups and marginal attributes using a single discriminator, significantly outperforming existing methods on highly sparse datasets.
- DPQuant: Efficient and Private Model Training via Dynamic Quantization Scheduling
-
DPQuant points out for the first time that "low-bit quantization causes much more severe accuracy collapse in Differential Privacy (DP) training than in standard training." It suppresses quantization variance by using "probabilistic rotation of layers to be quantized per epoch + a DP loss sensitivity estimator to prioritize quantization of low-impact layers." It achieves accuracy drops of <2% on ResNet/DenseNet/BERT, with a theoretical speedup of up to 2.21×.
- DRIFT: Divergent Response in Filtered Transformations for Robust Adversarial Defense
-
DRIFT prepends a set of lightweight learnable filters to a frozen classifier and employs a "consensus divergence" loss to actively scatter the gradient directions of different filters. This effectively disrupts the "gradient consensus" that adversarial perturbations rely on for transferability. On ImageNet, for both CNNs and ViTs, DRIFT achieves state-of-the-art robust accuracy against strong adaptive attacks such as PGD-EoT, AutoAttack, and BPDA, with almost no increase in inference overhead.
- Dual Randomized Smoothing: Beyond Global Noise Variance
-
This paper points out that standard Randomized Smoothing (RS) serves all inputs with a single global noise variance, leading to an inability to balance performance across small and large radii. The authors first theoretically prove that RS certification remains valid as long as the noise variance is "locally constant" within the certified region. Consequently, the Dual RS framework is proposed—first using an RS model to predict the optimal variance for each input, and then using another RS classifier for classification at that variance. This achieves strong performance across both small and large radii on CIFAR-10 and ImageNet, with inference overhead increasing by only approximately 60%.
- EigenScore: OOD Detection using Posterior Covariance in Diffusion Models
-
This paper proposes EigenScore: when a diffusion model trained on InD data is applied to OOD samples, the denoising posterior covariance systematically expands along principal directions. By using the eigenvalue spectrum (sum of top-K eigenvalues) as a distribution shift signal—estimated efficiently using Jacobian-free subspace iteration—the method achieves SOTA average AUROC on standard OOD benchmarks (approximately 2% higher than the best baseline), remaining robust even in near-OOD scenarios such as CIFAR-10 vs CIFAR-100.
- Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs
-
Even if frontier models strictly block direct harmful outputs using classifiers, attackers can obtain "surface-benign" responses in adjacent domains (e.g., general organic synthesis) and use these pairs to fine-tune open-weights models. This "elicitation attack" recovers approximately 40% of the capability gap in chemical weapon scenarios, revealing the failure of output-level guardrails at the ecosystem level.
- Exponential-Wrapped Mechanisms: Differential Privacy on Hadamard Manifolds Made Practical
-
This paper systematizes the simple technique of "sampling in the tangent space + push-forward via the exponential map" into Exponential-Wrapped Laplace/Gaussian mechanisms. It unifies the implementation of \(\epsilon\)-DP, \((\epsilon, \delta)\)-DP, GDP, and RDP on general Hadamard manifolds for the first time while completely eliminating MCMC sampling, making differential privacy for manifold data truly computable and scalable.
- Expressiveness of Multi-Neuron Convex Relaxations in Neural Network Certification
-
This paper rigorously characterizes the expressiveness of multi-neuron convex relaxations for the first time, proving they are inherently incomplete like single-neuron relaxations (generalizing the "single-neuron convex barrier" to a "universal convex barrier"). However, completeness can be restored through equivalent network transformations or input domain polyhedral partitioning, with the partitioning complexity being strictly superior to single-neuron relaxations.
- Fair Classification by Direct Intervention on Operating Characteristics
-
Instead of searching in the classifier space, this work directly performs geometric optimization on the group-level ROC convex hulls (operating characteristic space) of a pre-trained classifier. It first locates the optimal operating point that satisfies multiple fairness constraints and then post-processes the base classifier using minimal label flipping to reach that point, satisfying multiple fairness metrics like DP, EO, and PP with near-oracle accuracy loss.
- Fair Conformal Classification via Learning Representation-Based Groups
-
FAREG shifts the task of "identifying subgroups discriminated against by algorithms" from the raw feature space to a latent representation space learned via Variational Information Bottleneck (VIB). This allows the model to capture unfair subgroups defined by non-linear combinations (such as XOR) and perform individual conformal calibration on these groups. It ensures adaptive equalized coverage while maintaining small, efficient prediction sets (complexity \(O(N+M)\), significantly lower than AFCP's \(O(N\log N+NM)\)).
- Fair Decision Utility in Human-AI Collaboration: Interpretable Confidence Adjustment for Humans with Cognitive Disparities
-
Targeting scenarios where "experts and novices share the same AI-assisted decision-making system," this paper demonstrates that existing calibration and human-alignment methods fail to guarantee fair decision utility across populations with different cognitive abilities. It proposes a new objective, inter-group-alignment, and utilizes cognition-aware multicalibration to simultaneously achieve high utility and utility fairness.
- Fair Graph Machine Learning under Adversarial Missingness Processes
-
This paper reveals an overlooked attack surface—adversarial sensitive attribute missingness processes—which can deceive fair GNNs by making imputation models "look fair." It proposes BFtS: a framework that uses a three-player adversarial game to impute missing sensitive values based on "worst-case fairness."
- Fair Reinforcement Learning for Just AI
-
The authors transform "quantile fairness" from a tabular algorithm requiring complete MDP transition tables into an oracle-efficient algorithm that calls a standard policy optimization oracle (approximately \(O(n)\) times). This allows "fair aggregation across multiple conflicting values" to scale to deep RL for the first time, achieving speeds several orders of magnitude faster than prior work.
- Fairness-Aware Multi-view Evidential Learning with Adaptive Prior
-
Addressing the neglected issue in multi-view evidential learning where samples tend to allocate support evidence to majority classes—leading to unfair uncertainty estimation—this paper proposes FAML. By replacing the fixed uniform prior in evidential deep learning with a training-trajectory-based adaptive prior, and incorporating fairness constraints and view opinion alignment, FAML simultaneously improves classification accuracy (especially for tail classes) and uncertainty reliability across six real-world multi-view datasets.
- Fairness via Independence: A General Regularization Framework for Machine Learning
-
This paper proposes using Cauchy-Schwarz (CS) divergence as a fairness regularization term to minimize the statistical dependence between "model predictions" and "sensitive attributes." Using a unified framework that is model-agnostic and independent of specific fairness definitions, it simultaneously improves \(\Delta\)DP and \(\Delta\)EO while maintaining accuracy and demonstrating greater robustness to hyperparameter variations.
- FaLW: A Forgetting-aware Loss Reweighting for Long-tailed Unlearning
-
This paper is the first to investigate the realistic scenario where the "forget set follows a long-tailed distribution." It identifies that existing approximate unlearning methods produce heterogeneous unlearning deviation and skewed unlearning deviation. The authors propose FaLW, a plug-and-play instance-level dynamic loss reweighting method that uses the "predictive probability distribution of unseen data" to measure the unlearning state of each sample and adaptively adjust the unlearning intensity.
- Federated Learning of Quantile Inference under Local Differential Privacy
-
This paper proposes a Local SGD algorithm for federated quantile inference (not just point estimation) under Local Differential Privacy (LDP). By utilizing a privacy mechanism that transforms the LDP problem into an equivalent non-private one, the authors establish the first weak convergence theory for Local SGD under non-smooth quantile loss and employ self-normalization to construct valid confidence intervals without estimating asymptotic variance.
- FERD: Fairness-Enhanced Data-Free Adversarial Robustness Distillation
-
FERD introduces "robust fairness" into data-free robustness distillation for the first time. By applying class proportion reweighting on synthetic samples and distribution uniformization of adversarial targets, it significantly enhances student model robustness on the weakest classes, mitigating the severe inter-class robustness imbalance.
- Fine-Grained Class-Conditional Distribution Balancing for Debiased Learning
-
This paper decomposes group-robust learning without bias annotations into "overfitting a model to identify bias patterns followed by fine-grained class-conditional distribution matching via a confusion matrix." It proposes MST and FG-CCDB, which approach or exceed the performance of methods relying on manual group annotations in binary, multi-shortcut, and extreme multi-class scenarios.
- Fine-Grained Iterative Adversarial Attacks with Limited Computation Budget
-
This paper models the maximization of iterative adversarial attack strength under a fixed computation budget as a layer-wise and step-wise combinatorial optimization problem. It proposes an event-driven Spiking-PGD: when the relative change of activations between adjacent iterations in a layer is below a threshold, the previous output is reused and recalculation is skipped. Virtual surrogate gradients are used to recover backpropagation signals blocked by spiking gates. This approach significantly outperforms existing attacks under equivalent computation budgets and achieves comparable robustness in adversarial training using only approximately 30% of the budget.
- Fingerprinting Deep Neural Networks for Ownership Protection: An Analytical Approach
-
AnaFP reformulates the empirical problem of "how far a fingerprint should be from the decision boundary" into finding a feasible interval for a stretch factor. By constraining adversarial fingerprints using both a robustness lower bound and a uniqueness upper bound, AnaFP distinguishes pirated models from independent models more stably than existing methods across CNNs, MLPs, and GNNs.
- Fisher-Rao Sensitivity for Out-of-Distribution Detection in Deep Neural Networks
-
This paper revisits Out-of-Distribution (OoD) detection through the lens of Riemannian information geometry, treating the network's prediction for an input as a statistical manifold. It is discovered that OoD inputs exhibit higher local Fisher-Rao sensitivity at the trained parameters. The authors quantify this sensitivity using the trace of the Fisher Information Matrix (FIM). Theoretically, they derive a "feature magnitude \(\times\) output uncertainty" product form that unifies existing OoD signals. Furthermore, using a product manifold construction, they upgrade this into a more robust additive score, achieving competitive detection performance with a single forward pass, no retraining, and no OoD data.
- Formalising Human-in-the-Loop: Computational Reductions, Failure Modes, and Legal–Moral Responsibility
-
This paper utilizes the concepts of oracle machines and "reductions" from computability theory to rigorously formalize diverse Human-in-the-Loop (HITL) human oversight schemes into three categories—Trivial Monitoring, Endpoint Action, and Involved Interaction. Based on this, it establishes a failure mode classification system and analyzes blind spots in UK/EU laws, ultimately revealing an unavoidable "Accountability ↔ Technical Interpretability" trade-off.
- From Curiosity to Caution: Mitigating Reward Hacking for Best-of-\(N\) with Pessimism
-
This paper reverses the idea of using "curiosity" reward prediction error as an exploration signal in Reinforcement Learning. Instead, it trains a predictor to fit the internal features of a reward model on typical responses and uses the prediction error as an "out-of-distribution uncertainty" penalty for reward scores. This ensures that Best-of-\(N\) sampling no longer degrades as \(N\) increases but rather improves monotonically.
- Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction
-
Addressing the reward hacking issue in real-time melody-chord accompaniment—where RL post-training "collapses into repetitive chords to exploit consistency rewards"—this paper proposes GAPT. It utilizes a discriminator that co-evolves with the policy to provide an adversarial reward representing "authenticity relative to real data." Combined with a two-stage adaptive update schedule, it restores output diversity to near-dataset levels without sacrificing harmonic consistency. In a real-time jamming user study with 12 professional musicians, it significantly improved adaptation speed and the sense of agency.
- Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness
-
This paper proposes the RICH Hypothesis (Robustness from Inference Compute Hypothesis) — that test-time compute can only be traded for robustness when "the components of the attacked data have been covered by the training data." Based on this, it demonstrates that applying lightweight adversarial fine-tuning to a VLM's vision encoder can transform extended reasoning (CoT / budget forcing) from "nearly ineffective" into "significant strengthening," manifesting a "rich-get-richer" dynamic.
- GradPCA: Leveraging NTK Alignment for Reliable Out-of-Distribution Detection
-
GradPCA leverages the low-rank structure of network gradients induced by NTK alignment. By performing PCA on "class-mean gradients" to characterize the ID subspace, it identifies inputs with gradients falling outside this subspace as OOD. It achieves more consistent (rather than occasionally optimal) detection performance across multiple image classification benchmarks and provides a theoretical framework for spectral OOD detection.
- How to Cure Newton for Unlearning Neural Networks? An Empirical Study from the Hessian Perspective
-
This paper discovers that Newton unlearning fails on real-world neural networks and LLMs due to Hessian degeneracy (a large number of zero/negative eigenvalues). It proposes CuReNU, based on cubic regularization, and its stochastic Hessian-free variant CuReNUS. These methods automatically determine the damping factor \(\gamma\), guarantee convergence to second-order stationary points, and achieve unlearning performance comparable to SOTA empirical methods across batch, sequential, and LLM-scale unlearning tasks.
- Identifying Robust Neural Pathways: Few-Shot Adversarial Mask Tuning for Vision-Language Models
-
This paper proposes AdvMask: instead of modifying pre-trained CLIP weights, it learns a set of binary masks for the vision encoder to "unearth" an inherently attack-resistant robust neural pathway by deactivating parameters sensitive to adversarial perturbations. This is combined with a Layer-wise Adaptive Feature Alignment (LAFA) loss specifically designed for adversarial robust fine-tuning in few-shot scenarios.
- INO-SGD: Addressing Utility Imbalance under Individualized Differential Privacy
-
This paper identifies that "Individualized Differential Privacy" (IDP) creates utility imbalances even when the training set itself is balanced—data with stricter privacy requirements becomes severely underrepresented. The authors propose INO-SGD: sorting gradients by loss within each batch and applying "continuous" down-weighting to unimportant gradients. This compensates the utility of more private groups while strictly satisfying the IDP budget of every data owner.
- Jailbreaking on Text-to-Video Models via Scene Splitting Strategy
-
SceneSplit decomposes a single harmful prompt into multiple "individually harmless" storyboards. By leveraging the temporal combination of these scenes, it constrains the video generation output space into unsafe regions and iteratively rewrites the most influential scenes to bypass visual safety filters, achieving Attack Success Rates (ASR) of 68.6%–84.1% across five commercial T2V models.
- Label Smoothing Improves Machine Unlearning
-
This paper integrates "Negative Label Smoothing" into Gradient Ascent-based machine unlearning, proposing a plug-and-play method named UGradSL. By performing gradient ascent with negative smoothed labels on the forget set and gradient descent on the retain set, Ours significantly closes the performance gap with the "Retrained Model" with almost zero additional computational overhead, while providing theoretical proof of improved label-level local differential privacy.
- LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis
-
LAMDA constructs a long-term malware benchmark covering over 1 million Android APKs from 2013 to 2025. Using Drebin static features, family labels, and multiple temporal splitting systems, it reveals that existing malware detectors degrade rapidly under real-world concept drift.
- Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights
-
This paper reveals that privacy vulnerability is concentrated in a very small number of critical weights (as low as 0.1%) and is highly entangled with learnability (Pearson r > 0.9). It proposes the CWRF method, which achieves a superior privacy-utility trade-off by rewinding and freezing privacy-vulnerable weights while fine-tuning only the remaining weights.
- LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
-
LingoLoop injects subtle adversarial perturbations into input images to trap Multimodal Large Language Models (MLLMs) into infinite repetitive generation. By utilizing "POS-prior-based EOS suppression" and "hidden state contraction-induced loops," it produces up to 367× more tokens than normal inputs when generation limits are relaxed, causing compute/energy-exhaustion Denial-of-Service (DoS).
- LiteGuard: Efficient Task-Agnostic Model Fingerprinting with Enhanced Generalization
-
LiteGuard employs two strategies—expanding model sets using training checkpoints and assigning a lightweight local verifier to each fingerprint—to minimize the training requirement for task-agnostic model fingerprinting (requiring as few as 1 real model per set). It outperforms the SOTA MetaV in AUC across five categories of tasks while reducing training costs by 5 to 10 times.
- LitmusValues: Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
-
Ours proposes the LitmusValues evaluation framework and the AIRiskDilemmas dataset. By forcing models to make choices in "value-conflict" dilemma scenarios, it reveals their true value priorities. It demonstrates that these revealed values (even seemingly harmless ones like "Care") can predict risk behaviors in both seen and unseen scenarios, serving as an early warning system for AI risks.
- Machine Unlearning under Retain–Forget Entanglement
-
To address the "collateral damage" to related samples caused by semantic entanglement between the forget and retain sets, a two-stage optimization framework is proposed: the first stage uses an augmented Lagrangian method for aggressive unlearning and locking irrelevant retain samples, while the second stage employs gradient projection regularized by Wasserstein-2 distance to restore the accuracy of semantically adjacent retain samples while preventing unlearning rebound.
- Memorization Through the Lens of Sample Gradients
-
This paper proposes Cumulative Sample Gradient (CSG)—the accumulation of the "loss gradient with respect to the input" throughout the training process—as an efficient proxy for Feldman's memorization score. Theoretically, it proves that CSG is linearly bounded by both the degree of memorization and the learning time. This leads to the discovery of an early stopping criterion at the peak of the weight norm, which requires no validation set and accelerates memorization estimation by up to 5 orders of magnitude.
- Mitigating Privacy Risk via Forget Set-Free Unlearning
-
This paper introduces the partially-blind unlearning setting and the RELOAD method, which replaces the original forget set with cached full-data gradients from the end of training. By utilizing a single-step reverse forget gradient, selective weight re-initialization, and fine-tuning on the retain set, it approximates a from-scratch retrained model without retaining the samples to be deleted. The method achieves strong results across general sample unlearning, LLM entity unlearning, and error correction.
- MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
-
MoReBench proposes evaluating the structural quality of the reasoning process (rather than the correctness of the final conclusion) of reasoning models across 1,000 moral dilemmas using 23,018 expert-written rubric criteria. The study finds that neither scaling laws nor performance on math/code benchmarks can predict a model's moral reasoning capabilities.
- MUSE: Model-Agnostic Tabular Watermarking via Multi-Sample Selection
-
MUSE proposes a "multi-sample selection" paradigm for tabular data watermarking: generating multiple candidate samples for each row and selecting the one with the highest score via a keyed scoring function. This bypasses the unreliability of DDIM inversion in diffusion models, achieving a model-agnostic, calibratable, and low-distortion solution.
- Nasty Adversarial Training: A Probability Sparsity Perspective for Robustness Enhancement
-
This paper leverages Nasty Training, originally designed to "prevent model distillation," to enhance adversarial robustness. By utilizing a vanilla-trained "adversary model" for divergence regularization, the target model is forced to output a sparse probability distribution. This widens inter-class margins and increases decision boundary margins, achieving SOTA robustness on CIFAR / ImageNet with minimal overhead, while providing an interpretable spatial metric perspective.
- NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion
-
NatADiff utilizes diffusion models to guide sampling trajectories toward the "boundary between the true class and the adversarial class." Rather than producing constrained adversarial samples with perturbations, it generates "natural adversarial samples" that naturally blend adversarial semantic cues. This maintains white-box attack success rates while significantly enhancing cross-architecture transferability, producing a distribution closer to real-world test-time errors.
- No Prior, No Leakage: Revisiting Reconstruction Attacks in Trained Neural Networks
-
This paper revisits training data reconstruction attacks based on implicit bias from a "defensive" perspective, strictly proving that in the absence of data prior knowledge, the attack objective function possesses infinite indistinguishable global optima that can be arbitrarily far from the real training set, thereby demonstrating that the "success" of such attacks fundamentally relies on external priors rather than the information leaked by the implicit bias itself.
- NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models
-
NoisePrints treats the random seed used during diffusion model generation as a natural authorship fingerprint. By generating initial noise via a hashed seed and performing correlation verification with the VAE latent of the generated content, it achieves lightweight authorship verification for images and videos without model modification, sampling changes, or inversion.
- On Optimal Hyperparameters for Differentially Private Deep Transfer Learning
-
This paper systematically investigates the critical hyperparameters of clipping bound \(C\) and batch size \(B\) in Differentially Private (DP) transfer learning. It demonstrates that prevalent heuristics—such as "use small \(C\) for strong privacy" and "use large batch sizes for a fixed number of steps"—are erroneous. Based on a theory of optimal clipping via MSE decomposition and an analysis of cumulative DP noise, the authors explain why \((C, B, \eta)\) should be jointly tuned according to the "learning problem difficulty."
- On the Interaction of Compressibility and Adversarial Robustness
-
This paper provides a unified theoretical framework proving that "structured compression" (at both neuron and spectral levels) concentrates parameter energy into a few dominant directions. This concentration raises the operator norm and Lipschitz constant of the network, creating "high-sensitivity directions" in the representation space that adversarial attacks can exploit, ultimately leading to a systematic degradation of adversarial robustness. These predictions are validated across various architectures, datasets, and training paradigms.
- Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence
-
This work leverages the geometric singularity boundaries of semi-discrete Optimal Transport (OT) to locate semantically ambiguous regions and generates proxy OOD samples (OTIS) near these boundaries. By employing a confidence suppression loss during training, the model is forced to produce uniform predictions in structurally uncertain areas, systematically mitigating the OOD overconfidence issue in DNNs.
- Optimizing Canaries for Privacy Auditing with Metagradient Descent
-
This paper employs metagradient descent to directly optimize the set of canaries (probe samples) used in privacy auditing. In black-box, single-training differential privacy (DP) auditing scenarios, this approach improves the empirical privacy lower bound \(\varepsilon\) by several times compared to existing random or mislabeled canaries, relying solely on the final model output.
- PateGAIL++: Utility Optimized Private Trajectory Generation with Imitation Learning
-
PateGAIL++ dynamically allocates the privacy budget based on "per-sample privacy sensitivity" within a federated differential privacy imitation learning framework, injects adaptive Laplace noise, and utilizes WGAN-GP to stabilize policy training under discrete trajectories. This significantly improves the "privacy-utility" tradeoff of synthetic mobility trajectories under the same privacy budget and renders membership inference attacks nearly equivalent to random guessing.
- PE-SGD: Differentially Private Deep Learning via Evolution of Gradient Subspace for Text
-
PE-SGD combines "gradient projection + private evolving synthetic data" for differentially private fine-tuning: it uses a synthetic dataset that evolves continuously during training to span the gradient projection subspace and injects DP noise into the optimal projection coefficients. It significantly outperforms DP-SGD and various projection-based baselines in scenarios with extremely limited private data (\(M < 500\)) and tight privacy budgets (\(\epsilon = 1\)).
- Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
-
The authors generated 276 million person bounding boxes with perceived gender/race labels and person-centric captions for the entire LAION-400M dataset. Using this first full-scale web data annotation, they audited systematic biases where "men, Black, and Middle Eastern individuals are over-associated with crime and negative content." Furthermore, they demonstrated that 60–70% of the gender bias in CLIP and Stable Diffusion can be directly predicted by a linear fit of the "gender-concept co-occurrence frequency" in the training data.
- Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens
-
This paper demonstrates for the first time that wall diffuse scattering can serve as an "optical projection side-channel." It proposes IR4Net, a physically-guided inversion network that reconstructs display content from air-gapped screens using only passively captured scattering spots, without line-of-sight, electromagnetic leakage, or network connectivity.
- PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm
-
PluriHarms employs an automated pipeline of "oversampled generation → interpretable feature extraction → genetic algorithm selection" to create 150 prompts spanning the spectrum from "completely benign to clearly harmful," specifically focusing on borderline controversies. By collecting 15,000 ratings from 100 annotators along with demographic and psychological traits, the study treats "annotator disagreement" as a signal rather than noise. It evaluates safety models based on this data, finding that personalized alignment significantly improves the prediction of human harm judgments, though substantial room for improvement remains.
- Prior-based Noisy Text Data Filtering: Fast and Strong Alternative for Perplexity
-
A text data filtering method based on token priors (word frequency statistics) is proposed. By utilizing the mean and standard deviation of in-document token priors as an approximation for PPL, it achieves the highest average performance across 20 downstream benchmarks while being over 1000x faster than PPL-based filtering.
- Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding
-
By attaching a lightweight "anonymization adapter" to a frozen video foundation model and employing self-supervised adversarial training directly in the latent feature space, private information such as skin tone, gender, and clothing is erased. This allows a single set of anonymized features to be generic across various downstream tasks—including action recognition, temporal localization, and anomaly detection—reducing privacy leakage by 35% while downstream performance only drops by 1-2%.
- Private Rate-Constrained Optimization with Applications to Fair Learning
-
This paper proposes RaCO-DP—a differential privacy version of the Stochastic Gradient Descent-Ascent (SGDA) algorithm. By unifying various "group fairness" metrics into "generalized rate constraints" based on histograms, the additional privacy overhead of the entire constrained optimization is reduced to the cost of a private mini-batch histogram estimation per step. This approach Pareto-dominates existing private fair learning methods on the privacy-utility-fairness triangle.
- Protection against Source Inference Attacks in Federated Learning
-
Addressing Source Inference Attacks (SIA) in Federated Learning—where a server guesses which client a specific data record belongs to—this paper demonstrates that standard shuffling is insufficient as attackers can remap shuffled models back to owners using shadow datasets. The authors propose combining parameter-level shuffling with the Residue Number System (RNS) and unary encoding at bit-level granularity. This ensures the server only perceives aggregated results without access to individual client models, reducing SIA success rates to random guessing levels with negligible impact on joint model accuracy.
- Reducing Information Dependency Does Not Cause Training Data Privacy. Adversarially Non-Robust Features Do.
-
This paper overturns the mainstream hypothesis that "reducing information dependency between training data and models prevents reconstruction attacks" through three counter-intuitive experiments. It demonstrates that privacy under Model Inversion Attacks (MIA) actually stems from "adverserially non-robust features." Based on this, it proposes Anti-Adversarial Training (AT-AT), reducing the reconstruction rate of ResNet-152 from 84% to 6.5% while maintaining higher accuracy than existing SOTA defenses.
- Remaining-data-free Machine Unlearning by Suppressing Sample Contribution
-
This paper characterizes "sample contribution to training" as the input sensitivity of the pre-trained model toward that sample. It proposes MU-Mis, which utilizes only the pre-trained model and forget data without accessing any remaining data. By minimizing the "sensitivity difference between target and irrelevant classes," it directly erases the contribution of forget samples. It is the first remaining-data-free method to achieve utility parity with state-of-the-art (SOTA) methods that rely on remaining data.
- RESFL: An Uncertainty-Aware Framework for Responsible Federated Learning by Balancing Privacy, Fairness and Utility
-
RESFL integrates "adversarial privacy decoupling" and "uncertainty-guided fair aggregation" into a single Federated Learning (FL) pipeline. It utilizes an evidential neural network to compute a scale-invariant group fairness metric, UFM, to weight client updates. In autonomous driving object detection, this framework simultaneously reduces privacy leakage and narrows group disparities with minimal impact on accuracy.
- Rethinking LoRA for Privacy-Preserving Federated Learning in Large Models
-
Aiming at the performance collapse of directly applying LoRA in Differentially Private Federated Learning (DPFL), this paper identifies three root causes—gradient coupling, noise multiplicative amplification, and entrapment in sharp minima after aggregation. It proposes LA-LoRA, which alternately updates two low-rank matrices within each local round and smooths noisy gradients using a fixed Gaussian low-pass filter. It achieves SOTA on Swin Transformer and RoBERTa (outperforming the best baseline RoLoRA by 16.83% on Swin-B / Tiny-ImageNet / \(\epsilon=1\)).
- Rethinking Pareto Frontier: On the Optimal Trade-offs in Fair Classification
-
This paper reformulates the optimal fairness-accuracy trade-off achievable under a given model architecture (model-specific Pareto frontier) as a convex optimization problem over confusion vectors. It proves that existing post-processing frontiers are suboptimal and proposes a last-layer retraining framework with group-dependent biases, theoretically demonstrating its strict superiority over post-processing baselines such as randomized flipping.
- ReTrace: Reinforcement Learning-Guided Reconstruction Attacks on Machine Unlearning
-
The authors model the recovery of "unlearned data" as a reinforcement learning (RL) problem. By treating the residual differences (traces) between the pre-unlearning and post-unlearning models as reward signals, they guide a generator to search for high-reward regions in the input space. This approach successfully reconstructs samples and class distributions on large-scale models like ResNet and DistilBERT, achieving an instance-level recovery success rate of up to 73.1%.
- Risk-Sensitive Agent Compositions
-
This work formalizes agent workflows as Directed Acyclic Graphs (Agent Graph), modeling safety/fairness/privacy requirements using a max loss function. It proposes the BucketedVaR algorithm, which utilizes union bounds and dynamic programming to find the optimal agent composition minimizing VaR/CVaR in polynomial time, proven to be asymptotically near-optimal under the independent loss assumption.
- Robust Adversarial Attacks Against Unknown Disturbances via Inverse Gradient Sample
-
The authors propose IGSA (Inverse Gradient Sample-based Attack), which utilizes "Inverse Gradient Sampling" to actively identify the most destructive perturbation directions within the neighborhood of an adversarial example. By performing perturbation-guided optimization along these directions, the method generates robust adversarial examples that maintain high attack success rates under various unknown disturbances (blur, JPEG, rotation, perspective, etc.), significantly surpassing existing methods like EOT in both theory and experimentation.
- Robust Adversarial Quantification via Conflict-Aware Evidential Deep Learning
-
Addressing the issue where Evidential Deep Learning (EDL) makes "confident mistakes" under adversarial perturbations, this paper proposes C-EDL, a post-hoc method requiring no retraining. C-EDL generates multiple label-preserving transformed views for each input, quantifies the "conflict" between these views in the evidence space, and decays the evidence accordingly to amplify uncertainty. This reduces OOD data coverage by up to \(\approx 55\%\) and adversarial data coverage by up to \(\approx 90\%\), with almost no loss in ID accuracy or inference efficiency.
- Robust Federated Inference
-
Ours first formalizes the "Robust Federated Inference" problem—where predictions from multiple local models are aggregated at the server, but outputs from up to \(f < n/2\) clients may be arbitrarily tampered with. It provides the first robustness analysis: deriving provable certifications for mean-based aggregators and transforming the problem into adversarial learning for non-linear neural network aggregators. By combining DeepSet, adversarial training, and inference-time robust averaging (DeepSet-TM), the worst-case accuracy is improved by 4.7–22.2 percentage points over existing robust aggregation methods.
- Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer with Epsilon-Scheduling
-
This paper identifies that Robust Fine-Tuning (RFT) from non-robust pretrained models suffers from "suboptimal transfer"—where clean accuracy drops drastically below standard fine-tuning or even nears random levels, even with small adversarial perturbations. The authors attribute the root cause to "delayed task adaptation" and propose Epsilon-Scheduling (a two-stage hinge schedule that starts at 0 and linearly ramps up to the target \(\varepsilon_g\)) to allow the model to adapt to the task before imposing robustness constraints. They also propose an Expected Robustness metric for a more comprehensive characterization of the accuracy-robustness trade-off, demonstrating consistent improvements across 6 backbones and 5 datasets.
- Robust Optimization for Mitigating Reward Hacking with Correlated Proxies
-
This paper models reward hacking as a max-min robust policy optimization problem that "performs well against the worst-case among all possible true rewards maintaining a correlation \(r\) with the proxy reward." It proposes two algorithms, a universal Max-Min and a feature-based Linear Max-Min, significantly improving worst-case returns and stability across environments including Traffic, Pandemic, Glucose, Tomato, and RLHF.
- Robust Spiking Neural Networks Against Adversarial Attacks
-
This paper theoretically proves that threshold-proximity spiking neurons are the key bottleneck for the adversarial robustness of directly trained SNNs (they simultaneously set the theoretical upper bound of attack intensity and are most prone to state flipping). It proposes the Threshold Guarding Optimization (TGO) method—a dual approach using membrane potential constraints and noisy LIF neurons—which achieves SOTA robustness across various adversarial scenarios with zero additional inference overhead.
- Robustify Spiking Neural Networks via Dominant Singular Deflation under Heterogeneous Training Vulnerability
-
The authors identify a heterogeneous training vulnerability in Spiking Neural Networks (SNNs) under the mainstream "direct coding + BPTT" paradigm—where a single batch with a slightly different distribution can cause complete network collapse. The root cause is theoretically attributed to the linear growth of the Hessian spectral radius over time steps. Accordingly, a parameter-free Dominant Singular Deflation (DSD) method is proposed to orthogonally remove the dominant singular component of gradients during backpropagation to suppress the spectral radius, significantly improving SNN adversarial robustness in both homogeneous and heterogeneous training scenarios.
- Membership Privacy Risks of Sharpness Aware Minimization
-
This paper systematically reveals for the first time that models trained with Sharpness-Aware Minimization (SAM), despite having better generalization performance, are more vulnerable to Membership Inference Attacks (MIA) than those trained with SGD. Theoretical and experimental explanations are provided through the lenses of memorization behavior and variance contraction.
- Sample-Efficient Distributionally Robust Multi-Agent Reinforcement Learning via Online Interaction
-
This paper conducts the first study on the online learning problem of Distributionally Robust Markov Games (DRMGs). It proposes the MORNAVI algorithm, which efficiently learns optimal robust policies through online interaction without simulators or offline data, providing the first provable regret bounds under TV and KL divergence uncertainty sets.
- SCOPED: Score–Curvature Out-of-Distribution Proximity Evaluator for Diffusion
-
SCOPED combines the "squared norm / Jacobian trace (curvature)" of the diffusion model score function into a single statistic \(T(x)\) to determine whether a sample is in-distribution. By utilizing the Hutchinson estimator to compress curvature into a single JVP, it approximates the accuracy of the strongest diffusion-based OOD methods with only 1–2 forward evaluations, requiring an order of magnitude fewer model evaluations than methods relying on complete denoising trajectories.
- Secure Outlier-Aware Large Language Model Inference
-
This paper proposes the SOAL framework, identifying that "outlier activations" are prevalent in the nonlinear layers (Normalization, Activation, Softmax) of LLMs. By prefixing special tokens to the input to "confine" outliers to fixed positions and redesigning MPC nonlinear protocols for the narrowed input domains, the framework accelerates RMSNorm by ~2×, SiLU by ~2×, and Softmax by over 3×, achieving a nearly 2× overall speedup without model fine-tuning.
- SeRI: Gradient-Free Sensitive Region Identification in Decision-Based Black-Box Attacks
-
In decision-based black-box attack scenarios where only top-1 labels are available under tight query budgets, SeRI proposes a continuous pixel sensitivity definition based on the "decision boundary." By utilizing recursive region subdivision and local perturbation adjustment to estimate sensitivity weights for each pixel, it serves as a plug-and-play perturbation optimizer. It further reduces \(\ell_2\) perturbations of mainstream attacks such as HSJA, CGBA, RayS, and ADBA by approximately 15%~30% under identical query constraints.
- Skirting Additive Error Barriers for Private Turnstile Streams
-
Proves that polynomial pure additive error lower bounds in differentially private turnstile streams (Distinct Elements \(\Omega(T^{1/4})\), \(F_2\) moment \(\Omega(T)\)) can be bypassed by introducing multiplicative error—achieving \((\text{polylog}(T), \text{polylog}(T))\) mixed error for Distinct Elements and \((1+\eta, \text{polylog}(T))\) mixed error for \(F_2\) moment, both with polylogarithmic space.
- STEDiff: Unveiling Spatio-Temporal Redundancy in Backdoor Attacks on Text-to-Image Diffusion Models
-
The authors first reveal significant "spatio-temporal redundancy" in diffusion model backdoor attacks—only a few key weights (enrichment phenomenon) and a few key timesteps (marginal effect) are truly involved in backdoor injection. Based on this, a unified framework STEDiff is proposed. On the attack side, STEBA accelerates backdoor injection by up to 15.07× while saving 82% VRAM. On the defense side, STEDF utilizes spatio-temporal features to achieve real-time backdoor detection of up to 99.8%.
- Test-Time Poisoned Sample Detection by Exploiting Shallow Malicious Matching in Backdoored CLIP
-
This paper discovers that backdoored CLIP models exhibit "shallow malicious matching" on poisoned images—where image features align closely with the target text itself but remain far from its semantic neighbors. Based on this, Subspace Detection is proposed: at test time, the local text manifold of the predicted concept is reconstructed using text variants, a "Region of Interest" (ROI) is sampled along the positive direction, and poisoned samples are detected via the Euclidean distance from image features to this ROI. This method significantly outperforms existing detectors across 7 SOTA backdoor attacks and 3 datasets in terms of AUROC.
- The Gaussian-Head OFL Family: One-Shot Federated Learning from Client Global Statistics
-
GH-OFL allows clients to upload "class-conditional sufficient statistics" (counts, first/second moments) only once. The server then directly constructs closed-form Gaussian discriminant heads (NB/LDA/QDA) and synthesizes data-free samples in a Fisher subspace to train two lightweight heads (FisherMix, Proto-Hyper). It achieves OFL SOTA accuracy under strong non-IID conditions with a single communication round, without ever touching raw data.
- The Self-Re-Watermarking Trap: From Exploit to Resilience
-
This paper demonstrates that deep image watermarking systems can be easily overwritten by "re-writing a new watermark using the same encoder," thereby compromising original ownership. It proposes a self-aware watermarking framework with Lipschitz constraints and re-watermarking adversarial training, enabling stable recovery of the original watermark even after self-re-watermarking and PGD overwriting attacks.
- Time Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks
-
The authors propose the Spike-Retiming Attack—a temporal attack method that alters spike timestamps without adding or deleting spikes. By formalizing a unified three-norm budget (\(\mathcal{B}_\infty\) local jitter, \(\mathcal{B}_1\) total delay, and \(\mathcal{B}_0\) manipulation count) under a capacity-1 constraint, and utilizing Projected-in-the-Loop (PIL) optimization to decouple strict forward projections from soft backward differentiation, the method achieves >90% ASR on CIFAR10-DVS, DVS-Gesture, and N-MNIST with <2% spike perturbation. This reveals a critical temporal vulnerability in event-driven SNNs.
- Toward Enhancing Representation Learning in Federated Multi-Task Settings
-
This paper proposes Muscle loss—an N-tuple level multi-model contrastive learning objective whose minimization is equivalent to maximizing the lower bound of mutual information among all model representations. Based on this, the FedMuscle algorithm is designed to align the representation spaces of heterogeneous models using a public dataset. It naturally handles model and task heterogeneity, consistently outperforming SOTA baselines in CV/NLP multi-task settings (up to \(\Delta\) +28.65%).
- Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI
-
To address the prevalence of regression tasks in scientific computing, this paper utilizes a score-based diffusion model trained on the joint distribution \(p(x, y_{\text{pred}})\). By treating the joint log-likelihood as a "certificate of trust" for model predictions, it demonstrates a strong correlation with actual prediction errors. This allows for determining whether an AI prediction is trustworthy (ID/OOD) without ground truth values at test time. The method is validated on various scientific datasets, including PDEs, satellite remote sensing, and brain tumor segmentation.
- Towards Privacy-Guaranteed Label Unlearning in Vertical Federated Learning: Few-Shot Forgetting without Disclosure
-
Addressing the unique dilemma in Vertical Federated Learning (VFL) where "labels are both input and privacy," this paper proposes the first VFL label unlearning method. It utilizes a small set of public data with manifold mixup to synthesize embeddings, followed by gradient ascent on active/passive parties to erase target labels and gradient descent to recover performance on the remaining set. The entire process completes in seconds—16–1200x faster than baselines—with minimal loss in remaining set accuracy.
- Traceable Black-box Watermarks for Federated Learning
-
TraMark is proposed to realize server-side traceable black-box watermark injection in Federated Learning (FL) for the first time by partitioning the model parameter space into main task areas and watermark areas, and employing masked aggregation to prevent watermark collisions. It achieves a 99.58% verification rate with only a 0.54% reduction in main task accuracy.
- TriQDef: Disrupting Semantic and Gradient Alignment to Block Adversarial Patch Transfer in Quantized Networks
-
This paper discovers that adversarial patches are highly transferable across quantized networks of different bit-widths. The root cause is that models across bits maintain strong "perceptual alignment" in intermediate features and input gradients. TriQDef utilizes two perceptual mismatch regularizations (FDP + GPDP) along with a bit-wise curriculum training strategy to actively disrupt this cross-bit alignment during training. This reduces the Attack Success Rate (ASR) by over 40% under unseen patches or bit combinations, while incurring almost no loss in clean accuracy and zero extra inference overhead.
- TrojanTO: Action-Level Backdoor Attacks Against Trajectory Optimization Models
-
The authors propose TrojanTO, the first action-level backdoor attack against Trajectory Optimization (TO) models such as the Decision Transformer. As a "post-training" attack, it requires poisoning only 0.3% of trajectories without manipulating reward signals. By employing "Trajectory Filtering + Batch Poisoning + Alternating Training," it establishes a strong coupling between the trigger and target actions. Across six D4RL tasks and three TO architectures, TrojanTO improves the composite score (CP) from a baseline of 0.34 to 0.70.
- Tug-of-War No More: Harmonizing Accuracy and Robustness in Vision-Language Models via Stability-Aware Task Vector Merging
-
Addressing the persistent trade-off where "improving VLM robustness inevitably degrades clean accuracy," this paper proposes PISTOLE. Instead of retraining, it selectively merges off-the-shelf "naturally fine-tuned" and "adversarially fine-tuned" CLIP task vectors based on prediction stability. By using complementary gradient stability masks to suppress conflicting coordinates and weighting adversarial parameter trajectories with curvature-sensitive metrics, it "bends" the typically linear clean-robust frontier toward a better sweet spot, improving both clean and robust accuracy by approximately 5% across 14 datasets.
- ULD-Net: Enabling Ultra-Low-Degree Fully Polynomial Networks for Homomorphically Encrypted Inference
-
ULD-Net proposes a method to train "fully polynomial networks" from scratch. By utilizing a polynomial-only normalization layer, PolyNorm (consisting only of additions and multiplications), activation values are stabilized within a well-behaved range. This allows ultra-low-degree fully polynomial models with multiplication depth \(\le 3\) to scale to ViT/ImageNet for the first time (ViT-Small achieves 76.70% top-1 on ImageNet), achieving a 2.76× homomorphic encryption inference speedup compared to previous SOTA.
- Uncertainty Estimation via Hyperspherical Confidence Mapping
-
This paper proposes Hyperspherical Confidence Mapping (HCM), which decomposes network outputs into "magnitude \(R\) + unit direction vector \(\hat{d}\)" and treats the degree of deviation of \(\hat{d}\) from the unit sphere as uncertainty. This achieves sampling-free, distribution-assumption-free deterministic uncertainty estimation, matching or even exceeding Deep Ensembles and Evidential Learning in classification and regression task with minimal inference overhead.
- Unified Privacy Guarantees for Decentralized Learning via Matrix Factorization
-
This paper unifies multiple algorithms and trust models in decentralized learning (DL) into a Matrix Factorization (MF) framework. It generalizes privacy guarantees to broader matrix types and proposes the MAFALDA-SGD algorithm, which significantly outperforms existing methods on synthetic and real graph topologies by optimizing noise correlation.
- Video Unlearning via Low-Rank Refusal Vector
-
This work proposes the first training-free, closed-form weight update framework for concept erasure in video diffusion models. By using only 5 pairs of safe/unsafe prompts to estimate a "refusal vector" and applying contrastive low-rank decomposition to decouple target concepts from unrelated semantics, the authors analytically incorporate corrections into model weights. This approach reduces unsafe generation rates in OPEN-SORA and ZEROSCOPET2V by an average of 36.3% and 58.2%, respectively, without compromising video quality or adding inference overhead.
- VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents
-
The authors construct VPI-Bench (306 samples), the first comprehensive visual prompt injection attack benchmark, systematically evaluating the security of Computer-Use and Browser-Use Agents across 5 platforms. Findings reveal that Browser-Use Agents are extremely fragile (100% AR on Amazon/Booking), and even Anthropic's CUA exhibits serious vulnerabilities (up to 59% AR), with system prompt defenses proving ineffective.
- WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols
-
This paper points out that approximate machine unlearning can conversely leak the data being forgotten. This leakage is attributed to two root causes: "large gradient norms of forgotten samples" and "parameters being too close to the original model after unlearning." The authors propose WARP, a plug-and-play defense that utilizes the loss-preserving symmetry of neural networks to "teleport" the model to another point on the loss isosurface. This simultaneously suppresses the unlearning gradient norm and increases parameter displacement, reducing black-box attack AUC by up to 64% and white-box by up to 92% across six unlearning algorithms with almost no loss in accuracy.
- Watermark-based Detection and Attribution of AI-Generated Content
-
This paper presents the first systematic study of watermark-based user-level detection and attribution for AI-generated content. It providing theoretical analysis (TDR/FDR/TAR bounds), an efficient watermark selection algorithm (A-BSTA), and cross-modal (image and text) experimental validation. The results demonstrate that detection and attribution inherit the accuracy and (lack of) robustness of the underlying watermarking methods.
- When Flatness Does (Not) Guarantee Adversarial Robustness
-
This paper reformulates the empirical intuition of whether "flat minima lead to adversarial robustness" into a provable problem. It concludes that flatness provides a lower bound for local loss stability around a point but cannot guarantee global robustness, as adversarial examples often fall into high-confidence, low-curvature, but incorrectly classified flat regions.
- Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information
-
This work provides a unified explanation of the effective mechanism of all Unlearnable Examples (UE) from the perspective of Mutual Information (MI) reduction. It proves that reducing the intraclass covariance of poisoned features lowers the MI upper bound. Accordingly, the MI-UE method is proposed to achieve covariance reduction by maximizing intraclass cosine similarity, suppressing test accuracy on CIFAR-10 to 9.95% (near random guessing) while significantly outperforming existing methods under adversarial training defense.
- Wring Out the Bias: A Rotation-Based Alternative to Projection Debiasing
-
Addressing the "whac-a-mole" dilemma where "projection debiasing" used in vision-language models like CLIP shifts bias from one concept to another unconsidered one, this paper mathematically proves that projection necessarily amplifies bias in orthogonal subspaces. It proposes WRING, a method that replaces "subspace deletion" with "embedding rotation within relevant subspaces," effectively eliminating bias in target concepts while virtually avoiding amplification in unconsidered concepts.
- Zero-Sacrifice Persistent-Robustness Adversarial Defense for Pre-Trained Encoders
-
ZePAD utilizes two complementary branches (an adversarially fine-tuned multi-encoder branch + a benign branch trained only on clean data) paired with a confidence-based federated decision mechanism. This allows pre-trained encoders to defend against "Downstream-Agnostic Adversarial Examples" (DAE) across multiple downstream tasks with a single fine-tuning step, while maintaining or even improving clean accuracy and providing free adversarial detection.