🛡️ AI Safety¶

🤖 AAAI2026 · 44 paper notes

Alternative Fairness and Accuracy Optimization in Criminal Justice: This paper provides a systematic review of three dimensions of algorithmic fairness (group fairness, individual fairness, and procedural fairness), proposes an improved group fairness optimization formulation based on tolerance constraints, and constructs a "Three Pillars of Fairness" deployment framework for public decision-making systems.
An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses: Under the sole assumption of \(L\)-smoothness (without convexity), this paper derives tighter closed-form RDP privacy bounds for DPSGD and, for the first time, provides a complete convergence/utility analysis in the bounded-domain setting, revealing that a smaller parameter domain diameter simultaneously improves both privacy and utility.
An Information Theoretic Evaluation Metric for Strong Unlearning: This paper exposes a fundamental flaw in existing black-box unlearning evaluation metrics (MIA, JSD, etc.)—modifying only the final classification head is sufficient to satisfy all black-box metrics while intermediate layers fully retain information about the forget set. The paper proposes IDI, a white-box metric that quantifies unlearning effectiveness by estimating, via InfoNCE, the mutual information between each layer's representations and the forget labels. It further proposes COLA, an unlearning method that achieves IDI scores approaching Retrain on CIFAR-10/100 and ImageNet-1K.
Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks: This paper proposes the Angular Gradient Sign Method (AGSM), which decomposes gradients in hyperbolic space into radial (hierarchical depth) and angular (semantic) components, applying perturbations exclusively along the angular direction to generate adversarial examples. AGSM achieves 5–13% greater accuracy degradation than standard FGSM/PGD on image classification and cross-modal retrieval tasks.
Authority Backdoor: A Certifiable Backdoor Mechanism for Authoring DNNs: This paper proposes Authority Backdoor, which embeds hardware fingerprints as backdoor triggers into DNNs so that models function correctly only on authorized devices, and achieves certifiable robustness against adaptive trigger reverse-engineering attacks via randomized smoothing.
Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification: This paper proposes the Manifold-Correcting Causal Flow (MC²F) framework, which employs a Stratified Riemannian Continuous Normalizing Flow (SR-CNF) to learn the manifold density of clean data embeddings for adversarial example detection, and subsequently applies a Geodesic Purification Solver to project detected adversarial embeddings back onto the clean manifold along geodesic paths. MC²F comprehensively surpasses state-of-the-art methods in adversarial robustness across SST-2, AGNews, and YELP benchmarks, while incurring no loss—and even achieving marginal gains—in clean accuracy.
Breaking the Dyadic Barrier: Rethinking Fairness in Link Prediction Beyond Demographic Parity: This paper identifies three fundamental flaws in dyadic fairness and Demographic Parity (ΔDP) for link prediction—insufficient GNN expressiveness, subgroup bias masking, and ranking insensitivity—and proposes a ranking-aware fairness metric based on NDKL and a post-processing algorithm MORAL, achieving state-of-the-art fairness–utility trade-offs across six datasets.
CoRe-Fed: Bridging Collaborative and Representation Fairness via Federated Embedding Distillation: This paper proposes CoRe-Fed, a framework that simultaneously addresses representation fairness and collaborative fairness in federated learning through two synergistic modules—embedding-level contrastive alignment and contribution-aware aggregation—achieving significant improvements in both fairness and generalization of the global model under heterogeneous data distributions.
DeepTracer: Tracing Stolen Model via Deep Coupled Watermarks: This paper proposes DeepTracer, a robust watermarking framework that achieves deep coupling between the watermark task and the main task through adaptive source-class selection (K-Means clustering for feature space coverage) + same-class coupling loss (aligning watermark samples with target-class samples in output space) + two-stage key sample filtering. Under 6 model stealing attacks (including hard-label and data-free settings), the watermark success rate averages 77–100%, substantially outperforming existing methods.
Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception: This paper establishes the first all-type (speech/sound/singing/music) audio deepfake detection benchmark and proposes Wavelet Prompt Tuning (WPT), which enhances full-band frequency perception of SSL features via discrete wavelet transform. Without increasing trainable parameters, WPT surpasses full fine-tuning and achieves an average EER of only 3.58% under co-training.
Diversifying Counterattacks: Orthogonal Exploration for Robust CLIP Inference: This paper proposes Directional Orthogonal Counterattack (DOC), a method that expands the search space during counterattack optimization by introducing orthogonal gradient components and momentum updates, and adaptively modulates counterattack intensity via a cosine-similarity-based Directional Sensitivity Score (DSS). DOC significantly improves the test-time adversarial robustness of CLIP across 16 datasets.
Easy to Learn, Yet Hard to Forget: Towards Robust Unlearning Under Bias: This paper proposes the CUPID framework, which partitions the forget set into causal and bias subsets via loss landscape sharpness analysis, identifies and disentangles causal and bias pathways within the model, and achieves precise class-level unlearning on biased models — effectively addressing the shortcut unlearning problem.
EFX and PO Allocation Exists for Two Types of Goods: This paper proves that an allocation satisfying both EFX (envy-freeness up to any good) and Pareto optimality always exists when goods are of only two types and all valuations are positive, and provides a near-linear-time algorithm.
Enhancing DPSGD via Per-Sample Momentum and Low-Pass Filtering: This paper proposes DP-PMLF, which reduces clipping bias via per-sample momentum and suppresses high-frequency DP noise via a low-pass filter, simultaneously addressing both sources of accuracy degradation in DPSGD for the first time.
Fair Model-Based Clustering: This paper proposes FMC, a fair clustering algorithm based on finite mixture models. By imposing fairness constraints on model parameters rather than sample-level assignments, FMC achieves scalable fair clustering whose parameter count is independent of the sample size \(N\). It supports mini-batch learning and categorical data, and substantially outperforms existing methods on large-scale datasets.
FairGSE: Fairness-Aware Graph Neural Network without High False Positive Rates: This paper is the first to identify the "FPR shortcut" problem in fairness-aware GNNs — existing methods achieve favorable fairness metrics by massively misclassifying negative samples as positive — and proposes the FairGSE framework, which reweights graph edges by maximizing 2D structural entropy to simultaneously improve fairness and reduce the false positive rate, achieving a 39% reduction in FPR.
Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection: This paper proposes DFF-Adapter (DeepFake Fine-Grained Adapter), a lightweight fine-tuning scheme for deepfake detection built upon DINOv2. A three-branch adapter (authenticity detection head, forgery type classification head, and shared head) is injected into each Transformer block. A Forgery-Aware Multi-Head Router enables subspace-level dynamic routing among LoRA experts. The auxiliary forgery type classification task enhances artifact sensitivity for the primary task. With only 3.5M trainable parameters, the method achieves state-of-the-art performance across multiple cross-dataset evaluations.
Generalizing Fair Clustering to Multiple Groups: Algorithms and Applications: This paper generalizes the Closest Fair Clustering problem from two groups to arbitrarily many groups, proves NP-hardness for the equal-proportion case with three or more groups, proposes near-linear-time approximation algorithms (equal-proportion \(O(|\chi|^{1.6}\log^{2.81}|\chi|)\), arbitrary-proportion \(O(|\chi|^{3.81})\)), and extends the results to fair correlation clustering and fair consensus clustering.
Hashed Watermark as a Filter: A Unified Defense Against Forging and Overwriting Attacks in Neural Network Watermarking: This paper proposes NeuralMark—a weight-based watermarking method built on a hashed watermark filter. It leverages the SHAKE-256 hash function to derive irreversible binary watermarks from secret key matrices, which serve as private filters for selecting embedding parameters. The avalanche effect blocks gradient-based reverse engineering against forging attacks, while multi-round filtering minimizes parameter overlap to resist overwriting attacks. Effectiveness and robustness are validated across 13 CNN/Transformer architectures on 5 image classification tasks and 1 text generation task.
HealSplit: Towards Self-Healing through Adversarial Distillation in Split Federated Learning: This paper proposes HealSplit, the first unified defense framework for Split Federated Learning (SFL). It identifies poisoned samples via topology-aware scoring (TAS) on a graph built over smashed data, generates semantically consistent substitute representations using a GAN, and trains a consistency-validated student model through adversarial multi-teacher distillation. This end-to-end detect-and-recover pipeline substantially outperforms ten SOTA defense methods across five categories of poisoning attacks.
Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks: To address the query efficiency bottleneck in hard-label black-box adversarial attacks, this paper proposes ARS-OPT, a momentum-based algorithm grounded in Nesterov Accelerated Gradient (NAG), and its enhanced variant PARS-OPT that incorporates surrogate model priors. Theoretical convergence guarantees are established, and both methods outperform 13 state-of-the-art approaches on ImageNet and CIFAR-10.
InfoDecom: Decomposing Information for Defending Against Privacy Leakage in Split Inference: InfoDecom is proposed to reduce redundant information in smashed data via two-level information decomposition (frequency-domain visual information removal + mutual information suppression), followed by closed-form Gaussian noise injection for theoretical privacy guarantees, achieving a significantly superior utility-privacy trade-off over existing methods under shallow client models.
Learning to Collaborate: An Orchestrated-Decentralized Framework for Peer-to-Peer Collaborative Learning: This paper proposes KNEXA-FL, a framework that models P2P collaboration as a contextual bandit problem via a Central Pairing Manager (CPM) that never accesses model parameters. Using LinUCB to learn optimal pairing strategies, KNEXA-FL achieves approximately 50% higher Pass@1 than random P2P in heterogeneous LLM federated learning, while avoiding the catastrophic collapse observed in centralized distillation.
Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation: This paper proposes TwINEst and TwINEst++, two randomized algorithms based on the Hutchinson diagonal estimator, for efficiently estimating \(\|A\|_{2\to\infty}\) and \(\|A\|_{1\to 2}\) norms in a matrix-free setting. The algorithms come with provable oracle complexity guarantees and demonstrate significant advantages in Jacobian regularization for adversarial robustness in image classification and defense against adversarial attacks in recommender systems.
Minimizing Inequity in Facility Location Games: This paper studies the problem of minimizing the Maximum Group Effect in facility location games on the real line, proposing two strategyproof mechanisms—BALANCED and MAJOR-PHANTOM—that achieve tight approximation ratios in the single-facility setting. The framework unifies classical objectives (utilitarian social cost, egalitarian maximum cost) with group fairness objectives, and extends the endpoint mechanism to the two-facility setting.
MPD-SGR: Robust Spiking Neural Networks with Membrane Potential Distribution-Driven Surrogate Gradient Regularization: This work theoretically establishes a connection between SNN robustness error and surrogate gradient (SG) magnitude, demonstrating that reducing the overlap between the membrane potential distribution (MPD) and the effective region of the SG function can effectively decrease sensitivity to adversarial perturbations. Based on this insight, the paper proposes the MPD-SGR regularization method, which substantially outperforms existing SNN defense methods under both vanilla training and adversarial training settings.
Plug-and-Play Parameter-Efficient Tuning of Embeddings for Federated Recommendation: This paper proposes a plug-and-play federated recommendation framework that introduces PEFT (Parameter-Efficient Fine-Tuning) concepts into item embeddings. By freezing pre-trained full embeddings and transmitting only lightweight compressed embeddings (LoRA / Hash / RQ-VAE), the framework significantly reduces communication overhead while improving recommendation accuracy.
Privacy Auditing of Multi-Domain Graph Pre-Trained Model under Membership Inference Attack: This paper proposes MGP-MIA, the first framework targeting membership inference attacks (MIA) against multi-domain graph pre-trained models. It amplifies membership signals via machine unlearning, constructs shadow models through incremental learning, and employs a similarity-based inference mechanism to effectively expose privacy leakage risks in multi-domain graph pre-training.
Privacy on the Fly: A Predictive Adversarial Transformation Network for Mobile Sensor Data: This paper proposes PATN (Predictive Adversarial Transformation Network), the first framework to introduce adversarial perturbations into sensor data privacy protection. PATN leverages historical sensor data to generate forward-looking adversarial perturbations, achieving zero-latency real-time privacy protection while preserving the semantic fidelity of sensor data.
ProbLog4Fairness: A Neurosymbolic Approach to Modeling and Mitigating Bias: This paper proposes the ProbLog4Fairness framework, which formalizes bias mechanisms in data as interpretable logic programs using the probabilistic logic programming language ProbLog, and integrates bias assumptions into neural network training via distant supervision in DeepProbLog, enabling flexible and principled bias mitigation.
Reference Recommendation based Membership Inference Attack against Hybrid-based Recommender Systems: This paper proposes a Reference Recommendation-based Membership Inference Attack (MIA), designing a relative membership metric \(\rho(u) = d(v_t, v_h) / d(v_t, v_r)\) that exploits the personalization capability of hybrid-based recommender systems to obtain reference recommendations. It is the first method to effectively attack hybrid-based recommender systems, achieving an attack success rate of up to 93.4% with a computational cost of only 10 seconds.
RegionMarker: A Region-Triggered Semantic Watermarking Framework for Embedding-as-a-Service: This paper proposes RegionMarker, a semantic watermarking framework based on region-triggered mechanisms. It defines trigger regions in a low-dimensional space and injects semantic watermarks, constituting the first EaaS copyright protection method capable of simultaneously resisting CSE attacks, paraphrasing attacks, and dimension perturbation attacks.
Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach: This paper proposes the TGAF framework, which leverages diffusion models to encode target labels as 2D semantic tensors for guiding adversarial noise generation, and introduces a random masking strategy to preserve complete semantic information, significantly improving the transferability of targeted adversarial attacks.
Revisiting (Un)Fairness in Recourse by Minimizing Worst-Case Social Burden: This paper systematically analyzes three fundamental limitations of existing fairness metrics in algorithmic recourse—neglecting classifier decision behavior, ignoring ground-truth labels, and the tendency of gap-based metrics to obscure unfairness—and proposes MISOB, a fairness framework grounded in social burden. Through a minimax-weighted training strategy, MISOB reduces social burden across all demographic groups without requiring access to sensitive attributes, simultaneously improving fairness at both the prediction and recourse stages.
Robust Watermarking on Gradient Boosting Decision Trees: This paper proposes the first robust watermarking framework for GBDT models. It embeds watermarks via in-place fine-tuning and introduces four embedding strategies—Wrong Prediction Flip, Outlier Flip, Cluster Center Flip, and Confidence Flip—achieving high embedding success rates, minimal accuracy degradation, and strong robustness against fine-tuning attacks.
SecMoE: Communication-Efficient Secure MoE Inference via Select-Then-Compute: This paper proposes the SecMoE framework, which efficiently enables sparse MoE inference under two-party secure computation via a Select-Then-Compute paradigm, eliminating redundant expert computation and achieving up to 29.8× communication reduction and up to 16.1× end-to-end speedup.
Sim-to-Real: An Unsupervised Noise Layer for Screen-Camera Watermarking Robustness: This paper proposes the Simulation-to-Real (S2R) framework, which introduces a novel two-stage noise approximation strategy of "mathematical modeling → unsupervised domain transfer": a mathematical transform \(T\) first maps clean images to a known noise domain \(\mathcal{C}\), and an unsupervised image-to-image network \(G\) then maps \(\mathcal{C}\) to the real screen-camera (SC) noise domain \(\mathcal{U}\). Without requiring paired data, S2R accurately approximates real SC noise and achieves state-of-the-art watermarking robustness (BER reduced by 30–60%) and image quality (PSNR 42.27 dB / SSIM 0.962) across multiple devices, angles, and distances.
TopoReformer: Mitigating Adversarial Attacks Using Topological Purification in OCR Models: This paper proposes TopoReformer, a model-agnostic adversarial purification pipeline based on a topological autoencoder. By leveraging persistent homology to enforce topological consistency in the latent space, the method filters adversarial perturbations without adversarial training, effectively protecting OCR systems against classical attacks, adaptive attacks, and OCR-specific watermark attacks.
Towards Effective, Stealthy, and Persistent Backdoor Attacks Targeting Graph Foundation Models: This paper proposes GFM-BA, the first systematic backdoor attack method targeting the pre-training phase of Graph Foundation Models (GFMs). It addresses three core challenges — effectiveness, stealthiness, and persistence — through three modules: label-free trigger association, node-adaptive trigger generation, and persistent backdoor anchoring.
Towards Multiple Missing Values-Resistant Unsupervised Graph Anomaly Detection: This paper proposes M2V-UGAD, the first framework to address unsupervised graph anomaly detection under simultaneous node attribute and graph topology missingness. Through three core mechanisms—dual-pathway independent imputation, hyperspherical latent space fusion, and pseudo-anomaly generation—the framework overcomes cross-view interference and imputation bias, consistently outperforming existing methods across 7 benchmark datasets.
Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation: This paper proposes STAB (Sharpness-aware Transferable Adversarial Backdoor), which trains a surrogate model via SAM to converge to flat regions of the loss landscape and employs Gumbel-Softmax optimization to generate context-aware adversarial triggers. STAB is the first approach to simultaneously achieve cross-dataset transferability and stealthiness in backdoor attacks against code models.
Transferable Hypergraph Attack via Injecting Nodes into Pivotal Hyperedges: This paper proposes TH-Attack, a transferable node injection attack framework targeting Hypergraph Neural Networks (HGNNs). By identifying pivotal hyperedges in information aggregation pathways and injecting semantically inverted malicious nodes, TH-Attack effectively attacks diverse HGNN architectures in a black-box setting, reducing accuracy from 80%+ to below 30%.
Truth, Justice, and Secrecy: Cake Cutting Under Privacy Constraints: This paper proposes PP_CC_puv, the first privacy-preserving cake cutting protocol, which transforms Chen et al.'s strategyproof fair division algorithm into a privacy-preserving variant based on secret sharing and secure multi-party computation (MPC). The protocol maintains envy-freeness, Pareto optimality, and strategyproofness while ensuring that participants' preference information is not disclosed.
Yours or Mine? Overwriting Attacks Against Neural Audio Watermarking: This paper presents the first systematic study of overwriting attacks against neural audio watermarking, proposing white-box, gray-box, and black-box attack schemes that achieve near-100% attack success rates against three SOTA methods—AudioSeal, Timbre, and WavMark—exposing critical security vulnerabilities in existing audio watermarking systems.