� LLM Safety¶

🤖 AAAI2026 · 29 paper notes

Anti-adversarial Learning: Desensitizing Prompts for Large Language Models: This paper proposes PromptObfus, which adopts an "anti-adversarial learning" paradigm to replace sensitive tokens in user prompts with semantically distinct yet task-preserving alternatives. The approach eliminates explicit privacy leakage entirely and reduces implicit privacy inference attack success rates by 62.70%, without degrading the task performance of remote LLMs.
Attention Retention for Continual Learning with Vision Transformers: This paper proposes ARCL-ViT, a framework that prevents attention drift in Vision Transformers during continual learning via a two-step strategy of attention mask generation and gradient masking. It achieves state-of-the-art results on ImageNet-R and CIFAR-100, demonstrating that preserving attention patterns is key to mitigating catastrophic forgetting.
AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models: This paper proposes AUVIC, a framework that combines an adversarial perturbation generator with a dynamic anchor preservation mechanism to precisely unlearn target visual concepts (e.g., specific faces) in MLLMs, while avoiding collateral forgetting of semantically similar concepts. The paper also introduces VCUBench, the first evaluation benchmark for visual concept unlearning in group-scene scenarios.
Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion: This paper proposes the KUnBR framework, which employs gradient-guided knowledge density estimation to localize layers enriched with harmful knowledge, and adopts a block re-insertion strategy to bypass the gradient-masking effect of cover layers, achieving deep unlearning of harmful knowledge in LLMs rather than mere surface-level suppression.
Can Editing LLMs Inject Harm?: This paper reframes knowledge editing as a novel LLM security threat termed Editing Attack, systematically investigating the feasibility of injecting misinformation and bias into LLMs via three editing methods—ROME, FT, and ICE—and demonstrating that such attacks are both highly effective and remarkably stealthy.
CATFormer: When Continual Learning Meets Spiking Transformers With Dynamic Thresholds: This paper proposes CATFormer, a data-replay-free continual learning framework built upon a spiking Vision Transformer, which achieves task-specific neuronal excitability modulation via context-adaptive dynamic firing thresholds. Over sequences of up to 100 tasks, the model not only avoids forgetting but actually improves in accuracy — a phenomenon the authors term "reverse forgetting."
Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability: This position paper argues that current LLM efficiency research is dominated by hyperscale assumptions. It identifies five open research challenges targeting small- and medium-scale deployers, and advocates for redefining efficiency metrics through an Overhead-Aware Efficiency (OAE) framework.
Designing Truthful Mechanisms for Asymptotic Fair Division: This paper proposes the PRD (Proportional Response with Dummy) mechanism, which for the first time simultaneously achieves expected truthfulness, polynomial-time computability, and high-probability envy-freeness in the asymptotic fair division setting, requiring only \(m = \Omega(n \log n)\) items. This resolves an open problem posed by Manurangsi & Suksompong.
FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-World LoRA: FedALT is proposed to maintain a trainable Individual LoRA (updated locally) and a frozen Rest-of-World (RoW) LoRA (averaged from other clients) for each client, combined with an adaptive MoE mixer that dynamically balances local and global knowledge. This design fundamentally eliminates cross-client interference caused by FedAvg aggregation, achieving significant improvements over SOTA on heterogeneous-task federated LLM fine-tuning.
From Single to Societal: Analyzing Persona-Induced Bias in Multi-Agent Interactions: This paper presents the first systematic study of persona-induced bias in LLM-based multi-agent interactions. Through controlled experiments on collaborative problem solving and persuasion tasks, three key findings are revealed: (1) different personas exhibit significant divergence in trustworthiness and insistence (dominant groups such as males and White individuals are perceived as less trustworthy); (2) agents display pronounced in-group favoritism; and (3) these biases persist and tend to amplify in multi-turn, multi-agent settings.
Gender Bias in Emotion Recognition by Large Language Models: This paper systematically evaluates gender bias in emotion recognition across multiple LLMs (GPT-4/5, Mistral, LLaMA, etc.), finding that most models exhibit statistically significant gender bias on at least one emotion label. Experiments demonstrate that inference-time prompt strategies (prompt engineering, in-context learning, CoT) fail to effectively debias, whereas training-based fine-tuning can substantially mitigate the bias.
Ghost in the Transformer: Detecting Model Reuse with Invariant Spectral Signatures: This paper proposes GhostSpec, a data-free, white-box method that does not modify model behavior. It extracts spectral fingerprints by applying SVD to invariant matrix products of attention weight matrices, enabling robust verification of LLM lineage under fine-tuning, pruning, merging, expansion, and even adversarial transformations.
GraphTextack: A Realistic Black-Box Node Injection Attack on LLM-Enhanced GNNs: This paper proposes GraphTextack — the first black-box multimodal node injection poisoning attack targeting LLM-enhanced GNNs. It jointly optimizes the graph structural connections and semantic features of injected nodes via an evolutionary optimization framework, requiring neither internal model information nor surrogate models. GraphTextack significantly outperforms 12 baseline methods across 5 datasets and 2 types of LLM-GNN models.
Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models: This paper employs computational complexity theory to demonstrate that the per-step inference complexity of Transformer-based LLMs is \(O(N^2 \cdot d)\). Grounded in the Hartmanis–Stearns Time Hierarchy Theorem, it proves that any computational task exceeding this complexity bound—such as \(O(n^3)\) matrix multiplication, \(O(n^k)\) token enumeration, or TSP verification—necessarily causes hallucination. Furthermore, LLM agents are shown to be incapable of verifying the correctness of such tasks.
LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models: This paper proposes LAMP, a black-box Universal Adversarial Perturbation (UAP) learning method targeting multi-image MLLMs. By incorporating attention constraints and a contagious loss, LAMP enables cross-model and cross-task transferable attacks by perturbing only a small subset of input images.
Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting: This paper proposes Learning-from-the-Undesirable (LfU), a regularization method for SFT that simulates "undesirable behavior" by applying gradient ascent to an auxiliary model, then enforces representation-level consistency between the original and auxiliary models via an MSE loss. This effectively mitigates overfitting, catastrophic forgetting, and adversarial fragility in limited-data fine-tuning.
LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users: Systematic experiments demonstrate that mainstream LLMs (GPT-4, Claude 3 Opus, Llama 3-8B) exhibit significant discriminatory degradation in information accuracy, truthfulness, and refusal rates toward users with lower English proficiency, lower educational attainment, and non-US backgrounds, making the most vulnerable users the least reliably served.
Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms: This paper introduces the CompositeHarm benchmark, which systematically investigates the vulnerability of LLM safety alignment in cross-lingual settings by translating adversarial syntactic attacks (AttaQ) and contextualized harms (MMSafetyBench) into five Indic languages. The study finds that adversarial syntactic attacks achieve dramatically higher attack success rates in Indic languages.
PANDA: Patch and Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning: This paper proposes PANDA, a framework that achieves intra-task class balancing via CLIP-guided semantic patch grafting and alleviates inter-task distribution shift through a learnable distribution smoothening mechanism. PANDA operates as a plug-and-play module to improve pretrained model-based exemplar-free continual learning under long-tailed scenarios.
Perturb Your Data: Paraphrase-Guided Training Data Watermarking: This paper proposes SPECTRA — a paraphrase-sampling-based training data watermarking method. It generates paraphrases via an LLM and uses Min-K%++ scoring to select paraphrases with scores close to the original text as watermarks. Even when watermarked data constitutes as little as 0.001% of the training corpus, the p-value gap between members and non-members consistently exceeds 9 orders of magnitude.
Principles2Plan: LLM-Guided System for Operationalising Ethical Principles into Plans: This paper presents Principles2Plan, an interactive prototype system that enables collaborative human–LLM operationalisation of high-level ethical principles (e.g., beneficence, privacy) into context-sensitive ethical rules, which are then embedded into a PDDL planner to generate ethically compliant action plans.
PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration: This paper proposes PRISM, a framework that dynamically routes user prompts to one of three inference modes—cloud-only, edge-only, or collaborative—via a context-aware soft gating mechanism. In the collaborative mode, an adaptive two-layer local differential privacy (LDP) scheme and semantic sketch collaboration are employed to achieve a three-way balance among privacy, utility, and efficiency.
Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering: This work is the first to explore privacy-protected RAG for Knowledge Graph Question Answering (KGQA). It proposes ARoG (Abstraction Reasoning on Graph), a framework that employs two strategies—relation-centric abstraction and structure-oriented abstraction—to enable effective retrieval and utilization of knowledge graphs for question answering even when entities are anonymized (replaced with semantically meaningless MIDs).
PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization: This paper proposes PSM, a framework that formalizes system prompt protection as a utility-constrained black-box optimization problem. By leveraging LLM-as-Optimizer, PSM automatically searches for an optimal "shield" suffix that reduces prompt extraction attack success rates to near zero without degrading model functionality.
SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth: This paper introduces SproutBench, an evaluation benchmark comprising 1,283 developmentally-grounded adversarial prompts, designed to systematically assess the safety of 47 LLMs in contexts involving children and adolescents (ages 0–6, 7–12, and 13–18). Key findings reveal that safety and risk prevention are strongly correlated (\(\rho = 0.86\)), while a significant trade-off exists between interactivity and age-appropriateness (\(\rho = -0.48\)).
StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak: This paper proposes StyleBreak, the first audio jailbreak framework based on speech style, which systematically investigates the impact of linguistic, paralinguistic, and extralinguistic attributes on LAM alignment robustness through a two-stage style-aware transformation pipeline and a query-adaptive policy network. StyleBreak improves ASR by 7.1%–22.3% across multiple attack paradigms.
The Confidence Trap: Gender Bias and Predictive Certainty in LLMs: This paper proposes Gender-ECE, a metric for systematically evaluating the confidence calibration and alignment with human bias judgments of six open-source LLMs on gendered pronoun prediction tasks. The authors find that Gemma-2 exhibits the worst calibration and an extreme disparity between male and female pronoun calibration, whereas GPT-J-6B — trained on less filtered data — achieves the best calibration overall.
Uncovering Bias Paths with LLM-guided Causal Discovery: An Active Learning and Dynamic Scoring Approach: This paper proposes a hybrid causal discovery framework that integrates LLM semantic priors with statistical signals. Through an active learning strategy and a dynamic scoring mechanism, the framework prioritizes querying the most informative variable pairs, effectively recovering fairness-critical causal paths (e.g., sex→education→income) under noise and confounding conditions, substantially outperforming classical CD methods and naïve LLM-based approaches.
WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking: This paper proposes WaterMod, an LLM text watermarking method based on modular arithmetic (\(\text{rank} \bmod k\)) that partitions the vocabulary into modular residue classes after sorting tokens by probability. Under both zero-bit (\(k=2\)) and multi-bit (\(k>2\)) watermarking settings, WaterMod achieves high detection rates and low quality degradation within a unified framework, requiring no external thesaurus or hashing tricks.