🔒 LLM Safety¶
🤖 AAAI2026 · 41 paper notes
📌 Same area in other venues: 📷 CVPR2026 (11) · 🔬 ICLR2026 (185) · 💬 ACL2026 (115) · 🧠 NeurIPS2025 (80) · 📹 ICCV2025 (10)
🔥 Top topics: LLM ×19 · Adversarial Robustness ×8 · Federated Learning ×4 · Multimodal/VLM ×3 · Continual Learning ×2
- AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments
-
LLM-driven embodied agents are instantiated to "live" in simulated smart home environments, generating virtual ambient sensor data for pre-training HAR models, which yields significant gains in activity recognition under low-resource settings.
- ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs
-
This paper proposes ALTER, a framework that combines an asymmetric LoRA architecture with token-level Tsallis entropy guidance to achieve precise unlearning of target knowledge in LLMs. A parameter isolation mechanism is employed to preserve the model's general capabilities, achieving state-of-the-art performance on three benchmarks: TOFU, WMDP, and MUSE.
- An LLM-Based Simulation Framework for Embodied Conversational Agents in Psychological Counseling
-
This paper proposes the ECAs framework, which grounds psychological counseling simulation in established theories such as Cognitive Behavioral Therapy (CBT). By leveraging LLMs to expand real counseling cases into embodied cognitive memory spaces, the framework simulates the complete cognitive processes of clients in counseling sessions and generates high-fidelity dialogue data. ECAs significantly outperforms baselines in both expert and automated evaluations.
- Anti-adversarial Learning: Desensitizing Prompts for Large Language Models
-
This paper proposes PromptObfus, which adopts an "anti-adversarial learning" paradigm to replace sensitive tokens in user prompts with semantically distinct yet task-preserving alternatives. The approach eliminates explicit privacy leakage entirely and reduces implicit privacy inference attack success rates by 62.70%, without degrading the task performance of remote LLMs.
- Attention Retention for Continual Learning with Vision Transformers
-
This paper proposes ARCL-ViT, a framework that prevents attention drift in Vision Transformers during continual learning via a two-step strategy of attention mask generation and gradient masking. It achieves state-of-the-art results on ImageNet-R and CIFAR-100, demonstrating that preserving attention patterns is key to mitigating catastrophic forgetting.
- AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models
-
This paper proposes AUVIC, a framework that combines an adversarial perturbation generator with a dynamic anchor preservation mechanism to precisely unlearn target visual concepts (e.g., specific faces) in MLLMs, while avoiding collateral forgetting of semantically similar concepts. The paper also introduces VCUBench, the first evaluation benchmark for visual concept unlearning in group-scene scenarios.
- BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models
-
This paper proposes BadThink — the first training-time backdoor attack targeting CoT reasoning efficiency. By iteratively optimizing verbose reasoning templates via an LLM, it constructs poisoned data that causes the victim model, upon trigger activation, to generate reasoning chains inflated by over 17× (on MATH-500), while preserving final answer correctness and maintaining strong stealthiness.
- Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion
-
This paper proposes the KUnBR framework, which employs gradient-guided knowledge density estimation to localize layers enriched with harmful knowledge, and adopts a block re-insertion strategy to bypass the gradient-masking effect of cover layers, achieving deep unlearning of harmful knowledge in LLMs rather than mere surface-level suppression.
- Can Editing LLMs Inject Harm?
-
This paper reframes knowledge editing as a novel LLM security threat termed Editing Attack, systematically investigating the feasibility of injecting misinformation and bias into LLMs via three editing methods—ROME, FT, and ICE—and demonstrating that such attacks are both highly effective and remarkably stealthy.
- Cross-Modal Unlearning via Influential Neuron Path Editing in Multimodal Large Language Models
-
This paper proposes MIP-Editor, which localizes influential neuron paths encoding forget-target knowledge in MLLMs via cross-layer gradient integration (text branch) and Fisher integration (visual branch), then edits these neurons using path-based Representation Misdirection Unlearning (RMisU), achieving up to 87.75% forget rate and 54.26% improvement in general knowledge retention on MLLMU-Bench.
- Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability
-
This position paper argues that current LLM efficiency research is dominated by hyperscale assumptions. It identifies five open research challenges targeting small- and medium-scale deployers, and advocates for redefining efficiency metrics through an Overhead-Aware Efficiency (OAE) framework.
- Designing Truthful Mechanisms for Asymptotic Fair Division
-
This paper proposes the PRD (Proportional Response with Dummy) mechanism, which for the first time simultaneously achieves expected truthfulness, polynomial-time computability, and high-probability envy-freeness in the asymptotic fair division setting, requiring only \(m = \Omega(n \log n)\) items. This resolves an open problem posed by Manurangsi & Suksompong.
- FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-World LoRA
-
FedALT is proposed to maintain a trainable Individual LoRA (updated locally) and a frozen Rest-of-World (RoW) LoRA (averaged from other clients) for each client, combined with an adaptive MoE mixer that dynamically balances local and global knowledge. This design fundamentally eliminates cross-client interference caused by FedAvg aggregation, achieving significant improvements over SOTA on heterogeneous-task federated LLM fine-tuning.
- Federated CLIP for Resource-Efficient Heterogeneous Medical Image Classification
-
This paper proposes FedMedCLIP, a federated CLIP framework for medical image classification. By freezing the CLIP encoder and combining a masked Feature Adaptation Module (FAM), a local masked MLP, and class-level KL distillation regularization, the framework achieves robust classification under data heterogeneity with minimal communication and computational overhead (surpassing the second-best method by 8% on ISIC2019 and running 120× faster than FedAVG).
- FedP²EFT: Federated Learning to Personalize PEFT for Multilingual LLMs
-
This paper proposes FedP²EFT, which collaboratively trains a Personalization Strategy Generator (PSG) via federated learning to automatically generate personalized LoRA rank structures for each client, substantially outperforming manually designed PEFT configurations and existing FL personalization methods in multilingual LLM fine-tuning.
- From Single to Societal: Analyzing Persona-Induced Bias in Multi-Agent Interactions
-
This paper presents the first systematic study of persona-induced bias in LLM-based multi-agent interactions. Through controlled experiments on collaborative problem solving and persuasion tasks, three key findings are revealed: (1) different personas exhibit significant divergence in trustworthiness and insistence (dominant groups such as males and White individuals are perceived as less trustworthy); (2) agents display pronounced in-group favoritism; and (3) these biases persist and tend to amplify in multi-turn, multi-agent settings.
- Gender Bias in Emotion Recognition by Large Language Models
-
This paper systematically evaluates gender bias in emotion recognition across multiple LLMs (GPT-4/5, Mistral, LLaMA, etc.), finding that most models exhibit statistically significant gender bias on at least one emotion label. Experiments demonstrate that inference-time prompt strategies (prompt engineering, in-context learning, CoT) fail to effectively debias, whereas training-based fine-tuning can substantially mitigate the bias.
- Ghost in the Transformer: Detecting Model Reuse with Invariant Spectral Signatures
-
This paper proposes GhostSpec, a data-free, white-box method that does not modify model behavior. It extracts spectral fingerprints by applying SVD to invariant matrix products of attention weight matrices, enabling robust verification of LLM lineage under fine-tuning, pruning, merging, expansion, and even adversarial transformations.
- GraphTextack: A Realistic Black-Box Node Injection Attack on LLM-Enhanced GNNs
-
This paper proposes GraphTextack — the first black-box multimodal node injection poisoning attack targeting LLM-enhanced GNNs. It jointly optimizes the graph structural connections and semantic features of injected nodes via an evolutionary optimization framework, requiring neither internal model information nor surrogate models. GraphTextack significantly outperforms 12 baseline methods across 5 datasets and 2 types of LLM-GNN models.
- Invisible Triggers, Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving
-
This paper proposes AdvRoad, a two-stage framework (Road-Style Adversary Generation + Scenario-Associated Adaptation) that generates diverse adversarial posters with road-surface texture styles. These posters induce "ghost objects" (false positives) in visual 3D detectors for autonomous driving while remaining inconspicuous to human drivers due to their natural appearance, significantly improving the stealthiness and defensive resistance of FP attacks.
- iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification
-
This paper proposes iSeal — the first active fingerprinting method capable of reliably verifying LLM ownership in a black-box setting where the model thief has full control over the inference process. Through a triple mechanism of an external encrypted encoder, RSC error correction, and similarity-based matching, iSeal maintains a 100% Fingerprint Success Rate (FSR) across 12 LLMs and 10+ attack types, while existing methods drop to 0%.
- LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models
-
This paper proposes LAMP, a black-box Universal Adversarial Perturbation (UAP) learning method targeting multi-image MLLMs. By incorporating attention constraints and a contagious loss, LAMP enables cross-model and cross-task transferable attacks by perturbing only a small subset of input images.
- Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting
-
This paper proposes Learning-from-the-Undesirable (LfU), a regularization method for SFT that simulates "undesirable behavior" by applying gradient ascent to an auxiliary model, then enforces representation-level consistency between the original and auxiliary models via an MSE loss. This effectively mitigates overfitting, catastrophic forgetting, and adversarial fragility in limited-data fine-tuning.
- LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
-
Systematic experiments demonstrate that mainstream LLMs (GPT-4, Claude 3 Opus, Llama 3-8B) exhibit significant discriminatory degradation in information accuracy, truthfulness, and refusal rates toward users with lower English proficiency, lower educational attainment, and non-US backgrounds, making the most vulnerable users the least reliably served.
- Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms
-
This paper introduces the CompositeHarm benchmark, which systematically investigates the vulnerability of LLM safety alignment in cross-lingual settings by translating adversarial syntactic attacks (AttaQ) and contextualized harms (MMSafetyBench) into five Indic languages. The study finds that adversarial syntactic attacks achieve dramatically higher attack success rates in Indic languages.
- Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models
-
This paper proposes MFA, a Multi-Faceted Attack framework that systematically exposes security vulnerabilities in VLMs equipped with multi-layered defenses (including commercial models such as GPT-4o and Gemini) through three complementary dimensions: Attention Transfer Attack (ATA) to bypass alignment, adversarial signatures to evade content moderation, and visual encoder attack to overwrite system prompts. The overall attack success rate reaches 58.5%.
- PANDA: Patch and Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning
-
This paper proposes PANDA, a framework that achieves intra-task class balancing via CLIP-guided semantic patch grafting and alleviates inter-task distribution shift through a learnable distribution smoothening mechanism. PANDA operates as a plug-and-play module to improve pretrained model-based exemplar-free continual learning under long-tailed scenarios.
- Perturb Your Data: Paraphrase-Guided Training Data Watermarking
-
This paper proposes SPECTRA — a paraphrase-sampling-based training data watermarking method. It generates paraphrases via an LLM and uses Min-K%++ scoring to select paraphrases with scores close to the original text as watermarks. Even when watermarked data constitutes as little as 0.001% of the training corpus, the p-value gap between members and non-members consistently exceeds 9 orders of magnitude.
- Principles2Plan: LLM-Guided System for Operationalising Ethical Principles into Plans
-
This paper presents Principles2Plan, an interactive prototype system that enables collaborative human–LLM operationalisation of high-level ethical principles (e.g., beneficence, privacy) into context-sensitive ethical rules, which are then embedded into a PDDL planner to generate ethically compliant action plans.
- PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration
-
This paper proposes PRISM, a framework that dynamically routes user prompts to one of three inference modes—cloud-only, edge-only, or collaborative—via a context-aware soft gating mechanism. In the collaborative mode, an adaptive two-layer local differential privacy (LDP) scheme and semantic sketch collaboration are employed to achieve a three-way balance among privacy, utility, and efficiency.
- Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering
-
This work is the first to explore privacy-protected RAG for Knowledge Graph Question Answering (KGQA). It proposes ARoG (Abstraction Reasoning on Graph), a framework that employs two strategies—relation-centric abstraction and structure-oriented abstraction—to enable effective retrieval and utilization of knowledge graphs for question answering even when entities are anonymized (replaced with semantically meaningless MIDs).
- PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization
-
This paper proposes PSM, a framework that formalizes system prompt protection as a utility-constrained black-box optimization problem. By leveraging LLM-as-Optimizer, PSM automatically searches for an optimal "shield" suffix that reduces prompt extraction attack success rates to near zero without degrading model functionality.
- RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-Wave Point Cloud Sequence
-
This paper proposes RadarLLM, the first end-to-end framework leveraging large language models for semantic-level human motion understanding from millimeter-wave radar point cloud sequences. The framework comprises a motion-guided radar tokenizer based on Aggregate VQ-VAE and a radar-aware language model, along with a physics-aware simulation pipeline for generating large-scale paired radar-text data.
- SafeNlidb: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
-
This paper proposes SafeNlidb, a framework that jointly optimizes safety reasoning and SQL generation in LLM-driven Natural Language Interfaces to Databases (NLIDBs) through a safety-aware data synthesis pipeline and an alternating preference optimization strategy, effectively defending against privacy leakage under implicit inference attacks.
- SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth
-
This paper introduces SproutBench, an evaluation benchmark comprising 1,283 developmentally-grounded adversarial prompts, designed to systematically assess the safety of 47 LLMs in contexts involving children and adolescents (ages 0–6, 7–12, and 13–18). Key findings reveal that safety and risk prevention are strongly correlated (\(\rho = 0.86\)), while a significant trade-off exists between interactivity and age-appropriateness (\(\rho = -0.48\)).
- StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak
-
This paper proposes StyleBreak, the first audio jailbreak framework based on speech style, which systematically investigates the impact of linguistic, paralinguistic, and extralinguistic attributes on LAM alignment robustness through a two-stage style-aware transformation pipeline and a query-adaptive policy network. StyleBreak improves ASR by 7.1%–22.3% across multiple attack paradigms.
- The Confidence Trap: Gender Bias and Predictive Certainty in LLMs
-
This paper proposes Gender-ECE, a metric for systematically evaluating the confidence calibration and alignment with human bias judgments of six open-source LLMs on gendered pronoun prediction tasks. The authors find that Gemma-2 exhibits the worst calibration and an extreme disparity between male and female pronoun calibration, whereas GPT-J-6B — trained on less filtered data — achieves the best calibration overall.
- TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models
-
TOFA is a federated learning framework that adapts CLIP via hierarchical Bayesian inference of personalized visual prototype distributions, globally aligned LLM-based text augmentation, and adaptive modality fusion — achieving training-free, single-round communication adaptation that outperforms one-shot baselines and even some multi-round training methods across 9 datasets.
- Uncovering Bias Paths with LLM-guided Causal Discovery: An Active Learning and Dynamic Scoring Approach
-
This paper proposes a hybrid causal discovery framework that integrates LLM semantic priors with statistical signals. Through an active learning strategy and a dynamic scoring mechanism, the framework prioritizes querying the most informative variable pairs, effectively recovering fairness-critical causal paths (e.g., sex→education→income) under noise and confounding conditions, substantially outperforming classical CD methods and naïve LLM-based approaches.
- Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach
-
This paper proposes SynPrune — the first syntax-aware membership inference attack (MIA) method for code. By identifying 47 Python syntactic conventions and pruning syntactically determined tokens (retaining only tokens that reflect authorial style) when computing MIA scores, SynPrune achieves an average AUROC improvement of 15.4%, enabling effective detection of pretraining data attribution in code LLMs.
- WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking
-
This paper proposes WaterMod, an LLM text watermarking method based on modular arithmetic (\(\text{rank} \bmod k\)) that partitions the vocabulary into modular residue classes after sorting tokens by probability. Under both zero-bit (\(k=2\)) and multi-bit (\(k>2\)) watermarking settings, WaterMod achieves high detection rates and low quality degradation within a unified framework, requiring no external thesaurus or hashing tricks.