� LLM Safety¶

🔬 ICLR2026 · 39 paper notes

Attention Smoothing Is All You Need For Unlearning: This paper proposes Attention Smoothing Unlearning (ASU), which constructs a forget-teacher by raising the softmax temperature in self-attention, reformulating the unlearning problem as self-distillation. By smoothing the attention distribution to weaken both lexical- and semantic-level associations, ASU erases memorized knowledge while preserving output coherence, surpassing existing unlearning methods on multiple benchmarks including TOFU, MUSE, and WMDP.
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models: This paper proposes AudioTrust, the first multidimensional trustworthiness evaluation benchmark for audio large language models (ALLMs), encompassing six dimensions—fairness, hallucination, safety, privacy, robustness, and authentication—with 26 sub-tasks and 4,420+ audio samples. It systematically evaluates the trustworthiness boundaries of 14 state-of-the-art open- and closed-source ALLMs in high-stakes audio scenarios.
BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models: This paper presents the first systematic study of bias in LLM tool selection. When multiple functionally equivalent APIs are available, LLMs systematically favor certain tools due to semantic alignment, positional effects, and pretraining exposure. The authors propose a total variation–based bias metric, a benchmark spanning 10 tool categories, and a lightweight debiasing strategy based on filtering followed by uniform sampling.
Enhancing Hallucination Detection through Noise Injection: Injecting uniform noise into MLP activations of intermediate LLM layers to approximate the Bayesian posterior, capturing epistemic uncertainty that is complementary to the aleatoric uncertainty captured by sampling temperature. This raises hallucination detection AUROC on GSM8K from 71.56 to 76.14.
Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning: This paper exposes the "shallow alignment" problem in mainstream LLM unlearning methods — rather than truly erasing target knowledge, these methods generate "spurious unlearning neurons" that suppress its expression, allowing the knowledge to be readily recovered via subsequent fine-tuning. The proposed method, Ssiuu, employs attribution-guided regularization to prevent the growth of negative influence, achieving robust unlearning.
Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs: This paper introduces the IRIS Benchmark, the first benchmark to synchronously evaluate fairness in both understanding and generation tasks for Unified Multimodal Large Language Models (UMLLMs). Through a three-dimensional evaluation framework, 60 fine-grained metrics, and a high-dimensional fairness space, IRIS reveals key phenomena such as cross-task "personality splitting" and systematic "generation gaps."
Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions: This paper proposes Concept DAS (CDAS), which achieves faithful bi-directional model steering through a Jensen-Shannon divergence distribution matching objective and distributed interchange interventions (DII). The method enables systematic behavioral control in safety-critical scenarios—bypassing refusal behaviors and eliminating backdoors—while preserving general model capabilities.
From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning: This paper proposes ATAD (Agent-Centric Text Anomaly Detection), which replaces static benchmarks with a Teacher-Orchestrator-Student three-agent competition and validation loop. Using text anomaly detection as the task format, ATAD achieves self-calibrating, dynamically evolving LLM reasoning evaluation — all evaluated LLMs achieve average accuracies of only 54–59% (far below 90%+ on static benchmarks), effectively exposing reasoning weaknesses.
Gaussian Certified Unlearning in High Dimensions: A Hypothesis Testing Approach: This paper proposes \((\phi,\varepsilon)\)-Gaussian certifiability — a high-dimensional machine unlearning privacy framework grounded in hypothesis testing trade-off functions. It rigorously proves that, in the high-dimensional proportional regime (\(p \sim n\)), a single Newton step combined with calibrated Gaussian noise simultaneously satisfies privacy (GPAR) and accuracy (GED→0) requirements. The work refutes the conclusion of Zou et al. (2025) that "at least two Newton steps are necessary," and theoretically identifies the fundamental incompatibility between the classical \(\varepsilon\)-certifiability and noise-addition mechanisms.
Heterogeneous Federated Fine-Tuning with Parallel One-Rank Adaptation: This paper proposes Fed-PLoRA, a framework that replaces multi-rank LoRA with multiple parallel one-rank modules (PLoRA). Via a Select-N-Fold strategy—selecting \(N\) modules for training and folding the remainder into frozen weights—it achieves zero initialization noise and minimal aggregation noise for heterogeneous federated fine-tuning, outperforming existing methods across 6 LLMs and multiple tasks.
Improving the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models: This paper proposes a quantitative measure of watermark strength (expected KL divergence) and fully characterizes the Pareto trade-off curve between watermark strength and speculative sampling efficiency. By pseudo-randomizing the acceptance decision, the method simultaneously achieves maximum watermark strength and optimal sampling efficiency.
Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates: This paper exposes LLM chat templates (Jinja2) as a novel inference-time backdoor attack surface. Without modifying model weights, poisoning training data, or controlling inference infrastructure, an adversary can implant conditionally triggered backdoors by modifying only the template within a GGUF file. Attacks are validated across 18 models and 4 inference engines with a success rate exceeding 80%, while completely evading HuggingFace's security scanning.
Inoculation Prompting: Eliciting Traits from LLMs during Training Can Suppress Them at Test-Time: This paper proposes Inoculation Prompting—inserting a system prompt describing an undesired trait (e.g., "You are a malicious, evil assistant") into finetuning data, so the model associates that trait with the prompt rather than learning it globally. Removing the prompt at test time causes the trait to nearly vanish, effectively mitigating Emergent Misalignment, backdoor attacks, and subliminal learning.
LH-Deception: Simulating and Understanding LLM Deceptive Behaviors in Long-Horizon Interactions: This paper proposes LH-Deception, the first simulation framework for LLM deceptive behaviors in long-horizon interactions. It adopts a three-role multi-agent architecture comprising a performer, a supervisor, and a deception auditor, combined with a social-science-theory-driven probabilistic event system. Across 11 frontier models, the framework systematically quantifies deception frequency, severity, type distribution, and trust erosion effects, revealing an emergent "chain of deception" phenomenon that static single-turn evaluations are entirely unable to capture.
Lifelong Learning with Behavior Consolidation for Vehicle Routing: This paper proposes LLR-BC, a framework for lifelong learning in neural VRP solvers. By combining decision-step-level experience buffers, Confidence-aware Experience Weighting (CaEW), and Decision-seeking Behavior Consolidation via reverse KL divergence (DsBC), LLR-BC reduces the Average Performance gap (AP) by an order of magnitude on task sequences with simultaneously shifting distributions and scales, while preserving plasticity for new tasks and improving zero-shot generalization.
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark: This paper proposes EAPrivacy — the first 4-tier benchmark for evaluating LLM physical-world privacy awareness (400+ procedurally generated scenarios, 60+ physical scenes). It finds that all frontier models exhibit "asymmetric conservatism" (over-cautious on task execution yet insufficient on privacy protection), that enabling reasoning/thinking mode actually degrades privacy performance, and that the best model (Gemini 2.5 Pro) achieves only 59% accuracy in dynamic environments.
Membership Inference Attacks Against Fine-tuned Diffusion Language Models (SAMA): This paper presents the first systematic study of membership inference attack (MIA) vulnerabilities in diffusion language models (DLMs), proposing SAMA: a method that exploits DLMs' bidirectional masking structure to generate exponentially many probing opportunities, and handles sparse, heavy-tailed membership signals via progressive masking, sign voting, and adaptive weighting. SAMA achieves AUC of 0.81 across 9 datasets, outperforming the best baseline by 30%.
OFMU: Optimization-Driven Framework for Machine Unlearning: This work formulates machine unlearning as a bilevel optimization problem: the inner level maximizes the forgetting loss with gradient decorrelation to prevent damage to the retain set, while the outer level minimizes the retain loss with a penalty term enforcing stationary points of the inner objective. On the TOFU benchmark, OFMU simultaneously achieves high forgetting quality and high model utility, outperforming GA/GradDiff/NPO/RMU in terms of forget-retain balance.
Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers: This paper proposes PIL, a method that generates unlearnable perturbations using only a bias-free linear classifier as the surrogate model. By inducing linearization in deep models, PIL prevents them from learning semantic features, achieving over 100× speedup compared to existing methods (under 1 minute of GPU time on CIFAR-10).
PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints: PMark is a theoretically distortion-free and paraphrase-robust semantic-level watermarking method for LLMs. It employs cascaded binary filtering over candidate sentences using multiple orthogonal pivot vectors, with median-based sampling to guarantee distortion-freeness. Multi-channel design increases watermark evidence density and enhances robustness. Under paraphrase attacks, TP@FP1% reaches 95%+, outperforming prior SWM methods by 14.8%.
Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference: A backdoor purification method for LLMs that requires neither prior knowledge nor a clean reference model. Mechanistic analysis reveals that backdoor associations are redundantly distributed across MLP layers. Inspired by immunology, the method extracts a "signature" from multiple backdoor variants, localizes and suppresses suspicious neurons, and applies lightweight fine-tuning for recovery. Across 5 attacks × 3 tasks, ASR is reduced by 80%+ while utility is preserved.
Redirection for Erasing Memory (REM): Towards a Universal Unlearning Method for Corrupted Data: This paper proposes a two-dimensional taxonomy for the corrupted data unlearning task (discovery rate × statistical regularity), reveals that existing unlearning methods are each effective only within specific regions of this space, and introduces REM (Redirection for Erasing Memory), which redirects corrupted data into newly added dedicated network capacity before discarding it—achieving strong and consistent unlearning performance across the entire two-dimensional task space for the first time.
RedSage: A Cybersecurity Generalist LLM: This paper introduces RedSage—the first fully open-source cybersecurity generalist LLM—built upon large-scale domain continual pre-training on 11.7B tokens, agentic-augmentation SFT with 266K samples, and RedSage-Bench, the first comprehensive evaluation benchmark covering knowledge, skills, and tools. The resulting 8B-parameter model surpasses same-scale SOTA on cybersecurity benchmarks by +5.4 pp and approaches Qwen3-32B, while simultaneously improving general-purpose performance (+8.4 pp vs. Qwen3-8B).
Resource-Adaptive Federated Text Generation with Differential Privacy: This paper proposes a resource-adaptive federated text generation framework that employs a two-stage design — DP fine-tuning on strong clients and DP voting on weak clients — to generate high-quality synthetic text data under computational heterogeneity and differential privacy constraints.
SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning: This paper is the first to investigate backdoor attack threats in the federated prompt learning (FPL) setting, and proposes SABRE-FL — a lightweight server-side defense based on anomaly detection in the embedding space — which effectively filters poisoned prompt updates without accessing clients' raw data.
SecP-Tuning: Efficient Privacy-Preserving Prompt Tuning for Large Language Models via MPC: This paper proposes SecP-Tuning, the first privacy-preserving prompt tuning framework based on secure multi-party computation (MPC). It eliminates backpropagation overhead via forward-only tuning and reduces communication complexity by replacing softmax with privacy-preserving random feature attention (RFA), achieving approximately 12–16× speedup and 17–20× reduction in communication volume.
SecP-Tuning: Efficient Privacy-Preserving Prompt Tuning for Large Language Models via MPC: This paper proposes SecP-Tuning, the first MPC-based privacy-preserving prompt tuning framework for LLMs. It eliminates backpropagation overhead via forward-only tuning and replaces softmax attention with a privacy-preserving random feature attention mechanism, achieving 12–16× speedup and 17–20× reduction in communication cost.
Self-Destructive Language Model: This paper proposes Seam, which couples the optimization trajectories of benign and harmful data (forcing their gradients into opposite directions) to transform an LLM into a "self-destructive model." Harmful fine-tuning automatically triggers catastrophic performance collapse, creating an inescapable dilemma for attackers: low-intensity attacks are ineffective, while high-intensity attacks render the model unusable.
SHE-LoRA: Selective Homomorphic Encryption for Federated Tuning with Heterogeneous LoRA: This paper proposes SHE-LoRA, which integrates Selective Homomorphic Encryption (SHE) with LoRA for cross-device federated LLM fine-tuning. The framework features sensitivity-based column-level encrypted subset negotiation, column-swap parameter obfuscation, and column-aware adaptive aggregation. It achieves model performance comparable to non-private baselines while reducing communication overhead by 99.71% and encryption time by 99.87%, providing complete resistance against the state-of-the-art gradient inversion attack DAGER.
SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense: This work is the first to systematically trace object hallucinations in LVLMs back to the visual encoder, identifying three core issues: statistical bias (over-emphasis on high-frequency pattern tokens), inherent bias (residual representations of pre-training dominant objects), and vulnerability (feature distortion under minimal perturbations). It proposes SHIELD—a fully training-free framework that jointly addresses these issues via token reweighting, token subtraction, and contrastive decoding, achieving comprehensive improvements over VCD and OPERA on LLaVA-1.5, InstructBLIP, and Qwen-VL.
Train Once, Answer All: Many Pretraining Experiments for the Cost of One: This paper proposes a methodological framework for running multiple independent experiments simultaneously within a single LLM pretraining run. Training a 2.7B-parameter model on 210B tokens, the framework concurrently executes 10 experiments, successfully replicates the results of 5 prior works, and conducts 3 novel experiments. It further introduces Continual Pretraining Dependence Testing (CPDT) to verify inter-experiment independence.
Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks (DialTree): This paper proposes DialTree, which frames multi-turn red-teaming as a goal-oriented dialogue policy optimization problem. By employing tree-structured rollouts with quality-based pruning to explore the attack trajectory space, combined with an adaptive mask to prevent format forgetting, DialTree achieves an average ASR of 81.5% across 12 target models—44.2% higher than the previous SOTA—and attains 71% ASR even on Claude-4-Sonnet.
Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness: This work is the first to analyze the Differential Attention (DA) mechanism from an adversarial robustness perspective. It reveals that the subtraction structure in DA, while suppressing noise, amplifies sensitivity to adversarial perturbations through negative gradient alignment. The study establishes a "Fragility Principle"—DA improves discriminability on clean samples but becomes more vulnerable under adversarial attacks—and identifies a depth-dependent robustness crossover effect.
Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness: This work provides the first adversarial robustness analysis of the structural vulnerability in Differential Attention (DA): while the subtraction mechanism suppresses noise, it amplifies sensitivity to adversarial perturbations due to negative gradient alignment, revealing a fundamental trade-off between selectivity and robustness.
Unlearning Evaluation through Subset Statistical Independence: This paper proposes Split-half Dependence Evaluation (SDE), which leverages HSIC-based statistical independence testing to evaluate machine unlearning at the subset level, requiring neither model retraining nor auxiliary classifiers.
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models: This paper proposes X-GRAAD, an inference-time backdoor defense that combines attention anomaly scoring and gradient importance scoring to localize trigger tokens, followed by character-level perturbation to neutralize them. Across 5 Transformer models × 3 attack types, ASR is reduced to near 0% while maintaining 88–95%+ CACC, with a 30× speedup over PURE.
Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning: This paper proposes Veritas, an MLLM-based deepfake detector that simulates human authentication reasoning via pattern-aware reasoning (fast judgment → reasoning → planning → self-reflection → conclusion). It introduces a two-stage training pipeline (SFT+MiPO cold-start + P-GRPO reinforcement learning) and constructs the HydraFake benchmark with a four-level OOD evaluation protocol. Veritas achieves an average accuracy of 90.7% across cross-forgery and cross-domain scenarios, surpassing the previous SOTA by 6.0%.
VeriTrail: Closed-Domain Hallucination Detection with Traceability: This paper proposes VeriTrail, the first closed-domain hallucination detection method designed for multi-step generation (MGS) pipelines. By modeling the generation process as a DAG and verifying claims layer by layer along the graph, VeriTrail achieves full traceability encompassing hallucination detection, provenance tracking, and error localization. It substantially outperforms all baselines on two newly introduced datasets.
VeriTrail: Closed-Domain Hallucination Detection with Traceability: This paper proposes VeriTrail — the first closed-domain hallucination detection method that provides traceability for multi-generative-step (MGS) processes. It models the generation process as a DAG and performs layer-by-layer verification along paths, while also introducing the first MGS datasets that include all intermediate outputs with human annotations.