Skip to content

🔒 LLM Safety

💬 ACL2025 · 55 paper notes

📌 Same area in other venues: 📷 CVPR2026 (12) · 🔬 ICLR2026 (185) · 💬 ACL2026 (115) · 🤖 AAAI2026 (41) · 🧠 NeurIPS2025 (81) · 📹 ICCV2025 (10)

🔥 Top topics: LLM ×26 · Adversarial Robustness ×19 · Watermarking ×6 · Agents ×2 · Multimodal/VLM ×2

A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models

This paper comprehensively revisits Membership Inference Attacks (MIA) in LLMs from a statistical perspective through thousands of experiments. It analyzes the inconsistency of MIA performance across six dimensions: data splitting methods, model size, domain characteristics, text features, embedding separability, and decoding dynamics. It reveals previously overlooked findings such as threshold generalization, the impact of text length/similarity, and emergent changes in the embedding layers.

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

Proposes AGrail, a lifelong learning LLM Agent guardrail framework. Through dual-LLM collaboration (Analyzer + Executor) and a memory module, it adaptively generates and optimizes safety check policies at test time, effectively defending against task-specific and systemic risks.

Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning

This paper proposes "In-Context Knowledge Unlearning" by introducing special unlearning tokens <<UNL>>...<</UNL>> to enable LLMs to selectively forget specific knowledge during inference based on context. It achieves a 95% unlearning accuracy on TOFU/AGE/RWKU while retaining 80% of irrelevant knowledge. In-depth internal analysis reveals that LLMs do not truly delete the knowledge but rather "pretend to forget" it at the final layer.

Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs

This paper challenges the prior conclusion that LLM hidden states can encode the truthfulness of facts. By constructing more realistic and challenging datasets (perplexity-guided negative sampling and QA-based LLM generation datasets), the authors find that prior methods exhibit limited generalization on data that closer resembles real-world scenarios, providing a more rigorous benchmark and practical guidance for LLM factuality evaluation.

Bias in the Mirror: Are LLMs' Opinions Robust to Their Own Adversarial Attacks

This paper proposes a novel "self-debate" paradigm where two instances of the same LLM play the proponent and opponent to debate each other, attempting to persuade a neutral version of the model. This setup is used to evaluate the robustness of LLMs' intrinsic bias—specifically, whether the bias is easily swayed and whether the model is susceptible to being misled by its own adversarial arguments.

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks

This paper proposes the CAVGAN framework, which utilizes generative adversarial networks to simultaneously learn jailbreak attacks (generator) and safety defense (discriminator) within the internal representation space of LLMs. This is the first work to unify attack and defense into a single framework for mutual enhancement, achieving an average attack success rate of 88.85% and an average defense success rate of 84.17%.

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Proposes Chinese SimpleQA, the first comprehensive Chinese factuality evaluation benchmark, containing 3,000 high-quality short Q&A pairs (covering 6 main domains and 99 sub-domains). After evaluating 41 LLMs, only o1-preview (63.8%) and Doubao-pro-32k (61.9%) passed. The study systematically reveals key insights such as "larger models perform better," "RAG narrows the gap," and "alignment lowers factuality."

CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

This work proposes CLIPErase, a machine unlearning framework tailored for multimodal CLIP models. By synergistically integrating a Forgetting Module, a Retention Module, and a Consistency Module, it selectively removes specified vision-language associations while preserving the performance of the model on the retained data.

ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty

The ComparisonQA benchmark (283K paired questions) is constructed to achieve controlled comparisons by having high- and low-frequency entities share the same abstract question. Combining a two-stage evaluation method of accuracy and uncertainty, the study reveals that LLMs (including GPT-4o) exhibit extremely poor robustness to low-frequency knowledge.

Core: Robust Factual Precision with Informative Sub-Claim Identification

This paper proposes the Core framework, which achieves robust factual precision evaluation by identifying and filtering informative sub-claims, addressing the issue of inaccurate evaluation in existing methods caused by the dilution effect of uninformative claims.

Defense Against Prompt Injection Attack by Leveraging Attack Techniques

This paper proposes an "attack-as-defense" prompt injection defense strategy: reversing existing attack techniques (ignore, escape, fake completion) for defense. By appending a shield prompt and the original instruction after the poisoned data, the LLM is forced to ignore the injected instructions and execute the original instructions, reducing the attack success rate (ASR) to near zero across various attack scenarios.

ReDial: Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

This paper constructs ReDial (1,216 pairs), the first high-quality human-annotated parallel reasoning benchmark for Standard English and African American Vernacular English (AAVE), to systematically evaluate the fairness and robustness of LLMs under dialect inputs. It reveals that almost all mainstream models suffer a significant performance drop of over 10% on AAVE queries.

ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models

This work establishes ELBA-Bench, a comprehensive backdoor attack benchmark covering 12 attack methods, 18 datasets, and 12 LLMs, to systematically evaluate the effectiveness and stealthiness of LLM backdoor attacks under Parameter-Efficient Fine-Tuning (PEFT) and tuning-free paradigms.

Ensemble Watermarks for Large Language Models

Proposes an ensemble watermarking method that combines stylometric features (acrostics + sensorimotor norms) with existing red-green watermarks, achieving a 95% detection rate for the three-feature ensemble after paraphrasing attacks, compared to only 49% for the red-green watermark alone.

Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models

This paper proposes the context influence metric to quantify the degree of privacy leakage of augmented contextual knowledge in language models during decoding based on a differential privacy framework, and systematically analyzes the effects of model size, context size, generation location, and other factors on privacy leakage.

Exploring Forgetting in Large Language Model Pre-Training

This paper systematically explores catastrophic forgetting during the LLM pre-training phase, introduces new entity-memory-based metrics (\(M_{ex}\), \(M_{in}\)) to replace traditional PPL for detecting forgetting, and validates the effectiveness of a periodic, high-intensity memory replay strategy in mitigating pre-training forgetting.

Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs

Challenges the dominant paradigm of "difference unawareness" in current LLM fairness evaluations, proposes two metrics, DiffAware and CtxtAware, along with a benchmark suite containing 16K questions across 8 scenarios, and demonstrates that models should differentiate group differences in scenarios such as law, culture, and harm evaluation, whereas existing debiasing methods instead impair this necessary difference awareness capability.

Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations

This paper investigates the interaction architecture between LLMs and theorem provers (TPs), and proposes four strategies to mitigate issues such as semantic information loss, syntactic errors, insufficient proof construction, and difficulties in feedback interpretation during autoformalisation. It achieves significant improvements in formalisation accuracy by +18.46%/+34.2%/+39.77% and explanation quality by +29.5%/+51.5%/+41.25% on the e-SNLI, QASC, and WorldTree datasets, respectively.

From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs

Proposes a three-stage fine-tuning method (misleading detection -> query correction -> accurate response) to enhance the capability of LLMs in processing inputs containing misleading information, significantly improving accuracy in misleading detection and QA tasks while reducing hallucination generation.

From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models

This paper proposes SymMark, a symbiotic watermarking framework that integrates logits-based and sampling-based watermarking methods (via three strategies: serial, parallel, and hybrid). By adaptively selecting watermarking strategies using token entropy and semantic entropy, it achieves SOTA performance in terms of detectability, robustness, text quality, and security.

How Does Response Length Affect Long-Form Factuality

This paper systematically studies the relationship between LLM response length and factual precision, proposing an efficient two-tier factuality evaluation framework Bafe (which achieves 89.31% agreement with human annotations). It confirms the existence of length bias and proves that "fact exhaustion" is the primary cause of factuality decline by ruling out the error propagation and long context hypotheses.

Improved Unbiased Watermark for Large Language Models

This paper proposes MCmark, a family of multi-channel unbiased watermarking algorithms. By dividing the vocabulary into \(l\) segments and boosting token probabilities within the selected segment to embed statistical signals, MCmark preserves the original output distribution of the LLM while improving detectability by over 10% compared to existing unbiased watermarks.

Ewe: Improving Factuality with Explicit Working Memory

Trigers Ewe (Explicit Working mEmory), which introduces explicit working memory consisting of multiple KV cache units during LLM decoding. It dynamically receives feedback from compiled retrieval knowledge and fact-checking. When errors are detected, Ewe deletes the incorrect sentences and regenerates them using the updated memory. It improves VeriScore F1 by 2–6 points across 4 factual long-form generation benchmarks without sacrificing helpfulness.

Improving Fairness of Large Language Models in Multi-document Summarization

Proposes FairPO (Fair Preference Optimization), which optimizes both summary-level and corpus-level fairness in multi-document summarization through perturbation-based preference pair generation and fairness-aware preference tuning.

Improving Model Factuality with Fine-grained Critique-based Evaluator

A fine-grained factuality evaluator, FenCE, is trained to improve evaluation accuracy by augmenting textual critiques and diverse source documents retrieved through multiple tools on public datasets. FenCE is then leveraged to edit and score generator responses to construct preference training data, improving Llama2-7B/Llama3-8B by 16.86%/14.45% in FActScore, respectively.

Can Indirect Prompt Injection Attacks Be Detected and Removed?

This paper systematically studies the detection and removal of indirect prompt injection attacks: it constructs an evaluation benchmark, discovers that existing detection models perform poorly against indirect attacks while specially trained models can achieve 99% accuracy, proposes two removal methods (segmentation-based and extraction-based), and combines detection and removal into a filtering pipeline to effectively reduce the attack success rate of indirect prompt injection.

Lacuna Inc. at SemEval-2025 Task 4: LoRA-Enhanced Influence-Based Unlearning for LLMs

This paper proposes LIBU (LoRA-enhanced Influence-Based Unlearning), which achieves machine unlearning for LLMs in two phases: Phase 1 utilizes influence function updates weighted by the diagonal Fisher information matrix for precise unlearning, and Phase 2 stabilizes training using the Sophia second-order optimizer. On OLMo-7B in SemEval-2025 Task 4, this method achieves an unlearning rate of 0.283 while maintaining an MMLU accuracy of 0.469.

Language Models Can Subtly Deceive Without Lying: A Case Study on Strategic Phrasing

This work constructs a legislative environment testing platform (LobbyLens) to study whether LLMs can employ strategic phrasing—specifically, coloring expressions without outright lying—to obscure the corporate benefits embedded within bill amendments. The authors find that LLMs optimized via iterative re-planning can boost their deception success rate by up to 40 percentage points.

Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

This paper presents the first systematic study on the robustness of LLM watermarks in preventing unauthorized knowledge distillation. It proposes three watermark removal attacks (Untargeted/Targeted Paraphrasing and Inference-Time Watermark Neutralization). The study reveals that Targeted Paraphrasing and Watermark Neutralization can thoroughly remove inherited watermarks, with Watermark Neutralization achieving zero extra training overhead while maintaining knowledge transfer efficiency.

Mamba Knockout for Unraveling Factual Information Flow

This work transfers the Attention Knockout interpretability method from Transformers to Mamba-1 and Mamba-2, revealing the factual information flow patterns in SSM models. Key findings show that Mamba and Transformers share a universal pattern where "subject tokens transmit key information to the last token in mid-to-late layers," but differ in architecture-specific aspects such as first-token bias and dependency on relation tokens.

Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models

Proposes MANU, the first modality-aware unlearning framework for MLLMs. It identifies cross-modality entangled knowledge-carrying neurons through four complementary neuron importance functions (absolute, frequency, variance, and RMS), and selectively prunes the top-\(\alpha\%\) neurons to achieve balanced unlearning under both multimodal and text-only inputs, completely training-free without any gradient updates.

MEGen: Generative Backdoor into Large Language Models via Model Editing

MEGen is proposed, a generative backdoor attack method based on model editing. It injects generative backdoors into LLMs by modifying a few local parameters using only a small number of samples, allowing the model to freely output preset dangerous content when triggered.

Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

Merge Hijacking is proposed—the first backdoor attack specifically targeting LLM model merging. The attacker only needs to upload a single malicious model. When the victim merges it with any clean model, the resulting merged model inherits the backdoor and maintains both attack effectiveness across all tasks and normal performance, while remaining robust against existing defense methods.

Unveiling Privacy Risks in LLM Agent Memory

This paper systematically investigates the privacy risks of LLM Agent memory modules and proposes MEXTRA, a black-box memory extraction attack. Utilizing carefully designed locator-aligner attack prompts and an automated diverse prompt generation method, the authors successfully extract large volumes of private query histories from both medical and online shopping Agents.

MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models

This paper reformulates the machine unlearning (MU) task in the era of Multimodal Large Language Models (MLLMs)—erasing only the visual patterns associated with specific entities while retaining textual knowledge. It proposes MMUnlearner, a geometry-constrained gradient ascent method that selectively updates parameters using weight saliency maps, comprehensively outperforming baselines like GA and NPO on both MLLMU-Bench and CLEAR benchmarks.

MorphMark: Flexible Adaptive Watermarking for Large Language Models

Through a multi-objective trade-off analysis framework, MorphMark reveals the critical role of the greenlist probability \(P_G\) in the trade-off between watermark effectiveness and text quality. Based on this, it proposes a method to adaptively adjust the watermark strength \(r\)—strengthening the watermark when \(P_G\) is high and weakening it when \(P_G\) is low, thereby simultaneously improving watermark detectability and text quality without relying on additional model training.

Opt-Out: Investigating Entity-Level Unlearning for Large Language Models via Optimal Transport

This paper proposes Opt-Out, an entity-level LLM unlearning method based on optimal transport theory. By utilizing Sliced Wasserstein Distance to regularize parameter shifts, it achieves fine-grained unlearning. Concurrently, the authors construct ELUDe, the first entity-level unlearning dataset (containing 20 target entities, 144 neighbor entities, 15K+ forget QA pairs, and 90K+ retain QA pairs). Opt-Out comprehensively outperforms existing methods on Llama-3.1-8B and Phi-3.5.

PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative Prompts

This paper proposes the PIG framework, which achieves efficient privacy jailbreak attacks on LLMs by identifying PII entity types in privacy queries, constructing privacy in-context demonstrations, and utilizing three gradient-based iterative optimization strategies to update the context. It achieves SOTA performance on both white-box and black-box models.

Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models

This paper proposes PME (Private Memorization Editing), which transforms the memorization tendency of LLMs from a security vulnerability into a defense mechanism. By editing the parameters of Feed Forward layers, it removes memorized personally identifiable information (PII), achieving privacy protection without retraining.

Real-time Factuality Assessment from Adversarial Feedback

This paper reveals the “data leakage” issue in existing factuality assessment datasets (where LLMs easily identify old misinformation due to pre-training memorization) and proposes an iterative rewriting pipeline based on adversarial feedback from a RAG detector to generate truly challenging real-time fake news variants, causing a 17.5% absolute drop in ROC-AUC for the GPT-4o RAG detector.

ReLearn: Unlearning via Learning for Large Language Models

ReLearn proposes replacing traditional "reverse optimization" with "forward learning" to achieve knowledge unlearning in LLMs. Through a pipeline of data augmentation and fine-tuning, the model forgets target knowledge while maintaining language generation quality and fluency. A comprehensive evaluation framework involving KFR, KRR, and LS is also designed.

REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space

This paper proposes REVS, a gradient-free model editing method. By locating neurons in the FF2 layer that are most strongly associated with sensitive tokens and projecting them into the vocabulary space, it iteratively lowers the rank of target tokens. On three types of sensitive data (SSN/Email/URL), its Unlearning Score significantly outperforms six baselines (89.58 vs 36.98) with almost zero cost to general capabilities (MMLU 61.05 \(\rightarrow\) 60.87), while remaining highly robust to Logit-Lens and Delta extraction attacks.

Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge

This paper proposes a data watermarking method based on Fictitious Knowledge. By injecting fictitious but plausible entities and their attribute descriptions into the training data, it achieves traceable verification of LLM training data ownership. The watermark is resilient to data preprocessing filters and supports black-box QA verification.

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

Proposes SafeRoute, a binary classifier router that adaptively selects between small and large safety guardrail models based on input difficulty. It routes only approximately 5% of "hard" samples to the large model, substantially reducing computational overhead while maintaining safety detection accuracy.

SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs?

SEUF reveals for the first time that existing LLM unlearning methods fail severely on MoE models (causing over 35% utility drop). The root cause is that the unlearning process leads to expert selection drift in the router, creating a "shortcut" where target experts to be forgotten are bypassed while innocent experts are damaged. To address this, SEUF proposes a framework that locates target experts through expert attribution and stabilizes routing selection with a router anchor loss, updating only 0.06% of parameters to simultaneously improve unlearning quality and model utility.

TIP of the Iceberg: Task-in-Prompt Adversarial Attacks on LLMs

This paper introduces Task-in-Prompt (TIP) attacks—a novel category of jailbreak attacks that indirectly generate harmful content by embedding sequence-to-sequence tasks (such as cipher decoding, riddles, or code execution) in the prompt. The authors construct the PHRYGE benchmark for systematic evaluation, demonstrating that this attack successfully bypasses the safety alignment of six state-of-the-art (SOTA) LLMs, including GPT-4o and LLaMA 3.2.

Towards Context-Robust LLMs: A Gated Representation Fine-tuning Approach

Proposes Grft (Gated Representation Fine-Tuning), a lightweight, plug-and-play gated representation fine-tuning method. With fewer than 200 training samples and only 0.0004% of the model parameters, it enables LLMs to exhibit human-like robust cognitive behavior when encountering contradictory or unhelpful external contexts.

Towards Effective Extraction and Evaluation of Factual Claims

Proposes a standardized framework for evaluating factual claim extraction quality (including metrics like coverage and decontextualization), and develops Claimify—an LLM-based method that handles ambiguity and extracts claims under high confidence, outperforming existing methods within this framework.

Truth Knows No Language: Evaluating Truthfulness Beyond English

The first professionally translated multilingual TruthfulQA benchmark (Basque, Catalan, Galician, Spanish) is constructed, revealing that cross-lingual truthfulness disparities in LLMs are smaller than expected, and that LLM-as-a-Judge aligns better with human judgment than multiple-choice metrics.

The Tug of War Within: Mitigating the Fairness-Privacy Conflicts in Large Language Models

It is discovered that enhancing the privacy awareness of LLMs through SFT significantly degrades their fairness awareness (representing a trade-off). To address this, a training-free method named SPIN (Suppressing Fairness-Privacy Coupled Neurons) is proposed to decouple the two dimensions based on information theory, simultaneously improving fairness by 12.2% and privacy awareness by 14.0% on Qwen2-7B.

UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models

This work proposes the UAlign framework, which leverages two uncertainty estimations—confidence score and semantic entropy—to explicitly model the knowledge boundary of LLMs. By incorporating these estimations as input features into PPO alignment training, the model is guided to answer known questions confidently and refuse unknown ones firmly, significantly improving reliability and generalization across multiple factual QA datasets.

Unveiling and Addressing Pseudo Forgetting in Large Language Models

This work unveils the "pseudo forgetting" phenomenon in LLM continual learning: performance degradation is not due to the loss of old task capabilities, but rather because instructions fail to correctly activate existing capabilities. Attribution analysis demonstrates that the instruction dependence of the forgotten model is decreased, and a dynamic data replay framework, RGD-R, based on Rationale-Guidance Difficulty (RGD), is proposed to alleviate pseudo forgetting.

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

This paper investigates LLM backdoor attacks for the first time from the perspective of natural language explanations. It reveals that backdoored models generate logically coherent explanations for clean inputs, but diverse and logically flawed explanations for poisoned inputs. Furthermore, token-level and sentence-level analyses show that the predictive semantics of poisoned samples only emerge in the last few layers, and attention shifts from the input context to newly generated tokens.

Which Retain Set Matters for LLM Unlearning? A Case Study on Entity Unlearning

This paper systematically studies the selection of the retain set in entity unlearning, proposes the Syntactically Similar Neighbor Set, and reveals that syntactic similarity (rather than domain/entity similarity) is the primary driver of knowledge degradation during unlearning. Regularization with a syntactically similar retain set optimally protects all types of neighbor knowledge simultaneously.

ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging

This work achieved second place in SemEval-2025 Task 4 (LLM Sensitive Content Unlearning). The core mechanism is to train two complementary models (one over-forgetting, one under-forgetting) and merge them via TIES-Merging to obtain a balanced unlearning model, achieving a near-perfect MIA score of 0.501 in local experiments.