� LLM Safety¶

💬 ACL2026 · 24 paper notes

Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization: This paper proposes an adaptive text anonymization framework that employs evolutionary prompt optimization to automatically discover task-specific anonymization instructions for LLMs, outperforming manually designed strategies across multiple privacy-utility trade-off scenarios while operating entirely on open-source models.
AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation: AGSC proposes an uncertainty quantification framework for long-text generation that uses NLI neutral probability to trigger adaptive granularity decomposition (reducing inference time by 60%) and employs GMM soft clustering to capture latent semantic topics for topic-aware weighted aggregation, achieving state-of-the-art factuality correlation on the BIO and LongFact benchmarks.
Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge: This paper proposes ChainFed, a chain-based federated fine-tuning paradigm that breaks through the memory wall by sequentially training and freezing adapters layer by layer, enabling resource-constrained edge devices to participate in LLM fine-tuning. Combined with three techniques—Dynamic Layer Coordination, Global-aware Parameter Optimization, and Function-Oriented Adaptive Tuning—ChainFed achieves up to 46.46% average accuracy improvement.
Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors: This paper proposes STEEREDIT, a backdoor injection framework that compiles dynamic activation steering into static weight modifications. By extracting a compliance direction and applying null-space constraints, the injected backdoor activates only in the presence of a trigger token. The method achieves high attack success rates on multiple safety-aligned LLMs while preserving safe behavior and general capability in trigger-absent scenarios.
De-Anonymization at Scale via Tournament-Style Attribution: This paper proposes DAS (De-Anonymization at Scale), an LLM-based large-scale authorship de-anonymization method that combines tournament-style elimination, dense retrieval pre-filtering, and multi-round voting aggregation to perform author matching across tens of thousands of candidate texts, revealing the privacy threat that LLMs pose to anonymous platforms such as double-blind peer review.
DUET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode: This paper proposes DUET, a dual-path framework that combines direct code execution with LLM-based pseudocode execution. The two paths are complementary—the former is reliable when generated code is correct but vulnerable to implementation errors, while the latter bypasses implementation details at the cost of potential execution hallucinations. Predictions are merged via functional majority voting, achieving a 13.6 percentage-point improvement in Pass@1 on LiveCodeBench test output prediction.
Enhancing Hallucination Detection via Future Context: This paper proposes leveraging sampled "future context" (subsequent sentences) to enhance hallucination detection in black-box settings. By exploiting the "snowball effect"—whereby hallucinations tend to propagate once introduced—the method consistently improves detection performance across multiple sampling-based approaches, including SelfCheckGPT and SC.
FACTS: Table Summarization via Offline Template Generation with Agentic Workflows: This paper proposes FACTS (Fast, Accurate, and Privacy-Compliant Table Summarization), a three-stage agentic workflow that automatically generates reusable offline templates (SQL queries + Jinja2 templates) for fast, accurate, and privacy-compliant query-focused table summarization, achieving state-of-the-art performance across FeTaQA, QTSumm, and QFMTS benchmarks.
Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens: This paper proposes Entropy-guided Token Weighting (ETW), which uses the entropy of the predictive distribution as a proxy for token informativeness. ETW selectively imposes stronger unlearning penalties on informative tokens, enabling effective removal of target knowledge while better preserving general model utility.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages: This paper introduces ICF, the first multi-Indic-language CodecFake detection benchmark, and proposes SATYAM—a hyperbolic audio large language model that aligns semantic and paralinguistic representations via Bhattacharyya distance in hyperbolic space before aligning with a conditioning prompt. With only 3.75M trainable parameters, SATYAM achieves 98.32% detection accuracy.
Jailbreaking Large Language Models with Morality Attacks: This paper constructs a 10.3K morality attack dataset (covering value ambiguity and value conflict scenarios) and manipulates the moral judgment of LLMs via four adversarial strategies. It finds that both LLMs and guardrail models are highly vulnerable to morality attacks, and that larger models are paradoxically easier to compromise.
KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates: This paper proposes Knowledge Coordinate-conditioned pre-training (KoCo), which maps each document to a three-dimensional semantic coordinate (Source, Content, Stability) and injects this as a natural language prefix during pre-training. This endows the model with explicit context-awareness, yielding performance gains across 10 downstream tasks, approximately 30% faster convergence, and effective hallucination mitigation.
Look Twice before You Leap: A Rational Framework for Localized Adversarial Anonymization: This paper proposes the RLAA framework, which addresses the utility collapse problem when transferring adversarial text anonymization to local small models (LSMs). Through an Attacker-Arbitrator-Anonymizer (A-A-A) architecture and a Marginal Rate of Substitution (MRS) rationality constraint, RLAA achieves a superior privacy-utility balance over API-based solutions on local devices, without any training.
Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning: This paper proposes PALU (Prefix-Aware Localized Unlearning), which achieves localized entropy maximization for unlearning along two dimensions: temporally, unlearning objectives are applied only to sensitive prefix tokens; in the vocabulary dimension, only top-K logits are flattened. This approach enables effective unlearning with minimal parameter perturbation while preserving the model's general capabilities.
MeasHalu: Mitigation of Scientific Measurement Hallucinations for LLMs: This paper proposes MeasHalu, a framework that mitigates hallucinations in LLM-based scientific measurement extraction through a fine-grained measurement hallucination taxonomy and a two-stage optimization pipeline (reasoning-aware SFT + hallucination-targeted GRPO rewards), achieving significant improvements over baselines on MeasEval.
Protecting Bystander Privacy via Selective Hearing in Audio LLMs: This work introduces SH-Bench, the first benchmark for bystander privacy evaluation, and proposes Bystander Privacy Fine-Tuning (BPFT), a method that improves the ability of audio LLMs to focus exclusively on the target speaker and refuse to disclose bystander information in multi-speaker environments. After BPFT, the SE metric surpasses Gemini 2.5 Pro by 16%.
Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework: This paper proposes TTP-Detect, the first black-box third-party watermark verification framework that decouples detection from injection. By leveraging a proxy model to amplify watermark signals and combining three complementary metrics — local consistency, global geometry, and adaptive rank tests — it achieves high-accuracy detection across diverse watermarking schemes without access to secret keys or internal model states.
Synthia: Scalable Grounded Persona Generation from Social Media Data: This paper proposes Synthia, a framework that generates grounded LLM persona narratives from real social media posts (Bluesky), achieving up to 11.6% improvement over the state of the art on social survey alignment while using smaller models, and preserving social network topology to support network-aware analysis.
Topic-Based Watermarks for Large Language Models: This paper proposes TBW, a lightweight topic-based watermarking scheme that clusters the vocabulary into semantically coherent "green lists" via predefined topics (rather than random partitioning), selects the topic list most aligned with the input prompt for logit bias injection, and achieves text quality comparable to unwatermarked outputs while significantly improving robustness against paraphrase and lexical perturbation attacks.
Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement: This paper provides a theoretical analysis of how Multi-Token Prediction (MTP) induces representational contractiveness through gradient coupling mechanisms to promote the emergence of belief states. It simultaneously reveals a "structural hallucination" problem in MTP—namely, illegal shortcuts in the latent space—and proposes the LSE-MTP framework, which anchors predictions to true latent state trajectories via latent consistency loss and semantic anchoring loss. The approach significantly improves path legality and robustness on synthetic graphs and real-world Manhattan taxi navigation tasks.
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations: This paper identifies two distinct information pathways through which LLMs internally encode truthfulness signals: Question-Anchored (relying on information flow from question to answer) and Answer-Anchored (extracting self-contained evidence from the generated answer itself). Both pathways are closely associated with knowledge boundaries. Building on this finding, the paper proposes two pathway-aware hallucination detection methods—Mixture-of-Probes and Pathway Reweighting—achieving AUC improvements of up to 10%.
Who Gets Which Message? Auditing Demographic Bias in LLM-Generated Targeted Text: This paper presents the first systematic analysis of demographic bias in LLM-generated targeted messages, proposes the Persuasion Bias Index (PBI), and finds that GPT-4o, Llama, and Mistral consistently employ stronger persuasive strategies toward male and younger audiences in climate communication, with contextual prompting systematically amplifying these disparities.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models: This paper presents the first systematic study of the Incomplete Learning Phenomenon (ILP) in SFT — i.e., the model's inability to correctly reproduce a subset of training samples even after convergence. Five recurring causes are identified (knowledge absence, knowledge conflict, intra-dataset contradiction, left-side forgetting, and insufficient optimization), along with a diagnostic framework and targeted mitigation strategies.
XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts: This paper proposes XMark, a multi-bit text watermarking method based on the Leave-one-Shard-out (LoSo) strategy and evergreen lists. By taking the intersection of green lists across multiple vocabulary permutations and employing a constrained token-shard mapping matrix, XMark significantly improves decoding accuracy under limited token budgets while preserving text quality.