🔒 LLM Safety¶
🧠 NeurIPS2025 · 80 paper notes
📌 Same area in other venues: 📷 CVPR2026 (11) · 🔬 ICLR2026 (185) · 💬 ACL2026 (115) · 🤖 AAAI2026 (41) · 📹 ICCV2025 (10)
🔥 Top topics: LLM ×18 · Adversarial Robustness ×17 · Alignment/RLHF ×6 · Reasoning ×6 · Federated Learning ×5
- A Cramér–von Mises Approach to Incentivizing Truthful Data Sharing
-
This paper proposes an incentive mechanism based on the Cramér–von Mises (CvM) two-sample test statistic. Under both Bayesian and prior-free settings, the mechanism provably makes truthful data submission a (approximate) Nash equilibrium, while encouraging participants to contribute more genuine data—without relying on strong distributional assumptions (e.g., Gaussian or Bernoulli).
- A Reliable Cryptographic Framework for Empirical Machine Unlearning Evaluation
-
This paper models the machine unlearning evaluation problem as a cryptographic game (the unlearning sample inference game), quantifies unlearning quality via the adversary's "advantage," and addresses multiple shortcomings of traditional MIA accuracy as an evaluation metric—namely, the lack of a retrain-as-zero baseline, sensitivity to data partitioning, and sensitivity to the choice of MIA. A SWAP test is further proposed as an efficient practical approximation.
- A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs
-
This paper proposes an Adaptive Alpha aggregation strategy that dynamically adjusts reward weights based on each user group's historical alignment performance within a federated RLHF framework, simultaneously achieving high fairness and strong alignment performance for pluralistic preference alignment.
- Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning
-
This paper proposes FedLEASE, which addresses two critical challenges in federated LoRA fine-tuning: (1) automatically determining the optimal number of experts and their assignment via LoRA B-matrix similarity clustering, and (2) enabling adaptive top-M expert selection through an expanded routing space of \(2M-1\) dimensions, allowing each client to determine how many experts to use. FedLEASE achieves an average improvement of 5.53% over the strongest baseline on GLUE.
- Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text
-
This paper proposes Adversarial Paraphrasing — a training-free universal attack framework that selects the most "human-like" token at each decoding step by leveraging feedback signals from AI text detectors during token-by-token paraphrasing. The approach achieves an average T@1%F reduction of 87.88% across 8 detectors and exhibits strong cross-detector transferability.
- AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
-
This paper proposes AgentDAM, the first benchmark for end-to-end evaluation of data minimization compliance by AI agents in real web environments. It comprises 246 tasks spanning Reddit, GitLab, and Shopping platforms, and finds that leading models such as GPT-4o exhibit privacy leakage rates of 36–46% without mitigation, while a CoT-based privacy prompt reduces leakage rates to 6–8%.
- AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text
-
This paper proposes the AgentStealth framework, which trains a small language model (SLM) through a three-stage pipeline comprising an adversarial anonymization workflow, supervised fine-tuning (SFT), and online reinforcement learning, achieving effective anonymization of user-generated content while preserving text utility — yielding a 12.3% improvement in anonymization performance and 6.8% improvement in utility.
- ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models
-
The first defense framework against jailbreak attacks on audio-language models (ALMs). The work discovers that aligned ALMs possess latent safety shortcuts that can be activated, and proposes a Mel Gradient Sparse Mask (M-GSM) to identify critical frequency bins. By applying Shortcut Activation Perturbations (SAP) to these bins, the average attack success rate is reduced from 41.6% to 4.6% with negligible degradation of normal task performance.
- Approximate Domain Unlearning for Vision-Language Models
-
This paper introduces Approximate Domain Unlearning (ADU), a novel task that enables pretrained VLMs to selectively forget recognition capabilities for specified domains (e.g., illustrations, sketches) while preserving classification accuracy on other domains (e.g., real photographs). Two modules are proposed — Domain Disentangling Loss (DDL) and Instance-wise Prompt Generator (InstaPG) — achieving substantial improvements over all baselines across four multi-domain datasets.
- Attention! Your Vision Language Model Could Be Maliciously Manipulated
-
This paper proposes the Vision-language Model Manipulation Attack (VMA), an image-based adversarial attack method that combines first- and second-order momentum optimization with a differentiable transformation mechanism, enabling precise control over every output token of a VLM. The approach supports a range of attack scenarios (jailbreaking, hijacking, privacy breach, DoS, sponge examples) and can also be repurposed for copyright-protection watermark injection.
- Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools
-
AMA (Attractive Metadata Attack) demonstrates that by carefully crafting malicious tool metadata (name, description, parameter schema) alone — without prompt injection or internal model access — an attacker can induce LLM agents to invoke malicious tools and leak private data at a success rate of 81–95%, while barely affecting original task completion (98%+), with existing defenses (auditors, prompt rewriting) proving largely ineffective.
- Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks Against LLMs
-
This paper models adversarial attacks on LLMs as an information channel problem — defining the "bits leaked per query" \(I(Z;T)\) as the mutual information between the attack target attribute \(T\) and the observable signal \(Z\), and proving that the minimum number of queries required to achieve error \(\varepsilon\) is \(\log(1/\varepsilon)/I(Z;T)\). Validated across 7 LLMs: exposing only answer tokens requires ~1000 queries; adding logits reduces this to ~100; adding chain-of-thought (CoT) further reduces it to ~tens of queries. This provides the first principled metric for the transparency–security trade-off.
- Buffer Layers for Test-Time Adaptation
-
This paper proposes Buffer layers as a new paradigm for Test-Time Adaptation (TTA), replacing conventional normalization layer updates to fundamentally preserve the integrity of the pretrained backbone. The approach effectively alleviates catastrophic forgetting and achieves consistent performance improvements across diverse architectures and TTA frameworks.
- Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems
-
This paper proposes the Collective Narrative Grounding protocol, which collects community narratives through participatory workshops and structures them into "narrative units." A RAG pipeline then injects this local knowledge into LLM-based QA systems. Experiments on LocalBench reveal that 76.7% of errors can be directly remediated by local narratives, and GPT-5 achieves only 21% accuracy on the participatory QA set, highlighting the severity of the local knowledge gap.
- Contextual Integrity in LLMs via Reasoning and Reinforcement Learning
-
This paper proposes CI-RL, a framework that combines Chain-of-Thought reasoning prompts with GRPO reinforcement learning to train LLMs to understand contextual integrity (CI) using only ~700 synthetic samples. On the PrivacyLens benchmark, it reduces privacy leakage rates by up to 40%, and smaller models trained with CI-RL can surpass larger baseline models.
- CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment
-
This paper proposes CoreGuard, which locks Transformer linear layer weights via row permutation and reduces TEE authorization to a single invocation through a column-permutation propagation protocol, protecting foundational capabilities of edge-deployed LLMs against model stealing attacks with negligible computational and communication overhead.
- CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
-
To address the prevalence of duplicate and near-duplicate problems in competitive programming—which compromises contest fairness and inflates LLM evaluation scores—this work constructs CPRet, a large-scale benchmark spanning four retrieval tasks, and proposes CPRetriever, a domain-specific retrieval model trained with Group-InfoNCE loss. CPRetriever surpasses 20+ existing embedding models across all tasks and reveals systematic evaluation bias in LiveCodeBench attributable to problem similarity.
- CryptoMoE: Privacy-Preserving and Scalable Mixture of Experts Inference via Balanced Expert Routing
-
CryptoMoE is the first framework supporting privacy-preserving inference for MoE-based LLMs. By combining balanced expert routing to conceal routing information, a confidence-aware dispatch protocol, and a batch ciphertext matrix multiplication protocol, it achieves 2.8–3.5× latency reduction and 2.9–4.3× communication reduction compared to a dense baseline, with only 0.8% accuracy loss.
- DeepPersona: A Generative Engine for Scaling Deep Synthetic Personas
-
This paper presents DeepPersona, a two-stage taxonomy-guided synthetic persona generation engine. Stage 1 mines a human attribute taxonomy with 8,000+ nodes from real user–ChatGPT conversations; Stage 2 generates narratively coherent personas averaging 200+ structured attributes via progressive attribute sampling. The approach achieves an 11.6% improvement in personalized QA accuracy and a 31.7% reduction in social survey simulation bias.
- Demystifying Language Model Forgetting with Low-Rank Example Associations
-
This paper discovers that the association matrix between upstream sample forgetting and newly learned tasks exhibits a low-rank structure (rank-3 achieves \(R^2 > 0.69\)) after LLM fine-tuning, and leverages matrix completion to predict forgetting induced by unseen tasks, thereby guiding selective replay to mitigate forgetting.
- Differentially Private Federated Low Rank Adaptation Beyond Fixed-Matrix
-
This paper proposes FedASK, a framework that employs a two-stage sketching pipeline (inspired by randomized SVD) to, for the first time under differential privacy, enable simultaneous effective updates of both low-rank matrices A and B in federated LoRA, achieving up to 11.5% improvement on MMLU and 46% on GSM8K over baselines on Llama-2 7B/13B.
- Distillation Robustifies Unlearning
-
This paper reveals the core finding that "distillation can robustify unlearning" — distilling an unlearned model into a randomly initialized student network effectively discards latent capabilities. Building on this insight, the paper proposes UNDO (Unlearn-Noise-Distill-on-Outputs), which applies weight perturbation to the unlearned model prior to distillation, establishing a tunable compute–robustness trade-off that approaches the gold standard of retraining from scratch on both synthetic tasks and the WMDP benchmark.
- Distributive Fairness in Large Language Models: Evaluating Alignment with Human Values
-
This paper systematically evaluates the distributive fairness preferences of several SOTA LLMs (GPT-4o, Claude-3.5S, Llama3-70b, Gemini-1.5P) on non-strategic resource allocation tasks. The results reveal significant divergence between LLMs and humans: LLMs favor efficiency and envy-freeness (EF) while neglecting equality (EQ), which humans prioritize. However, in multiple-choice settings, GPT-4o and Claude can correctly identify the fairest allocation.
- DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm
-
This paper proposes DNA-DetectLLM, a zero-shot AI-generated text detection method inspired by the DNA mutation-repair mechanism. It constructs an ideal AI sequence and quantifies the cumulative difficulty of repairing the input text toward that sequence as the detection signal, achieving state-of-the-art results with a relative AUROC improvement of 5.55% and F1 improvement of 2.08% across multiple benchmark datasets.
- DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning
-
DRAGON proposes a systematic LLM unlearning framework that requires no fine-tuning of the base model. It employs a two-layer detection module to identify prompts subject to unlearning, then uses a specially fine-tuned guard model to generate CoT reasoning instructions for in-context intervention, effectively removing private or harmful knowledge while preserving the model's general capabilities.
- DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents
-
DRIFT is a system-level agent security framework featuring three layers of defense: Secure Planner (pre-planned function trajectories and parameter checklists), Dynamic Validator (dynamic policy updates based on Read/Write/Execute permissions), and Injection Isolator (detection and masking of injected instructions from the memory stream). On AgentDojo, DRIFT reduces ASR from 30.7% to 1.3% while achieving 20.1% higher utility than CaMeL.
- Enhancing CLIP Robustness via Cross-Modality Alignment
-
This paper proposes COLA, a training-free framework that eliminates non-semantic noise by projecting adversarially perturbed image features onto the subspace spanned by text features, and then employs optimal transport (OT) to perform fine-grained distribution-level image-text alignment. COLA achieves an average improvement of 6.7% in adversarial robust accuracy across 14 zero-shot classification benchmarks while preserving clean sample performance.
- Enhancing Sample Selection Against Label Noise by Cutting Mislabeled Easy Examples
-
This paper identifies and defines Mislabeled Easy Examples (MEEs)—samples whose incorrect labels are confidently learned by the model in the early stages of training—and demonstrates that these samples cause the greatest harm to generalization. An Early Cutting method is proposed to filter MEEs by recalibrating the early-stage confident subset using the model's later-stage state.
- Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions
-
This paper systematically evaluates the hiring-match performance of mainstream LLMs—including GPT-4o/4.1, Claude 3.5, Gemini 2.5, Llama 3.1/4, and DeepSeek R1—on approximately 10,000 real-world candidate–job pairs. Results show that a domain-specialized model (Match Score) comprehensively outperforms general-purpose LLMs in both accuracy (AUC 0.85 vs. 0.77) and fairness (Race IR 0.957 vs. ≤0.809).
- Evaluation of Vision-LLMs in Surveillance Video
-
This paper proposes a training-free two-stage framework that leverages small Vision-LLMs to generate textual descriptions of video content, followed by an NLI classifier for zero-shot scoring. It systematically evaluates the impact of prompting strategies and privacy-preserving filters on anomalous behavior recognition in surveillance videos.
- Exploring the Limits of Strong Membership Inference Attacks on Large Language Models
-
This work presents the first extension of strong membership inference attacks (LiRA) to GPT-2-scale LLMs ranging from 10M to 1B parameters, training over 4,000 reference models. Four key findings are revealed: strong MIAs can succeed on LLMs but with limited effectiveness (AUC < 0.7), and a substantial fraction of per-sample decisions are indistinguishable from random coin flips under training randomness.
- FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model
-
This paper proposes FALCON, a representation-guided LLM unlearning framework that employs mutual information for parameter selection, a contrastive mechanism for fine-grained knowledge separation, and gradient orthogonal projection to resolve forgetting–retention conflicts. FALCON consistently outperforms existing methods on harmful knowledge, copyright, and entity unlearning benchmarks.
- FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models
-
FedRW proposes the first privacy-preserving soft deduplication framework for federated learning that requires no trusted third party. By leveraging secure multi-party computation to obtain global sample frequencies and performing frequency-aware sample reweighting, it achieves up to 28.78× preprocessing speedup and approximately 11.42% improvement in perplexity over prior methods.
- FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA
-
FedSVD proposes globally reparameterizing LoRA matrices via SVD, updating the \(A\) matrix each communication round using the right singular vectors of the aggregated \(BA\) product. This approach avoids the quadratic noise amplification under DP-SGD while preserving the adaptive capacity of \(A\), consistently outperforming fixed-\(A\) baselines across multiple NLU benchmarks.
- Finding Structure in Continual Learning
-
This paper proposes a continual learning optimization framework based on Douglas-Rachford Splitting (DRS), which decouples stability and plasticity into two independent proximal subproblems, and replaces KL divergence with Rényi divergence for more robust prior alignment, thereby effectively alleviating catastrophic forgetting without replay buffers or additional modules.
- Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation
-
Geo-Sign projects skeleton features into a Poincaré ball model of hyperbolic space and regularizes an mT5 language model via a hyperbolic contrastive loss, enabling the model to perceive the hierarchical structure of sign language motion. Using only skeleton data, the method surpasses RGB-based SOTA on CSL-Daily (BLEU-4 +1.81, ROUGE-L +3.03).
- HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
-
This paper proposes HoloLLM, the first framework to integrate rare sensing modalities — including LiDAR, infrared, mmWave radar, and WiFi — into a multimodal large language model (MLLM). Through a Universal Modality-Injection Projector (UMIP), HoloLLM achieves efficient alignment between sensing modalities and text under data-scarce conditions, improving human action QA and captioning by approximately 30% over existing MLLMs.
- ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation
-
This paper proposes the ImageSentinel framework, which synthesizes sentinel images that are visually consistent with a private dataset and binds them to randomly generated character retrieval keys, enabling reliable detection of unauthorized use of private datasets by retrieval-augmented image generation (RAIG) systems—achieving near-100% AUC with only 3–10 queries.
- InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy
-
This paper proposes InvisibleInk, a framework that reduces the computational cost of differentially private long-text generation by more than 8× through two innovations—differential clipping (DClip) for isolating sensitive information and Top-\(k^+\) truncated sampling—achieving, for the first time, high-quality private text generation with only 4–8× overhead over non-private generation.
- Learning to Watermark: A Selective Watermarking Framework for Large Language Models via Multi-Objective Optimization
-
This paper proposes LTW (Learning to Watermark), a framework that employs a lightweight selector network to adaptively determine when to apply watermarks based on sentence embeddings, token entropy, and the current watermarking ratio. By leveraging multi-objective optimization via MGDA, LTW achieves a Pareto-optimal balance between detectability and text quality, substantially improving watermarked text quality without compromising detection performance.
- LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory
-
This paper proposes an LLM strategic reasoning evaluation framework grounded in behavioral game theory. It employs Truncated Quantal Response Equilibrium (TQRE) to quantify reasoning depth τ, evaluates 22 state-of-the-art models across 13 matrix games, and reveals differences in reasoning styles as well as biases induced by demographic personas.
- MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction
-
This paper proposes MaskSQL, a framework that protects privacy by replacing sensitive table names, column names, and data values with abstract symbols before sending prompts to a remote LLM. Combined with a local SLM for schema linking and SQL reconstruction, MaskSQL preserves privacy while surpassing SLM-only approaches in SQL generation accuracy.
- MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
-
This paper proposes MixAT, a method that combines discrete adversarial attacks (PAP-based rewriting) with continuous embedding-space perturbations for LLM adversarial training. MixAT achieves robustness against diverse attacks (reducing ALO-ASR from 50%+ to below 20%) while preserving utility, at a training cost comparable to purely continuous methods.
- ModHiFi: Identifying High Fidelity Predictive Components for Model Modification
-
This paper proposes the Subset Fidelity metric and the ModHiFi framework. Through theoretical analysis, it proves that local reconstruction error linearly upper-bounds global prediction error for Lipschitz continuous networks. Without requiring training data, loss functions, or gradients—using only synthetic data—the framework identifies high-fidelity (HiFi) components within a model, and unifies the tasks of structured pruning and class unlearning under a single formulation.
- MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
-
This paper proposes MPCache, a KV cache eviction framework designed for secure multi-party computation (MPC), combining one-time static eviction with query-aware dynamic selection. Through hierarchical clustering, linearized similarity approximation, and cross-layer index sharing, MPCache achieves up to 2.01× latency reduction and 8.37× communication volume reduction without sacrificing LLM performance.
- Music Arena: Live Evaluation for Text-to-Music
-
Music Arena is the first online live evaluation platform for text-to-music (TTM) generation. It addresses the heterogeneous signature problem of TTM systems via an LLM-driven moderation and routing system, collects multi-level preference data including fine-grained listening behavior and natural language feedback, and provides the community with a sustainable open preference data source through monthly rolling data releases.
- On Optimal Steering to Achieve Exact Fairness
-
This paper defines the concept of an ideal distribution—a data distribution under which the Bayes-optimal classifier for any cost-sensitive risk satisfies exact fairness—and proposes an optimization framework that identifies the nearest ideal distribution via KL divergence minimization, providing provable fairness guarantees for both fair preprocessing and LLM representation steering.
- On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection
-
This paper systematically evaluates eight classical goodness-of-fit (GoF) tests for LLM text watermark detection, demonstrating that GoF tests significantly outperform existing baseline methods in both detection power and robustness.
- On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks
-
This paper presents the first systematic study on the robustness of LLM verbal confidence under adversarial attacks. It proposes a Verbal Confidence Attack (VCA) framework comprising perturbation-based and jailbreak-based attacks, demonstrating that such attacks can reduce confidence scores by up to 30%, cause answer-flip rates of up to 100%, and that existing defense strategies are largely ineffective.
- On the Sample Complexity of Differentially Private Policy Optimization
-
This paper presents the first systematic study of sample complexity for policy optimization (PO) under differential privacy (DP) constraints. It proposes a unified meta-algorithm framework and analyzes three private policy optimization algorithms—DP-PG, DP-NPG, and DP-REBEL—proving that the privacy cost typically appears only as a lower-order term in the sample complexity.
- One Token Embedding Is Enough to Deadlock Your Large Reasoning Model
-
This paper proposes the Deadlock Attack, which optimizes a single adversarial token embedding and implants it into a Large Reasoning Model (LRM) via a backdoor mechanism, causing the model to enter a permanent reasoning loop during inference (endlessly generating transition words such as "Wait" and "But"). The attack achieves a 100% attack success rate across 4 LRMs and 3 mathematical reasoning benchmarks, with negligible performance degradation on clean inputs.
- ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden Tests
-
This paper proposes ORBIT, a unified benchmark for recommender systems comprising standardized evaluation on 5 public datasets and a privacy-safe hidden test set, ClueWeb-Reco, constructed from real users' browsing histories. The benchmark systematically evaluates 12 recommendation models and introduces the LLM-QueryGen baseline, revealing the limitations of existing approaches in large-scale, real-world recommendation scenarios.
- Poly-Guard: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset
-
This paper introduces Poly-Guard, the first large-scale, multi-domain, policy-grounded safety guardrail benchmark. It extracts 400+ risk categories and 1,000+ safety rules from 150+ real-world industry safety policies, generates 100K+ instances spanning 8 safety-critical domains, and systematically evaluates 19 guardrail models, revealing 8 key findings including domain specialization, evolutionary forgetting, scaling stagnation, and adversarial vulnerability.
- Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma
-
This paper formalizes the recurring safety–fairness–efficiency tensions in RLHF as an "alignment trilemma": it proves that no RLHF system can simultaneously satisfy \(\varepsilon\)-representativeness (faithfully reflecting diverse values), polynomial tractability (computational feasibility), and \(\delta\)-robustness (resistance to adversarial attacks), thereby providing a unified complexity-theoretic explanation for pathological phenomena such as preference collapse and sycophancy observed in current RLHF systems.
- Probabilistic Reasoning with LLMs for K-Anonymity Estimation
-
This paper proposes Branch, a framework that leverages large language models to model personal information disclosed in user-generated text as a joint probability distribution over a Bayesian network. By estimating conditional probabilities for individual attributes and composing them to compute k-anonymity values (i.e., the number of individuals globally matching a given profile), Branch achieves 73% accuracy on privacy risk estimation, outperforming o3-mini chain-of-thought reasoning by 13%.
- Procurement Auctions with Predictions: Improved Frugality for Facility Location
-
This paper studies procurement auction design for the strategic uncapacitated facility location problem. It proves that the frugality ratio of the classical VCG auction is exactly 3 (improving the previously known upper bound of 4), and designs learning-augmented auction mechanisms that exploit prediction information to achieve near-optimal frugality when predictions are accurate, while maintaining constant-factor robustness when predictions are arbitrarily inaccurate.
- PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning
-
This paper proposes the PULSE evaluation protocol, which assesses existing unlearning methods for large multimodal models (LMMs) along two practically motivated dimensions: the forgetting of pretrained knowledge and the sustainability of repeated sequential unlearning. The findings reveal severe deficiencies in current methods—forgetting pretrained knowledge causes over 90% loss of general capability, and after five sequential unlearning operations, model generalization nearly collapses entirely.
- Reinforcement Learning with Backtracking Feedback
-
This paper proposes RLBF, a reinforcement learning framework with backtracking feedback that allows agents to return to previous states and re-explore when encountering dead ends. By leveraging backtracking signals to improve credit assignment, RLBF significantly enhances exploration efficiency in sparse-reward environments.
- ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search
-
ReliabilityRAG proposes a RAG framework that leverages document reliability signals (e.g., search ranking) for adversarial defense. It identifies a consistent document subset by finding the Maximum Independent Set (MIS) on a contradiction graph while prioritizing high-reliability documents, providing provable robustness guarantees alongside high accuracy on benign scenarios and long-form generation tasks.
- Reverse Engineering Human Preferences with Reinforcement Learning
-
A reinforcement learning-trained preamble generator is used to inflate the evaluation scores of downstream LLMs, exposing critical vulnerabilities in the LLM-as-a-Judge evaluation framework. The attack is nearly undetectable and demonstrates cross-model transferability.
- Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions
-
Through a persona-based evaluation framework, this paper finds that ChatGPT-4o and Bio-Medical-Llama-3-8B are systematically influenced by clinically irrelevant sociodemographic attributes (education, insurance, housing, etc.) in adverse drug event prediction, exhibiting both explicit and implicit bias patterns.
- SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders
-
This paper proposes SAEMark, a framework that leverages sparse autoencoders (SAEs) to extract Feature Concentration Scores (FCS) from text, and embeds multi-bit watermarks via inference-time feature-guided rejection sampling. The approach requires no modification to model weights or logits, natively supports black-box APIs, multilingual text, and code, and achieves state-of-the-art watermark detectability and text quality across English, Chinese, and code domains.
- Securing the Language of Life: Inheritable Watermarks from DNA Language Models to Proteins
-
This paper proposes DNAMark and CentralMark, two watermarking schemes for embedding robust watermarks in sequences generated by DNA language models. DNAMark achieves function-preserving watermarks via synonymous codon substitution, while CentralMark realizes inheritable watermarks that propagate from DNA to protein through the central dogma.
- Self-Refining Language Model Anonymizers via Adversarial Distillation
-
This paper proposes SEAL, a framework that distills GPT-4-level text anonymization capabilities into an 8B model via adversarial distillation, combining SFT + DPO training with a self-refinement mechanism. The resulting small model achieves privacy–utility trade-offs on par with or superior to GPT-4-based anonymizers while enabling fully local deployment.
- Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
-
This paper identifies that reference model bias in NPO (Negative Preference Optimization) leads to uneven optimization power allocation across forget data and early-stage gradient weight smoothing failure. The proposed SimNPO eliminates reference model dependency and adopts length-normalized rewards, improving FQ from 0.79 to 0.99 on TOFU and consistently outperforming NPO across all benchmarks.
- SIMU: Selective Influence Machine Unlearning
-
SIMU proposes a two-stage framework: it first identifies critical MLP neurons encoding forget-set information via gradient aggregation, then applies second-order (Sophia) optimization exclusively to those neurons, achieving effective unlearning while substantially preserving the model's original capabilities.
- Steering When Necessary: Flexible Steering Large Language Models with Backtracking
-
This paper proposes FASB (Flexible Activation Steering with Backtracking), a framework that dynamically determines the necessity and intensity of intervention by tracking the internal states of an LLM during generation, and introduces a backtracking mechanism to correct already-deviated tokens. FASB achieves a True*Info score of 80.56% on TruthfulQA and an average accuracy of 78.8% across six multiple-choice tasks, significantly outperforming all baselines.
- Stop DDoS Attacking the Research Community with AI-Generated Survey Papers
-
This position paper analogizes the proliferation of AI-generated survey papers to a "Distributed Denial-of-Service (DDoS) attack" on the academic community. Through systematic quantitative analysis of 10,063 CS survey papers on arXiv from 2020 to 2024, the paper documents synchronized post-ChatGPT surges in survey volume, AI-generation scores, and anomalous author counts. It diagnoses four major quality deficiencies in AI-generated surveys (disorganized structure, unoriginal taxonomies, inaccurate citations, and highly redundant content), analyzes cultural repercussions for the researcher–reviewer–editor triad, and proposes a comprehensive response framework encompassing transparency requirements, rigorous review standards, redundancy restrictions, AI-detection assistance, and a "Dynamic Live Survey" platform.
- SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications
-
This paper proposes BIRD-CRITIC (the first SQL debugging benchmark) and the Six-Gym training environment, and develops the Bird-Fixer agent. Through the f-Plan Boosting strategy, it elevates the SQL debugging capability of a 14B open-source model to surpass Claude-3.7-Sonnet and GPT-4.1, achieving efficient SQL issue resolution while preserving data privacy.
- ToxicTextCLIP: Text-Based Poisoning and Backdoor Attacks on CLIP Pre-training
-
This paper proposes ToxicTextCLIP, a framework that generates high-quality adversarial texts during CLIP pre-training via two modules—Background-aware Target Text Selector and Background-driven Poisoned Text Augmenter—achieving up to 95.83% attack success rate and 98.68% backdoor Hit@1, while successfully bypassing three defenses: RoCLIP, CleanCLIP, and SafeCLIP.
- Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
-
This paper proposes the Trans-EnV framework, which combines expert linguistic knowledge with the transformation capabilities of LLMs to automatically convert Standard American English (SAE) datasets into 38 English varieties (18 dialects + 20 ESL Englishes), revealing performance degradations of up to 46.3% on non-standard English and highlighting critical linguistic fairness concerns.
- TRAP: Targeted Redirecting of Agentic Preferences
-
TRAP introduces a diffusion-based semantic injection adversarial framework that optimizes image semantics in the CLIP embedding space. Under black-box conditions, it systematically misdirects the decision preferences of multiple mainstream VLM agents in a visually natural manner, achieving attack success rates of up to 100% across six models including LLaVA-34B and GPT-4o.
- TRUST -- Transformer-Driven U-Net for Sparse Target Recovery
-
This paper proposes the TRUST architecture, which integrates the Transformer attention mechanism with a U-Net decoder to jointly learn the sensing operator and reconstruct sparse signals under unknown sensing matrices, achieving significant improvements over conventional methods in SSIM and PSNR.
- Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM
-
This paper reveals that even exact unlearning (retraining from scratch to remove data influence) is susceptible to privacy leakage. By exploiting the divergence between model checkpoints before and after unlearning, an adversary can apply reversed model guidance with token filtering to substantially improve extraction success rates for deleted data—in some settings doubling the extraction rate.
- Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery
-
This paper proposes reframing machine unlearning as an epistemological probe ("unlearning as ablation"): by systematically removing a target piece of knowledge along with its unlearning closure, and then testing whether a model can re-derive it from axioms, the framework provides a falsifiable test to distinguish whether LLMs genuinely generate new knowledge or merely retrieve memorized fragments.
- Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data
-
This paper presents the first systematic study of security risks introduced by synthetic data in LLM training. It reveals that existing poisoning and backdoor attacks rarely propagate through synthetic data, and proposes the Virus Infection Attack (VIA) framework. VIA embeds poisoned content into normal training samples via hijacking point search and shell construction, enabling malicious content to be generated by the model even on clean queries and subsequently propagated to downstream models.
- VMDT: Decoding the Trustworthiness of Video Foundation Models
-
This paper introduces VMDT (Video-Modal DecodingTrust), the first unified benchmark platform for evaluating the trustworthiness of T2V and V2T video foundation models across five dimensions—safety, hallucination, fairness, privacy, and adversarial robustness—covering large-scale assessments of 7 T2V and 19 V2T models, and revealing the complex relationship between model scale and trustworthiness.
- Watermarking Autoregressive Image Generation
-
This paper is the first to adapt LLM watermarking (KGW green/red scheme) to the token level of autoregressive image generation models. It identifies and addresses the key challenge of insufficient Reverse Cycle Consistency (RCC) through tokenizer–detokenizer fine-tuning and a watermark synchronization layer, achieving robust image watermark detection with theoretical guarantees.
- When AI Democratizes Exploitation: LLM-Assisted Strategic Manipulation of Fair Division Algorithms
-
This paper empirically demonstrates that LLMs can reduce algorithm manipulation in fair division—previously requiring deep expertise in mechanism design—to a simple natural language conversation available to any user. Four coordination scenarios are designed on the Spliddit fair rent platform (exclusionary collusion, defensive counter-attack, benevolent collusion, and cost-minimization coalition), fundamentally overturning the traditional assumption that "algorithmic complexity serves as a security barrier."
- Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting
-
This paper proposes CAW (Confidence-Aware Weighting), an adversarial fine-tuning loss function for CLIP that focuses on hard adversarial examples via confidence-aware weighting, combined with feature alignment regularization to preserve pre-trained semantic knowledge. CAW achieves state-of-the-art zero-shot robustness under AutoAttack with lower memory overhead.