🔒 LLM Safety¶
💬 ACL2026 · 115 paper notes
📌 Same area in other venues: 📷 CVPR2026 (11) · 🔬 ICLR2026 (185) · 🤖 AAAI2026 (41) · 🧠 NeurIPS2025 (80) · 📹 ICCV2025 (10)
🔥 Top topics: LLM ×42 · Adversarial Robustness ×22 · Watermarking ×10 · Multimodal/VLM ×10 · Reasoning ×9
- STELA: A Linguistics-Aware LLM Watermarking via Syntactic Predictability
-
STELA uses "linguistic indeterminacy" \(\lambda(c_t)\) estimated from POS n-grams as a modulation signal for watermark strength. It weakens the watermark at positions with high syntactic constraints (preserving quality) and strengthens it at syntactically free positions (improving detectability). Similar to KGW, STELA remains publicly verifiable using only a POS tagger, without requiring access to model logits.
- A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
-
This paper provides the first systematic review of safety research for "Computer-Using Agents (CUA)," organizing 124 relevant papers into a four-dimensional framework of "Internal Threats × External Threats × Defense × Evaluation," and highlighting that the primary gaps in existing CUAs are UI grounding robustness and cross-platform adversarial evaluation.
- Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL
-
Abstain-R1 proposes a clarification-aware RLVR reward to jointly optimize "explicit refusal" and "providing helpful clarifications (pointing out missing information) post-refusal" on unanswerable queries. This allows 3B models to approach or even surpass large models such as DeepSeek-R1 in refusal and clarification quality.
- ACIArena: Toward Unified Evaluation for Agent Cascading Injection
-
This paper constructs the first unified evaluation framework for "Agent Cascading Injection (ACI)" attacks, ACIArena. It covers 6 mainstream multi-agent systems (MAS), 3 attack surfaces (Adversarial Input / Malicious Agent / Message Poison), and 3 attack goals (Hijacking / Disruption / Exfiltration) with 1356 test cases. It also proposes ACI-Sentinel, a minimalist yet effective defense that reduces Hijacking attack success rates from 92.78% to 8.06%.
- Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization
-
Discovered task-specific anonymization instructions for LLMs via an adaptive framework using evolutionary prompt optimization. It outperforms hand-crafted strategies across multiple privacy-utility trade-off scenarios and is executable on open-source models.
- ADVICE: Answer-Dependent Verbalized Confidence Estimation
-
This paper diagnoses the root cause of LLM verbalized overconfidence as "confidence hardly depends on the generated answer" through JSD and attribution analysis. It proposes ADVICE, a lightweight contrastive fine-tuning framework using answer pairs, which employs JSD/Margin/Sum losses to force the confidence distribution for correct answers to be significantly higher than for incorrect ones. This reduces Gemma2-9b's ECE on TriviaQA from 21.9% to 6.2% while maintaining task accuracy.
- AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
-
AgentCoMa constructs an agentic benchmark that forcibly combines commonsense selection with single-step mathematical operations. Evaluations across 61 LLMs reveal that while models typically solve both sub-problems independently (80%), the average accuracy drops to 51% when combined, exposing significant vulnerabilities in mixed-type compositional reasoning.
- AgentMark: Utility-Preserving Behavioral Watermarking for Agents
-
AgentMark models the "next tool/subgoal selection" of an LLM agent as a time-varying discrete channel. By explicitly eliciting the behavioral distribution \(P_t\) and applying FDPSS-style distribution-preserving sampling, it embeds multi-bit IDs into planning decisions. Combined with RLNC encoding, the watermark can be recovered from residual logs even if the trace is cropped or steps are deleted. Across ALFWorld, ToolBench, and OASIS tasks, it maintains accuracy (SR difference from baseline <0.7 pp) while providing stable multi-bit capacity of 1.2-2.3 bps, and it is orthogonally stackable with content-level watermarks like SynthID-Text.
- AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation
-
AGSC proposes an uncertainty quantification (UQ) framework for long-text generation that triggers adaptive granularity decomposition via NLI neutral probability (reducing inference time by 60%) and utilizes GMM soft clustering to capture latent semantic topics for topic-aware weighted aggregation, achieving SOTA factuality correlation on BIO and LongFact benchmarks.
- APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation
-
APPSI-139 is the first parallel corpus of English application privacy policy summarization and interpretation finely annotated by legal experts (139 policies / 36,351 annotations / 15,692 rewrite pairs). The accompanying TCSI-pp-V2 framework utilizes a shared encoder with five alternately trained expert heads for five sub-tasks: "Importance / Risk / Sensitivity / Topic / Rewriting." Compared to TCSI-pp v1, the encoding time is reduced by 73%, and GPU memory usage decreases from 7.3GB to 2.7GB, with subjective readability surpassing GPT-4o and Llama3-70b.
- ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs
-
ASTRA treats every jailbreak attempt as a learning opportunity. By distilling strategies into a three-tier vector library ("Effective / Promising / Ineffective") based on continuous scores from 1-10, subsequent attacks reuse experience through similarity retrieval. It achieves an 80.6% Attack Success Rate (ASR) across 8 mainstream LLMs with an average of only 2.4 queries.
- ATAAT: Adaptive Threat-Aware Adversarial Tuning Framework against Backdoor Attacks on Vision-Language-Action Models
-
ATAAT systematically reveals that the root cause of VLA backdoor injection difficulty is "Gradient Interference" (where benign and backdoor gradient directions cancel out, with a long-term negative correlation of ~ -0.4). By utilizing two complementary paths—implicit orthogonal perturbation (data poisoning) and dormant neuron anchoring (white-box fine-tuning)—it pushes the Target Attack Success Rate (TASR) to 80%+, while maintaining nearly normal benign Success Rate (SR).
- AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models
-
This paper proposes AutoRAN, the first framework to automate the hijacking of internal safety reasoning in Large Reasoning Models (LRMs). It utilizes a weak but minimally aligned small model to simulate the "execution reasoning" of the target LRM to generate narrative prompts. It further employs iterative refinement based on the Chain-of-Thought (CoT) feedback leaked during the target's refusal. AutoRAN achieves near 100% attack success rates on AdvBench, HarmBench, and StrongReject against gpt-o3, o4-mini, and Gemini-2.5-Flash, often requiring only a single turn.
- Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks
-
The authors point out that existing LVLM unlearning benchmarks (FIUBench / MLLMU-Bench / CLEAR) fail to truly memorize fictional identities during the Stage 1 fine-tuning phase, rendering Stage 2 "unlearning" evaluations invalid. They diagnose the root causes as "insufficient data repetition + multi-hop curse" and propose ReMem—featuring 100 QAs × 100 multi-view images per identity, a 70%:30% single-hop/multi-hop mix, and a new Exposure metric for internal probability—re-establishing unlearning evaluation on the foundation of "reliable memorization."
- Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge
-
ChainFed is proposed as a chain-based federated fine-tuning paradigm to break the memory wall. By sequentially training and freezing adapters layer-by-layer, it enables resource-constrained edge devices to participate in LLM fine-tuning. Combining Functional-Oriented Adaptive Tuning (FOAT), Dynamic Layer Coordination Tuning (DLCT), and Global-Perceptual Optimization (GPO), it achieves an average accuracy improvement of up to 46.46%.
- Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation
-
This paper formally defines the "soft-failure" threat in RAG systems—generating fluent yet uninformative responses—and proposes the DEJA black-box evolutionary attack framework. By utilizing adversarial documents to induce model safety alignment mechanisms into producing hedging responses, DEJA achieves a SASR exceeding 79% while remaining highly stealthy.
- Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models
-
This paper demonstrates that after machine unlearning, LLMs may rely more on dataset shortcut tokens for decision-making even while maintaining low calibration error. Consequently, using ECE, MCE, or Brier score alone is insufficient to determine if an unlearned model is reliable.
- Can Persona-Prompted LLMs Emulate Subgroup Values? An Empirical Analysis of Generalisability and Fairness in Cultural Alignment
-
Ours uses a subset of the Singapore World Values Survey as a case study to construct 20,877 (question, subgroup) samples, verifying whether LLMs can simulate fine-grained demographic subgroup value preferences. Results show GPT-4.1 zero-shot achieves only 57.4% accuracy; simple SFT yields an average 17.4% gain on OOD subgroups, but subgroup gaps widen from an NMAE perspective, with models showing persistent preferences for young/male/Chinese/Christian personas.
- CAP: Controllable Alignment Prompting for Unlearning in LLMs
-
This paper proposes the CAP framework, which guides frozen LLMs to selectively unlearn target knowledge by training a lightweight SLM to generate controllable prompt prefixes. This approach requires no modification to model parameters, achieving reversible and transferable LLM knowledge unlearning.
- CarO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation
-
This paper proposes CarO (Chain-of-Analogy Reasoning Optimization), a two-stage training framework. It uses RAG to guide the generation of analogy reasoning chains followed by SFT and customized DPO optimization. This allows LLMs to autonomously generate analogical reference cases for content moderation during inference. On ambiguous moderation benchmarks, it achieves an average F1 improvement of 24.9%, significantly surpassing reasoning models (DeepSeek R1) and specialized moderation models (LLaMA Guard).
- CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification
-
CausalDetox utilizes the "Probability of Necessity and Sufficiency" (PNS) as a causal criterion to precisely locate attention heads responsible for generating toxic content. It employs two complementary strategies: local inference-time intervention and PNS-guided fine-tuning. The method achieves up to a 5.34% reduction in toxicity across multiple models while maintaining linguistic fluency.
- CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents
-
Constructs the enterprise-specific benchmark CI-Work based on Contextual Integrity theory, revealing that frontier LLM agents exhibit widespread privacy leakage in enterprise workflows, and that increasing model scale exacerbates these leaks.
- CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization
-
Addressing the unlearning challenge in Large Reasoning Models (LRMs)—the need to simultaneously remove sensitive knowledge from both Chain-of-Thought (CoT) and final answers—the CiPO framework is proposed. By enabling models to generate logically valid counterfactual reasoning trajectories and guiding model preferences towards these paths via iterative preference optimization, it achieves effective unlearning while maintaining reasoning capabilities.
- Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors
-
This paper proposes STEEREDIT, a backdoor injection framework that compiles dynamic activation steering into static weight modifications. By extracting a compliance direction and utilizing null-space constraints to ensure activation only in the presence of trigger words, it achieves high attack success rates across multiple safety-aligned LLMs while maintaining safety and general utility in non-trigger scenarios.
- Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding
-
CFB repurposes the additive logit bias technique used in text watermarking—applying a bonus to tokens "supported by the input context" at each decoding step. It proposes three progressive strategies: static, context-aware (adaptive scaling via JSD), and token-aware (redistribution via attention + semantic relevance). This approach consistently improves faithfulness metrics in summarization and QA across multiple models with near-zero decoding overhead.
- CRISP: Persistent Concept Unlearning via Sparse Autoencoders
-
Addressing the issue where SAE-based unlearning mostly intervenes only during inference while weights still contain sensitive knowledge, CRISP automatically identifies SAE features that are "strongly activated only on target" by comparing target/retain corpora. It then uses LoRA with a three-part loss (unlearn + retain + coherence) to "fix" these feature activations to zero within the weights. This approach advances the Pareto frontier across unlearn-retain-fluency axes on WMDP-Bio/Cyber, outperforming ELM by 27-34 points and RMU by 5-8 points.
- CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks
-
Addressing "joint-modal implicit attacks" where images and text are safe individually but harmful when combined, this work proposes ImpForge, an RL-based red-teaming framework to automatically generate such samples using three rewards (safety, semantic, and overlap). These data are used for LoRA SFT to develop the CrossGuard model, reducing the SIUO implicit attack ASR from 48.9% (GPT-4o) to 5.4%, while achieving an average ASR of only 2.79% across five safety benchmarks (compared to 12.05% for the runner-up Claude-3.5).
- CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge
-
CURaTE proposes a behavioral unlearning framework based on sentence embedding matching: it trains a general unlearning embedder during pre-deployment (without using any forget set), stores new unlearning requests as embeddings in a database in real-time post-deployment, and determines whether to answer or refuse via cosine similarity during inference, achieving near-perfect knowledge preservation by avoiding any modification to LLM weights.
- DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
-
DART identifies and addresses the "harm drift" problem—where fine-tuning LLMs to improve difference-aware classification accuracy (e.g., identifying legitimate demographic differences) causes generated explanations to become more harmful. Through a three-stage Distill-Audit-Repair pipeline, DART improves Llama-3-8B accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.
- De-Anonymization at Scale via Tournament-Style Attribution
-
This paper proposes DAS (De-Anonymization at Scale), an LLM-based method for large-scale authorship de-anonymization. By employing a tournament-style elimination strategy combined with dense retrieval pre-filtering and multi-round voting aggregation, the method enables author matching across tens of thousands of candidate texts, revealing the privacy threat LLMs pose to anonymous platforms such as double-blind peer review.
- Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
-
This study provides the first systematic evaluation of the impact of low-rank decomposition (SVD/FWSVD/BASEL) on LLM trustworthiness. It identifies an asymmetric trade-off: "training data privacy ↑, adversarial robustness ↑, PII protection ↓, ethics alignment ↓, fairness ↓." Using gradient attribution, the study localizes adversarial vulnerability to the
embed_tokensanddown_projsub-layers. - Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game
-
CanaryRAG is proposed as a runtime defense mechanism for RAG systems inspired by stack canaries in software security. By injecting non-semantic canary tokens into retrieved chunks and designing a dual-path integrity game (the target path should not leak the canary while the Oracle path should elicit it), the system detects knowledge base extraction attacks in real-time without compromising task performance or inference latency.
- Detoxification for LLM from Dataset Itself
-
This paper proposes the HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, which uses SoCD (Soft Contrastive Decoding) to guide an LLM in locating and rewriting toxic segments in the original corpus while preserving semantics. This generates a detoxified corpus that can directly replace original data for fine-tuning—reducing toxicity probability from 0.42 to 0.18 on GPT2-XL and achieving optimal detoxification effects on LLaMA2-7B, OPT-6.7B, and Falcon-7B.
- Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG)
-
DP-SynRAG utilizes LLMs to distill private RAG databases into differentially private synthetic text libraries in a one-time process. Subsequent queries do not consume any privacy budget. On Medical Synth, MovieLens, and SearchQA datasets, its accuracy significantly outperforms query-time DP-RAG (which collapses in multi-query scenarios).
- Do Multimodal RAG Systems Leak Data? A Comprehensive Evaluation of Membership Inference and Image Caption Retrieval Attacks
-
The authors provide the first systematic evaluation of privacy leakage risks in image-driven multimodal RAG (mRAG) systems. They demonstrate that a naive black-box text prompt combined with a single target image can achieve MIA F1=0.993 and caption exact-match=0.835 across 4 datasets and 3 VLMs. The attacks remain effective even when images undergo transformations such as cropping, masking, rotation, or noise. Key findings identify the relative position of the "target image vs. retrieved images" in the prompt and cross-modal reranking as critical mitigation levers.
- DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
-
DualGuard proposes the first dual-stream watermarking mechanism: it adaptively injects different watermarks using two complementary standard/adversarial watermark heads based on whether content is "benign" or "malicious." This ensures consistency for benign text and divergence for malicious text, maintaining robustness against paraphrasing while enabling the first-ever detection and traceability of malicious segments injected via piggyback spoofing.
- Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks
-
This paper systematically evaluates the answer leakage robustness of LLM tutors in scenarios where "students attempt to deceive the tutor into providing answers." It defines 6 categories of adversarial/persuasive techniques, compares 4 types of adversarial student agents (Base, Reasoning-enhanced, Multi-agent, SFT-tuned), and verifies that two simple defenses (Reasoning-first and Multi-agent tutor) can compress the leakage rate from 70–85% to \(< 10\%\) across most models.
- Exploring Cross-Client Memorization of Training Data in Large Language Models for Federated Learning
-
The authors extend fine-grained cross-sample memorization metrics for centralized LLMs (Zeng 2024 + PAN2014 plagiarism detector) to Federated Learning (FL). They propose a client-pair metric \(\text{MR}_{j \to k}\) and derive intra-client and inter-client memorization ratios. The study finds that FL does not effectively prevent training data memorization—while intra-client memorization is higher than inter-client, the total memorization in FL vs. Centralized Learning (CL) shows no significant decrease. Memorization is significantly influenced by prefix length, decoding strategies, and FL algorithms (FedProx > FedAvg).
- FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness
-
This paper proposes the FAITH framework, which maps LLM uncertainty signals (consistency + semantic entropy) to natural language descriptions of knowledge state quadrants (trustworthiness \(\times\) honestness). It designs a fine-grained reward function considering uncertainty for PPO training and utilizes a RAG module to correct potential errors, systematically improving the factual accuracy of LLMs.
- Fast-MIA: Efficient and Scalable Membership Inference for LLMs
-
Fast-MIA integrates 9 mainstream LLM Membership Inference Attack (MIA) methods into a single vLLM batch inference engine with a cross-method log-prob cache layer. This setup accelerates evaluation by approximately \(5\times\) overall (with SaMIA alone achieving \(19.5\times\)) on LLaMA-30B / WikiMIA while maintaining nearly identical AUC, making large-scale MIA auditing computationally feasible for the first time.
- FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation
-
FlexGuard proposes an LLM moderation model that outputs continuous risk scores (0-100) instead of binary safe/unsafe judgements. Through distillation guided by scoring rubrics and GRPO risk alignment training, it achieves SOTA robustness and accuracy across different deployment strictness levels.
- Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens
-
Ours proposes Entropy-guided Token Weighting (ETW), which utilizes the entropy of the prediction distribution as a proxy for token informativeness. It selectively applies stronger unlearning penalties to informative tokens, effectively unlearning target knowledge while better maintaining the general capabilities of the model.
- From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning
-
This paper formally defines domain-level and instance-level granularities for LLM unlearning and proposes the BiForget framework. BiForget utilizes the target model itself (rather than external strong models) to generate high-quality unlearning datasets through two stages: seed-guided synthesis and adversarial probing. In the Harry Potter domain, it improves relevance by ~20 and diversity by ~0.05 while halving the data volume.
- From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models
-
This paper systematically reviews the functional evolution of uncertainty quantification (UQ) in LLMs from "passive diagnostic metrics" to "active control signals," covering three frontier domains: advanced reasoning (guiding computation allocation and self-correction), autonomous agents (driving meta-cognitive decisions for tool use and information acquisition), and reinforcement learning (mitigating reward hacking and enabling self-improvement via intrinsic rewards).
- GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models
-
This paper proposes GAMBIT, a gamified multimodal jailbreak framework. By decomposing harmful queries into puzzle images plus hidden keywords and embedding them into competitive game scenarios, it leverages the model's reasoning incentives and cognitive load to bypass safety filters. It achieves an attack success rate of 92.13% on Gemini 2.5 Flash and 85.87% on GPT-4o, proving effective for both reasoning and non-reasoning models.
- Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data
-
This paper proposes Gap-K%, which uses the normalized log probability gap between the target token and the model's top-1 prediction, combined with sequential sliding window smoothing, to detect whether text appeared in the LLM pretraining data. It outperforms baselines like Min-K%++ on WikiMIA, MIMIR, recent models, and under strong paraphrase attacks.
- ForgeryTalker: Generating Attribution Reports for Manipulated Facial Images
-
This paper proposes a new task called Forgery Attribution Report Generation and constructs the MMTT dataset containing 152,217 samples (the first large-scale facial forgery dataset providing both pixel-level masks and human text descriptions). It further introduces ForgeryTalker, an end-to-end baseline that jointly generates localization masks and attribution reports via a shared encoder and dual decoders (mask + language model), achieving 59.3 CIDEr and 73.67 IoU.
- Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
-
This paper reveals a safety blind spot in MLLMs under the "visual text compression" paradigm. When rendered image DPI falls within the Attack Comfort Zone (ACZ) of 45–150, model OCR remains accurate while safety alignment collapses (ASR surges from 0% to 70%+). This occurs because shallow computational resources are exhausted by "character recognition," causing harmful semantics to emerge only in deeper layers and bypassing shallow guardrails. Using prompt-level Structured Cognitive Offloading (transcribe → audit → answer) can reduce ASR back to near-baseline levels.
- How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
-
This paper systematically investigates how to enhance the safety of Large Reasoning Models (LRMs) through SFT. It identifies that the root cause of the limited effectiveness of direct safety response distillation is five risk reasoning patterns (especially "weak vacillation"). The authors propose targeted distillation strategies that reduce the PAIR attack success rate from 63% to 13%, and find that short reasoning chains and template reasoning perform comparably to long reasoning chains in terms of safety.
- Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
-
This paper points out that for LLMs, "high self-consistency does not equal true belief"—on 995 questions where the model answered correctly with 100% consistency, inserting minor contextual interference caused accuracy to plummet to 33.8%. The authors propose Neighbor-Consistency Belief (NCB): a structured proxy for belief robustness by performing joint consistency estimation of a target fact and its "conceptual neighbors" (premises/entailments/topics). Based on Asch's conformity experiments and Source Credibility theory, they designed a cognitive stress-test protocol, proving on 4 LLMs that high NCB data is significantly more resistant to interference. They further introduce Structure-Aware Training (SAT): utilizing teacher-student KL distillation to force student models to output consistently across different neighborhood contexts, improving the robustness of newly learned knowledge by approximately 30% over Ans/Know augmentation baselines.
- Instant Personalized Large Language Model Adaptation via Hypernetwork
-
Profile-to-PEFT (P2P) utilizes a hypernetwork to directly map user profiles to personalized LoRA parameters. This avoids the need for OPPU to retrain adapters for each user, achieving faster, more scalable LLM personalization that generalizes to unseen users.
- Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries
-
This paper discovers that domain-specific contexts (e.g., chemistry papers) selectively relax LLM protection against related harmful knowledge (Vertical Unlocking), while safety research contexts trigger a broad relaxation of protection across all harmful categories (General Unlocking). Based on this, the Jargon attack framework is proposed, achieving over 93% attack success rate (ASR) across seven frontier models including GPT-5.2 and Claude-4.5.
- Jailbreaking Large Language Models with Morality Attacks
-
This paper constructs a 10.3K morality attack dataset (Value Ambiguity + Value Conflict) and manipulates LLM moral judgments through four adversarial strategies. The study finds that LLMs and guardrail models are extremely vulnerable to morality attacks, and larger models are surprisingly easier to break.
- Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning
-
This paper proposes InstruCoT, which synthesizes diverse training data covering multiple injection vectors and threat scenarios, and introduces a three-stage instruction-level Chain-of-Thought (CoT) fine-tuning based on a situation awareness model. This allows LLMs to effectively identify and reject malicious instructions when facing various prompt injection attacks, significantly outperforming existing defense methods across behavioral deviation, privacy leakage, and harmful output dimensions.
- Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation
-
The authors propose M3Att—the first query-agnostic knowledge poisoning framework for medical multi-modal RAG. It utilizes "distribution-guided visual PGD triggers" for retrieval hijacking and "clinical ambiguity-guided text rewriting" to bypass LVLM self-correction. With a poisoning rate of <1% (without querying knowledge, visual perturbation \(\epsilon=16/255\)), it reduces downstream utility by an average of 8.78% across 5 LVLMs × 5 datasets × 4 medical tasks, while remaining robust to three types of pre-retrieval defenses: image clustering, text clustering, and image-text consistency.
- LeakDojo: Decoding the Leakage Threats of RAG Systems
-
This paper introduces LeakDojo, the first configurable evaluation framework that modularly decouples RAG systems, attacks, and defenses. By systematically quantifying RAG leakage risks across 6 attacks, 14 LLMs, 4 datasets, and multiple enhancement modules, it discovers that "stronger instruction-following capability leads to higher leakage risk" and "RAG faithfulness is positively correlated with leakage risk."
- Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
-
Ours proposes the SIVR framework, which computes internal variance (generalized variance, circular variance, token entropy) across LLM hidden layers as token-level features. A lightweight Transformer encoder aggregates full sequence patterns to estimate uncertainty and detect hallucinations, significantly outperforming baselines with stronger generalization.
- LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment
-
LLM-VA discovers that LLMs encode "whether to answer" (answer vector \(v_a\)) and "input safety" (benign vector \(v_b\)) into two nearly orthogonal directions internally, leading to a persistent trade-off between jailbreak and over-refusal. By performing closed-form minimal-norm weight updates to align \(v_a\) and \(v_b\), the model's "willingness to answer" becomes causally dependent on "input safety." Evaluated on 12 LLMs, it achieves an F1 score 11.45% higher than the strongest baseline with only a 4.08% utility drop, requiring no fine-tuning or architectural modifications.
- Look Twice before You Leap: A Rational Framework for Localized Adversarial Anonymization
-
The authors propose the RLAA framework, which utilizes an Attacker-Arbitrator-Anonymizer architecture and Marginal Rate of Substitution (MRS) rationality constraints to solve the utility collapse issue when migrating adversarial text anonymization to local small models, achieving a privacy-utility balance superior to API-based solutions without requiring training.
- Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage
-
This paper identifies the security threat of cognitive collusion: multiple agents can publicly release only truthful but narratively orchestrated evidence fragments to induce false causal beliefs in a victim LLM agent, which then continues to propagate through downstream verification layers.
- Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation
-
This paper reveals the threat of "Adversarial Smuggling Attacks" (ASA) in multimodal large language model content moderation—encoding harmful content into human-readable but AI-unreadable visual formats to evade automated detection. The authors constructed the SmuggleBench benchmark containing 1,700 samples and 9 attack techniques, finding that state-of-the-art (SOTA) models, including GPT-5, suffer from attack success rates exceeding 90%.
- Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning
-
This paper proposes PALU (Prefix-Aware Localized Unlearning), which achieves localized entropy maximization unlearning across both temporal and vocabulary dimensions: it applies unlearning objectives only to sensitive prefix tokens in the temporal dimension and flattens only the top-K logits in the vocabulary dimension. This enables efficient unlearning with minimal parameter perturbation while maintaining the model's general capabilities.
- Membership Inference Attacks on In-Context Learning Recommendation
-
This paper presents the first systematic study of Membership Inference Attacks (MIA) on LLM-based ICL recommendation systems. It designs four attacks: Similarity, Memorization, Inquiry, and Poisoning. The study finds that the Memorization attack, based on LLM's inherent memory, achieves an attack advantage \(\geq 82\%\) on MovieLens-1M, and existing prompt-based defenses (including those against poisoning) are largely ineffective.
- MemoPhishAgent: Memory-Augmented Multi-Modal LLM Agent for Phishing URL Detection
-
MemoPhishAgent (MPA) is proposed as the first memory-augmented multi-modal LLM agent specifically designed for phishing URL detection. By dynamically orchestrating five specialized tools and utilizing an episodic memory system to reuse historical reasoning trajectories, MPA achieves a 13.6% recall improvement on public benchmarks and a 20% improvement on real-world social media data. It has been deployed in production, processing approximately 60,000 high-risk URLs weekly.
- Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem
-
LLM unlearning is explicitly modeled as an asymmetric two-task problem where "retention is primary and forgetting is auxiliary." The proposed SAGO method applies element-wise sign alignment gating to retain/forget gradients, achieving retention performance close to the original model on WMDP and RWKU benchmarks with almost no loss in forgetting effectiveness.
- Multi-component Causal Tracing in Large Language Models
-
This paper extends causal tracing from single-component analysis to multi-component subset searching and proposes PGB-CT, which efficiently identifies attention heads and MLP neurons that collectively influence LLM behavior using soft intervention, metric transformation, and sparse binary penalties.
- MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models
-
MUSE integrates cross-modal payload generation, multi-turn red teaming attacks, unified model routing, and a five-level safety judge into a run-centric reproducible experimental platform. Through approximately 3,700 experiments, the study demonstrates that multi-turn strategies can breach multimodal LLMs that otherwise show near-perfect refusal in single-turn settings. Furthermore, inter-turn modality switching acts more as a mechanism to accelerate the erosion of safety defenses rather than a universal "silver bullet" for increasing final ASR.
- On Safety Risks in Experience-Driven Self-Evolving Agents
-
This paper systematically investigates safety risks in experience-driven self-evolving agents, finding that accumulating experience even from harmless tasks leads to significant safety degradation (ASR increases by 13-49%), rooted in the execution-oriented nature of experience that reinforces action over refusal.
- PARASITE: Conditional System Prompt Poisoning to Hijack LLMs
-
PARASITE formalizes the threat where system prompts downloaded from public marketplaces may contain conditional trigger backdoors as a new supply chain risk. It utilizes global semantic search combined with word-level greedy perturbation to generate highly stealthy system prompts under black-box conditions that hijack responses only for target queries.
- Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
-
PCFJudge treats candidate answer order as a nuisance variable in listwise factuality evaluation. By running 7 permutations of the same candidate set and aggregating scores, rankings, top-set votes, and calibrated uncertainty, it improves performance on RewardBench 2 Factuality by up to 7 percentage points compared to single direct judging.
- PIArena: A Platform for Prompt Injection Evaluation
-
This paper proposes PIArena, a unified and extensible evaluation platform for Prompt Injection. It integrates multiple SOTA attack and defense methods, supports plug-and-play evaluation, and introduces a policy-based adaptive attack method. It systematically reveals key limitations of existing defenses in terms of generalization, adaptive attacks, and task alignment scenarios.
- Please Refuse to Answer Me: Mitigating Over-Refusal in LLMs via Adaptive Contrastive Decoding
-
This paper proposes AdaCD (Adaptive Contrastive Decoding), which extracts a refusal token distribution by comparing the differences in token distributions under extreme safety prompts versus no prompts. It then dynamically decides to enhance or suppress refusal behavior based on an agreement ratio, reducing over-refusal by 10.35% while improving the refusal rate for malicious queries by 0.13%.
- Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
-
Ours proposes CWAC, which simultaneously constrains weight update directions and safety-critical activation features during fine-tuning, demonstrating theoretically and experimentally that constraining weights or activations alone is insufficient to prevent LLM safety drift.
- Privacy-R1: Privacy-Aware Multi-LLM Agent Collaboration via Reinforcement Learning
-
Privacy-R1 models the local/remote model delegation for privacy-sensitive queries as a sentence-level sequential decision task. Using a lightweight Transformer policy optimized via PPO, it learns a dynamic trade-off between privacy and task quality, achieving a superior quality-leakage frontier on both PUPA and the high-PII-density Med-PCD datasets compared to static rewriting methods.
- Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
-
This paper proposes "Privacy Collapse," a novel failure mode where seemingly benign fine-tuning causes systematic degradation of an LLM's contextual privacy norms, while standard safety and capability metrics remain largely unaffected.
- ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks
-
Instead of instructing the model "not to leak the system prompt," ProxyPrompt replaces the original prompt with a functionally equivalent but semantically obfuscated proxy prompt. This maintains task utility while ensuring that extracted prompts are difficult to use for replicating the original task, achieving a 94.70% protection rate across 264 configurations, significantly higher than filter-based and instruction-based defenses.
- Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness
-
This paper proposes the GeoDe framework, which constructs a truth hyperplane by training linear probes in the LLM latent space. Using the geometric distance from samples to the hyperplane as a confidence signal, it filters high-quality abstention fine-tuning data, effectively eliminating "gray zone" noise near decision boundaries to significantly enhance model truthfulness and reliability.
- Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models
-
This paper proposes "Reasoning Hijacking," a new attack paradigm that manipulates the reasoning logic of LLMs by injecting false decision criteria into the data channel rather than changing the task goal. This approach achieves high attack success rates and bypasses defense methods based on intent detection.
- Reasoning Structure Matters for Safety Alignment of Reasoning Models
-
The paper identifies that the safety issues in large reasoning models (LRMs) stem from a reasoning structure of "understanding the problem first, then solving with full effort." It proposes AltTrain, which uses 1K SFT data samples to reshape the reasoning structure into "problem understanding → harmfulness assessment → conditional reasoning," significantly reducing harmful responses while largely preserving reasoning capabilities.
- Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts
-
Red-Bandit models automated LLM red-teaming as an online adaptation problem using "multiple attack-style LoRA experts + test-time bandit routing." It demonstrates the effectiveness of style-level adaptive red-teaming with higher ASR@10 and lower perplexity across various open-source and closed-source target models.
- Representation-Guided Parameter-Efficient LLM Unlearning
-
This paper proposes the ReGLU framework, shifting LLM unlearning from the "parameter importance" paradigm to a "representation space geometry" paradigm. By using Representation-guided LoRA Initialization (RILA), the unlearning updates are aligned with the most discriminative subspace of the forget/retain sets, coupled with a Representation Orthogonal Loss (ROL) to constrain updates from interfering with retain set knowledge.
- Responsible Federated LLMs via Safety Filtering and Constitutional AI
-
This paper integrates safety filters and Constitutional AI (CAI) into the FedLLM workflow. It demonstrates that harmful client data significantly compromises global model safety, while client-side filtering combined with low-cost server-side CAI fine-tuning can restore AdvBench safety scores from approximately 72% to over 96%.
- Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
-
This paper proposes the Representational Contrastive Scoring (RCS) framework, which achieves SOTA jailbreak detection performance under rigorous cross-attack evaluation protocols. By analyzing the geometric structure of internal intermediate layer representations in LVLMs, RCS utilizes lightweight projection and contrastive scoring to distinguish malicious intent from benign distribution shifts.
- Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework
-
The authors propose TTP-Detect, the first black-box third-party watermark verification framework that decouples detection from injection. By magnifying watermark signals through a proxy model and combining three complementary metrics—local consistency, global geometry, and adaptive rank testing—it achieves high-precision detection across various watermarking schemes without accessing secret keys or internal model states.
- Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models
-
The authors propose BadRDM, the first backdoor attack framework specifically designed for Retrieval-Augmented Diffusion Models (RDMs). By fine-tuning the retriever via malicious contrastive learning, the method establishes a shortcut between trigger words and poisonous proxy images. It achieves attack success rates of 90.9% and 96.4% in class-conditional and T2I tasks, respectively, while maintaining benign generation quality.
- Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms
-
This paper systematically investigates how non-verbatim memorization in LLMs is affected by entity naming variations by constructing the RedirectQA dataset (leveraging Wikipedia redirect information to link the same entity to multiple surface forms). It finds that factual memory is neither purely dependent on specific surface forms nor completely surface-agnostic, and that entity-level frequency makes an independent contribution beyond surface-level frequency.
- RISK: A Framework for GUI Agents in E-commerce Risk Management
-
Ours proposes the RISK framework, comprising a domain dataset (RISK-Data: 8,492 single-step + 2,386 multi-step trajectories), a benchmark (RISK-Bench), and a GRPO-based reinforcement fine-tuning method (RISK-R1). Specifically designed for GUI agents in e-commerce risk management, the 7B model outperforms SOTA baselines with only 7.2% of the parameters, achieving a 70.5% success rate in online tasks.
- Robust Multimodal Safety via Conditional Decoding
-
This paper proposes the CASA conditional decoding framework, which requires multimodal models to predict a safety token before generating a response. By using a safety attention mechanism to amplify malicious signals, the framework reduces the average attack success rate by over 97% across text, vision, and audio jailbreak benchmarks while maintaining the multimodal capabilities for benign inputs.
- Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
-
This paper proposes a defense method against prompt injection based on instruction referencing. Instead of suppressing the instruction-following capability of LLMs, it requires the model to reference the instruction currently being executed within its response. Responses unrelated to the original instruction are then removed through label filtering, reducing the Attack Success Rate (ASR) to nearly 0% in certain scenarios.
- Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffixes
-
This paper proposes R2A (Route to Rome Attack). By constructing mixed ensemble proxy routers and optimizing universal adversarial suffixes under a black-box setting, it directs LLM routing decisions from cheap weak models to expensive strong models. On 7 open-source routers and 2 commercial routers (GPT-5-Auto, OpenRouter), the average Attack Success Rate (ASR) increases by 49%, with inference costs rising by 2.7-2.9 times.
- SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
-
SafeConstellations identifies that LLM middle-to-late layer representations form stable "constellation trajectories" based on tasks. It significantly reduces over-refusal by lightly steering representations from refusal trajectories toward non-refusal trajectories on high-confidence benign tasks, while preserving general capabilities.
- SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
-
Ours proposes SafeMERGE, a lightweight post-fine-tuning framework that identifies fine-tuned layers deviating from safe behavior via cosine similarity and merges only these layers with corresponding layers of a safety model. This significantly reduces harmful outputs across four LLMs while maintaining or even improving task performance.
- SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
-
This paper introduces the SafetyALFRED benchmark, incorporating six categories of kitchen safety hazards into ALFRED embodied tasks. It reveals a severe alignment gap where Multimodal Large Language Models (MLLMs) can identify hazards in static QA (up to 92%) but fail to actively mitigate them in embodied planning (<60%), advocating for a shift from QA evaluation paradigms to embodied safety evaluation.
- SAGE: Sparse Adaptive Guidance for Dependency-Aware Tabular Data Generation
-
SAGE discretizes tabular features into value-aware pseudo-features and constructs a sparse dynamic dependency graph based on mutual information to guide LLM generation, thereby enhancing downstream utility, constraint consistency, and realism of synthetic tabular data.
- Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking
-
The authors propose Attention-Guided Visual Jailbreaking, which bypasses rather than directly attacks safety alignment mechanisms by suppressing model attention to safety instructions and anchoring it to adversarial image features. This method achieves a 94.4% attack success rate on Qwen-VL while reducing gradient conflict by 45%.
- SERE: Structural Example Retrieval for Enhancing LLMs in Event Causality Identification
-
SERE posits that example selection in Event Causality Identification (ECI) should not rely solely on semantic similarity. Instead, it should retrieve examples with similar structural properties—including concept paths, syntax trees, and causal patterns—to reduce causal hallucinations (over-prediction) during few-shot inference with LLMs.
- SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs
-
This paper unifies the safety, helpfulness, and pedagogy of educational LLMs within a knowledge mastery graph. It proposes the SHAPE benchmark to evaluate whether models can choose between "scaffolding" or "direct answering" based on the student's mastery state under answer-inducing pressure, while introducing a graph-augmented gating pipeline to significantly improve robustness.
- SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models
-
SharedRequest is proposed, a model-agnostic privacy-preserving LLM inference framework that elevates privacy preservation from the individual prompt level to the batch level—by mixing real and noisy prompts and sharing the inference overhead of semantically equivalent requests—achieving a \(>20\%\) utility improvement and a query cost reduction of up to 5.6×.
- SLIM: Stealthy Low-Coverage Black-Box Watermarking via Latent-Space Confusion Zones
-
SLIM proposes a low-coverage data watermarking approach for individual data owners: by making the model learn divergent continuations for similar prefixes within a local latent space, the model exhibits statistically detectable local instability during black-box generation.
- SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking
-
This paper analyzes why KGW-style LLM watermarking fails in low-entropy scenarios such as code generation and mathematical reasoning. It proposes a Watermark Strength metric and SSG (logit-balanced vocabulary partitioning) to distribute high-probability tokens more evenly across categories, significantly enhancing detectability without further compromising generation quality.
- STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
-
This paper proposes STAR-Teaming, an automated red teaming framework based on a strategy-response multiplex network. By modeling attack strategy selection as a probabilistic optimization of an inverse Ising problem, it achieves a 74.5% average attack success rate on HarmBench, surpassing the strongest baseline by 13.5% while significantly reducing computational overhead.
- Subject-level Inference for Realistic Text Anonymization Evaluation
-
SPIA proposes the first subject-level PII inference evaluation benchmark (675 documents, 1,712 subjects, 7,040 PII), revealing that even when 90%+ of PII spans are masked, the subject-level inference protection rate can be as low as 33%, and focusing anonymization on a single target subject leads to greater exposure of non-target subjects.
- SWAN: Semantic Watermarking with Abstract Meaning Representation
-
SWAN embeds watermarks into the semantic graph structure of sentences using Abstract Meaning Representation (AMR) templates rather than token or embedding regions. Consequently, after paraphrasing that preserves the original meaning, the watermark can still be detected through AMR parsing, template matching, and proportion z-testing.
- Topic-Based Watermarks for Large Language Models
-
This paper proposes TBW, a lightweight topic-based watermarking scheme that clusters the vocabulary into "green lists" based on semantic topics rather than random partitioning. By selecting a semantically aligned topic list for logit biasing based on the input prompt, it maintains perplexity comparable to unwatermarked text while significantly enhancing robustness against paraphrasing and lexical perturbation attacks.
- Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets
-
This paper proposes FunPoison, which injects execution-lazy weak-use fragments into real execution paths while keeping Java code compilable, executable, and functionally equivalent. Poisoning only 10% of the data significantly reduces the gains from unauthorized CodeLLM fine-tuning, demonstrating strong robustness against formatting, rewriting, static analysis, and detection-based cleaning.
- TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
-
This paper proposes TrajGuard, a training-free decoding-time jailbreak defense framework. By aggregating key-layer hidden-state trajectories via a sliding window to quantify risk in real-time, it triggers a lightweight semantic judge only when risks persistently exceed a threshold. TrajGuard achieves a 95% average defense rate across 12 jailbreak attacks with a detection latency of only 5.2ms/token and a false positive rate below 1.5%.
- TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards
-
This paper models automated multi-turn jailbreak attacks as a multi-turn reinforcement learning problem and proposes TROJail. By introducing two heuristic process rewards (penalty for excessive toxicity and semantic correlation progression), it alleviates the sparse supervision issue of outcome rewards, significantly improving attack success rates across multiple models and benchmarks.
- Understanding and Mitigating Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks
-
This paper systematically investigates how biased augmented data generated by LLMs is inherited and amplified during supervised fine-tuning (SFT), impacting downstream tasks. Using six types of bias generation frameworks across ten tasks and three categories of mitigation methods, it reveals the complex phenomenon that "more synthetic data does not necessarily mean higher safety."
- Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning
-
This paper argues that existing LLM unlearning methods often hallucinate, feign refusal, or exhibit inconsistency even after "forgetting" target knowledge. It proposes an honest unlearning evaluation framework and the ReVa representation alignment method to ensure models stably admit their lack of knowledge after unlearning.
- VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models
-
Ours proposes VLA-Forget, the first hybrid unlearning framework for Vision-Language-Action (VLA) models. By employing ratio-aware selective editing for perception/cross-modal layers and significance-based selective editing for reasoning/action layers, it achieves target behavior removal while maintaining perception accuracy (+22%) and task success rate (+9%).
- When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life
-
This paper proposes the SaLAD benchmark, comprising 2013 real-world image-text samples across 10 daily life categories. It evaluates the ability of Multimodal Large Language Models (MLLMs) to identify implicit safety risks and provide safety warnings during daily assistance, revealing that even the strongest models achieve only 57.2% accuracy on unsafe queries.
- When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models
-
This paper discovers that safety failures in Large Reasoning Models (LRMs) often occur when "the model identifies a risk but subsequently overturns it during further reasoning." It proposes Chain-of-Guardrail (CoG) to locate and repair dangerous reasoning segments, significantly reducing attack success rates while preserving mathematical and coding reasoning capabilities.
- Why Agents Compromise Safety Under Pressure
-
This paper proposes the concept of "Agentic Pressure"—the phenomenon where LLM agents spontaneously exhibit normative drift by sacrificing safety to maintain helpfulness when resource constraints prevent them from simultaneously completing tasks and following safety rules. Notably, models with stronger reasoning capabilities are more adept at constructing verbal rationalizations to justify these violations.
- XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts
-
Ours proposes XMark, a multi-bit text watermarking method based on the Leave-one-Shard-out (LoSo) strategy and evergreen lists. By intersecting green lists arranged across multiple vocabulary permutations and constraining the token-shard mapping matrix, it significantly improves decoding accuracy under limited token conditions while maintaining text quality.
- XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants
-
This work identifies a design vulnerability in the automatic context collection of AI coding assistants and proposes the Cross-Origin Context Poisoning (XOXO) attack. By applying semantics-preserving transformations (e.g., variable renaming) to poison shared codebases, assistants like GitHub Copilot are misled into generating buggy or vulnerable code. The attack achieves an average success rate of 73.20% across 8 SOTA models.