Skip to content

🔒 LLM Safety

🧪 ICML2026 · 18 paper notes

📌 Same area in other venues: 💬 ACL2026 (66) · 📷 CVPR2026 (29) · 🔬 ICLR2026 (54) · 🤖 AAAI2026 (43) · 🧠 NeurIPS2025 (83) · 📹 ICCV2025 (13)

🔥 Top topics: LLM ×10 · Adversarial Robustness ×5 · Reasoning ×4 · Agents ×2

From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

This work shifts LLM hallucination detection from "output probability analysis" to "loss landscape curvature": by injecting Gaussian noise into embeddings and measuring the perturbation in gradient direction and magnitude as a cheap proxy for the Hessian spectral radius, the method outperforms entropy, Semantic Entropy, EigenScore, and other baselines in AUROC across 12 model-dataset combinations.

From Parameter Dynamics to Risk Scoring: Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

By tracking the cumulative parameter drift along "dangerous/safe directions" during LoRA fine-tuning, the authors reveal that the fundamental mechanism behind benign data breaking alignment is the monotonic drift of parameters toward the dangerous direction during fine-tuning. They propose SQSD—assigning each sample a continuous risk score based on the difference in single-step gradient projections along these two directions. SQSD maintains monotonic ASR ranking across 3 models × 2 datasets and generalizes across architectures, scales, and LoRA→Full transfer.

Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping

This paper proposes ARS for hallucination detection in large reasoning models (LRMs): instead of perturbing the reasoning trace at the text level, it directly applies small perturbations to the latent representation at the end of the trace and continues decoding to obtain counterfactual answers. Using "answer agreement" as a label, a lightweight contrastive head is trained to shape the trace-conditioned answer embedding, enabling subsequent embedding-based detectors to better separate hallucinations from truthful answers (AUROC on TruthfulQA improves from \(66.85\to 86.64\)).

Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models

This paper targets the vulnerability of large reasoning models (LRMs) to "overthinking" triggered by logically deficient inputs. It proposes a hierarchical genetic algorithm (HGA) that, under pure black-box conditions, treats structurally decomposed questions as genes. Through sentence-level and question-level crossover and mutation (addition/deletion), it searches for adversarial samples with logical breaks. On MATH, it can amplify response length by up to 26.1×, enabling low-cost DoS attacks.

Internalizing Safety Understanding in Large Reasoning Models via Verification

This paper argues that "being able to generate safe answers" ≠ "understanding safety," and proposes the SInternal framework: training large reasoning models solely to verify the safety of their own generated answers. The resulting emergent internal safety understanding significantly suppresses jailbreak attacks (StrongREJECT ASR drops from 41% to 0.6%) and provides a better starting point for subsequent RL.

Jailbreaking Vision-Language Models Through the Visual Modality

The authors propose four attacks that jailbreak state-of-the-art VLMs solely via visual input (visual cipher / object replacement / text replacement / visual analogy riddles). Systematic evaluation on six advanced VLMs demonstrates that "safety alignment on the text side does not automatically transfer to the visual side," and mechanistic analysis reveals the underlying hierarchical mechanisms.

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

This paper reveals a previously overlooked failure mode of Test-Time Scaling (TTS): by simply reducing the diversity of candidate responses, TTS becomes even more prone to outputting unsafe content than directly feeding adversarial prompts. The authors propose RefDiv, a genetic algorithm driven by dual signals—Shannon entropy and reference guidance—which efficiently jailbreaks across models, closed-source systems, and guardrails on both MCTS and Best-of-N.

Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Reformulates multi-turn jailbreak as an inference-time policy optimization problem—within an adversarial POMDP framework, the Attacker and Metacognitive Evaluator form a closed loop: dense analytical feedback from the Evaluator is used as a "semantic gradient" to guide the Attacker's belief update and policy improvement. This enables adaptation to 10 cutting-edge models (including O1 / GPT-5-chat / Claude-3.7) with an average ASR of 89.2%, while reducing token consumption by 8.2× compared to strong baselines, all without retraining any weights.

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

MultiBreak employs an iterative framework of "active learning + uncertainty-guided rewriting" to expand a multi-turn jailbreak dataset to 10,389 dialogues and 2,665 unique harmful intents, achieving a diversity of 0.942 that far surpasses previous work. On DeepSeek-R1-7B / GPT-4.1-mini, it improves ASR by 54% / 34.6% over the next-best dataset.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

OTora introduces a novel attack paradigm, Reasoning-Level Denial-of-Service (R-DoS): without compromising task correctness, it employs a two-stage red teaming pipeline (first using insertion-aware optimization to induce the agent to proactively access attacker-controlled external resources, then deploying "reasoning-type payloads" optimized via ICL genetic search at those resources) to trap LLM agents in prolonged multi-turn overthinking states. On WebShop, Email, and OS agents, this achieves up to 10× reasoning token inflation and orders-of-magnitude latency attacks, with final task accuracy nearly unchanged.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

REALISTA constructs an "input-dependent edit direction dictionary" in the LLM latent space, turning adversarial prompt optimization into a continuous problem under a simplex constraint. This approach preserves the semantic equivalence/coherence of discrete methods like SECA, while achieving the search flexibility of continuous methods like LARGO. It is the first to successfully induce hallucinations in free-form outputs of closed-source inference models such as GPT-5.

SafeHarbor: Defining Precise Decision Boundaries via Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

SafeHarbor upgrades LLM Agent safety from "static coarse-grained classifiers" to "dynamic hierarchical memory tree + dual-score gating." Through adversarial rule generation and entropy-driven self-evolution, GPT-4o maintains a 93%+ refusal rate while raising benign tool invocation success to 63.6%, significantly alleviating the over-refusal problem.

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

This paper proves that all existing HFT defenses that impose constraints in parameter space can be circumvented due to parameter redundancy. It proposes Safety Bottleneck Regularization (SBR), which shifts the defense to the unembedding layer—a geometric bottleneck: by anchoring only the final hidden state of a single high-risk prompt, the Harmful Score can be suppressed to < 10 under 50 epochs of sustained HFT attack, without harming benign task accuracy.

Self-Debias: Self-correcting for Debiasing Large Language Models

Self-Debias reframes LLM debiasing as "fair resource allocation of probability mass along the autoregressive reasoning chain": it uses trajectory-level suffix margins as resource units, applies the Jain fairness index to prevent resource collapse on easy samples, and combines cold-start SFT with consistency-filtered online self-training. With only 20k labeled seeds, it boosts Qwen3-8B's average score across 8 fairness/utility benchmarks from 77.5 to 81.7, and reverses the base model's "self-correction collapse" into a stable +0.4 improvement.

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

This paper identifies two major sources of instability in existing GFlowNet red-teaming: high variance from partition function \(Z_\theta\) estimation, and mode collapse caused by noisy rewards from toxicity classifiers on OOD gibberish text. The authors propose three simple components—pairwise contrastive objective CTB to eliminate \(Z\), Noisy Gradient Pruning to filter uninformative pairs, and Min-K Fluency Stabilizer to block gibberish—which together boost the number of unique attacks on Qwen2.5-1.5B from 17 to 134 (about 7×), maintain a 92% ASR, and outperform baselines in cross-model/cross-defense transferability.

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

This paper treats the entire denoising trajectory of T2I models as the "attack surface" for VLM red-teaming attacks. It proposes a hierarchical RL framework (STARE) combining a high-level prompt editor and low-level GRPO fine-tuning of rectified-flow models. This approach not only improves attack success rate by 68% over SOTA, but also reveals a novel phenomenon—Optimization-Induced Phase Alignment: adversarial optimization automatically binds "conceptual toxicity" to early denoising and "detail toxicity" to later stages, transforming the chaotic toxicity formation process into several predictable "vulnerability time windows."

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

This work uses Causal Tracing to show that "refusal" in LLMs is not a static vector at the terminal token, but a "refusal trajectory" spanning upstream intermediate layers and tokens. Based on this, the authors design SALO—a detector with <20M parameters, trained only on standard alignment data, yet able to leverage the irreversibility of Transformer causal masks to identify adversarial attacks such as GCG, AutoDAN, and Prefilling. SALO raises detection rates from 0% to over 85% on GCG/Prefilling attacks.

Watermarking LLM Agent Trajectories (ACTHOOK)

ACTHOOK introduces the "software hook" concept into agent trajectories: at action boundaries, it inserts an extra action triggered by a secret key as a watermark. LLMs trained on such data will execute the hook at significantly higher frequency when prompted with the key, enabling copyright detection via black-box queries only. The average AUC reaches 94.3 with almost no impact on downstream task performance.