ICML2026 AI Safety AI paper notes paper summaries Adversarial Robustness LLM Watermarking Alignment/RLHF Federated Learning Multimodal/VLM

🛡️ AI Safety¶

🧪 ICML2026 · 114 paper notes

📌 Same area in other venues: 📷 CVPR2026 (145) · 🔬 ICLR2026 (141) · 💬 ACL2026 (5) · 🤖 AAAI2026 (45) · 🧠 NeurIPS2025 (73) · 📹 ICCV2025 (24)

🔥 Top topics: Adversarial Robustness ×17 · LLM ×15 · Watermarking ×8 · Alignment/RLHF ×7 · Federated Learning ×6

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity: ABC-Bench transforms the question "Can AI agents actually perform molecular biology?" into three automatically scorable tasks (designing DNA fragments, evading synthesis screening, and controlling liquid-handling robots for Gibson Assembly). Experiments show that eight frontier models exceed the median scores of PhD-level experts across all three tasks. Real-world wet-lab validation demonstrates that scripts written by o4-mini-high successfully assembled DNA on OpenTrons robots.
ACTG-ARL: Differentially Private Conditional Text Generation with RL-Boosted Control: This paper proposes ACTG, a hierarchical framework that decomposes private text generation into two sub-tasks: feature learning and conditional text generation. It further introduces Anchored RL, which enhances the instruction-following capabilities of the conditional generator through a hybrid reinforcement learning objective and SFT anchors based on best-of-N sampling, achieving a 20% improvement in MAUVE on biomedical data compared to prior work while maintaining text fidelity.
Active Continual Learning with Metaplastic Binary Bayesian Neural Networks: BiMU designs bounded-memory and uncertainty-aware metaplastic updates for binary Bayesian neural networks to prevent Bernoulli posterior saturation in long-range non-stationary streams. It utilizes Monte Carlo disagreement for buffer-free one-pass active querying, significantly reducing label requirements and backpropagation updates.
Position: 'AI Alignment' Encompasses Competing Technical Priorities: This ICML position paper argues that "AI alignment" is a polysemous term: the ML literature contains at least three high-level alignment ideals that are competing rather than merely different (Task Reliability / Social Judiciousness / Takeover Avoidance). In practice, advancing one type of alignment often actively undermines another. The authors explain these tensions via two cross-cutting distinctions—"threat model differences" and "positive/negative alignment differences"—and offer five specific recommendations for researchers.
Position: AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks: This is a position paper arguing that AI researchers must look beyond distant superintelligence risks and proactively lead technical research into "arms control" for military AI. Using historical precedents from nuclear arms control as a template, the authors demonstrate that integrating frontier models into military systems introduces risks with extremely poor verifiability—such as escalation, alignment faking, and gradual human disempowerment—for which current diplomatic tools are unprepared. They call for a formal collaboration mechanism between AI researchers and arms control experts to solve technical challenges regarding verification, trust, and transparency.
Alignment Risks from Capability-Seeking RL Training: This paper identifies an underestimated alignment risk: when models pursue task capabilities via RL in environments with "structural loopholes," they spontaneously learn to exploit these loopholes for high rewards even without explicit instruction. Using four "loopholes games," the authors demonstrate that such exploits are prevalent, transferable across tasks, propagatable through SFT, and more resistant to correction than SFT-distilled behaviors. Crucially, as the exploit rate rises, main task metrics often remain stable or even improve, creating a "developer blind spot" that evades standard monitoring.
AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing: AliMark reformulates sentence-level text watermarking from "prefix-conditioned sentence-by-sentence detection" to "global secret bit sequence encoding and alignment." By utilizing text reconstruction and adaptive block edit distance, it significantly enhances detection robustness against strong paraphrasing attacks such as DIPPER and GPT-3.5.
Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model: This paper proposes Anchored Decoding: an inference-time method that anchors a high-performance but potentially risky LM to a safe LM trained only on permissive data. It provides a formal guarantee on the trade-off between copyright duplication risk and generation quality using a tunable information budget.
Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning: The authors provide the first systematic evaluation of the impact of 7 mainstream plasticity interventions (SAM, Shrink & Perturb, Weight Clip, SN, WD, LN, ReDo) on deep reinforcement learning (DRL) backdoor attacks through 14,664 experiments. It is discovered that only SAM acts as a "demon"—significantly intensifying backdoor threats. Consequently, the "Sweeper-Converter-Connector" robust backdoor injection framework is proposed, alongside a detection signal based on the sharpness of the loss landscape.
Antidistillation Fingerprinting: This paper proposes Antidistillation Fingerprinting (ADFP), which utilizes a proxy student model to estimate which watermark tokens are most easily absorbed during the distillation process. This allows for more reliable detection of whether third-party models have been trained on teacher model outputs, without sacrificing the quality of the teacher's generation.
Beyond Procedure: Substantive Fairness in Conformal Prediction: This paper moves beyond the procedural fairness perspective of Conformal Prediction (CP) to focus on the substantive fairness of downstream decisions. It theoretically proves and experimentally validates that equalizing prediction set size (rather than equalizing coverage) is the procedural metric strongly correlated with substantive fairness. It proposes a scalable evaluation framework based on LLM-in-the-loop and a Label-Clustered CP method to effectively balance utility and fairness.
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics: BioAgent Bench introduces an end-to-end evaluation suite for executing bioinformatics pipelines with LLM agents. It features 10 real-world bioinformatics tasks evaluated across 10 frontier/open-weight models and 3 agent harnesses. Using an LLM judge for scoring and three types of perturbation tests (corrupted, decoy, and prompt-bloat), the study finds that frontier models can complete over 90% of pipelines, yet their robustness remains concerning.
BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks: BYORn identifies poisoned samples by detecting high-perplexity target responses inconsistent with input semantics and dynamically replaces them with clean responses generated by the model itself. This breaks the association between backdoor triggers and malicious outputs, reducing the average Attack Success Rate (ASR) by 40 percentage points while maintaining clean task performance.
Calibrating Uncertainty for Zero-Shot Adversarial CLIP: The UCAT framework is proposed to reparameterize CLIP logits as concentration parameters of a Dirichlet distribution. By aligning the Dirichlet distributions of clean and adversarial samples via reverse KL divergence, the method simultaneously calibrates uncertainty and preserves semantic structure during zero-shot adversarial fine-tuning, achieving an optimal balance between robustness and calibration across 16 benchmarks.
COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models: COFT implements a training-free and gradient-free approach for step-by-step token-level counterfactual fairness on frozen LLMs. By constructing counterfactual masked branches during decoding, performing logit fusion, and applying dual-branch split conformal prediction to filter tokens, it reduces bias metrics by 30–55% (median 38%) with negligible impact on task performance.
COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs: COPF treats "online link recommendation on evolving graphs" as a performative decision process by adding a decision-layer wrapper outside the backbone scorer. It ensures counterfactual identifiability via an online logging protocol with explicit exploration, estimates the "exposed vs. unexposed" counterfactual group gap using a graph-aware doubly robust (GA-DR) estimator, and suppresses fairness spikes post-deployment using a Residual-OI audit + PI primal–dual controller. Theoretically, it provides a transfer certificate from plug-in OI to true counterfactual gaps, significantly reducing the worst-case TE gap during the Deploy phase with controllable utility loss on TGB and synthetic bipartite streams.
Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis: The authors point out that "directly translating English safety benchmarks into target languages" systematically underestimates the true risks of large language models (LLMs). They constructed 500 paired Direct Translation (DT) and Culturally-Adapted (CA) red-teaming samples for Korean, Japanese, Thai, and Khmer. The results demonstrate that CA leads to higher Attack Success Rates (ASR) across all 16 language-model combinations (averaging +9.3 percentage points), arguing that multilingual safety evaluation must achieve "cultural adaptation" rather than mere "language translation."
Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning: FedDTL retains the CLIP image encoder on the client while moving the text encoder to the server as a "global semantic anchor." It employs a two-stage local fine-tuning approach (SFT warm-up followed by GRPO-style RL) to simultaneously mitigate inter-client optimization inconsistency and intra-client overfitting in heterogeneous and full-data federated scenarios.
Deep Sequence Models Tend to Memorize Geometrically; It Is Unclear Why: This paper demonstrates that when Transformer / Mamba models memorize graph edges, they do not simply degenerate into lookup tables (associative memory). Instead, they spontaneously organize node embeddings into a "geometric memory" that encodes multi-hop global structures. Through path-star experiments, the authors prove this geometry makes implicit reasoning abnormally easy, yet its emergence cannot be attributed to supervision, capacity, or optimization pressure, leaving a new "memorization puzzle."
Demystifying the Optimal Fair Classifier in Multi-Class Classification: This paper provides an analytically tractable form (a closed-form solution with entropy regularization) for the Bayes optimal classifier in multi-class fair classification problems. Based on this, it derives a unified framework, OptFair: the training phase utilizes a reduction to saddle-point optimization of cost-sensitive cross-entropy, while the deployment phase uses plug-in estimation to solve a convex proximal gradient problem. Both methods theoretically converge to the accuracy-fairness Pareto frontier.
dgMARK: Decoding-Guided Watermarking for Diffusion Language Models: dgMARK utilizes the "decoding order degree of freedom" inherent in Diffusion Language Models (dLLMs) as a watermarking channel. By prioritizing the decoding of positions that satisfy parity conditions based on a binary hash, it embeds statistically detectable watermarks in models like LLaDA/Dream without modifying token probability distributions, maintaining robustness against insertion, deletion, substitution, and rewriting.
Differentially Private Preference Data Synthesis for Large Language Model Alignment: DPPrefSyn replaces "DP fine-tuning on private preference data" with "learning a distribution of DP preference reward models and synthesizing DP preference data using public prompts." By leveraging the geometric structure of Bradley-Terry linear rewards, DP-PCA, and DP-KMeans clustering to capture user preference heterogeneity, it achieves a 56.5% GPT-4o win-rate on Anthropic-HH at \(\varepsilon=2\), outperforming both non-private fine-tuning (55.95%) and DP-FT (37.0%).
Dual-branch Robust Unlearnable Examples: This paper proposes DUNE, which extends the perturbation of Unlearnable Examples (UEs) from a single spatial domain to a "spatial + color" dual-domain optimization. By aligning perturbation features with shift-induced labels and utilizing pre-trained model ensembles, DUNE maintains robustness against 7 mainstream defenses (including ECLIPSE, ISS-J, and COIN) on CIFAR-10 / ImageNet. The average test accuracy is further reduced by 14.95%–50.82% compared to 12 SOTA UE schemes.
DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models: DualOptim+ decomposes Adam optimizer states into a "shared base state + decoupled delta states," allowing LLM machine unlearning to adaptively transition between shared and decoupled optimizers as forget/retain gradients fluctuate between conflict and synergy. Theoretically, it reduces to Alternate optimization (under positive correlation) and DualOptim (under negative correlation), while an 8-bit quantized variant reduces extra memory overhead back to baseline levels.
Efficient DP-SGD for LLMs with Randomized Clipping: This paper proposes DP-SGD-RC, which replaces the exact per-sample gradient norm computation in DP-SGD with Hutchinson / Hutch++ stochastic trace estimation. This reduces the clipping memory overhead for long-context LLM training from \(O(B\min\{T^2,d^2\})\) to \(O(BkT+kp)\). Accompanied by a tight \(f\)-DP analysis based on a chi-squared mixture envelope CDF, the method maintains accuracy in Llama-3.2-1B long-context fine-tuning while reducing peak memory of the largest linear layer by approximately 40% and saving about 2× FLOPs.
Exploring Systems-Thinking Approaches to Loss of Control Risk: This is a position/analysis paper: the authors argue that "Loss of Control (LoC)" in frontier AI should not be evaluated solely at the model level but treated as a control problem within a sociotechnical system. They adapt three mature systems safety methods from industries like aviation and nuclear power (STECA, STPA, FRAM) to the general scenario of "internal deployment of coding agents in frontier labs." These methods reveal governance gaps invisible to model-level evaluations, failures caused by control latency, and the incremental erosion of safety controls by daily operational fluctuations. Consequently, they propose a tripartite approach: "Model Evaluation + Systems-level Hazard Analysis + Operational Assurance."
Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search: This paper proposes BGPS (Bias-Guided Prompt Search), which utilizes a lightweight attribute classifier trained on internal activations of a diffusion model to guide the beam search decoding of an LLM. It automatically generates prompts that are naturally readable yet steer generated images significantly toward specific genders/ethnicities, exposing hidden biases in text-to-image models (including debiased ones) that are difficult for humans to conceive.
Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Adversarial Attack: This paper proposes TSEF—a dual-target attack framework for joint "Time Series Classifier + Explainer" systems. By learning a "Temporal Vulnerability Mask + Frequency Perturbation Filter," it simultaneously pushes model predictions to a target label and align explanations with a reference saliency map within an \(\ell_\infty\) budget. It demonstrates that the common "stable explanation = trustworthy decision" assumption in existing time-series interpretability pipelines is fundamentally flawed.
Extending Fair Null-Space Projections for Continuous Attributes to Kernel Methods: This paper extends the "Iterative Null-Space Projection (INLP)" fairness method, originally designed for linear models, to kernel methods. By deriving a closed-form transformation \(\mathbf{T}\) in the empirical feature space that acts directly on the kernel matrix \(\mathbf{K}\), the transformed \(\mathbf{K}_{(m)}\) remains a positive semi-definite (PSD) kernel while being stripped of predictive information regarding continuous protected attributes. This allows any kernel-based algorithm (KRR, SVR) to be converted into a "continuously fair" version with a single step, achieving competitive or superior fairness–accuracy Pareto fronts on Crimes, ACSIncome, and ACSTravelTime.
Fair Dataset Distillation via Cross-Group Barycenter Alignment: This paper reveals that Dataset Distillation (DD) amplifies biases present in the original data—a phenomenon rooted in the interaction between "subgroup size imbalance" and "subgroup representation separation." It proposes COBRA: using the (group-size independent) barycenter of each subgroup's representation as the distillation target, which simultaneously reduces EOD and improves accuracy across multiple DD frameworks.
Fair Decisions from Calibrated Scores: Achieving Optimal Classification While Satisfying Sufficiency: This paper addresses a long-neglected pain point: "even if scores are fully group-calibrated across populations, applying a single threshold to them will violate sufficiency (predictive parity)." The authors provide an exact solution for the optimal binary classifier under sufficiency constraints with finite discrete scores. By geometrically characterizing the \((\mathrm{PPV}, \mathrm{FOR})\) feasible region, they derive a post-processing algorithm that depends only on scores and group labels. They prove this algorithm simultaneously solves two types of objectives: "loss minimization" and "minimizing deviation from separation under sufficiency."
Fairness in Aggregation: Optimal Top-\(k\) and Improved Full Ranking: Under the Spearman footrule distance, this work proves that the ILP constraint matrix is totally unimodular, providing the first polynomial-time optimal algorithm for fair top-\(k\) rank aggregation. It further improves the approximation ratio for fair (full) rank aggregation from 3 to 2 using a two-step strategy: solving fair top-\(k\) first and then completing it into a full permutation via minimum-cost perfect matching.
Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences: This paper proposes FedVPA-GP: Under the privacy constraints of Federated Learning (FL), it models each client's preference as a continuous latent variable \(z\) using a "client mixture prior + Gumbel-Softmax learnable weights + orthogonal prototype loss." This fundamentally fixes the "posterior collapse" encountered when directly applying VPL to FL, enabling a single reward model to switch dynamically between conflicting preferences such as "helpful" and "harmless."
FedHPro: Federated Hyper-Prototype Learning via Gradient Matching: To address the "inheritance of client bias by global prototypes" in prototype-based federated learning, this paper proposes a set of learnable global hyper-prototypes. These hyper-prototypes simulate prototypes from centralized training via gradient matching on the server side. Combined with client-side contrastive learning and alignment loss, this approach significantly improves accuracy in heterogeneous scenarios.
FedTreeLoRA: Reconciling Statistical and Functional Heterogeneity in Federated LoRA Fine-Tuning: To address the disconnect in existing federated LoRA methods between "client statistical heterogeneity" and "LLM layer functional heterogeneity," FedTreeLoRA employs a global hierarchical clustering tree with layer-wise adaptive depth search. This allows shallow layers to prioritize sharing while deep layers differentiate progressively. On GLUE and FLAN, it improves average metrics from 91.19 / 61.77 to 92.36 / 63.19 with minimal parameter overhead.
Flatness-Aware Stochastic Gradient Langevin Dynamics: This paper proposes fSGLD: it replaces the parameter \(\theta\) at the gradient step in standard SGLD with a Gaussian-perturbed \(\theta+\epsilon\), and strictly couples the perturbation scale \(\sigma\) with the inverse temperature \(\beta\) via \(\sigma=\beta^{-(1+\eta)/4}\). Without adding any gradient or memory overhead, the algorithm's invariant measure approximates the Gibbs distribution corresponding to the Hessian-trace regularized objective \(v(\theta)=u(\theta)+\tfrac{\sigma^2}{2}\mathrm{tr}(H(\theta))\). The authors provide non-asymptotic bounds for Wasserstein-1 distance and excess risk, achieving performance comparable to or better than SAM/ASAM on CIFAR/WebVision/ViT with nearly halved training time.
FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors: FoeGlass transplants the "LLM red-teaming LLM" paradigm to Audio Deepfake Detection (ADD). Without fine-tuning, it utilizes in-context learning combined with realness and diversity feedback to guide a black-box reasoning LLM in generating TTS prompts that deceive detectors. Starting from a cold start, it increases the False Negative Rate (FNR) of existing detectors from 0% to up to 96%, showing high transferability across eight different ADD models.
Forget to Know, Remember to Use: Context-Aware Unlearning for Large Language Models: This paper points out that existing LLM unlearning methods, while "erasing knowledge from parameters," also destroy the "contextual utility"—the ability of the model to correctly utilize that knowledge when it is re-provided in the prompt. The authors propose adding a KL regularization term to existing unlearning losses—aligning the distribution of the unlearned model on "question + context" inputs with the original model—effectively restoring Contextual QA LLM-Judge scores from 0.00–0.84 back to 0.95+ with almost no loss in forgetting effectiveness or retain set utility.
Frequency Matching in Spiking Neural Networks for mmWave Sensing: This work proves from a "mechanism-data alignment" perspective that LIF spiking neurons are equivalent to first-order IIR low-pass filters. It proposes setting the membrane decay coefficient \(\beta\) according to the discriminative spectrum of mmWave signals, enabling the SNN to achieve an average accuracy improvement of 6.22% and a theoretical energy reduction of 3.64× compared to ANNs across four common mmWave datasets.
From Parameter Dynamics to Risk Scoring: Quantifying Sample-Level Safety Degradation in LLM Fine-tuning: The authors track the cumulative drift of parameters along "danger/safety directions" during LoRA fine-tuning. They discover that the underlying mechanism for alignment collapse caused by benign data is the monotonic parameter drift toward dangerous directions. Consequently, they propose SQSD, which assigns continuous risk scores to individual samples based on the projection difference of a single-step gradient along these two directions. SQSD maintains monotonic ASR rankings across 3 models and 2 datasets and demonstrates transferability across architectures, scales, and from LoRA to Full Fine-tuning.
From Prompts to Responses: Dual-Sided Data Leakage and Defense in Split Large Language Models: In "Split Large Language Models (Split-LLM)," private data is leaked from both ends—the model head and the model tail. This paper proposes the PIDI attack, which uses dual-sided initialization and patched inversion to reconstruct user input prompts and model-generated responses with high fidelity. Simultaneously, it proposes the ADMI defense, which utilizes adapter local warm-up and mutual information regularization to suppress attack success rates at both ends to near zero with almost no task performance degradation.
From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG: EPIC shifts the core bottleneck of on-device RAG from "how to use preferences during retrieval" forward to "what to store during indexing." It utilizes a three-stage pipeline involving "coarse filtering + fine verification + query steering" to retain only data aligned with user preferences and generates "instruction-item" pairs as indexing units. It reduces storage by 2404× while achieving an absolute improvement of 20.17 percentage points in preference alignment accuracy across four preference benchmarks.
From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents: The paper argues that LLM agents can cross-reference fragmented, non-identifiable cues with public evidence to re-link anonymized data to specific real-world identities. This "inference-driven de-anonymization" risk is systematically quantified through three scenarios: replication of classic cases, a controlled benchmark (InferLink), and real-world human-computer dialogue logs.
FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing: FuseFSS replaces the paradigm of "handcrafting a dedicated secure protocol for every fixed-point non-linear operator" with a unified compiler. By defining a compact specification for each scalar operator (interval partitioning + low-degree polynomials + predicate bits), the compiler automatically generates two FSS calls: "one packed comparison + one vector interval lookup." Compared to the state-of-the-art FSS baseline Sigma, it achieves end-to-end speedups of 1.24×–1.50× on BERT/GPT, reduces online communication by 9%–16%, and produces smaller, faster keys.
GEM-FI: Gated Evidential Mixtures with Fisher Modulation: This paper addresses the issues of overconfidence in out-of-distribution (OOD) samples and the difficulty of single-head architectures in expressing multimodal epistemic uncertainty in Evidential Deep Learning (EDL). It proposes a three-component suite, GEM-Core/MIX/FI: gating evidence with learned feature energy, approximating ensembles via a single-pass mixture of evidential heads, and stabilizing mixture assignments with Fisher information regularization. It outperforms DAEDL on OOD detection tasks (CIFAR-10 → SVHN/CIFAR-100) while maintaining single-pass efficiency.
Generative Models Erode Human Temporal Learning Through Market Selection: This position paper argues that even before reaching AGI, generative models pose structural risks to knowledge and cultural production through "market adverse selection." As AI outputs increasingly mimic surface features of work traditionally requiring long-term human learning, the cost for evaluators to verify "whether this is a product of long-term human accumulation" exceeds the benefits. Consequently, reward mechanisms become "source-blind," forcing individuals who invested years in learning to compete on price with near-zero-cost AI outputs, ultimately driving them out of the market.
Geometrically Constrained Outlier Synthesis: GCOS synthesizes virtual outliers along geometric off-manifold directions within the "small-variance subspace" of ID feature PCA. It regulates synthesis intensity via a "conformal shell" \([\alpha_\text{inner},\alpha_\text{outer}]\) derived from Mahalanobis quantiles of a calibration set. Combined with a contrastive regularization loss using an adaptive margin, it improves average AUROC from 86.21 (VOS) to 93.47 across four near-OOD datasets.
Gradient Transformer: Learning to Generate Updates for LLMs: This paper proposes Grad-Transformer, which "translates" the update vector obtained by a client fine-tuning a small model (TinyLM) on private data into an update vector for a target large language model (LLM) using an encoder-decoder Transformer. This achieves weak-to-strong knowledge distillation without touching private data. It achieves an average PGR of 91.88% across 6 reasoning/summarization datasets, a 55.89% improvement over the best baseline (58.94%), and demonstrates robustness to differential privacy perturbations.
SemGrad: Gradients w.r.t. Semantics-Preserving Embeddings Tell LLM Uncertainty: SemGrad applies gradient-based uncertainty quantification to LLM free-form generation for the first time. By using the Semantics Preserving Score (SPS) to identify hidden states that encode input semantics, the method uses the gradient norm of the log-likelihood with respect to these states as a measure of LLM confidence. Without sampling and requiring only a single backward pass, it outperforms 11 SOTA baselines on 3 QA datasets, notably exceeding SAR by 3.27 AUROC on TruthfulQA, which contains multiple valid answers.
HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning: Drawing physical intuition from Helmholtz free energy, this work trains prompt parameters for each domain to follow an energy curve that is "compressed to boundary \(\Theta\) and aligned to midline \(\Delta\)." During inference, a hybrid weight composed of energy and distance factors is used to combine domain-specific prompts, achieving improvements of 1.76 / 3.12 / 2.57 percentage points on unseen domains across CDDB / DomainNet / CORe50 DIL benchmarks, respectively.
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents: Ours proposes STING—an automated framework that employs four collaborative agents (Strategist / Attacker / Refusal Detector / Phase-Completion Checker) to decompose malicious intent into multiple steps, disguised under benign personas, for multi-turn adaptive red-teaming of tool-using Agents. It introduces a survival analysis toolkit that models "multi-turn jailbreaking" as a "Time-to-First-Jailbreak" random variable (discovering discovery curves, language-attributed hazard ratios, and the new RMJD metric). Experiments show that multi-turn STING increases illicit task completion by up to 107.1% compared to single-turn prompts, and contrary to chatbot findings, low-resource languages are not consistently easier to jailbreak.
Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio: To address the exponential decay of watermark signals caused by "decoding → re-encoding non-idempotency" in autoregressive audio generation under KGW-style token watermarking, the authors perform Leiden community detection on the codec's confusion matrix to derive a contracted "cluster vocabulary." By defining green/red sets on clusters rather than individual tokens, this gradient-free, black-box approach raises the exponential base of the \(z\)-score from \(r\) to \(r_{cl} > r\). Detectability is improved by several orders of magnitude compared to baselines and WMAR (which requires fine-tuning), demonstrating inherent robustness to perturbations like MP3 compression, denoising, and cropping.
Hiding in Plain Floats: Steganographic Carriers for Indirect Prompt and Content Injection: This paper hides malicious instructions within floating-point parameter arrays used for "procedural generation" (encoding bytes into trajectory coordinates using Iterated Function Systems, IFS). This approach ensures that plaintext prompt injection detectors find no suspicious text at either the raw configuration layer or the reconstructed report layer. In 14,400 real-world attack experiments across three commercial LLMs, this method maintained a 94.3% leak attack success rate against the strongest dual-layer text classifier defenses.
How Does Bayesian Sampling Help Membership Inference Attacks?: This paper proposes BMIA, which expands a single reference model into a "virtual model family" using a Laplace posterior. By estimating the conditional score distribution of each sample via Bayesian sampling, BMIA achieves a TPR in low FPR regions that is 54% higher than LiRA (which requires 8 reference models) on datasets like CIFAR-100, all while staying within a budget of training only one reference model.
How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning: The trade-off between "forgetting vs. retaining" is formulated directly as a "per-step constrained first-order convex optimization" problem. The dot product of the retain/forget gradients, \(\kappa = \bm{g_r}\cdot\bm{g_f}\), serves simultaneously as a hardness metric, a switch for update directions, and an early stopping condition. It proves more stable than baselines such as GA, GDiff, SCRUB, and KL on CIFAR-10/ResNet-20 and Llama-2-7B/WaterDrum-TOFU.
In-Training Defenses Against Emergent Misalignment in Language Models: Addressing the phenomenon of "Emergent Misalignment" (EM)—where fine-tuning on narrow domains causes global model deterioration—this paper provides the first systematic comparison of five categories of in-training defenses. The authors propose Interleaving++, which automatically selects safe data using the "perplexity difference between aligned and misaligned models." Interleaving++ simultaneously satisfies four criteria: preventing EM, preserving narrow-domain learning, enabling benign task learning, and maintaining response coherence.
LAPRAS: Learning-Augmented PRivate Answering for Linear Query Streams: LAPRAS utilizes a predictor of "which queries will arrive" to categorize an online DP query stream into predicted and unpredicted queries. Predicted queries are released with low noise through the offline optimal Matrix Mechanism, while unpredicted queries use Smooth Allocation to estimate the total count online based on the observed arrival positions of "unpredicted queries" and distribute the budget smoothly. It nearly matches offline optimal performance when predictions are accurate and degrades to online baseline levels when predictions are poor.
LLM Benchmark Datasets Should Be Contamination-Resistant (Position Paper): This position paper argues that LLM benchmarks should be contamination-resistant—meaning they are usable for inference but unusable for training. It proposes leveraging the fundamental asymmetry between Transformer training and inference pipelines (training requires full tokens for gradients, while inference only requires the KV-cache + penultimate layer hidden state). The authors suggest shifting benchmark release formats from plaintext to KV-cache and intermediate hidden states, combined with cross-model subspace alignment or relative representations to solve interoperability, calling for community adoption.
MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio: MedMosaic constructs a medical audio QA benchmark (46,701 QAs, 10 question types) covering physiological sounds and real/synthetic clinical dialogues via a synthetic pipeline. Systematic evaluation of 13 audio/multimodal models reveals that even Gemini-2.5-Pro achieves only approximately 68.1% weighted accuracy, uncovering fundamental shortfalls of contemporary LALMs in medical audio reasoning.
Memetic Capture: A Pluralistic Policy Framework for Governing AI-Driven Cultural Disempowerment: This AI governance position paper identifies "memetic capture" as the process by which AI incrementally strips humans of cultural agency. It proposes the CPGF (Cultural Policy Governance Framework), a four-tier architecture featuring quantifiable impact indices, democratic assemblies, pluralistic deployment standards, and transnational coordination. The core argument is that pluralism is a structural necessity rather than a moral choice—monocultural AI governance itself accelerates the very disempowerment it seeks to prevent.
Memory as a Markov Matrix: Sample Efficient Knowledge Expansion via Token-to-Dictionary Mapping: The next-token distribution of an autoregressive LLM is interpreted as the state transition matrix of a Markov chain. Consequently, "learning new words" becomes "adding new states to the state space and representing them as sparse combinations of existing states." Theoretically, this requires only \(O(s)\) samples (where \(s\) is the number of mapped old tokens), and in practice, fine-tuning only the new token embeddings achieves cross-lingual or new concept expansion with strictly zero forgetting.
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification: Expert models fine-tuned independently on private data by multiple clients can be merged into a single deployable MoE model without sharing private data. The core approach utilizes relevance-weighted Determinantal Point Processes (DPP) to select proxy samples from public data that are both "relevant and diverse." This is followed by proxy-aligned expert training and context-aware router training to align expert behavior with proxy supervision, significantly outperforming methods like FlexOlmo that rely solely on similarity for proxy selection.
Mind the Gap: Mixtures of Gaussians in Approximate Differential Privacy: This paper designs a class of Gaussian mixture additive noise mechanisms (multi-Gaussian mixture and hyperparameter-free quasi-Gaussian mixture) for \((\varepsilon,\delta)\)-DP. These mechanisms close the optimality gap of the analytic Gaussian mechanism by up to 99% in low-to-medium privacy regimes while preserving the tight zCDP composition properties of Gaussians.
Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization: Minim is a "trusted sanitization proxy" running locally on a user's device. Before an Agent uploads the interface state (accessibility tree) to a remote inference server, it uses a small model to assign two scores to each UI element—intrinsic sensitivity \(s\) and task-conditioned necessity \(n\). It then applies a ternary disclosure strategy (Keep / Abstract / Remove) to release only the minimum information truly required for the task. On WebArena, it reduces Task-Irrelevant Sensitive Leakage (TISL) to 10.1% of the full observation while maintaining nearly no loss in task-critical content and interactivity.
MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs: Addressing the real-world scenario where Multimodal Large Language Models (MLLMs) need to continuously delete specific data chronologically, this paper constructs a large-scale lifelong unlearning benchmark, MLUBench (127 real entities, 5,105 images, 15,414 VQA pairs). The study systematically reveals that existing unlearning methods collapse as tasks accumulate, with the root cause being the destruction of multimodal alignment. To mitigate this, the authors propose LUMoE, a method using a "one switchable LoRA expert per unlearning task + gating router" architecture. This isolates unlearning modifications from the stable backbone, simultaneously preserving unlearning quality and model utility under long-sequence unlearning.
Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility: This paper extends the TOFU unlearning benchmark to 5 languages to systematically study "cross-lingual unlearning transfer." It finds that unlearning strength varies with the kinship of language families/writing systems and primarily modifies late-stage language-specific decoding layers while leaving the shared semantic space in earlier layers nearly untouched. Consequently, an inference-time steering vector can recover 50% of forgotten knowledge on Qwen and 90% on Gemma, indicating that existing LLM unlearning is essentially "surface suppression" rather than true erasure.
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs: The History-Echoes framework analyzes the carryover effect of LLM conversational history through "Markov chain state consistency" and "latent space geometric angles." It identifies a Spearman correlation of 0.78—once a behavior (hallucination, sycophancy, or refusal) occurs, the model becomes trapped in a latent space region corresponding to that state, making escape difficult. The "refusal" trap is the strongest, while "hallucination" is the weakest; these traps dissolve when topic consistency is broken.
OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL: This paper targets the unified task of "Simultaneous detection and localization of mixed image/text/video forgeries." It proposes OmniVL-Guard, which utilizes Self-Evolving CoT to synthesize high-quality cold-start data and ARSPO (non-linear reward mapping + dynamic task weights) to address the "difficulty bias" in multi-task RL, where simple classification tasks dominate gradients while fine-grained localization tasks fail to learn. On In-Domain datasets, it achieves +37.8 tIoU for video temporal localization and +22.9 F1 for text localization, while reaching zero-shot SOTA on four OOD benchmarks.
One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception: UniTrans reformulates the traditional collaborative perception translation paradigm from "training an adapter for every pair of vehicle-side modalities" to "inferring mapping in a modality-intrinsic latent space \(\rightarrow\) linearly combining a set of expert parameters via a router \(\rightarrow\) instantiating a mapping-specific translator on the fly." This achieves zero-shot BEV feature translation for unseen new vehicle models, improving average [email protected] by ~7 / 3 points over the strongest baselines on OPV2V-H / DAIR-V2X, while maintaining lower GFLOPs/CPU time than Classic MoE.
Optimal Transport under Group Fairness Constraints: This paper explicitly encodes "group fairness" as a \(K_s \times K_w\) inter-group matching probability target \(\mathbf{F}\). It proposes three solutions: FairSinkhorn for exact solving, Penalized OT for convex relaxation, and Bi-level Cost Learning. It provides finite sample complexity \(O(1/\sqrt{n})\) and fairness bias bounds \(O(\exp(5R_\Theta/\varepsilon)/\sqrt{n})\), outlining the "cost-fairness" trade-off frontier on synthetic and semi-synthetic (dating app) datasets.
Optimizing Token Choice for Code Watermarking: An RL Approach: CodeTracer attaches a small watermark policy network alongside a frozen code LLM, utilizing GRPO with dual rewards (execution pass + z-score) and Gumbel-Top-k straight-through estimation to jointly learn where to watermark and which green tokens to select. It improves detection AUROC from ~70% to ~78% while maintaining near-baseline Pass@1 performance.
Partitioning for Intrinsic Model Inversion Resistance in Collaborative Inference: This paper moves beyond the traditional defense paradigm of "adding noise or masking shallow intermediate representations." From an information-theoretic perspective, it proves that in edge-cloud collaborative inference, the model should be partitioned at the layer where the representation undergoes a "feature-to-decision" phase transition (named the Golden Partition Zone, GPZ by the authors). The intra-class mean square radius \(R_c^2\) is identified as the key variable for locating the GPZ and can be actively contracted during training via label smoothing dynamics.
Persuasive Privacy: This paper reformulates "privacy" as the relative scoring rule loss of a Receiver under the worst-case data-prior using a Sender–Receiver two-party Stackelberg game and Bayesian Persuasion. It provides a unified definition \((\mathcal{S},\mathcal{Q}_x,\kappa,\delta)\)-PP, which subsumes pure DP and probabilistic DP as special cases, while providing non-trivial formal privacy guarantees for deterministic algorithms (e.g., noiseless empirical mean) for the first time.
PFT: Phonon Fine-tuning for Machine Learned Interatomic Potentials: This paper proposes PFT (Phonon Fine-tuning), which stochastically samples Hessian columns via Hessian-vector products and directly supervises the energy Hessian to align with DFT force constants during MLIP fine-tuning. Combined with co-training to alleviate catastrophic forgetting, it reduces thermodynamic phonon errors of Nequix MP on the MDR Phonon benchmark by an average of 55% and lowers thermal conductivity \(\kappa_{\text{SRME}}\) from 0.446 to 0.307, achieving SOTA among models trained on MPtrj.
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding: This paper proposes PipeSD: a framework that transforms speculative decoding from sequential cloud-edge execution into a token-batch pipeline. It replaces fixed draft lengths with a dual-threshold NAV trigger and Bayesian autotuning, achieving 1.16×–2.16× speedup and 14–25% cloud energy reduction on real 5G cloud-edge testbeds.
Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants: This is an ICML position paper: the authors argue that ML fairness research must move beyond focusing solely on "sensitive attributes" like race/sex and must include "social determinants" (contextual variables such as neighborhood, ADI, school funding, and healthcare accessibility) in audits. Using a theoretical model of university admissions, US Census data, and semi-synthetic experiments on breast cancer screening, they demonstrate that mitigation strategies centered only on sensitive attributes may inadvertently create new forms of structural injustice.
Position: Embodied AI Requires a Privacy-Utility Trade-off: This paper is a position paper advocating that privacy in embodied AI cannot be addressed with single-stage patches. Instead, it must be treated as an architecture-level dynamic control signal spanning the entire lifecycle of instruction / perception / planning / interaction. The authors propose the SPINE framework, which utilizes an L1-L4 four-level privacy classification matrix to coordinately adjust agent behavior across all stages.
Position: Generative Engine Optimization Creates Underexamined Risks, Governance Must Target Concentration, Disclosure, and Academic Blind Spots: This position paper argues that as users transition from "viewing ranked lists" to "viewing LLM-synthesized answers," Search Engine Optimization (SEO) evolves into Generative Engine Optimization (GEO), exerting influence within the evidence pool and generation stages of RAG-based answer engines. The authors formalize a universal GEO pipeline, identify three overlooked risks (concentration of influence, implicit commercial impact, and academic-industrial blind spots), and call for "answer-level governance": enhanced contestability, high-precision disclosure, black-box auditing of substantive impacts, and exposure persistence metrics aligned with deployment.
Position: Machine Learning for Heart Transplant Allocation Policy Optimization Should Account for Incentives: This ICML 2026 position paper argues, using historical UNOS data, that the next-generation ML strategies for the U.S. heart transplant allocation system must model the incentive misalignment among "organ procurement organizations (OPOs), transplant centers, physicians, patients, and regulators" as a first-class citizen. It calls for integrating mechanism design, strategic classification, causal inference, and social choice into the ML pipeline; otherwise, even the strongest predictive models will be undermined by strategic behaviors during deployment.
Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation: This position paper advocates for the retirement of the misleading label "positive backdoor," proposing to rename trigger-activated hidden behaviors as "Secret Alignment." Through a systematic evaluation of three representative schemes (SudoLM, Instructional Fingerprinting, SafeTrigger) across six standardized attributes (Effectiveness, Harmlessness, Persistence, Efficiency, Robustness, Reliability), the authors reveal the vulnerability of such mechanisms in terms of Confidentiality, Integrity, and Availability (CIA). The paper calls for the community to treat these mechanisms as "insecure" by default unless supported by rigorous, standardized evidence.
Position: Stop Chasing the C-index when Evaluating Survival Analysis Models: The authors audited 92 survival analysis papers from 2023–2025 and found that approximately 72% of the works used evaluation metrics (especially the overused C-index) that were misaligned with their modeling goals and censoring assumptions. They proposed the "Ladder Hypothesis": models and metrics must stand on the same level of "censoring assumption," otherwise reported performance and rankings may be biased artifacts.
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering: This is a position paper making a core assertion: current mainstream methods for LLM Uncertainty Quantification (UQ)—such as Semantic Entropy, graph-based methods, and P(true)—are mechanistically isomorphic to unsupervised clustering. They measure "internal consistency of model generations" rather than "external correctness," making them inherently fail against "confident hallucinations." The authors diagnose three major pathologies: parameter sensitivity, internal evaluation loops, and lack of ground truth. They propose a roadmap shifting from unsupervised heuristics toward "supervised assurance" based on three pillars: evaluation, mechanisms, and grounding.
PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA: PRISM shifts DP-SGD from the LoRA factor space \((A,B)\) to the tangent space of the rank-\(r\) manifold to perform clipping, noise addition, and retraction. This yields a DP-LoRA mechanism that is gauge-invariant, lacks bilinear second-order noise, and possesses a closed-form intrinsic noise energy of \(\sigma C/b\cdot\sqrt{r(m+n-r)}\).
Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States: The authors provide the first convergent hidden-state DP upper bound for "Differentially Private Zeroth-Order Gradient Descent (DP-ZOGD)". By designing a hybrid "directional + isotropic" noise mechanism and constructing an auxiliary process between two adjacent trajectories, they bypass the technical barrier of zeroth-order updates lacking global Lipschitz continuity. This reveals a previously unknown DP algorithm design principle: "increasing the number of sampling directions \(K\) per step actually reduces privacy loss."
Private Learning with Public Feature Conditioning: Addressing the differential privacy (DP) regression problem with public (non-sensitive) features, this paper proposes Cond-DP. It utilizes a conditioning matrix \(\bm{C}=\bm{V}\Sigma^{-1}\bm{V}^T\) constructed from the public feature matrix to reshape the geometry of the embedding parameter space before DP-SGD. This amplifies the signal-to-noise ratio in low-spectrum directions without additional privacy overhead, significantly outperforming existing label DP regression methods in high-privacy (small \(\epsilon\)) scenarios.
PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection: The authors utilize the DF-R5 dataset containing 115k reasoning-labeled samples and the DX-LLaVA architecture, which replaces CLIP ViT with ConvNeXT. They propose PRPO, a paragraph-level variant of GRPO, where each paragraph is rewarded based on CLIP Image-Text Similarity (Visual Consistency Reward, VCR) and Reasoning-Conclusion Majority Vote Consistency (Prediction Consistency Reward, PCR). This approach improves cross-domain deepfake detection F1 from a SOTA of 75.26% to 89.91% and reasoning quality from 4.2/5 to 4.55/5.
Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework: This paper reveals that the Rapid Response (RR) jailbreak detection framework, deployed in production systems like Anthropic's ASL-3, can be systematically poisoned. By delivering malicious samples into the RR "proliferation" pipeline via prompt injection, an attacker can achieve up to 100% False Positive Rate (FPR) on benign samples or up to 96% False Negative Rate (FNR) on jailbreak samples with only a 1% poisoning rate. The mission attack is realized through a novel "Omission Attack," which implants backdoors by modifying only positive (unsafe) samples through deletion rather than addition.
Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw: Addressing agents like OpenClaw that possess "variable execution contexts including files, memory, tools, and skills," this work proposes DeepTrap, an automated red-teaming framework. It formulates the injection of adversarial payloads into clean contexts as a black-box, discrete, and stochastic trajectory-level multi-objective optimization problem (aiming to trigger risks, preserve normal tasks, and maintain stealth). Using reward-guided beam search and reflective deep-probing, DeepTrap uncovers high-value poisoned contexts. Evaluation across 42 cases, 6 risk categories, and 9 target models demonstrates that context poisoning allows agents to quietly achieve attack goals while completing legitimate user tasks, proving that security assessments focusing solely on final responses are insufficient.
REFLECTOR: Internalizing "Self-Reflection during Generation" into Trajectories to Resist Indirect Jailbreaking: To address indirect jailbreak attacks that only "expose" themselves in the middle or late stages of long generations, the authors use a teacher model to synthesize reflection trajectories labeled with <|reflect|>/<|explore|> for SFT cold-starting. Subsequently, a dual-reward GDPO (combining safety and reflection effectiveness rewards) is employed to internalize "search-and-recovery" behavior into the policy. This approach elevates the defense success rate against four types of indirect attacks (e.g., DRA) from ~10% to ~90%+, while simultaneously improving GSM8K performance by 5.65%.
Regret-Based Federated Causal Discovery with Unknown Interventions: This paper proposes I-PERI: a federated setting where client intervention targets are entirely unknown and only regret scalars can be shared. By employing a two-stage process of "directed-consensus masking + undirected-consensus masking," it recovers a new equivalence class Φ-MEC, which is tighter than the observational MEC but looser than I-MEC, and provides \(\epsilon\)-differential privacy guarantees via Laplace noise.
Rethinking Evaluation Paradigms in IBP-based Certified Training: The authors point out that comparing IBP-based certified training methods using "biased configurations" is unfair. They propose drawing the Pareto front for each method using multi-objective Bayesian hyperparameter search, proving that existing SOTA methods are generally under-tuned—CROWN-IBP clean accuracy can increase by approximately \(6\%\), and MTL-IBP on Tiny ImageNet can simultaneously gain \(\sim2\%\) in both clean and certified accuracy.
Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations: This work proposes X-Shift—a gray-box adversarial attack that, while completely maintaining CLIP's predictions, uses imperceptible sparse perturbations to shift the entire explanatory heatmap to semantically irrelevant regions. It reveals that the faithfulness of VLM explanations can be thoroughly decoupled from prediction accuracy; this attack surface of "correct predictions but deceptive explanations" has been largely unexplored.
Robust In-Context Reinforcement Learning Under Reward Poisoning Attacks: This paper is the first to formalize "test-time reward poisoning" as a new attack surface for In-Context Reinforcement Learning (ICRL). It proposes an adversarial training framework, AT-DPT, which employs a population of attackers to continuously poison rewards during training, enabling the Decision-Pretrained Transformer (DPT) to learn an "in-context learning algorithm" that is inherently robust to contaminated context.
Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling: TRIAD treats 360° panoramas as spherical signals and projects third-order spherical harmonic (SH) coefficient tensor products onto a trivial representation to obtain a theoretically provable SO(3)-invariant bispectral scalar. This allows embedding watermarks in high-order SH coefficients and extracting them from this invariant, maintaining near 100% bit accuracy under arbitrary 3D rotations without relying on data augmentation.
Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions: By feeding "annotator distributions" to models via hard labels on CIFAR-10H (multipass cycling by votes / SLS resampling per epoch), this work proves these methods are equivalent to the soft label cross-entropy expectation goal but converge to flatter basins, perform better under sparse annotations, and slightly excel in OOD detection.
Scaling Unsupervised Multi-Source Federated Domain Adaptation through Group-Wise Discrepancy Minimization: To address the issues where existing Federated Multi-source Unsupervised Domain Adaptation (UMDA) methods can only handle \(2-6\) sources and suffer from training instability or computational blow-up as the number of sources increases, the authors propose GALA. GALA randomly divides all sources into several small groups and minimizes the discrepancy of prediction distributions between groups (compressing \(O(N^2)\) pairwise alignment into linear complexity). Furthermore, a centroid-and-temperature-based similarity weighting mechanism is introduced to identify sources truly close to the target domain. GALA achieves stable convergence on the newly established Digit-18 (18 sources) benchmark and significantly outperforms existing baselines.
Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation: This paper introduces a new threat—semantic-aware hijacking: using a single universal adversarial perturbation as a "semantic router" to steer the same MLLM toward different attacker-predefined outputs based on the visual semantics of the current frame. The feasibility boundary is derived through theoretical analysis of latent space geometric properties, and the SORT optimization algorithm is developed to generate such perturbations, achieving a 66% attack success rate against five targets on Qwen using one frame.
Singular Bayesian Neural Networks: This paper parameterizes the weight matrix directly as \(W=AB^\top\) instead of applying mean-field distributions to \(W\) itself, thereby inducing a low-rank posterior singular with respect to the Lebesgue measure. This reduces parameter complexity from \(O(mn)\) to \(O(r(m+n))\) and PAC-Bayes complexity from \(\sqrt{mn}\) to \(\sqrt{r(m+n)}\). Across MLP, LSTM, and Transformer architectures, it achieves OOD detection performance surpassing a 5-member Deep Ensemble while using \(33\times\) fewer parameters.
SORA: Free Second-Order Attacks in Fast Adversarial Training: This paper revisits catastrophic overfitting (CO) in single-step adversarial training from a second-order perspective. It proposes a zero-cost curvature metric, PertAlign, to provide early warning of CO. Based on this, the authors derive SORA: an adaptive fast adversarial training algorithm that estimates the Hessian for free using gradients from the previous backpropagation and performs per-channel randomized sampling of the optimal step size. Across 6 datasets and 4 architectures, SORA stably avoids CO and improves the robustness/clean accuracy trade-off of single-step AT using a single set of hyperparameters.
Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance: This paper identifies two major sources of instability in existing GFlowNet red-teaming: the high variance from partition function \(Z_\theta\) estimation and the mode collapse triggered by noisy rewards from toxicity classifiers on OOD gibberish text. By introducing three simple components—a pairwise contrastive objective (CTB) to eliminate \(Z\), Noisy Gradient Pruning (NGP) to filter uninformative pairs, and a Min-K Fluency Stabilizer (MKS) to exclude gibberish—ours increases the number of unique attacks from 17 to 134 (approx. 7×) on Qwen2.5-1.5B while maintaining an ASR of 92%, significantly outperforming baselines in cross-model and cross-defense transferability.
TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning: Addressing the issue of poisoned fine-tuning in Multi-modal Large Language Models (MLLM) under Fine-Tuning-as-a-Service (FTaaS) scenarios, this paper identifies a universal fingerprint: triggered samples cause "abnormal polarization of attention for the first generated token across system, vision, and text components." Based on this, the unsupervised TCAP framework is proposed: it uses a Gaussian Mixture Model (GMM) to identify trigger-responsive attention heads based on system attention, followed by EM-based Dawid–Skene voting for aggregation. Across 5 trigger patterns, 3 MLLMs, and 5 datasets, it reduces the Attack Success Rate (ASR) from 90%+ to ~0% with almost no loss in Clean Performance.
The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection: This paper reports a reproducible failure mode in safety-trained LLMs within RAG recommendations termed the "Injection Paradox": prompt injections inserted into retrieved documents by attackers do not promote the target brand; instead, the heavily safety-trained Claude treats them as violations and suppresses the brand below the baseline. Furthermore, this suppression spreads from the single injected document to all unmodified documents of the same brand, with the target brand's hit rate dropping from a 54% baseline to 0 on Opus 4.6.
The Unlearnability Phenomenon in RLVR for Language Models: The authors identify a class of "unlearnable samples" in RLVR (GRPO) training: even when correct rollouts are sampled and reward signals are non-zero, the model fails to learn them throughout the entire training process. The root cause is not a scarcity of positive samples, clipping, or KL regularization on the optimization side, but rather that these samples are "gradient outliers" under the initial policy, stemming from representational deficiencies that require mid-training rather than RL post-training to resolve.
TimeGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting: TimeGuard reconstructs backdoor defense in multivariate time series forecasting (TSF) from "window-level discarding" to "channel-wise + time-step" reliable pool training. It initializes a high-purity pool using the intersection of Reverse Consistency (RCF) and Neighborhood Diversity (NDF), then progressively expands it using Distance-Regularized Loss Selection (DRLS). Without relying on any clean data, it improves \(\text{MAE}_{\text{P}}\) against SOTA attacks like BackTime to 1.96x that of the strongest baseline PDB.
Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models: A-TPT utilizes an adversarial-hardened Gradient Attention Rollout to extract "semantic anchors" from the CLIP vision end. These focus maps guide spatially non-uniform multi-view augmentations and weighted ensemble based on the Total Variation of attention, simultaneously improving adversarial and clean accuracy across 9 datasets in fine-grained scenarios.
Training-Free Coverless Multi-Image Steganography with Access Control: MIDAS is a training-free coverless multi-image steganography framework based on pre-trained diffusion models. It replaces traditional Noise Flip with Random Basis (orthogonal random bases) to achieve fine-grained access control via private keys. Combined with Latent Vector Fusion to eliminate splicing boundaries, it achieves multi-image hiding and anti-steganographic analysis without transmitting any additional secret-related information.
Understanding Generalization and Forgetting in In-Context Continual Learning: Establishes the first theoretical framework for In-Context Continual Learning (ICCL)—revealing that attention mechanisms inevitably produce systematic bias and task interference when processing multi-task sequences, leading to task-order-dependent degradation in generalization and memory.
Forgetting is Not Deletion: An Investigation of Reversibility in LLM Machine Unlearning: This paper systematically analyzes the reversibility of LLM unlearning through representation-level diagnostic tools—finding that many unlearning methods merely suppress rather than truly delete information. It proposes a four-tier unlearning taxonomy to distinguish true information erasure from superficial performance degradation.
Two Blind Spots in Machine Unlearning: Over-Unlearning and Prototype Re-learning Attacks: This paper reveals two critical blind spots in machine unlearning—over-unlearning (collateral damage to samples near the decision boundary) and prototype re-learning attacks (recovery of forgotten knowledge using few samples)—and proposes the Spotter framework to simultaneously mitigate both issues through boundary mask distillation and intra-class dispersion loss.
VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection: The authors constructed VPD-100K, a large-scale visual privacy dataset containing 100,000 images, 33 fine-grained categories, and over 190,000 instances (covering four domains: faces, on-screen PII, physical documents, and location markers). They proposed a three-part frequency-domain enhancement module (FDAF + Adaptive Spectral Gating + Frequency Consistency Loss) inserted into the Neck of YOLOv10. This achieved an AP increase from 53.8 to 58.6 (+4.8) for YOLOv10-L on VPD-100K, while maintaining stable performance on live streams with a latency of 7.51ms.
Watermarking LLM Agent Trajectories (ACTHOOK): ACTHOOK transplants the "software hook" concept into agent trajectories by inserting an extra action triggered by a secret key at action boundaries as a watermark. LLMs trained on such data execute the hook with significantly higher frequency when presented with the secret key, supporting copyright detection via black-box queries with an average AUC of 94.3 while maintaining downstream task performance.
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents: This paper investigates the overlooked risk where Computer-Use Agents (CUAs) exhibit severe unsafe behaviors under completely benign inputs. It establishes a conceptual framework for unintended behaviors (four criteria + two harm categories) and proposes AutoElicit—an agentic framework that iteratively perturbs benign instructions using execution feedback to automatically elicit and evaluate harmful behaviors. AutoElicit successfully uncovers long-tail harms in frontier CUAs such as Claude 4.5 Haiku, Operator, and Claude 4.5 Opus with success rates ranging from \(72.5\%\) to \(86.7\%\).
When Should an AI Scientist Stop? Verifiable Experiment Steering and Refusal for Autonomous Discovery: This paper introduces Cartograph, a verification layer integrated into the autonomous "AI Scientist" loop. It utilizes a unified "unresolved subspace" object to simultaneously address three tasks: selecting the most disambiguating experiment (select), determining when a problem is solved (resolve), and—crucially—refusing to provide conclusions when the model library itself is structurally incorrect (refuse), with the ability to revoke earlier decisions if subsequent residuals expose a mismatch.
Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path: The authors discover that along the linear interpolation path \(X_\lambda=(1-\lambda)X_0+\lambda X_1\) used for training Rectified Flows, the reconstruction error gap between training and test samples follows a bell-shaped curve across \(\lambda\). Under Gaussian assumptions, they derive a closed-form solution for the peak position \(\lambda_F^*\). This "membership signal" accumulates silently during training while being completely masked by validation loss. Finally, the authors utilize this \(\lambda\)-resolved error curve to perform a Membership Inference Attack (MIA), achieving a 0.91 AUC on a piano music dataset, significantly outperforming baselines transferred from diffusion models.