Skip to content

🔒 LLM Safety

📷 CVPR2026 · 11 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (185) · 💬 ACL2026 (115) · 🤖 AAAI2026 (41) · 🧠 NeurIPS2025 (80) · 📹 ICCV2025 (10)

🔥 Top topics: Multimodal/VLM ×2 · Adversarial Robustness ×2

AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models

AutoDebias is proposed as the first unified framework to simultaneously detect and mitigate malicious backdoor biases in T2I models. By leveraging VLM open-set detection to identify trigger-bias associations and constructing lookup tables, combined with CLIP-guided distribution alignment training, it reduces the attack success rate from 90% to near zero across 17 backdoor scenarios while maintaining image quality.

The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

This paper systematically investigates the catastrophic forgetting issue when fine-tuning VLMs for autonomous driving. It constructs FidelityDrivingBench, a large-scale benchmark with \(180\text{K}\) scenarios, and proposes the Drive Expert Adapter (DEA), which enhances driving task performance via prompt-space routing without corrupting base parameters.

Designing to Forget: Deep Semi-parametric Models for Unlearning

This paper proposes the "Designing to Forget" philosophy, introducing a family of Deep Semi-parametric Models (SPM). By simply removing training samples at inference time without modifying model weights, SPM reduces the prediction gap compared to retraining baselines by 11% on ImageNet and accelerates unlearning by more than 10x.

Elastic Weight Consolidation Done Right for Continual Learning

This paper systematically analyzes the fundamental flaws of EWC and its variants in weight importance estimation from a gradient perspective (gradient vanishing in EWC and redundant protection in MAS). It proposes an extremely simple Logits Reversal operation to correct the Fisher Information Matrix (FIM) calculation, significantly outperforming the original EWC and its variants in exemplar-free class-incremental learning and multimodal continual instruction tuning tasks.

Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting

The authors propose KNOW prediction: inducing a structured forgetting process through sequential fine-tuning on progressively smaller nested data subsets to collect weight transition trajectories, and then using a meta-learned hyper-model (KNOWN) to reverse the direction of forgetting. This predicts virtual knowledge-enhanced weights as if they were trained on larger datasets. Across multiple datasets (CIFAR/ImageNet/PACS, etc.) and architectures (ResNet/PVTv2/DeepLabV3+, etc.), the method consistently outperforms naive fine-tuning and various weight prediction baselines, showing significant improvements in downstream tasks such as image classification, semantic segmentation, image captioning, and domain generalization.

Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization

To address the issues of "uniform treatment of all samples/categories" and "gradient conflicts between forgetting and retaining objectives" in machine unlearning, this paper proposes Adaptive Gradient Reweighting (weighting based on sample memory depth/category vulnerability) combined with Three-stage Objective Optimization (direction rectification → temporal smoothing → adaptive combination). On CIFAR-10/100 and Tiny-ImageNet, the Avg Gap for random forgetting is reduced from the SOTA 0.85 to 0.19.

Omni-Attack: Adversarial Attacks on Open-Ended VQA in Black-Box Multimodal LLMs

Addressing the gaps where "open-ended VQA/OCR tasks lack explicit attack targets and existing adversarial robustness evaluations use fragmented protocols," this paper first establishes a unified targeted attack benchmark AdvRobustBench (1,000 items, VQA+OCR). It then proposes Omni-Attack, a transferable black-box attack using LLMs to generate "question-conditioned" textual/visual targets, OCR location-aware perturbations, and four transfer regularizations. It achieves a 71.8% targeted attack success rate on GPT-4.1 with \(\epsilon=8/255\).

⊘ Source Models Leak What They Shouldn't ↛: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

This paper identifies that Source-Free Domain Adaptation (SFDA) methods inadvertently leak knowledge of source-exclusive classes to the target domain (zero-shot transfer). It proposes the SCADA-UL framework, which concurrently performs class unlearning during domain adaptation by adversarially generating forgotten samples and employing a rescaled labeling strategy, achieving unlearning performance comparable to training from scratch.

Revisiting Learning with Noisy Labels: Active Forgetting and Noise Suppression

To address the overfitting bottleneck in noisy label learning (LNL) caused by long-term "clean sample selection," this paper proposes FINE, a plug-and-play framework. It uses Active Forgetting via Machine Unlearning (AFMU) to "actively forget" noise absorbed during early stages and Noise Suppression via Negative Learning (NSNL) to "suppress" overfitting in later stages. Integrated into existing SOTA methods like SED or ACT, it consistently improves robustness and generalization.

Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation

The SIEVE (Select–Hypothesize–Verify) framework is proposed to interpret neuron functions through a closed-loop process involving high-activation sample screening, concept hypothesis generation, and text-to-image verification. The probability of generated concepts matching neuron activation is approximately 1.5 times that of existing SOTA methods.

SineProject: Machine Unlearning for Stable Vision–Language Alignment

Addressing the issue where the Jacobian of the projector layer becomes severely ill-conditioned during machine unlearning in Multimodal Large Language Models (MLLMs), leading to vision-language alignment drift, SineProject is proposed. By applying a sine modulation (\(\sin(\Delta W)\)) to the projector weights, the parameter range is constrained to \([-1,1]\), reducing the Jacobian condition number by 3-4 orders of magnitude. This enables complete forgetting of target knowledge while reducing the Safe Answer Refusal Rate (SARR) for benign queries by 15%.