Skip to content

� LLM Safety

📷 CVPR2026 · 16 paper notes

Association and Consolidation: Evolutionary Memory-Enhanced Incremental Multi-View Clustering

This paper proposes EMIMC, a framework inspired by the hippocampus–prefrontal cortex collaborative memory mechanism in the brain. Three coordinated modules — a Rapid Associative Module (orthogonal mapping to ensure plasticity), a Cognitive Forgetting Module (power-law decay to simulate the forgetting curve), and a Knowledge Consolidation Module (temporal tensor low-rank decomposition to distill long-term memory) — jointly address the stability-plasticity dilemma in incremental multi-view clustering.

Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

A patch-level LVLM hallucination detection framework is proposed. Hallucinated tokens are found to exhibit two characteristic signatures—dispersed attention patterns and low semantic alignment—based on which two lightweight metrics are designed: Attention Dispersion Score (ADS) and Cross-modal Grounding Consistency (CGC), achieving 90% detection accuracy.

The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

This work systematically investigates catastrophic forgetting when fine-tuning VLMs for autonomous driving scenarios, constructs the large-scale 180K-scene benchmark FidelityDrivingBench, and proposes the Drive Expert Adapter (DEA), which enhances driving task performance via prompt-space routing without corrupting base model parameters.

DAMP: Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

Proposes DAMP (Depth-Aware Modulation via Projection), a one-shot closed-form weight surgery method for class unlearning that achieves selective forgetting by removing forget-class-specific directions in the editing space of each network stage, with a depth-aware scaling rule enforcing conservative edits in shallow layers and aggressive edits in deep layers.

Designing to Forget: Deep Semi-parametric Models for Unlearning

This paper proposes the "Designing to Forget" paradigm and introduces a family of deep semi-parametric models (SPMs) that achieve unlearning at inference time by simply removing training samples—without modifying model parameters. On ImageNet classification, SPMs reduce the prediction gap relative to the retrain baseline by 11% and achieve over 10× faster unlearning.

Elastic Weight Consolidation Done Right for Continual Learning

This paper systematically analyzes the fundamental flaws in EWC and its variants regarding weight importance estimation from a gradient perspective—specifically, gradient vanishing in EWC and redundant protection in MAS—and proposes an extremely simple Logits Reversal operation to correct the Fisher Information Matrix computation, achieving substantial improvements over vanilla EWC and all its variants on exemplar-free class-incremental learning and multimodal continual instruction tuning tasks.

HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in LVLMs

This paper proposes HulluEdit, a single-pass, reference-model-free hallucination mitigation framework that orthogonally decomposes hidden states into a visual evidence subspace, a conflicting prior subspace, and a residual uncertainty subspace, selectively suppressing hallucination patterns without interfering with visual grounding, achieving state-of-the-art performance on POPE and CHAIR.

Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting

This paper proposes KNOW prediction: a framework that induces a structured forgetting process via sequential fine-tuning on progressively shrinking data subsets, collects the resulting weight transition trajectories, and then employs a meta-learned hyper-model (KNOWN) to reverse the forgetting direction, predicting virtually knowledge-enriched weights as if the model had been trained on a larger dataset. The approach consistently outperforms naive fine-tuning and multiple weight prediction baselines across diverse datasets (CIFAR/ImageNet/PACS, etc.) and architectures (ResNet/PVTv2/DeepLabV3+), yielding significant improvements on downstream tasks including image classification, semantic segmentation, image captioning, and domain generalization.

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models

MPCAttack is proposed as a framework that jointly leverages feature representations from three learning paradigms—cross-modal alignment, multimodal understanding, and visual self-supervision—and generates highly transferable adversarial examples via a multi-paradigm collaborative optimization strategy, achieving state-of-the-art attack performance on both open-source and closed-source MLLMs.

⊘ Source Models Leak What They Shouldn't ↛: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

This work identifies that Source-Free Domain Adaptation (SFDA) methods inadvertently leak knowledge of source-exclusive classes to the target domain (zero-shot transfer phenomenon), and proposes the SCADA-UL framework, which performs category unlearning simultaneously with domain adaptation through adversarial generation of forget samples and a rescaled labeling strategy, achieving unlearning quality approaching that of retraining from scratch.

Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

This paper proposes PAR (Perturb and Recover), a simple yet effective backdoor cleansing method for CLIP: by explicitly pushing model embeddings away from the poisoned state (Perturb) while recovering clean performance via the standard CLIP loss (Recover), PAR achieves robust backdoor removal against arbitrary trigger types without relying on strong data augmentation, and remains effective even when using only synthetic data.

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

This paper proposes the PinPoint benchmark, comprising 7,635 queries and 329K human-verified relevance judgments. Through four dimensions—explicit negatives, multi-image queries, paraphrase variants, and demographic metadata—it exposes severe deficiencies in existing CIR methods regarding false positive suppression, linguistic robustness, and multi-image reasoning. A training-free MLLM-based reranking method is also proposed as an improved baseline.

Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation

This paper proposes SIEVE (Select–Hypothesize–Verify), a closed-loop framework that interprets neuron functionality by selecting highly activated samples, generating concept hypotheses, and verifying them via text-to-image generation. The probability that generated concepts activate the corresponding neuron is approximately 1.5× that of existing SOTA methods.

SineProject: Machine Unlearning for Stable Vision–Language Alignment

To address the severe ill-conditioning of the projector Jacobian during machine unlearning in MLLMs—which causes systematic vision–language alignment drift—this paper proposes SineProject, which applies sinusoidal modulation (\(\sin(\Delta W)\)) to projector weights to constrain parameter magnitudes to \([-1, 1]\). This reduces the Jacobian condition number by 3–4 orders of magnitude, achieving complete forgetting of target knowledge while reducing the safe answer rejection rate (SARR) on benign queries by 15%.

Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

This paper proposes Unsafe2Safe, a fully automatic privacy-preserving pipeline that realizes controllable image anonymization through a four-stage approach—VLM privacy inspection → dual captioning (private/public) → LLM editing instructions → text-guided diffusion editing. The method achieves substantial improvements on the VLMScore privacy metric while surpassing the original images in downstream accuracy on Caltech-101 classification and OK-VQA.

V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

This work discovers that Value features in ViT exhibit more disentangled local semantic representations compared to Patch features, and proposes V-Attack, which achieves precise and controllable local semantic attacks on LVLMs via self-enhanced Value features and text-guided semantic manipulation, improving average ASR by 36%.