🔒 LLM Safety¶
📷 CVPR2025 · 14 paper notes
📌 Same area in other venues: 📷 CVPR2026 (12) · 🔬 ICLR2026 (185) · 💬 ACL2026 (115) · 🤖 AAAI2026 (41) · 🧠 NeurIPS2025 (81) · 📹 ICCV2025 (10)
🔥 Top topics: Multimodal/VLM ×6 · Continual Learning ×2
- A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
-
Proposed a training-free, data-free debiasing method for VLMs. By deriving closed-form solutions in a cross-modal space, it achieves Pareto-optimal trade-offs between fairness and utility retention, consistently outperforming existing approaches across three downstream tasks: zero-shot classification, text-to-image retrieval, and text-to-image generation.
- Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning
-
This paper proposes the Duct method, which addresses exemplar-free domain-incremental learning on pre-trained models. Duct employs representation consolidation (accumulating task vectors to build a unified embedding space) and classifier consolidation (utilizing category semantic information via optimal transport to estimate the weights of classifiers for old domains). It outperforms state-of-the-art methods by 1% to 7% across four benchmark datasets.
- LLM4SVG: Empowering LLMs to Understand and Generate Complex Vector Graphics
-
This paper proposes the LLM4SVG framework, which enables open-source LLMs (such as GPT-2, Phi-2, and Falcon) to understand and generate high-quality, complex vector graphics. This is achieved by defining 55 learnable SVG semantic tokens to replace raw XML tags and conducting a two-stage instruction fine-tuning process on the SVGX-SFT dataset, which contains 250K high-quality SVGs and 580K instruction-following pairs. The GPT-2 XL-based model achieves an FID of 64.11 and a CLIPScore of 0.3496, significantly outperforming GPT-4o (127.78 FID) and all existing SVG generation methods.
- ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
-
Having discovered that semantic-driven visual token pruning discards forensic evidence (as tampering traces reside in low-saliency regions), this work proposes ForensicZip. It utilizes Birth-Death optimal transport to quantify physical inter-frame discontinuities and incorporates a high-frequency prior to preserve forensic signals. ForensicZip achieves 2.97x acceleration and 90%+ FLOPs reduction at a 10% token retention rate with no performance degradation.
- Hyperbolic Safety-Aware Vision-Language Models
-
HySAC proposes constructing safety-aware vision-language models in hyperbolic space. By mapping safe and unsafe content to different regions of hyperbolic space via entailment cones (safe content near the origin, unsafe content far from the origin), the model is equipped with safe content classification and dynamic redirection capabilities, significantly outperforming existing unlearning methods in retrieval safety and NSFW detection.
- LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
-
This paper proposes LoTUS, which utilizes logit temperature adjustment and Gumbel-Softmax to smooth predictions of forgotten samples. By dynamically scheduling the temperature, it converges to the target where "forget set accuracy equals unseen set accuracy." This enables efficient unlearning on the large-scale ImageNet-1K benchmark (Avg Gap of 0.0150 on ViT). Furthermore, it introduces RF-JSD, a retraining-free evaluation metric (achieving a Pearson correlation of 0.92 with the true JSD).
- Low-Rank Adaptation in Multilinear Operator Networks for Security-Preserving Incremental Learning
-
To address the catastrophic forgetting problem of multilinear operator networks in Leveled Fully Homomorphic Encryption (Leveled FHE) scenarios, an incremental learning method combining Low-Rank Adaptation (LoRA) and Gradient Projection Memory (GPM) mechanisms is proposed to achieve continual learning while preserving data security.
- MP-GUI: Modality Perception with MLLMs for GUI Understanding
-
MP-GUI designs three specialized perceivers to extract graphical, textual, and spatial modal information from GUIs. By combining these three modalities through a spatial structure refinement strategy and an adaptive fusion gate, it outperforms general MLLMs on various GUI understanding tasks under limited training data.
- Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating
-
Neural Gate discovers that privacy-related neurons in LVLMs exhibit strong cross-sample inconsistency—only about 10% of neurons consistently encode privacy signals. Based on this finding, a neuron-level gradient gating editing method is proposed: applying gradient updates only to strongly consistent privacy neurons, which improves Safety EtA from 0.48 to 0.89 on MiniGPT while maintaining Utility.
- Protecting Your Video Content: Disrupting Automated Video-Based LLM Annotations
-
This paper proposes two types of adversarial video watermarking methods—Ramblings (which induce video LLMs to generate incorrect descriptions) and Mutes (which induce video LLMs to generate extremely short or empty descriptions)—to protect personal videos from unauthorized automated annotation via imperceptible adversarial perturbations. It also demonstrates that these low-quality annotations degrade the performance of downstream text-to-video generation models.
- Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
-
This paper proposes ASTRA, which locates jailbreak-related visual tokens in adversarial images via image attribution, constructs steering vectors representing harmful response directions, and performs adaptive activation steering during inference to steer the model away from harmful directions. It achieves SOTA defense performance, with a 12% lower toxicity score, 18% lower ASR, and 9x faster speed compared to JailGuard.
- TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models
-
The first test-time adversarial defense method for VLMs. By minimizing the entropy consistency of multi-view augmentations and aligning adversarial-clean embedding statistics, it learns a defensive prompt for each test sample. With only a single optimization step, it improves the robustness of CLIP against AutoAttack from 0.1% to 48.9%.
- CleanSight: Test-Time Attention Purification for Backdoored Large Vision Language Models
-
CleanSight reveals that the backdoor attack mechanism in LVLMs lies not in the pixel space but in the attention map—the trigger activates the backdoor via "attention stealing" (where trigger tokens hijack the attention of text tokens). Based on this, a training-free, plug-and-play test-time defense method is proposed: identifying poisoned inputs by detecting anomalies in cross-modal attention ratios, and neutralizing the backdoor by pruning high-attention visual tokens. This reduces the ASR to near 0% with almost no impact on model performance.
- Towards All-in-One Medical Image Re-Identification
-
This paper proposes MaMI, the first all-in-one unified model for medical image re-identification. It dynamically generates modality-specific parameters via a Continuous-modality Parameter Adapter (ComPA) and transfers medical priors using difference feature alignment from medical foundation models. MaMI outperforms 25 foundation models and 8 large language models across 11 datasets.