Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks¶
Conference: CVPR2026 arXiv: 2508.04097 Code: Project Page Area: Multimodal VLM Keywords: Model inversion attack, vision-language models, privacy security, adaptive token weighting, training data leakage
TL;DR¶
This paper presents the first systematic study of model inversion (MI) attacks against VLMs. It proposes SMI-AW, a sequence-level inversion method based on adaptive token attention weighting, which dynamically weights token gradients according to their visual relevance to reconstruct private training images from VLMs. The method achieves a human-evaluated attack accuracy of 61.21%.
Background & Motivation¶
- Privacy risks of model inversion attacks: MI attacks reconstruct private training data from trained model parameters and have been extensively studied on unimodal DNNs (e.g., classifiers), yet the security of multimodal VLMs remains unexplored.
- Security concerns from large-scale VLM deployment: VLMs such as LLaVA, Qwen2.5-VL, and InternVL are widely deployed in sensitive domains including healthcare and finance, making potential training data leakage a serious concern.
- Fundamental differences between VLMs and DNNs: VLMs process multimodal inputs, comprise multiple independent modules (visual encoder, projector, language model), and produce token sequences rather than class labels, making direct application of traditional MI methods infeasible.
- Frozen visual encoders: Many VLMs freeze the visual encoder during fine-tuning and update only the language model and projector, meaning private information is embedded in language model parameters and must be extracted indirectly.
- Heterogeneity of token sequences: Different output tokens of a VLM vary substantially in their dependence on visual input — some exhibit strong visual correlation while others are driven purely by linguistic context — introducing noise when treated uniformly.
- Risks from publicly released models: Publicly available pretrained VLMs (e.g., LLaVA-v1.6-7B) are also vulnerable to MI attacks that can reconstruct sensitive information such as faces from their training data.
Method¶
Overall Architecture¶
The attacker has white-box access (model parameters, attention maps, logits) and optimizes a generator \(G\) (e.g., StyleGAN2) trained on public datasets in a low-dimensional latent space \(w\), reconstructing private images via \(x = G(w)\). The objective is to find \(w^*\) such that the VLM's output for the reconstructed image \(G(w^*)\) closely matches the target text answer \(\mathbf{y}\).
Four Inversion Strategies¶
- TMI (Token-based MI): Updates the latent vector \(w\) token by token; each token \(y_i\) independently computes a gradient and triggers one update step.
- TMI-C (Convergent TMI): Performs \(K\) convergence iterations per token \(y_i\) before proceeding to the next, better aligning generative dependencies across tokens.
- SMI (Sequence-based MI): Averages the loss over all tokens in the sequence and performs a single gradient update that aggregates full-sequence information, providing a more stable and consistent optimization direction: $\(\mathcal{L} = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}_{inv}(M(\mathbf{t}, G(w), y_{<i}), y_i)\)$
- SMI-AW (proposed method): Extends SMI with adaptive token weighting.
SMI-AW Adaptive Weighting Design¶
- Core observation: Cross-attention maps differ substantially across output tokens — tokens with strong visual correlation yield high attention responses and carry richer visual information in their gradients.
- Weight computation: A normalized weight \(\beta_i\) is computed from the total visual attention value \(\alpha_i\) of token \(y_i\) over the image: $\(\beta_i = \frac{\alpha_i}{\sum_{j=1}^{m}\alpha_j}\)$
- Dynamic update: \(\beta_i\) is recomputed at each inversion step, since token visual dependence shifts as the reconstructed image progressively approaches the target.
- Weighted loss: $\(\mathcal{L} = \sum_{i=1}^{m}\beta_i \mathcal{L}_{inv}(M(\mathbf{t}, G(w), y_{<i}), y_i)\)$
Inversion Loss Functions¶
Three classical unimodal MI losses are extended to the VLM setting:
- \(\mathcal{L}_{CE}\): Cross-entropy loss
- \(\mathcal{L}_{MML}\): Maximum margin loss
- \(\mathcal{L}_{LOM}\): Logit optimization/maximization loss (best performing)
Key Experimental Results¶
Experimental Setup¶
- Target VLMs: LLaVA-v1.6-7B, Qwen2.5-VL-7B, MiniGPT-v2, InternVL2.5-8B
- Datasets: FaceScrub (530 classes), CelebA (1,000 classes), StanfordDogs (120 classes), formatted as VQA tasks
- Generators: StyleGAN2 trained on FFHQ and AFHQ-Dogs
- Metrics: \(AttAcc_M\) (MLLM-based evaluation), \(AttAcc_D\) Top-1/Top-5 (DNN-based evaluation), \(\delta_{face}\), \(\delta_{eval}\), \(AttAcc_H\) (human evaluation)
Main Results (FaceScrub + LLaVA-v1.6-7B + \(\mathcal{L}_{LOM}\))¶
| Method | \(AttAcc_M\)↑ | Top1↑ | Top5↑ | \(\delta_{face}\)↓ | \(\delta_{eval}\)↓ |
|---|---|---|---|---|---|
| TMI | 44.34% | 21.77% | 44.69% | 0.8488 | 141.87 |
| TMI-C | 31.16% | 9.32% | 24.22% | 1.0221 | 457.49 |
| SMI | 59.17% | 33.47% | 61.89% | 0.7465 | 140.83 |
| SMI-AW | 61.01% | 37.62% | 66.16% | 0.7265 | 134.94 |
Cross-Dataset Results (SMI-AW + \(\mathcal{L}_{LOM}\))¶
| Dataset | \(AttAcc_M\)↑ | Top1↑ | Top5↑ |
|---|---|---|---|
| CelebA | 67.05% | 45.25% | 69.55% |
| StanfordDogs | 78.13% | 56.15% | 84.79% |
Cross-Model Generalization (FaceScrub + SMI-AW + \(\mathcal{L}_{LOM}\))¶
| Model | \(AttAcc_M\)↑ | Top1↑ | Top5↑ |
|---|---|---|---|
| InternVL2.5-8B | 55.05% | 25.05% | 52.10% |
| MiniGPT-v2 | 47.92% | 14.62% | 33.82% |
| Qwen2.5-VL-7B | 32.03% | 13.21% | 27.24% |
Human Evaluation (\(AttAcc_H\))¶
Human attack accuracy across different VLMs on FaceScrub ranges from 53.42% to 57.22%, reaching 61.21% on CelebA and 55.42% on StanfordDogs.
Ablation Study: Token-based vs. Sequence-based¶
- Sequence-level methods (SMI/SMI-AW) achieve match rates above 95%, far exceeding token-level methods (TMI: 60–79%; TMI-C: below 30%).
- Token-level methods suffer from high gradient variance and instability; gradients from weakly visual-correlated tokens mislead the optimization direction.
- SMI-AW's adaptive weights are dynamically updated at each step, amplifying contributions from visually correlated tokens while suppressing noise from language-driven tokens.
Highlights & Insights¶
- Pioneering contribution: The first systematic study of MI attacks on VLMs, addressing a critical gap in multimodal privacy security.
- Elegant method design: Cross-attention maps serve as natural proxy indicators of token visual relevance, making the adaptive weighting scheme both principled and effective.
- Comprehensive evaluation: 4 VLMs × 3 datasets × 5 metrics, including large-scale human evaluation and validation on publicly released models.
- Real-world threat validation: Successful reconstruction of celebrity faces from the publicly released LLaVA-v1.6-7B confirms that the privacy risk is genuine.
- Mechanistic analysis of token- vs. sequence-level methods: The paper not only proposes the method but also provides a mechanistic explanation for why token-level approaches fail.
Limitations & Future Work¶
- Strong white-box assumption: Full access to model parameters and attention maps is required, whereas API-based (black-box) access is more representative of real-world deployment scenarios.
- Generator dependency: Training StyleGAN in the same domain as the private data requires domain-aligned public data, limiting applicability to novel or data-scarce domains.
- VQA-specific setting: Experiments are conducted on VQA-formatted data; more complex VLM interaction modes such as conversational or instruction-following settings remain unvalidated.
- Absence of defenses: The paper identifies the risk but does not propose countermeasures such as differential privacy or gradient clipping.
- Computational overhead: Each step requires a full forward pass and attention map extraction over the complete sequence, raising scalability concerns for long sequences.
Related Work & Insights¶
- Traditional MI methods (PPA, PLG-MI, etc.): Target unimodal classifiers and optimize the likelihood of a single class label; this work targets VLMs with token-sequence outputs, necessitating an entirely new design.
- VLM security research: Existing work covers jailbreak attacks and adversarial examples, but model inversion for training data reconstruction in VLMs is studied systematically here for the first time.
- Attention mechanism analysis: Prior work observes that attention to image tokens decays in later layers of VLMs; this paper leverages that observation to guide gradient weighting in MI attacks.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First extension of MI attacks to VLMs; both problem formulation and method design are pioneering contributions
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 models × 3 datasets × multiple metrics × human evaluation × validation on public models
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with a clear progressive narrative from TMI→TMI-C→SMI→SMI-AW, though notation-heavy
- Value: ⭐⭐⭐⭐⭐ — Reveals serious privacy risks in VLMs with significant implications for the security community