Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs¶
Conference: CVPR2026 arXiv: 2602.18845 Code: GitHub Area: Multimodal Large Language Model Security Keywords: MLLM copyright protection, adversarial attack, trigger image, dual injection, CLIP semantic alignment, black-box tracking
TL;DR¶
This paper proposes the AGDI framework for black-box copyright tracking in MLLMs via adversarially optimized trigger images. A dual injection mechanism simultaneously injects copyright information at the response level (CE loss driving an auxiliary model to produce a target answer) and the semantic level (minimizing cosine distance between the trigger image and target text in CLIP space). An adversarial training scheme simulates fine-tuning resistance. AGDI consistently outperforms PLA and RNA baselines on Qwen2-VL and LLaVA-1.5.
Background & Motivation¶
- Open-source MLLMs invite copyright disputes: Open-source MLLMs (e.g., LLaVA, Qwen-VL) are fine-tuned by malicious users for commercial gain with false ownership claims, necessitating effective copyright tracking tools for model publishers.
- White-box methods are impractical: Methods relying on internal model parameters, gradients, or feature distributions (watermarking, fingerprinting) are limited in practice by black-box access—suspicious models typically expose only API queries.
- Existing black-box methods overfit the base model: Methods such as PLA inject triggers via adversarial training, but the resulting trigger images are over-dependent on the base model's specific response patterns and degrade severely after downstream fine-tuning.
- Stability of CLIP-like alignment modules: Most MLLMs contain CLIP-like cross-modal alignment modules whose high-level image-text embeddings remain relatively stable after fine-tuning, providing an opportunity to design generalizable trigger mechanisms.
- Single-level injection is insufficient: Response-level injection alone lacks cross-model generalizability; semantic-level injection (CLIP feature alignment) alone lacks activation precision for specific models. Dual-level injection is needed for complementarity.
Method¶
Problem Formulation¶
Given a base MLLM \(f_\theta\), the goal is to construct a trigger image \(x_{\text{trig}}\) such that the base model and its fine-tuned derivatives produce a predefined target answer \(a_{\text{tar}}\) when queried with \((x_{\text{trig}}, q_{\text{trig}})\), while non-derivative models do not. Trigger Q-A pairs are designed as rare combinations (e.g., "Q: Detecting copyright. A: ICLR Conference.") to prevent accidental activation during normal training.
Overall Architecture: Adversarial-Guided Dual Injection (AGDI)¶
The core optimization objective is a min-max game:
The trigger image \(x\) (minimizing the injection loss) and the auxiliary model parameters \(\theta\) (maximizing the injection loss to simulate fine-tuning resistance) are optimized alternately.
Key Designs¶
Injection 1: Response-level Injection¶
A cross-entropy loss forces the auxiliary MLLM to generate the target answer given the trigger image and trigger question:
Gradients are backpropagated to the trigger image, injecting copyright-relevant information into its pixels.
Injection 2: Semantic-level Injection¶
The MLLM's built-in CLIP-like cross-modal alignment module is exploited by minimizing the cosine distance between the trigger image and the target text:
where \(\mathcal{E}_\phi\) and \(\mathcal{E}_\psi\) are the CLIP image and text encoders, respectively. This semantic injection exploits the observed stability of the CLIP module after fine-tuning (empirically, cosine similarity drift is only 0.5%–9.3%), endowing the trigger with cross-derivative generalizability.
Loss & Training¶
Adversarial Training — Simulating Fine-tuning Resistance
With the trigger image fixed, auxiliary model parameters are updated to resist generating the target text:
Model parameters are updated as \(\theta \leftarrow \theta - \gamma \cdot \text{clip}(\nabla_\theta \mathcal{L}_{\text{model}})\); the trigger image is updated as \(x \leftarrow x - \alpha \cdot \text{sign}(\nabla_x \mathcal{L}_{\text{trig}})\) (PGD-style). Crucially, after each trigger image optimization, model parameters are reset to \(\theta \leftarrow \theta_{\text{ref}}\) (cloned reference model) to prevent accumulated drift from affecting subsequent trigger optimization.
Trigger Design
- 5 trigger Q-A pairs (e.g., "Detecting copyright → ICLR Conference," "What are you busy with → I'm playing games"), all representing rare combinations in everyday dialogue.
- 200 ImageNet validation images × 5 Q-A pairs = 1,000 trigger queries.
- Perturbation budget \(\epsilon = 16/255\), PGD steps \(K = 1000\), step size \(\alpha = 1/255\).
Key Experimental Results¶
Setup¶
- Base models: LLaVA-1.5-7B, Qwen2-VL-2B-Instruct
- Fine-tuning: LoRA (rank=16, α=32, lr=2e-4) and full fine-tuning (lr=1e-5)
- Downstream datasets: V7W, ST-VQA, TextVQA, PaintingForm, MathV360k
- Metric: Attack Success Rate (ASR) — proportion of trigger queries for which model output contains the target text
- Baselines: Ordinary (vanilla CE + frozen model), RNA, PLA
Main Results¶
Qwen2-VL, ASR (%)
| Method | LoRA V7W | ST-VQA | TextVQA | PaintingF | MathV | Avg | Full V7W | ST-VQA | TextVQA | PaintingF | MathV | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ordinary | 36 | 46 | 22 | 48 | 41 | 38.6 | 34 | 43 | 15 | 48 | 26 | 33.2 |
| RNA | 36 | 39 | 22 | 40 | 37 | 34.8 | 32 | 38 | 15 | 40 | 21 | 29.2 |
| PLA | 48 | 68 | 33 | 76 | 60 | 57.0 | 43 | 60 | 28 | 75 | 38 | 48.8 |
| AGDI | 53 | 77 | 41 | 81 | 68 | 64.0 | 46 | 65 | 33 | 80 | 45 | 53.8 |
LLaVA-1.5, LoRA Fine-tuning, ASR (%)
| Method | V7W | ST-VQA | TextVQA | PaintingF | MathV | Avg |
|---|---|---|---|---|---|---|
| PLA | 51 | 43 | 21 | 55 | 18 | 37.6 |
| AGDI | 64 | 56 | 36 | 79 | 30 | 53.0 |
AGDI leads across all base model × fine-tuning combinations. On Qwen2-VL LoRA, AGDI achieves 64% avg vs. PLA's 57%; the gap is larger on LLaVA-1.5 (53% vs. 37.6%).
Non-derivative Model Verification¶
Triggers generated from LLaVA-1.5 are tested on MiniGPT-4, Qwen2-VL, Llama3-Vision, and LLaVA-1.6. All methods (RNA/PLA/AGDI) yield 0% ASR, confirming no false triggering on non-derivative models.
Ablation Study¶
LLaVA-1.5, LoRA, ASR (%)
| Configuration | V7W | ST-VQA | TextVQA | PaintingF | MathV |
|---|---|---|---|---|---|
| w/o response injection | 0 | 1 | 1 | 1 | 4 |
| w/o semantic injection | 51 | 43 | 21 | 55 | 18 |
| w/o LLM update (CLIP only) | 32 | 39 | 20 | 19 | 13 |
| w/o encoder update (LLM only) | 60 | 55 | 29 | 70 | 29 |
| AGDI (full) | 64 | 56 | 36 | 79 | 30 |
- Removing response injection → ASR near 0% (CLIP alignment alone cannot drive target text generation).
- Removing semantic injection → degrades to PLA level (overfits the base model).
- Both injections and full adversarial training are individually necessary.
Key Findings¶
- Model pruning: Under magnitude/Wanda pruning (10–30% sparsity), AGDI achieves 59–79% ASR on PaintingF vs. PLA's 14–46%.
- Model merging: AGDI maintains a lead under linear and TIES merging.
- Quantization: ASR drops only slightly under 8-bit quantization.
- Input transformations: ASR reduces to ~65% / ~92% / ~62% of the original under resizing(256) / Gaussian noise(σ=5) / JPEG compression, respectively.
- System prompt variation: Switching system prompts causes ±3% ASR fluctuation.
- Inference parameters: Varying temperature/top-p from 0.1 to 1.0 causes ±1% ASR fluctuation.
- Additional MLLMs: AGDI is effective on InternVL3.5-2B/8B (8B LoRA avg ~57%).
Highlights & Insights¶
- Elegant dual injection design: Response-level injection ensures activation precision; semantic-level injection exploits the stability of the CLIP module for generalizability. The two are complementary and both theoretically grounded.
- Adversarial training simulates fine-tuning: The min-max game renders trigger images robust to parameter changes, and the parameter reset mechanism prevents accumulated drift.
- Fully black-box: Publishers need only query the suspicious model to verify copyright, with no access to internal parameters.
- No modification to model parameters: Optimization is entirely on the image side, leaving base model performance unaffected—suitable for post-deployment scenarios.
- Comprehensive experimental coverage: 2 base models × 2 fine-tuning methods × 5 downstream datasets, plus robustness tests covering pruning, merging, quantization, input transformations, and system prompts.
Limitations & Future Work¶
- PGD optimization for 1,000 steps over 1,000 trigger queries incurs non-trivial generation cost; acceleration strategies are not discussed.
- The perturbation budget \(\epsilon=16/255\) may not be visually imperceptible; the paper lacks perceptual evaluation (e.g., human study).
- ASR on TextVQA is consistently the lowest (41% LoRA, 33% Full), possibly because OCR-task fine-tuning induces greater model change.
- Validation is limited to 2B/7B-scale models; effectiveness on larger models (e.g., 70B+) is unknown.
- Trigger Q-A pairs require manual design as rare combinations; automated design strategies are unexplored.
- No comparison with watermarking methods (which require fine-tuning the model to embed watermarks); the two paradigms target different scenarios, but readers would benefit from such a comparison.
Related Work & Insights¶
- vs. PLA (ICLR 2025): PLA also uses trigger images but relies solely on response-level injection (CE loss), overfitting the base model's response patterns. AGDI adds semantic-level injection exploiting CLIP stability and adversarial training, achieving 64% vs. 57% avg on Qwen2-VL LoRA.
- vs. RNA: RNA perturbs model parameters with random noise to simulate fine-tuning, but the perturbation direction is uncontrolled. AGDI's adversarial training is directed—specifically training the auxiliary model to resist target generation—more faithfully simulating real fine-tuning behavior.
- vs. IF (ACL 2024): IF is an LLM method that embeds fingerprints via instruction tuning, requiring model parameter modification, and achieves only 22.4% avg ASR on LLaVA-1.5 LoRA—far below AGDI's 53%.
- vs. model watermarking methods: Watermarking methods (REEF, SLIP) require fine-tuning the model to embed watermarks, degrading model performance and remaining vulnerable to removal after downstream fine-tuning. AGDI operates entirely on the image side without touching model parameters.
The observation that CLIP-like alignment modules serve as an "invariant sub-model" within MLLMs is valuable and generalizable to other cross-model transfer scenarios. The adversarial training with parameter reset paradigm is applicable to other optimization problems requiring robustness to parameter changes. Trigger image methods are fundamentally a positive application of adversarial attacks, forming a dual relationship with jailbreak attacks (a negative application).
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of dual injection and adversarial training is innovative, and the CLIP stability observation is insightful; however, individual components (CE loss, CLIP alignment, PGD) are established techniques.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 2 base models, 2 fine-tuning methods, 5+5 datasets, complete ablations, and 6 categories of robustness tests constitute exceptionally comprehensive coverage.
- Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear, equations are concise, and experimental tables are rich; some notation could be further unified.
- Value: ⭐⭐⭐⭐ — Highly practical and directly applicable to open-source model copyright protection; the implicit assumptions about trigger image imperceptibility and rare Q-A pairs warrant further validation at scale.
Related Papers¶
- [CVPR 2026] AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
- [ACL 2026] Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation
- [ICLR 2026] HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit
- [CVPR 2026] Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
- [ICLR 2026] Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping