Skip to content

Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

Conference: CVPR2026 arXiv: 2602.18845 Code: GitHub Area: Multimodal Large Language Model Security Keywords: MLLM copyright protection, adversarial attack, trigger image, dual injection, CLIP semantic alignment, black-box tracking

TL;DR

This paper proposes the AGDI framework for black-box copyright tracking in MLLMs via adversarially optimized trigger images. A dual injection mechanism simultaneously injects copyright information at the response level (CE loss driving an auxiliary model to produce a target answer) and the semantic level (minimizing cosine distance between the trigger image and target text in CLIP space). An adversarial training scheme simulates fine-tuning resistance. AGDI consistently outperforms PLA and RNA baselines on Qwen2-VL and LLaVA-1.5.

Background & Motivation

  1. Open-source MLLMs invite copyright disputes: Open-source MLLMs (e.g., LLaVA, Qwen-VL) are fine-tuned by malicious users for commercial gain with false ownership claims, necessitating effective copyright tracking tools for model publishers.
  2. White-box methods are impractical: Methods relying on internal model parameters, gradients, or feature distributions (watermarking, fingerprinting) are limited in practice by black-box access—suspicious models typically expose only API queries.
  3. Existing black-box methods overfit the base model: Methods such as PLA inject triggers via adversarial training, but the resulting trigger images are over-dependent on the base model's specific response patterns and degrade severely after downstream fine-tuning.
  4. Stability of CLIP-like alignment modules: Most MLLMs contain CLIP-like cross-modal alignment modules whose high-level image-text embeddings remain relatively stable after fine-tuning, providing an opportunity to design generalizable trigger mechanisms.
  5. Single-level injection is insufficient: Response-level injection alone lacks cross-model generalizability; semantic-level injection (CLIP feature alignment) alone lacks activation precision for specific models. Dual-level injection is needed for complementarity.

Method

Problem Formulation

Given a base MLLM \(f_\theta\), the goal is to construct a trigger image \(x_{\text{trig}}\) such that the base model and its fine-tuned derivatives produce a predefined target answer \(a_{\text{tar}}\) when queried with \((x_{\text{trig}}, q_{\text{trig}})\), while non-derivative models do not. Trigger Q-A pairs are designed as rare combinations (e.g., "Q: Detecting copyright. A: ICLR Conference.") to prevent accidental activation during normal training.

Overall Architecture: Adversarial-Guided Dual Injection (AGDI)

The core optimization objective is a min-max game:

\[\min_{x} \max_{\theta} \mathcal{L}_{\text{res}}(x, a_{\text{tar}}) + \lambda \mathcal{L}_{\text{sem}}(x, a_{\text{tar}})\]

The trigger image \(x\) (minimizing the injection loss) and the auxiliary model parameters \(\theta\) (maximizing the injection loss to simulate fine-tuning resistance) are optimized alternately.

Key Designs

Injection 1: Response-level Injection

A cross-entropy loss forces the auxiliary MLLM to generate the target answer given the trigger image and trigger question:

\[\mathcal{L}_{\text{res}}(x, a_{\text{tar}}) = -\log f_\theta(a_{\text{tar}}|x) = -\sum_{t=1}^{|a_{\text{tar}}|} \log f_\theta(a_t^{\text{tar}}|x, a_{<t}^{\text{tar}})\]

Gradients are backpropagated to the trigger image, injecting copyright-relevant information into its pixels.

Injection 2: Semantic-level Injection

The MLLM's built-in CLIP-like cross-modal alignment module is exploited by minimizing the cosine distance between the trigger image and the target text:

\[\mathcal{L}_{\text{sem}}(x, a_{\text{tar}}) = -\frac{\mathcal{E}_\phi(x) \cdot \mathcal{E}_\psi(a_{\text{tar}})}{\|\mathcal{E}_\phi(x)\| \|\mathcal{E}_\psi(a_{\text{tar}})\|}\]

where \(\mathcal{E}_\phi\) and \(\mathcal{E}_\psi\) are the CLIP image and text encoders, respectively. This semantic injection exploits the observed stability of the CLIP module after fine-tuning (empirically, cosine similarity drift is only 0.5%–9.3%), endowing the trigger with cross-derivative generalizability.

Loss & Training

Adversarial Training — Simulating Fine-tuning Resistance

With the trigger image fixed, auxiliary model parameters are updated to resist generating the target text:

\[\mathcal{L}_{\text{model}} = -\mathcal{L}_{\text{res}} - \lambda \mathcal{L}_{\text{sem}}\]

Model parameters are updated as \(\theta \leftarrow \theta - \gamma \cdot \text{clip}(\nabla_\theta \mathcal{L}_{\text{model}})\); the trigger image is updated as \(x \leftarrow x - \alpha \cdot \text{sign}(\nabla_x \mathcal{L}_{\text{trig}})\) (PGD-style). Crucially, after each trigger image optimization, model parameters are reset to \(\theta \leftarrow \theta_{\text{ref}}\) (cloned reference model) to prevent accumulated drift from affecting subsequent trigger optimization.

Trigger Design

  • 5 trigger Q-A pairs (e.g., "Detecting copyright → ICLR Conference," "What are you busy with → I'm playing games"), all representing rare combinations in everyday dialogue.
  • 200 ImageNet validation images × 5 Q-A pairs = 1,000 trigger queries.
  • Perturbation budget \(\epsilon = 16/255\), PGD steps \(K = 1000\), step size \(\alpha = 1/255\).

Key Experimental Results

Setup

  • Base models: LLaVA-1.5-7B, Qwen2-VL-2B-Instruct
  • Fine-tuning: LoRA (rank=16, α=32, lr=2e-4) and full fine-tuning (lr=1e-5)
  • Downstream datasets: V7W, ST-VQA, TextVQA, PaintingForm, MathV360k
  • Metric: Attack Success Rate (ASR) — proportion of trigger queries for which model output contains the target text
  • Baselines: Ordinary (vanilla CE + frozen model), RNA, PLA

Main Results

Qwen2-VL, ASR (%)

Method LoRA V7W ST-VQA TextVQA PaintingF MathV Avg Full V7W ST-VQA TextVQA PaintingF MathV Avg
Ordinary 36 46 22 48 41 38.6 34 43 15 48 26 33.2
RNA 36 39 22 40 37 34.8 32 38 15 40 21 29.2
PLA 48 68 33 76 60 57.0 43 60 28 75 38 48.8
AGDI 53 77 41 81 68 64.0 46 65 33 80 45 53.8

LLaVA-1.5, LoRA Fine-tuning, ASR (%)

Method V7W ST-VQA TextVQA PaintingF MathV Avg
PLA 51 43 21 55 18 37.6
AGDI 64 56 36 79 30 53.0

AGDI leads across all base model × fine-tuning combinations. On Qwen2-VL LoRA, AGDI achieves 64% avg vs. PLA's 57%; the gap is larger on LLaVA-1.5 (53% vs. 37.6%).

Non-derivative Model Verification

Triggers generated from LLaVA-1.5 are tested on MiniGPT-4, Qwen2-VL, Llama3-Vision, and LLaVA-1.6. All methods (RNA/PLA/AGDI) yield 0% ASR, confirming no false triggering on non-derivative models.

Ablation Study

LLaVA-1.5, LoRA, ASR (%)

Configuration V7W ST-VQA TextVQA PaintingF MathV
w/o response injection 0 1 1 1 4
w/o semantic injection 51 43 21 55 18
w/o LLM update (CLIP only) 32 39 20 19 13
w/o encoder update (LLM only) 60 55 29 70 29
AGDI (full) 64 56 36 79 30
  • Removing response injection → ASR near 0% (CLIP alignment alone cannot drive target text generation).
  • Removing semantic injection → degrades to PLA level (overfits the base model).
  • Both injections and full adversarial training are individually necessary.

Key Findings

  • Model pruning: Under magnitude/Wanda pruning (10–30% sparsity), AGDI achieves 59–79% ASR on PaintingF vs. PLA's 14–46%.
  • Model merging: AGDI maintains a lead under linear and TIES merging.
  • Quantization: ASR drops only slightly under 8-bit quantization.
  • Input transformations: ASR reduces to ~65% / ~92% / ~62% of the original under resizing(256) / Gaussian noise(σ=5) / JPEG compression, respectively.
  • System prompt variation: Switching system prompts causes ±3% ASR fluctuation.
  • Inference parameters: Varying temperature/top-p from 0.1 to 1.0 causes ±1% ASR fluctuation.
  • Additional MLLMs: AGDI is effective on InternVL3.5-2B/8B (8B LoRA avg ~57%).

Highlights & Insights

  • Elegant dual injection design: Response-level injection ensures activation precision; semantic-level injection exploits the stability of the CLIP module for generalizability. The two are complementary and both theoretically grounded.
  • Adversarial training simulates fine-tuning: The min-max game renders trigger images robust to parameter changes, and the parameter reset mechanism prevents accumulated drift.
  • Fully black-box: Publishers need only query the suspicious model to verify copyright, with no access to internal parameters.
  • No modification to model parameters: Optimization is entirely on the image side, leaving base model performance unaffected—suitable for post-deployment scenarios.
  • Comprehensive experimental coverage: 2 base models × 2 fine-tuning methods × 5 downstream datasets, plus robustness tests covering pruning, merging, quantization, input transformations, and system prompts.

Limitations & Future Work

  • PGD optimization for 1,000 steps over 1,000 trigger queries incurs non-trivial generation cost; acceleration strategies are not discussed.
  • The perturbation budget \(\epsilon=16/255\) may not be visually imperceptible; the paper lacks perceptual evaluation (e.g., human study).
  • ASR on TextVQA is consistently the lowest (41% LoRA, 33% Full), possibly because OCR-task fine-tuning induces greater model change.
  • Validation is limited to 2B/7B-scale models; effectiveness on larger models (e.g., 70B+) is unknown.
  • Trigger Q-A pairs require manual design as rare combinations; automated design strategies are unexplored.
  • No comparison with watermarking methods (which require fine-tuning the model to embed watermarks); the two paradigms target different scenarios, but readers would benefit from such a comparison.
  • vs. PLA (ICLR 2025): PLA also uses trigger images but relies solely on response-level injection (CE loss), overfitting the base model's response patterns. AGDI adds semantic-level injection exploiting CLIP stability and adversarial training, achieving 64% vs. 57% avg on Qwen2-VL LoRA.
  • vs. RNA: RNA perturbs model parameters with random noise to simulate fine-tuning, but the perturbation direction is uncontrolled. AGDI's adversarial training is directed—specifically training the auxiliary model to resist target generation—more faithfully simulating real fine-tuning behavior.
  • vs. IF (ACL 2024): IF is an LLM method that embeds fingerprints via instruction tuning, requiring model parameter modification, and achieves only 22.4% avg ASR on LLaVA-1.5 LoRA—far below AGDI's 53%.
  • vs. model watermarking methods: Watermarking methods (REEF, SLIP) require fine-tuning the model to embed watermarks, degrading model performance and remaining vulnerable to removal after downstream fine-tuning. AGDI operates entirely on the image side without touching model parameters.

The observation that CLIP-like alignment modules serve as an "invariant sub-model" within MLLMs is valuable and generalizable to other cross-model transfer scenarios. The adversarial training with parameter reset paradigm is applicable to other optimization problems requiring robustness to parameter changes. Trigger image methods are fundamentally a positive application of adversarial attacks, forming a dual relationship with jailbreak attacks (a negative application).

Rating

  • Novelty: ⭐⭐⭐⭐ — The combination of dual injection and adversarial training is innovative, and the CLIP stability observation is insightful; however, individual components (CE loss, CLIP alignment, PGD) are established techniques.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 2 base models, 2 fine-tuning methods, 5+5 datasets, complete ablations, and 6 categories of robustness tests constitute exceptionally comprehensive coverage.
  • Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear, equations are concise, and experimental tables are rich; some notation could be further unified.
  • Value: ⭐⭐⭐⭐ — Highly practical and directly applicable to open-source model copyright protection; the implicit assumptions about trigger image imperceptibility and rare Q-A pairs warrant further validation at scale.