Skip to content

MergePrint: Merge-Resistant Fingerprints for Robust Black-box Ownership Verification of Large Language Models

Conference: ACL 2025
arXiv: 2410.08604
Code: None (supplementary code attached to the paper, to be released upon acceptance)
Area: LLM/NLP
Keywords: Model fingerprinting, ownership verification, model merging, intellectual property protection, black-box verification, adversarial robustness

TL;DR

This paper proposes MergePrint, the first black-box LLM fingerprinting verification method tailored for model merging scenarios. By simulating merging behavior with a pseudo-merged model and employing a two-stage optimization (input optimization + parameter optimization), the embedded fingerprint remains detectable after merging, achieving efficient, harmless, and tamper-resistant ownership verification.

Background & Motivation

Training LLMs is extremely costly, making models highly valuable intellectual property that urgently requires ownership protection mechanisms.

Model merging has emerged as a new threat: directly merging parameters of multiple expert models yields a multi-task model without additional training, requiring negligible computational cost and significantly lowering the barrier for model plagiarism.

Existing black-box fingerprinting methods are not resistant to merging: TRAP (intrinsic fingerprint) and IF (injected fingerprint) almost entirely disappear when the merging ratio is \(\le 50\%\), rendering them undetectable.

White-box verification is impractical: model thieves typically provide services only via APIs and do not publish weights, making white-box methods like HuReF/REEF inapplicable.

Directly embedding pre-defined fingerprints causes side effects: uncommon input-output pairs suffer from high initial loss, requiring numerous optimization steps which lead to model performance degradation.

This paper is the first to propose a fingerprinting method specifically designed for model merging, defining five practical requirements: merge-resistance (R1), harmlessness (R2), non-overclaiming (R3), efficiency (R4), and secrecy (R5).

Method

Overall Architecture

MergePrint adopts a two-stage optimization pipeline: Input Optimization (OptI) \(\rightarrow\) Parameter Optimization (OptP). The core idea is to construct a pseudo-merged model to simulate the parameter distribution after merging, and optimize the fingerprint embedding on this model, ensuring the fingerprint survives in real merging scenarios.

Module 1: Pseudo-Merged Model

  • Since model owners cannot predict which expert models a malicious user will merge with, they cannot optimize directly on a real merged model.
  • Solution: Use the base model \(\theta_b\) itself as a proxy for other expert models to construct a pseudo-merged model: \(\tilde{\theta}_m = \theta_b + \alpha(\theta_o - \theta_b)\).
  • Intuition: The essence of model merging is the coexistence of different capabilities. If a fingerprint remains detectable after pseudo-merging (i.e., parameter dilution), it is highly likely to survive in real merging.
  • Different merging coefficients are set for OptI and OptP: \(\alpha_I = 0.3\) and \(\alpha_P = 0.1\) (more aggressive dilution to ensure robustness).

Module 2: Input Optimization (OptI)

  • Goal: Pre-optimize the fingerprint input \(x^*\) to minimize the loss of \((x^*, y)\) on the pseudo-merged model, thereby reducing subsequent parameter optimization steps.
  • Uses GCG (Greedy Coordinate Gradient) for adversarial text optimization, greedily selecting tokens based on gradients.
  • Key regularization: Add a \(-\lambda \cdot L(p_{\theta_b}(\cdot|x), y)\) term to ensure the optimized input does not trigger the target output on the base model, preventing false positives (overclaiming).
  • Early stopping: Stops optimization when the loss on the base model falls below a threshold of 3.5.
  • The optimized input is formatted as random gibberish (e.g., "Decrypt message: r4tjqht4bno"), naturally preserving secrecy.

Module 3: Parameter Optimization (OptP)

  • Optimize the owner's model parameters \(\theta_o\) using cross-entropy loss on the pseudo-merged model to generate the target output \(y\).
  • Uses a lower merging coefficient \(\alpha_P = 0.1\) (retaining only 10% of the owner model's parameters) to ensure the fingerprint survives even under extreme dilution.
  • Since OptI has already significantly reduced the initial loss, OptP requires only 18 steps to converge, taking approximately 7 minutes.

Training & Verification

  • The target output \(y\) is a random word (e.g., "transformer", "pikachu") that is unguessable.
  • During verification, the Verification Success Rate (VSR) is computed: the suspect model is queried \(n\) times, and the ratio of outputs whose prefix exactly matches \(y\) is recorded.
  • Verification does not require access to model weights, functioning completely as a black box using only API queries.

Experiments

Experimental Setup

  • Base Model: LLaMA-2-7B; Owner Model: WizardMath-7B-V1.0, LLaMA-2-7B-CHAT.
  • Merging Methods: Task Arithmetic, TIES-merging, DARE, Breadcrumbs, DELLA, and their combinations, totaling 8 methods.
  • Baselines: TRAP (intrinsic fingerprint), IF (instruction-tuned/injected fingerprint).
  • Evaluation Benchmarks: ARC-C/E, CommonsenseQA, GSM8K, HellaSwag, OBQA, PIQA, Toxigen, TriviaQA, Winogrande.

Table 1: Three-Model Merging, Dual Fingerprint Coexistence

Merging Coefficient (\(\alpha_1/\alpha_2/\alpha_3\)) Task Arith. \(y_1/y_2\) TIES \(y_1/y_2\) DARE+TA \(y_1/y_2\) DARE+TIES \(y_1/y_2\)
0.33/0.33/0.33 1.00/1.00 1.00/1.00 1.00/1.00 1.00/1.00
0.10/0.45/0.45 0.93/1.00 0.93/1.00 1.00/1.00 1.00/1.00
Avg VSR 0.992 0.992 1.000 1.000

Findings: Even when the owner model accounts for only 10% of the merged weights, MergePrint still verifies the fingerprint with \(93\%+\) VSR; two different fingerprints can coexist in the same merged model without interfering with each other.

Table 2: Harmlessness Evaluation

Model Diff Avg \(\downarrow\) Diff Std \(\downarrow\)
WizardMath (IF) 0.92 1.35
WizardMath (Ours w/o OptI) 0.60 0.78
WizardMath (Ours) 0.15 0.23
Chat (IF) 1.21 1.75
Chat (Ours w/o OptI) 0.54 0.87
Chat (Ours) 0.45 0.55

Findings: MergePrint has a negligible impact on model performance (the average absolute difference for WizardMath is only 0.15), significantly outperforming IF; OptI markedly reduces performance loss—removing OptI increases the difference from 0.15 to 0.60.

Key Findings

  • Robustness to Multi-Model Merging (Figure 3): The fingerprint maintains a high VSR even after merging with up to 7 models, with TIES-merging on Swallow-7B being the only exception where the fingerprint disappears.
  • Beyond Merging Scenarios (Table 4): MergePrint achieves VSR=1.0 in fine-tuning (Alpaca), quantization (LLM.int8()), and pruning (\(r \le 0.5\)) scenarios, consistently outperforming TRAP and IF.
  • Robustness to Inference Hyperparameters (Table 5): VSR remains 1.0 as temperature ranges from 0.4 to 2.0 and top-p ranges from 0.90 to 1.00, and is still 0.87 at temperature=3.0.
  • Secrecy (Table 3): VSR drops to 0.13 when \(\ge 10\%\) of input characters are replaced and entirely to 0 when \(\ge 20\%\) are replaced, showing that the fingerprint is extremely difficult to guess.

Highlights & Insights

  • First fingerprinting method designed for model merging, filling an important gap in LLM IP protection.
  • The design of the pseudo-merged model is simple yet elegant: instead of knowing other expert models used by attackers, it uses only the base model as an approximation.
  • Two-stage optimization balances efficiency and harmlessness: OptI dramatically reduces the number of OptP steps, completing the whole pipeline in <10 minutes.
  • Comprehensive evaluation: Covers 8 merging methods, multi-model merging, fine-tuning/quantization/pruning, and variations in inference hyperparameters.
  • All five requirements (R1-R5) are fully satisfied, presenting a highly practical verification solution.

Limitations & Future Work

  • Not resistant to knowledge distillation: Student models are trained on input-output pairs, so fingerprints are not triggered by typical inputs and thus cannot be transferred to the student model.
  • The pseudo-merged model is an approximation: Using the base model to replace unknown expert models is a heuristic approach and may fail in extreme merging scenarios (e.g., TIES + Swallow-7B).
  • Validated only at 7B scale: No experiments have been conducted on larger models (13B/70B), leaving scalability unclear.
  • Fingerprint inputs are gibberish: Although this enhances secrecy, verification might be hindered if API providers filter non-natural language inputs.
  • Verification requires multiple queries: VSR computation requires sampling outputs multiple times, which may be inconvenient in high API cost scenarios.
  • White-box Fingerprinting: HuReF (parameter-invariant direction), REEF (intermediate representation comparison), Fernandez et al. (weight embedding)—these require access to model internals.
  • Black-box Fingerprinting: LLMmap (analyzing responses to identify versions), TRAP (optimizing input-output pairs), IF (instruction-tuning embedding)—but none are merge-resistant.
  • Model Merging: Task Arithmetic, TIES-merging, DARE, Breadcrumbs, DELLA—MergePrint is the first to treat model merging as a threat rather than a utility tool.
  • Backdoor Attacks: Zhang et al. 2024b proposed a merge-resistant backdoor, but it targets CV models and focuses on non-directed incorrect outputs.
  • Adversarial Attacks: GCG (Zou et al. 2023)—borrowed in this work for input optimization.

Rating

  • Novelty: ⭐⭐⭐⭐ — First to propose a fingerprinting method for model merging scenarios with an elegant pseudo-merged model design; however, the overall framework (GCG + instruction tuning) is a combination of existing technologies.
  • Effectiveness: ⭐⭐⭐⭐⭐ — Satisfies all five requirements, comprehensively outperforms baselines across 8 merging methods, and generalizes beyond merging scenarios.
  • Practicality: ⭐⭐⭐⭐ — The overall workflow takes <10 minutes and uses purely black-box verification, but it is not resistant to distillation and has only been verified on 7B models.
  • Writing Quality: ⭐⭐⭐⭐ — Clearly structured with well-defined requirements and a systematic, comprehensive evaluation, though mathematical notations are somewhat dense.