ACL 2025 LLM (Other) Model fingerprinting ownership verification model merging intellectual property protection black-box verification adversarial robustness

MergePrint: Merge-Resistant Fingerprints for Robust Black-box Ownership Verification of Large Language Models¶

Conference: ACL 2025
arXiv: 2410.08604
Code: None (supplementary code attached to the paper, to be released upon acceptance)
Area: LLM/NLP
Keywords: Model fingerprinting, ownership verification, model merging, intellectual property protection, black-box verification, adversarial robustness

TL;DR¶

This paper proposes MergePrint, the first black-box LLM fingerprinting verification method tailored for model merging scenarios. By simulating merging behavior with a pseudo-merged model and employing a two-stage optimization (input optimization + parameter optimization), the embedded fingerprint remains detectable after merging, achieving efficient, harmless, and tamper-resistant ownership verification.

Background & Motivation¶

Training LLMs is extremely costly, making models highly valuable intellectual property that urgently requires ownership protection mechanisms.

Model merging has emerged as a new threat: directly merging parameters of multiple expert models yields a multi-task model without additional training, requiring negligible computational cost and significantly lowering the barrier for model plagiarism.

Existing black-box fingerprinting methods are not resistant to merging: TRAP (intrinsic fingerprint) and IF (injected fingerprint) almost entirely disappear when the merging ratio is \(\le 50\%\), rendering them undetectable.

White-box verification is impractical: model thieves typically provide services only via APIs and do not publish weights, making white-box methods like HuReF/REEF inapplicable.

Directly embedding pre-defined fingerprints causes side effects: uncommon input-output pairs suffer from high initial loss, requiring numerous optimization steps which lead to model performance degradation.

This paper is the first to propose a fingerprinting method specifically designed for model merging, defining five practical requirements: merge-resistance (R1), harmlessness (R2), non-overclaiming (R3), efficiency (R4), and secrecy (R5).

Method¶

Overall Architecture¶

MergePrint adopts a two-stage optimization pipeline: Input Optimization (OptI) \(\rightarrow\) Parameter Optimization (OptP). The core idea is to construct a pseudo-merged model to simulate the parameter distribution after merging, and optimize the fingerprint embedding on this model, ensuring the fingerprint survives in real merging scenarios.

Module 1: Pseudo-Merged Model¶

Since model owners cannot predict which expert models a malicious user will merge with, they cannot optimize directly on a real merged model.
Solution: Use the base model \(\theta_b\) itself as a proxy for other expert models to construct a pseudo-merged model: \(\tilde{\theta}_m = \theta_b + \alpha(\theta_o - \theta_b)\).
Intuition: The essence of model merging is the coexistence of different capabilities. If a fingerprint remains detectable after pseudo-merging (i.e., parameter dilution), it is highly likely to survive in real merging.
Different merging coefficients are set for OptI and OptP: \(\alpha_I = 0.3\) and \(\alpha_P = 0.1\) (more aggressive dilution to ensure robustness).

Module 2: Input Optimization (OptI)¶

Goal: Pre-optimize the fingerprint input \(x^*\) to minimize the loss of \((x^*, y)\) on the pseudo-merged model, thereby reducing subsequent parameter optimization steps.
Uses GCG (Greedy Coordinate Gradient) for adversarial text optimization, greedily selecting tokens based on gradients.
Key regularization: Add a \(-\lambda \cdot L(p_{\theta_b}(\cdot|x), y)\) term to ensure the optimized input does not trigger the target output on the base model, preventing false positives (overclaiming).
Early stopping: Stops optimization when the loss on the base model falls below a threshold of 3.5.
The optimized input is formatted as random gibberish (e.g., "Decrypt message: r4tjqht4bno"), naturally preserving secrecy.

Module 3: Parameter Optimization (OptP)¶

Optimize the owner's model parameters \(\theta_o\) using cross-entropy loss on the pseudo-merged model to generate the target output \(y\).
Uses a lower merging coefficient \(\alpha_P = 0.1\) (retaining only 10% of the owner model's parameters) to ensure the fingerprint survives even under extreme dilution.
Since OptI has already significantly reduced the initial loss, OptP requires only 18 steps to converge, taking approximately 7 minutes.

Training & Verification¶

The target output \(y\) is a random word (e.g., "transformer", "pikachu") that is unguessable.
During verification, the Verification Success Rate (VSR) is computed: the suspect model is queried \(n\) times, and the ratio of outputs whose prefix exactly matches \(y\) is recorded.
Verification does not require access to model weights, functioning completely as a black box using only API queries.

Experiments¶

Experimental Setup¶

Base Model: LLaMA-2-7B; Owner Model: WizardMath-7B-V1.0, LLaMA-2-7B-CHAT.
Merging Methods: Task Arithmetic, TIES-merging, DARE, Breadcrumbs, DELLA, and their combinations, totaling 8 methods.
Baselines: TRAP (intrinsic fingerprint), IF (instruction-tuned/injected fingerprint).
Evaluation Benchmarks: ARC-C/E, CommonsenseQA, GSM8K, HellaSwag, OBQA, PIQA, Toxigen, TriviaQA, Winogrande.

Table 1: Three-Model Merging, Dual Fingerprint Coexistence¶

Merging Coefficient (\(\alpha_1/\alpha_2/\alpha_3\))	Task Arith. \(y_1/y_2\)	TIES \(y_1/y_2\)	DARE+TA \(y_1/y_2\)	DARE+TIES \(y_1/y_2\)
0.33/0.33/0.33	1.00/1.00	1.00/1.00	1.00/1.00	1.00/1.00
0.10/0.45/0.45	0.93/1.00	0.93/1.00	1.00/1.00	1.00/1.00
Avg VSR	0.992	0.992	1.000	1.000

Findings: Even when the owner model accounts for only 10% of the merged weights, MergePrint still verifies the fingerprint with \(93\%+\) VSR; two different fingerprints can coexist in the same merged model without interfering with each other.

Table 2: Harmlessness Evaluation¶

Model	Diff Avg \(\downarrow\)	Diff Std \(\downarrow\)
WizardMath (IF)	0.92	1.35
WizardMath (Ours w/o OptI)	0.60	0.78
WizardMath (Ours)	0.15	0.23
Chat (IF)	1.21	1.75
Chat (Ours w/o OptI)	0.54	0.87
Chat (Ours)	0.45	0.55

Findings: MergePrint has a negligible impact on model performance (the average absolute difference for WizardMath is only 0.15), significantly outperforming IF; OptI markedly reduces performance loss—removing OptI increases the difference from 0.15 to 0.60.

Key Findings¶

Robustness to Multi-Model Merging (Figure 3): The fingerprint maintains a high VSR even after merging with up to 7 models, with TIES-merging on Swallow-7B being the only exception where the fingerprint disappears.
Beyond Merging Scenarios (Table 4): MergePrint achieves VSR=1.0 in fine-tuning (Alpaca), quantization (LLM.int8()), and pruning (\(r \le 0.5\)) scenarios, consistently outperforming TRAP and IF.
Robustness to Inference Hyperparameters (Table 5): VSR remains 1.0 as temperature ranges from 0.4 to 2.0 and top-p ranges from 0.90 to 1.00, and is still 0.87 at temperature=3.0.
Secrecy (Table 3): VSR drops to 0.13 when \(\ge 10\%\) of input characters are replaced and entirely to 0 when \(\ge 20\%\) are replaced, showing that the fingerprint is extremely difficult to guess.

Highlights & Insights¶

First fingerprinting method designed for model merging, filling an important gap in LLM IP protection.
The design of the pseudo-merged model is simple yet elegant: instead of knowing other expert models used by attackers, it uses only the base model as an approximation.
Two-stage optimization balances efficiency and harmlessness: OptI dramatically reduces the number of OptP steps, completing the whole pipeline in <10 minutes.
Comprehensive evaluation: Covers 8 merging methods, multi-model merging, fine-tuning/quantization/pruning, and variations in inference hyperparameters.
All five requirements (R1-R5) are fully satisfied, presenting a highly practical verification solution.

Limitations & Future Work¶

Not resistant to knowledge distillation: Student models are trained on input-output pairs, so fingerprints are not triggered by typical inputs and thus cannot be transferred to the student model.
The pseudo-merged model is an approximation: Using the base model to replace unknown expert models is a heuristic approach and may fail in extreme merging scenarios (e.g., TIES + Swallow-7B).
Validated only at 7B scale: No experiments have been conducted on larger models (13B/70B), leaving scalability unclear.
Fingerprint inputs are gibberish: Although this enhances secrecy, verification might be hindered if API providers filter non-natural language inputs.
Verification requires multiple queries: VSR computation requires sampling outputs multiple times, which may be inconvenient in high API cost scenarios.

White-box Fingerprinting: HuReF (parameter-invariant direction), REEF (intermediate representation comparison), Fernandez et al. (weight embedding)—these require access to model internals.
Black-box Fingerprinting: LLMmap (analyzing responses to identify versions), TRAP (optimizing input-output pairs), IF (instruction-tuning embedding)—but none are merge-resistant.
Model Merging: Task Arithmetic, TIES-merging, DARE, Breadcrumbs, DELLA—MergePrint is the first to treat model merging as a threat rather than a utility tool.
Backdoor Attacks: Zhang et al. 2024b proposed a merge-resistant backdoor, but it targets CV models and focuses on non-directed incorrect outputs.
Adversarial Attacks: GCG (Zou et al. 2023)—borrowed in this work for input optimization.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to propose a fingerprinting method for model merging scenarios with an elegant pseudo-merged model design; however, the overall framework (GCG + instruction tuning) is a combination of existing technologies.
Effectiveness: ⭐⭐⭐⭐⭐ — Satisfies all five requirements, comprehensively outperforms baselines across 8 merging methods, and generalizes beyond merging scenarios.
Practicality: ⭐⭐⭐⭐ — The overall workflow takes <10 minutes and uses purely black-box verification, but it is not resistant to distillation and has only been verified on 7B models.
Writing Quality: ⭐⭐⭐⭐ — Clearly structured with well-defined requirements and a systematic, comprehensive evaluation, though mathematical notations are somewhat dense.