TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA¶

Conference: ICLR 2026 arXiv: 2510.04682 Code: https://github.com/NaughtyMaltiz16/TiTok Area: Model Compression Keywords: LoRA Transfer, Knowledge Distillation, Token-level Selection, Parameter-Efficient Fine-Tuning, Contrastive Excess Score

TL;DR¶

This paper proposes TiTok, a framework that enables efficient cross-model transfer of LoRA adapters via token-level contrastive excess scores, without requiring an auxiliary discriminator model. TiTok consistently outperforms TransLoRA and knowledge distillation baselines on reasoning and personalization tasks.

Background & Motivation¶

The binding problem of LoRA: Although PEFT methods such as LoRA are parameter-efficient, adapter parameters are tightly coupled to a specific base model and cannot be directly transferred across models.
Limitations of Prior Work:
- Knowledge distillation (KD) requires access to the original training data, which is typically unavailable.
- TransLoRA addresses data dependency through synthetic data but requires training an additional discriminator model for data filtering, introducing extra complexity.
Core Motivation: Can task-relevant knowledge signals be extracted at the token level from a LoRA adapter in a more lightweight manner, so as to guide cross-model knowledge transfer?

Method¶

Overall Architecture¶

TiTok consists of three stages: 1. Synthetic data generation → 2. Excess score computation → 3. Target model training with filtering

Key Design 1: Token-level Contrastive Excess Score¶

The token-level score difference between the source model with and without LoRA is defined as:

\[S(y_i) = L_e(y_i) - L_a(y_i)\]

where:

\[L_a(y_i) = \log P_{\mathcal{M}_s}(y_i \mid \mathbf{q}, \mathbf{y}_{<i}), \quad L_e(y_i) = \log P_{\mathcal{M}_s + \mathcal{A}_s}(y_i \mid \mathbf{q}, \mathbf{y}_{<i})\]

Intuition: The excess score quantifies the amount of task knowledge injected by the LoRA adapter. When the base model is uncertain about a token but the LoRA-augmented model predicts it with high confidence, that token receives a high excess score.
Theoretical Basis: This is equivalent to a token-level log-likelihood ratio (LLR), which is guaranteed by the Neyman–Pearson lemma to be the most powerful statistic for distinguishing the two model distributions.

Key Design 2: Two-level Filtering for Training¶

Stage 1 — Sample Filtering: The average excess score of each synthetic sample is computed, and the top-\(M\) most informative samples are retained:

\[\bar{S}_j = \frac{1}{|\mathbf{y}_j|} \sum_{y_i \in \mathbf{y}_j} S(y_i)\]

Stage 2 — Token Selection: Within the retained samples, only tokens in the top-\(k\%\) of excess scores are used for training:

\[\mathcal{L}_{\text{TiTok}} = \sum_{(\mathbf{q}_j, \mathbf{y}_j) \in \mathcal{D}_f} \sum_{y_i \in \mathbf{y}_j} I_{k\%}(y_i) \cdot L_t(y_i)\]

Key Design 3: Tokenizer Alignment Algorithm¶

When the source and target models employ different tokenizers: - A dual-pointer incremental decoding scheme is used to align text spans. - Four rules propagate the mask: one-to-one direct copy, one-to-many copy, many-to-one averaging, and many-to-many average copy. - A final top-\(k\%\) selection retains the most reliable target tokens.

Loss & Training¶

The target LoRA \(\mathcal{A}_t\) is trained on top of the frozen backbone \(\mathcal{M}_t\) using filtered synthetic data with a standard NLL loss:

\[\mathcal{L}_{\text{TiTok}} = \sum \sum I_{k\%}(y_i) \cdot (-\log P_{\mathcal{M}_t + \mathcal{A}_t}(y_i \mid \mathbf{q}, \mathbf{y}_{<i}))\]

Key Experimental Results¶

Main Results: Four Transfer Settings¶

Transfer Setting	Method	BBH Acc	MMLU Acc	News R-1	Scholarly R-1
Mistral→Mistral	Vanilla	0.397	0.557	0.117	0.381
Mistral→Mistral	TransLoRA	0.416	0.534	0.156	0.447
Mistral→Mistral	TiTok	0.424	0.561	0.161	0.473
Mistral→Llama3	Vanilla	0.469	0.469	0.125	0.444
Mistral→Llama3	TransLoRA	0.473	0.473	0.126	0.461
Mistral→Llama3	TiTok	0.484	0.485	0.139	0.464
Llama2→Llama3	TiTok	0.488	0.477	0.138	0.461

Ablation Study¶

Sample Filtering	Token Selection	BBH	MMLU	News R-1	Scholarly R-1
✗	✗	0.458	0.485	0.133	0.456
✗	✓	0.463	0.496	0.137	0.460
✓	✗	0.470	0.500	0.139	0.460
✓	✓	0.483	0.501	0.142	0.464

Key Findings¶

TiTok outperforms the vanilla target model by an average of +9.94%, KD by +8.5%, and TransLoRA by +4.4%.
The method is effective across model families (Mistral→Llama), scales (3B→8B), and versions (Llama2→Llama3).
Tokens in the top 20% of excess scores contain the most concentrated task knowledge (0.482 vs. bottom 0.468).
Two different model experts (Mistral 7B and Llama2 7B) share a 59.76% overlap in their top 20% token selections.
A token selection ratio of \(k\%\) = 70% is optimal in most settings.
TiTok remains effective when out-of-domain external data is used.

Highlights & Insights¶

Simple yet effective: No auxiliary model (discriminator) needs to be trained; the method leverages only the difference between the source model with and without LoRA.
Theoretically grounded: The excess score is backed by statistical hypothesis testing theory via the log-likelihood ratio.
Comprehensive transfer scenarios: Experiments cover same-family, cross-family, cross-scale, and cross-version transfer settings.
Tokenizer alignment: An elegant solution is provided for tokenizer mismatches across different models.

Limitations & Future Work¶

The method depends on the quality of synthetic data; a source model with weak generation capability may limit transfer performance.
The optimal token selection ratio \(k\%\) is not fully consistent across transfer settings (e.g., the optimal value for Llama3 3B→8B is 30%).
Validation is limited to LoRA (rank=8); other PEFT methods remain unexplored.
Evaluation is concentrated on reasoning (BBH/MMLU) and personalization (LaMP) tasks; generalization to other task types requires further investigation.

PEFT Transfer: TransLoRA transfers LoRA via synthetic data and a discriminator, constituting a heavier pipeline.
Knowledge Distillation: Traditional KD operates at the logit or sequence level within a teacher–student framework and requires access to the original training data.
Selective Token Training: Inspired by the selective training literature, TiTok is the first to extend token selection to the knowledge transfer setting.

Rating¶

Dimension	Score
Novelty	★★★★☆
Theoretical Depth	★★★★☆
Experimental Thoroughness	★★★★☆
Value	★★★★☆
Writing Quality	★★★★☆