TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA¶
Conference: ICLR 2026 arXiv: 2510.04682 Code: https://github.com/NaughtyMaltiz16/TiTok Area: Model Compression Keywords: LoRA Transfer, Knowledge Distillation, Token-level Selection, Parameter-Efficient Fine-Tuning, Contrastive Excess Score
TL;DR¶
This paper proposes TiTok, a framework that enables efficient cross-model transfer of LoRA adapters via token-level contrastive excess scores, without requiring an auxiliary discriminator model. TiTok consistently outperforms TransLoRA and knowledge distillation baselines on reasoning and personalization tasks.
Background & Motivation¶
- The binding problem of LoRA: Although PEFT methods such as LoRA are parameter-efficient, adapter parameters are tightly coupled to a specific base model and cannot be directly transferred across models.
- Limitations of Prior Work:
- Knowledge distillation (KD) requires access to the original training data, which is typically unavailable.
- TransLoRA addresses data dependency through synthetic data but requires training an additional discriminator model for data filtering, introducing extra complexity.
- Core Motivation: Can task-relevant knowledge signals be extracted at the token level from a LoRA adapter in a more lightweight manner, so as to guide cross-model knowledge transfer?
Method¶
Overall Architecture¶
TiTok consists of three stages: 1. Synthetic data generation → 2. Excess score computation → 3. Target model training with filtering
Key Design 1: Token-level Contrastive Excess Score¶
The token-level score difference between the source model with and without LoRA is defined as:
where:
- Intuition: The excess score quantifies the amount of task knowledge injected by the LoRA adapter. When the base model is uncertain about a token but the LoRA-augmented model predicts it with high confidence, that token receives a high excess score.
- Theoretical Basis: This is equivalent to a token-level log-likelihood ratio (LLR), which is guaranteed by the Neyman–Pearson lemma to be the most powerful statistic for distinguishing the two model distributions.
Key Design 2: Two-level Filtering for Training¶
Stage 1 — Sample Filtering: The average excess score of each synthetic sample is computed, and the top-\(M\) most informative samples are retained:
Stage 2 — Token Selection: Within the retained samples, only tokens in the top-\(k\%\) of excess scores are used for training:
Key Design 3: Tokenizer Alignment Algorithm¶
When the source and target models employ different tokenizers: - A dual-pointer incremental decoding scheme is used to align text spans. - Four rules propagate the mask: one-to-one direct copy, one-to-many copy, many-to-one averaging, and many-to-many average copy. - A final top-\(k\%\) selection retains the most reliable target tokens.
Loss & Training¶
The target LoRA \(\mathcal{A}_t\) is trained on top of the frozen backbone \(\mathcal{M}_t\) using filtered synthetic data with a standard NLL loss:
Key Experimental Results¶
Main Results: Four Transfer Settings¶
| Transfer Setting | Method | BBH Acc | MMLU Acc | News R-1 | Scholarly R-1 |
|---|---|---|---|---|---|
| Mistral→Mistral | Vanilla | 0.397 | 0.557 | 0.117 | 0.381 |
| Mistral→Mistral | TransLoRA | 0.416 | 0.534 | 0.156 | 0.447 |
| Mistral→Mistral | TiTok | 0.424 | 0.561 | 0.161 | 0.473 |
| Mistral→Llama3 | Vanilla | 0.469 | 0.469 | 0.125 | 0.444 |
| Mistral→Llama3 | TransLoRA | 0.473 | 0.473 | 0.126 | 0.461 |
| Mistral→Llama3 | TiTok | 0.484 | 0.485 | 0.139 | 0.464 |
| Llama2→Llama3 | TiTok | 0.488 | 0.477 | 0.138 | 0.461 |
Ablation Study¶
| Sample Filtering | Token Selection | BBH | MMLU | News R-1 | Scholarly R-1 |
|---|---|---|---|---|---|
| ✗ | ✗ | 0.458 | 0.485 | 0.133 | 0.456 |
| ✗ | ✓ | 0.463 | 0.496 | 0.137 | 0.460 |
| ✓ | ✗ | 0.470 | 0.500 | 0.139 | 0.460 |
| ✓ | ✓ | 0.483 | 0.501 | 0.142 | 0.464 |
Key Findings¶
- TiTok outperforms the vanilla target model by an average of +9.94%, KD by +8.5%, and TransLoRA by +4.4%.
- The method is effective across model families (Mistral→Llama), scales (3B→8B), and versions (Llama2→Llama3).
- Tokens in the top 20% of excess scores contain the most concentrated task knowledge (0.482 vs. bottom 0.468).
- Two different model experts (Mistral 7B and Llama2 7B) share a 59.76% overlap in their top 20% token selections.
- A token selection ratio of \(k\%\) = 70% is optimal in most settings.
- TiTok remains effective when out-of-domain external data is used.
Highlights & Insights¶
- Simple yet effective: No auxiliary model (discriminator) needs to be trained; the method leverages only the difference between the source model with and without LoRA.
- Theoretically grounded: The excess score is backed by statistical hypothesis testing theory via the log-likelihood ratio.
- Comprehensive transfer scenarios: Experiments cover same-family, cross-family, cross-scale, and cross-version transfer settings.
- Tokenizer alignment: An elegant solution is provided for tokenizer mismatches across different models.
Limitations & Future Work¶
- The method depends on the quality of synthetic data; a source model with weak generation capability may limit transfer performance.
- The optimal token selection ratio \(k\%\) is not fully consistent across transfer settings (e.g., the optimal value for Llama3 3B→8B is 30%).
- Validation is limited to LoRA (rank=8); other PEFT methods remain unexplored.
- Evaluation is concentrated on reasoning (BBH/MMLU) and personalization (LaMP) tasks; generalization to other task types requires further investigation.
Related Work & Insights¶
- PEFT Transfer: TransLoRA transfers LoRA via synthetic data and a discriminator, constituting a heavier pipeline.
- Knowledge Distillation: Traditional KD operates at the logit or sequence level within a teacher–student framework and requires access to the original training data.
- Selective Token Training: Inspired by the selective training literature, TiTok is the first to extend token selection to the knowledge transfer setting.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ★★★★☆ |
| Theoretical Depth | ★★★★☆ |
| Experimental Thoroughness | ★★★★☆ |
| Value | ★★★★☆ |
| Writing Quality | ★★★★☆ |