PiFi: Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models¶
Conference: ACL 2025
arXiv: 2506.07424
Code: Not released
Area: LLM/NLP
Keywords: Knowledge Distillation, Small Language Models, Large Language Models, Model Compression, Transfer Learning
TL;DR¶
The PiFi framework is proposed, which inserts a single frozen layer of an LLM into an SLM and fine-tunes the combined model, significantly boosting SLM performance on NLU and NLG tasks with minimal computational overhead.
Background & Motivation¶
Core Problem: LLMs possess powerful generalization abilities and language knowledge but suffer from high computational overhead, making them difficult to deploy in resource-constrained environments. Although SLMs are highly efficient, their generalization capability is limited. How can SLMs leverage the knowledge advantages of LLMs?
Limitations of Prior Work: - Knowledge distillation methods (parametric or non-parametric) require the complete LLM as a teacher model or to generate synthetic data, yielding high computational costs. - Directly fine-tuning LLMs demands substantial VRAM and computational resources. - Existing SLM enhancement methods (such as domain pre-training and data augmentation) fail to introduce LLM-level external knowledge.
Design Motivation: Inspired by the computer vision domain where LLM layers are utilized as vision encoders, a single Transformer layer of an LLM already encodes rich linguistic knowledge. Through a "plug-in and freeze" approach, the LLM's knowledge can be effectively transferred to the SLM with almost no increase in trainable parameters.
Method¶
Overall Architecture¶
The core idea of the PiFi (Plug-in and Fine-tuning) framework is to extract a single Transformer layer from an LLM (e.g., Llama-3.1-8B), freeze it, insert it into a specific position of an SLM, and then fine-tune the combined model. The framework supports both encoder-only architectures (e.g., BERT) and encoder-decoder architectures (e.g., T5).
Key Designs¶
-
Dimension Adaptation Layers: Since SLMs and LLMs have different hidden dimensions (e.g., BERT 768 vs. Llama 4096), linear projection layers \(L_{in}\) (upsampling) and \(L_{out}\) (downsampling) are introduced for dimension alignment. For encoder-only models: \(h_{enc} = Enc(x)\), \(h_{LLM} = L_{LLM}(L_{in}(h_{enc}))\), \(\hat{y} = Head(L_{out}(h_{LLM}))\).
-
LLM Layer Freezing Strategy: During the fine-tuning phase, the parameters of \(L_{LLM}\) are frozen. Only the original parameters of the SLM, \(L_{in}\), \(L_{out}\), and the classification head are trained, preventing catastrophic forgetting and minimizing extra trainable parameters.
-
Flexible Architectural Adaptation: For encoder-only models, the LLM layer is inserted between the encoder and the classification head; for encoder-decoder models, the LLM layer is inserted between the encoder and the decoder.
Loss & Training¶
Standard task loss functions are used: cross-entropy loss for classification tasks and autoregressive language modeling loss for generation tasks, without requiring any additional distillation losses.
Experiments¶
Main Results: NLU Task Performance¶
| Model | SST2 | IMDB | Tweet(Sent) | Tweet(Off) | CoLA | MNLI | SNLI | SQuAD | Avg |
|---|---|---|---|---|---|---|---|---|---|
| BERT_base | 89.41 | 85.10 | 86.90 | 83.15 | 80.10 | 82.00 | 89.10 | 63.81 | 82.45 |
| +PiFi | 91.50 | 87.09 | 92.95 | 86.03 | 82.07 | 82.74 | 89.48 | 66.17 | 84.75 |
| ELECTRA_base | 93.42 | 88.31 | 90.58 | 83.52 | 83.99 | 85.41 | 90.11 | 44.44 | 82.00 |
| +PiFi | 94.13 | 89.40 | 93.31 | 84.99 | 86.26 | 86.47 | 90.48 | 67.99 | 86.71 |
| DeBERTa-V3_base | 93.74 | 89.45 | 91.29 | 83.60 | 84.75 | 87.52 | 90.94 | 69.40 | 86.34 |
| +PiFi | 95.01 | 89.83 | 93.80 | 85.60 | 86.07 | 87.98 | 91.05 | 69.87 | 87.40 |
NLG Tasks: T5_base + PiFi improves BLEU from 0.5301 to 0.5413 on Multi30K translation, and BART_base + PiFi improves BLEU from 0.2270 to 0.2331 on CNN/DailyMail summarization.
Ablation Study¶
| Ablation Dimension | Key Findings |
|---|---|
| LLM Layer Position (Layer 1/16/32) | The last layer performs the best, as it contains the richest high-level linguistic knowledge |
| Whether to Freeze the LLM Layer | Freezing outperforms unfreezing, as unfreezing leads to catastrophic forgetting |
| Different LLM Sources | Llama-3.1-8B achieves the best overall performance, though all examined LLMs bring improvements |
| Instruction-tuned LLM | Using the instruction-tuned version yields similar results |
Key Findings¶
- PiFi consistently improves performance across all 5 SLM architectures, with BERT_base and ELECTRA_base achieving average gains of 2.3% and 4.7%, respectively.
- In cross-domain generalization experiments, PiFi improves performance on IMDB \(\rightarrow\) Tweet from 70.40 to 83.68 (+13.28%), demonstrating strong domain transfer capability.
- Multilingual classification experiments demonstrate that using an LLM layer pre-trained on the target language can significantly enhance the multilingual capabilities of SLMs.
- Only a tiny number of parameters (two linear layers) are added, resulting in negligible computational overhead.
Highlights & Insights¶
- Simple Yet Effective Design: Simply inserting a single LLM layer with two linear mappings significantly boosts SLM performance.
- Strong Versatility: It is applicable to both encoder-only and encoder-decoder architectures, covering a wide range of NLU and NLG tasks.
- Outstanding Cross-Domain and Multilingual Transfer: Experiments show that PiFi not only improves in-domain performance but also enhances generalization to unseen domains and languages.
- Highly Efficient Training: The LLM layer is frozen during training, introducing minimal extra parameters.
Limitations & Future Work¶
- Currently, only the effect of a single LLM layer has been verified; the potential of multi-layer combinations or middle layers has not been fully explored.
- Dimension projection in the linear mapping layer may cause information loss; a more advanced adaptation network (e.g., MLP) might yield better results.
- Whether significant improvements still hold for larger SLMs (e.g., 1-3B scale) has not been fully verified.
- During inference, loading one layer of parameters from the LLM (e.g., ~600M for one layer of Llama-3-8B) is required, increasing storage demands.
- Direct comparison with mainstream PEFT methods like LoRA is lacking.
Related Work & Insights¶
- Knowledge Distillation: Parametric methods (Zhong et al.) train student models using the teacher's output distribution, while non-parametric methods (Ye et al.) generate synthetic data using LLMs.
- SLM Enhancement: Continual pre-training (Gururangan et al.), data augmentation (Gao et al.), etc.
- Cross-Modal Applications of LLMs: Utilizing LLM layers as vision encoders (Pang et al.), rewriting text descriptions via LLMs to enhance CLIP (Fan et al.).
Rating¶
| Dimension | Score |
|---|---|
| Novelty | 7/10 |
| Effectiveness | 8/10 |
| Experimental Thoroughness | 8/10 |
| Writing Quality | 7/10 |
| Overall Score | 7.5/10 |