Who Taught You That? Tracing Teachers in Model Distillation¶
| Conference | arXiv | Code | Area | Keywords |
|---|---|---|---|---|
| ACL 2025 | 2502.06659 | - | Model Compression / LLM Safety | Knowledge Distillation, Teacher Attribution, Syntactic Templates, PoS Tags, Model Provenance |
TL;DR¶
This paper introduces a novel problem of "teacher model attribution": given a distilled student model, can its training teacher be identified from a pool of candidate teachers? It is found that n-gram similarity and perplexity are unreliable, whereas Part-of-Speech (PoS) syntactic templates provide effective signals for teacher identification.
Background & Motivation¶
Research Question: In the context of model distillation (using large models to train small models), is it possible to infer the teacher model by analyzing the output of the student model?
Limitations of Prior Work: - Model distillation has become the mainstream method for training efficient small models using large proprietary LLMs. - Distillation may violate the terms of service of model providers (e.g., the controversy over whether DeepSeek distilled ChatGPT). - There is a lack of effective methods to detect unauthorized distillation behaviors. - Existing data provenance methods (such as watermarking) require embedding at generation time and cannot be detected post-hoc.
Core Motivation: LLM providers require tools to identify unauthorized distillation use. Furthermore, understanding the transfer of "linguistic fingerprints" from teachers to students helps elucidate the mechanisms of knowledge distillation.
Method¶
Overall Architecture¶
Three teacher attribution strategies are systematically compared: 1. Perplexity Method: Computes the perplexity of candidate teachers on the student's output, with the expectation that the true teacher will yield a lower perplexity. 2. Similarity Method: Measures the text similarity between the student's and candidate teachers' outputs. 3. Syntactic Template Method: Trains a classifier based on Part-of-Speech (PoS) sequence patterns to identify the teacher.
Key Designs¶
Experimental Setup: - Student Models: GPT-2 (124M) and OLMo-1B - Candidate Teachers \(\mathcal{M}\): {Llama3-8B, Llama3-70B, Mistral-7B, Mixtral, Gemma2-9B}, all of which are open-source models. - Tasks: Summarization (CNN-DailyMail, Rotten Tomatoes, PubMed), Question Answering (OpenbookQA, CommonsenseQA), Instruction Following (Alpaca 10K).
PoS Template Method:
- Extracts PoS templates of length 4 using the diversity package.
- Selects the top 50 most common PoS patterns across all teacher outputs.
- Constructs PoS template indicator features (a 50-dimensional binary vector).
- Trains a logistic regression classifier (5-class), trained on teacher data and tested on student data.
Core Hypothesis: Student models internalize the syntactic preferences of their teachers during distillation. These high-level linguistic structural features are more discriminative than surface lexical similarity.
Evaluation Metrics¶
Classification accuracy (5 classes, random baseline of 0.20), BoW cosine similarity, BERTScore, and AUC-ROC.
Experiments¶
Main Results¶
| Student Model | Feature Type | C-D | P-M | R-T | CSQA | OBQA | QRe | Alpaca |
|---|---|---|---|---|---|---|---|---|
| GPT-2 | BERT | 0.46 | 0.55 | 0.40 | 0.44 | 0.38 | 0.35 | 0.51 |
| GPT-2 | n-grams | 0.58 | 0.68 | 0.44 | 0.56 | 0.48 | 0.50 | 0.56 |
| GPT-2 | PoS Templates | 0.60 | 0.71 | 0.54 | 0.69 | 0.51 | 0.59 | 0.55 |
| OLMo-1B | BERT | 0.45 | 0.65 | 0.41 | 0.40 | 0.42 | 0.31 | 0.46 |
| OLMo-1B | n-grams | 0.60 | 0.62 | 0.48 | 0.55 | 0.42 | 0.58 | 0.50 |
| OLMo-1B | PoS Templates | 0.61 | 0.74 | 0.45 | 0.59 | 0.43 | 0.61 | 0.53 |
5-class classification accuracy, with a random baseline of 0.20. PoS templates outperform n-gram and BERT features on most datasets.
Ablation Study¶
| Method | Effect |
|---|---|
| Perplexity | Teacher perplexity cannot reliably distinguish (the true teacher does not necessarily yield the lowest PPL) |
| BoW + BERTScore Similarity | AUC \(\approx\) 0.49-0.53, close to random, lacking discriminative power |
| Logistic Regression + Similarity Features | AUC \(\approx\) 0.52, almost no discriminative power |
| PoS Templates (Core Method) | Significantly outperforms random, reaching 0.69 on CSQA, but is still far from perfect |
Key Findings¶
- Surface Similarity Fails: BoW and BERTScore cannot distinguish students taught by different teachers, demonstrating that distillation does not transfer surface lexical patterns.
- Perplexity Also Fails: Teacher models do not always prefer the output of their own student models (e.g., Gemma actually assigns a higher PPL to its own distilled student).
- Syntactic Templates Work: PoS templates capture higher-level syntactic structural preferences, which are retained by students during distillation.
- Task Dependency: PoS templates perform best on reasoning tasks (CSQA: 0.69) and show smaller gains on instruction-following tasks (Alpaca: 0.55).
- Accuracy, while far exceeding random choice, remains far from perfect, indicating that while teacher fingerprints exist, they are not strong enough, necessitating further improvements for practical applications.
Highlights & Insights¶
- Introduces a novel and practically valuable problem—post-hoc attribution of distilled teacher models.
- Systematically rules out intuitively plausible approaches (perplexity, text similarity), highlighting the challenging nature of the problem.
- The effectiveness of PoS templates reveals that distillation transfers implicit preferences at the syntactic level rather than surface-level lexical features.
- Requires no access to the internal states of the teacher model and operates without watermarking strategies.
Limitations & Future Work¶
- Although the classification accuracy of PoS templates is far above random, it is still far from practical utility (maximum of 0.74).
- Assumes a closed-set scenario (the true teacher must be in the candidate set) and cannot handle out-of-candidate-set teachers.
- Additional fine-tuning, data augmentation, or multi-teacher distillation might blur the attribution signals.
- Different teachers trained on the same data might share footprints, increasing the difficulty of attribution.
- Only two student models (GPT-2, OLMo-1B) are analyzed, and generalizability remains to be verified.
Related Work & Insights¶
- LLM Distillation: Knowledge distillation framework by Hinton (2015); Reasoning tutoring by Ho et al. (2023); Symbolic CoT distillation by Li et al. (2023b); CoT-enhanced distillation by Wadhwa et al. (2024a).
- Provenance Tracking: Statistical watermark detection by Li et al. (2024b); Generation-time watermarking methods by Li et al. (2024a); and text source detection using perplexity and contrastive training by Li et al. (2023a).
- LLM Text Detection: Shaib et al. (2024b) find that LLMs prefer specific syntactic templates (which serves as the direct inspiration for this work).
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Practicality | ⭐⭐⭐⭐ |
| Technical Depth | ⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐ |