Skip to content

From Teacher to Student: Tracking Memorization Through Model Distillation

Conference: ACL 2025
arXiv: 2506.16170
Code: None
Area: Video Understanding
Keywords: Knowledge Distillation, Memorization, Privacy Protection, Model Compression, GPT-2

TL;DR

This work systematically investigates the impact of knowledge distillation (KD) on the memorization behavior of large language models, finding that distillation not only compresses the model but also significantly reduces the risk of verbatim memorization of training data—with reverse KL distillation (RKLD/MiniLLM) reducing the memorization ratio from 65.4% in SFT to as low as 6.0%.

Background & Motivation

Background: Large language models have been shown to memorize and potentially leak sensitive information from training data. Prior research has primarily focused on the memorization of pre-trained models (Carlini et al., 2021; 2023), with limited investigation into memorization behaviors during the fine-tuning stage and knowledge distillation process.

Limitations of Prior Work: (1) Fine-tuning often utilizes specialized, sensitive data (e.g., medical records, proprietary data) where datasets are smaller and more concentrated, posing higher and more dangerous memorization risks; (2) Direct supervised fine-tuning (SFT) causes models to highly memorize training samples; (3) Knowledge distillation, as a mainstream model compression technique, remains almost unexplored in terms of its impact on memorization.

Key Challenge: The need to reduce model memorization of training data while maintaining task performance, especially in scenarios involving privacy-sensitive data.

Goal: To systematically study how different knowledge distillation methods affect memorization behavior during the transfer from a fine-tuned teacher model to a smaller student model.

Key Insight: Treating distillation as an implicit privacy-preserving technique, and verifying this hypothesis by comparing the memorization behaviors of four methods: SFT, Word-Level KD, Sequence-Level KD, and RKLD.

Core Idea: "Knowledge softening" during the distillation process naturally filters out verbatim memorization—the student model learns the teacher's output distribution (soft labels) rather than directly memorizing the hard labels of the training data.

Method

Memorization Definition

Based on the framework of Carlini et al. (2023), adapted for instruction-following scenarios: given an instruction-context-response triplet (p, c, s), if the response s' generated by the model through greedy decoding given p and c matches s exactly (verbatim reproduction), it is considered memorized. Memorization ratio = number of memorized samples / total samples.

Comparison of Four Distillation Methods

  1. SFT (Supervised Fine-Tuning, Baseline):

    • The student model is trained directly on ground-truth responses using standard next-token loss.
    • Lacking teacher guidance, it directly learns hard labels, yielding the highest risk of memorization.
  2. Word-Level KD (Word-Level Distillation):

    • The student mimics the teacher's token-level probability distribution (soft labels) at each position.
    • Loss = KL(Teacher Distribution, Student Distribution) + NLL mixture.
    • Soft distributions contain uncertainty and alternative possibilities, providing richer supervisory signals than hard labels.
  3. Seq-KD (Sequence-Level Distillation):

    • Replaces the original ground-truth with complete sequences generated by the teacher model via beam search as training targets.
    • The student learns the teacher's output sequence rather than the original training data.
    • Indirectly severs the direct connection between the student and the original data.
  4. RKLD (Reverse KL Distillation/MiniLLM):

    • Minimizes the reverse KL divergence from the student to the teacher (instead of the traditional forward KL).
    • Reverse KL penalizes the model for assigning high probabilities where the teacher's distribution has low probability—meaning it penalizes "overconfidence".
    • Adds an additional pre-trained language model loss to maintain general capabilities.

Evaluation Suite

  • Memorization Ratio: Randomly samples 3000 instances from the training set, with a token window k=50, to calculate the ratio of verbatim reproduction.
  • ROUGE Score: Calculates ROUGE-1/2/L on both the training and test sets. High training ROUGE + high memorization indicates verbatim copying, while the test set reflects generalization capability.

Key Experimental Results

Experimental Setup

  • Teacher Model: GPT-2 1.5B, fine-tuned on 10,000 samples of the DollyEval dataset.
  • Student Models: GPT-2 760M / 340M / 120M.
  • Test Set: 500 samples; Memorization Evaluation: 3,000 samples.

Table 1: Memorization Ratios of Different Distillation Methods

Model Size SFT Word-Level KD Seq-KD RKLD
1.5B (Teacher) 0.654
760M 0.523 0.472 0.315 0.090
340M 0.433 0.140 0.134 0.075
120M 0.330 0.100 0.129 0.060

Key Findings: RKLD achieves the lowest memorization across all types (6.0%–9.0%), reducing it by 5–7 times compared to SFT of the same scale.

Table 2: Comparison of ROUGE Scores (Training Set vs. Test Set)

Model Size Method R-1 Train R-1 Test R-L Train R-L Test
1.5B SFT 0.88 0.33 0.78 0.27
760M SFT 0.78 0.31 0.76 0.25
760M RKLD 0.45 0.36 0.40 0.30
340M SFT 0.72 0.30 0.76 0.25
340M RKLD 0.57 0.34 0.53 0.28
120M SFT 0.67 0.25 0.66 0.24
120M RKLD 0.46 0.30 0.42 0.21

Key Findings: The training set ROUGE of SFT is significantly higher than its test set ROUGE (overfitting/memorization), whereas RKLD shows more balance between training and test sets and performs better in test set ROUGE.

Highlights & Insights

  1. First Systematic Study of the Relationship Between Distillation and Memorization: Reveals the "implicit privacy-preserving" effect of distillation—not a specially designed privacy mechanism, but naturally reducing memorization.
  2. Dual Benefits of RKLD: Reverse KL distillation not only compresses the model but also minimizes memorization (only 6.0% for the 120M model), while its test set ROUGE is even superior to SFT's.
  3. Practical Insights: Provides a low-cost path for deploying LLMs in privacy-sensitive scenarios (medical, legal)—using distillation instead of direct fine-tuning can significantly lower the risk of data leakage.
  4. Clean Methodology: Based on the memorization quantification framework of Carlini et al., combined with a dual-perspective (train/test) ROUGE analysis, making the evaluation pipeline highly reusable.

Limitations & Future Work

  1. Single Dataset: Experiments were conducted solely on the DollyEval dataset; the generalizability of the findings needs to be validated across more domains and data types.
  2. Model Architecture Limitations: Only the GPT-2 family was evaluated, and it remains unverified whether modern architectures like LLaMA and Mistral exhibit the same patterns.
  3. Fixed Memorization Window: A fixed 50-token window was used to evaluate memorization; different window lengths might affect the conclusions.
  4. Implicit Memorization Unexplored: This work only focuses on verbatim reproduction and does not analyze whether the model "implicitly" memorizes the training data in paraphrased forms.
  5. Lack of Formal Privacy Metrics: Lowering memorization through distillation is not equivalent to providing provable privacy guarantees (such as Differential Privacy).
  • Memorization Quantification: Carlini et al. (2021, 2023) propose extraction attacks to retrieve training data from LLMs and establish a quantification framework for verbatim memorization.
  • Privacy Risks of Fine-Tuning: Yang et al. (2024) investigate the memorization and privacy risks of domain-specific LLMs, confirming that fine-tuned models are more prone to memorizing sensitive content.
  • Knowledge Distillation: Pioneered by Hinton et al. (2015); Kim & Rush (2016) propose word-level and sequence-level distillation; Gu et al. (2024) introduce MiniLLM (reverse KL distillation).
  • Privacy Leakage Detection: Lukas et al. (2023) and Kim et al. (2023) analyze the leakage of personally identifiable information in LLMs.

Rating

Dimension Rating (1-5) Description
Novelty 3 Distillation and memorization are mature areas individually; of interest when combined, but not a major breakthrough
Technical Depth 2 No new methodology is proposed; primarily an empirical comparison of existing methods
Experimental Thoroughness 2 Only one dataset + GPT-2 family; lacks validation across multiple architectures and datasets
Writing Quality 3 Clearly structured, but the content is somewhat thin
Value 3 Provides practical guidance for privacy-sensitive deployment, but lacks large-scale validation
Overall Rating 2.5 Meaningful empirical study but with a limited experimental scale; findings are intuitive but lack in-depth analysis