Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation¶
Conference: ACL 2025
arXiv: 2502.01491
Code: https://github.com/vernadankers/memseqkd
Area: Multilingual Translation
Keywords: Knowledge Distillation, Memorization, Hallucination, Machine Translation, SeqKD
TL;DR¶
This paper presents the first systematic study of how memorization behavior of teacher models is transferred to student models in sequence-level knowledge distillation (SeqKD). It is discovered that although the student model is never directly exposed to the original training data, its extractive memorization rate is \(57\%\) higher than that of the baseline model, accompanied by an increased hallucination rate. Adaptive-SeqKD is proposed to mitigate these issues by fine-tuning the teacher on a high-quality subset.
Background & Motivation¶
Background: SeqKD is a standard practice in NMT deployment, where a large teacher model translates the source side of training data to generate synthetic targets for training a smaller student model. Commercial translation systems such as NLLB and ALMA utilize SeqKD.
Limitations of Prior Work: - Prior research primarily focuses on why SeqKD succeeds (e.g., pattern simplification and regularization), while the propagation of failure modes remains largely unexplored. - Memorization of noisy training data by NMT models leads to unreliable behaviors during deployment. - While some studies in image classification suggest that KD suppresses memorization, corresponding research in the NLP/NMT domain is lacking.
Key Challenge: SeqKD propagates both the strengths (performance) and weaknesses (memorization, hallucinations) of the teacher, yet the community has predominantly focused on the former.
Goal: (1) Quantify the extent to which the student inherits memorization behaviors from the teacher, (2) analyze behavioral variations across different data subgroups, and (3) propose mitigation strategies.
Key Insight: The student model never directly witnesses the target side of the original parallel corpus; instead, it observes the teacher's translations. If the teacher memorizes the original target and forwards it to the student, how does the student behave?
Core Idea: The denoising effect of SeqKD, while improving student performance, reduces the regularization effect, paradoxically leading to higher memorization and hallucination rates in the student compared to a directly trained baseline model.
Method¶
Overall Architecture¶
Input: WMT20 parallel corpora (De-En/En-De 48M, Pl-En/En-Pl 12M, Fr-De 14M) \(\to\) Train Transformer-large teacher (300k steps) \(\to\) Teacher translates the source side to generate synthetic targets \(\to\) Train Transformer-base student (100k steps) \(\to\) Compare memorization and hallucination behaviors of the student versus the baseline (same architecture, trained directly on the original data).
Key Designs¶
-
Quantifying Memorization Metrics:
- Function: Measures the degree to which a model memorizes training data across multiple dimensions.
- Replication rate: The proportion of exact matches between greedy translations and training targets.
- Extractive Memorization (ExMem): The proportion where the target can be fully reconstructed when the model is prompted with only \(\le 75\%\) of the source side, indicating that the model has memorized the source-to-target mapping without requiring the full source sentence.
- OscHal (Oscillatory hallucination): Translations containing a bigram repeated \(\ge 10\) times that does not appear on the source side.
- NatHal (Natural hallucination): Occurs when a translation is outputted by the model for \(\ge 5\) different source inputs, indicating that the model "default-outputs" certain sentences.
- Design Motivation: Comprehensively characterizes memorization across both "exact match" and "behavioral abnormality" dimensions.
-
Subgroup Analysis:
- Function: Partitions the training data based on data quality, Counterfactual Memorization (CM) scores, and teacher confidence to analyze behavior within each subgroup.
- Quality Partitioning: Categorized into five bins (\(<0.2\), \(0.2-0.4\), ..., \(\ge 0.8\)) using Comet-QE-22.
- CM Partitioning: Approximated using leave-one-out cross-validation to estimate the counterfactual memorization score of each sample.
- Design Motivation: Reveals the differentiated impact of SeqKD on distinct data subsets.
-
Adaptive-SeqKD:
- Function: Incorporates a teacher adaptation step into the SeqKD pipeline to reduce the transmission of memorization.
- Mechanism: Selects a high-quality subset using Comet-QE-22 \(\to\) briefly fine-tunes the teacher on this subset \(\to\) generates distilled targets using the fine-tuned teacher.
- Design Motivation: Fine-tuning on high-quality data encourages the teacher to "forget" noisy memorization, yielding cleaner distilled targets; this process requires no external data and is completely self-contained.
- Comparison with random fine-tuning: Fine-tuning on a random subset degrades student quality.
Loss & Training¶
- Teacher: Transformer-large, 300k steps, original WMT20 parallel corpus.
- Student/Baseline: Transformer-base, 100k steps.
- Student training data: Unchanged source side, with target side replaced by the teacher's beam size \(= 1\) translations.
- MarianNMT training framework.
Key Experimental Results¶
Main Results¶
| Metric | Teacher \(\to\) Baseline Relationship | Student vs. Baseline Increase |
|---|---|---|
| Replication rate (Exact match) | Teacher > Student > Baseline | \(+3.4\% \pm 0.9\) |
| ExMem rate (Extractive memorization) | Teacher > Student > Baseline | \(+57.0\% \pm 15.4\) |
| OscHal (Oscillatory hallucination) | Student > Baseline > Teacher | \(+31.0\% \pm 25.7\) |
| NatHal (Natural hallucination) | Teacher > Student > Baseline | \(+13.8\% \pm 5.0\) |
Adaptive-SeqKD Results¶
| Metric | High-quality Fine-tuned Teacher | Randomly Fine-tuned Teacher | High-quality Fine-tuned Student | Randomly Fine-tuned Student |
|---|---|---|---|---|
| BLEU Change | \(+0.0 \pm 0.5\) | \(-1.2 \pm 0.8\) | \(-0.2 \pm 1.7\) | \(-1.2 \pm 1.6\) |
| Comet-QE-22 | \(+0.2 \pm 0.3\) | \(-0.2 \pm 0.1\) | \(+0.3 \pm 0.3\) | \(-0.1 \pm 0.2\) |
Key Findings¶
- The student's ExMem rate is \(57\%\) higher than the baseline, yet the student only indirectly observed \(18.4\%\) of the original training targets. This indicates that the teacher selectively "relays" memorized samples, leading to a higher degree of memorization in the student under cleaner data conditions.
- Secondary Memorization: The student not only inherits the teacher's memorization of the original corpus but also develops new memorization of unique translations generated by the teacher (secondary ExMem accounts for \(59\%\) of the total ExMem).
- Denoising as a double-edged sword: While denoising improves translation quality, it reduces the noise-based regularization effect inherent in the training data, ultimately leading to stronger memorization.
- Students outperform teachers on low-quality subgroups: On samples with quality \(<0.4\), the student's Comet-QE-22 scores exceed those of the teacher, indicating that the student further denoises the teacher's "denoised translations".
- Adaptive-SeqKD is effective yet conservative: Fine-tuning the teacher on high-quality data leads to improvements in reference-free translation quality for the student, alongside a reduced memorization rate, without sacrificing BLEU.
Highlights & Insights¶
- The counterintuitive finding of "the student memorizing more despite not seeing the original data" holds significant practical implications: Commercial translation systems using SeqKD should actively monitor memorization behaviors, as failure to do so could risk leaking private information embedded in the training data.
- The methodology of using ExMem to quantify NMT memorization is highly practical: Reconstructing the target by merely observing \(75\%\) of the source side captures the degree of "memorization" far better than exact matching. This approach is transferable to the analysis of memorization in other sequence-to-sequence tasks.
- The causal chain of denoising \(\to\) reduced regularization \(\to\) enhanced memorization uncovers a hidden cost of KD: This insight applies broadly to any scenario involving training on synthetic data—the "cleaner" the data, the more susceptible the model becomes to overfitting.
Limitations & Future Work¶
- Focus is restricted solely to SeqKD in NMT: Whether these conclusions generalize to LLM distillation (e.g., distilling from GPT-4) requires further investigation.
- The high-quality filtering in Adaptive-SeqKD relies heavily on Comet-QE-22: This quality estimator may introduce its own biases.
- Only beam size \(= 1\) distillation was used: The impact of different beam sizes on the transmission of memorization warrants exploration.
- Lack of direct quantification of privacy risks: Although security risks associated with memorization are highlighted, membership inference attack experiments were not conducted.
Related Work & Insights¶
- vs. Lukasik et al. (2024): KD suppresses memorization in image classification, whereas this paper finds that SeqKD in NMT accentuates it. The discrepancy in tasks and KD formulations leads to contrasting conclusions.
- vs. Jagielski et al. (2024): Jagielski et al. found that distilled students are vulnerable to membership inference attacks; this work corroborates similar conclusions through the lens of translation quality and behavioral analysis.
- vs. Zhou et al. (2020): Both acknowledge the denoising capacity of SeqKD, but this paper is the first to demonstrate that a side effect of denoising is the enhancement of memorization.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Conducts the first systematic study of memorization propagation in SeqKD, with counterintuitive yet significant findings.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly comprehensive, covering 5 language pairs \(\times\) 3 models \(\times\) 12 subgroups \(\times\) multiple evaluation metrics.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, rich examples, and intuitive visualizations.
- Value: ⭐⭐⭐⭐ Offers direct guidelines for the security of KD and NMT system deployment.