Skip to content

SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs

Conference: CVPR 2026
Paper: CVF Open Access
Code: TBD
Area: Model Compression / LLM Knowledge Distillation
Keywords: Knowledge Distillation, token weighting, speculative decoding, implicit curriculum, small language models

TL;DR

SelecTKD shifts the focus of LLM distillation from "what divergence to measure the teacher-student gap" to "on which tokens to apply supervision." Borrowing the "propose-verify" mechanism from speculative decoding, it assigns weights \(\{0, \beta, 1\}\) to each token, applying full loss only on tokens where the teacher is highly confident and consistent with the student. It achieves plug-and-play SOTA across instruction following, mathematics, code, and VLM tasks.

Background & Motivation

Background: Distilling large LLMs into Small Language Models (SLMs) is a primary path for deployment. Most distillation pipelines apply token-wise losses (vanilla KL, reverse KL, skew KL, etc.) indiscriminately to every token, focusing on the choice of divergence or data distribution (on-policy vs. off-policy). DistiLLM-2 even advocates for meticulous pairings, such as using teacher data with SKL and student data with SRKL.

Limitations of Prior Work: This "indiscriminate supervision" forces the student to mimic high-entropy, uncertain teacher predictions. When there is a significant capacity gap, the teacher itself may be uncertain at many positions. Forcing the student to follow such signals injects noise into the optimization process, harming generalization and training stability.

Key Challenge: The authors observe that loss geometry might not be the dominant factor. Pre-experiments (Figure 1) comparing KL/RKL/SKL/SRKL and various loss-data pairings under the same training budget show remarkably similar final performance. Theoretically, while forward/reverse/skew KL follow different optimization paths, they share the same fixed point in the limit. In other words, the divergence debate pertains more to "training dynamics" than the "final solution."

Goal: Since changing the divergence yields limited returns, the problem should be reframed: which tokens are worth learning? The goal is to implement selective control over supervision signals at the token level—applying full weight to valuable tokens while shielding or down-weighting others.

Key Insight: Borrow the "student proposes, teacher verifies" logic from speculative decoding. Existing works (SLIM for reweighting logits, SKD for dataset modification, AdaSPEC for filtering via extra reference models) are related but carry various costs.

Core Idea: Use a lightweight "propose-verify" mechanism to compute a verification weight \(V_t\) for each token, rewriting any divergence loss \(D(\cdot\|\cdot)\) as \(\sum_t V_t\, D(p_t\|q_t)\). This is objective-agnostic, plug-and-play, and induces an implicit curriculum quantified by the "token acceptance rate."

Method

Overall Architecture

SelecTKD does not alter the architecture or introduce additional reference models; it simply adds a token-level "gate" over the standard KD loss. Given teacher distribution \(p_t\) and student distribution \(q_t\), the per-token loss is rewritten as:

\[\mathcal{L}_{\text{SelecTKD}} = \sum_{t=1}^{|\boldsymbol{y}|} V_t\, D\big(p_t \,\big\|\, q_t\big)\]

where \(D\) can be any divergence (KL, RKL, SKL, or SRKL), and \(V_t \in \{0, \beta, 1\}\) is a token-level weight determined by a verifier \(V(\cdot)\) (using stop-gradient internally). The pipeline follows: the student proposes one (or more) tokens at each position, and the teacher verifies if the proposal is reliable. Validated tokens receive weight 1, while rejected tokens are masked (0) or down-weighted to a small \(\beta\). It is compatible with both on-policy and off-policy data. The verification mechanism has two variants: "Greedy Top-k" and "Non-greedy Spec-k."

Key Designs

1. Selective Token-Weighted Objective: Shifting from "How to Measure" to "Where to Supervise"

Addressing the "indiscriminate supervision" issue, SelecTKD multiplies each token loss by a verification weight \(V_t\) rather than modifying the divergence shape. This step is objective-agnostic—it works regardless of whether KL or SRKL is used. When the teacher has high entropy or teacher-student predictions diverge, the supervision is likely noisy, and \(V_t\) suppresses it to 0 or \(\beta\). Unlike SLIM or AdaSPEC, SelecTKD acts directly on the supervision signal with zero additional models or architectural changes.

2. Greedy Top-k Verification: Judging Consistency via Ranking Rather than Absolute Probability

A naive approach might use Hellinger distance \(H(p_t,q_t)\) to set \(V_t\) softly, but Hellinger is sensitive to long tails and is rank-agnostic. SelecTKD uses a discrete, rank-based Top-k verification: the student proposes its most likely token \(\hat{y}_t = \arg\max_y q_\theta(y\mid x, y_{<t})\) and checks if it falls within the teacher's Top-k candidate set:

\[V_t = \beta + (1-\beta)\,\mathbb{I}\!\left(\hat{y}_t \in \mathrm{Top}_k(p_t)\right)\]

When \(\beta=0\), this becomes a hard gate; a small \(\beta>0\) provides mild regularization for rejected tokens. This is robust to absolute probability values and ignores tail noise.

3. Non-greedy Spec-k Verification: Mitigating Exposure Bias via Speculative Sampling

To align with speculative sampling and mitigate exposure bias, the Spec-k variant lets the student sample \(k\) independent candidates \(\{y_t^{(j)}\}\) from its distribution. For each candidate, it calculates the standard speculative acceptance probability \(a_t^{(j)} = \min\!\big(1,\, p_t(y_t^{(j)})/q_t(y_t^{(j)})\big)\). If at least one proposal is accepted (via \(r^{(j)}\sim U[0,1]\)), the weight is 1:

\[V_t = \beta + (1-\beta)\,\mathbb{I}\!\left(|\mathcal{A}_t| \ge 1\right)\]

Intuitively, if at least one student proposal is accepted by the teacher, the teacher's local signal is informative. If all are rejected, the supervision is likely noisy or mismatched.

4. Token Acceptance Rate (TAR) and Implicit Curriculum: Quantifiable Training Dynamics

A key property of SelecTKD is its stable, convergent training dynamics, quantified by the Token Acceptance Rate (TAR):

\[\mathrm{TAR} = \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})}\!\left[\frac{1}{|\boldsymbol{y}|}\sum_{t=1}^{|\boldsymbol{y}|}\mathbb{I}\big(V_t=1\big)\right]\]

This represents the expected proportion of student proposals accepted by the teacher verifier. Theorem 1 (Monotonic TAR Improvement) states that under standard Lipschitz continuity and small learning rate assumptions, Top-k and Spec-k gradients satisfy \(\mathrm{TAR}_{t+1}-\mathrm{TAR}_t \ge \eta\,\kappa\,(1-\mathrm{TAR}_t)\). This implies faster improvement when TAR is low and saturation as \(\mathrm{TAR}\to1\), formalizing an implicit "easy-to-hard" curriculum.

Loss & Training

The unified loss is \(\mathcal{L}_{\text{SelecTKD}}=\tfrac{1}{|\boldsymbol{y}|}\sum_t V_t\,D(p_t\|q_t)\). Experiments use \(k=5, \beta=0.01\), based on HuggingFace TRL + LoRA on 8×A100 GPUs. Training is done for 2 epochs for instruction/code and 1 epoch for math, using held-out validation for early stopping.

Key Experimental Results

Main Results

SelecTKD (+Ours) consistently improves various strong baselines across instruction following, math, code, and VLM tasks. Table below shows Win Rates (WR, %) for Qwen2-7B-Inst → Qwen2-1.5B:

Configuration AlpacaEval Evol-Instruct UltraFeedback AVG
KD 57.49 28.23 37.86 41.19
KD + Ours 59.72 (+2.23) 30.35 (+2.12) 39.93 (+2.07) 43.33 (+2.14)
DistiLLM-2 69.88 47.13 59.05 58.69
DistiLLM-2 + Ours 70.60 (+0.72) 48.27 (+1.14) 59.53 (+0.48) 59.47 (+0.78)

Stable gains even on top of the strong DistiLLM-2 baseline indicate that selective token filtering is orthogonal to divergence/data choices.

Ablation Study

Comparison with other token-selective methods on Evol-Instruct (Qwen2-7B-Inst → 1.5B):

Method Extra Ref Model? Evol-Instruct WR(%)
AdaSPEC Yes 40.21
ATKD No 45.13
SelecTKD (Ours) No 48.27

Key Findings

  • Improvements are most pronounced in open-ended, long-text generation where teacher entropy is higher and supervision noise is more prevalent.
  • SelecTKD is plug-and-play: stacking it on 7 types of baselines (KD, SeqKD, ImitKD, GKD, DistiLLM, SKD, DistiLLM-2) resulted in gains across almost all cases.
  • It is particularly effective for teacher-student pairs with large capacity gaps (e.g., Mistral-7B → Danube2-1.8B).

Highlights & Insights

  • Problem Reformulation: The core highlight is shifting the research gravity from "how to measure divergence" to "where to apply supervision" based on the observation that KL-family divergences share fixed points.
  • Quantifiable Curriculum: TAR transforms the "implicit curriculum" into a provable metric, providing formal grounding for training stability.
  • Zero-Cost Integration: Requiring no additional models or architectural changes makes it highly engineering-friendly.

Limitations & Future Work

  • Verification quality depends entirely on the teacher; if the teacher is unreliable (e.g., OOD tasks), the signals will be distorted.
  • The Monotonic TAR Theorem relies on strong assumptions like Lipschitz continuity; validity under large learning rates warrants further study.
  • While \(k\) and \(\beta\) are intuitive, sensitivity analysis across diverse tasks remains limited.
  • vs. DistiLLM-2: While DistiLLM-2 focuses on "loss-data pairing," SelecTKD argues that loss geometry is not primary and instead filters supervision at the token level.
  • vs. SLIM: SLIM reweights tokens in the logit space, potentially losing "dark knowledge," whereas SelecTKD gates the supervision signal while preserving full distributional information.
  • vs. AdaSPEC: AdaSPEC requires an extra reference model to filter difficult tokens; SelecTKD uses the student's own proposals for filtering, outforming AdaSPEC with zero extra models.
  • vs. SKD: SKD modifies the training dataset using speculative decoding; SelecTKD leaves the dataset unchanged and modifies weights, making it more general.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Reformulating KD as supervision location selection and providing provable dynamics via TAR.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Broad coverage across tasks and baselines, though VLM and hyperparameter sensitivity are less extensive.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and formalization of the propose-and-verify mechanism.
  • Value: ⭐⭐⭐⭐⭐ Plug-and-play with zero extra cost, directly applicable to production LLM distillation.