Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=e5tepxQfE1
Code: To be confirmed
Area: Knowledge Distillation / Model Compression / Time Series Classification
Keywords: Knowledge Distillation, Diffusion Prior, Posterior Sampling, Early Time Series Classification, full-to-partial distillation

TL;DR¶

This work reformulates the "full-sequence teacher $\rightarrow$ partial-sequence student" distillation as an inverse problem. It treats the student's short-context features as "degraded observations" of the target long-context features. By using a diffusion model as a generative prior for the teacher's features to perform posterior sampling, the method provides each student feature with a set of "dynamic, diverse, and aggregatable" teacher signals. This enables a classifier seeing only sequence prefixes to achieve generalization capabilities approximating those of a full-sequence model.

Background & Motivation¶

Background: Many time-series classification tasks (e.g., ECG arrhythmia detection, industrial monitoring) are constrained by latency, cost, or sensor drops during deployment, meaning only a partial prefix of the sequence is available at inference time, rather than the complete sequence assumed during training. Knowledge Distillation (KD) is a natural way to transfer the generalization ability of a full-sequence teacher to a partial-sequence student.
Limitations of Prior Work: Classical KD methods (e.g., logit matching by Hinton, feature matching like FitNets/RKD/Attention) are designed for parameter capacity gaps where the teacher and student observe the same data. In this scenario, there is a natural representation chasm due to different input spaces (full sequence vs. prefix), which persists even if model capacities are identical.
Key Challenge: The teacher's full-context features act as an "overwhelming" hard target for a student that has only encoded partial context. This paper identifies three KD pathologies amplified by the full-to-partial setting: (1) direct alignment with full-context features drowns the student, preventing knowledge transfer; (2) a single teacher perspective is insufficiently diverse, leading students to overfit to a specific interpretation of missing information; (3) training data mismatch makes it difficult for students to faithfully reproduce (low fidelity) the teacher's prediction distribution.
Goal: To enable students trained and deployed on prefix inputs to achieve the generalization and fidelity of full-sequence models using a single teacher without retraining multiple models.
Core Idea: Instead of treating the teacher signal as a fixed point target $k^\star=z_t$, "knowledge" is modeled as a distribution conditioned on the student's current state $k\sim p(K\mid Z_s=z_s)$, with these signals obtained via posterior sampling using a diffusion prior on the teacher's feature manifold.

Method¶

Overall Architecture¶

GDPD treats the student feature $z_{\text{short}}=S^{\text{feat}}_\theta(x_e)$ as a degraded measurement of an ideal full-context feature $z^*_{\text{long-ideal}}$. A diffusion model is first trained on teacher features $z_{\text{long}}$ as a generative prior $p_\phi(z_{\text{long}})$. Then, "hint features" $\hat z_{\text{long}}$ are obtained through posterior sampling conditioned on $z_{\text{short}}$. The student is constrained such that if its posterior reconstruction can be correctly classified by the teacher's classification head, the student feature is pushed toward $z^*_{\text{long-ideal}}$. Training consists of two stages: a warm-up phase for the diffusion prior, followed by prior-guided extraction of long-context knowledge by the student.

flowchart LR
    X[Full sequence x] --> T[Teacher T]
    T --> ZL[Long-context feature z_long]
    ZL --> DP[Diffusion prior p_phi]
    Xe[Prefix x_e] --> S[Student S_theta]
    S --> ZS[Short-context feature z_short]
    ZS -->|Noise fusion initialization| DP
    DP -->|Posterior sampling| ZH[Hint feature z_hat_long]
    ZH --> H[Teacher/Student Head]
    H --> L[L_GDPD: CE Correct Class]
    L -.Backprop.-> S

Key Designs¶

1. Distillation as an Inverse Diffusion Problem: Student features as "degraded observations". The paper adopts an inverse diffusion perspective: given a degraded measurement $y=D(z_0)$, the goal is to recover $z_0$ from the posterior $p(z_0\mid y)\propto p(y\mid z_0)\,p_\phi(z_0)$. Here, $z_{\text{short}}$ acts as the degraded measurement, $z^*_{\text{long-ideal}}$ is the clean signal to be recovered, and the diffusion model provides the learned prior $p_\phi(z_{\text{long}})$. This step essentially changes "feature alignment" from rigid $\ell_2$ distance to "finding the complete version on the teacher manifold that best explains the current student feature," naturally avoiding the drowning problem.

2. Conditional Posterior Sampling via Unconditional Priors: Noise Fusion Initialization. Since the diffusion prior is trained unconditionally on teacher features, the challenge is how to sample conditioned on $z_{\text{short}}$. Unlike standard guided sampling (e.g., DPS) which uses likelihood gradients at each reverse step (requiring a fixed measurement), GDPD's conditional signal $z_{\text{short}}$ is itself being optimized. The authors use a direct initialization strategy: the student feature is fused with Gaussian noise to serve as the starting point of the reverse process: $$z_{\text{long},T}=\alpha\, z_{\text{short}}+(1-\alpha)\,\epsilon,\quad \epsilon\sim\mathcal N(0,I)$$ where $\alpha$ is learned per-feature during distillation, allowing different features to enter the starting step at appropriate noise levels. Sampling from this point is pulled by $z_{\text{short}}$ while exploring the teacher manifold, converging to a credible clean feature $\hat z_{\text{long}}$.

3. Defining Hint Features via Classification Success, Not Direct Alignment. The hint feature $z_{\text{long-hint}}$ is defined functionally as a teacher feature containing the long-range information necessary for correct classification. Supervision does not force $\hat z_{\text{long}}$ to approach a specific $z_t$, but constrains its posterior reconstruction to output the correct label through the classifier: $$\mathcal L_{\text{GDPD}}(\theta)=\mathbb E_{(x,y)}\Big[\ell_{\text{CE}}\big(S^{\text{head}}_\theta(\hat z_{\text{long}}),\,y\big)\Big],\quad \hat z_{\text{long}}\sim\tilde p_{\text{diff}}\big(z_{\text{long}}\mid z_{\text{short}}=S^{\text{feat}}_\theta(x_e)\big)$$ This implies that if the student feature retains sufficient long-context information (minimal degradation relative to the hint), it should be able to recover a valid $z_{\text{long-hint}}$ as a credible completion.

4. Knowledge as Distribution: Stochastic Trajectories for "Diversity + Progression + Aggregation". Traditional KD uses a single point $k^\star=z_t$ (equivalent to $P_{K\mid Z_s}=\delta_{k^\star}$). GDPD uses expected loss: $$\mathbb E_{k\sim p(\cdot\mid z_s)}[\ell(z_s;\theta,k)]\approx\frac1J\sum_{j=1}^J \ell\big(z_s;\theta,k^{(j)}\big)$$ Since each forward pass follows a different noise trajectory, the student naturally covers multiple samples over training; in practice, $J=1$ is sufficient. These three properties address the three pathologies: signals are drawn dynamically/progressively based on current student capability $z_s;\theta_t$ (preventing drowning); diffusion sampling ensures diversity is constrained to the teacher manifold (avoiding manual noise outliers); and signals from multiple trajectories collectively aggregate into more complete long-range knowledge (improving fidelity).

Key Experimental Results¶

Datasets: UCR Univariate, UEA Multivariate, and PhysioNet mortality data. Net1→Net2 denotes teacher→student distillation. Results are averaged over 5 runs.

Main Results (UCR datasets across different earliness, LSTM3-100→LSTM3-100)¶

Earliness	Metric	Base	Base-KD	Fits	GDPD
0.2L	Avg.AUC-PRC	63.64	69.23	67.47	73.83
0.2L	Avg.Rank↓ / Top-1 Count	3.50 / 0	2.42 / 2	2.92 / 0	1.17 / 10
0.4L	Avg.AUC-PRC	70.44	78.03	75.36	81.70
0.6L	Avg.AUC-PRC	76.79	83.70	81.15	86.00
0.8L	Avg.AUC-PRC	77.79	84.78	82.74	89.02
0.8L	Avg.Rank↓ / Top-1 Count	3.67 / 0	2.42 / 0	2.83 / 1	1.08 / 11

GDPD achieves the best AUC-PRC and lowest average rank across all earliness levels, winning on over 80% of the datasets.

Comparison with KD Variants (e=0.5L, 12 UCR)¶

Against RKD, Attention, DKD, DT2W, VID, PKT, TeKAP, TTM, Base-KD, and Fits, GDPD ranks top-3 on 80% of datasets with an average rank of 2.25, while no other method approaches 4.

Key Findings¶

Generalization: All distilled students outperform the non-distilled Base, proving that full-context knowledge helps partial classification; GDPD provides the largest gain.
Fidelity: Measured by teacher-student top-1 agreement, GDPD consistently outperforms Base-KD / Fits across all earliness levels, suggesting that "collective signals" reproduce the teacher's predictive structure (class separability, feature geometry) better than point signals.
Ablation of J: Since different noise trajectories are taken each forward pass, diversity is covered over the time dimension, making $J=1$ sufficient.

Highlights & Insights¶

Paradigm Shift: This is the first work to model teacher knowledge as a "generative distribution" rather than a point target, formalizing the teacher-student feature relationship as an ill-posed inverse problem. This perspective is clean, transferable, and applicable beyond time series to any distillation where teacher and student see different input spaces.
Clever Conditional Sampling: Using learnable noise fusion weights $\alpha$ for initialization bypasses the difficulty of using fixed likelihood gradients when the conditional signal itself is being optimized.
Functional Definition of Hints: By only requiring reconstructions to be correctly classified rather than approximating a specific feature, the method avoids rigid alignment, directly addressing the "drowning" of students by full-context features.
The three KD pathologies (effectiveness, diversity, fidelity) are solved simultaneously by a single mechanism (distributional signals).

Limitations & Future Work¶

Evaluation primarily uses UCR/UEA + PhysioNet time series with small-to-medium models (like LSTM), mostly in homogenous teacher-student settings (LSTM3-100→LSTM3-100). Scalability to cross-architecture or large-scale models remains to be verified.
The introduction of a diffusion prior, posterior sampling, and two-stage training makes training costs and hyperparameters (warm-up, $\alpha$, $\lambda$) higher than direct feature KD, though inference remains unchanged.
Designed specifically for "prefix/partial observations," its generalizability to other input distribution mismatches (domain shift, missing modalities) is an open question.
The conclusion that $J=1$ is sufficient relies on training steps naturally covering diversity; whether this holds for very few training steps or small batches warrants investigation.

Classic KD: Includes logit matching (Hinton 2015), intermediate feature matching (FitNets, Attention, RKD), and methods for capacity gaps like teacher-assistant (Mirzadeh 2020) and student-friendly teachers. GDPD notes these assume identical input spaces.
Diverse Teachers: Methods like teacher ensembles, Deep Mutual Learning (DML), and single-model multi-perspective generation (TeKAP, 2025). GDPD instead uses diffusion sampling to constrain "diversity" within the teacher manifold.
Distillation Fidelity: Stanton 2021 noted that mismatch between the distillation set and the teacher's training set reduces fidelity, which corresponds to the full-to-partial scenario.
Inverse Diffusion: Borrowing inverse problem solving from DDRM (2022) and DPS (2023) into "feature space distillation" provides the foundational methodology.
Insight: For any supervision scenario where target signals are ambiguous or multi-vocal (weak labels, missing observations, early decision-making), shifting from point targets to generative distributions with posterior sampling may serve as a universal robustification paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to model teacher knowledge as a generative distribution and the teacher-student relationship as inverse diffusion; consistent and opens the full-to-partial distillation direction.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers various earliness, multiple datasets, and 10+ KD variants; includes fidelity and J ablation. However, lacks cross-architecture/large model validation.
Writing Quality: ⭐⭐⭐⭐ Clear logical loop from pathologies to mechanism to verification; formulas and diagrams are well-placed.
Value: ⭐⭐⭐⭐ Addresses the real-world need for early/partial time series classification; the paradigm is transferable to broader input-mismatch distillation tasks.