Spectral-Guided Physical Dynamics Distillation¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=P6F4MxtOKp
Area: Physical Dynamics Modeling / Knowledge Distillation
Keywords: Physical Dynamics Prediction, Spectral Graph Analysis, Knowledge Distillation, Privileged Information, Spatiotemporal Representation

TL;DR¶

Addressing the challenge of predicting long-term 3D trajectories of particles given only the initial state, this paper proposes SGDD: using a teacher encoder that observes future trajectories as "privileged information" to adaptively weight key frequency components in a unified spatiotemporal spectral domain, and then distilling this dynamics-rich representation into a student encoder that only sees the initial state, achieving more accurate and stable long-term predictions across multi-scale systems including molecules, proteins, and human motion.

Background & Motivation¶

Background: Physical dynamics prediction—predicting future 3D trajectories given the initial state of particles (atoms/molecules, protein backbones, human joints)—is a fundamental task in science and engineering. Recent mainstream approaches design equivariant neural networks (EGNN, ClofNet, SE(3)-Transformer, etc.) to capture the symmetries of physical systems; further works (ESTAG, EGNO, GF-NODE) have introduced frequency awareness using Fourier or Graph Fourier transforms to model periodic structures in time or space.

Limitations of Prior Work: Errors accumulate and amplify significantly during long-term prediction from initial states. The root cause is the deep entanglement of global low-frequency trends and local high-frequency oscillations over long durations. Existing frequency-aware methods have two specific flaws: first, they model time and space separately, deriving spectral representations from a single dimension (either time or space), which fails to characterize physical processes emerging from spatiotemporal interdependencies; second, they treat all frequency components equally, failing to recognize that the importance of low/high frequencies varies greatly across different systems.

Key Challenge: Long-term prediction needs to prioritize low-frequency modes to maintain stability and long-term consistency, while complementarily filling in high-frequency details to improve short-term accuracy—but since these are intertwined, single-dimension spectral modeling can neither separate them nor determine the appropriate weights for different frequency bands.

Goal: (1) Jointly derive spectral representations in a unified spatiotemporal domain rather than modeling time and space independently; (2) adaptively emphasize task-related frequency components; (3) resolve the fundamental constraint of inaccessible future trajectories during inference.

Key Insight: The future trajectory itself is a form of Privileged Information (PI)—visible during training but invisible during testing. By allowing a teacher encoder to process the ground-truth future trajectory and extract dynamics-rich frequency-aware representations, and then "teaching" this to a student encoder that only sees the initial state via knowledge distillation, a direct and efficient supervisory signal can be provided to the student.

Core Idea: Replace single-dimensional frequency modeling with "Teacher sees future + Student sees initial + Spatiotemporal spectral distillation," enabling the student to generate effective dynamics representations at inference time without needing privileged information.

Method¶

Overall Architecture¶

SGDD (Spectral-Guided Dynamics Distillation) aims to solve the problem of predicting the full future trajectory \(\{x_1,\dots,x_T\}\) given only the initial state \(G_0\). The general paradigm is Encoder → Dynamics Representation z → Decoder: where the decoder is an off-the-shelf physical dynamics model (EGNO or GF-NODE in this paper), and the representation \(z\) is generated by the encoder from \(G_0\), encapsulating the anticipation of future evolution. All innovations in SGDD lie in "how to construct a superior \(z\)."

Specifically, the framework trains two encoders in parallel: the Dynamics Encoder \(E_{dyn}\) takes the privileged future sequence \(G_{1:T}\) and \(G_0\) to produce \(z_{dyn}\); the Initial Encoder \(E_{init}\) takes only \(G_0\) to produce \(z_{init}\). Both representations are passed into the Spectral-Guided Enhancement (SGE) module, which projects them into the spectral domain using a set of joint spatiotemporal bases, adaptively reweights the most relevant frequency components, and yields \(z^{sg}_{dyn}\) and \(z^{sg}_{init}\). During training, knowledge distillation forces \(z^{sg}_{init}\) to mimic \(z^{sg}_{dyn}\) (aligning in both spatiotemporal and spectral domains); at inference time, the teacher is removed, and the decoder relies solely on \(z^{sg}_{init}\) to predict trajectories. The entire framework is trained end-to-end with a staged training strategy to ensure stable convergence.

graph TD
    A["Initial State G0"] --> C["Initial Encoder E_init<br/>(Sees G0 only)"]
    B["Future Trajectory G_1:T<br/>(Privileged Info · Training Only)"] --> D["Dynamics Encoder E_dyn<br/>(Sees Future + Initial)"]
    C --> E["Spectral-Guided Enhancement (SGE)<br/>Adaptive Spatiotemporal Spectral Reweighting"]
    D --> E
    E --> F["Student Representation z_init^sg"]
    E --> G["Teacher Representation z_dyn^sg"]
    G -->|Dual-layer Alignment<br/>(Spatiotemporal + Spectral)| F
    F --> H["Physics Dynamics Decoder<br/>(Inference uses z_init^sg only)"]
    H --> I["Future Trajectory Prediction x_1:T"]

Key Designs¶

1. Dual Encoders + Privileged Information Distillation: Allowing the student to "peek" at the future

The limitation of standard "encoder-representation-decoder" pipelines is that if both input and output are the full future trajectory, the encoder can easily create information-rich latent representations; however, in the target scenario, future trajectories are unavailable during inference. SGDD explicitly splits this gap into teacher and student branches. The dynamics encoder \(E_{dyn}\) observes the real future sequence \(\{G_1,\dots,G_T\}\) during training to produce \(z_{dyn}\in\mathbb{R}^{N\times T\times d_z}\), capturing low-frequency trends and high-frequency changes. To ensure the student representation \(z_{init}\) resides in the same spatiotemporal space, \(E_{init}\) projects the \(G_0\) node features through a fully connected layer from \(\mathbb{R}^{N\times d}\) to \(\mathbb{R}^{N\times T\times d}\) to construct a "synthetic spatiotemporal input," yielding \(z_{init}\in\mathbb{R}^{N\times T\times d_z}\). This ensures consistency for downstream alignment. Unlike model compression, distillation here is an information transfer across observability, compressing the "peek into the future" capability into a student that "only sees the present."

2. Joint Spatiotemporal Basis: Decoupling space and time frequencies in a unified domain

To address the limitation of separate spatiotemporal modeling, this paper constructs a joint spatiotemporal basis. Let the eigenvector matrices of the normalized Laplacians for the spatial and temporal graphs be \(U_s\in\mathbb{R}^{N\times N}\) and \(U_t\in\mathbb{R}^{T\times T}\) (where \(L_s=U_s\Lambda_s U_s^\top\) and \(L_t=U_t\Lambda_t U_t^\top\), with eigenvalues in ascending order). The joint basis is given by the Kronecker product:

\[B = U_t \otimes U_s,\quad B\in\mathbb{R}^{NT\times NT}.\]

This projects spatiotemporal representations along orthogonal dimensions simultaneously, decoupling frequency components within a unified basis. To reduce computational cost, only the columns corresponding to the \(K\) smallest eigenvalues are kept, resulting in a truncated basis \(B_K=[b_1,\dots,b_K]\in\mathbb{R}^{NT\times K}\). This truncation naturally suppresses high-variance, high-frequency content associated with large eigenvalues, baking the "low-frequency priority" physical prior directly into the basis selection; \(K\) serves as a key hyperparameter controlling the range of frequencies to be adaptively weighted.

3. Spectral-Guided Enhancement (SGE): Adaptive weighting of key frequency bands

Determining which frequencies are most important requires adaptation, which SGE provides. For representation \(z\) (reshaped to \(\mathbb{R}^{d_z\times(NT)}\)), since \(B_K\) is orthogonal, \(P=B_K B_K^\top\) is the orthogonal projection onto \(\mathrm{span}(B_K)\), allowing the decomposition \(z = Pz + (I-P)z\). The former captures selected spectral patterns, while the latter represents the residual information outside the truncated subspace. The enhancement calculates spectral coefficients \(a:=B_K^\top z\in\mathbb{R}^{d_z\times K}\), modulates them with learnable, frequency-specific weights \(w\in\mathbb{R}^K\), and projects back:

\[\tilde a := w\odot a,\quad \tilde z := B_K\tilde a = B_K(w\odot B_K^\top z),\]

The representation is reconstructed by adding back the residual: \(z^{sg}:=\tilde z + (I-P)z\). Thus, \(z^{sg}\) integrates reweighted dominant spectral components while preserving residual details. The pre-reconstruction spectral coefficients \(\tilde a_{dyn}\) and \(\tilde a_{init}\) are also held for spectral alignment.

4. Dual-layer Alignment + Phase-based Training: Simultaneous distillation and optimization stability

Simple coordinate-level imitation is insufficient because long-term stability depends on both low-frequency trends and high-frequency details. SGDD applies dual-layer alignment: spatiotemporal alignment between \(z^{sg}_{dyn}\) and \(z^{sg}_{init}\) (\(\mathcal{L}_{rep}\)), and spectral alignment between \(\tilde a_{dyn}\) and \(\tilde a_{init}\) (\(\mathcal{L}_{spec}\)). The alignment loss is \(\mathcal{L}_{align}=\mathcal{L}_{rep}(z^{sg}_{dyn},z^{sg}_{init})+\mathcal{L}_{spec}(\tilde a_{dyn},\tilde a_{init})\), with the total loss:

\[\mathcal{L}_{total} = \mathcal{L}_{pred}(x_{1:T},\hat x_{1:T}) + \lambda\,\mathcal{L}_{align}.\]

Gradients are detached from \(z^{sg}_{dyn}\) and \(\tilde a_{dyn}\) to prevent the student from corrupting the teacher. A two-stage strategy is used: a pre-training phase with a teacher-forcing ratio of 1.0 where the decoder only uses \(z^{sg}_{dyn}\) (approx. 1/3 of iterations), followed by a joint training phase where the ratio drops to 0.5, allowing the decoder to alternate between \(z^{sg}_{dyn}\) and \(z^{sg}_{init}\). This progressive schedule ensures stable convergence.

Loss & Training¶

Loss: \(\mathcal{L}_{pred}\) is the step-wise MSE of the predicted trajectory.
Alignment Loss: \(\mathcal{L}_{align}=\mathcal{L}_{rep}+\mathcal{L}_{spec}\) (Representation MSE + Spectral MSE), with teacher-side gradient detached.
Weight: \(\lambda=1.0\); Optimizer: Adam.
Strategy: Two-stage training: Pre-training (1.0 teacher forcing) → Joint training (0.5 teacher forcing).
Backbone: \(E_{dyn}\) uses STSGNN, \(E_{init}\) uses GAT; decoders are instantiated as EGNO or GF-NODE, forming SGDD-EGNO and SGDD-GFNODE variants.

Key Experimental Results¶

Main Results¶

Evaluated on three multi-scale systems: molecular dynamics (MD17), human motion capture (CMU Mocap), and proteins (ADk trajectory). Metrics are S2S (final step state) and S2T (average over the whole trajectory) in MSE (\(\times 10^{-2}\)).

Dataset / Setting	Metric	Representative Example	Ours (SGDD)	Prev. SOTA	Gain
MD17 Benzene	S2S	EGNO 48.85 / GFNODE 4.82	SGDD-GFNODE 2.74	4.82	+43.2%
MD17 Aspirin	S2S	EGNO 9.18	SGDD-GFNODE 7.29	7.93	+8.1%
Mocap Run	S2S	EGNO 33.9	SGDD-EGNO 28.2	33.9	+16.8%
Mocap Walk	S2S	GFNODE 9.3	SGDD-GFNODE 6.5	9.3	+30.1%
Protein ADk	S2S	EGNO 2.23	SGDD-EGNO 1.75	2.23	+21.5%
Mocap Walk	S2T	EGNO 3.5	SGDD-EGNO 3.2	3.5	+8.6%

In the MD17 S2S evaluation, SGDD achieved SOTA on all 8 molecules. On Benzene, SGDD-EGNO improved over EGNO by ~72%, as EGNO/GFNODE errors were concentrated in low-frequency bands which SGDD captures better. The 21.5% improvement on the protein ADk (855 nodes) demonstrates scalability to large spatial systems.

Ablation Study¶

Ablation of Spectral Alignment (Freq Align), Spatiotemporal Alignment (Feature Align), and Spectral-Guided Enhancement (SGE) on SGDD-EGNO (S2T, \(\times 10^{-2}\)):

Freq Align	Feature Align	SGE	Ethanol	Toluene	Mocap-Walk	Mocap-Run
✓	✓	✓	2.84	3.80	2.95	12.98
✓	✓	-	2.90	4.18	4.04	12.61
✓	-	✓	2.89	4.86	3.30	13.01
-	✓	✓	2.85	4.65	3.26	14.37

Analysis reveals that the truncation parameter \(K\) has a non-monotonic effect on performance; if too small, important frequencies are missed, while the optimal \(K\) must be selected carefully. Regarding encoder combinations, \(E_{dyn}\)=STSGNN + \(E_{init}\)=GAT performed best on MD17.

Key Findings¶

Dual-layer alignment works best when combined: spatiotemporal alignment provides a robust inductive bias for the global structure, while spectral alignment refines frequency priorities to suppress noise/instability.
Removing SGE leads to significant performance drops in most scenarios (e.g., Toluene, Mocap-Walk), proving that learnable spectral weights effectively amplify useful frequency bands.
Gains are larger in S2T (full trajectory) than S2S (final step), suggesting SGDD primarily benefits from more reliable long-term hidden representations, with advantages becoming more pronounced as time progresses.

Highlights & Insights¶

Future Trajectories as Privileged Information: This is the most clever aspect—allowing the teacher to "see the answers" during training to compress this capability into a student that only sees the initial state, bypassing the inference-time constraint.
Kronecker Product Spatiotemporal Basis: Using \(B=U_t\otimes U_s\) to decouple space and time frequencies in a unified spectral domain elegantly fixes the structural flaw of "spatiotemporal separation" in existing methods.
Dual Spectral + Spatiotemporal Alignment: Distillation occurs not just at the coordinate level, but also on spectral coefficients, encoding the intuition that "low-frequency ensures stability, high-frequency yields detail" directly into the loss.
Plug-and-play Decoder: SGDD is a representation learning framework that can wrap around various decoders like EGNO or GF-NODE as a general enhancement.

Limitations & Future Work¶

The authors note that for some molecules in MD17 S2T evaluations, results lag behind previous models, attributed to difficulties in perfectly reproducing baseline decoders; horizontal comparisons should be viewed with caution.
The choice of the truncation parameter \(K\) is sensitive and non-monotonic, lacking an a priori optimal criterion.
Training requires full future trajectories as privileged supervision, which may limit applicability in real-world scenarios with sparse or partial observations.
Spatiotemporal joint bases depend on fixed graph Laplacian eigendecomposition, which may be a bottleneck for dynamic edge sets or ultra-large-scale graphs (\(O((NT)^3)\) complexity).

vs. EGNO / GF-NODE: They use Fourier time convolutions or Graph Fourier + Neural ODE for frequency awareness but model space/time in a single dimension and weight frequencies equally. SGDD uses a joint basis, learnable weights, and privileged distillation.
vs. Equivariant Networks (EGNN / ClofNet): These emphasize spatial equivariance but place less focus on spatiotemporal frequency entanglement; SGDD directly learns frequency-aware representations.
vs. Knowledge Distillation + Privileged Information: SGDD follows the "visible during training, invisible during testing" paradigm but is the first to combine it with spatiotemporal spectral representations for physical dynamics.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of joint spatiotemporal bases and privileged information distillation is a novel and self-consistent entry point for physical dynamics.
Experimental Thoroughness: ⭐⭐⭐⭐ Coverage of multi-scale systems is good, but inconsistencies in some S2T metrics weaken the comparative argument.
Writing Quality: ⭐⭐⭐⭐ The link between motivation, method, and equations is clear.
Value: ⭐⭐⭐⭐ A plug-and-play representation enhancement framework with strong potential for transfer to other spatiotemporal tasks.