Spectral-Guided Physical Dynamics Distillation¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=P6F4MxtOKp
Area: Physical Dynamics Modeling / Knowledge Distillation
Keywords: Physical Dynamics Prediction, Spectral Graph Analysis, Knowledge Distillation, Privileged Information, Spatiotemporal Representation
TL;DR¶
Addressing the challenge of predicting long-term 3D trajectories of particles given only the initial state, this paper proposes SGDD: using a teacher encoder that observes future trajectories as "privileged information" to adaptively weight key frequency components in a unified spatiotemporal spectral domain, and then distilling this dynamics-rich representation into a student encoder that only sees the initial state, achieving more accurate and stable long-term predictions across multi-scale systems including molecules, proteins, and human motion.
Background & Motivation¶
Background: Physical dynamics prediction—predicting future 3D trajectories given the initial state of particles (atoms/molecules, protein backbones, human joints)—is a fundamental task in science and engineering. Recent mainstream approaches design equivariant neural networks (EGNN, ClofNet, SE(3)-Transformer, etc.) to capture the symmetries of physical systems; further works (ESTAG, EGNO, GF-NODE) have introduced frequency awareness using Fourier or Graph Fourier transforms to model periodic structures in time or space.
Limitations of Prior Work: Errors accumulate and amplify significantly during long-term prediction from initial states. The root cause is the deep entanglement of global low-frequency trends and local high-frequency oscillations over long durations. Existing frequency-aware methods have two specific flaws: first, they model time and space separately, deriving spectral representations from a single dimension (either time or space), which fails to characterize physical processes emerging from spatiotemporal interdependencies; second, they treat all frequency components equally, failing to recognize that the importance of low/high frequencies varies greatly across different systems.
Key Challenge: Long-term prediction needs to prioritize low-frequency modes to maintain stability and long-term consistency, while complementarily filling in high-frequency details to improve short-term accuracy—but since these are intertwined, single-dimension spectral modeling can neither separate them nor determine the appropriate weights for different frequency bands.
Goal: (1) Jointly derive spectral representations in a unified spatiotemporal domain rather than modeling time and space independently; (2) adaptively emphasize task-related frequency components; (3) resolve the fundamental constraint of inaccessible future trajectories during inference.
Key Insight: The future trajectory itself is a form of Privileged Information (PI)—visible during training but invisible during testing. By allowing a teacher encoder to process the ground-truth future trajectory and extract dynamics-rich frequency-aware representations, and then "teaching" this to a student encoder that only sees the initial state via knowledge distillation, a direct and efficient supervisory signal can be provided to the student.
Core Idea: Replace single-dimensional frequency modeling with "Teacher sees future + Student sees initial + Spatiotemporal spectral distillation," enabling the student to generate effective dynamics representations at inference time without needing privileged information.
Method¶
Overall Architecture¶
SGDD (Spectral-Guided Dynamics Distillation) aims to solve the problem of predicting the full future trajectory \(\{x_1,\dots,x_T\}\) given only the initial state \(G_0\). The general paradigm is Encoder → Dynamics Representation z → Decoder: where the decoder is an off-the-shelf physical dynamics model (EGNO or GF-NODE in this paper), and the representation \(z\) is generated by the encoder from \(G_0\), encapsulating the anticipation of future evolution. All innovations in SGDD lie in "how to construct a superior \(z\)."
Specifically, the framework trains two encoders in parallel: the Dynamics Encoder \(E_{dyn}\) takes the privileged future sequence \(G_{1:T}\) and \(G_0\) to produce \(z_{dyn}\); the Initial Encoder \(E_{init}\) takes only \(G_0\) to produce \(z_{init}\). Both representations are passed into the Spectral-Guided Enhancement (SGE) module, which projects them into the spectral domain using a set of joint spatiotemporal bases, adaptively reweights the most relevant frequency components, and yields \(z^{sg}_{dyn}\) and \(z^{sg}_{init}\). During training, knowledge distillation forces \(z^{sg}_{init}\) to mimic \(z^{sg}_{dyn}\) (aligning in both spatiotemporal and spectral domains); at inference time, the teacher is removed, and the decoder relies solely on \(z^{sg}_{init}\) to predict trajectories. The entire framework is trained end-to-end with a staged training strategy to ensure stable convergence.
graph TD
A["Initial State G0"] --> C["Initial Encoder E_init<br/>(Sees G0 only)"]
B["Future Trajectory G_1:T<br/>(Privileged Info · Training Only)"] --> D["Dynamics Encoder E_dyn<br/>(Sees Future + Initial)"]
C --> E["Spectral-Guided Enhancement (SGE)<br/>Adaptive Spatiotemporal Spectral Reweighting"]
D --> E
E --> F["Student Representation z_init^sg"]
E --> G["Teacher Representation z_dyn^sg"]
G -->|Dual-layer Alignment<br/>(Spatiotemporal + Spectral)| F
F --> H["Physics Dynamics Decoder<br/>(Inference uses z_init^sg only)"]
H --> I["Future Trajectory Prediction x_1:T"]
Key Designs¶
1. Dual Encoders + Privileged Information Distillation: Allowing the student to "peek" at the future
The limitation of standard "encoder-representation-decoder" pipelines is that if both input and output are the full future trajectory, the encoder can easily create information-rich latent representations; however, in the target scenario, future trajectories are unavailable during inference. SGDD explicitly splits this gap into teacher and student branches. The dynamics encoder \(E_{dyn}\) observes the real future sequence \(\{G_1,\dots,G_T\}\) during training to produce \(z_{dyn}\in\mathbb{R}^{N\times T\times d_z}\), capturing low-frequency trends and high-frequency changes. To ensure the student representation \(z_{init}\) resides in the same spatiotemporal space, \(E_{init}\) projects the \(G_0\) node features through a fully connected layer from \(\mathbb{R}^{N\times d}\) to \(\mathbb{R}^{N\times T\times d}\) to construct a "synthetic spatiotemporal input," yielding \(z_{init}\in\mathbb{R}^{N\times T\times d_z}\). This ensures consistency for downstream alignment. Unlike model compression, distillation here is an information transfer across observability, compressing the "peek into the future" capability into a student that "only sees the present."
2. Joint Spatiotemporal Basis: Decoupling space and time frequencies in a unified domain
To address the limitation of separate spatiotemporal modeling, this paper constructs a joint spatiotemporal basis. Let the eigenvector matrices of the normalized Laplacians for the spatial and temporal graphs be \(U_s\in\mathbb{R}^{N\times N}\) and \(U_t\in\mathbb{R}^{T\times T}\) (where \(L_s=U_s\Lambda_s U_s^\top\) and \(L_t=U_t\Lambda_t U_t^\top\), with eigenvalues in ascending order). The joint basis is given by the Kronecker product:
This projects spatiotemporal representations along orthogonal dimensions simultaneously, decoupling frequency components within a unified basis. To reduce computational cost, only the columns corresponding to the \(K\) smallest eigenvalues are kept, resulting in a truncated basis \(B_K=[b_1,\dots,b_K]\in\mathbb{R}^{NT\times K}\). This truncation naturally suppresses high-variance, high-frequency content associated with large eigenvalues, baking the "low-frequency priority" physical prior directly into the basis selection; \(K\) serves as a key hyperparameter controlling the range of frequencies to be adaptively weighted.
3. Spectral-Guided Enhancement (SGE): Adaptive weighting of key frequency bands
Determining which frequencies are most important requires adaptation, which SGE provides. For representation \(z\) (reshaped to \(\mathbb{R}^{d_z\times(NT)}\)), since \(B_K\) is orthogonal, \(P=B_K B_K^\top\) is the orthogonal projection onto \(\mathrm{span}(B_K)\), allowing the decomposition \(z = Pz + (I-P)z\). The former captures selected spectral patterns, while the latter represents the residual information outside the truncated subspace. The enhancement calculates spectral coefficients \(a:=B_K^\top z\in\mathbb{R}^{d_z\times K}\), modulates them with learnable, frequency-specific weights \(w\in\mathbb{R}^K\), and projects back:
The representation is reconstructed by adding back the residual: \(z^{sg}:=\tilde z + (I-P)z\). Thus, \(z^{sg}\) integrates reweighted dominant spectral components while preserving residual details. The pre-reconstruction spectral coefficients \(\tilde a_{dyn}\) and \(\tilde a_{init}\) are also held for spectral alignment.
4. Dual-layer Alignment + Phase-based Training: Simultaneous distillation and optimization stability
Simple coordinate-level imitation is insufficient because long-term stability depends on both low-frequency trends and high-frequency details. SGDD applies dual-layer alignment: spatiotemporal alignment between \(z^{sg}_{dyn}\) and \(z^{sg}_{init}\) (\(\mathcal{L}_{rep}\)), and spectral alignment between \(\tilde a_{dyn}\) and \(\tilde a_{init}\) (\(\mathcal{L}_{spec}\)). The alignment loss is \(\mathcal{L}_{align}=\mathcal{L}_{rep}(z^{sg}_{dyn},z^{sg}_{init})+\mathcal{L}_{spec}(\tilde a_{dyn},\tilde a_{init})\), with the total loss:
Gradients are detached from \(z^{sg}_{dyn}\) and \(\tilde a_{dyn}\) to prevent the student from corrupting the teacher. A two-stage strategy is used: a pre-training phase with a teacher-forcing ratio of 1.0 where the decoder only uses \(z^{sg}_{dyn}\) (approx. 1/3 of iterations), followed by a joint training phase where the ratio drops to 0.5, allowing the decoder to alternate between \(z^{sg}_{dyn}\) and \(z^{sg}_{init}\). This progressive schedule ensures stable convergence.
Loss & Training¶
- Loss: \(\mathcal{L}_{pred}\) is the step-wise MSE of the predicted trajectory.
- Alignment Loss: \(\mathcal{L}_{align}=\mathcal{L}_{rep}+\mathcal{L}_{spec}\) (Representation MSE + Spectral MSE), with teacher-side gradient detached.
- Weight: \(\lambda=1.0\); Optimizer: Adam.
- Strategy: Two-stage training: Pre-training (1.0 teacher forcing) → Joint training (0.5 teacher forcing).
- Backbone: \(E_{dyn}\) uses STSGNN, \(E_{init}\) uses GAT; decoders are instantiated as EGNO or GF-NODE, forming SGDD-EGNO and SGDD-GFNODE variants.
Key Experimental Results¶
Main Results¶
Evaluated on three multi-scale systems: molecular dynamics (MD17), human motion capture (CMU Mocap), and proteins (ADk trajectory). Metrics are S2S (final step state) and S2T (average over the whole trajectory) in MSE (\(\times 10^{-2}\)).
| Dataset / Setting | Metric | Representative Example | Ours (SGDD) | Prev. SOTA | Gain |
|---|---|---|---|---|---|
| MD17 Benzene | S2S | EGNO 48.85 / GFNODE 4.82 | SGDD-GFNODE 2.74 | 4.82 | +43.2% |
| MD17 Aspirin | S2S | EGNO 9.18 | SGDD-GFNODE 7.29 | 7.93 | +8.1% |
| Mocap Run | S2S | EGNO 33.9 | SGDD-EGNO 28.2 | 33.9 | +16.8% |
| Mocap Walk | S2S | GFNODE 9.3 | SGDD-GFNODE 6.5 | 9.3 | +30.1% |
| Protein ADk | S2S | EGNO 2.23 | SGDD-EGNO 1.75 | 2.23 | +21.5% |
| Mocap Walk | S2T | EGNO 3.5 | SGDD-EGNO 3.2 | 3.5 | +8.6% |
In the MD17 S2S evaluation, SGDD achieved SOTA on all 8 molecules. On Benzene, SGDD-EGNO improved over EGNO by ~72%, as EGNO/GFNODE errors were concentrated in low-frequency bands which SGDD captures better. The 21.5% improvement on the protein ADk (855 nodes) demonstrates scalability to large spatial systems.
Ablation Study¶
Ablation of Spectral Alignment (Freq Align), Spatiotemporal Alignment (Feature Align), and Spectral-Guided Enhancement (SGE) on SGDD-EGNO (S2T, \(\times 10^{-2}\)):
| Freq Align | Feature Align | SGE | Ethanol | Toluene | Mocap-Walk | Mocap-Run |
|---|---|---|---|---|---|---|
| ✓ | ✓ | ✓ | 2.84 | 3.80 | 2.95 | 12.98 |
| ✓ | ✓ | - | 2.90 | 4.18 | 4.04 | 12.61 |
| ✓ | - | ✓ | 2.89 | 4.86 | 3.30 | 13.01 |
| - | ✓ | ✓ | 2.85 | 4.65 | 3.26 | 14.37 |
Analysis reveals that the truncation parameter \(K\) has a non-monotonic effect on performance; if too small, important frequencies are missed, while the optimal \(K\) must be selected carefully. Regarding encoder combinations, \(E_{dyn}\)=STSGNN + \(E_{init}\)=GAT performed best on MD17.
Key Findings¶
- Dual-layer alignment works best when combined: spatiotemporal alignment provides a robust inductive bias for the global structure, while spectral alignment refines frequency priorities to suppress noise/instability.
- Removing SGE leads to significant performance drops in most scenarios (e.g., Toluene, Mocap-Walk), proving that learnable spectral weights effectively amplify useful frequency bands.
- Gains are larger in S2T (full trajectory) than S2S (final step), suggesting SGDD primarily benefits from more reliable long-term hidden representations, with advantages becoming more pronounced as time progresses.
Highlights & Insights¶
- Future Trajectories as Privileged Information: This is the most clever aspect—allowing the teacher to "see the answers" during training to compress this capability into a student that only sees the initial state, bypassing the inference-time constraint.
- Kronecker Product Spatiotemporal Basis: Using \(B=U_t\otimes U_s\) to decouple space and time frequencies in a unified spectral domain elegantly fixes the structural flaw of "spatiotemporal separation" in existing methods.
- Dual Spectral + Spatiotemporal Alignment: Distillation occurs not just at the coordinate level, but also on spectral coefficients, encoding the intuition that "low-frequency ensures stability, high-frequency yields detail" directly into the loss.
- Plug-and-play Decoder: SGDD is a representation learning framework that can wrap around various decoders like EGNO or GF-NODE as a general enhancement.
Limitations & Future Work¶
- The authors note that for some molecules in MD17 S2T evaluations, results lag behind previous models, attributed to difficulties in perfectly reproducing baseline decoders; horizontal comparisons should be viewed with caution.
- The choice of the truncation parameter \(K\) is sensitive and non-monotonic, lacking an a priori optimal criterion.
- Training requires full future trajectories as privileged supervision, which may limit applicability in real-world scenarios with sparse or partial observations.
- Spatiotemporal joint bases depend on fixed graph Laplacian eigendecomposition, which may be a bottleneck for dynamic edge sets or ultra-large-scale graphs (\(O((NT)^3)\) complexity).
Related Work & Insights¶
- vs. EGNO / GF-NODE: They use Fourier time convolutions or Graph Fourier + Neural ODE for frequency awareness but model space/time in a single dimension and weight frequencies equally. SGDD uses a joint basis, learnable weights, and privileged distillation.
- vs. Equivariant Networks (EGNN / ClofNet): These emphasize spatial equivariance but place less focus on spatiotemporal frequency entanglement; SGDD directly learns frequency-aware representations.
- vs. Knowledge Distillation + Privileged Information: SGDD follows the "visible during training, invisible during testing" paradigm but is the first to combine it with spatiotemporal spectral representations for physical dynamics.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The combination of joint spatiotemporal bases and privileged information distillation is a novel and self-consistent entry point for physical dynamics.
- Experimental Thoroughness: ⭐⭐⭐⭐ Coverage of multi-scale systems is good, but inconsistencies in some S2T metrics weaken the comparative argument.
- Writing Quality: ⭐⭐⭐⭐ The link between motivation, method, and equations is clear.
- Value: ⭐⭐⭐⭐ A plug-and-play representation enhancement framework with strong potential for transfer to other spatiotemporal tasks.