Meta-learning Structure-Preserving Dynamics¶
Conference: ICML 2026
arXiv: 2508.11205
Code: None
Area: Scientific Machine Learning / Meta-learning / Structure-preserving Neural Networks
Keywords: Hamiltonian NN, GENERIC, Modulation-based Meta-learning, Low-rank Adaptation, SVD Modulation
TL;DR¶
Systematically introduces modulation-based meta-learning (hyper-network maps latent code \(\bm{z}^{(k)}\) to hierarchical modulation parameters) into Hamiltonian / GENERIC neural networks, proposing two novel modulations—latent multi-rank (MR) and latent SVD-like modulation—enabling a shared network to few-shot adapt to a family of new parameter instances without knowing system parameters \(\bm{\mu}\), while strictly preserving energy conservation/dissipative structure.
Background & Motivation¶
Background: Structure-preserving neural networks (HNN, LNN, port-Hamiltonian NN, GENERIC/metriplectic NN) hard-code conservation laws, symplectic structure, and dissipation into the architecture, enabling physically faithful predictions on dynamical systems with known parameters \(\bm{\mu}\).
Limitations of Prior Work: These models are typically "one model per parameter instance"—any parameter change requires retraining, making many-query scenarios (e.g., families of pendulums with different masses, oscillators with different stiffness) prohibitively expensive. The few existing meta-learning extensions (Lee 2021, Song 2024) follow MAML/ANIL approaches, requiring inner-loop high-dimensional parameter updates, which are unstable and inefficient.
Key Challenge: HNN-type models only need to learn a scalar potential \(\mathcal{H}_\Theta(\bm{q}, \bm{p})\) to specify the entire dynamics, so the dependence of weights on parameters \(\bm{\mu}\) is inherently low-dimensional. However, existing meta-learning methods update all \(\Theta\) via full gradients, wasting this low-dimensional structure.
Goal: (1) Systematically compare various modulation strategies within the Hamiltonian / GENERIC framework; (2) Design more expressive yet parameter-efficient new modulation methods; (3) Ensure strict preservation of conservation/dissipative structure after modulation.
Key Insight: Inspired by latent modulation in INR/NeRF (e.g., CODA by Dupont 2022)—compress each system into a low-dimensional latent code \(\bm{z}^{(k)}\), then use a hyper-network \(\bm{f}_\text{hyper}(\bm{z}^{(k)}; \bm\phi)\) to generate small correction parameters for each layer, with base weights shared across all tasks.
Core Idea: "Base sharing + instance latent + hierarchical low-rank modulation" can capture the "low-dimensional principal manifold of parameter \(\bm{\mu}\)" with very few trainable parameters, and further, SVD-like decomposition allows the base phase to learn orthogonal bases, reducing test-time adaptation to updating a few singular value scalars.
Method¶
Overall Architecture¶
The input is a family of Hamiltonian / GENERIC systems \(\{\mathcal{H}^{(k)}(\bm{q}, \bm{p}) = \mathcal{H}(\bm{q}, \bm{p}; \bm{\mu}^{(k)})\}_{k=1}^{n_\mu}\), with several trajectories sampled per system. Model parameters are split as \(\Theta^{(k)} = \Theta_\text{base} \cup \Theta_\text{indv}^{(k)}\): the base is updated by meta-gradient in the outer loop, while the individual part is a system-specific latent code \(\bm{z}^{(k)}\) updated in the inner loop. The hyper-network maps \(\bm{z}^{(k)}\) to low-rank/bias correction parameters for each layer. The final \(\tilde{\mathcal{H}}(\bm{q}, \bm{p}; \Theta^{(k)})\) is an energy function conditioned on the latent, still yielding dynamics via \(\dot{\bm q} = \partial \tilde{\mathcal H} / \partial \bm p,\ \dot{\bm p} = -\partial \tilde{\mathcal H} / \partial \bm q\), so structure preservation is inherited from the base architecture.
Key Designs¶
-
Latent Multi-Rank (MR) Modulation:
- Function: Adds a rank-\(r\) instance-specific correction \(\bm{U}^{(\ell,k)} \bm{V}^{(\ell,k)\top}\) and a bias correction \(\bm{s}^{(\ell,k)}\) to each MLP layer's weights \(\bm{W}^{(\ell)}\), all generated by the hyper-network from \(\bm{z}^{(k)}\).
- Mechanism: Each layer is updated as \(\bm{h} \mapsto \sigma\left((\bm{W}^{(\ell)} + \bm{U}^{(\ell,k)} \bm{V}^{(\ell,k)\top}) \bm{h} + \bm{b}^{(\ell)} + \bm{s}^{(\ell,k)}\right)\), where \(\bm{U}, \bm{V} \in \mathbb{R}^{w_\ell \times r}\). When \(r=1\), this reduces to RO (rank-one), equivalent to a minimal LoRA-like modulation; \(r=5\) gives MR(5). Both \(\bm{U}, \bm{V}\) are instance-specific, so the hyper-network generates a new pair of rank-\(r\) factors each time.
- Design Motivation: Proposition 3.1 shows that if the local rank of \(\partial_{\bm\mu} \bm{f}\) is \(\le r\), then \(r\)-dimensional modulation suffices to capture all local parameter variations—MR leverages this by placing expressiveness in "LoRA-style low-rank matrices".
-
Latent SVD-like Modulation (Best in Paper):
- Function: Further factorizes low-rank modulation into "shared bases + instance singular values", so the hyper-network only needs to output a few scalars.
- Mechanism: Each layer is written as \(\bm{h} \mapsto \sigma\left((\bm{W}^{(\ell)} + \sum_{i=1}^r d_i^{(\ell,k)} \bm{u}_i^{(\ell)} \bm{v}_i^{(\ell)\top}) \bm{h} + \bm{b}^{(\ell)} + \bm{s}^{(\ell,k)}\right)\), where \(\bm{u}_i^{(\ell)}, \bm{v}_i^{(\ell)}\) are base parameters (meta-gradient updated), and only the singular values \(d_i^{(\ell,k)}\) and bias \(\bm{s}^{(\ell,k)}\) are generated by the hyper-network from \(\bm{z}^{(k)}\); soft orthogonality penalties \(\|\bm{U}^\top \bm{U} - \bm{I}\|_F\) and \(\|\bm{V}^\top \bm{V} - \bm{I}\|_F\) plus ReLU activation in the hyper-network ensure non-negative singular values.
- Design Motivation: The base phase learns "modulation directions invariant across systems" in \(\bm{u}_i, \bm{v}_i\); at test time, only a few singular values need to be fitted to adapt to new parameter instances—the hyper-network is tiny, test-time gradients are minimal, matching the successful pattern in INR: "learn shared bases first, then fit coefficients".
-
Locality Regularization + Evolving Latent Code Training Protocol:
- Function: (a) Restricts instance parameters from deviating too far from the shared base; (b) Instead of resetting the latent every epoch, allows \(\bm{z}^{(k)}\) to evolve throughout training.
- Mechanism: Adds \(\lambda_z \|\bm{z}\|_2 + \lambda_\phi \|\bm\phi\|_2\) to the loss to constrain instance updates near the shared base; at test time, the latent is initialized as the Euclidean mean of training latents \(\bm{z}_\text{avg} = \tfrac{1}{n_\mu^\text{train}} \sum_k \bm{z}_\text{train}^{(k)}\), then few-shot auto-decoding is performed.
- Design Motivation: Zero-initialization (as in Dupont 2022) pushes the base to a position that can be fine-tuned by any latent, losing training signal; evolving latents allow the base and average latent to co-evolve, so test initialization lands directly in a "previously learned parameter neighborhood", yielding greater stability.
Loss & Training¶
For Hamiltonian systems, use symplecticity loss \(\mathcal{L}_\text{symp} = \|\dot{\bm q} - \partial_{\bm p} \tilde{\mathcal H}_\Theta\|_2^2 + \|\dot{\bm p} + \partial_{\bm q} \tilde{\mathcal H}_\Theta\|_2^2\); for GENERIC systems, use the corresponding metriplectic loss. The outer loop updates the base \(\Theta_\text{base}\) for \(N_\text{out}\) steps, the inner loop updates the current batch's latent for \(N_\text{in}\) steps; at test time, Algorithm 2 is used for \(N_\text{test}\)-shot latent fitting (base frozen).
Key Experimental Results¶
Main Results¶
Three energy-conserving systems (Duffing, mass-spring, pendulum) and one dissipative system (DNO) are used, each with 80 parameter instances (70 train / 10 test), 10 trajectories per instance. Metrics: \(\epsilon_\text{field}\) (relative \(\ell^2\) error on uniform grid, OOD indicator), \(\epsilon_\text{traj}\) (relative error on test trajectories), SSIM.
| System | Method | \(\epsilon_\text{field}\) (\(\times 10^{-2}\)) | \(\epsilon_\text{traj}\) (\(\times 10^{-2}\)) |
|---|---|---|---|
| Pendulum | Scratch | 83.35 | 79.84 |
| Pendulum | MAML | 99.13 | 52.37 |
| Pendulum | Reptile | 88.72 | 75.73 |
| Pendulum | FW (CODA) | 8.23 | 10.65 |
| Pendulum | Shift | 9.76 | 12.88 |
| Pendulum | RO (MR-1) | 6.47 | 8.27 |
| Pendulum | SVD(5) | 4.62 | 5.33 |
| Mass Spring | FW | 1.60 | 1.31 |
| Mass Spring | SVD(5) | 1.51 | 1.12 |
| Duffing | FW | 10.30 | 2.78 |
| Duffing | SVD(5) | 10.03 | 2.30 |
Ablation Study¶
| Configuration | Pendulum \(\epsilon_\text{field}\) | Notes |
|---|---|---|
| Multi-domain joint training (Duffing + spring + pendulum share base) | SVD(5) still best | Shows modulation can generalize across "different dynamical families" |
| Test-time adaptation with different shot numbers | SVD consistently best from 1-shot to 300-shot | Strong few-shot adaptation |
| Locality weights \(\lambda_\phi \in \{10^{-2..-4}\}\), \(\lambda_z \in \{10^{-1..-3}\}\) | SVD has lowest variance | Robust to regularization strength |
| Test latent initialization (zero vs \(\bm z_\text{avg}\)) | \(\bm z_\text{avg}\) always better | Validates evolving-latent protocol |
| Dissipative DNO system | SVD(3) \(\epsilon_\text{traj} = 0.142\), Reptile / ANIL NaN/diverge | Modulation extends stably to GENERIC |
Key Findings¶
- Modulation-based methods (FW / Shift / MR / RO / SVD) reduce error by about 65% compared to optimization-based methods (MAML / Reptile / ANIL), indicating that in structure-preserving networks where weight dependence on parameters is naturally low-dimensional, modulation is more efficient than inner-gradient updates.
- SVD(5) not only achieves the lowest error, but its hyper-network is much smaller than FW (FW must output the entire \(\bm{U}\bm{V}^\top\); SVD only outputs \(r\) scalars), making it Pareto optimal in "accuracy / params".
- ANIL / Reptile directly yield NaN or diverge on dissipative DNO, reflecting that inner-loop optimization is highly unstable for dynamics with entropy generation; modulation methods, which do not update main weights, remain stable.
- In multi-domain joint training, a single base network fits Duffing / spring / pendulum simultaneously, with adaptation via latent switching—showing that modulation enables not just parameter fine-tuning, but "dynamical family switching".
Highlights & Insights¶
- The split "base = cross-task invariant + latent = task-specific" combined with SVD decomposition makes the INR latent modulation approach more interpretable—the magnitude of singular values directly indicates the importance of each principal direction for the current parameter instance.
- Proposition 3.1 provides a clean theoretical justification for low-rank modulation: the local rank of the parameter space determines the required modulation dimension, tightly linking to the empirical "low-dimensional latent suffices" in INR.
- Naturally combines modulation with Hamiltonian / GENERIC: modulation only changes the scalar value of \(\mathcal{H}_\Theta\), without breaking symplectic or metriplectic structure, making this a rare provably compatible combination of structure preservation and meta-learning.
- The evolving-latent training protocol, though seemingly minor, is crucial: it ensures the base always co-evolves with the actual latents in use, avoiding distribution shift where the base adapts to 0-latent but test latents are far from zero.
Limitations & Future Work¶
- All experimental systems are low-dimensional toys (pendulum / spring / Duffing / DNO), with maximum dimension \(\le 4\); extension to PDEs or high-dimensional multi-body systems (e.g., molecular dynamics) remains unproven.
- Modulation is only applied at the MLP layer level; more general architectures (Transformer, Graph NN) are unexplored. If the base is a GNN, hyper-network design may become much more complex.
- The non-negativity constraint on SVD singular values relies on ReLU activation plus soft orthogonality penalty, which does not guarantee true orthogonality in early training; sensitivity is not fully discussed.
- Interaction with time step/integrator choice is not systematically analyzed—whether symplectic integrator + modulation yields long-horizon stability remains open.
Related Work & Insights¶
- vs MAML / Reptile / ANIL: This work replaces inner-loop high-dimensional gradient updates with low-dimensional auto-decoding of latent codes, avoiding second-order gradients/unstable inner-gradient, reducing error by 65% on HNN/GNN.
- vs CODA / FW (Kirchmeyer 2022): FW is also modulation-based, but corrects all \(\bm{W}, \bm{b}\) in every layer, making the hyper-network huge; MR/SVD greatly reduce parameters via low-rank + shared bases, while still outperforming FW.
- vs Shift modulation (Dupont 2022): Shift only modulates bias, which is too weak; SVD-like retains bias and adds shared rank-\(r\) matrices, achieving greater expressiveness with still small parameter count.
- vs LoRA: LoRA is task-agnostic low-rank fine-tuning; MR here is actually task-conditioned LoRA + hyper-network, upgrading fine-tuning to meta-learning.
Rating¶
- Novelty: ⭐⭐⭐ Transfers LoRA/SVD-style modulation to structure-preserving meta-learning; method is clean but a "reasonable combination" rather than a wholly new paradigm.
- Experimental Thoroughness: ⭐⭐⭐ Covers 3 Hamiltonian + 1 GENERIC system, 6 baselines, multi-domain joint training, different shot numbers, locality sweeps, but system dimensionality is low.
- Writing Quality: ⭐⭐⭐⭐ Formulas and algorithm blocks are clear; Proposition 3.1 provides rigorous basis for low-rank selection.
- Value: ⭐⭐⭐⭐ Offers a simple, reusable meta-learning template for many-query SciML scenarios, especially suitable for parameterized ODE/Hamiltonian simulation.