Meta-learning Structure-Preserving Dynamics¶

Conference: ICML 2026
arXiv: 2508.11205
Code: None
Area: Scientific Machine Learning / Meta-learning / Structure-preserving Neural Networks
Keywords: Hamiltonian NN, GENERIC, Modulation-based Meta-learning, Low-rank Adaptation, SVD Modulation

TL;DR¶

Systematically introduces modulation-based meta-learning (hyper-network maps latent code \(\bm{z}^{(k)}\) to hierarchical modulation parameters) into Hamiltonian / GENERIC neural networks, proposing two novel modulations—latent multi-rank (MR) and latent SVD-like modulation—enabling a shared network to few-shot adapt to a family of new parameter instances without knowing system parameters \(\bm{\mu}\), while strictly preserving energy conservation/dissipative structure.

Background & Motivation¶

Background: Structure-preserving neural networks (HNN, LNN, port-Hamiltonian NN, GENERIC/metriplectic NN) hard-code conservation laws, symplectic structure, and dissipation into the architecture, enabling physically faithful predictions on dynamical systems with known parameters \(\bm{\mu}\).

Limitations of Prior Work: These models are typically "one model per parameter instance"—any parameter change requires retraining, making many-query scenarios (e.g., families of pendulums with different masses, oscillators with different stiffness) prohibitively expensive. The few existing meta-learning extensions (Lee 2021, Song 2024) follow MAML/ANIL approaches, requiring inner-loop high-dimensional parameter updates, which are unstable and inefficient.

Key Challenge: HNN-type models only need to learn a scalar potential \(\mathcal{H}_\Theta(\bm{q}, \bm{p})\) to specify the entire dynamics, so the dependence of weights on parameters \(\bm{\mu}\) is inherently low-dimensional. However, existing meta-learning methods update all \(\Theta\) via full gradients, wasting this low-dimensional structure.

Goal: (1) Systematically compare various modulation strategies within the Hamiltonian / GENERIC framework; (2) Design more expressive yet parameter-efficient new modulation methods; (3) Ensure strict preservation of conservation/dissipative structure after modulation.

Key Insight: Inspired by latent modulation in INR/NeRF (e.g., CODA by Dupont 2022)—compress each system into a low-dimensional latent code \(\bm{z}^{(k)}\), then use a hyper-network \(\bm{f}_\text{hyper}(\bm{z}^{(k)}; \bm\phi)\) to generate small correction parameters for each layer, with base weights shared across all tasks.

Core Idea: "Base sharing + instance latent + hierarchical low-rank modulation" can capture the "low-dimensional principal manifold of parameter \(\bm{\mu}\)" with very few trainable parameters, and further, SVD-like decomposition allows the base phase to learn orthogonal bases, reducing test-time adaptation to updating a few singular value scalars.

Method¶

Overall Architecture¶

The input is a family of Hamiltonian / GENERIC systems \(\{\mathcal{H}^{(k)}(\bm{q}, \bm{p}) = \mathcal{H}(\bm{q}, \bm{p}; \bm{\mu}^{(k)})\}_{k=1}^{n_\mu}\), with several trajectories sampled per system. Model parameters are split as \(\Theta^{(k)} = \Theta_\text{base} \cup \Theta_\text{indv}^{(k)}\): the base is updated by meta-gradient in the outer loop, while the individual part is a system-specific latent code \(\bm{z}^{(k)}\) updated in the inner loop. The hyper-network maps \(\bm{z}^{(k)}\) to low-rank/bias correction parameters for each layer. The final \(\tilde{\mathcal{H}}(\bm{q}, \bm{p}; \Theta^{(k)})\) is an energy function conditioned on the latent, still yielding dynamics via \(\dot{\bm q} = \partial \tilde{\mathcal H} / \partial \bm p,\ \dot{\bm p} = -\partial \tilde{\mathcal H} / \partial \bm q\), so structure preservation is inherited from the base architecture.

Key Designs¶

Latent Multi-Rank (MR) Modulation:
- Function: Adds a rank-\(r\) instance-specific correction \(\bm{U}^{(\ell,k)} \bm{V}^{(\ell,k)\top}\) and a bias correction \(\bm{s}^{(\ell,k)}\) to each MLP layer's weights \(\bm{W}^{(\ell)}\), all generated by the hyper-network from \(\bm{z}^{(k)}\).
- Mechanism: Each layer is updated as \(\bm{h} \mapsto \sigma\left((\bm{W}^{(\ell)} + \bm{U}^{(\ell,k)} \bm{V}^{(\ell,k)\top}) \bm{h} + \bm{b}^{(\ell)} + \bm{s}^{(\ell,k)}\right)\), where \(\bm{U}, \bm{V} \in \mathbb{R}^{w_\ell \times r}\). When \(r=1\), this reduces to RO (rank-one), equivalent to a minimal LoRA-like modulation; \(r=5\) gives MR(5). Both \(\bm{U}, \bm{V}\) are instance-specific, so the hyper-network generates a new pair of rank-\(r\) factors each time.
- Design Motivation: Proposition 3.1 shows that if the local rank of \(\partial_{\bm\mu} \bm{f}\) is \(\le r\), then \(r\)-dimensional modulation suffices to capture all local parameter variations—MR leverages this by placing expressiveness in "LoRA-style low-rank matrices".
Latent SVD-like Modulation (Best in Paper):
- Function: Further factorizes low-rank modulation into "shared bases + instance singular values", so the hyper-network only needs to output a few scalars.
- Mechanism: Each layer is written as \(\bm{h} \mapsto \sigma\left((\bm{W}^{(\ell)} + \sum_{i=1}^r d_i^{(\ell,k)} \bm{u}_i^{(\ell)} \bm{v}_i^{(\ell)\top}) \bm{h} + \bm{b}^{(\ell)} + \bm{s}^{(\ell,k)}\right)\), where \(\bm{u}_i^{(\ell)}, \bm{v}_i^{(\ell)}\) are base parameters (meta-gradient updated), and only the singular values \(d_i^{(\ell,k)}\) and bias \(\bm{s}^{(\ell,k)}\) are generated by the hyper-network from \(\bm{z}^{(k)}\); soft orthogonality penalties \(\|\bm{U}^\top \bm{U} - \bm{I}\|_F\) and \(\|\bm{V}^\top \bm{V} - \bm{I}\|_F\) plus ReLU activation in the hyper-network ensure non-negative singular values.
- Design Motivation: The base phase learns "modulation directions invariant across systems" in \(\bm{u}_i, \bm{v}_i\); at test time, only a few singular values need to be fitted to adapt to new parameter instances—the hyper-network is tiny, test-time gradients are minimal, matching the successful pattern in INR: "learn shared bases first, then fit coefficients".
Locality Regularization + Evolving Latent Code Training Protocol:
- Function: (a) Restricts instance parameters from deviating too far from the shared base; (b) Instead of resetting the latent every epoch, allows \(\bm{z}^{(k)}\) to evolve throughout training.
- Mechanism: Adds \(\lambda_z \|\bm{z}\|_2 + \lambda_\phi \|\bm\phi\|_2\) to the loss to constrain instance updates near the shared base; at test time, the latent is initialized as the Euclidean mean of training latents \(\bm{z}_\text{avg} = \tfrac{1}{n_\mu^\text{train}} \sum_k \bm{z}_\text{train}^{(k)}\), then few-shot auto-decoding is performed.
- Design Motivation: Zero-initialization (as in Dupont 2022) pushes the base to a position that can be fine-tuned by any latent, losing training signal; evolving latents allow the base and average latent to co-evolve, so test initialization lands directly in a "previously learned parameter neighborhood", yielding greater stability.

Loss & Training¶

For Hamiltonian systems, use symplecticity loss \(\mathcal{L}_\text{symp} = \|\dot{\bm q} - \partial_{\bm p} \tilde{\mathcal H}_\Theta\|_2^2 + \|\dot{\bm p} + \partial_{\bm q} \tilde{\mathcal H}_\Theta\|_2^2\); for GENERIC systems, use the corresponding metriplectic loss. The outer loop updates the base \(\Theta_\text{base}\) for \(N_\text{out}\) steps, the inner loop updates the current batch's latent for \(N_\text{in}\) steps; at test time, Algorithm 2 is used for \(N_\text{test}\)-shot latent fitting (base frozen).

Key Experimental Results¶

Main Results¶

Three energy-conserving systems (Duffing, mass-spring, pendulum) and one dissipative system (DNO) are used, each with 80 parameter instances (70 train / 10 test), 10 trajectories per instance. Metrics: \(\epsilon_\text{field}\) (relative \(\ell^2\) error on uniform grid, OOD indicator), \(\epsilon_\text{traj}\) (relative error on test trajectories), SSIM.

System	Method	\(\epsilon_\text{field}\) (\(\times 10^{-2}\))	\(\epsilon_\text{traj}\) (\(\times 10^{-2}\))
Pendulum	Scratch	83.35	79.84
Pendulum	MAML	99.13	52.37
Pendulum	Reptile	88.72	75.73
Pendulum	FW (CODA)	8.23	10.65
Pendulum	Shift	9.76	12.88
Pendulum	RO (MR-1)	6.47	8.27
Pendulum	SVD(5)	4.62	5.33
Mass Spring	FW	1.60	1.31
Mass Spring	SVD(5)	1.51	1.12
Duffing	FW	10.30	2.78
Duffing	SVD(5)	10.03	2.30

Ablation Study¶

Configuration	Pendulum \(\epsilon_\text{field}\)	Notes
Multi-domain joint training (Duffing + spring + pendulum share base)	SVD(5) still best	Shows modulation can generalize across "different dynamical families"
Test-time adaptation with different shot numbers	SVD consistently best from 1-shot to 300-shot	Strong few-shot adaptation
Locality weights \(\lambda_\phi \in \{10^{-2..-4}\}\), \(\lambda_z \in \{10^{-1..-3}\}\)	SVD has lowest variance	Robust to regularization strength
Test latent initialization (zero vs \(\bm z_\text{avg}\))	\(\bm z_\text{avg}\) always better	Validates evolving-latent protocol
Dissipative DNO system	SVD(3) \(\epsilon_\text{traj} = 0.142\), Reptile / ANIL NaN/diverge	Modulation extends stably to GENERIC

Key Findings¶

Modulation-based methods (FW / Shift / MR / RO / SVD) reduce error by about 65% compared to optimization-based methods (MAML / Reptile / ANIL), indicating that in structure-preserving networks where weight dependence on parameters is naturally low-dimensional, modulation is more efficient than inner-gradient updates.
SVD(5) not only achieves the lowest error, but its hyper-network is much smaller than FW (FW must output the entire \(\bm{U}\bm{V}^\top\); SVD only outputs \(r\) scalars), making it Pareto optimal in "accuracy / params".
ANIL / Reptile directly yield NaN or diverge on dissipative DNO, reflecting that inner-loop optimization is highly unstable for dynamics with entropy generation; modulation methods, which do not update main weights, remain stable.
In multi-domain joint training, a single base network fits Duffing / spring / pendulum simultaneously, with adaptation via latent switching—showing that modulation enables not just parameter fine-tuning, but "dynamical family switching".

Highlights & Insights¶

The split "base = cross-task invariant + latent = task-specific" combined with SVD decomposition makes the INR latent modulation approach more interpretable—the magnitude of singular values directly indicates the importance of each principal direction for the current parameter instance.
Proposition 3.1 provides a clean theoretical justification for low-rank modulation: the local rank of the parameter space determines the required modulation dimension, tightly linking to the empirical "low-dimensional latent suffices" in INR.
Naturally combines modulation with Hamiltonian / GENERIC: modulation only changes the scalar value of \(\mathcal{H}_\Theta\), without breaking symplectic or metriplectic structure, making this a rare provably compatible combination of structure preservation and meta-learning.
The evolving-latent training protocol, though seemingly minor, is crucial: it ensures the base always co-evolves with the actual latents in use, avoiding distribution shift where the base adapts to 0-latent but test latents are far from zero.

Limitations & Future Work¶

All experimental systems are low-dimensional toys (pendulum / spring / Duffing / DNO), with maximum dimension \(\le 4\); extension to PDEs or high-dimensional multi-body systems (e.g., molecular dynamics) remains unproven.
Modulation is only applied at the MLP layer level; more general architectures (Transformer, Graph NN) are unexplored. If the base is a GNN, hyper-network design may become much more complex.
The non-negativity constraint on SVD singular values relies on ReLU activation plus soft orthogonality penalty, which does not guarantee true orthogonality in early training; sensitivity is not fully discussed.
Interaction with time step/integrator choice is not systematically analyzed—whether symplectic integrator + modulation yields long-horizon stability remains open.

vs MAML / Reptile / ANIL: This work replaces inner-loop high-dimensional gradient updates with low-dimensional auto-decoding of latent codes, avoiding second-order gradients/unstable inner-gradient, reducing error by 65% on HNN/GNN.
vs CODA / FW (Kirchmeyer 2022): FW is also modulation-based, but corrects all \(\bm{W}, \bm{b}\) in every layer, making the hyper-network huge; MR/SVD greatly reduce parameters via low-rank + shared bases, while still outperforming FW.
vs Shift modulation (Dupont 2022): Shift only modulates bias, which is too weak; SVD-like retains bias and adds shared rank-\(r\) matrices, achieving greater expressiveness with still small parameter count.
vs LoRA: LoRA is task-agnostic low-rank fine-tuning; MR here is actually task-conditioned LoRA + hyper-network, upgrading fine-tuning to meta-learning.

Rating¶

Novelty: ⭐⭐⭐ Transfers LoRA/SVD-style modulation to structure-preserving meta-learning; method is clean but a "reasonable combination" rather than a wholly new paradigm.
Experimental Thoroughness: ⭐⭐⭐ Covers 3 Hamiltonian + 1 GENERIC system, 6 baselines, multi-domain joint training, different shot numbers, locality sweeps, but system dimensionality is low.
Writing Quality: ⭐⭐⭐⭐ Formulas and algorithm blocks are clear; Proposition 3.1 provides rigorous basis for low-rank selection.
Value: ⭐⭐⭐⭐ Offers a simple, reusable meta-learning template for many-query SciML scenarios, especially suitable for parameterized ODE/Hamiltonian simulation.