HyperKKL: Enabling Non-Autonomous State Estimation through Dynamic Weight Conditioning¶
Conference: ICLR 2026 arXiv: 2602.22630 Code: To be confirmed Area: Scientific Computing Keywords: KKL observer, state estimation, hypernetwork, non-autonomous system, dynamical system
TL;DR¶
This paper proposes HyperKKL, which uses a hypernetwork to encode exogenous input signals and dynamically generate the transformation mapping parameters of a KKL observer, enabling state estimation for non-autonomous nonlinear systems without retraining or online gradient updates. The method is validated on four classical nonlinear systems: Duffing, Van der Pol, Lorenz, and Rössler.
Background & Motivation¶
Background: State estimation—reconstructing the full internal state of a dynamical system from partially observable measurements—is a fundamental problem in control and engineering. The KKL (Kazantzis-Kravaris/Luenberger) observer achieves state estimation by immersing a nonlinear dynamical system into a higher-dimensional stable linear latent space, and theoretically guarantees global convergence under the backward distinguishability condition.
Limitations of Prior Work:
- The core of the KKL observer requires solving an analytically intractable PDE: \(\frac{\partial \mathcal{T}}{\partial x}(x) f(x) = A \mathcal{T}(x) + B h(x)\)
- Recent neural network-based approaches (PINNs, autoencoders, etc.) can approximately solve these mappings, but nearly all target autonomous systems (no external input \(u(t)\))
- Real-world systems are almost never autonomous—robots receive motor commands, biological systems respond to external stimuli, and industrial processes are subject to time-varying disturbances
Key Challenge: When extending to non-autonomous systems, the transformation mapping \(\mathcal{T}\) must become input-dependent as \(\mathcal{T}(x, t)\), satisfying the time-varying PDE:
The additional temporal partial derivative term \(\frac{\partial \mathcal{T}}{\partial t}\) couples the transformation with the temporal evolution of the input, rendering static mappings insufficient. Existing learning methods either require retraining for each input scenario or online gradient updates, severely limiting practical applicability.
Goal: HyperKKL encodes the history of the input signal into an instantaneous perturbation of the observer parameters via a hypernetwork, enabling inference-time adaptation to different exogenous input conditions without retraining or online optimization.
Method¶
Overall Architecture¶
HyperKKL employs a two-phase sequential training procedure:
- Phase 1 (Autonomous Pretraining): Under no external input (\(u \equiv 0\)), a base encoder \(\hat{\mathcal{T}}_{\theta^{\text{base}}}\) and decoder \(\hat{\mathcal{T}}^*_{\phi^{\text{base}}}\) are trained with a physics-informed loss to satisfy the autonomous KKL conditions. These parameters are frozen after training.
- Phase 2 (Hypernetwork Training): The base mappings are frozen, and only the hypernetwork parameters \(\psi\) are trained to learn the mapping from input signals to parameter perturbations.
At inference time: a new input signal → LSTM encoding → generation of parameter perturbations \(\Delta\theta, \Delta\phi\) → superimposed onto the frozen base parameters → instantaneous input-adaptive observer.
The learning objective combines reconstruction loss and PDE residual:
Key Design 1: Dynamic HyperKKL¶
For complex systems where inputs continuously reshape the attractor geometry, a truly time-varying transformation \(\mathcal{T}(x, \theta(t))\) is required. Dynamic HyperKKL employs a residual hypernetwork that decouples the base parameters from the input-conditioned perturbation:
The hypernetwork consists of three components: - Shared LSTM encoder: processes the input window \(u_{[t-w, t]}\) (window size \(w = 100\)) and outputs hidden state \(h_t \in \mathbb{R}^{d_h}\) - Encoder head MLP: predicts \(\Delta\theta_{\text{enc}}\) from \(h_t\) - Decoder head MLP: predicts \(\Delta\phi_{\text{dec}}\) from \(h_t\)
Chunked prediction strategy: Directly predicting the full parameter perturbation dimensionality is prohibitively high-dimensional. The target weight matrix \(W \in \mathbb{R}^{m \times n}\) is partitioned into blocks, with the MLP independently predicting each block, maintaining expressiveness while controlling output dimensionality.
Residual structure guarantee: When \(u \equiv 0\), the LSTM hidden state produces \(\Delta\theta = \Delta\phi = 0\), exactly recovering the autonomous observer—ensuring no degradation in the absence of external input.
The temporal partial derivative is estimated via finite differences:
Key Design 2: Static HyperKKL¶
For simpler systems where the input acts only as a bounded perturbation, the autonomous transformation \(\mathcal{T}(x)\) is retained unchanged, and a learned input injection term is added only to the observer dynamics:
where \(\bar{\varphi}\) is a small MLP, taking both the LSTM-encoded input context and \(\hat{z}\) as inputs. Training constrains \(\bar{\varphi}\) to output zero when \(u = 0\).
Key Design 3: Adaptive Curriculum Learning Baseline¶
As a control condition, the paper also evaluates whether training strategy alone can compensate for the limitations of a static architecture. Curriculum learning is used to grade training data by input complexity (\(\mathcal{D}_1\): constants → \(\mathcal{D}_2\): low-frequency sinusoids → ... → high-frequency mixtures), advancing to the next level when loss stagnates at the current one. This baseline addresses the question: can the non-autonomous problem be solved purely through richer training data, without architectural modification?
Key Experimental Results¶
Main Results: State Estimation Performance on Four Nonlinear Systems (RMSE / SMAPE%)¶
Oscillatory systems (Duffing, Van der Pol):
| Method | Duffing-Zero | Duffing-Sin | Duffing-Sqr | VdP-Zero | VdP-Sin | VdP-Sqr |
|---|---|---|---|---|---|---|
| Autonomous | 0.04 (5.6) | 0.26 (26) | 0.33 (31) | 0.15 (7.0) | 0.23 (9.8) | 0.25 (10.5) |
| Curriculum | 0.27 (33) | 0.44 (41) | 0.57 (46) | 1.10 (51.4) | 1.15 (51.5) | 1.15 (51.7) |
| Static HyperKKL | 0.04 (5.6) | 0.10↓ (9.3) | 0.17↓ (14) | 0.12↓ (5.3) | 0.24 (10.2) | 0.25 (10.8) |
| Dynamic HyperKKL | 0.08 (8.2) | 0.24↓ (25) | 0.27↓ (28) | 0.12↓ (5.0) | 0.21↓ (8.6) | 0.22↓ (9.1) |
Chaotic systems (Rössler, Lorenz):
| Method | Rössler-Zero | Rössler-Sin | Rössler-Sqr | Lorenz-Zero | Lorenz-Sin | Lorenz-Sqr |
|---|---|---|---|---|---|---|
| Autonomous | 1.14 (6.7) | 1.47 (7.6) | 1.48 (8.3) | 5.56 (18) | 5.58 (18) | 5.55 (18) |
| Curriculum | 5.58 (35) | 5.94 (37) | 5.61 (38) | 11.5 (41) | 11.6 (42) | 11.6 (42) |
| Static HyperKKL | 1.14 (6.7) | 1.70 (10) | 1.75 (12) | 5.56 (18) | 16.3 (52) | 16.2 (51) |
| Dynamic HyperKKL | 1.01↓ (5.1) | 1.38↓ (6.0) | 1.36↓ (6.9) | 6.67 (22) | 6.67 (22) | 6.66 (22) |
Key findings:
- Static HyperKKL achieves the best performance on low-dimensional oscillatory systems: RMSE on Duffing with sinusoidal input is reduced by 62% (0.26 → 0.10), consistent with theoretical expectations—the attractor of a low-dimensional oscillator shifts smoothly with the input, making a static transformation sufficient.
- Curriculum learning fails across the board: Performance is worse than the autonomous baseline on all systems under all input conditions (e.g., VdP-Zero: 0.15 → 1.10), demonstrating that the bottleneck is representational rather than educational.
- The Lorenz system exposes a fundamental limitation: The autonomous baseline achieves the best performance (RMSE ≈ 5.5), Static HyperKKL degrades catastrophically (16.3), and Dynamic HyperKKL also exhibits mild degradation (6.67).
Ablation Study: Decoupled Analysis of Architecture vs. Training¶
| Analysis Dimension | Conclusion | Evidence |
|---|---|---|
| Curriculum learning vs. no training | Curriculum learning is harmful | Performance inferior to autonomous baseline on all systems |
| Static vs. Dynamic | System complexity determines the choice | Static for low-dimensional, Dynamic for chaotic |
| Input encoding method | LSTM outperforms MLP | Temporal aggregation is critical for chaotic systems |
| Recovery at \(u=0\) | All hypernetwork methods correctly recover autonomous performance | \(\Delta\theta \to 0\) verification successful |
| Lorenz specificity | High-sensitivity attractor causes hypernetwork conditioning to introduce noise | Small errors amplify exponentially along unstable manifolds |
Rating¶
Rating: ⭐⭐⭐⭐
Strengths:
- Clearly extends the KKL observer from autonomous to non-autonomous systems, filling a practical gap in learning-based KKL methods
- The two-phase training scheme (autonomous pretraining + hypernetwork fine-tuning) and residual architecture design are well-motivated and guarantee graceful recovery at \(u=0\)
- The chunked prediction strategy balances the expressive capacity of the hypernetwork with output dimensionality
- Failures on the Lorenz system are honestly reported with in-depth theoretical analysis (unstable manifolds + exponential error amplification)
- The comparison between Static and Dynamic architectures provides a practical selection guide
Limitations:
- Validation is limited to four classical low-dimensional systems (up to 3-dimensional state spaces); scalability to high-dimensional real-world systems remains unknown
- The failure on the Lorenz system exposes a fundamental limitation of hypernetwork conditioning on high-sensitivity systems, with no solution currently proposed
- The failure of the curriculum learning baseline may be partially attributable to implementation details (e.g., hyperparameter choices) rather than purely architectural constraints
- Comparisons with other non-autonomous observer methods (e.g., EKF, UKF in non-autonomous settings) are absent
- Computational cost analysis is missing—does the inference latency of the LSTM hypernetwork meet the requirements of real-time control?
Key Distinctions from Related Work:
- Unlike Niazi et al. (2025), which handles only autonomous KKL, this paper achieves non-autonomous extension via a hypernetwork
- Unlike meta-RL approaches (e.g., MAML) that require online gradient updates, HyperKKL adapts through purely forward inference
- Unlike static transformation methods, Dynamic HyperKKL explicitly models the temporal partial derivative term in the time-varying PDE