Mechanistic Interpretability of RNNs Emulating Hidden Markov Models¶
Conference: NeurIPS 2025 arXiv: 2510.25674 Code: GitHub Area: Segmentation / Interpretability Keywords: Mechanistic Interpretability, Recurrent Neural Networks, Hidden Markov Models, Stochastic Resonance, Dynamical Systems
TL;DR¶
By training RNNs to emulate the emission statistics of HMMs, then reverse-engineering the learned solutions, this work reveals how RNNs exploit noise-driven orbital dynamics, structured connectivity (noise-integrating populations + kick neurons), and self-induced stochastic resonance to implement discrete stochastic state transitions.
Background & Motivation¶
-
Background: RNNs are powerful tools in neuroscience for inferring latent dynamics of neural populations, but prior work has focused primarily on relatively simple, input-driven, deterministic behaviors. HMMs, by contrast, can segment naturalistic behavior into discrete latent states with stochastic transitions.
-
Limitations of Prior Work: RNNs operate in continuous state spaces, whereas HMMs rely on discrete states and stochastic transitions — an apparent incompatibility. It remains unclear whether, and how, RNNs can produce stochastic transitions between discrete states through continuous dynamics.
-
Key Challenge: How can a continuous state space give rise to discrete stochastic behavior? Intuitively, RNNs should learn one fixed point per HMM state (a multi-well landscape), yet the actual solution turns out to be considerably more subtle.
-
Goal: To determine how RNNs emulate the discrete probabilistic behavior of HMMs using continuous internal dynamics, and to uncover the underlying computational mechanism.
-
Key Insight: Develop a training methodology (noise-driven RNN + Sinkhorn divergence) that fits RNNs to HMM emission statistics, followed by multi-level reverse engineering: global dynamics → local dynamics → connectivity structure → computational principles.
-
Core Idea: RNNs implement stochastic state transitions via a self-induced stochastic resonance (SISR) mechanism — slow noise integration combined with fast kick triggering — realizing composable dynamical primitives that emulate HMM behavior.
Method¶
Overall Architecture¶
The training pipeline consists of three steps: (A) noise input \(x_t \sim \mathcal{N}(0, I_d)\) → (B) Vanilla RNN + Gumbel-Softmax → (C) Sinkhorn divergence loss. Three HMM architectures are considered: linear chain, fully connected, and ring.
Key Designs¶
1. Noise-Driven RNN Training Paradigm
- Function: Enables RNNs to learn the stochastic transition dynamics of HMMs.
- Mechanism: A standard Vanilla RNN (\(h_t = \text{ReLU}(h_{t-1}W_{hh}^T + x_tW_{ih}^T)\)) receives i.i.d. Gaussian inputs; outputs are converted to categorical samples via Gumbel-Softmax. Sinkhorn divergence (an optimal transport distance) serves as the loss function to compare output and target distributions.
- Design Motivation: HMM target sequences are probabilistic, necessitating a loss function suited to distributional comparison. Sinkhorn divergence enables differentiable optimization through smoothed coupling matrices.
2. Multi-Level Reverse Engineering Analysis
- Function: Reveals the complete mechanistic chain by which RNNs implement HMMs.
- Mechanism:
- Global dynamics: Without input, the RNN converges to a single fixed point; under noise input, it exhibits orbital dynamics along closed trajectories, with orbital radius scaling linearly with input variance.
- Local dynamics: The state space is partitioned into three functional regions — clusters (long dwell times, locally stable), kick zones (intermediate dwell times, with unstable directions), and transition corridors (rapid, deterministic passages).
- Connectivity structure: Structured connectivity is identified comprising kick-neuron triplets and noise-integrating populations.
- Design Motivation: Standard fixed-point linearization methods cannot account for the rich dynamics observed under a single-fixed-point regime.
3. Causal Intervention Validation
- Function: Validates the causal role of the kick mechanism.
- Mechanism: Ablating kick neurons or their noise inputs (\(\mu=0\)) traps trajectories within the current cluster, preventing transitions; amplification (\(\mu=2\)) causes overshooting beyond the target cluster. Control experiments on non-noise-integrating neurons show no effect on inter-cluster switching, confirming causal specificity.
- Design Motivation: Beyond discovering the mechanism, it is essential to verify its causal sufficiency and necessity.
Loss & Training¶
- Loss function: Sinkhorn divergence, comparing the distribution of RNN output sequence \(Y\) against HMM target sequence \(Y^*\)
- Evaluation metrics: Euclidean distance (global reconstruction error), transition matrix, marginal frequencies, output volatility
- Hyperparameters: hidden size \(|h| \in \{50, 150, 200\}\), input dimension \(d \in \{1, 10, 100, 200\}\)
Key Experimental Results¶
Main Results¶
| HMM Architecture | No. of States | Emission Statistics | Transition Matrix | Stationary Distribution |
|---|---|---|---|---|
| Linear chain | 2–5 | ✓ Exact match | ✓ | ✓ |
| Fully connected | 3 | ✓ Exact match | ✓ | ✓ |
| Ring | 4 | ✓ Exact match | ✓ | ✓ |
Characteristics of the Training Transition Phase:
| Training Phase | Dynamical Characteristics | Unstable Eigenvalues | Loss Behavior |
|---|---|---|---|
| Early | Single fixed point | None | Normal descent |
| Transition | Unstable | Complex eigenvalues emerge | Double descent |
| Stable | Orbital dynamics | Stable oscillations | Convergence |
Ablation Study¶
| Intervention | μ=0 (Ablation) | μ=2 (Enhancement) |
|---|---|---|
| Kick neurons | Trapped in current cluster | Overshoots target cluster |
| Noise-integrating → kick pathway | Trapped in current cluster | Overshoots target cluster |
| Control neurons | Normal inter-cluster switching | Normal inter-cluster switching |
Key Findings¶
- Trained RNNs possess only a single fixed point (rather than \(n\) wells); noise is a necessary condition for sustaining dynamics.
- Orbital radius scales linearly with input variance, explainable via second-order perturbation analysis.
- RNNs trained on different HMM architectures reuse the same composable dynamical primitives — multiple instances of the same basic mechanism combine to produce more complex discrete structures.
- The mechanism resembles self-induced stochastic resonance (SISR): a synergy of slow noise integration and fast kick-driven resetting.
Highlights & Insights¶
- A concrete mechanism by which a continuous system implements discrete stochastic behavior is identified, bridging the RNN and HMM paradigms.
- The notion of "composable dynamical primitives" is highly illuminating — complex discrete structures emerge from modular combinations of simple elementary units.
- Methodologically, the multi-level reverse engineering paradigm (global → local → single-neuron) merits broader adoption.
- Noise functions not as interference but as a computational resource — consistent with theories of stochastic resonance facilitating signal processing in the brain.
Limitations & Future Work¶
- The study uses Vanilla RNNs with simple 3-output HMMs; scalability to larger models and more complex HMM structures remains unvalidated.
- Only HMMs with output dimension 3 are examined; higher-dimensional or continuous-emission cases are unexplored.
- No direct connection is drawn to experimental data from biological neural circuits.
- It remains open whether the single-fixed-point + noise-driven orbital mechanism is the unique solution for RNN emulation of HMMs, or merely one among multiple possible solutions.
Related Work & Insights¶
- This work extends the tradition of RNN reverse engineering (fixed-point analysis, low-rank connectivity) to internally driven probabilistic behavior.
- It resonates with Driscoll et al. (2024)'s notion of "shared dynamical motifs," demonstrating a homogenizing effect of the training environment.
- Future directions: applying this framework to understand the emergence of discrete states in Transformers or SSMs.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Uncovers the complete mechanism by which RNNs emulate HMMs via self-induced stochastic resonance — highly original.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-level analysis is highly systematic, and causal interventions are compelling, though the experimental scale is limited.
- Writing Quality: ⭐⭐⭐⭐⭐ The narrative logic, proceeding from global to local to connectivity to principle, is exceptionally clear.
- Value: ⭐⭐⭐⭐ Makes an important theoretical contribution to computational neuroscience; the concept of composable dynamical primitives has broad applicability.