Dynamical Phases of Short-Term Memory Mechanisms in RNNs¶
Conference: ICML 2025
arXiv: 2502.17433
Code: https://github.com/fatihdinc/dynamical-phases-stm
Area: Social Computing
Keywords: Short-term memory, RNN, dynamical phase transition, slow-point manifold, limit cycle
TL;DR¶
This work discovers two distinct underlying dynamical mechanisms supporting short-term memory in RNNs—slow-point manifolds and limit cycles. It analytically derives the power-law scaling laws of their maximum learnable learning rates using toy models (SP: \(\beta\) approx. 4-5 vs LC: \(\beta\) approx. 2-3), and provides large-scale empirical validation by training approximately 80,000 RNNs.
Background & Motivation¶
Background: Short-term memory is a core function of cognitive processing, but its neural mechanisms are still incompletely understood in systems neuroscience. Prior studies have linked memory maintenance to sequential activation patterns.
Limitations of Prior Work: Recurrent connections are believed to drive sequential dynamics, but a mechanistic understanding is lacking. Three key questions remain: Q1: What mechanism supports memory maintenance of sequential activity? Q2: What determines the choice of mechanism? Q3: How do mechanisms change with delay duration?
Key Challenge: Different internal mechanisms can produce identical activities within the trial period but exhibit completely different out-of-trial behaviors, making them difficult to distinguish solely from behavioral data.
Goal: To systematically identify and classify dynamical mechanisms in RNNs, and to establish quantitative relationships between task and optimization parameters and mechanism selection.
Key Insight: Analytical analysis of low-dimensional toy models combined with large-scale full-rank RNN training, from a dynamical systems theory perspective.
Core Idea: There exist two equivalent neural sequence generation mechanisms in RNNs, whose emergence is determined by the power-law relationship between delay duration and learning rate, forming a predictable phase diagram.
Method¶
Overall Architecture¶
A four-step progression: (1) observing the two mechanisms on a rank-2 RNN; (2) studying the impact of task design on mechanism selection; (3) deriving scaling laws from a toy model; (4) training approximately 80,000 full-rank RNNs to validate the theory.
Key Designs¶
-
Delayed Activation Task:
- Function: The simplest short-term memory task—suppressing the output during \(T_{\text{delay}}\), and then producing the output during \(T_{\text{resp}}\).
- Mechanism: Stripping away all unnecessary complexity, leaving only the core challenge of "delayed output".
- Variants: Adding a post-response period (\(T_{\text{post}}\)) fundamentally changes the learned mechanism.
-
Slow-Point Manifold Mechanism:
- Function: Creating a slow region in the state space to achieve delay.
- Mechanism: The system slowly traverses the slow-point region to achieve a time delay.
- Features: Converging to a fixed point after the trial, without repetition.
- Scaling: \(\alpha_{\text{SP}} \sim \mathcal{O}(T_{\text{delay}}^{-\beta_{\text{SP}}})\), where \(\beta_{\text{SP}} \in [4,5]\)—showing a very steep decay.
-
Limit Cycle Mechanism:
- Function: Creating a closed periodic orbit to achieve delay.
- Mechanism: The half-period corresponds to the transition from "suppression to activation".
- Features: Continuing to oscillate after the trial.
- Scaling: \(\alpha_{\text{LC}} \sim T_{\text{delay}}^{-\beta_{\text{LC}}}\), where \(\beta_{\text{LC}} \in [2,3]\)—showing a milder decay compared to SP.
-
Toy Model Analytical Analysis:
- SP Model: Saddle-node bifurcation \(dx/dt = x^2 + r\).
- LC Model: \(x(t) = \sin(2\pi rt)\), where \(r\) is learnable.
- Key Findings: \(\beta_{\text{LC}} \le \beta_{\text{SP}}\), meaning limit cycles allow a larger learning rate under large delays.
-
Mechanism Discrimination Index:
- Automated classification based on spectral analysis: Limit cycles have low energy in low frequencies, while slow-point manifolds have high energy in low frequencies.
Loss & Training¶
- RNN Equation: \(\tau \frac{dr}{dt} = -r(t) + \tanh(Wr(t) + W_{\text{in}}u(t) + b + \epsilon)\)
- Standard MSE Loss
- SGD, systematically scanning learning rates and delay durations.
- Scale: Approximately 80,000 RNNs (\(N=100\) neurons), with carbon emissions of about 230 kg \(\text{CO}_2\).
Key Experimental Results¶
Main Results¶
| Mechanism Type | Theoretical Scaling Exponent | Experimental Scaling Exponent | Match |
|---|---|---|---|
| Slow-Point Manifold (SP) | \(\beta \in [4, 5]\) | 4.05 ± 0.10 | Consistent |
| Limit Cycle (LC) | \(\beta \in [2, 3]\) | 2.72 ± 0.07 | Consistent |
| Task Variant | SP Occurrence Rate | LC Occurrence Rate | Description |
|---|---|---|---|
| No post-response period, short delay | High | Low | SP dominates |
| No post-response period, long delay | Low | High | LC dominates |
| With post-response period | Approx. 0 | High | Post-response period strongly biases towards LC |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| No-memory task | \(\beta = 0.38 \pm 0.02\) | Delay-dependent scaling almost disappears |
| Increasing time constant \(\tau\) | Similar scaling | Robust to different intrinsic dynamics |
| Mechanism evolution during training | SP shifts to LC | Corresponds to jumps in the loss curve |
Key Findings¶
- The two mechanisms produce nearly identical sequential activity within the trial window, but exhibit completely different out-of-trial behaviors.
- Adding a post-response period fundamentally changes the mechanism—shifting from potentially SP to almost always LC.
- Around 80,000 RNNs precisely reproduce the theoretical scaling predictions of the toy model.
- Mechanism selection forms a phase diagram determined by (delay duration \(\times\) learning rate).
Highlights & Insights¶
- A complete closed loop from theory to large-scale empirical validation.
- An important warning for systems neuroscience: minor choices in experimental design can fundamentally alter the dynamical mechanisms.
- Open release of 80,000 pre-trained RNN models.
- Mechanism evolution during training reveals "algorithmic phase transitions" in the optimization process.
Limitations & Future Work¶
- Toy models cannot exhaust all possible memory mechanisms.
- Only SGD optimizer is considered.
- The RNN scale of 100 neurons is relatively limited.
- Lack of direct comparison with real neural data.
Related Work & Insights¶
- Rajan et al. (2016) linked sequential activity with short-term memory; this work reveals the two underlying mechanisms that generate sequences.
- Drawing an interesting analogy to works on grokking/algorithmic phase transitions.
- Insight: Dynamical systems theory provides a powerful tool to understand the qualitative changes in neural network training.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐