Learning Dynamics of RNNs in Closed-Loop Environments¶

Conference: NeurIPS 2025 arXiv: 2505.13567 Code: GitHub Area: Theory / Recurrent Neural Networks / Control Theory Keywords: RNN learning dynamics, closed-loop learning, open-loop vs. closed-loop, control theory, internal representations

TL;DR¶

This paper establishes a mathematical theory revealing that RNNs exhibit fundamentally different learning dynamics under closed-loop (agent–environment interaction) versus open-loop (supervised learning) training. Closed-loop learning follows a three-phase process driven by the competition between short-term policy improvement and long-term stability.

Background & Motivation¶

Background: RNNs are widely used in neuroscience modeling and sequence tasks. Existing theoretical work primarily analyzes RNN learning dynamics and solution properties in open-loop (supervised learning) settings.

Limitations of Prior Work: Biological learning occurs in closed-loop environments—where an agent's actions influence subsequent inputs—yet a theoretical account of closed-loop RNN learning dynamics is largely absent.

Key Challenge: Open-loop analyses assume i.i.d. inputs and ignore feedback loops. In closed-loop environments, outputs affect subsequent inputs, making the learning dynamics fundamentally different.

Goal: To establish a mathematical theory of closed-loop RNN learning dynamics and reveal why and how closed-loop learning differs from its open-loop counterpart.

Key Insight: The authors adopt the classical double integrator control task and obtain an analytically tractable theoretical framework via linearized RNNs and a rank-1 connectivity weight assumption.

Core Idea: Closed-loop RNN learning is governed by the eigenvalue evolution of the joint agent–environment system. The learning process unfolds in three stages, fundamentally driven by the competition between short-term policy improvement and long-term system stability.

Method¶

Overall Architecture¶

The research framework consists of: - Environment: A discrete-time double integrator (position–velocity control task) with only position observed (partial observability). - Agent: An RNN with $N=100$ neurons trained via policy gradient. - A joint system matrix $\bm{P}$ is constructed to model the RNN and environment in a unified manner.

Key Designs¶

Joint Closed-Loop System: The environmental state $\bm{x}_t$ and RNN hidden state $\bm{h}_t$ are concatenated into a joint state $\bm{s}_t = (\bm{x}_t, \bm{h}_t)^\top$, yielding a linear dynamical system: $$\bm{s}_{t+1} = \bm{P} \bm{s}_t, \quad \bm{P} = \begin{bmatrix} \bm{A} & \bm{B}\bm{z}^\top \\ \bm{m}\bm{C}\bm{A} & \bm{W} \end{bmatrix}$$ System stability is determined by the eigenvalues of $\bm{P}$.
Effective Low-Dimensional System: Under the rank-1 connectivity assumption ($\bm{W} = \bm{u}\bm{v}^\top$), hidden states are confined to the subspace spanned by $\bm{m}$ and $\bm{u}$, reducing the system to 4 dimensions controlled by four scalar order parameters (overlaps): $\sigma_{\bm{z}\bm{m}}, \sigma_{\bm{z}\bm{u}}, \sigma_{\bm{v}\bm{m}}, \sigma_{\bm{v}\bm{u}}$.
Effective Feedback Gain: The high-dimensional nonlinear RNN policy is embedded into a 2D interpretable space $(k_1, k_2)$: $$u_t \approx -k_1 x_t^{(1)} - k_2 x_t^{(2)}$$ Stability regions are analyzed via the closed-loop matrix $\bm{M}_{\text{cl}} = \bm{A} - \bm{B}\bm{K}$.

Three-Phase Learning Dynamics¶

Phase 1 — Negative-Position Policy: - Loss decreases rapidly; the RNN learns a proportional control strategy $u_t \propto -\text{position}$. - The characteristic polynomial simplifies to $\chi_{\bm{P}}(\lambda) = \lambda^2 - 2\lambda + (1 - \sigma_{\bm{z}\bm{m}})$. - An asymmetric loss landscape drives $\sigma_{\bm{z}\bm{m}}$ to a small negative value. - The system is unstable ($\rho(\bm{P}) > 1$), manifesting as oscillatory divergence.

Phase 2 — Building a World Model: - Loss enters a plateau; the RNN must learn to infer the latent variable (velocity). - A surrogate loss is introduced: $\mathcal{L}_{\text{surrogate}} = \alpha \cdot \mathcal{L}_\infty + (1-\alpha) \cdot \mathcal{L}_2$. - Gradients from the short-term control objective ($\mathcal{L}_2$) and the long-term stability objective ($\mathcal{L}_\infty$) point in nearly opposite directions, producing a zigzag trajectory. - This phase ends when the dominant eigenvalue enters the unit circle (system stabilization).

Phase 3 — Policy Refinement: - Loss decreases again; trajectories become rapid and non-oscillatory. - A third real eigenvalue $\lambda_3$ grows, giving rise to a second slow mode. - The low-dimensional effective model accurately reproduces the dynamics of this phase.

Key Experimental Results¶

Main Results: Closed-Loop vs. Open-Loop Learning¶

The authors compare two RNNs with identical architectures and initializations, trained in closed-loop and open-loop modes respectively:

Training Mode	Initial Behavior	Mid-Training	Final Performance
Closed-loop	Similar → plateau	Three-phase progression	Stable convergence
Open-loop	Similar → loss spike	Closed-loop test loss deteriorates sharply	Eventually recovers but via a different path

Key finding: The two modes traverse entirely different trajectories in the effective feedback gain space $(k_1, k_2)$.

Validation on Multi-Frequency Tracking Task¶

Feature	Observation
Learning phases	Staircase-shaped loss decrease; each step corresponds to acquiring one frequency component
Order of frequency acquisition	Low → high frequency, consistent with human motor control experiments
Competition effect	Performance on previously acquired frequencies temporarily degrades when a new frequency is learned

Ablation Study¶

Linear vs. nonlinear RNNs: Nonlinear RNNs qualitatively exhibit the same three-phase dynamics.
Episode length $T$: Short $T$ drives eigenvalues upward along the imaginary axis; long $T$ drives them downward—validating the short-term/long-term competition theory.
Adam optimizer: The zigzag trajectory is mitigated, as adaptive optimization alleviates gradient direction conflicts.

Key Findings¶

Closed-loop and open-loop training produce fundamentally different learning trajectories even with identical architectures and initializations.
The plateau in closed-loop learning is not stagnation but a necessary phase during which the system builds an internal world model.
Tracking the eigenvalues of the joint agent–environment system—rather than the RNN alone—is both necessary and sufficient for understanding closed-loop learning.

Highlights & Insights¶

Strong theoretical contribution: The first analytical theory of closed-loop RNN learning dynamics, filling an important gap in the literature.
Concise and illuminating physical picture: The competition between short-term policy improvement and long-term stability provides a unified explanation for the plateau, zigzag trajectories, and related phenomena.
Connection to neuroscience: The order in which the RNN acquires frequencies strikingly matches human motor learning experiments, suggesting shared inductive biases.
Low-dimensional effective model: The 100-dimensional RNN dynamics are compressed into four scalar order parameters while preserving the key learning dynamics.

Limitations & Future Work¶

The theoretical analysis relies on linearization and rank-1 weight simplification; extension to fully nonlinear settings requires further work.
Only direct gradient computation (policy gradient) is considered; more complex RL algorithms such as sparse rewards or actor-critic methods are not addressed.
The effective system maintains precise spectral equivalence only within an episode; learning dynamics across episodes are not yet fully characterized.
The control task is relatively simple (double integrator); applicability to higher-dimensional, nonlinear environments remains to be explored.

This work is analogous to Saxe et al. 2013's analysis of feedforward network learning dynamics, serving as the corresponding theory for closed-loop RNNs.
It complements Bordelon et al. 2025's analysis of open-loop RNN learning dynamics.
It provides a theoretical perspective for understanding "phase transitions" in RL training (analogous to grokking?).
Insight: RL algorithm design should account for the balance between short-term and long-term objectives, potentially mitigating competition via curriculum learning or episode-length scheduling.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First mathematical theory of closed-loop RNN learning dynamics, opening a new direction.
Experimental Thoroughness: ⭐⭐⭐⭐ Theoretical validation is rigorous and the multi-frequency extension is convincing, though task complexity is limited.
Writing Quality: ⭐⭐⭐⭐⭐ Elegant theoretical derivations, polished figures, and clear physical intuition.
Value: ⭐⭐⭐⭐ Significant implications for understanding closed-loop and biological learning, though direct practical applicability is limited.