On learning linear dynamical systems in context with attention layers¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=os7OLubIMI
Code: https://github.com/XHZhang01/icl-for-lds-data
Area: Learning Theory / In-Context Learning / Transformer Expressivity
Keywords: in-context learning, linear attention, linear dynamical systems, system identification, gradient descent equivalence

TL;DR¶

This paper provides the explicit optimal weight solution for single-layer linear attention in the In-Context Learning (ICL) task of "noisy Linear Dynamical Systems (LDS)". It proves that under a first-order autoregressive approximation (AR(1)), the attention layer is equivalent to performing a single step of gradient descent on the autoregressive least squares loss. Through experiments, it connects the optimal solution structure of AR(s) (\(s \ge 2\)) with Preconditioned Conjugate Gradient (PCG) methods, providing a theoretical explanation for the empirical phenomenon where Transformer prediction accuracy matches the Kalman filter.

Background & Motivation¶

Background: Viewing Transformer ICL as "performing implicit optimization during the forward pass" is a mainstream perspective in recent theoretical research. Numerous studies have proven that under i.i.d. data (e.g., linear regression where each token is independent), the optimal weights of a single-layer linear self-attention exactly implement a (preconditioned) gradient descent step on the context-induced least squares loss. This picture has been characterized relatively completely by Ahn, Mahankali, von Oswald, and others.

Limitations of Prior Work: Real-world sequences (language, time series) are almost never i.i.d.; instead, the current token statistically depends on the entire history. Once the data originates from a dynamical system, proofs under the i.i.d. setting fail: high-order data moments appear in the loss, and the statistical coupling between tokens makes first-order optimality conditions difficult to solve. Consequently, what attention optimizes in non-i.i.d. settings remains largely unexplored.

Key Challenge: Empirically, Du et al. found that when predicting the next observation of an LDS, GPT-2 achieves accuracy parity with the Kalman Filter (KF)—a predictor proven optimal given system parameters. This holds even in regimes where KF is provably optimal. However, the internal mechanism remains unexplained. A bridge is missing between the theory (what attention does) and the empirical findings (why it is so effective).

Goal: Characterize the globally optimal weights of the pretraining loss when a "single-layer linear self-attention is trained to predict the next observation \(y_T\) of an LDS sequence," and interpret them as algorithmic steps of a context-based loss.

Key Insight: The authors adopt the "improper learning" approach from system identification—instead of estimating system parameters like \(A, c, \Sigma_w, \sigma_v\), they directly approximate the "next observation" as a "linear function of the \(s\) most recent observations," resulting in an AR(s) process. Under KF convergence conditions, this autoregressive approximation approaches the true conditional expectation \(\mathbb{E}[y_{t+1} \mid y_t, \dots, y_1]\) at an exponential rate. Thus, the problem is reduced to "linear regression with non-i.i.d. noise," allowing the use of the "attention = optimizer" analytical framework.

Core Idea: Use the "improperly learned AR(s) autoregressive loss" as the context loss, proving that the optimal solution for linear attention corresponds to an optimization step on this loss—specifically, one-step gradient descent for AR(1) and PCG-type Krylov subspace iterations for AR(s) (\(s \ge 2\)).

Method¶

Overall Architecture¶

This is a theoretical analysis paper; it does not train a new model. The "Method" follows a logical chain of reduction—derivation—interpretation, aiming to translate "what linear attention learns on LDS" into "which optimization algorithm it executes on a specific explicit loss." The process consists of four steps:

Data and Approximation: Sequences are generated by an LDS corrupted by dual Gaussian noise (\(x_{t+1}=Ax_t+w_t, y_t=c^\top x_t+v_t\)). Using results from Tsiamis–Pappas, the "future observation" is written as a "linear combination of the past \(s\) observations + exponentially decaying bias + noise," defining the AR(s) autoregressive loss \(L_{\mathrm{AR}(s)}(w)\) as the context objective.
Architecture and Tokens: The analysis uses single-layer, single-head, MLP-free linear attention with causal masking (omitting softmax and the projection matrix \(W_O\), while merging \(W_QW_K^\top\) into \(W_{QK}\)). Using prior token construction, the sequence is packed into an input matrix \(Y_0\), focusing only on the prediction \(\hat y_T\) of the last "test token."
Solving for Optimal Weights: The first-order optimality (stationary) conditions of the context loss are formulated. A banded sparse structure lemma is first used to narrow the candidate optimal weights into a small parameter class. Then, a closed-form global optimal solution is solved for the AR(1) case.
Interpretation: The optimal solution is mapped back to the forward pass—AR(1) exactly matches "performing one GD step on \(L_{\mathrm{AR}(1)}\) starting from 0"; AR(s) (\(s \ge 2\)) matches "two-step PCG / augmented Krylov" iterations, based on the sparse weight patterns and the block structure of the Hessian/gradient observed in experiments.

The core contributions lie in: the AR(s) loss reduction (Design 1), the banded structure lemma for optimality conditions (Design 2), the AR(1) closed-form optimal solution theorem (Design 3), and the PCG interpretation for AR(s) (Design 4).

Key Designs¶

1. Reducing LDS Prediction to AR(s) Autoregressive Least Squares: Connecting non-i.i.d. problems to the "Attention-as-Optimizer" framework

The difficulty in analyzing LDS data directly is that \(y_T\) depends on the entire history \(x_0, w_{1:T}, v_{1:T}\), leading to high-order moments in the loss. Instead of "proper learning" (identifying system parameters for KF), the authors adopt improper learning: expressing the future observation as a linear function of the recent \(s\) observations. Based on the Tsiamis–Pappas expansion, under assumptions of observability and marginal stability (\(\rho(A) \le 1\)), \(\rho(A-kc^\top) < 1\) causes the bias term to decay exponentially with the window size \(s\), effectively reducing the sequence to an AR(s) noisy linear regression. This yields the optimization objective:

\[\min_{w\in\mathbb{R}^s}\ L_{\mathrm{AR}(s)}(w):=\frac{1}{2(T-s-1)}\sum_{t=1}^{T-s-1}\big(y_{t+s}-w^\top\bar y_t\big)^2,\]

where \(\bar y_t:=[y_t,\dots,y_{t+s-1}]^\top\). This reduction transforms the "non-i.i.d. sequence with historical coupling" into a convex least-squares problem, enabling the application of the analytical framework that proves attention implements context optimization. It also provides a provable accuracy guarantee: results by Kozdoba et al. show that for any finite LDS family, there exists a window length \(s(\varepsilon)\) such that the AR(s) optimal predictor approximates the optimal KF, providing the theoretical entry point for the "Transformer matches KF" phenomenon.

2. Banded Sparse Structure in First-Order Optimality Conditions: Narrowing the high-dimensional parameter search

Solving for optimal weights requires solving stationary equations. However, because data are fully coupled, the number of terms explodes. The authors rewrite the loss using two effective parameter blocks: \(b\) (from \(W_V\)) and \(z_1, \dots, z_s\) (from \(W_{QK}\)). They discover that the right-hand side of the stationary condition exhibits a banded structure due to the centrosymmetry of the \(A\) distribution, alternating between two checkerboard zero-patterns \(B_0(s,j)\) and \(B_1(s,j)\) based on the parity of \(s+j\). Lemma 4.1 constructs a class of parameters (with specific Kronecker tensor structures for \(W_{QK}\) and \(W_V\)) that allow the left-hand side of the equation to reproduce this sparse structure. This narrows the search for the optimal solution from the entire weight space to a structured subclass, which is crucial for proving optimal solutions for \(s \ge 2\). Experiments confirm that trained minima indeed fall within this sparse pattern.

3. AR(1) Closed-Form Global Optimal Solution = One-Step Gradient Descent: First optimality results for LDS data

For \(s=1\) (AR(1)), the authors use the structure from Lemma 4.1 and Isserlis' Theorem (which decomposes high-order moments of Gaussian data into products of second-order moments) to handle full historical dependency. They provide a closed-form global optimal weight solution for the pretraining loss \(L(\theta)\) (Equation 15, up to a non-zero scaling constant):

\[W^\star_{QK}=\begin{pmatrix}\dfrac{(T-2)\,\mathbb{E}[y_{T-1}y_T\sum_{i=1}^{T-2}y_iy_{i+1}]}{\mathbb{E}\!\big[y_{T-1}^2(\sum_{i=1}^{T-2}y_iy_{i+1})^2\big]}&0\\[2pt]0&0\end{pmatrix},\qquad W^\star_V=\begin{pmatrix}0&0\\0&1\end{pmatrix}.\]

The key insight is that the forward pass using these optimal parameters exactly equals the prediction given by one step of gradient descent on \(L_{\mathrm{AR}(1)}(w)\) starting from \(w_0=0\). This first extends the "Attention implements GD" conclusion from i.i.d. settings to non-i.i.d. LDS data, providing a mechanistic hypothesis for the GPT-2/KF parity.

4. PCG / Augmented Krylov Interpretation for AR(s) (\(s \ge 2\)): Stronger algorithms beyond one-step GD

For \(s \ge 2\), the optimal weights no longer implement standard GD. The authors observe that the factor \(\frac{1}{T-s-1}\bar Y\) in the forward pass has a meaningful block structure that arranges the Hessian \(\nabla^2 L_{\mathrm{AR}(s)}\), the gradient at zero \(\nabla L_{\mathrm{AR}(s)}(0)\), and a scalar \(\gamma\) into a symmetric block matrix. Combined with the parameter structure from Lemma 4.1, the attention-induced predictor can be written as:

\[P\nabla^2 L_{\mathrm{AR}(s)}\,q+\xi_1 P\nabla L_{\mathrm{AR}(s)}(0)+\xi_2\,p,\]

In contrast, a predictor from two-steps of PCG on \(L_{\mathrm{AR}(s)}\) with preconditioned \(P^{-1}\), starting from \(w_0=0\), is \(\tau_1 P\nabla^2 L_{\mathrm{AR}(s)}P\nabla L_{\mathrm{AR}(s)}(0)+\tau_2 P\nabla L_{\mathrm{AR}(s)}(0)\). The second terms align under scaling. The first terms both fall in the Krylov subspace direction \(P\nabla^2 L_{\mathrm{AR}(s)}\). Experiments show the cosine similarity between these directions reaches \(0.88 \sim 0.93\) for AR(4). The additional \(p\) vector acts like a "correction direction" in augmented Krylov methods, compensating for ill-conditioned modes in \(P\nabla^2 L_{\mathrm{AR}(s)}\). This interpretation naturally generalizes AR(1), as PCG variants reduce to GD for 1D covariates.

Loss & Training¶

Data were generated online: different \(A, c, x_0\) were sampled for each trajectory (\(T=30\), hidden dimension \(d=5\)). The objective is the context loss \(L(\theta)=\mathbb{E}\big[\tfrac12(T_\theta(Y_0)_{s+1,T-s}-y_T)^2\big]\). The optimizer used is AdamW with gradient clipping, linear warmup, and cosine annealing for 8000 steps. Batch sizes were increased for larger windows (starting from 3000 for AR(1)), and results were averaged over 3 random seeds.

Key Experimental Results¶

The experiments are theoretical validation (not benchmarking against other models), checking if trained optimal attention weights match the structures predicted by Theorem 4.1 / Lemma 4.1.

Main Results: Optimal weights match theoretical predictions¶

Setting	Window	Theoretical Validation	Phenomenon
(a) \(A\) diagonal, \(c=\mathbf{1}\)	AR(1)	Theorem 4.1	\(W^\star_{QK}\), \(W^\star_V\) converge to the single-element structure (Fig 1b,c).
(a)	AR(2–4)	Lemma 4.1	Weights fall on the predicted banded checkerboard pattern (Fig 1e,f,h,i,k,l).
(b) \(A=Q^\top\mathrm{diag}(v)Q\) general orthogonal	AR(1–4)	Thm 4.1 + Lemma 4.1	Consistent alignment (Appendix Fig 3).
(c) Non-isotropic \(\Sigma_w\)	AR(1–3)	Lemma 4.1	Consistent alignment (Appendix Fig 4).
(d) \(A=P^{-1}\mathrm{diag}(v)P\) non-normal	AR(1–3)	Lemma 4.1	Consistent alignment (Appendix Fig 5).

Key Findings¶

AR(1) closed-form solution is reproducible via training: Across all four data settings, AdamW-trained \(W^\star_{QK}/W^\star_V\) converge to the structure in Theorem 4.1, showing it is a reachable minimum.
Robustness of banded sparse pattern: For \(s \ge 2\), non-zero weight positions consistently fall on the parity checkerboard predicted by Lemma 4.1.
Support for PCG interpretation: For AR(4), the cosine similarity between attention and two-step PCG directions is \(0.88 \sim 0.93\). The correction vector \(p\) is nearly anti-aligned (\(\approx -0.99\)) with the residual direction, suggesting it indeed pushes the predictor to reduce residuals.

Highlights & Insights¶

Leveraging System Identification tools for non-i.i.d. hurdles: Reducing the problem to "improper learning + AR(s) approximation" transforms the historically coupled LDS problem into a convex least-squares problem, enabling context optimization analysis.
Centrosymmetry \(\rightarrow\) Banded Structure: Mapping the symmetry of the \(A\) distribution to sparse zero-patterns in optimality conditions elegantly restricts the parameter search space.
Isserlis’ Theorem for High-order Moments: Using Isserlis (Wick) theorem to decompose Gaussian moments is a transferable trick for handling non-i.i.d. expected losses.
Shift from GD to PCG Perspective: Revealing that a larger window causes attention to implement stronger implicit optimization (GD \(\rightarrow\) PCG \(\rightarrow\) Krylov) explains why single-layer attention matches the KF.

Limitations & Future Work¶

Lack of closed-form solution for AR(s) (\(s \ge 2\)): Theorem 4.1 only fully solves AR(1); higher orders are characterized via structure and empirical interpretation rather than a provable global optimum.
Idealized settings: The analysis assumes single-layer, single-head, linear attention without MLP and with Gaussian noise/centrosymmetric \(A\). There is a gap between this and multi-layer softmax Transformers.
Simplified training objective: The paper uses a few-shot context loss (predicting only the last token) rather than the standard causal pretraining objective (predicting all positions).

vs. i.i.d. Linear Regression ICL (Ahn / Mahankali / von Oswald / Zhang): This work first extends the "Attention as Optimizer" conclusion to non-i.i.d. LDS data, overcoming token history coupling via AR(s) reduction and Isserlis' Theorem.
vs. Cole et al. (2025): They provide an "existence construction" for a two-layer attention-only Transformer to simulate KF; this paper focuses on provable optimality of a single layer with experimental verification.
vs. Sander et al. (2024) / von Oswald et al. (2023b): They characterize optimal solutions for noise-free \(y_{t+1}=Ay_t\); this paper treats the full system with dual Gaussian noise, which is more realistic.
vs. Transformer-simulated KF (Goel & Bartlett / Akram & Vikalo / Du et al.): Prior works either rely on known parameters/specific token augmentation or provide only empirical evidence. This work provides the first mechanistic hypothesis from the "ICL as implicit optimization" perspective.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First provable optimality result for single-layer linear attention on non-i.i.d. LDS data.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated AR(1) closed-form and AR(s) structures across settings; PCG interpretation is supported but not strictly proven.
Writing Quality: ⭐⭐⭐⭐ Clear logic from reduction to interpretation; high theoretical density requires background in system identification.
Value: ⭐⭐⭐⭐⭐ Bridges the gap between "ICL as Optimization" theory and the empirical "Transformer matches KF" phenomenon.