Skip to content

Noise Stability of Transformer Models

Conference: ICLR 2026 arXiv: 2602.08287 Code: Not released Area: Interpretability Keywords: noise stability, simplicity bias, Transformer, grokking, Fourier analysis, regularization, Boolean function analysis

TL;DR

This paper proposes noise stability as a superior alternative to average sensitivity for measuring simplicity bias in Transformers, and designs a regularization method based on this metric that accelerates training by approximately 35% on synthetic tasks and 75% on language modeling.

Background & Motivation

Simplicity bias in deep learning is a central concept for understanding model generalization, interpretability, and robustness. Neural networks tend to converge to the simplest function consistent with the training data. The conventional measure of this "simplicity" originates from Boolean function analysis: average sensitivity, defined as the expected change in model output under single-token perturbations.

Prior work has shown that functions learned by Transformers exhibit lower sensitivity than those learned by LSTMs (Bhattamishra et al., 2022), and that Transformers struggle to learn high-sensitivity functions such as Parity (Hahn, 2020). Vasudeva et al. (2024) further linked average sensitivity to the grokking phenomenon.

However, the authors identify two critical limitations of average sensitivity:

Theoretical limitation: Its definition over Boolean domains does not extend naturally to real-valued domains, and hypergrid-based extensions are cumbersome and practically infeasible to sample.

Empirical limitation: It fails to explain the "junta-like" input-dependence observed in modern LLMs such as GPT-2, Gemma, and RoBERTa—where outputs depend on only a tiny subset of input tokens (5–10 out of 256 tokens show significant influence in experiments), whereas the upper bound predicted by Friedgut's theorem reaches as high as 1024, a substantial gap.

Method

Overall Architecture

This paper proposes replacing average sensitivity with noise stability. Unlike average sensitivity, which perturbs inputs one coordinate at a time, noise stability measures a function's robustness to correlated noise applied simultaneously to all input coordinates. This concept extends naturally to real-valued domains via the Ornstein–Uhlenbeck semigroup.

Key Designs

1. Formal Definition of Noise Stability

For a function \(f \in L^2(\gamma)\) under Gaussian measure \(\gamma\), a correlated pair \((X, Y)\) is generated by adding scaled Gaussian noise to \(X\):

\[\text{Stab}_\rho(f) := \mathbb{E}_{(X,Y)}[f(X) f(Y)]\]

where \(Y = \rho X + Z\sqrt{1-\rho^2}\), \(Z \sim \gamma\) independent of \(X\), and \(\rho \in (0,1)\) is the correlation coefficient. This quantity relates directly to the spectrum via Hermite–Fourier coefficients:

\[\text{Stab}_\rho(f) = \sum_{\alpha \in \mathbb{N}^d} \rho^{|\alpha|} \tilde{f}(\alpha)^2\]

2. Spectral Concentration Lemma (Lemma 1)

High noise stability implies that Fourier mass is concentrated in low-degree coefficients: if \(\text{Stab}_\rho(f) \geq (1-\delta)\|f\|_2^2\), then \(f\) is \((\varepsilon, T)\)-spectrally concentrated, where:

\[T \geq \log_{1/\rho}\left(1 - \frac{\delta}{\varepsilon}\right)\]

3. Noise Stability of a Single-Layer ReLU MLP (Theorem 5.1)

For \(\rho\)-correlated Gaussian inputs \((X, Y)\):

\[\mathbb{E}[\text{ReLU}(X) \cdot \text{ReLU}(Y)] = \frac{1}{2\pi}\left(\sqrt{1-\rho^2} + \rho(\pi - \arccos\rho)\right)\]

Second-order Taylor approximation: \(\approx \frac{1}{2\pi} + \frac{1}{4}\rho + \frac{1}{4\pi}\rho^2\)

4. Noise Stability of a Single Attention Layer

Three cases are analyzed for \(W = W_Q W_K^T\):

  • Identity matrix \(W = I_d\) (Theorem 5.2): In the high-dimensional limit, the attention matrix converges to \(I_n\); stability exhibits a linear relationship with \(\rho\) at cost \(o(1)\).
  • Low-rank matrix \(W = UU^T\): Reduced to the identity case via a Johnson–Lindenstrauss transformation.
  • Unstructured \(W \sim \mathcal{N}(0, I)\) (Theorem 5.3): The attention matrix tends toward a random permutation matrix; stability is \(\rho \cdot s(\rho) \cdot \|(W_V)_{:,j}\|_2^2\), where \(s(\rho)\) is the probability of the attention pattern being preserved.

5. Multi-Layer Propagation Analysis

In ReLU FFN layers, stability propagates according to the recurrence:

\[\rho_L = \frac{1}{2\pi}\left(\sqrt{1-\rho_{L-1}^2} + \rho_{L-1}(\pi - \arccos\rho_{L-1})\right)\]

A linear approximation yields a fixed point of \(\frac{2}{3\pi} \approx 0.212\), exhibiting weak attenuation—stability does not vanish entirely.

Loss & Training

The noise stability regularizer (\(S=1\) encourages stability):

\[R_{M,S,\rho}(X) = (-1)^S \cdot \sum_{i=1}^C M(X)_i \cdot M(Y)_i\]

where \(Y_i\) is set equal to \(X_i\) with probability \(\frac{1+\rho}{2}\), and otherwise sampled from \(\text{uniform}([U])\). The regularized loss is:

\[\ell_{\text{reg}}(M,X) = \ell(M,X) + \gamma \cdot R_{M,S,\rho}(X)\]

This requires only one additional forward pass per iteration, incurring negligible computational overhead.

Key Experimental Results

Main Results

Spectral concentration upper bound comparison (n=256, Fourier tail mass at degree ≥ 15):

Model Avg. Sensitivity Bound Noise Stability Bound
GPT-2 0.003 0.0005
BERT 0.04 0.02
RoBERTa 0.19 0.02
Gemma 0.043 0.0157

Noise stability yields tighter Fourier tail mass estimates across all models (6× to 9.5× improvement).

Grokking acceleration:

Task Hyperparameters (γ, ρ) Steps to Converge (no reg.) Steps to Converge (with reg.) Speedup
Modular addition (K=113) (0.75, 0.25) ~4500 ~3300 36%
Noisy k-sparse parity (0.05, 0.05) baseline accelerated ~35%
WikiText-2 NTP baseline accelerated ~75%

Ablation Study

  • Junta-like properties of LLMs: On 256-token inputs, only 5–10 tokens exhibit significant geometric influence in GPT-2, RoBERTa, and Gemma—far fewer than the upper bound of 1024 predicted by Friedgut's theorem.
  • Positional bias: Tokens at the beginning and end of sequences consistently exhibit the highest influence, consistent with the "attention sinks" observed in the KV cache compression literature.
  • Training dynamics monitoring: In the noisy sparse parity task, the Transformer's noise stability naturally decreases during training to match the target function; changes in stability serve as a leading indicator of generalization.
  • WikiText-2 language modeling: The regularized model maintains high noise stability throughout training, while the unregularized model becomes progressively less stable.

Key Findings

  1. Noise stability characterizes spectral concentration in Transformers more precisely than average sensitivity, yielding tighter upper bounds across all evaluated models.
  2. ReLU MLP layers produce weak attenuation of stability (converging to fixed point \(2/(3\pi)\)) rather than eliminating the signal entirely.
  3. Attention layers preserve stability under identity/low-rank \(W\) (linear relationship), while unstructured \(W\) introduces an additional attenuation factor \(s(\rho)\).
  4. Noise stability regularization acts as a catalyst for grokking, consistently accelerating training across multiple tasks.

Highlights & Insights

  1. Unified theoretical framework: The Ornstein–Uhlenbeck semigroup provides a natural extension of Boolean function analysis to real-valued domains, preserving a rigorous connection to the function spectrum and offering greater analytical power than geometric influence measures.
  2. Cross-domain bridging: A new connection is established between signal propagation (C-maps/Q-maps) and simplicity bias/interpretability—noise stability can be viewed as a more concise analogue of correlation mapping.
  3. Practical regularization: The method requires only one additional forward pass, and the 75% acceleration in next-token prediction training demonstrates strong practical utility.
  4. Insights into LLM internals: The quantification of junta-like dependence in models such as GPT-2 (only 5–10 tokens exert significant influence) provides theoretical support for KV cache compression and token pruning.
  5. New training monitoring metric: Changes in noise stability can serve as a leading signal for grokking, opening avenues for adaptive training strategies.

Limitations & Future Work

  1. Theoretical analyses omit practical Transformer components such as residual connections, layer normalization, and attention masking.
  2. Language modeling experiments are conducted only at the small scale of WikiText-2, without validation on LLMs with hundreds of millions of parameters.
  3. The tightness of the stability interval propagation across multiple Transformer layers has not been sufficiently verified in practice.
  4. Regularization hyperparameters \((\gamma, \rho)\) require task-specific tuning (e.g., (0.75, 0.25) for modular addition; (0.05, 0.05) for parity).
  5. The quantitative relationship between noise stability and adversarial robustness remains unexplored.

In contrast to Vasudeva et al. (2024), who use average sensitivity to track grokking, the noise stability proposed in this paper provides stronger spectral concentration guarantees. The approach differs fundamentally from Hua et al. (2023)'s noise stability method for Transformer fine-tuning in terms of motivation (simplicity bias vs. fine-tuning stability), scope of application, and the definition of correlated noise. The most direct inspiration comes from Li & Mossel (2025)'s analysis of noise sensitivity in hierarchical functions.

Insights: Noise stability can serve as a training monitoring metric—a decrease in stability often anticipates the onset of grokking, offering new directions for adaptive training strategies. Furthermore, the quantitative analysis of junta-like dependence provides theoretical grounding for prompt engineering questions regarding which tokens genuinely matter.

Rating

  • Novelty: ⭐⭐⭐⭐ (The perspective unifying signal propagation and simplicity bias is highly novel, with rigorous theoretical analysis)
  • Experimental Thoroughness: ⭐⭐⭐ (Theoretically solid but limited in experimental scale; large-model validation is absent)
  • Writing Quality: ⭐⭐⭐⭐ (Theoretical derivations are clear, exposition is fluent, and figures are intuitive)
  • Value: ⭐⭐⭐⭐ (Provides a new tool for understanding the internal mechanisms of Transformers; the regularization method has practical potential)