Free Energy Mixer¶
Conference: ICLR 2026 arXiv: 2602.07160 Code: Available (linked in paper) Area: Time Series / LLM Efficiency Keywords: Attention mechanism, free energy, channel-wise selection, log-sum-exp, plug-and-play
TL;DR¶
This paper proposes Free Energy Mixer (FEM), which reframes attention value retrieval as a free energy (log-sum-exp) optimization problem, enabling value-aware posterior selection at the per-channel level. FEM addresses the inherent bottleneck of standard attention—lossless storage but lossy reading—and serves as a plug-and-play replacement for softmax/linear attention/RNN/SSM, yielding consistent improvements across NLP, vision, and time series tasks.
Background & Motivation¶
Background: The attention mechanism in Transformers stores all historical information losslessly via KV-cache, then reads values through a convex combination of probability weights—a pattern of "lossless storage but lossy processing." Existing improvements include sparse attention, low-rank projection, kernelized attention, linear RNN/SSM, and related approaches.
Limitations of Prior Work: The reading operation in standard attention applies identical weights to all value dimensions (\(\mathbf{o}_t = \sum_i \alpha_{t,i} \mathbf{v}_i\)), forcing the output to lie within the convex hull of the value vectors. This means per-channel index selection is infeasible—even when different channels need to retrieve information from different historical positions, a single attention head cannot accommodate this.
Key Challenge: \(H\) attention heads provide at most \(t^H\) head-level argmax patterns, far fewer than the \(t^D\) patterns required for fully independent per-channel selection (when \(H \ll D\)). Increasing the number of heads narrows each head's width, and stacking more layers cannot recover the channel-level index information lost after the first convex mixing operation.
Goal: How can attention be endowed with per-channel, value-aware selection capability without altering asymptotic complexity?
Key Insight: Value retrieval is recast as a selection problem under information constraints—given a prior distribution \(p_t\) (derived from Q/K), the goal is to find, for each channel \(j\), a posterior distribution \(q\) that maximizes expected utility while bounding the KL divergence from the prior. The solution to this variational problem takes exactly the log-sum-exp form.
Core Idea: Replace the linear readout in attention with free energy (log-sum-exp), introducing independent value-aware posterior selection for each value channel. This enables a smooth transition from average readout to per-channel hard selection without increasing asymptotic complexity.
Method¶
Overall Architecture¶
FEM is a plug-and-play module that directly replaces the attention layer within a Transformer block. It takes any selection prior (softmax attention, GLA, Mamba, AFT) as input and enhances the reading capability through four components: (C) low-rank convolutional local conditioning, (L) LSE mixing, (T) linearized temperature learning, and (G) external gating. The output substitutes the original attention output; all other components (MLP, embeddings, etc.) remain unchanged.
Key Designs¶
-
Free Energy Read:
- Function: Replaces the standard expected readout \(\mu_{t,j} = \sum_i p_t(i) v_{i,j}\) with a per-channel free energy readout.
- Mechanism: For each channel \(j\), a constrained optimization problem is defined as \(\max_{q} \mathbb{E}_{i \sim q}[v_{i,j}] \text{ s.t. } \text{KL}(q \| p_t) \leq B_{t,j}\). Introducing a Lagrange multiplier \(\beta_{t,j}\) yields the free energy: \(\mathcal{F}_{t,j}(\beta) = \frac{1}{\beta} \log \sum_i p_t(i) \exp(\beta v_{i,j})\), with corresponding posterior \(q_{t,\beta}^{(j)}(i) \propto p_t(i) \exp(\beta v_{i,j})\). As \(\beta \to 0\), this degenerates to the standard mean; as \(\beta \to \infty\), it converges to argmax hard selection.
- Design Motivation: The free energy naturally satisfies \(\mathcal{F}_{t,j}(\beta) = \mathbb{E}_{p_t}[v_{i,j}] + \frac{1}{\beta} \text{KL}(p_t \| q^{(\beta)})\), guaranteeing the output is never below the expected value, with the degree of improvement quantified by the KL divergence. Different channels can maintain different posteriors, enabling independent retrieval.
-
Linearized Temperature Learning (LTL):
- Function: Enables dynamic per-channel temperature without sacrificing single-pass computational efficiency.
- Mechanism: A maximum inverse temperature \(\beta_{\max}\) is fixed; a baseline \(\mu_t\) and a high-temperature branch \(F_t^{\max}\) are computed, then interpolated via a learnable gate \(\lambda_t \in [0,1]^D\): \(\tilde{F}_t(\lambda_t) = (1-\lambda_t) \odot \mu_t + \lambda_t \odot F_t^{\max}\). Both terms can be computed in a single pass (by mixing \([v_{i,j}, e^{\beta_{\max} v_{i,j}}]\)), preserving the same asymptotic complexity as the prior.
- Design Motivation: Learning independent \(\beta\) values for each (step, channel) pair would break parallelism. By proving via the intermediate value theorem that the map \(\lambda \mapsto \beta^*\) is strictly monotone, optimizing \(\lambda\) is equivalent to optimizing the implicit temperature.
-
Dual-Level Gating:
- Function: Inner gating (temperature) controls the degree of transition from averaging to selection; outer gating scales the final output magnitude.
- Mechanism: The final readout is \(\mathbf{o}_t = \mathbf{g}_t \odot [(1-\lambda_t) \odot \mu_t + \lambda_t \odot F_t^{\max}]\), where \(\mathbf{g}_t\) is parameterized via softplus + RMSNorm.
- Design Motivation: The outer gate corresponds to applying an exponential scaling \([\sum_i p_t(i) \exp(\beta^* v_{i,j})]^{g_{t,j}}\) on the free energy, expanding the representational space. Setting \(\lambda = 0, g = 1\) recovers standard attention.
-
Low-Rank Convolutional Local Conditioning (Module C):
- Function: Extracts locally position-sensitive features via a lightweight adaptive low-rank convolution to modulate the selection prior and FEM gates.
- Mechanism: A simple temporal decay kernel with \(O(1)\) streaming updates; total cost is \(O(TH_c)\) where \(H_c = d/16 \ll D\).
- Design Motivation: Inspired by the local convolution design in Mamba/DeltaNet, this component introduces position sensitivity.
Loss & Training¶
- FEM introduces no additional loss terms; standard task losses are used (cross-entropy for language modeling, MSE for time series, etc.).
- Parameter budget strategy (i): \(d = D/2\), \(r = 4\), maintaining the same \(4D^2\) parameter count as standard attention.
- In all experiments, FEM directly replaces the attention layer without modifying other hyperparameters.
Key Experimental Results¶
MAD Synthetic Benchmark — Attention Mechanism Diagnostics¶
| Model | Compress | Fuzzy Recall | In-Ctx Recall | Memorize | Selective Copy | Avg |
|---|---|---|---|---|---|---|
| Transformer (SMAttn) | 44.3 | 24.5 | 99.9 | 85.7 | 95.1 | 74.7 |
| DiffTransformer | 42.9 | 39.0 | 99.9 | 83.7 | 95.8 | 76.4 |
| GatedDeltaNet | 45.0 | 29.8 | 99.9 | 80.2 | 94.3 | 74.9 |
| FEM-SM | 53.1 | 43.1 | 99.9 | 85.9 | 99.3 | 80.2 |
| FEM-GLA | 53.0 | 19.1 | 99.9 | 86.3 | 99.0 | 74.9 |
Ablation Study (MAD Average Score)¶
| Configuration | Avg | Notes |
|---|---|---|
| SMAttn (baseline) | 74.7 | Standard Transformer |
| +C (low-rank conv) | 76.3 | +1.6, local conditioning |
| +C,L (LSE) | 78.8 | +2.5, largest jump, LSE is the core |
| +C,L,T (temperature) | 79.4 | +0.6, temperature fine-tuning |
| +C,L,T,G (full FEM) | 80.2 | +0.8, outer gating as additional gain |
Language Modeling (1.3B parameters, 100B tokens)¶
| Model | Open LLM Avg Rank↓ | Top-1 Count↑ |
|---|---|---|
| Transformer | 4.56 | 1 |
| FEM-SM | 2.06 | 9 |
| GLA | 5.63 | 0 |
| FEM-GLA | 3.88 | 1 |
Time Series Forecasting (MSE)¶
| Dataset | FEM-SM | iTransformer | PatchTST | DLinear |
|---|---|---|---|---|
| Weather | 0.222 | 0.232 | 0.221 | 0.233 |
| ETTh1 | 0.419 | 0.454 | 0.413 | 0.422 |
| ETTm1 | 0.341 | 0.373 | 0.346 | 0.347 |
| ETTm2 | 0.242 | 0.265 | 0.247 | 0.252 |
Key Findings¶
- LSE mixing (L) is the core component: It contributes the largest performance jump (+2.5 points) on the MAD benchmark, directly validating the critical role of free energy reading for Compress & Recall tasks.
- FEM can elevate linear-time methods (GLA, Mamba) to performance levels approaching state-of-the-art attention variants, closing the gap between linear and quadratic complexity models.
- Computational overhead is manageable: Full FEM-SM training latency is 0.041s vs. 0.027s for standard Transformer, with throughput 104K vs. 154K tokens/s—approximately 30% overhead.
- At the 1.3B scale for language modeling, FEM-SM achieves the best result on 9 of 16 benchmarks, with an average rank of 2.06 (vs. 4.56 for Transformer).
Highlights & Insights¶
- The precise diagnosis of the "lossless storage but lossy reading" problem is a particularly profound contribution: the paper rigorously proves the fundamental limitations of standard attention from both a geometric perspective (convex hull constraint) and an information-theoretic perspective (\(t^H\) vs. \(t^D\) capacity), and systematically analyzes why "more heads / deeper layers / per-dimension QK / richer mixers" all fail to resolve this issue.
- Mathematical elegance of the free energy variational framework: Defining value retrieval as utility maximization under KL constraints naturally yields the log-sum-exp form, with the temperature parameter corresponding to the Lagrange multiplier for the KL budget. This framework unifies a continuous spectrum from mean readout to argmax.
- The plug-and-play, prior-agnostic design makes FEM broadly applicable: it is compatible with softmax attention, GLA, Mamba, and AFT, while preserving the original asymptotic complexity. This represents a rare "free lunch" style improvement.
- The elegant design of LTL: By proving the monotonicity of the map \(\lambda \mapsto \beta^*\), the continuous spectrum of dynamic temperatures is compressed into a two-point interpolation, requiring only a single forward pass.
Limitations & Future Work¶
- Validation on large-scale language models (>10B parameters) and very long contexts (>128K) is absent—the authors acknowledge limited computational resources.
- No custom CUDA kernels are implemented; the current 30% computational overhead could be optimized in engineering practice, but this is not addressed in the paper.
- The value space dimension is halved in FEM (\(d = D/2\)); although total parameter count is matched, the potential degradation in value representational capacity is not thoroughly analyzed.
- The performance advantage on time series forecasting is less pronounced than on NLP and synthetic benchmarks—FEM underperforms PatchTST on ETTh1/ETTh2.
Related Work & Insights¶
- vs. Differential Transformer: DiffTrans reduces attention noise via a differential mechanism but still operates within the convex hull; FEM breaks the convex hull constraint.
- vs. Mamba/SSM: SSMs store history in a fixed-size state, which is inherently lossy storage; FEM achieves lossless reading on top of lossless storage (KV-cache).
- vs. GatedDeltaNet/DeltaNet: These linear RNNs update states using the delta rule; FEM can be stacked on top of them (FEM-GLA/Mamba) with significant performance gains.
- The theoretical framework of this paper is well-suited as both a baseline and an analytical tool for future research on attention mechanism improvements.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Both the diagnosis of the lossy reading problem and the free energy solution are highly original contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Broad coverage across four domains (synthetic, NLP, vision, time series), but lacks very large-scale validation.
- Writing Quality: ⭐⭐⭐⭐⭐ — Theoretical analysis is exceptionally rigorous, with a complete logical chain from problem formulation to solution.
- Value: ⭐⭐⭐⭐⭐ — A universal, plug-and-play mechanism improvement with high theoretical and practical value, likely to influence future directions in attention mechanism design.