ICLR 2026 Signal & Communication positional encoding group theory RoPE ALiBi Lie groups rotary encoding long context

Group Representational Position Encoding (GRAPE)¶

Conference: ICLR 2026 arXiv: 2512.07805 Code: github.com/model-architectures/GRAPE Area: Signal & Communication Keywords: positional encoding, group theory, RoPE, ALiBi, Lie groups, rotary encoding, long context

TL;DR¶

This paper proposes the GRAPE framework, which unifies the multiplicative (RoPE) and additive (ALiBi/FoX) families of positional encodings in Transformers via group actions, proves that RoPE and ALiBi are exact special cases, and introduces a path-integral additive variant GRAPE-AP that outperforms existing methods on downstream tasks.

Background & Motivation¶

Fragmentation of positional encodings: Existing methods — including absolute encodings (sinusoidal/learned), relative encodings (RoPE), linear biases (ALiBi), and forgetting mechanisms (FoX) — are designed independently, lacking a unified theoretical framework.

Limitations of RoPE: RoPE relies on fixed coordinate planes and a log-uniform frequency spectrum, precluding cross-subspace coupling and content-dependent phase warping.

Absolute encodings break translation equivariance: Table-based relative encodings introduce window-dependent computational overhead.

Lack of theoretical guarantees: Desirable properties such as stability, monotonic distance penalty, and expressivity are scattered across disjoint methods, motivating a unified framework that integrates them.

Long-context modeling demands: Long-sequence models require a principled positional geometry design space.

Method¶

Overall Architecture¶

GRAPE is grounded in Lie group theory, unifying positional encodings as group actions \(\mathbf{G}(n) = \exp(n\omega\mathbf{L})\), organized into two families:

Multiplicative GRAPE (GRAPE-M): Norm-preserving rotations in the special orthogonal group \(\mathrm{SO}(d)\)
Additive GRAPE (GRAPE-A): Unipotent actions in the general linear group \(\mathrm{GL}\), producing linear biases

Multiplicative GRAPE¶

Core construction: Rotations are constructed using rank-2 skew-symmetric generators \(\mathbf{L} = \mathbf{ab}^\top - \mathbf{ba}^\top \in \mathfrak{so}(d)\):

\[\mathbf{G}(n) = \exp(n\omega\mathbf{L}) \in \mathrm{SO}(d)\]

Key properties: - Exact relative law: \(\mathbf{G}(n+m) = \mathbf{G}(n)\mathbf{G}(m)\), ensuring attention scores depend only on the offset \(j-i\) - Norm preservation: \(\mathbf{G}(n)^\top\mathbf{G}(n) = \mathbf{I}\) - Rodrigues closed-form: \(\exp(\mathbf{L}) = \mathbf{I} + \frac{\sin s}{s}\mathbf{L} + \frac{1-\cos s}{s^2}\mathbf{L}^2\), with \(O(d)\) complexity requiring no explicit matrix exponentiation

Multi-subspace GRAPE-M: \(d/2\) rank-2 generators act on orthogonal 2D subspaces. RoPE is exactly recovered when subspaces correspond to standard coordinate pairs with a log-uniform frequency spectrum. Learnable orthogonal bases and non-commutative mixing further extend expressivity.

Additive GRAPE¶

Core construction: Via homogeneous coordinate lifting into \(\mathrm{GL}(d+k)\), a nilpotent generator \(\mathbf{A}\) (satisfying \(\mathbf{A}^2=\mathbf{0}\)) yields a unipotent action:

\[\mathbf{G}_\mathrm{add}(n) = \exp(n\omega\mathbf{A}) = \mathbf{I} + n\omega\mathbf{A}\]

Exact recovery of ALiBi: Using a rank-1 nilpotent generator in \(\mathrm{GL}(d+2)\), the logit becomes \(\mathbf{q}_i^\top\mathbf{k}_j + (j-i)\beta_h\).

Content-gated variant (GRAPE-A-QK): Employs softplus-gated, query/key-dependent slopes:

\[\text{logit} = \mathbf{q}_i^\top\mathbf{k}_j + (j-i)\omega[\text{softplus}(\mathbf{v}^\top\mathbf{q}_i/\sqrt{d}) + \text{softplus}(\mathbf{u}^\top\mathbf{k}_j/\sqrt{d})]\]

Exact recovery of FoX: Per-token forgetting scalars \(f_t\) correspond to \(\omega_t = \log f_t\), and the cumulative bias is consistent with the forgetting bias \(D_{ij}\) of FoX.

Path-Integral Additive GRAPE (GRAPE-AP)¶

Building on GRAPE-A, a path-integral bias is introduced with an edge potential at each step:

\[\psi_h(t,\ell) = \alpha_h \cdot g\left(\frac{1}{d}\langle\mathbf{p}_{t,h},\, \mathbf{R}_\ell\mathbf{p}_{\ell,h}\rangle\right) \leq 0\]

The path-integral bias is \(b_h(t,j) = \sum_{\ell=j+1}^{t}\psi_h(t,\ell)\), which can be combined with multiplicative GRAPE and supports causal constraints and streaming inference.

Experiments¶

Experimental Setup¶

Based on nanoGPT / Llama architecture with only the positional encoding replaced
Dataset: FineWeb-Edu 100B (50B tokens used for training)
Model scales: Medium (350M, 24 layers, 8 heads) / Large (770M, 36 layers, 10 heads)
Context length 4096, batch size 480
Baselines: RoPE, ALiBi, FoX

Main Results (Medium 350M, 0-shot, 7-task average)¶

Method	ARC-E	ARC-C	HellaSwag	PIQA	SciQ	Avg.
RoPE	56.36	30.38	44.65	68.77	74.40	51.73
ALiBi	58.21	29.78	45.38	70.08	78.50	52.87
FoX	58.38	30.89	45.80	69.37	78.40	52.96
GRAPE-A-QK	57.95	32.00	45.77	69.37	79.00	53.00
GRAPE-AP	59.26	31.31	45.42	68.17	79.70	53.25
GRAPE-AP+KV-shift	57.32	30.55	46.18	69.10	79.60	53.46

Main Results (Large 770M, 0-shot, 7-task average)¶

Method	ARC-E	ARC-C	HellaSwag	PIQA	SciQ	Avg.
RoPE	62.63	32.76	51.01	71.33	80.50	55.76
ALiBi	62.67	34.39	51.33	71.11	82.70	56.44
FoX	61.07	33.11	51.85	71.27	83.70	56.30
GRAPE-AP	63.89	34.22	51.52	71.98	84.40	56.91
FoX+KV-shift	63.55	33.96	52.72	71.71	83.20	57.09
GRAPE-AP+KV-shift	63.72	33.11	52.29	71.65	83.50	56.86

Key Findings¶

GRAPE-AP achieves best overall performance without KV-shift: 350M Avg. 53.25 > FoX 52.96 > RoPE 51.73; 770M Avg. 56.91 > ALiBi 56.44.
Training stability advantage: RoPE exhibits instability (loss spikes) at 770M scale, while GRAPE maintains stable improvement.
Multiplicative GRAPE-M matches RoPE: This empirically validates the theoretical equivalence; GRAPE-M itself does not yield significant gains over RoPE.
Additive variants are the primary source of improvement: The GRAPE-A and GRAPE-AP families consistently outperform purely multiplicative methods.
KV-shift and GRAPE-AP are complementary: Adding KV-shift further improves the 350M model to 53.46.

Highlights & Insights¶

Elegant theoretical unification: The Lie group framework unifies the seemingly disparate RoPE, ALiBi, and FoX as special cases of a single mathematical object, with rigorous proofs provided.
Practical efficiency: The Rodrigues closed-form formula yields \(O(d)\) complexity on par with RoPE, with full compatibility for streaming inference and KV-cache.
Extensible design space: The framework naturally gives rise to extensions including learnable orthogonal bases, content-gated slopes, and path-integral biases.
Rigorous mathematical exposition: The group-theoretic perspective provides clear geometric intuition (rotation planes, unipotent translations) for positional encoding design.

Limitations & Future Work¶

Limited experimental scale: Validation is restricted to 350M/770M models; experiments at >1B scale are absent, and training covers only 50B tokens.
GRAPE-M does not significantly surpass RoPE: The theoretical advantages of the multiplicative variant — learnable subspaces and non-commutative mixing — do not translate into clear empirical gains.
No long-context evaluation: Training uses only 4096-token contexts; extrapolation to longer sequences is not evaluated, which is precisely the key differentiator between ALiBi and RoPE.
Computational overhead of GRAPE-AP underanalyzed: The edge potential requires per-step inner product computations, and actual inference latency is not reported.
Limited downstream task coverage: Evaluation is restricted to 0-shot language modeling benchmarks, without assessment of generation quality or fine-tuned performance.

RoPE (Su et al., 2021): An exact special case of GRAPE-M (standard coordinate pairs + log-uniform spectrum).
ALiBi (Press et al., 2021): An exact special case of GRAPE-A in \(\mathrm{GL}(d+2)\).
Forgetting Transformer (FoX) (Lin et al., 2025): Shown to be a path-dependent form of GRAPE-A.
PaTH Attention (Yang et al., 2025): Analyzed in the paper as contractive and near-singular, potentially harmful for long-context modeling.
NoPE / no positional encoding: Not discussed within the framework.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The group-theoretic unification is highly elegant; the exact recovery proofs for RoPE/ALiBi/FoX are a standout contribution.
Experimental Thoroughness: ⭐⭐⭐ — Model scale is modest; long-context and large-model validation are lacking.
Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear and rigorous, though the dense notation raises the barrier to entry.
Value: ⭐⭐⭐⭐ — Theoretical contribution is significant, providing a unified principled framework for positional encoding design.