Composing Linear Layers from Irreducibles¶
Conference: NeurIPS 2025 arXiv: 2507.11688 Code: To be confirmed Area: LLM/NLP Keywords: Clifford algebra, rotor decomposition, parameter efficiency, linear layers, geometric primitives
TL;DR¶
By leveraging Clifford algebra, this work represents linear layers as compositions of bivectors—specifically as rotor sandwich products—requiring only \(O(\log^2 d)\) parameters to replace a \(d \times d\) dense matrix. When applied to Q/K/V projections in LLM attention layers, performance closely matches the original model and strong baselines.
Background & Motivation¶
Background: Linear layers account for the vast majority of parameters in modern large language models (e.g., a single 2048×2048 projection in LLaMA's self-attention requires 4M parameters). Parameter-efficient methods such as LoRA and block-Hadamard compress linear layers from different perspectives, but lack a principled understanding of the intrinsic algebraic structure of linear transformations.
Limitations of Prior Work: (a) LoRA assumes low-rank structure and constitutes a functional approximation; (b) block-Hadamard assumes structured matrices, also an approximation; (c) both methods approximate dense matrices in a "top-down" manner rather than constructing functionality "bottom-up" from fundamental primitives. A formal algebraic characterization of the minimal irreducible primitives of linear transformations is absent.
Key Challenge: Dense matrices require \(O(d^2)\) parameters, yet the "essential degrees of freedom" of a linear transformation may be far smaller. The question is whether a minimal set of geometric primitives can be identified whose composition reproduces the functionality of a linear layer.
Goal: To describe the algebraic structure of linear transformations using the language of Clifford algebra, employing bivectors as irreducible primitives and composing rotors to achieve extreme parameter compression.
Key Insight: In Clifford algebra, linear transformations can be expressed as multivector products of the form \(F(x) = \sum_t a_t x b_t\). Restricting \(a_t, b_t\) to rotors (elements of the Spin group) and parameterizing rotors via bivectors yields exponential parameter compression.
Core Idea: Bivectors serve as the geometric primitives of linear transformations. Through the action of rotor sandwich products on multivector spaces, effective linear transformations can be constructed with only \(O(\log^2 d)\) parameters.
Method¶
Overall Architecture¶
An input vector \(x \in \mathbb{R}^d\) is re-encoded as a multivector in the Clifford algebra \(\text{Cl}(n)\) (where \(n = \log_2 d\)), then transformed via the sandwich products \(r_t x r_t^\dagger\) of multiple rotors, and finally aggregated to produce the output. The learnable parameters consist solely of a small number of bivector coefficients.
Key Designs¶
-
Clifford Algebra Representation:
- Function: Converts linear transformations from matrix representations into multivector product form within Clifford algebra.
- Mechanism: By Lemma 1 (Hestenes, 2012), any linear function \(F: \text{Cl}^k(n) \to \text{Cl}(n)\) can be written as a finite sum \(F(x) = \sum_{t=1}^w a_t x b_t\). Restricting \(a_t = r_t\), \(b_t = r_t^\dagger\) to rotors (elements of Spin(n)) makes \(x \mapsto r x r^\dagger\) an orientation-preserving rotation.
- Key Advantage: Rotors act on the \(2^n\)-dimensional multivector space (where \(n = \log_2 d\)), yet require only \(\binom{n}{2}\) bivector parameters. For \(d = 2048\), a single rotor needs only 55 parameters versus 4M for a dense matrix.
-
Differentiable Invariant Decomposition Algorithm:
- Function: Decomposes a general bivector (which may not be "simple") into a sum of mutually orthogonal, commuting simple bivectors, enabling a closed-form rotor expression.
- Mechanism: The exponential map of a simple bivector \(b = u \wedge v\) has a closed-form solution: \(\exp(b) = \cos(\|b\|) + \frac{\sin(\|b\|)}{\|b\|} b\). A general bivector decomposes into a sum of at most \(k = \lfloor n/2 \rfloor\) simple bivectors \(b = \sum_j b_j\); mutual orthogonality yields \(\exp(\sum b_j) = \prod \exp(b_j)\).
- Algorithm 1 is proposed: simple components are sequentially extracted from the bivector via projection operations, with all steps differentiable through autograd.
- Design Motivation: Avoids approximation errors from truncating infinite series while maintaining end-to-end differentiability.
-
Rotor-Gadget Module Design:
- Function: Encapsulates rotor operations into a module that can replace dense linear layers.
- Mechanism: Input \(x \in \mathbb{R}^d\) → reshaped into a \(2^n\)-dimensional multivector → transformed through \(w\) rotor sandwich products \(\{r_t x r_t^\dagger\}_{t=1}^w\) → output aggregated via learnable mixing coefficients.
- Width \(w\) is a hyperparameter (typical value ~3); total parameter count is \(w \times \binom{\log_2 d}{2}\).
- Application: Replaces Q/K/V projection matrices in LLM attention layers.
Loss & Training¶
- Evaluated on LLaMA-2-7B and LLaMA-3-8B.
- All Q/K/V projections in attention layers are replaced with rotor-gadgets.
- Training proceeds via distillation/approximation to recover the functionality of the original layers rather than training from scratch.
- Evaluation metrics: perplexity on WikiText-2 and C4, and multiple downstream zero-shot tasks.
Key Experimental Results¶
Main Results: LLaMA-2-7B Perplexity (log-PPL)¶
| Method | Params | WikiText-2 Q | WikiText-2 K | WikiText-2 V | C4 Q | C4 K | C4 V |
|---|---|---|---|---|---|---|---|
| Original | 100% | 2.575 | 2.575 | 2.575 | 3.151 | 3.151 | 3.151 |
| LR1 (rank-1) | Very few | 2.688 | 3.455 | 4.956 | 3.414 | 4.071 | 5.001 |
| LR4 (rank-4) | Few | 2.658 | 2.729 | 2.880 | 3.390 | 3.315 | 3.504 |
| BH1 (block-Hadamard) | Few | 2.636 | 2.700 | 2.779 | 3.343 | 3.262 | 3.404 |
| Rotor | Very few | 2.629 | 2.717 | 2.818 | 3.261 | 3.285 | ~3.4 |
The rotor-gadget matches or outperforms the block-Hadamard baseline in most settings, and substantially surpasses naive low-rank approximations.
Ablation Study: Parameter Count Comparison¶
| Method | Parameters at d=2048 |
|---|---|
| Dense matrix | ~4.2M |
| LoRA (rank-4) | ~16K |
| Block-Hadamard | ~11 bits/row |
| Rotor (w=3) | ~165 |
The rotor-gadget uses two orders of magnitude fewer parameters than LoRA.
Key Findings¶
- The rotor-gadget achieves near-original performance under extreme parameter compression (165 parameters vs. 4.2M), demonstrating that linear transformations possess exploitable algebraic structure.
- The Q projection is easiest to approximate (smallest perplexity increase), while the V projection is hardest (largest increase)—consistent with the functional distinctions among Q/K/V observed in mechanistic interpretability research.
- Width \(w=3\) suffices; additional rotors yield diminishing returns.
- LLaMA-3 is harder to approximate than LLaMA-2, possibly because its projection layers encode more complex functionality.
Highlights & Insights¶
- Mathematical Elegance: Grounding the work in Clifford algebra—a well-established mathematical framework—formalizes the "minimal primitives" of linear layers as bivectors. The \(O(\log^2 d)\) parameter complexity arises naturally from the exponential dimensional lift of Clifford algebra (\(n\)-dimensional space generates a \(2^n\)-dimensional algebra), yielding a result that is both natural and elegant.
- Differentiable Invariant Decomposition constitutes the key technical contribution: it eliminates truncation errors in the exponential map and ensures gradient flow throughout.
- A New Perspective on Linear Layers: Rather than asking "how to compress," this work asks "what are the essential degrees of freedom of a linear transformation?"—the answer being a small number of planar rotations.
Limitations & Future Work¶
- The approach is not yet a practical drop-in replacement—system-level integration (e.g., custom CUDA kernels) would be required to realize speed advantages.
- Evaluation is limited to Q/K/V projections; the FFN layers, which account for the majority of linear layer parameters, remain untested.
- \(d\) must be a power of two for clean embedding decomposition; arbitrary dimensions require padding.
- Training proceeds by distilling/approximating existing weights rather than training from scratch—whether rotor parameterization supports efficient training from random initialization remains unverified.
- Convergence analysis is absent; the optimization landscape of multi-rotor compositions may exhibit local optima.
Related Work & Insights¶
- vs. LoRA: LoRA assumes low-rank weight updates; rotors assume that transformations can be composed from planar rotations. The two approaches are orthogonal and may be combinable.
- vs. Block-Hadamard: Block-Hadamard uses structured orthogonal matrices; rotors also yield orthogonal transformations but with a distinct parameterization. Performance is comparable, but rotors use far fewer parameters.
- vs. Other Applications of Geometric Algebra in Deep Learning: Clifford layers in GANs (Ruhe et al.) target equivariance, whereas this work targets parameter efficiency and decomposability.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Introducing Clifford algebra to LLM parameter efficiency represents a genuinely novel perspective with rigorous mathematical foundations.
- Experimental Thoroughness: ⭐⭐⭐ Proof-of-concept level; lacks experiments on training from scratch, broader model coverage, and inference speed comparisons.
- Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are clear, though the presentation poses a high barrier for readers unfamiliar with Clifford algebra.
- Value: ⭐⭐⭐⭐ Opens a new direction for understanding and compressing linear layers, though practical utility requires further validation.