Skip to content

Polynomial Expansion Rank Adaptation: Enhancing Low-Rank Fine-Tuning with High-Order Interactions

Conference: ACL 2026 Findings
arXiv: 2604.11841
Code: https://github.com/zhangwenhao6/PERA
Area: Parameter-Efficient Fine-Tuning / Model Compression
Keywords: Low-Rank Adaptation, Polynomial Expansion, High-Order Feature Interaction, PEFT, LoRA Improvement

TL;DR

This paper proposes PERA (Polynomial Expansion Rank Adaptation), which expands the linear adaptation space of LoRA into a polynomial manifold by introducing structured polynomial expansion (square and cross terms) within the parameter space of low-rank factors. It significantly enhances weight update expressiveness without increasing rank or inference overhead, consistently outperforming methods such as LoRA, DoRA, and HiRA on commonsense reasoning and NLU tasks.

Background & Motivation

Background: Parameter-Efficient Fine-Tuning (PEFT) has become the standard paradigm for adapting Large Language Models. LoRA achieves efficient adaptation by restricting weight updates to a low-rank subspace \(\Delta W = BA\). However, its strictly bilinear structure only captures first-order linear dependencies between low-rank factors, limiting the model's capacity to model non-linear and high-order parameter interactions.

Limitations of Prior Work: (1) LoRA's weight update \(\Delta W = \sum_{i=1}^{r} \mathbf{b}_i \mathbf{a}_i^T\) is a linear combination of rank-one matrices, whose expressiveness is constrained by rank \(r\); (2) DoRA improves via magnitude-direction decomposition but remains a linear transformation; (3) MoRA achieves high-rank adaptation via compression-transformation-decompression but introduces additional overhead; (4) HiRA enriches representation through Hadamard modulation with pre-trained weights, yet the update mechanism remains linear regarding trainable parameters and relies on external weight coupling.

Key Challenge: From a function approximation perspective, there is a fundamental difference in expressiveness between a first-order linear function \(f(x) = c + c_1 x\) and a polynomial function containing high-order terms \(f(x) = c + c_1 x + c_2 x^2 + \cdots\). If LoRA is viewed as a first-order linear approximation of weight updates, the limitation in its expressiveness is intrinsic.

Goal: To enhance the expressiveness of low-rank adaptation by introducing high-order feature interactions without increasing rank or inference costs.

Key Insight: Inspiration is drawn from polynomial feature expansion techniques in classical feature engineering—applying them to the parameter space of low-rank factors rather than the input feature space.

Core Idea: Structural polynomial expansion and Hadamard-based polynomial expansion are applied to the low-rank matrices \(B\) and \(A\) respectively to generate square terms (\(\mathbf{b}_i \odot \mathbf{b}_i\)) and cross terms (\(\mathbf{b}_i \odot \mathbf{b}_j\)). By using matrix concatenation (instead of addition), extra inference overhead is avoided while expanding the adaptation space from a linear subspace to a polynomial manifold.

Method

Overall Architecture

PERA follows the decomposition framework of LoRA, decomposing weight updates into \(B \in \mathbb{R}^{m \times r}\) and \(A \in \mathbb{R}^{r \times n}\). The core improvement involves performing polynomial expansion on both factors before combination: standard second-order polynomial expansion \(\text{Poly}^2(B)\) for \(B\), and Hadamard-based polynomial expansion \(\text{Poly}_H^2(A)\) (with learnable coefficients \(\mathbf{h}\) to ensure stability) for \(A\). The final update is \(\Delta W = \text{Poly}^2(B) \cdot \text{Poly}_H^2(A)\). This process constitutes a "dual-path expansion—concatenation" parameter construction pipeline: \(B\) and \(A\) paths undergo polynomial expansion independently, are multiplied via concatenation to form \(\Delta W\), and are subsequently merged into the frozen \(W_0\). This occurs entirely during parameter construction in the training phase, making the inference form identical to LoRA.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    B0["Low-rank factor B (Gaussian init)"]
    A0["Low-rank factor A (Zero init)<br/>+ Learnable coefficients h (Zero init)"]
    B0 --> POLYB["Parameter Space Polynomial Expansion<br/>Second-order expansion Poly²(B)<br/>B̂ = [B; Square terms b⊙b; Cross terms b⊙b′]"]
    A0 --> POLYA["Zero Initialization of Hadamard Coefficients<br/>Hadamard expansion Poly_H²(A)<br/>Starts at h=0, high-order terms awakened during training"]
    POLYB --> MERGE["Matrix Concatenation for Zero Inference Overhead<br/>ΔW = B̂·Â (Concatenation, not serial addition)"]
    POLYA --> MERGE
    W0["Pre-trained weights W0 (Frozen)"] --> OUT
    MERGE --> OUT["ΔW merged into W0<br/>Zero inference overhead, same as LoRA"]

Key Designs

1. Parameter Space Polynomial Expansion: High-order interactions of low-rank factor column vectors

The ceiling of LoRA is rigid: \(\Delta W = \sum_{i=1}^{r}\mathbf{b}_i\mathbf{a}_i^T\) is merely a linear superposition of \(r\) rank-one matrices, with expressiveness restricted by rank \(r\), capturing only first-order linear dependencies. PERA shifts polynomial expansion from input space to parameter space—column vectors of low-rank factors are treated as a set of "adaptation directions." Second-order expansion captures non-linear coupling between these directions. Specifically, for \(B=[\mathbf{b}_1,\ldots,\mathbf{b}_r]\), second-order expansion yields \(\hat{B}=[B; B_{square}; B_{cross}]\), where \(B_{square}=\{\mathbf{b}_i\odot\mathbf{b}_i\}\) contains \(r\) square terms, and \(B_{cross}=\{\mathbf{b}_i\odot\mathbf{b}_j\mid i<j\}\) contains \(C(r,2)\) cross terms. \(A\) undergoes a corresponding Hadamard expansion, increasing dimensions from \(r\) to \(2r+C(r,2)\).

The resulting weight update is \(\Delta W = \sum_{i}\mathbf{b}_i\mathbf{a}_i^T + \sum_{i=j}h_{ij}(\mathbf{b}_i\odot\mathbf{b}_j)(\mathbf{a}_i^T\odot\mathbf{a}_j^T) + \sum_{i<j}h_{ij}(\mathbf{b}_i\odot\mathbf{b}_j)(\mathbf{a}_i^T\odot\mathbf{a}_j^T)\). The first term is standard LoRA, while the latter terms expand the adaptation space into a polynomial manifold. The effective rank upper bound increases from \(r_0+r\) to \(r_0+2r+C(r,2)\), explaining why PERA approximates high-rank performance even with very low rank.

2. Zero Initialization of Hadamard Coefficients: Progressive awakening of high-order terms

Introducing square and cross terms directly may cause training instability due to volatile high-order gradients. PERA assigns learnable coefficients \(\mathbf{h}=\{h_{ij}\}\) to the expansion on the \(A\) side, initialized to zero. Consequently, at the start of training, all high-order contributions are zero, and PERA precisely simplifies to standard LoRA.

As training progresses, the model learns which \(h_{ij}\) to increase and which high-order interactions are beneficial. This "progressive introduction of nonlinearity" maintains the smoothness of early optimization without sacrificing the upper bound of expressiveness, effectively integrating high-order terms into the optimization trajectory via a gentle "annealing" process.

3. Matrix Concatenation for Zero Inference Overhead: Integration via concatenation instead of serial addition

High expressiveness is only practical if it does not increase inference latency. PERA implements high-order terms as column/row concatenations of \(B\) and \(A\) rather than sequential additions. The expanded \(\hat{B}\in\mathbb{R}^{m\times(2r+C(r,2))}\) and \(\hat{A}\in\mathbb{R}^{(2r+C(r,2))\times n}\) are multiplied, resulting in an \(\mathbb{R}^{m\times n}\) matrix.

During inference, \(\Delta W=\hat{B}\hat{A}\) can be pre-computed and merged into \(W_0\), introducing no additional latency or VRAM usage compared to standard LoRA. High-order interactions occur entirely within the parameter construction of the training phase, while the deployment remains zero-overhead.

Loss & Training

Standard next-token prediction loss is employed. During training, only the low-rank matrices \(A\), \(B\), and Hadamard coefficients \(\mathbf{h}\) are optimized, while pre-trained weights \(W_0\) remain frozen. The learning rate is set to \(1 \times 10^{-4}\), with other hyperparameters consistent with HiRA baselines. \(A\) is zero-initialized, and \(B\) is Gaussian-initialized.

Key Experimental Results

Main Results

Model Method Params(%) Commonsense Reasoning Avg Acc
LLaMA2-7B LoRA (r=32) 0.83% 77.61
LLaMA2-7B DoRA (r=32) 0.83% 79.69
LLaMA2-7B HiRA (r=32) 0.83% 81.42
LLaMA2-7B PERA (r=16) 0.41% 82.61
LLaMA3-8B LoRA (r=16) 0.35% 82.80
LLaMA3-8B HiRA (r=16) 0.35% 86.08
LLaMA3-8B PERA (r=16) 0.35% 87.38
Qwen2.5-7B LoRA (r=16) 0.35% 73.80
Qwen2.5-7B HiRA (r=16) 0.35% 85.40
Qwen2.5-7B PERA (r=16) 0.35% 88.29
Model Method Params GLUE Avg
RoBERTa-base LoRA 0.3M 83.40
RoBERTa-base DeLoRA 0.3M 84.60
RoBERTa-base PERA 0.3M 85.10
RoBERTa-large LoRA 0.8M 87.30
RoBERTa-large PERA 0.8M 88.13

Ablation Study

Config Weight Update Formula Avg Accuracy
LoRA (First-order only) Eq.8 82.80
LoRA + Square terms only Eq.10 87.48
LoRA + Cross terms only Eq.11 86.83
PERA (Square + Cross) Eq.9 87.38

Key Findings

  • Significant Gains from High-Order Terms: PERA outperforms LoRA by 5 percentage points on LLaMA2-7B (82.61% vs 77.61%) and 14.5 percentage points on Qwen2.5-7B (88.29% vs 73.80%).
  • Square Terms are the Most Critical High-Order Component: The gain from adding only square terms (87.48%) is higher than adding only cross terms (86.83%), suggesting that non-linear interactions within the same dimension are more significant than cross-dimension interactions.
  • Superior Performance at Extremely Low Rank: PERA achieves 86.91% at \(r=2\) and 87.01% at \(r=4\), nearing the best result of 87.38% at \(r=16\). This is attributed to polynomial expansion increasing the effective rank upper bound from \(r\) to \(2r + C(r,2)\).
  • Training and Inference Costs Comparable to LoRA: Training memory is 19.12GB vs LoRA's 18.70GB; inference memory is 19.70GB vs LoRA's 19.50GB. Training time is significantly better than DoRA (13h30m vs 22h07m).
  • Exceeding LoRA with Only 10% Data: PERA achieves 83.07% using 10% of commonsense170K data, surpassing LoRA's 82.80% trained on 100% of the data, demonstrating exceptional data efficiency.

Highlights & Insights

  • Elegant Transfer from Feature to Parameter Engineering: Migrating polynomial feature expansion from input space to the parameter space of low-rank adaptation is a novel and insightful design that is simple yet effective.
  • LoRA as a Special Case of PERA: When \(\mathbf{h}=0\), PERA degrades to LoRA, providing theoretical unification and justifying the progressive introduction strategy via zero initialization.
  • Theoretical Increase in Rank Upper Bound: PERA raises the rank upper bound of adaptation weights from \(r_0 + r\) to \(r_0 + 2r + C(r,2)\). For \(r=16\), this increases the capacity from \(r_0+16\) to \(r_0+152\), nearly a 10-fold theoretical improvement.
  • Hessian Interaction Strength Analysis: By calculating the interaction strength matrix of second-order partial derivatives, PERA is shown to possess a stronger capacity for modeling global feature interactions compared to LoRA.

Limitations & Future Work

  • Evaluation is limited to commonsense reasoning and GLUE; arithmetic reasoning, code generation, and multimodal tasks are not yet covered.
  • Only second-order expansion is explored; the effects of higher-order expansions (\(k>2\)) remain unknown.
  • Cross terms show limited contribution (87.38% vs 87.48% for square-only), indicating potential redundancy and a need for finer selection strategies.
  • Comparisons with recent Mixture-of-Experts LoRA (e.g., MELoRA) or adaptive rank methods (e.g., AdaLoRA) are missing.
  • Polynomial expansion may lead to dimension explosion at large ranks (\(C(r,2)\) grows quadratically with \(r\)), requiring research into scalability for high-rank scenarios.
  • vs LoRA: PERA is a strict generalization of LoRA that introduces high-order terms. LoRA corresponds to the special case in PERA where \(\mathbf{h}=0\).
  • vs HiRA: HiRA introduces nonlinearity via Hadamard products with pre-trained weights, relying on external weight coupling; PERA introduces high-order interactions internally without external dependencies.
  • vs DoRA: DoRA decomposes magnitude and direction but remains a linear transformation; PERA introduces genuine non-linear parameter interactions.
  • vs MoRA: MoRA increases expressiveness through high-rank transformations at the cost of inference overhead; PERA maintains zero inference overhead.

Rating

  • Novelty: ⭐⭐⭐⭐ Innovative application of polynomial expansion to low-rank parameter spaces with clear theoretical analysis.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Validated across multiple models and tasks with comprehensive rank, module, and data ablation.
  • Writing Quality: ⭐⭐⭐⭐ Clear methodological description, rigorous mathematical derivation, and elegant proof of the relationship with LoRA.
  • Value: ⭐⭐⭐⭐ Provides a simple and effective enhancement to LoRA with high practical utility.