Polynomial Expansion Rank Adaptation: Enhancing Low-Rank Fine-Tuning with High-Order Interactions¶
Conference: ACL 2026 arXiv: 2604.11841 Code: https://github.com/zhangwenhao6/PERA Area: Parameter-Efficient Fine-Tuning / Model Compression Keywords: Low-rank adaptation, polynomial expansion, high-order feature interaction, parameter-efficient fine-tuning, LoRA improvement
TL;DR¶
This paper proposes PERA (Polynomial Expansion Rank Adaptation), which introduces structured polynomial expansions (square and cross terms) into the parameter space of low-rank factors, extending LoRA's linear adaptation space into a polynomial manifold. Without increasing rank or inference overhead, PERA significantly enhances the expressiveness of weight updates and consistently outperforms LoRA, DoRA, and HiRA on commonsense reasoning and NLU tasks.
Background & Motivation¶
Background: Parameter-efficient fine-tuning (PEFT) has become the standard paradigm for adapting large language models. LoRA achieves efficient adaptation by constraining weight updates to a low-rank subspace \(\Delta W = BA\), but its strictly bilinear structure captures only first-order linear dependencies between low-rank factors, limiting the model's ability to capture nonlinear and high-order parameter interactions.
Limitations of Prior Work: (1) LoRA's weight update \(\Delta W = \sum_{i=1}^{r} \mathbf{b}_i \mathbf{a}_i^T\) is a linear combination of rank-one matrices, with expressiveness bounded by rank \(r\); (2) DoRA improves via magnitude-direction decomposition but remains a linear transformation; (3) MoRA achieves high-rank adaptation via compression-transformation-decompression but introduces additional overhead; (4) HiRA enriches representations via Hadamard modulation with pretrained weights, yet its update mechanism remains linear in trainable parameters and relies on external weight coupling.
Key Challenge: From a function approximation perspective, there is a fundamental expressive gap between a first-order linear function \(f(x) = c + c_1 x\) and a polynomial function \(f(x) = c + c_1 x + c_2 x^2 + \cdots\). Viewing LoRA as a first-order linear approximation of weight updates reveals that its expressive limitations are fundamental in nature.
Goal: To enhance the expressiveness of low-rank adaptation by introducing high-order feature interactions without increasing rank or inference cost.
Key Insight: Drawing inspiration from polynomial feature expansion in classical feature engineering—applying it to the parameter space of low-rank factors rather than the input feature space.
Core Idea: Apply polynomial expansion and Hadamard-based polynomial expansion to low-rank matrices \(B\) and \(A\) respectively, generating square terms (\(\mathbf{b}_i \odot \mathbf{b}_i\)) and cross terms (\(\mathbf{b}_i \odot \mathbf{b}_j\)). Matrix concatenation (rather than addition) is used to avoid additional inference overhead, expanding the adaptation space from a linear subspace to a polynomial manifold.
Method¶
Overall Architecture¶
PERA follows LoRA's factorization framework, decomposing weight updates into \(B \in \mathbb{R}^{m \times r}\) and \(A \in \mathbb{R}^{r \times n}\). The core improvement applies polynomial expansions to both factors before combining them: a standard second-order polynomial expansion \(\text{Poly}^2(B)\) is applied to \(B\), and a Hadamard-based polynomial expansion \(\text{Poly}_H^2(A)\) with learnable coefficients \(\mathbf{h}\) (for stability) is applied to \(A\). The final update is \(\Delta W = \text{Poly}^2(B) \cdot \text{Poly}_H^2(A)\).
Key Designs¶
-
Polynomial Expansion in Parameter Space:
- Function: Expands the low-rank factors from \(r\) dimensions to \(2r + C(r,2)\) dimensions, introducing high-order nonlinear interactions.
- Mechanism: A second-order polynomial expansion is applied to \(B = [\mathbf{b}_1, \ldots, \mathbf{b}_r]\), producing \(\hat{B} = [B; B_{square}; B_{cross}]\), where \(B_{square} = \{\mathbf{b}_i \odot \mathbf{b}_i\}\) contains \(r\) square terms and \(B_{cross} = \{\mathbf{b}_i \odot \mathbf{b}_j \mid i < j\}\) contains \(C(r,2)\) cross terms. An analogous expansion is applied to \(A\), with learnable coefficients \(h_{ij}\) (initialized to zero) for training stability. The final weight update decomposes as: \(\Delta W = \sum_{i} \mathbf{b}_i \mathbf{a}_i^T + \sum_{i=j} h_{ij}(\mathbf{b}_i \odot \mathbf{b}_j)(\mathbf{a}_i^T \odot \mathbf{a}_j^T) + \sum_{i<j} h_{ij}(\mathbf{b}_i \odot \mathbf{b}_j)(\mathbf{a}_i^T \odot \mathbf{a}_j^T)\).
- Design Motivation: Polynomial expansion is a classical feature augmentation technique. Transferring it from the feature space to the parameter space is a natural generalization—the column vectors of low-rank factors serve as "features" of the adaptation directions, and high-order interaction terms can capture nonlinear coupling among these directions.
-
Zero-Initialization Strategy for Hadamard Coefficients:
- Function: Ensures the training starting point is consistent with LoRA, allowing high-order terms to participate in optimization gradually.
- Mechanism: The learnable coefficients \(\mathbf{h} = \{h_{ij}\}\) are initialized to zero, causing PERA to reduce to standard LoRA at the beginning of training (when \(\mathbf{h}=0\), square and cross terms contribute nothing). As training proceeds, the model autonomously learns which high-order interactions are beneficial for the task. LoRA is thus a special case of PERA.
- Design Motivation: Zero initialization prevents high-order terms from introducing unstable gradients in early training while preserving the full upper bound of expressiveness. This progressive nonlinearity introduction strategy ensures smooth optimization.
-
Zero Inference Overhead via Matrix Concatenation:
- Function: High-order terms are realized through column/row concatenation and can be precomputed and merged into weights at inference time.
- Mechanism: The product of expanded \(\hat{B} \in \mathbb{R}^{m \times (2r+C(r,2))}\) and \(\hat{A} \in \mathbb{R}^{(2r+C(r,2)) \times n}\) remains an \(\mathbb{R}^{m \times n}\) matrix. At inference time, \(\Delta W = \hat{B}\hat{A}\) can be precomputed and merged into \(W_0\), introducing no inference latency.
- Design Motivation: Inference efficiency is critical for deployment. By using concatenation rather than sequential addition to realize high-order interactions, PERA preserves LoRA's zero inference overhead property.
Loss & Training¶
The same next-token prediction loss as LoRA is used. Only the low-rank matrices \(A\), \(B\), and the Hadamard coefficients \(\mathbf{h}\) are optimized during training; the pretrained weights \(W_0\) remain frozen. The learning rate is \(1 \times 10^{-4}\), and other hyperparameters follow the HiRA baseline. \(A\) is initialized to zero and \(B\) is initialized with a Gaussian distribution.
Key Experimental Results¶
Main Results¶
| Model | Method | Params (%) | Commonsense Avg. Acc. |
|---|---|---|---|
| LLaMA2-7B | LoRA (r=32) | 0.83% | 77.61 |
| LLaMA2-7B | DoRA (r=32) | 0.83% | 79.69 |
| LLaMA2-7B | HiRA (r=32) | 0.83% | 81.42 |
| LLaMA2-7B | PERA (r=16) | 0.41% | 82.61 |
| LLaMA3-8B | LoRA (r=16) | 0.35% | 82.80 |
| LLaMA3-8B | HiRA (r=16) | 0.35% | 86.08 |
| LLaMA3-8B | PERA (r=16) | 0.35% | 87.38 |
| Qwen2.5-7B | LoRA (r=16) | 0.35% | 73.80 |
| Qwen2.5-7B | HiRA (r=16) | 0.35% | 85.40 |
| Qwen2.5-7B | PERA (r=16) | 0.35% | 88.29 |
| Model | Method | Params | GLUE Avg. |
|---|---|---|---|
| RoBERTa-base | LoRA | 0.3M | 83.40 |
| RoBERTa-base | DeLoRA | 0.3M | 84.60 |
| RoBERTa-base | PERA | 0.3M | 85.10 |
| RoBERTa-large | LoRA | 0.8M | 87.30 |
| RoBERTa-large | PERA | 0.8M | 88.13 |
Ablation Study¶
| Configuration | Weight Update Formula | Avg. Acc. |
|---|---|---|
| LoRA (first-order only) | Eq. 8 | 82.80 |
| LoRA + square terms only | Eq. 10 | 87.48 |
| LoRA + cross terms only | Eq. 11 | 86.83 |
| PERA (square + cross) | Eq. 9 | 87.38 |
Key Findings¶
- High-order terms yield substantial gains: PERA outperforms LoRA by 5 percentage points on LLaMA2-7B (82.61% vs. 77.61%) and by 14.5 percentage points on Qwen2.5-7B (88.29% vs. 73.80%).
- Square terms are the most critical high-order component: Adding only square terms (87.48%) yields a larger improvement than adding only cross terms (86.83%), and approaches the full PERA result (87.38%), indicating that same-dimension nonlinear interactions are more important than cross-dimension interactions.
- Strong performance at extremely low ranks: PERA achieves 86.91% at \(r=2\) and 87.01% at \(r=4\), close to its best result of 87.38% at \(r=16\). This is attributed to polynomial expansion raising the effective rank upper bound from \(r\) to \(2r + C(r,2)\).
- Training and inference overhead close to LoRA: Training memory is 19.12 GB vs. LoRA's 18.70 GB; inference memory is 19.70 GB vs. 19.50 GB—far more efficient than DoRA (22h07m training time vs. PERA's 13h30m).
- 10% data surpasses LoRA at full data: PERA achieves 83.07% on 10% of commonsense170K, exceeding LoRA's 82.80% on 100% of the data, demonstrating superior data efficiency.
Highlights & Insights¶
- Elegant transfer from feature engineering to parameter engineering: Migrating polynomial feature expansion from input feature spaces in classical ML to the parameter space of low-rank adaptation is conceptually clean yet empirically effective—a novel and insightful design choice.
- LoRA as a special case of PERA: When \(\mathbf{h}=0\), PERA reduces to LoRA, providing not only theoretical unification but also justifying the progressive introduction strategy via zero initialization.
- Theoretical rank upper bound improvement: LoRA's adapted weight rank upper bound is \(r_0 + r\); PERA raises this to \(r_0 + 2r + C(r,2)\). For \(r=16\), this increases from \(r_0+16\) to \(r_0+152\)—nearly a 10× improvement in theoretical expressiveness.
- Hessian interaction strength analysis: By computing an interaction strength matrix of second-order partial derivatives, the paper intuitively demonstrates that PERA possesses stronger global feature interaction modeling capability than LoRA.
Limitations & Future Work¶
- Evaluation is limited to commonsense reasoning and GLUE; arithmetic reasoning, code generation, and multimodal generation tasks are not covered.
- Only second-order polynomial expansion is adopted; the effect of higher-order expansions (\(k>2\)) remains unexplored.
- The contribution of cross terms is marginal (87.38% vs. 87.48% with square terms only), suggesting potential redundancy; finer-grained term selection strategies are needed.
- Comparisons with recent mixture-of-experts LoRA methods (e.g., MELoRA) or adaptive rank methods (e.g., AdaLoRA) are absent.
- Polynomial expansion may cause dimensionality explosion at large ranks (\(C(r,2)\) grows quadratically with \(r\)), and scalability in high-rank settings warrants investigation.
Related Work & Insights¶
- vs. LoRA: PERA is a strict generalization of LoRA, introducing high-order terms via polynomial expansion. LoRA corresponds to the special case of PERA where \(\mathbf{h}=0\).
- vs. HiRA: HiRA introduces nonlinearity via Hadamard products with pretrained weights, relying on external weight coupling; PERA introduces high-order interactions entirely within trainable parameters, requiring no external modules.
- vs. DoRA: DoRA decomposes magnitude and direction but remains a linear transformation; PERA introduces genuinely nonlinear parameter interactions.
- vs. MoRA: MoRA increases expressiveness through high-rank transformations at the cost of additional inference overhead; PERA maintains zero inference overhead.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of applying polynomial expansion to the parameter space of low-rank factors is novel, with clear theoretical analysis.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated across multiple models and tasks, with comprehensive ablations on rank, modules, data scale, and components.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear, mathematical derivations are rigorous, and the relationship with LoRA is elegantly established.
- Value: ⭐⭐⭐⭐ Provides a simple yet effective enhancement of LoRA with strong practical utility.