QuadEnhancer: Leveraging Quadratic Transformations to Enhance Deep Neural Networks¶
Conference: NeurIPS 2025 arXiv: 2510.03276 Code: GitHub Area: Model Compression Keywords: quadratic transformation, nonlinearity enhancement, lightweight module, LoRA fine-tuning, weight sharing
TL;DR¶
This paper proposes a lightweight quadratic enhancer (QuadEnhancer) that introduces sparsified quadratic interaction terms into each linear layer, achieving significant performance improvements over existing neural network architectures with negligible additional parameters and computational overhead.
Background & Motivation¶
The core building block of modern deep neural networks is the combination of a linear transformation and a nonlinear activation function. Although this paradigm has achieved remarkable success in approximating complex functions, its nonlinear expressiveness still has room for improvement. Three main directions exist for enhancing nonlinearity:
More complex activation functions (e.g., Swish, GELU, Mish): These approaches focus on element-wise transformations and cannot capture inter-neuron interactions.
Nonlinear network modules (e.g., LSTM gating, attention mechanisms): These are typically task-specific and lack generality.
Polynomial transformations as replacements for linear operations (e.g., polynomial networks, QuadraNet): These offer stronger theoretical expressiveness but are limited in practice due to the dramatic growth in parameter count and computational cost.
The core limitation is that standard quadratic transformations require \(O(dn^2)\) additional parameters, which is unacceptable for practical deployment. The motivation of this paper is to find a method that introduces quadratic terms to enhance expressiveness while keeping the additional parameter and computational overhead negligible.
Method¶
Overall Architecture¶
QuadEnhancer is a plug-and-play module that can be appended to any linear layer. For a standard linear transformation \(\tilde{\mathbf{y}} = \mathbf{W}\mathbf{x}\), the enhanced output becomes:
where \(\mathbf{\Lambda}\) is a sparse banded matrix and \(\odot\) denotes the Hadamard element-wise product.
Key Designs¶
-
Rank-1 Decomposition for Parameter Reduction: In the original quadratic transformation, each output dimension corresponds to an \(n \times n\) matrix \(\mathbf{V}_i\), resulting in \(O(dn^2)\) parameters. Constraining each \(\mathbf{V}_i\) to a rank-1 matrix \(\mathbf{V}_i = \mathbf{p}_i\mathbf{q}_i^\top\) reduces the parameter count to \(O(2dn)\). The quadratic transformation then becomes \(\mathbf{z} = (\mathbf{P}\mathbf{x}) \odot (\mathbf{Q}\mathbf{x}) + \mathbf{W}\mathbf{x} + \mathbf{b}\).
-
Weight Sharing: Setting \(\mathbf{P} = \mathbf{\Lambda}\mathbf{W}\) and \(\mathbf{Q} = \mathbf{W}\) reuses the weight matrix \(\mathbf{W}\) of the linear layer. This yields two benefits: (a) the parameter count is reduced from \(3dn\) to \(dn + d^2\); and (b) the linear response \(\tilde{\mathbf{y}} = \mathbf{W}\mathbf{x}\) needs to be computed only once, reducing computational redundancy.
-
Sparsification of \(\mathbf{\Lambda}\): \(\mathbf{\Lambda} \in \mathbb{R}^{d \times d}\) still has \(O(d^2)\) parameters. By restricting it to a banded matrix of bandwidth \(k\) (augmented with small triangular regions at the lower-left and upper-right corners to form a circulant structure), the parameter count is reduced to \(k \times d\). When \(k=1\), the additional parameters relative to the original linear layer are only \(O(1/n)\). Computationally, \(\mathbf{\Lambda}\tilde{\mathbf{y}} = \sum_{r \in \mathcal{K}} \boldsymbol{\lambda}_r \odot \text{Roll}(\tilde{\mathbf{y}}, r)\), where Roll denotes the cyclic shift operation.
Loss & Training¶
- The shift \(r=0\) (i.e., the pure quadratic term \(\tilde{y}_i^2\)) is excluded, as squared terms are more prone to numerical instability compared to cross terms \(\tilde{y}_i \tilde{y}_j\) (variance of 2 vs. 1, with large values occurring orders of magnitude more frequently).
- Experiments fix \(\mathcal{K} = \{1\}\), introducing quadratic interactions only between adjacent neurons.
- No special loss function is required; standard task losses (e.g., cross-entropy) are used for end-to-end training.
Key Experimental Results¶
Main Results — Image Classification¶
| Model | Params | ImageNet | Caltech | CIFAR-10 | CIFAR-100 | Pets | Avg |
|---|---|---|---|---|---|---|---|
| ViT-M | 2.45M | 63.70 | 87.77 | 96.35 | 80.25 | 91.03 | 82.44 |
| ViT-M+QE | 2.47M | 65.30 | 90.32 | 97.09 | 82.59 | 91.88 | 83.91 |
| ViT-XT | 2.82M | 66.04 | 90.25 | 96.51 | 81.24 | 91.03 | 84.99 |
| ViT-XT+QE | 2.83M | 67.34 | 90.77 | 96.78 | 82.64 | 97.97 | 86.82 |
| ViT-T | 5.37M | 73.96 | 93.07 | 97.97 | 86.13 | 93.87 | 88.57 |
| ViT-T+QE | 5.40M | 75.15 | 94.03 | 98.03 | 86.88 | 94.95 | 89.27 |
Main Results — LLM Fine-tuning (Commonsense Reasoning)¶
| Model | Method | Params | BoolQ | HellaSwag | ARC-e | ARC-c | Avg |
|---|---|---|---|---|---|---|---|
| LLaMA-7B | LoRA/32 | 53.5M | 68.90 | 78.10 | 77.80 | 61.30 | 74.73 |
| LLaMA-7B | LoRA/16+QE | 27.6M | 69.69 | 87.11 | 79.41 | 63.99 | 77.85 |
| LLaMA3-8B | LoRA/32 | 54.0M | 70.80 | 91.70 | 84.20 | 71.20 | 80.79 |
| LLaMA3-8B | LoRA/32+QE | 54.7M | 74.92 | 95.02 | 89.85 | 79.60 | 85.46 |
Ablation Study¶
| Configuration | Params | ImageNet | CIFAR-10 | CIFAR-100 | Avg | Note |
|---|---|---|---|---|---|---|
| ViT-M+QuadraNet | 2.53M | 61.17 | 95.81 | 79.08 | 79.41 | Three independent weight matrices |
| ViT-M+SwiGLU | 2.58M | 63.25 | 96.76 | 80.58 | 81.13 | Gated linear unit |
| ViT-M+QuadEnhancer | 2.47M | 65.30 | 97.09 | 82.59 | 82.40 | Fewest parameters, best performance |
Key Findings¶
- QE with half the parameters (LoRA/16) already surpasses full-rank LoRA/32 (a gain of 2.64% on LLaMA2-7B), demonstrating the substantial expressiveness contributed by quadratic interactions.
- As model and data scale increase, the gain from QE grows consistently from 0.07% to 1.19%, indicating favorable scaling properties.
- On the Pets dataset, ViT-XT+QE achieves a remarkable improvement of 6.94%, suggesting that the method is particularly effective for fine-grained classification tasks.
Highlights & Insights¶
- Extremely lightweight: At \(k=1\), the additional parameters consist of only \(d\) scalars, and the extra FLOPs amount to merely \(4d\), constituting near-zero overhead.
- Plug-and-play: The module can be seamlessly applied to diverse architectures including ViT, GPT-2, and LLaMA, and is compatible with parameter-efficient fine-tuning methods such as LoRA.
- Numerically stable by design: By excluding squared terms and retaining only cross terms, the method avoids overflow issues under FP16 precision.
Limitations & Future Work¶
- The experimental scale is relatively small (ViT-M/T are very small models, and WikiText-2 is a small corpus); validation at larger scales is needed.
- Only the \(k=1\) setting is explored; the effect of varying bandwidth \(k\) is not systematically studied.
- Theoretical analysis explaining why and under what conditions QuadEnhancer is most effective is absent.
- Training time comparisons show non-trivial overhead in early stages, and latency analysis for practical deployment is insufficient.
Related Work & Insights¶
- Compared to QuadraNet and SwiGLU, QuadEnhancer achieves superior performance with fewer parameters through weight sharing and sparsification.
- This approach can be generalized to any neural network layer based on linear transformations, including convolutional layers (by treating convolution as matrix multiplication).
- The enhancement of PEFT methods such as LoRA is particularly valuable, suggesting that quadratic interactions may serve as a general means to improve fine-tuning effectiveness.
Rating¶
- Novelty: ⭐⭐⭐⭐ Quadratic transformations are not a new concept, but the parameter reduction strategy via weight sharing and banded sparsification is novel
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers three tasks (CV, NLP, LLM fine-tuning) with ablation and scaling analyses
- Writing Quality: ⭐⭐⭐⭐⭐ Derivations are clear and the step-by-step simplification is easy to follow
- Value: ⭐⭐⭐⭐ The plug-and-play enhancement scheme is practically useful, but requires validation at larger scale