StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold¶
Conference: NeurIPS 2025 arXiv: 2510.01938 Code: GitHub Area: Parameter-Efficient Fine-Tuning / Low-Rank Adaptation Keywords: LoRA, Stiefel Manifold, Subspace Learning, Riemannian Optimization, Three-Factor Decomposition
TL;DR¶
This paper proposes StelLA, which decomposes the LoRA adaptation matrix into a three-factor form \(USV^\top\) and constrains \(U\) and \(V\) to the Stiefel manifold for Riemannian optimization, enabling explicit subspace learning during training. StelLA consistently outperforms existing LoRA variants across multiple downstream tasks.
Background & Motivation¶
LoRA is the dominant method for parameter-efficient fine-tuning of large models, adapting pretrained weights by learning a low-rank matrix \(BA^\top\). However, a performance gap remains between LoRA and full fine-tuning. Prior work such as PiSSA and MiLoRA attempts to improve LoRA initialization via SVD decomposition, but these methods only guide the beginning of training and have limited influence on subsequent optimization.
The authors identify a critical issue: different heuristic subspace selection strategies — whether to select principal or minor components, and whether to base selection on weights or gradients — yield conflicting conclusions, indicating that manual subspace selection is suboptimal. This naturally raises the question: can the optimal subspace be learned directly during training?
Furthermore, the two-factor decomposition \(BA^\top\) in LoRA couples the input/output subspaces with the scaling factor, making structured geometric optimization difficult. Inspired by the SVD structure, the authors propose to decouple direction (subspace) from magnitude (scaling), consistent with the idea in DoRA of decomposing weights into magnitude and direction components.
Method¶
Overall Architecture¶
StelLA represents the low-rank adaptation of each linear layer in a three-factor form:
where \(U \in \text{St}(r,m)\) and \(V \in \text{St}(r,n)\) are orthonormal bases for the output and input subspaces respectively (constrained to the Stiefel manifold), and \(S \in \mathbb{R}^{r \times r}\) learns the mapping between the two subspaces. This design explicitly decouples subspace direction from scaling magnitude, enabling subspace optimization while preserving orthogonality.
Key Designs¶
-
Stiefel Manifold Constraint and Riemannian Optimization: To ensure \(U\) and \(V\) remain column-orthogonal throughout training, they are constrained to the Stiefel manifold \(\text{St}(k,n) = \{Y \in \mathbb{R}^{n \times k} \mid Y^\top Y = I_k\}\). Optimization proceeds in three steps: (a) converting the Euclidean gradient to the Riemannian gradient \(\text{grad}_Y = \nabla_Y - Y(\nabla_Y)^\top Y\); (b) projecting the perturbed gradient produced by the optimizer back to the tangent space \(\pi_Y(\Delta) = \Delta - Y \text{symm}(Y^\top \Delta)\); (c) retracting the updated point back to the manifold via polar decomposition \(\rho_Y(\Delta) = \text{uf}(Y + \Delta)\). This design allows any existing Euclidean optimizer (e.g., Adam) to be seamlessly adapted into a Riemannian optimizer.
-
Modular Geometric Optimization Design: The algorithm is implemented via optimizer hooks: Riemannian gradient conversion as a pre-hook, and projection and retraction as post-hooks. Unlike existing Riemannian optimizers (e.g., Riemannian Adam), StelLA decouples geometric constraints from optimizer logic — it requires no modification of the optimizer's internal momentum or adaptive learning rate mechanisms, treating the optimizer's update direction as a perturbation of the Riemannian gradient and correcting it via projection. A batched SVD strategy (stacking \(U\)/\(V\) matrices of the same shape across layers) achieves a 15–20× speedup.
-
Gradient Scaling Strategy: Since the columns of \(U\) and \(V\) are unit vectors, their element magnitudes are on the order of \(1/\sqrt{m}\) and \(1/\sqrt{n}\), respectively. When \(m \neq n\) (e.g., in LLM FFN layers where the hidden dimension is expanded by a factor of 4), Adam's gradient normalization causes an imbalance in the learning rates of \(U\) and \(V\). StelLA compensates by multiplying the gradients of \(U\) and \(V\) by \(\sqrt{d/m}\) and \(\sqrt{d/n}\) respectively (where \(d\) is the hidden dimension of the input token) before the projection step.
Loss & Training¶
The training objective is the standard task loss \(\mathcal{L}\), but the optimization paths differ: \(U\) and \(V\) are optimized under constraints on the Stiefel manifold, while \(S\) is optimized without constraints in Euclidean space. Initialization uses random column-orthogonal matrices for \(U\)/\(V\) and the identity matrix for \(S\), requiring no modification of pretrained weights as in PiSSA.
Key Experimental Results¶
Main Results¶
Commonsense Reasoning (LLaMA3-8B, rank=32)
| Method | Params (%) | BoolQ | PIQA | HellaSwag | WinoGrande | ARC-e | ARC-c | Avg. |
|---|---|---|---|---|---|---|---|---|
| LoRA | 0.700 | 75.16 | 88.14 | 95.41 | 86.74 | 90.84 | 78.70 | 85.27 |
| DoRA | 0.710 | 75.38 | 88.01 | 95.35 | 86.29 | 90.54 | 79.69 | 85.16 |
| ScaledAdamW | 0.700 | 75.24 | 88.57 | 95.81 | 85.11 | 91.09 | 80.55 | 85.40 |
| StelLA | 0.702 | 75.91 | 89.86 | 96.41 | 87.82 | 91.98 | 82.34 | 86.72 |
StelLA achieves approximately +1.3 percentage points improvement on both LLaMA2-7B and LLaMA3-8B.
Text-to-Image Generation (SD 1.5, rank=4, FID↓)
| Dataset | LoRA | DoRA | PiSSA | StelLA |
|---|---|---|---|---|
| BarbieCore | 175.48 | 175.04 | 299.49 | 170.25 |
| Expedition | 156.34 | 155.80 | 291.22 | 146.12 |
| Hornify | 180.48 | 179.58 | 295.15 | 167.53 |
FID reduction reaches up to 12 points.
Ablation Study¶
| Configuration | Avg. Accuracy | Notes |
|---|---|---|
| StelLA (default) | 86.72 | Non-zero init + gradient scaling |
| Euclidean geometry (no orthogonality constraint) | 84.4 | Demonstrates necessity of orthogonal constraint |
| Quotient space geometry | 85.7 | Inferior to Stiefel product manifold |
| Zero initialization | 86.5 | Small \(S\) leads to small \(U\)/\(V\) gradients, slow convergence |
| Pseudo-zero initialization | 84.2 | Corrupts pretrained weights, worst performance |
| SVD-major initialization | 86.7 | Close to random init, showing geometric optimization can learn subspaces automatically |
| Without gradient scaling | 86.4 | Scaling provides +0.3 improvement |
| Polar retraction vs. exponential map | 86.72 vs 86.76 | Comparable performance; polar retraction is more efficient |
Key Findings¶
- The Stiefel manifold constraint substantially outperforms unconstrained Euclidean three-factor decomposition (TriLoRA/MoSLoRA): 86.7 vs. 84.4
- StelLA is robust to initialization strategy: random initialization, SVD principal components, and SVD minor components yield comparable performance
- Additional parameter count increases by only \(r^2\) (rank squared), which is negligible
- StelLA also achieves consistently superior results on image classification (ViT-Base/Large, 8 datasets)
Highlights & Insights¶
- Elegant design: The three-factor decomposition combined with Stiefel manifold constraints forms a geometrically natural framework that directly embeds the SVD structure into the training process
- Modularity: Implemented via optimizer hooks, StelLA can be combined with any existing optimizer without the need to manually implement special optimizers such as Riemannian Adam
- Batched SVD acceleration: The bottleneck operation (polar decomposition) is accelerated 15–20× via cross-layer batching, addressing practical efficiency concerns
- The gradient scaling strategy, while simple, provides effective learning rate balancing for the asymmetric dimensions in FFN layers
Limitations & Future Work¶
- The three-factor decomposition and retraction operations introduce additional computational overhead (partially mitigated by batched SVD)
- Integration with adaptive-rank methods such as AdaLoRA (which could constrain \(S\) to a diagonal matrix) has not been explored
- Validation on 70B-scale models or other model families such as Mistral/LLaVA is absent
- Joint use with QLoRA (quantization) has not been explored
Related Work & Insights¶
- StelLA can be combined with AdaLoRA's rank-adaptive strategy by constraining \(S\) to a diagonal matrix, enabling dynamic rank adjustment via singular value pruning
- Orthogonality constraints may also offer potential benefits for adversarial robustness
- The geometric optimization framework opens possibilities for applying other manifold constraints (e.g., Grassmann manifold) in fine-tuning
Rating¶
- Novelty: ⭐⭐⭐⭐ Introducing Stiefel manifold optimization into LoRA is original; while three-factor decomposition is not novel per se, the geometric constraint design is
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers NLU, NLG, visual classification, and image generation; ablations are comprehensive
- Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are clear, algorithmic descriptions are rigorous, and ablation design is convincing
- Value: ⭐⭐⭐⭐ Provides a plug-and-play improvement over LoRA; code is open-sourced with strong practical utility