StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold¶

Conference: NeurIPS 2025 arXiv: 2510.01938 Code: GitHub Area: Parameter-Efficient Fine-Tuning / Low-Rank Adaptation Keywords: LoRA, Stiefel Manifold, Subspace Learning, Riemannian Optimization, Three-Factor Decomposition

TL;DR¶

This paper proposes StelLA, which decomposes the LoRA adaptation matrix into a three-factor form \(USV^\top\) and constrains \(U\) and \(V\) to the Stiefel manifold for Riemannian optimization, enabling explicit subspace learning during training. StelLA consistently outperforms existing LoRA variants across multiple downstream tasks.

Background & Motivation¶

LoRA is the dominant method for parameter-efficient fine-tuning of large models, adapting pretrained weights by learning a low-rank matrix \(BA^\top\). However, a performance gap remains between LoRA and full fine-tuning. Prior work such as PiSSA and MiLoRA attempts to improve LoRA initialization via SVD decomposition, but these methods only guide the beginning of training and have limited influence on subsequent optimization.

The authors identify a critical issue: different heuristic subspace selection strategies — whether to select principal or minor components, and whether to base selection on weights or gradients — yield conflicting conclusions, indicating that manual subspace selection is suboptimal. This naturally raises the question: can the optimal subspace be learned directly during training?

Furthermore, the two-factor decomposition \(BA^\top\) in LoRA couples the input/output subspaces with the scaling factor, making structured geometric optimization difficult. Inspired by the SVD structure, the authors propose to decouple direction (subspace) from magnitude (scaling), consistent with the idea in DoRA of decomposing weights into magnitude and direction components.

Method¶

Overall Architecture¶

StelLA represents the low-rank adaptation of each linear layer in a three-factor form:

\[\tilde{W} = W + \frac{\alpha}{r} U S V^\top\]

where \(U \in \text{St}(r,m)\) and \(V \in \text{St}(r,n)\) are orthonormal bases for the output and input subspaces respectively (constrained to the Stiefel manifold), and \(S \in \mathbb{R}^{r \times r}\) learns the mapping between the two subspaces. This design explicitly decouples subspace direction from scaling magnitude, enabling subspace optimization while preserving orthogonality.

Key Designs¶

Stiefel Manifold Constraint and Riemannian Optimization: To ensure \(U\) and \(V\) remain column-orthogonal throughout training, they are constrained to the Stiefel manifold \(\text{St}(k,n) = \{Y \in \mathbb{R}^{n \times k} \mid Y^\top Y = I_k\}\). Optimization proceeds in three steps: (a) converting the Euclidean gradient to the Riemannian gradient \(\text{grad}_Y = \nabla_Y - Y(\nabla_Y)^\top Y\); (b) projecting the perturbed gradient produced by the optimizer back to the tangent space \(\pi_Y(\Delta) = \Delta - Y \text{symm}(Y^\top \Delta)\); (c) retracting the updated point back to the manifold via polar decomposition \(\rho_Y(\Delta) = \text{uf}(Y + \Delta)\). This design allows any existing Euclidean optimizer (e.g., Adam) to be seamlessly adapted into a Riemannian optimizer.
Modular Geometric Optimization Design: The algorithm is implemented via optimizer hooks: Riemannian gradient conversion as a pre-hook, and projection and retraction as post-hooks. Unlike existing Riemannian optimizers (e.g., Riemannian Adam), StelLA decouples geometric constraints from optimizer logic — it requires no modification of the optimizer's internal momentum or adaptive learning rate mechanisms, treating the optimizer's update direction as a perturbation of the Riemannian gradient and correcting it via projection. A batched SVD strategy (stacking \(U\)/\(V\) matrices of the same shape across layers) achieves a 15–20× speedup.
Gradient Scaling Strategy: Since the columns of \(U\) and \(V\) are unit vectors, their element magnitudes are on the order of \(1/\sqrt{m}\) and \(1/\sqrt{n}\), respectively. When \(m \neq n\) (e.g., in LLM FFN layers where the hidden dimension is expanded by a factor of 4), Adam's gradient normalization causes an imbalance in the learning rates of \(U\) and \(V\). StelLA compensates by multiplying the gradients of \(U\) and \(V\) by \(\sqrt{d/m}\) and \(\sqrt{d/n}\) respectively (where \(d\) is the hidden dimension of the input token) before the projection step.

Loss & Training¶

The training objective is the standard task loss \(\mathcal{L}\), but the optimization paths differ: \(U\) and \(V\) are optimized under constraints on the Stiefel manifold, while \(S\) is optimized without constraints in Euclidean space. Initialization uses random column-orthogonal matrices for \(U\)/\(V\) and the identity matrix for \(S\), requiring no modification of pretrained weights as in PiSSA.

Key Experimental Results¶

Main Results¶

Commonsense Reasoning (LLaMA3-8B, rank=32)

Method	Params (%)	BoolQ	PIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	Avg.
LoRA	0.700	75.16	88.14	95.41	86.74	90.84	78.70	85.27
DoRA	0.710	75.38	88.01	95.35	86.29	90.54	79.69	85.16
ScaledAdamW	0.700	75.24	88.57	95.81	85.11	91.09	80.55	85.40
StelLA	0.702	75.91	89.86	96.41	87.82	91.98	82.34	86.72

StelLA achieves approximately +1.3 percentage points improvement on both LLaMA2-7B and LLaMA3-8B.

Text-to-Image Generation (SD 1.5, rank=4, FID↓)

Dataset	LoRA	DoRA	PiSSA	StelLA
BarbieCore	175.48	175.04	299.49	170.25
Expedition	156.34	155.80	291.22	146.12
Hornify	180.48	179.58	295.15	167.53

FID reduction reaches up to 12 points.

Ablation Study¶

Configuration	Avg. Accuracy	Notes
StelLA (default)	86.72	Non-zero init + gradient scaling
Euclidean geometry (no orthogonality constraint)	84.4	Demonstrates necessity of orthogonal constraint
Quotient space geometry	85.7	Inferior to Stiefel product manifold
Zero initialization	86.5	Small \(S\) leads to small \(U\)/\(V\) gradients, slow convergence
Pseudo-zero initialization	84.2	Corrupts pretrained weights, worst performance
SVD-major initialization	86.7	Close to random init, showing geometric optimization can learn subspaces automatically
Without gradient scaling	86.4	Scaling provides +0.3 improvement
Polar retraction vs. exponential map	86.72 vs 86.76	Comparable performance; polar retraction is more efficient

Key Findings¶

The Stiefel manifold constraint substantially outperforms unconstrained Euclidean three-factor decomposition (TriLoRA/MoSLoRA): 86.7 vs. 84.4
StelLA is robust to initialization strategy: random initialization, SVD principal components, and SVD minor components yield comparable performance
Additional parameter count increases by only \(r^2\) (rank squared), which is negligible
StelLA also achieves consistently superior results on image classification (ViT-Base/Large, 8 datasets)

Highlights & Insights¶

Elegant design: The three-factor decomposition combined with Stiefel manifold constraints forms a geometrically natural framework that directly embeds the SVD structure into the training process
Modularity: Implemented via optimizer hooks, StelLA can be combined with any existing optimizer without the need to manually implement special optimizers such as Riemannian Adam
Batched SVD acceleration: The bottleneck operation (polar decomposition) is accelerated 15–20× via cross-layer batching, addressing practical efficiency concerns
The gradient scaling strategy, while simple, provides effective learning rate balancing for the asymmetric dimensions in FFN layers

Limitations & Future Work¶

The three-factor decomposition and retraction operations introduce additional computational overhead (partially mitigated by batched SVD)
Integration with adaptive-rank methods such as AdaLoRA (which could constrain \(S\) to a diagonal matrix) has not been explored
Validation on 70B-scale models or other model families such as Mistral/LLaVA is absent
Joint use with QLoRA (quantization) has not been explored

StelLA can be combined with AdaLoRA's rank-adaptive strategy by constraining \(S\) to a diagonal matrix, enabling dynamic rank adjustment via singular value pruning
Orthogonality constraints may also offer potential benefits for adversarial robustness
The geometric optimization framework opens possibilities for applying other manifold constraints (e.g., Grassmann manifold) in fine-tuning

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing Stiefel manifold optimization into LoRA is original; while three-factor decomposition is not novel per se, the geometric constraint design is
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers NLU, NLG, visual classification, and image generation; ablations are comprehensive
Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are clear, algorithmic descriptions are rigorous, and ablation design is convincing
Value: ⭐⭐⭐⭐ Provides a plug-and-play improvement over LoRA; code is open-sourced with strong practical utility