GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR¶
Conference: ACL 2026 arXiv: 2601.09361 Code: None Area: Parameter-Efficient Fine-Tuning / Reinforcement Learning with Verifiable Rewards Keywords: Low-rank adaptation, RLVR, geometry-aware, SVD initialization, parameter-efficient fine-tuning
TL;DR¶
This paper proposes GeoRA, a low-rank adaptation method specifically designed for Reinforcement Learning with Verifiable Rewards (RLVR). It constructs a geometry-constrained matrix that fuses spectral and Euclidean priors to extract the principal directions of the RL update subspace for SVD initialization, while freezing a residual matrix as a structural anchor. On Qwen/Llama models ranging from 1.5B to 32B parameters, GeoRA consistently outperforms baselines such as LoRA, PiSSA, and MiLoRA across mathematical, medical, and code RLVR tasks, with stronger out-of-domain generalization and reduced capability forgetting.
Background & Motivation¶
-
Background: RLVR has become a core paradigm for enhancing LLM reasoning capabilities (OpenAI-o1, DeepSeek-R1). Unlike SFT, RLVR is fundamentally a constrained optimization process that amplifies latent reasoning behaviors via reward-induced sampling bias rather than injecting new knowledge. Consequently, RLVR is highly sensitive to update stability and the preservation of pretrained representation geometry.
-
Limitations of Prior Work: (1) Geometric mismatch between SFT-oriented low-rank methods and RLVR: PiSSA assigns trainable parameters to the principal components of the weight matrix, which is effective for SFT but conflicts with the preferred update subspace in RLVR — RLVR updates are biased toward low-energy directions orthogonal to the dominant pretrained features, whereas PiSSA forces updates along principal directions, leading to instability. (2) Efficiency bottleneck of sparse fine-tuning: Sparse methods such as SparseFT better align with RLVR update patterns but are poorly supported by modern hardware for unstructured sparsity; their theoretical parameter efficiency cannot be translated into practical speedups and may even introduce additional overhead (10.8% slower than FullFT).
-
Key Challenge: The effective update subspace of RLVR is anisotropic and compressible, concentrated along a small number of principal directions — but these directions are not the principal components of the pretrained weights. Existing low-rank methods either align to the wrong subspace (PiSSA) or have the correct orientation but are computationally inefficient (SparseFT).
-
Goal: To design a PEFT method that simultaneously satisfies three criteria: (1) alignment with the RLVR-specific update geometry, (2) hardware efficiency via dense matrix computation, and (3) prevention of pretrained representation degradation through a structural anchor.
-
Key Insight: By analyzing the actual update patterns of RLVR, the effective update subspace, while sparse, exhibits a compressible low-rank structure. This subspace is extracted via a geometry-constrained mask and then compressed into a low-rank adapter initialization via SVD.
-
Core Idea: Rather than performing low-rank decomposition on the original weight matrix \(W\) as in LoRA/PiSSA, GeoRA performs SVD on a geometry-constrained view \(W_{Geo} = W \odot (M_{Spec} \cup M_{Euc})\) — a view that retains only parameters with low curvature (spectral prior) and high plasticity (Euclidean prior), precisely corresponding to the update regions preferred by RLVR.
Method¶
Overall Architecture¶
GeoRA operates in two stages: (1) Offline preprocessing — constructing the geometry-constrained matrix \(W_{Geo}\), performing SVD to extract the top-\(r\) components for initializing the adapters \(A_{Geo}, B_{Geo}\), and computing the frozen residual matrix \(W_{res}\); (2) Online training — the forward pass computes \(h = W_{res} x + \frac{\alpha}{r} B_{Geo} A_{Geo} x\), where \(W_{res}\) is frozen and only \(A_{Geo}, B_{Geo}\) are trained. The initialization guarantees functional equivalence: \(W_{res} + \frac{\alpha}{r} B_{Geo} A_{Geo} = W\).
Key Designs¶
-
Geometric Prior Construction:
- Function: Extracts a parameter subspace from pretrained weights suited for RLVR updates.
- Mechanism: Two complementary geometric priors are combined. Spectral prior \(M_{Spec}\): takes the rank-\(r\) approximation \(\hat{W}_r\) of the weight matrix and selects the \(\rho\)-quantile of entries with the smallest absolute values, i.e., \((M_{Spec})_{i,j} = \mathbb{I}(|(\hat{W}_r)_{i,j}| \leq \tau_{Spec}(\rho))\), suppressing high-energy/high-curvature components to ensure spectral stability. Euclidean prior \(M_{Euc}\): selects the \(\rho\)-quantile of entries with the smallest absolute values in the original weights, \((M_{Euc})_{i,j} = \mathbb{I}(|W_{i,j}| \leq \tau_{Euc}(\rho))\), capturing near-zero parameters with high plasticity. The two masks are combined via union: \(W_{Geo} = W \odot (M_{Spec} \cup M_{Euc})\).
- Design Motivation: Empirical analysis shows that the intersection of the two masks covers only 4.55% of parameters (Jaccard similarity 0.128), confirming that they capture highly complementary parameter subsets. The spectral prior ensures that principal components are not disrupted, while the Euclidean prior preserves adaptation flexibility — together they define a RLVR update manifold that is both stable and expressive.
-
Geometry-Aware SVD Initialization:
- Function: Compresses the geometry-constrained subspace into an efficient low-rank adapter.
- Mechanism: SVD is applied to \(W_{Geo}\): \(W_{Geo} = U_{Geo} \Sigma_{Geo} V_{Geo}^\top\). The top-\(r\) components initialize the adapters: \(A_{Geo} = \Sigma_{Geo[:r,:r]}^{1/2} V_{Geo[:,:r]}^\top\), \(B_{Geo} = U_{Geo[:,:r]} \Sigma_{Geo[:r,:r]}^{1/2}\), such that the initial \(B_{Geo} A_{Geo}\) approximates the rank-\(r\) optimal approximation of \(W_{Geo}\). The residual matrix \(W_{res} = W - \frac{\alpha}{r} B_{Geo} A_{Geo}\) is frozen during training.
- Design Motivation: The key distinction from PiSSA (which takes principal components of \(W\) directly) is that GeoRA extracts principal components from the geometry-constrained \(W_{Geo}\), ensuring that the trainable directions of the adapter are aligned with the actual update subspace of RLVR rather than the knowledge-encoding directions of pretraining.
-
Frozen Residual Matrix (Structural Anchor):
- Function: Prevents erosion of the principal components of the pretrained representation during training.
- Mechanism: \(W_{res}\) retains the portion of the original weights outside the geometry-constrained subspace, preserving the core knowledge encoded in the pretrained model. During training, \(W_{res}\) is completely frozen, and the optimizer is constrained to move only on the geometry-aligned manifold parameterized by \(A_{Geo}, B_{Geo}\).
- Design Motivation: Overly aggressive updates in RLVR can lead to behavioral collapse or capability degradation (the "Reasoning Boundary Paradox"). Freezing the residual matrix imposes a hard structural constraint, equivalent to performing policy updates within a geometry-aligned trust region.
Loss & Training¶
GRPO is used for RLVR training. The rank is fixed at \(r=16\) with sparsity rate \(\rho=0.2\). Main experiments are conducted on the DeepMath-103K dataset. The SVD initialization is a one-time preprocessing step whose cost is negligible relative to RLVR training.
Key Experimental Results¶
Main Results — Mathematical RLVR (Qwen3-8B)¶
| Method | AIME24 | AIME25 | MATH500 | OlymMATH | HumanEval (OOD) | MMLU (OOD) | IFEval (OOD) |
|---|---|---|---|---|---|---|---|
| Base | 13.33 | 11.67 | 71.20 | 9.75 | 76.83 | 71.94 | 54.32 |
| FullFT | 23.33 | 22.08 | 78.40 | 11.25 | 76.83 | 71.94 | 50.45 |
| LoRA | 19.58 | 19.58 | 75.60 | 10.75 | 81.10 | 75.65 | 52.13 |
| PiSSA | 22.50 | 20.42 | 74.40 | 11.75 | 71.95 | 73.89 | 48.74 |
| MiLoRA | 20.42 | 19.58 | 76.20 | 11.50 | 78.66 | 74.51 | 51.85 |
| GeoRA | 23.75 | 21.67 | 78.00 | 12.75 | 82.93 | 75.96 | 53.73 |
Ablation Study (Qwen3-4B)¶
| Configuration | Reward | AIME24 | AIME25 | MATH500 | OlymMATH | Avg |
|---|---|---|---|---|---|---|
| GeoRA (Full) | 0.88 | 13.33 | 9.17 | 73.40 | 5.75 | 25.41 |
| Random-r Init | 0.85 | 12.50 | 8.50 | 72.10 | 5.25 | 24.60 |
| Tail-r Init | 0.82 | 11.67 | 7.50 | 70.80 | 4.50 | 23.40 |
| w/o \(M_{Spec}\) | 0.86 | 12.50 | 8.33 | 72.00 | 4.75 | 24.40 |
| w/o \(M_{Euc}\) | 0.83 | 13.33 | 8.75 | 72.80 | 5.50 | 25.10 |
Key Findings¶
- GeoRA matches or surpasses FullFT on in-distribution tasks while comprehensively outperforming it on OOD tasks — HumanEval 82.93 (FullFT: 76.83), MMLU 75.96 (FullFT: 71.94) — indicating that geometry-aligned updates reduce interference with pretrained capabilities.
- PiSSA performs worst on OOD tasks (IFEval: 48.74), validating the hypothesis that SFT-oriented principal-component initialization is detrimental to RLVR.
- Spectral analysis confirms that GeoRA's updates barely touch the principal component subspace (\(\mathcal{S}_{Head} \leq 0.02\)), whereas PiSSA exhibits near-complete overlap (\(\approx 0.98\)).
- Efficiency advantages are significant: only 0.04B trainable parameters (0.5% of FullFT), 19.9% faster training than FullFT, and 28.5% VRAM savings.
- GeoRA is robust to hyperparameter variation, maintaining high reward across a wide range of learning rates, while PiSSA and MiLoRA degrade sharply at high learning rates.
- GeoRA is also effective for medical and code RLVR: MedQA 76.12 (LoRA: 74.23), MBPP 81.60 (LoRA: 81.00).
Highlights & Insights¶
- The core insight is profound: the effective update subspace of RLVR is not isotropic random noise but exhibits a compressible heavy-tailed spectral structure, providing a theoretical foundation for applying low-rank methods to RLVR. The critical factor is performing the low-rank decomposition in the correct subspace.
- The complementarity of the two geometric priors is empirically verified: only 4.55% parameter overlap (Jaccard 0.128), confirming that spectral stability and parameter plasticity capture distinct informational dimensions.
- The frozen residual matrix design elevates LoRA's "additive residual" paradigm to a "structural anchor" paradigm — not only preserving initialization invariance but also enforcing constraints on the optimization trajectory, which is crucial for policy stability in RLVR.
Limitations & Future Work¶
- While the SVD initialization is a one-time cost, it adds a preprocessing step that is inconvenient for scenarios requiring rapid iteration.
- Experiments primarily focus on reasoning-oriented RLVR tasks (mathematics, medicine, code); effectiveness in more open-ended RL settings (e.g., conversational preference optimization) remains unverified.
- The choices of sparsity rate \(\rho=0.2\) and rank \(r=16\) were not subjected to extensive search, and better configurations may exist.
- The geometric priors rely on the statistical properties of pretrained weights; whether these properties hold for models that have undergone substantial post-training remains to be verified.
- Comparisons with additional LoRA variants such as DoRA and AdaLoRA are absent.
Related Work & Insights¶
- vs. PiSSA: PiSSA initializes adapters on the principal components of pretrained weights, which is beneficial for SFT but harmful for RLVR — its NSS reaches 0.395 (severe structural disruption) and \(\mathcal{S}_{Head} \approx 0.98\) (updates almost entirely confined to principal components). GeoRA achieves NSS of only 0.092 and \(\mathcal{S}_{Head} \leq 0.02\), with updates precisely targeting the geometry-constrained tail subspace.
- vs. MiLoRA: MiLoRA initializes using minor singular components, which is directionally closer to RLVR but does not explicitly leverage geometric priors. GeoRA precisely defines the update manifold through dual masks, consistently outperforming MiLoRA across all benchmarks.
- vs. SparseFT: SparseFT's update patterns align with RLVR but suffer from poor computational efficiency (10.8% slower than FullFT). GeoRA compresses the sparse subspace into dense low-rank computation, achieving 19.9% speedup over FullFT.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The first geometry-aware low-rank adaptation method specifically designed for RLVR, with tightly integrated theoretical analysis and method design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multi-scale models (1.5B–32B) × three domains (math/medicine/code) × comprehensive ablation and mechanistic analysis.
- Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly derived and spectral analysis is convincing, though the dense notation raises the barrier for first-time readers.
- Value: ⭐⭐⭐⭐⭐ — Establishes a new paradigm for parameter-efficient training in the RLVR era; the geometry-aware approach is generalizable to other RL fine-tuning scenarios.