Skip to content

Flat-LoRA: Low-Rank Adaptation over a Flat Loss Landscape

Conference: ICML2025
arXiv: 2409.14396
Code: nblt/Flat-LoRA
Area: Image Generation
Keywords: LoRA, Parameter-Efficient Fine-Tuning, Flat Minima, Random Weight Perturbation, Bayesian Expected Loss

TL;DR

This paper proposes Flat-LoRA, which introduces random weight perturbation based on the Bayesian expected loss in the full parameter space, forcing LoRA to converge to flatter minima within the full parameter space. This improves both in-domain and out-of-distribution generalization with almost no increase in training time and GPU memory overhead.

Background & Motivation

LoRA restricts fine-tuning to a low-rank matrix space \(\mathcal{M}_r\). Upon completion of training, the low-rank adaptation \(\Delta W = BA\) is merged into the pre-trained weights \(W\) for inference. Existing improvements (such as AdaLoRA, DoRA, PiSSA, etc.) focus on the optimization quality within the LoRA parameter space, overlooking the relationship between the LoRA space and the full parameter space.

Core observation: Minima that appear flat in the LoRA parameter space can exhibit sharp directions in the full parameter space (Figure 1). Sharp minima damage generalization capabilities, particularly when facing distribution shifts.

A natural idea is to combine SAM (Sharpness-Aware Minimization) with LoRA (LoRA-SAM), but this poses three problems:

Restricted Optimization Directions: LoRA-SAM only optimizes sharpness within the column space of \(A\), failing to cover the full parameter space.

Doubled Training Cost: SAM requires an additional gradient step, making it unfriendly to large scale models.

Requirement to Store Full-Parameter Perturbations: This violates the original intent of parameter-efficient fine-tuning.

Method

1. Flatness Objective in the Full Parameter Space

The ideal goal is to directly apply SAM in the full parameter space:

\[\min_{A,B} \max_{\|\varepsilon_W\|_F \leq \rho} L(W + BA + \varepsilon_W)\]

where \(\varepsilon_W \in \mathbb{R}^{m \times n}\) is the adversarial perturbation in the full parameter space. However, direct optimization requires an extra gradient step and storing the entire perturbation matrix.

2. Bayesian Expected Loss Relaxation

Relaxing the maximization to an expectation yields the core objective function of Flat-LoRA:

\[\min_{A,B} \mathbb{E}_{(\varepsilon_W)_{i,j} \sim \mathcal{N}(0, \sigma^2)} L(W + BA + \varepsilon_W)\]

By Lemma 3.1: If \(L(W)\) is \(\alpha\)-Lipschitz and \(\beta\)-smooth, then the expected loss function is \(\min\{\alpha/\sigma, \beta\}\)-smooth. That is, a larger noise variance \(\sigma\) results in a smoother loss landscape, encouraging the optimization to converge to flatter regions.

Practical operation: At each training step, a noise matrix \(\varepsilon_W\) is sampled, and the perturbed gradient is calculated to update \(A\) and \(B\). No extra gradient step is required, leaving the training time virtually unchanged.

3. Fine-Grained Random Perturbation Generation Strategy

The perturbation is not simple i.i.d. Gaussian noise; instead, it accounts for two factors:

  • Filter Structure: Noise is generated row-wise (filter-wise), where filters with larger norms receive larger perturbations.
  • Scaling by Input Dimension: The variance is scaled by \(1/n\) to ensure that the variance introduced by the perturbation during forward propagation is independent of the input dimension.

The final perturbation generation formula is:

\[(\varepsilon_W)_{i,j} \sim \mathcal{N}\left(0, \frac{\sigma^2}{n} \|W'_{i,:}\|_2^2\right)\]

where \(W' = W + BA\) represents the merged weights. By Proposition 3.2, this design increases the output variance by a factor of \(1+\sigma^2\), which is independent of the input dimension \(n\).

4. GPU Memory Optimization Based on Random Seeds

Although the full parameter perturbation \(\varepsilon_W\) is large, in practice, only the following need to be stored: - A random seed (a single integer) - The norm of each filter \(\{\|W'_{i,:}\|_2^2\}\) (\(m\) scalars, which is less than \(1/r\) of the LoRA parameters)

During training, noise is regenerated using the seed. After backpropagation, the noise is reconstructed and removed using the seed, ensuring minimal memory overhead.

5. Progressive Perturbation Enhancement

In practice, it is recommended to gradually increase the perturbation strength \(\sigma\): linearly growing from 0 to the target value, allowing the model to first converge to a reasonably good region before progressive smoothing is applied.

Theoretical Comparison with LoRA-SAM

The equivalent perturbation of LoRA-SAM in the full parameter space is:

\[\varepsilon_W \approx c \, (\nabla_W L) A^\top A\]

(when \(B\) is small)—which only covers the column space of \(A\), an extremely small subspace of the full parameter space. In contrast, Flat-LoRA's random perturbation covers all directions in the full space.

Key Experimental Results

NLP: T5-base Fine-Tuning on GLUE Subsets (r=8/16)

Method MNLI SST2 CoLA QNLI MRPC Avg
Full FT 86.19 94.15 82.84 93.10 89.22 89.10
LoRA (r=8) 86.24 94.25 82.87 93.06 88.56 88.99
Flat-LoRA (r=8) 86.20 94.75 83.61 93.16 89.59 89.47
LoRA (r=16) 86.49 94.52 82.89 92.97 88.89 89.15
Flat-LoRA (r=16) 86.51 94.84 84.08 93.28 89.83 89.72

Flat-LoRA with \(r=16\) outperforms LoRA by +0.57 on average, with particularly noteworthy improvements in CoLA (+1.19) and MRPC (+0.94).

CV: CLIP ViT-B/32 Fine-Tuning on Image Classification

Method CIFAR-10 CIFAR-100 Cars SVHN DTD Avg
LoRA (r=8) 97.90 87.74 73.22 97.49 76.86 86.64
Flat-LoRA (r=8) 98.09 88.64 74.17 97.59 77.51 87.20
LoRA (r=16) 97.99 88.12 73.80 97.56 77.34 86.92
Flat-LoRA (r=16) 98.21 89.27 74.89 97.71 78.24 87.66

On CV tasks, \(r=16\) achieves an average improvement of +0.74, with an increase of over 1 percentage point on the Cars dataset.

Other Tasks

The paper also covers tasks such as mathematical reasoning, code generation, dialogue, instruction following, and text-to-image generation, observing consistent improvements across all setups, demonstrating the generalibility of the method.

Orthogonality with Other LoRA Variants

Flat-LoRA can be combined in a plug-and-play manner with DoRA, PiSSA, LoRA-GA, etc., delivering further gains on top of these variants.

Highlights & Insights

  1. Deep Problem Insight: Starting from the landscape difference between the LoRA space and the full parameter space, this work reveals a widely overlooked yet critical issue.
  2. Highly Simple and Efficient: Replacing the min-max of SAM with Bayesian expected loss avoids additional gradient steps; storing random seeds avoids memory bloat. The engineering implementation is highly elegant.
  3. Clear Theoretical Derivation: Deriving the limitations of LoRA-SAM from its equivalent perturbation, combined with the variance analysis of filter-wise perturbations, creates a highly coherent narrative.
  4. Outstanding Generality: Consistent effectiveness across modalities including NLP, CV, and generation.

Limitations & Future Work

  1. Selection of Hyperparameter \(\sigma\): Although progressive enhancement is suggested, the optimal growth strategy and target value still require per-task tuning.
  2. Limited Theoretical Guarantees: The expected loss is only a relaxation of the min-max objective, not an equivalent substitute. Furthermore, the theoretical connection between flatness and generalization remains somewhat controversial (e.g., counterexamples in Dinh et al., 2017).
  3. Single Noise Sampling: Sampling only one \(\varepsilon_W\) at each step introduces higher variance. Theoretically, averaging multiple samplings would be superior but computationally more expensive.
  4. Insufficient Comparison with Latest Methods: Direct experimental comparisons with concurrent works such as LoRA-Pro (gradient alignment) are absent.
  5. Validation Limited to the LoRA Framework: Whether the method can be extended to other PEFT techniques (like Adapters or Prefix-Tuning) has not been explored.
  • SAM (Foret et al., 2020): The classic min-max method for flat minima, but doubles training costs.
  • RWP (Bisla et al., 2022): A pioneer in finding flat minima via random weight perturbation; Flat-LoRA adapts this to PEFT scenarios.
  • DoRA / PiSSA / LoRA-GA: Orthogonal structural improvements to LoRA that can be combined with Flat-LoRA.
  • NEFTune (Jain et al., 2024): Enhances fine-tuning by adding noise to the embedding layer, which is analogous to Flat-LoRA's concept of adding noise to the weight space.

Rating

Dimension Score (1-5) Description
Novelty 4 Novel perspective on the problem; the method itself is an engineering combination of existing techniques.
Theoretical Depth 3.5 Clear derivations but not deeply profound; core theorems are referenced from prior work.
Experimental Thoroughness 4.5 Covers multiple scenarios in NLP, CV, and generation, with extensive ablation studies.
Usability 5 Plug-and-play, near-zero additional overhead, with open-sourced code.
Writing Quality 4 Clearly structured, with Figure 1 being highly intuitive and easy to understand.
Total Score 4.2 High-quality work with outstanding practical value.

Rating

  • Novelty: TBD
  • Experimental Thoroughness: TBD
  • Writing Quality: TBD
  • Value: TBD