CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure¶

Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=VVruwk9404
Code: https://github.com/KongBoao/CR-Net
Area: LLM Efficiency / Parameter-Efficient Training
Keywords: Low-rank structure, Cross-layer residual, Efficient pre-training, Activation recomputation, Memory optimization

TL;DR¶

CR-Net discovers that the "difference between adjacent layer activations" possesses a strong low-rank structure. Consequently, it reformulates every linear mapping as "previous layer activation × learnable scaling + low-rank increment." This design reduces parameters by half without losing high-rank information. When combined with an activation recomputation strategy specifically designed for this cross-layer dependency, it scales pre-training from 60M to 13B, consistently outperforming existing low-rank methods while consuming less VRAM and computation.

Background & Motivation¶

Background: LLM pre-training is becoming increasingly expensive—scaling from GPT-3 2.7B to 175B increases VRAM and compute requirements by 66x. To suppress costs, low-rank structures have become the mainstream direction, primarily divided into two camps: low-rank parameters (LoRA and its variants, using two small matrices \(A, B\) to replace full-rank weight updates) and low-rank gradients (GaLore, Apollo, etc., which project optimizer states into low-dimensional subspaces).

Limitations of Prior Work: The authors categorize the shortcomings of existing low-rank methods into three points. First (L1), low-rank parameterization leads to performance degradation—transformer weights are empirically near full-rank, and forcing low-rank constraints limits model capacity. While full-rank initialization, update aggregation, or non-linear operators can mitigate this, they consume the saved compute. Second (L2), low-rank gradient methods have computation bottlenecks—GaLore/FIRA require SVD to find subspaces, significantly slowing throughput, while random subspaces may cause degradation. Third (L3), almost all methods focus only on parameters/gradients/optimizer states, yet ignore activation memory—cached intermediate activations are typically 1–4x the model parameters and scale with batch size.

Key Challenge: A trade-off exists between low-rank cost savings and model capability. The root cause is that existing methods directly perform low-rank approximations on the single-layer activation \(Y_l^P\) itself, which is actually near full-rank, leading to inevitable information loss.

Goal: To find a truly "naturally low-rank" target for approximation, thereby saving parameters without performance loss while simultaneously reducing activation memory.

Key Insight: The authors observe a structural property previously unreported—it is the difference between adjacent layer activations \(Y_l^P - \beta_0 Y_{l-1}^P\) that is truly low-rank, rather than the activations themselves. Intuitively, the transformer's residual structure makes adjacent layer activations highly correlated, leaving only a small amount of "incremental information" in the difference.

Core Idea: Reconstruct the current layer activation using "previous layer activation + low-rank difference," i.e., \(Y_l^P \approx \beta_0 Y_{l-1}^P + \mathrm{LR}_r(\Delta_{\beta_0} Y_l^P)\). This formula is directly solidified into the network architecture (cross-layer low-rank residual), representing high-rank activations with low-rank parameters.

Method¶

Overall Architecture¶

CR-Net (Cross-layer low-Rank residual Network) is built on the LLaMA-2 architecture (including SwiGLU, omitting LayerNorm and RoPE for simplification). Its core modification is that, except for the first layer, every linear projection \(P\in\{Q,K,V,O,\text{gate},\text{up},\text{down}\}\) in every layer no longer uses full-rank weights \(W_l^P\) but calculates output via "cross-layer residual + low-rank increment."

The pipeline operates as follows: inputs first pass through a full-rank first layer (providing high-rank "anchor" activations to prevent initial low-rank collapse). From the second layer onwards, each linear layer's activation = previous layer's corresponding activation multiplied by a learnable scaling coefficient \(\beta_l^P\) + low-rank increment obtained by passing the current input through two small matrices \(A_l^P B_l^P\). During backpropagation, most activations are not stored; instead, a custom recomputation strategy is used—storing only activations of a few "checkpoint layers" and all low-rank outputs, while other activations are derived layer-by-layer via the inverse operation of the cross-layer residual. This saves half the parameters, significantly reduces activation memory, and preserves high-rank information throughout the residual chain starting from the full-rank first layer.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Input token sequence"] --> B["Full-rank first layer<br/>Provides high-rank anchor activation Y₁"]
    B --> C["Cross-layer low-rank residual structure<br/>Yₗ = βₗYₗ₋₁ + XₗAₗBₗ"]
    C --> D["Learnable scaling coefficient βₗᴾ<br/>Dynamic balancing: History vs. Low-rank increment"]
    D -->|Forward: store only checkpoints + low-rank outputs| E["Efficient activation recomputation<br/>Inverse residual layer-by-layer derivation"]
    E --> F["Low-rank parameters + Low activation VRAM<br/>60M→13B Pre-training"]

Key Designs¶

1. Cross-layer low-rank activation difference: Applying "low-rank" to truly low-rank objects

This is the foundation of the paper, targeting L1—the near full-rank nature of single-layer activations. The authors first verify an observation: reconstructing \(Y_l^P\) using "historical activation + low-rank difference" \(\tilde Y_{l,\beta_0}^P := \beta_0 Y_{l-1}^P + \mathrm{LR}_r(\Delta_{\beta_0} Y_l^P)\) yields a much smaller relative error than a direct low-rank approximation \(\mathrm{LR}_r(Y_l^P)\) of the same rank. Here \(\mathrm{LR}_r(A):=\arg\min_{\Lambda}\|A-\Lambda\|_F^2,\ \text{s.t.}\ \mathrm{rank}(\Lambda)\le r\) is the optimal \(r\)-rank approximation in the Frobenius sense. Using LLaMA-3 8B and GPT-2 small with \(r=0.25h\), the relative error at all projection points is reduced to 0.4–0.97x. This confirms that "low-rank activation difference" is a genuine structure across models and training stages. CR-Net is therefore not a simple extension of LoRA—it changes the object being low-ranked.

2. Cross-layer low-rank residual structure: Solidifying observations into weights

Since \(\Delta_{\beta_0}Y_l^P\) is naturally low-rank, the authors decompose the full-rank weights \(W_l^P\in\mathbb{R}^{h_{in}\times h_{out}}\) into two low-rank learnable matrices \(A_l^P\in\mathbb{R}^{h_{in}\times r}\) and \(B_l^P\in\mathbb{R}^{r\times h_{out}}\) (\(r<\min\{h_{in},h_{out}\}\)). From the second layer, activations are computed as \(Y_l^P = \beta_0 Y_{l-1}^P + X_l^P A_l^P B_l^P\). This step transforms the "low-rank difference approximation" into a forward computation graph. Unlike the LoRA family which repeatedly approximates activations and accumulates loss, CR-Net only applies low-rank to the "increment," allowing high-rank information to pass losslessly through the residual chain. This simultaneously addresses L1 (no degradation) and L2 (no SVD required, standard optimization is sufficient).

3. Learnable scaling coefficient \(\beta_l^P\) and full-rank first layer: Stabilizing high-rank information

A fixed \(\beta_0\) lacks flexibility, so the authors upgrade it to a per-layer, per-position learnable scalar \(\beta_l^P\), formulated as \(\mathrm{sign}(\beta_l^P)(|\beta_l^P|+\varepsilon)\) (where \(\varepsilon=10^{-6}\) prevents zero coefficients). The complete structure is:

\[Y_l^P = \begin{cases} X_l^P W_l^P, & l=1,\\ \mathrm{sign}(\beta_l^P)(|\beta_l^P|+\varepsilon)\,Y_{l-1}^P + X_l^P A_l^P B_l^P, & l=2,\dots,L. \end{cases}\]

\(\beta_l^P\) allows the model to adaptively balance between "historical activation" and "low-rank increment." This enables the network to interpolate smoothly between shallow, expressive layers and deep, low-rank transitional layers. Combined with the full-rank first layer \(W_1^P\) providing a high-rank anchor, CR-Net avoids the collapse into low-dimensional subspaces or numerical instability common in low-rank training without requiring hard projection constraints like QR/SVD. Table 6 confirms this: on a 350M model, applying low-rank to the first layer degrades PPL from 18.95 to 19.68 at 6.4B tokens.

4. Efficient activation recomputation: Backpropagation strategy tailored for cross-layer dependency

Cross-layer residuals introduce a side effect—a layer's activation depends on all preceding layers (L2-level dependency). Standard gradient checkpointing (GCP) would require re-running all preceding layers, incurring \(O(L^2)\) overhead. The authors customize recomputation: forward storage includes only three things—layer inputs \(X_l\), a subset of checkpoint layers \(A\) (size \(|A|=L/8\), forcing \(L\in A\) and \(1\notin A\)), and all low-rank outputs \(X_l^P A_l^P\). During backpropagation, if \(Y_l^P\) is cached, it is used directly; otherwise, it is derived via the inverse operation:

\[Y_l^P = \frac{1}{\mathrm{sign}(\beta_{l+1}^P)(|\beta_{l+1}^P|+\varepsilon)}\big(Y_{l+1}^P - X_{l+1}^P A_{l+1}^P B_{l+1}^P\big).\]

This strategy allows lossless layer-by-layer activation recovery while preventing cumulative error. Ultimately, CR-Net reduces total compute overhead by 67.4% compared to vanilla GCP and 8.0% compared to CoLA-M, with even lower activation memory.

Loss & Training¶

The training objective is standard LLM pre-training (language modeling on C4-en) using the Adam optimizer, without extra subspace projection steps. In terms of complexity, the first layer is the same as full-rank, while remaining layers have parameters totaling \((L-1)(11hr+3h_{ff}r)\). At \(r\approx 0.25h\), parameters are reduced by ~50%. Since LLaMA uses \(h_{ff}\approx 8h/3\), CR-Net's FLOPs per step are lower than full-rank as long as \(r<0.5h\).

Key Experimental Results¶

Main Results¶

Pre-training LLaMA-2 on C4-en across scales from 60M to 13B. Comparisons are made against parameter-efficient methods by alignment of parameter counts (marked ♢) and optimizer-efficient methods by alignment of memory (marked †). Validation Perplexity (PPL, lower is better):

Model Scale	Metric	CR-Net	Full-rank	CoLA	Apollo
60M	PPL	32.76	34.06	34.04	31.55
130M	PPL	24.31♢	24.36	24.48	22.94
350M	PPL	18.95♢	18.80	19.40	16.85
1B	PPL	15.22♢	15.56	15.52	14.20
1B	Params(M)	583	1339	609	1339

When aligned by parameters, CR-Net at 1B even outperforms full-rank training while reducing parameters by 56.5% and compute by 63.2%. When aligned by memory (†) at scales > 1B, it outperforms optimizer-efficient methods like GaLore/RSO/Apollo.

LLaMA-2 7B (with recomputation) and 13B:

Task	Steps	CR-Net	Best Baseline
7B	80K	13.72	CoLA-M 13.82
7B	65K	16.01	CoLA-M 16.21
13B	40K	18.12	8-bit Adam 17.85

At 7B, CR-Net consistently outperforms CoLA-M with lower VRAM. At 13B, it saves 50%+ parameters with only ~2% PPL degradation.

Ablation Study¶

Configuration	Key Metric	Description
CR-Net (Full-rank first layer, 350M/6.4B)	PPL 18.95	Complete model
Low-rank first layer	PPL 19.68	Removing full-rank first layer causes significant drop
Removing learnable \(\beta_l^P\)	Worse convergence	Scaling coefficients aid convergence (Sec 5.2)
Rank configuration	—	Higher rank in middle layers, lower at ends is best

Activation memory and recomputation complexity (LLaMA2-7B, \(r=512\), batch 16): CR-Net recomputation memory is 23.35 GB (0.456x vanilla GCP), and FLOPs are \(0.692\times10^{15}\) (0.326x), both superior to CoLA-M.

Key Findings¶

Full-rank first layer is crucial for stable training: Removing it degrades PPL significantly, supporting the "high-rank anchor" motivation.
Rank should be allocated by layer: Granting higher rank to middle layers and lower rank to prefix/suffix layers yields better results, indicating non-uniform information density.
Learnable \(\beta_l^P\) improves convergence: Adaptive balancing is more stable than fixed coefficients.
Advantages scale with model size: In scales > 1B, particularly 7B/13B, CR-Net's lead over optimizer-efficient methods becomes more prominent.

Highlights & Insights¶

Shifting the target of low-rank: While others apply low-rank to "activations" or "gradients," CR-Net identifies that "activation differences" are truly low-rank. By applying the compression to the right target, it saves parameters without loss.
Embedding empirical observations into architecture: The design is highly self-consistent—from the observation of "low-rank differences" to the forward formula \(Y_l = \beta Y_{l-1} + XAB\), to the inverse recomputation.
Invertibility solves activation memory for free: The residual structure's invertibility allows layer-by-layer back-derivation of activations, naturally benefiting activation VRAM as a side effect of structural design.
Stability without hard constraints: Using a full-rank first layer and learnable scalars instead of QR/SVD is efficient and avoids training collapse.

Limitations & Future Work¶

Reliance on empirical observation: While "low-rank activation difference" was verified across models, it lacks a rigorous theoretical guarantee in this paper. Its validity on non-LLaMA architectures remains to be tested.
Slightly behind full-rank at 13B: CR-Net (18.12 PPL) trails 8-bit Adam (17.85 PPL) at 13B/40K, suggesting potential capacity limits at larger scales.
Extra storage in recomputation: Compared to vanilla GCP, CR-Net's recomputation caches low-rank outputs—trading a small amount of VRAM for significant compute savings, which may need consideration in extremely VRAM-constrained scenarios.
Manual rank allocation: The "middle-high, ends-low" rank configuration was found experimentally; an automated mechanism for determining per-layer rank is missing.

vs. LoRA / ReLoRA: These replace weight updates with low-rank matrices, causing information loss; CR-Net applies low-rank only to "increments" while the high-rank backbone remains lossless, achieving better PPL at the same parameter count.
vs. CoLA: CoLA adds non-linearity to low-rank outputs to recover capacity but adds compute; CR-Net simplifies this with cross-layer residuals, achieving better performance at lower \(r\) and saving 8% compute over CoLA-M.
vs. GaLore / Apollo: These compress gradients via SVD or random projections; SVD hurts throughput and random projections hurt performance. CR-Net is parameter-side low-rank with standard Adam, outperforming them when memory is aligned at > 1B scales.
vs. Gradient Checkpointing (GCP): Vanilla GCP would suffer \(O(L^2)\) overhead due to CR-Net's cross-layer dependencies; the custom strategy reduces this to manageable levels.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Proposes and verifies the "low-rank activation difference" property and engineers it fully.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers scales from 60M to 13B with multi-dimensional comparisons of parameters/VRAM/throughput.
Writing Quality: ⭐⭐⭐⭐ Logically self-consistent (observation → structure → recomputation), with minor layout/typing flaws.
Value: ⭐⭐⭐⭐⭐ High practical value by changing the object of low-rank compression to save half the parameters without degradation.