Efficient Resource-Constrained Training of Transformers via Subspace Optimization¶

Conference: ICLR 2026 arXiv: 2510.09160 Code: https://github.com/Le-TrungNguyen/ICLR2026-WASI.git Area: AI Safety Keywords: subspace optimization, transformer compression, SVD, activation compression, edge deployment

TL;DR¶

This paper proposes WASI (Weight-Activation Subspace Iteration), which leverages the observation that parameter subspaces remain stable during fine-tuning to simultaneously compress both the weights (via SVD + Gram-Schmidt subspace iteration) and activations (via Tucker decomposition) of Transformers. Both training and inference are performed entirely within low-rank representations, achieving 62× training memory compression and 1.4× speedup on Raspberry Pi 5 with negligible accuracy loss.

Background & Motivation¶

Background: Deploying Transformers on edge devices poses severe memory and computational challenges. While methods such as LoRA reduce the number of trainable parameters, inference still operates in the full-rank space; activation maps during the forward pass constitute the primary memory bottleneck.

Limitations of Prior Work: - LoRA and variants: Reduce training parameters but require merging back to full rank at inference, leaving inference overhead unchanged; storing both frozen weights and adapters during training can actually increase memory usage. - ASVD / FWSVD: Compress models via truncated SVD but lack a theoretical connection between truncation error and model performance. - SVD-LLM: Addresses the theoretical gap but is limited to LLMs and does not support 4D or higher activation tensors in vision Transformers. - AMC: Compresses activations with HOSVD, but recomputing HOSVD at every iteration incurs substantial overhead and rank fluctuations lead to unstable memory usage. - ASI: Replaces HOSVD with subspace iteration at a fixed activation rank, reducing computation, but does not compress weights.

Core Insight: The intrinsic parameter subspace remains stable during fine-tuning (small learning rate → minuscule per-step updates → negligible change in SVD bases), so after an initial SVD, inexpensive subspace iteration suffices to track basis changes without recomputation at every step.

Core Idea: Jointly compress both weights (WSI) and activations (ASI) so that training and inference are executed entirely within the low-rank space.

Method¶

Overall Architecture¶

WASI is a unified framework combining WSI (Weight Subspace Iteration) and ASI (Activation Subspace Iteration): - Forward pass: \(\mathcal{A}_{i+1} = \mathcal{A}_i R_i^T L_i^T\) (computed in the low-rank space) - Backward pass: gradients are computed directly in the low-rank space; weight update \(L_i R_i = L_i R_i + \eta \cdot \widetilde{\nabla_{\mathcal{W}_i}\mathcal{L}}\) - Inference runs directly on the compressed representations \((L_i, R_i)\) without reconstructing full-rank weights

Key Designs¶

WSI (Weight Subspace Iteration):
- Initialization (\(t=0\)): A full SVD is performed on weight \(\mathcal{W}_i\); the optimal rank \(K_i\) is determined by a variance-explained threshold \(\varepsilon\), yielding \(\mathcal{W}_i \approx L_i R_i\).
- Subsequent iterations (\(t>0\)): Gram-Schmidt orthogonalization tracks subspace changes at a computational cost far lower than recomputing SVD.
- Error control: The threshold \(\varepsilon\) ensures that the retained variance satisfies \(\sum_{j=1}^{K_i} \sigma_{i,j}^2 \geq \varepsilon\).
ASI (Activation Subspace Iteration) enhancements:
- Dynamic-programming rank selection: Replaces the brute-force search in ASI with a DP formulation that minimizes memory usage subject to a target perplexity constraint, reducing search complexity from exponential to linear.
- 3D activation support: Extends Tucker decomposition to support 3D activation tensors \(\mathcal{A}_i \in \mathbb{R}^{B \times N_i \times I_i}\) arising in Transformers.
Unified forward-backward computation: Both forward and backward passes are executed directly in the compressed representation, eliminating decompression/recompression round-trips.

Loss & Training¶

Standard cross-entropy loss is used; the key contribution lies in performing all gradient computations within the low-rank space: - Weight gradient: \(\widetilde{\nabla_{\mathcal{W}_i}\mathcal{L}} = f_{LR}(\tilde{\mathcal{A}_i}, \widetilde{\nabla_{\mathcal{A}_{i+1}}\mathcal{L}})\) - Activation gradient: \(\widetilde{\nabla_{\mathcal{A}_i}\mathcal{L}} = \widetilde{\nabla_{\mathcal{A}_{i+1}}\mathcal{L}} \cdot L_i R_i\)

Key Experimental Results¶

Main Results: Multiple Models and Datasets¶

Model	Dataset	Train Mem. Compression	Inference Mem. Compression	Train FLOPs Reduction	Accuracy Change
ViT	CIFAR-10	62×	62×	2×	−0.5%
ViT	Pets	62×	62×	2×	0%
SwinT	CUB	~50×	~50×	1.5×	+2% (surpasses baseline)
SwinT	Flowers	~50×	~50×	1.5×	−1%
SwinT	CIFAR-100	~50×	~50×	1.5×	0%
TinyLlama	BoolQ	953× (activation) / 30× (weight)	30×	13×	0%

Ablation Study: WSI vs. Full SVD¶

Method	ε=0.4	ε=0.6	ε=0.8	ε=0.9	Compute Cost Ratio
Full SVD	Low acc.	Medium	High	Near-full	1.0×
WSI	Low acc.	Medium	High	Near-full	0.74× (1.36× savings)
Accuracy gap at equal FLOPs	—	—	—	—	WSI higher by 35%

On-Device Evaluation: Raspberry Pi 5¶

Setting	Train Time/Step	Inference Time/Step	Speedup
Vanilla	baseline	baseline	1.0×
WASI (ε=0.9)	faster	faster	~1.4×
WASI (ε=0.4)	fastest	fastest	>2×

Key Findings¶

Layer rank \(K_i\) remains constant over 50 epochs, validating the subspace stability assumption.
WSI requires 1.36× fewer FLOPs than full SVD recomputation, yielding 35% higher accuracy at the same budget.
The leading principal components of activations capture >90% of the variance, indicating high compressibility.
On CUB, WASI with SwinT surpasses vanilla fine-tuning accuracy, suggesting a regularization effect from the low-rank constraint.
Activation compression reaches 953× on TinyLlama, demonstrating significant compression potential for LLMs.

Highlights & Insights¶

Both training and inference in the compressed space — fundamentally distinct from LoRA, which must merge adapters back to full rank at inference, making WASI naturally suited for edge deployment.
Empirical validation of the subspace stability assumption: Figure 3(a) directly visualizes the stability of singular values throughout fine-tuning, providing strong alignment between theory and experiment.
DP-based rank selection over brute-force search: Reducing exponential to linear search complexity substantially improves practical usability.
Compression can surpass the baseline: The accuracy gain on CUB indicates that the low-rank constraint acts as an implicit regularizer.
62× memory compression implies that a model originally requiring 62 GB can be trained on a 1 GB device.

Limitations & Future Work¶

LLM validation is limited: experiments are conducted only on the last 5 layers of TinyLlama; effectiveness on larger-scale LLMs remains unknown.
The threshold \(\varepsilon\) requires task- and model-specific tuning; the optimal value may vary across settings.
Gram-Schmidt orthogonalization may exhibit numerical instability at extremely small ranks.
The LoRA adapters in SVD-LLM confer a FLOPs advantage; WASI's FLOPs benefit is less pronounced than its memory benefit.
Combining WASI with orthogonal compression techniques such as quantization and knowledge distillation remains unexplored.

vs. LoRA: LoRA reduces only training parameters without compressing inference; WASI compresses both, offering a clear edge-deployment advantage.
vs. SVD-LLM: SVD-LLM is restricted to LLMs and memory can actually increase at low compression ratios due to LoRA adapter overhead; WASI is general-purpose with no additional overhead.
vs. ASI: ASI compresses activations only, leaving the inference space unchanged; WASI jointly compresses both.
vs. AMC: AMC incurs heavy per-step HOSVD recomputation; WASI uses an initial SVD followed by subspace iteration, achieving computational efficiency.

Rating¶

Novelty: ⭐⭐⭐⭐ — A unified framework that simultaneously compresses weights and activations, with theoretical grounding for the subspace stability assumption.
Experimental Thoroughness: ⭐⭐⭐⭐ — Convincing RPi5 deployment validation across ViT, SwinT, and TinyLlama.
Writing Quality: ⭐⭐⭐⭐ — Complete mathematical derivations with detailed computational complexity analysis.
Value: ⭐⭐⭐⭐ — A practical solution for edge deployment of Transformers; the 62× compression ratio carries significant engineering value.