Group Orthogonal Low-Rank Adaptation for RGB-T Tracking¶

Conference: AAAI 2026 arXiv: 2512.05359 Code: GitHub Area: Video Understanding Keywords: RGB-T Tracking, LoRA, Low-Rank Adaptation, Orthogonal Constraint, Parameter-Efficient Fine-Tuning

TL;DR¶

This paper proposes the GOLA framework, which quantifies LoRA rank importance via SVD decomposition, freezes critical ranks to preserve pre-trained priors, clusters redundant ranks into groups, and imposes inter-group orthogonal constraints to enable more efficient RGB-T tracking adaptation.

Background & Motivation¶

State of the Field¶

RGB-T (visible + infrared) tracking enhances robustness in challenging scenarios such as low illumination and occlusion by fusing complementary information from two modalities. Recent methods primarily adopt the parameter-efficient fine-tuning (PEFT) paradigm, freezing pre-trained parameters and fine-tuning only a small subset for downstream tasks.

Limitations of Prior Work¶

LoRA suffers from severe rank space redundancy in RGB-T tracking: - SVD decomposition of LoRA parameter matrices after training reveals that only a few ranks become dominant, with the majority contributing negligible information. - To achieve rapid convergence, the model tends to prioritize integrating a small number of critical ranks, overwriting pre-trained priors. - The remaining redundant ranks lack targeted optimization signals and remain unactivated, failing to learn fine-grained features. - This ultimately severely limits the model's ability to handle diverse challenges in RGB-T tracking (occlusion, illumination variation, deformation, etc.).

Core Idea¶

Protect critical ranks + activate redundant ranks: (1) quantify rank importance via SVD and freeze critical ranks to preserve pre-trained priors; (2) cluster redundant ranks into groups; (3) impose inter-group orthogonal constraints to force each group to learn complementary, non-overlapping feature transformations, thereby fully exploiting the entire rank space.

Method¶

Overall Architecture¶

GOLA builds upon a single-stream tracking framework and applies modified LoRA to each linear layer of the backbone: - Inputs: template images \((I_v^z, I_t^z)\), search images \((I_v^x, I_t^x)\), and online templates \((I_v^o, I_t^o)\) - After tokenization, tokens are concatenated: \(h = [Z_v; Z_t; X_v; X_t; O_v; O_t]\) - Joint feature extraction through the encoder; GOLA is applied to each linear layer: \(h' = \mathbf{W}h + \mathbf{BA}h\) - Parameters are merged at inference: \(\mathbf{W}' = \mathbf{W} + \mathbf{BA}\) (no additional inference latency)

Key Designs¶

1. Rank Decomposition Partition Strategy¶

Mechanism: Quantify the importance of each rank to distinguish critical ranks from redundant ranks; performed offline without increasing training overhead.
Steps:
Apply SVD to matrix \(\mathbf{B}\): \(\Sigma, \mathbf{V} \leftarrow \text{SVD}(\bar{\mathbf{B}})\)
Select the top-k singular vectors \(V_k\) and corresponding singular values \(\Sigma_k\) as references.
Compute the importance score for each rank: \(\mathbf{S} = \|\bar{\mathbf{B}}^\top \mathbf{V}_k^\top \odot \Sigma_k\|_2\)
Sort ranks by \(\mathbf{S}\) in descending order; the top-k ranks are designated critical (frozen), and the remainder are redundant (trainable).
Cluster redundant ranks into \(n\) groups via constrained k-means.
Rationale for using \(\mathbf{B}\) as reference: \(\mathbf{B}\) is more strongly associated with task-specific information, whereas \(\mathbf{A}\) acts more as a general-purpose feature extractor.
Design Motivation: Freezing critical ranks preserves learned generalization capability and prevents it from being overwritten during new modality adaptation.

2. Inter-Group Orthogonal Constraint Strategy¶

Mechanism: An orthogonality loss enforces orthogonal relationships between parameters of different rank groups, ensuring each group learns complementary features.
Orthogonal Loss:

\[\mathcal{L}_{orth} = \sum_{i \neq j}\left(\left|\mathbf{A}_{u_i}^\top \mathbf{A}_{u_j}\right| + \left|\mathbf{B}_{u_i}^\top \mathbf{B}_{u_j}\right|\right)\]

A channel orthogonality constraint is applied to \(\mathbf{A}\): enhances diversity and discriminability of general features.
A rank orthogonality constraint is applied to \(\mathbf{B}\): ensures different ranks carry complementary task knowledge.
Efficiency Design: At each iteration, only one pair of rank groups is randomly sampled to compute the orthogonality loss, reducing computational overhead.
Design Motivation: Prevents redundant ranks from learning overlapping feature spaces, enabling the model to simultaneously address diverse challenges in RGB-T tracking.

Loss & Training¶

Total loss: \(\mathcal{L} = \mathcal{L}_{cls} + \mathcal{L}_{reg} + \lambda \cdot \mathcal{L}_{orth}\)

\(\mathcal{L}_{cls}\): Binary cross-entropy classification loss
\(\mathcal{L}_{reg}\): GIoU regression loss
\(\lambda = 1.4 \times 10^{-3}\)
LoRA rank \(r=64\), number of critical ranks \(k=16\), number of redundant rank groups \(n=8\)
Training: 10 epochs, batch size 128, 131,072 image pairs per epoch
Online template update threshold \(\tau=0.84\)

Key Experimental Results¶

Main Results (4 Benchmark Datasets)¶

Method	GTOT MPR/MSR	RGBT210 PR/SR	RGBT234 MPR/MSR	LasHeR PR/NPR/SR	Speed
ViPT	-/-	-/-	83.5/61.7	65.1/-/52.5	-
TBSI	-/-	-/-	87.1/63.7	69.2/65.7/55.6	36fps
CKD	93.2/77.2	88.4/65.2	90.0/67.4	73.2/69.3/58.1	96fps
SUTrack-L384	-/-	-/-	93.7/70.3	76.9/-/61.9	12fps
GOLA-B	92.8/78.5	90.9/67.0	92.2/69.5	77.5/73.9/61.6	125fps
GOLA-L	95.3/80.9	92.0/68.7	92.8/71.3	78.1/74.5/61.9	64fps

GOLA-B surpasses SUTrack-L384 (12fps) on LasHeR at 125fps, with only 99M parameters (10% trainable).

Comparison with Fine-Tuning Methods¶

Method	Trainable Params	LasHeR PR/SR	Inference Speed
Full Fine-tune	100%	72.5/57.9	125fps
Adapter	4%	68.8/54.5	78fps
VPT	3%	70.8/56.3	85fps
LoRA	13%	76.3/60.7	125fps
DoRA	13%	63.7/49.3	125fps
GOLA-B	10%	77.5/61.6	125fps

GOLA outperforms LoRA by 1.2%/0.9% (PR/SR) while reducing trainable parameters by 23%.

Ablation Study¶

Configuration	PR (%)	SR (%)	Notes
w/o orthogonal constraint	76.3	60.7	LoRA baseline
\(\mathbf{A}\) orthogonality only	76.8	61.2	Constrains general features only
\(\mathbf{B}\) orthogonality only	76.7	61.2	Constrains task knowledge only
\(\mathbf{A}\)+\(\mathbf{B}\) orthogonality	77.5	61.6	Best complementary effect

Sorting	Clustering	PR/SR	Notes
✓	✗	77.0/61.4	Sorting only
✗	✓	76.6/61.0	Clustering only
✓	✓	77.5/61.6	Best with both

Key Hyperparameter	Optimal Value	Notes
Critical rank count \(k\)	16	Too large → too many redundant ranks frozen; too small → pre-trained priors lost
Number of groups \(n\)	8	Too many → insufficient expressiveness per group; too few → insufficient complementarity
Sampled group pairs per iteration	1	One pair suffices; more pairs do not improve performance but increase computation

Key Findings¶

Orthogonal constraints on \(\mathbf{A}\) and \(\mathbf{B}\) are complementary: Constraining both simultaneously outperforms constraining either alone.
Rank sorting and clustering must be combined: Sorting preserves generalization capacity; clustering promotes intra-group specialization.
t-SNE visualizations confirm the effectiveness of orthogonal constraints: Rank features from different groups in GOLA exhibit distinct clustering and separation in t-SNE plots.
Near-optimal performance across 19 attributes: GOLA shows particular advantages on challenging attributes such as HI (high illumination), HO (heavy occlusion), and SV (scale variation).

Highlights & Insights¶

In-depth analysis of LoRA redundancy: SVD-based quantification reveals the rank space redundancy in LoRA, providing a theoretical foundation for improvement.
Minimalist yet effective design: Freezing critical ranks + grouping + orthogonal constraints — conceptually simple but empirically significant.
No additional inference overhead: The parameter merging strategy ensures inference speed is identical to standard LoRA (125fps).
Strong generalizability: Both GOLA-B and GOLA-L variants perform excellently, and the offline partitioning strategy does not affect the training pipeline.

Limitations & Future Work¶

The critical rank count \(k\) and group count \(n\) require hyperparameter search; an adaptive mechanism is lacking.
Clustering relies on fixed constrained k-means; more flexible dynamic grouping methods could be explored.
Orthogonal constraints are applied only to randomly sampled group pairs, potentially leaving some pairs insufficiently constrained.
Validation is limited to RGB-T tracking; the approach could be extended to more multi-modal downstream tasks.
The offline partitioning strategy requires a preliminary LoRA training pass, adding upfront preparation cost.

Unlike AdaLoRA, which dynamically adjusts rank, GOLA maintains a fixed rank while optimizing rank space utilization.
The inter-group orthogonal constraint is conceptually analogous to the specialization of different experts in MoE, but without requiring a routing mechanism.
The strategy of freezing critical ranks is transferable to LoRA applications in NLP.
t-SNE visualization analysis provides a new perspective for evaluating LoRA variants.

Rating¶

Novelty: ⭐⭐⭐⭐ — The analysis of LoRA redundancy and the group orthogonal constraint solution are novel and practical.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 datasets, 19 attribute analyses, comparisons with multiple PEFT methods, and detailed ablations.
Writing Quality: ⭐⭐⭐⭐ — Clear logic, complete mathematical derivations, and rich visualization analysis.
Value: ⭐⭐⭐⭐ — General insights on LoRA usage are transferable across multiple domains.