Group Orthogonal Low-Rank Adaptation for RGB-T Tracking¶
Conference: AAAI 2026 arXiv: 2512.05359 Code: GitHub Area: Video Understanding Keywords: RGB-T Tracking, LoRA, Low-Rank Adaptation, Orthogonal Constraint, Parameter-Efficient Fine-Tuning
TL;DR¶
This paper proposes the GOLA framework, which quantifies LoRA rank importance via SVD decomposition, freezes critical ranks to preserve pre-trained priors, clusters redundant ranks into groups, and imposes inter-group orthogonal constraints to enable more efficient RGB-T tracking adaptation.
Background & Motivation¶
State of the Field¶
RGB-T (visible + infrared) tracking enhances robustness in challenging scenarios such as low illumination and occlusion by fusing complementary information from two modalities. Recent methods primarily adopt the parameter-efficient fine-tuning (PEFT) paradigm, freezing pre-trained parameters and fine-tuning only a small subset for downstream tasks.
Limitations of Prior Work¶
LoRA suffers from severe rank space redundancy in RGB-T tracking: - SVD decomposition of LoRA parameter matrices after training reveals that only a few ranks become dominant, with the majority contributing negligible information. - To achieve rapid convergence, the model tends to prioritize integrating a small number of critical ranks, overwriting pre-trained priors. - The remaining redundant ranks lack targeted optimization signals and remain unactivated, failing to learn fine-grained features. - This ultimately severely limits the model's ability to handle diverse challenges in RGB-T tracking (occlusion, illumination variation, deformation, etc.).
Core Idea¶
Protect critical ranks + activate redundant ranks: (1) quantify rank importance via SVD and freeze critical ranks to preserve pre-trained priors; (2) cluster redundant ranks into groups; (3) impose inter-group orthogonal constraints to force each group to learn complementary, non-overlapping feature transformations, thereby fully exploiting the entire rank space.
Method¶
Overall Architecture¶
GOLA builds upon a single-stream tracking framework and applies modified LoRA to each linear layer of the backbone: - Inputs: template images \((I_v^z, I_t^z)\), search images \((I_v^x, I_t^x)\), and online templates \((I_v^o, I_t^o)\) - After tokenization, tokens are concatenated: \(h = [Z_v; Z_t; X_v; X_t; O_v; O_t]\) - Joint feature extraction through the encoder; GOLA is applied to each linear layer: \(h' = \mathbf{W}h + \mathbf{BA}h\) - Parameters are merged at inference: \(\mathbf{W}' = \mathbf{W} + \mathbf{BA}\) (no additional inference latency)
Key Designs¶
1. Rank Decomposition Partition Strategy¶
- Mechanism: Quantify the importance of each rank to distinguish critical ranks from redundant ranks; performed offline without increasing training overhead.
- Steps:
- Apply SVD to matrix \(\mathbf{B}\): \(\Sigma, \mathbf{V} \leftarrow \text{SVD}(\bar{\mathbf{B}})\)
- Select the top-k singular vectors \(V_k\) and corresponding singular values \(\Sigma_k\) as references.
- Compute the importance score for each rank: \(\mathbf{S} = \|\bar{\mathbf{B}}^\top \mathbf{V}_k^\top \odot \Sigma_k\|_2\)
- Sort ranks by \(\mathbf{S}\) in descending order; the top-k ranks are designated critical (frozen), and the remainder are redundant (trainable).
- Cluster redundant ranks into \(n\) groups via constrained k-means.
- Rationale for using \(\mathbf{B}\) as reference: \(\mathbf{B}\) is more strongly associated with task-specific information, whereas \(\mathbf{A}\) acts more as a general-purpose feature extractor.
- Design Motivation: Freezing critical ranks preserves learned generalization capability and prevents it from being overwritten during new modality adaptation.
2. Inter-Group Orthogonal Constraint Strategy¶
- Mechanism: An orthogonality loss enforces orthogonal relationships between parameters of different rank groups, ensuring each group learns complementary features.
- Orthogonal Loss:
- A channel orthogonality constraint is applied to \(\mathbf{A}\): enhances diversity and discriminability of general features.
- A rank orthogonality constraint is applied to \(\mathbf{B}\): ensures different ranks carry complementary task knowledge.
- Efficiency Design: At each iteration, only one pair of rank groups is randomly sampled to compute the orthogonality loss, reducing computational overhead.
- Design Motivation: Prevents redundant ranks from learning overlapping feature spaces, enabling the model to simultaneously address diverse challenges in RGB-T tracking.
Loss & Training¶
Total loss: \(\mathcal{L} = \mathcal{L}_{cls} + \mathcal{L}_{reg} + \lambda \cdot \mathcal{L}_{orth}\)
- \(\mathcal{L}_{cls}\): Binary cross-entropy classification loss
- \(\mathcal{L}_{reg}\): GIoU regression loss
- \(\lambda = 1.4 \times 10^{-3}\)
- LoRA rank \(r=64\), number of critical ranks \(k=16\), number of redundant rank groups \(n=8\)
- Training: 10 epochs, batch size 128, 131,072 image pairs per epoch
- Online template update threshold \(\tau=0.84\)
Key Experimental Results¶
Main Results (4 Benchmark Datasets)¶
| Method | GTOT MPR/MSR | RGBT210 PR/SR | RGBT234 MPR/MSR | LasHeR PR/NPR/SR | Speed |
|---|---|---|---|---|---|
| ViPT | -/- | -/- | 83.5/61.7 | 65.1/-/52.5 | - |
| TBSI | -/- | -/- | 87.1/63.7 | 69.2/65.7/55.6 | 36fps |
| CKD | 93.2/77.2 | 88.4/65.2 | 90.0/67.4 | 73.2/69.3/58.1 | 96fps |
| SUTrack-L384 | -/- | -/- | 93.7/70.3 | 76.9/-/61.9 | 12fps |
| GOLA-B | 92.8/78.5 | 90.9/67.0 | 92.2/69.5 | 77.5/73.9/61.6 | 125fps |
| GOLA-L | 95.3/80.9 | 92.0/68.7 | 92.8/71.3 | 78.1/74.5/61.9 | 64fps |
GOLA-B surpasses SUTrack-L384 (12fps) on LasHeR at 125fps, with only 99M parameters (10% trainable).
Comparison with Fine-Tuning Methods¶
| Method | Trainable Params | LasHeR PR/SR | Inference Speed |
|---|---|---|---|
| Full Fine-tune | 100% | 72.5/57.9 | 125fps |
| Adapter | 4% | 68.8/54.5 | 78fps |
| VPT | 3% | 70.8/56.3 | 85fps |
| LoRA | 13% | 76.3/60.7 | 125fps |
| DoRA | 13% | 63.7/49.3 | 125fps |
| GOLA-B | 10% | 77.5/61.6 | 125fps |
GOLA outperforms LoRA by 1.2%/0.9% (PR/SR) while reducing trainable parameters by 23%.
Ablation Study¶
| Configuration | PR (%) | SR (%) | Notes |
|---|---|---|---|
| w/o orthogonal constraint | 76.3 | 60.7 | LoRA baseline |
| \(\mathbf{A}\) orthogonality only | 76.8 | 61.2 | Constrains general features only |
| \(\mathbf{B}\) orthogonality only | 76.7 | 61.2 | Constrains task knowledge only |
| \(\mathbf{A}\)+\(\mathbf{B}\) orthogonality | 77.5 | 61.6 | Best complementary effect |
| Sorting | Clustering | PR/SR | Notes |
|---|---|---|---|
| ✓ | ✗ | 77.0/61.4 | Sorting only |
| ✗ | ✓ | 76.6/61.0 | Clustering only |
| ✓ | ✓ | 77.5/61.6 | Best with both |
| Key Hyperparameter | Optimal Value | Notes |
|---|---|---|
| Critical rank count \(k\) | 16 | Too large → too many redundant ranks frozen; too small → pre-trained priors lost |
| Number of groups \(n\) | 8 | Too many → insufficient expressiveness per group; too few → insufficient complementarity |
| Sampled group pairs per iteration | 1 | One pair suffices; more pairs do not improve performance but increase computation |
Key Findings¶
- Orthogonal constraints on \(\mathbf{A}\) and \(\mathbf{B}\) are complementary: Constraining both simultaneously outperforms constraining either alone.
- Rank sorting and clustering must be combined: Sorting preserves generalization capacity; clustering promotes intra-group specialization.
- t-SNE visualizations confirm the effectiveness of orthogonal constraints: Rank features from different groups in GOLA exhibit distinct clustering and separation in t-SNE plots.
- Near-optimal performance across 19 attributes: GOLA shows particular advantages on challenging attributes such as HI (high illumination), HO (heavy occlusion), and SV (scale variation).
Highlights & Insights¶
- In-depth analysis of LoRA redundancy: SVD-based quantification reveals the rank space redundancy in LoRA, providing a theoretical foundation for improvement.
- Minimalist yet effective design: Freezing critical ranks + grouping + orthogonal constraints — conceptually simple but empirically significant.
- No additional inference overhead: The parameter merging strategy ensures inference speed is identical to standard LoRA (125fps).
- Strong generalizability: Both GOLA-B and GOLA-L variants perform excellently, and the offline partitioning strategy does not affect the training pipeline.
Limitations & Future Work¶
- The critical rank count \(k\) and group count \(n\) require hyperparameter search; an adaptive mechanism is lacking.
- Clustering relies on fixed constrained k-means; more flexible dynamic grouping methods could be explored.
- Orthogonal constraints are applied only to randomly sampled group pairs, potentially leaving some pairs insufficiently constrained.
- Validation is limited to RGB-T tracking; the approach could be extended to more multi-modal downstream tasks.
- The offline partitioning strategy requires a preliminary LoRA training pass, adding upfront preparation cost.
Related Work & Insights¶
- Unlike AdaLoRA, which dynamically adjusts rank, GOLA maintains a fixed rank while optimizing rank space utilization.
- The inter-group orthogonal constraint is conceptually analogous to the specialization of different experts in MoE, but without requiring a routing mechanism.
- The strategy of freezing critical ranks is transferable to LoRA applications in NLP.
- t-SNE visualization analysis provides a new perspective for evaluating LoRA variants.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The analysis of LoRA redundancy and the group orthogonal constraint solution are novel and practical.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 datasets, 19 attribute analyses, comparisons with multiple PEFT methods, and detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ — Clear logic, complete mathematical derivations, and rich visualization analysis.
- Value: ⭐⭐⭐⭐ — General insights on LoRA usage are transferable across multiple domains.