SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker¶
Conference: CVPR 2026 arXiv: 2604.12502 Code: Available Area: Object Tracking / Multimodal Keywords: Multimodal tracking, parameter-efficient fine-tuning, attention alignment, mixture of experts, LoRA
TL;DR¶
This paper proposes SEATrack, a multimodal tracker that achieves dynamic cross-modal attention map alignment via AMG-LoRA and efficient global relation modeling via HMoE, attaining a state-of-the-art performance–efficiency trade-off on RGB-T/D/E tracking with minimal trainable parameters.
Background & Motivation¶
Background: Multimodal tracking fuses RGB with complementary modalities such as thermal infrared, depth, and event data to enable all-weather robust tracking. The PEFT paradigm has gradually replaced full fine-tuning to mitigate catastrophic forgetting.
Limitations of Prior Work: The number of trainable parameters in PEFT-based methods has inflated 16-fold from early approaches to the latest SOTA, fundamentally undermining the efficiency rationale of PEFT. Meanwhile, the domain gap in dual-stream architectures causes conflicting attention maps across modalities, hampering joint representation learning.
Key Challenge: A performance–efficiency dilemma — more parameters yield better performance but erode the core value of PEFT.
Goal: (1) Break the performance–efficiency trade-off via cross-modal attention alignment; (2) Design an efficient global relation modeling module to replace attention-based fusion.
Key Insight: Multimodal inputs are spatiotemporally aligned, so the attention maps for intra-modal target matching should in principle be consistent — this consistency can be exploited for mutual cross-modal guidance.
Core Idea: AMG-LoRA uses the matching information from one modality to guide the matching process of the other, enabling bidirectional dynamic alignment.
Method¶
Overall Architecture¶
A dual-stream ViT architecture freezes the backbone of a pretrained RGB tracker. AMG-LoRA (attention alignment) and HMoE (cross-modal fusion) are inserted every 2 layers. Search-region features from both modalities are aggregated via element-wise addition and fed into the prediction head for target localization.
Key Designs¶
-
AMG-LoRA (Adaptive Mutual-Guidance Low-Rank Adaptation):
-
Function: Simultaneously performs domain adaptation and dynamic cross-modal attention map alignment.
- Mechanism: (i) LoRA adapts the K/V projection matrices of the attention layers for domain adaptation; (ii) Inspired by Classifier-Free Guidance, cross-modal alignment is reformulated as a multi-branch trade-off. The alignment formula is: \(\textbf{attn}_{rgb} = \tilde{\textbf{attn}}_{rgb} + w_X(\tilde{\textbf{attn}}_X - \tilde{\textbf{attn}}_{rgb})\), where \(w_X\) is a learnable scaling factor.
-
Design Motivation: Target saliency varies across modalities with changing scenes, requiring dynamic rather than static alignment to prevent negative transfer from unreliable modalities. Only 0.14M parameters yield PR improvements of 18.3%/7.2%/6.1%.
-
HMoE (Hierarchical Mixture of Experts):
-
Function: Efficient global relation modeling that replaces the quadratic complexity of attention.
- Mechanism: Unlike existing MoE methods that aggregate only at the expert level, HMoE enables fine-grained interactions from sub-token to token level. Low-rank linear layers serve as expert functions, with hierarchical soft routing implemented via learnable gating matrices.
-
Design Motivation: Attention-based fusion is expressive but incurs quadratic complexity; local fusion is efficient but lacks a global receptive field. HMoE is approximately 35% faster than its attention-based counterpart while maintaining comparable performance.
-
Dual-Stream Design with Shared LoRA:
-
Function: Establishes joint representation learning across the two streams.
- Mechanism: The RGB and X modality streams share the same LoRA bypass, promoting cross-modal feature alignment. At inference time, the LoRA matrices can be merged into the original weights, incurring no additional latency.
- Design Motivation: Shared parameters reduce parameter count while fostering cross-modal consistency in domain adaptation.
Loss & Training¶
Standard tracking losses (classification + regression) are employed. The AMG scaling factor is initialized to 1 and adapts automatically to scenes during training.
Key Experimental Results¶
Main Results¶
| Method | Trainable Params | LasHeR PR↑ | DepthTrack PR↑ | VisEvent PR↑ |
|---|---|---|---|---|
| ProTrack | 0.3M | 52.1 | 58.3 | 65.2 |
| Un-Track | 4.8M | 65.4 | 63.8 | 69.1 |
| SDSTrack | 2.1M | 68.2 | 65.5 | 71.3 |
| SEATrack | 0.8M | 70.4 | 65.5 | 71.3 |
Ablation Study¶
| Configuration | LasHeR PR | Params | Notes |
|---|---|---|---|
| Baseline (frozen ViT) | 52.1 | 0M | No adaptation |
| + LoRA | 60.8 | 0.12M | Domain adaptation only |
| + AMG-LoRA | 70.4 | 0.14M | Domain adaptation + alignment |
| + HMoE | 70.4 | 0.8M | Full model |
| Attention replacing HMoE | 70.2 | 1.6M | 35% slower |
Key Findings¶
- AMG-LoRA adds only 0.02M parameters (from 0.12M to 0.14M) yet yields nearly 10% PR improvement.
- HMoE achieves performance comparable to attention-based fusion while being 35% faster.
- CFG-inspired dynamic alignment outperforms static alignment (fixed \(w=1\)) by 3–5 percentage points.
Highlights & Insights¶
- Borrowing Classifier-Free Guidance for cross-modal alignment in tracking is an elegant analogy: modality reliability is treated as the "conditioned" vs. "unconditional" branch.
- The insight that "cross-modal attention alignment is the key to breaking the performance–efficiency dilemma" is generalizable to other multimodal tasks.
Limitations & Future Work¶
- Validation is limited to the tracking task; effectiveness on detection or segmentation remains untested.
- The number of experts and heads in HMoE requires manual tuning.
- Scenarios involving more than two modalities are not considered.
- AMG could be extended to broader types of attention alignment.
Related Work & Insights¶
- vs. SDSTrack: SDSTrack reuses frozen attention layers for global interaction but at high complexity; SEATrack replaces this with HMoE.
- vs. ProTrack: ProTrack pioneered the prompt-tuning paradigm but has limited expressiveness; SEATrack's AMG-LoRA is more effective.
Rating¶
- Novelty: ⭐⭐⭐⭐ Both AMG-LoRA and HMoE exhibit original designs.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across RGB-T/D/E tasks.
- Writing Quality: ⭐⭐⭐⭐ Motivation and design logic are clearly presented.
- Value: ⭐⭐⭐⭐ Offers meaningful reference for multimodal PEFT research.