RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation¶

Conference: CVPR2026 arXiv: 2603.03617 Code: IdolLab/RAGTrack Area: Video Understanding / RGBT Tracking Keywords: RGBT Tracking, Retrieval-Augmented Generation, Multimodal Fusion, Language-Guided Tracking, Adaptive Token Fusion

TL;DR¶

This paper is the first to introduce textual descriptions into RGBT tracking, proposing RAGTrack, a retrieval-augmented generation (RAG)-based framework. Through a Multimodal Transformer Encoder (MTE), Adaptive Token Fusion (ATF), and a Context-aware Reasoning Module (CRM), it achieves state-of-the-art performance on four RGBT benchmarks.

Background & Motivation¶

Limitations of RGBT tracking: Existing RGBT trackers rely solely on first-frame visual information to model the target, making them prone to drift under drastic appearance changes.
Insufficient single-frame template information: A single template image cannot capture the full appearance variation of a target across different viewpoints, resulting in limited semantic expressiveness.
Inherent target ambiguity: Trackers may confuse foreground with background (e.g., brooms, dustpans, or the lower body of pedestrians), lacking high-level semantic discriminability.
Redundancy in the search region: Conventional methods process large amounts of redundant background regions and distractors at the token level, degrading tracking precision.
Heterogeneous modality gap: Significant feature discrepancies exist between RGB and TIR modalities, impeding effective cross-modal correspondence.
Absence of language annotations: Existing RGBT tracking benchmarks lack textual annotations, limiting research on language-guided tracking.

Method¶

Overall Architecture¶

RAGTrack comprises three core modules: a Multimodal Transformer Encoder (MTE), Adaptive Token Fusion (ATF), and a Context-aware Reasoning Module (CRM). The inputs are RGB/TIR search images, template images, and language descriptions; the output is the target bounding box.

Multimodal Transformer Encoder (MTE)¶

A three-stage downsampling operation converts template and search images into patch tokens.
A sequence prefix \(\mathbf{E}^t\) (fixed text prompt + learnable tokens) is concatenated with the language description \(\mathbf{L}^t\) and encoded via a CLIP text encoder.
Reasoning tokens \(\mathbf{R}_m^t\), text tokens \(\hat{\mathbf{H}}^t\), template tokens \(\hat{\mathbf{Z}}_m^t\), and search tokens \(\hat{\mathbf{X}}_m^t\) are concatenated into a unified sequence.
Both the RGB and TIR branches share parameters and apply multi-head self-attention for unified vision–language modeling.

Adaptive Token Fusion (ATF)¶

Dynamic token selection: Self-attention scores are reused to compute the total attention score \(\mathbf{A}_m^{total}\) between search tokens and reasoning/text/template/search tokens. Target-relevant tokens are retained at a ratio of \(\gamma=85\%\) with no additional parameter overhead.
Adaptive channel exchange: Cross-modal channel correlation \(\mathbf{S}\) between RGB and TIR features is computed; the top \(\sigma=50\%\) channels are selected for exchange, followed by MLP-based fusion.
ATF is deployed at layers 6/12/18/24 of HiViT-B, enabling progressive cross-layer fusion.

Context-aware Reasoning Module (CRM)¶

A RAG paradigm is adopted for temporal language reasoning, consisting of four stages:

Construction: A dynamic knowledge base \(\mathbf{D}_m\) is maintained with \(n=4\) historical text features; a new feature is added only when its cosine similarity to all existing entries falls below threshold \(\lambda=1.0\).
Retrieval: The top-\(k=2\) most relevant features are retrieved from the knowledge base, and intra-modal cross-attention \(\Phi\) is applied to refine the search features.
Augmentation: Reasoning, text, and template features are average-pooled and concatenated, then passed through an MLP to generate the reasoning token for the next frame; cross-attention and Hadamard product are further applied to enhance temporal representations.
Generation: QWen2.5-VL-3B dynamically generates context-aware target descriptions from search images and structured prompts, continuously updating the multimodal reference.

Loss & Training¶

A multi-task joint loss is used: \(\mathcal{L} = L_{\text{cls}} + 2 L_{\text{iou}} + 5 L_1\), where classification employs focal loss and regression employs L1 + GIoU loss.

Key Experimental Results¶

SOTA Comparison on Four RGBT Benchmarks¶

Dataset	Metric	RAGTrack	Runner-up	Gain
GTOT	MPR/MSR	95.1/79.3	DMD 94.2/78.6	+0.9/+0.7
RGBT210	PR/SR	93.2/67.1	AETrack 90.4/66.3	+2.8/+0.8
RGBT234	MPR/MSR	93.8/69.5	SUTrack 92.1/69.2	+1.7/+0.3
LasHeR	PR/NPR/SR	76.8/73.0/61.1	STTrack 76.0/−/60.3	+0.8/−/+0.8

Ablation Study (RGBT234)¶

Configuration	MPR	MSR
Baseline	87.9	64.5
+ CRM* (w/o text)	89.1	65.0
+ MTE + CRM*	91.1	66.7
+ MTE + CRM (w/ text)	91.8	67.4
+ MTE + CRM + ATF (full)	93.8	69.5

Fusion Paradigm Comparison (RGBT234)¶

Method	MPR	MSR	Params
TBSI	92.8	67.6	145.9M
BSI	93.1	68.2	103.6M
DFM	92.7	67.8	110.3M
ATF (Ours)	93.8	69.5	101.8M

Attribute-level analysis on LasHeR shows: Total Occlusion (TO) +10.7% PR and Out-of-View (OV) +5.5% SR, demonstrating CRM's ability to maintain target identity under severe appearance changes.

Highlights & Insights¶

First to introduce language descriptions into RGBT tracking: MLLMs are used to automatically generate text annotations, extending four existing benchmarks (LasHeR training set annotated with 514,081 descriptions).
Elegant ATF design: Parameter-free token selection (reusing attention scores) combined with adaptive channel exchange achieves the best fusion performance with the fewest parameters.
Novel RAG paradigm: The first application of retrieval-augmented generation to RGBT tracking; a dynamic knowledge base with reasoning token propagation enables continuous temporal reasoning.
MLLM-based dynamic description generation: Overcomes the limitations of static language annotations by adaptively updating target descriptions across frames.

Limitations & Future Work¶

Inference overhead: QWen2.5-VL-3B is invoked per frame to generate descriptions, limiting real-time applicability; no FPS is reported in the paper.
Dependence on MLLM quality: Automatically generated text may contain hallucinations; although human verification is performed, scaling to larger datasets incurs high cost.
Validated only on RGBT: The framework is theoretically transferable to other multimodal tracking settings (e.g., RGB-Depth, RGB-Event), but no such experiments are conducted.
Fixed knowledge base size: \(n=4\) is manually set; adaptive sizing could further improve performance in long-video scenarios.
Training resources: Training requires 4× V100 GPUs; adaptation for lightweight deployment scenarios remains unclear.

vs ViPT/BAT/SDSTrack (visual prompt learning): These methods enhance tracking with visual prompts only, lacking language-level semantics; RAGTrack introduces textual descriptions for more abstract target representation.
vs RGBL tracking (CiteTracker/UVLTrack): RGBL methods suffer from static vision–language misalignment; RAGTrack addresses this by dynamically generating descriptions via MLLMs.
vs TrackingMiM (the only prior RAG-based tracking work): It merely reuses pre-stored features; RAGTrack achieves genuine RAG through a dynamic knowledge base and contextual reasoning.
vs SUTrack/AINet (current SOTA): On RGBT234, ATF with only 101.8M parameters surpasses SUTrack (384 resolution), demonstrating superior efficiency.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to introduce language descriptions and the RAG paradigm into RGBT tracking; the parameter-free token selection design in ATF is innovative.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive SOTA results on four benchmarks; ablations cover individual components, hyperparameters, fusion paradigms, and attention score combinations.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, rich figures and tables, complete mathematical derivations.
Value: ⭐⭐⭐⭐ — Opens a new language-guided direction for RGBT tracking, though real-time performance and deployment costs warrant attention.