Skip to content

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

Conference: CVPR2026 arXiv: 2603.03617 Code: IdolLab/RAGTrack Area: Video Understanding / RGBT Tracking Keywords: RGBT Tracking, Retrieval-Augmented Generation, Multimodal Fusion, Language-Guided Tracking, Adaptive Token Fusion

TL;DR

This paper is the first to introduce textual descriptions into RGBT tracking, proposing RAGTrack, a retrieval-augmented generation (RAG)-based framework. Through a Multimodal Transformer Encoder (MTE), Adaptive Token Fusion (ATF), and a Context-aware Reasoning Module (CRM), it achieves state-of-the-art performance on four RGBT benchmarks.

Background & Motivation

  1. Limitations of RGBT tracking: Existing RGBT trackers rely solely on first-frame visual information to model the target, making them prone to drift under drastic appearance changes.
  2. Insufficient single-frame template information: A single template image cannot capture the full appearance variation of a target across different viewpoints, resulting in limited semantic expressiveness.
  3. Inherent target ambiguity: Trackers may confuse foreground with background (e.g., brooms, dustpans, or the lower body of pedestrians), lacking high-level semantic discriminability.
  4. Redundancy in the search region: Conventional methods process large amounts of redundant background regions and distractors at the token level, degrading tracking precision.
  5. Heterogeneous modality gap: Significant feature discrepancies exist between RGB and TIR modalities, impeding effective cross-modal correspondence.
  6. Absence of language annotations: Existing RGBT tracking benchmarks lack textual annotations, limiting research on language-guided tracking.

Method

Overall Architecture

RAGTrack comprises three core modules: a Multimodal Transformer Encoder (MTE), Adaptive Token Fusion (ATF), and a Context-aware Reasoning Module (CRM). The inputs are RGB/TIR search images, template images, and language descriptions; the output is the target bounding box.

Multimodal Transformer Encoder (MTE)

  • A three-stage downsampling operation converts template and search images into patch tokens.
  • A sequence prefix \(\mathbf{E}^t\) (fixed text prompt + learnable tokens) is concatenated with the language description \(\mathbf{L}^t\) and encoded via a CLIP text encoder.
  • Reasoning tokens \(\mathbf{R}_m^t\), text tokens \(\hat{\mathbf{H}}^t\), template tokens \(\hat{\mathbf{Z}}_m^t\), and search tokens \(\hat{\mathbf{X}}_m^t\) are concatenated into a unified sequence.
  • Both the RGB and TIR branches share parameters and apply multi-head self-attention for unified vision–language modeling.

Adaptive Token Fusion (ATF)

  • Dynamic token selection: Self-attention scores are reused to compute the total attention score \(\mathbf{A}_m^{total}\) between search tokens and reasoning/text/template/search tokens. Target-relevant tokens are retained at a ratio of \(\gamma=85\%\) with no additional parameter overhead.
  • Adaptive channel exchange: Cross-modal channel correlation \(\mathbf{S}\) between RGB and TIR features is computed; the top \(\sigma=50\%\) channels are selected for exchange, followed by MLP-based fusion.
  • ATF is deployed at layers 6/12/18/24 of HiViT-B, enabling progressive cross-layer fusion.

Context-aware Reasoning Module (CRM)

A RAG paradigm is adopted for temporal language reasoning, consisting of four stages:

  1. Construction: A dynamic knowledge base \(\mathbf{D}_m\) is maintained with \(n=4\) historical text features; a new feature is added only when its cosine similarity to all existing entries falls below threshold \(\lambda=1.0\).
  2. Retrieval: The top-\(k=2\) most relevant features are retrieved from the knowledge base, and intra-modal cross-attention \(\Phi\) is applied to refine the search features.
  3. Augmentation: Reasoning, text, and template features are average-pooled and concatenated, then passed through an MLP to generate the reasoning token for the next frame; cross-attention and Hadamard product are further applied to enhance temporal representations.
  4. Generation: QWen2.5-VL-3B dynamically generates context-aware target descriptions from search images and structured prompts, continuously updating the multimodal reference.

Loss & Training

A multi-task joint loss is used: \(\mathcal{L} = L_{\text{cls}} + 2 L_{\text{iou}} + 5 L_1\), where classification employs focal loss and regression employs L1 + GIoU loss.

Key Experimental Results

SOTA Comparison on Four RGBT Benchmarks

Dataset Metric RAGTrack Runner-up Gain
GTOT MPR/MSR 95.1/79.3 DMD 94.2/78.6 +0.9/+0.7
RGBT210 PR/SR 93.2/67.1 AETrack 90.4/66.3 +2.8/+0.8
RGBT234 MPR/MSR 93.8/69.5 SUTrack 92.1/69.2 +1.7/+0.3
LasHeR PR/NPR/SR 76.8/73.0/61.1 STTrack 76.0/−/60.3 +0.8/−/+0.8

Ablation Study (RGBT234)

Configuration MPR MSR
Baseline 87.9 64.5
+ CRM* (w/o text) 89.1 65.0
+ MTE + CRM* 91.1 66.7
+ MTE + CRM (w/ text) 91.8 67.4
+ MTE + CRM + ATF (full) 93.8 69.5

Fusion Paradigm Comparison (RGBT234)

Method MPR MSR Params
TBSI 92.8 67.6 145.9M
BSI 93.1 68.2 103.6M
DFM 92.7 67.8 110.3M
ATF (Ours) 93.8 69.5 101.8M

Attribute-level analysis on LasHeR shows: Total Occlusion (TO) +10.7% PR and Out-of-View (OV) +5.5% SR, demonstrating CRM's ability to maintain target identity under severe appearance changes.

Highlights & Insights

  1. First to introduce language descriptions into RGBT tracking: MLLMs are used to automatically generate text annotations, extending four existing benchmarks (LasHeR training set annotated with 514,081 descriptions).
  2. Elegant ATF design: Parameter-free token selection (reusing attention scores) combined with adaptive channel exchange achieves the best fusion performance with the fewest parameters.
  3. Novel RAG paradigm: The first application of retrieval-augmented generation to RGBT tracking; a dynamic knowledge base with reasoning token propagation enables continuous temporal reasoning.
  4. MLLM-based dynamic description generation: Overcomes the limitations of static language annotations by adaptively updating target descriptions across frames.

Limitations & Future Work

  1. Inference overhead: QWen2.5-VL-3B is invoked per frame to generate descriptions, limiting real-time applicability; no FPS is reported in the paper.
  2. Dependence on MLLM quality: Automatically generated text may contain hallucinations; although human verification is performed, scaling to larger datasets incurs high cost.
  3. Validated only on RGBT: The framework is theoretically transferable to other multimodal tracking settings (e.g., RGB-Depth, RGB-Event), but no such experiments are conducted.
  4. Fixed knowledge base size: \(n=4\) is manually set; adaptive sizing could further improve performance in long-video scenarios.
  5. Training resources: Training requires 4× V100 GPUs; adaptation for lightweight deployment scenarios remains unclear.
  • vs ViPT/BAT/SDSTrack (visual prompt learning): These methods enhance tracking with visual prompts only, lacking language-level semantics; RAGTrack introduces textual descriptions for more abstract target representation.
  • vs RGBL tracking (CiteTracker/UVLTrack): RGBL methods suffer from static vision–language misalignment; RAGTrack addresses this by dynamically generating descriptions via MLLMs.
  • vs TrackingMiM (the only prior RAG-based tracking work): It merely reuses pre-stored features; RAGTrack achieves genuine RAG through a dynamic knowledge base and contextual reasoning.
  • vs SUTrack/AINet (current SOTA): On RGBT234, ATF with only 101.8M parameters surpasses SUTrack (384 resolution), demonstrating superior efficiency.

Rating

  • Novelty: ⭐⭐⭐⭐ — First to introduce language descriptions and the RAG paradigm into RGBT tracking; the parameter-free token selection design in ATF is innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive SOTA results on four benchmarks; ablations cover individual components, hyperparameters, fusion paradigms, and attention score combinations.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure, rich figures and tables, complete mathematical derivations.
  • Value: ⭐⭐⭐⭐ — Opens a new language-guided direction for RGBT tracking, though real-time performance and deployment costs warrant attention.