XTrack: Multimodal Training Boosts RGB-X Video Object Trackers¶

Conference: ICCV 2025 arXiv: 2405.17773 Code: Available Area: Video Understanding Keywords: Multimodal Tracking, Mixture of Experts, Cross-modal Knowledge Transfer, Video Object Tracking, RGB-X

TL;DR¶

This paper proposes XTrack, which employs a Mixture of Modal Experts (MeME) framework and a soft-routing classifier to enable cross-modal knowledge sharing across RGB-D/T/E modalities, allowing inference with a single modality to benefit from multimodal training knowledge, achieving an average precision gain of 3%.

Background & Motivation¶

Multimodal perception (depth, thermal infrared, and event cameras) can compensate for the limitations of RGB-only tracking in extreme scenarios, but faces critical constraints:

Data Scarcity: No comprehensive dataset containing all modalities simultaneously exists; typically only paired RGB-X data is available.

Rigid Branch Design: Existing unified models (e.g., ViPT, UnTrack) activate predetermined branches at inference time based on the input modality, with no cross-modal interaction.

Cross-modal Knowledge Waste: Strict modality isolation prevents cross-modal knowledge transfer — for instance, in RGB-Depth sequences under fast-motion, low-light conditions, the model should learn knowledge that transcends single-modality boundaries.

Core Insight: Similar samples across different modalities share more transferable knowledge. When a "weak" classifier fails to accurately distinguish which modality a sample belongs to, it indicates that the sample resides at an optimal position for cross-modal knowledge sharing (i.e., minimal domain gap). This "confusion" is not a defect but rather a signal for knowledge transfer.

Method¶

Overall Architecture¶

XTrack inserts MeME modules after each attention block and FFN in a frozen RGB base tracker (OSTrack/SeqTrack). MeME bidirectionally processes RGB and X modality tokens to enhance feature modeling: $$T_{rgb}^{attn} = T_{rgb}^l + Attn(T_{rgb}^l) + MeME(T_{rgb}^l, T_x^l)$$ $$T_{rgb}^{l+1} = T_{rgb}^{attn} + FFN(T_{rgb}^{attn}) + MeME(T_{rgb}^{attn}, T_x^{attn})$$

Key Designs¶

Soft Router with Classification Loss: The routing function $y = \sum_{i \in top\_k} p_i(T_x) \epsilon_i(T_x)$ incorporates not only the conventional expert load-balancing loss $\mathcal{L}_{balance} = \mathcal{L}_{Imp} + \mathcal{L}_{Load}$, but also a modal classification loss $\mathcal{L}_{cls}$. The classification loss preserves a degree of modality specialization within each expert, while the soft (non-rigid) classification allows cross-modal samples to access experts from other modalities — the optimal balance is achieved at a classification probability of approximately 80%.
Modality-Specific Experts + Shared Experts:
- Modality-Specific Experts: Each modality is assigned $k$ experts (experiments show $k=2$ is optimal), performing feature decomposition and reconstruction in a low-dimensional space $k \ll c$.
- Edge-Gated Shared Experts: Shared experts incorporate an EdgeMix gating module initialized with a Laplacian filter, using high-frequency edge information as a natural cross-modal common prior: $Out = (\sigma(EdgeMix(XW_1)) \cdot XW_2)W_3 + m_{s_k}$
Modal Prompting: The low-dimensional modal matrix $M_k$ output by MeME serves as a gating signal to modulate RGB tokens: $Out = ((X_i W_5 \cdot \sigma(X_m W_6))W_7 + I_k)W_8$, rendering RGB features modality-aware.

Loss & Training¶

Tracking loss: Inherits the IoU + L1 losses from the RGB base tracker.
MoE loss: $\mathcal{L}_{moe} = \mathcal{L}_{cls} + \lambda \cdot \mathcal{L}_{balance}$
Only MeME parameters are trained; the base tracker is frozen.
Training settings: batch size 32, learning rate 4e-4, 90 epochs with a 10× decay after epoch 78.
Training data: DepthTrack (RGB-D) + LasHeR (RGB-T) + VisEvent (RGB-E), with only one RGB-X pair available at a time.

Key Experimental Results¶

Main Results¶

RGB-Depth Tracking:

Method	DepthTrack F-score	VOT-RGBD22 EAO	VOT-RGBD22 Acc.	VOT-RGBD22 Rob.
ViPT	59.4	72.1	81.5	87.1
UnTrack	61.0	72.1	82.0	86.9
SDSTrack	61.4	72.8	81.2	88.3
XTrack-B	61.5	74.0	82.1	88.8
XTrack-L	64.8	74.0	82.8	88.9

RGB-Thermal Tracking (LasHeR/RGBT234):

Method	LasHeR Pr	LasHeR Sr	RGBT234 MPR	RGBT234 MSR
ViPT	65.1	52.5	83.5	61.7
SDSTrack	66.5	53.1	84.8	62.5
OneTracker	67.2	53.8	85.7	64.2
XTrack-B	69.1	55.7	87.4	64.9
XTrack-L	73.1	58.7	87.8	65.4

RGB-Event Tracking (VisEvent):

Method	Pr	Sr
ViPT	75.8	59.2
OneTracker	76.7	60.8
SDSTrack	76.7	59.7
XTrack-B	77.5	60.9
XTrack-L	80.5	63.3

Ablation Study¶

Component Analysis:

Shared Expert	Modal Expert	DepthTrack F-score	LasHeR Pr	VisEvent Pr
✓	-	59.1	67.8	76.5
-	✓	57.6	68.0	76.2
✓	✓	61.5	69.1	77.5

Gains from Multimodal Joint Training:

Training Modalities	VisEvent Pr	DepthTrack F-score	LasHeR Pr
Baseline (w/o MeME)	69.5	52.9	51.5
Event only	76.3	45.5	58.1
Event + Depth	76.6	60.8	58.1
E + D + T	77.5	61.5	69.1

Key Findings¶

A domain gap exists in VOT-RGBD22, yet XTrack achieves larger margins over the prior SOTA, indicating that multimodal training enhances domain generalization.
Training with Event data alone yields reasonable zero-shot generalization to the Thermal modality (58.1 Pr), as event cameras and thermal cameras both handle illumination variation.
A soft classification probability of 80% yields the best performance; both rigid separation and random assignment perform worse.
Two experts per modality is optimal; one is insufficiently expressive, while three introduces internal conflicts.

Highlights & Insights¶

The core idea that "confusion is opportunity" is novel: classifier failure is not a flaw but a signal for cross-modal knowledge sharing.
This work is the first to systematically achieve cross-modal knowledge transfer in RGB-X video object tracking.
The Laplacian prior in EdgeMix provides a well-motivated inductive bias for shared experts.
The experimental design is rigorous, with modalities added incrementally to clearly demonstrate each modality's contribution.

Limitations & Future Work¶

Training still relies on paired RGB-X data and cannot exploit unpaired single-modality data.
Low-dimensional projection reduces computational cost but may result in some information loss.
Validation on additional sensor modalities (e.g., LiDAR, SAR) has not been performed.
At inference, modality-specific experts must still be selected, so truly modality-agnostic inference has not been achieved.

The soft-routing MoE design strategy (balancing classification loss and load-balancing loss) has general applicability to other multimodal tasks.
The theoretical framing of "modality confusion" as a knowledge transfer signal can be extended to other domains of multimodal fusion.
Laplacian-initialized edge-sharing priors represent an initialization strategy worth exploring in other vision tasks.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Value	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Overall	⭐⭐⭐⭐⭐