XTrack: Multimodal Training Boosts RGB-X Video Object Trackers¶
Conference: ICCV 2025 arXiv: 2405.17773 Code: Available Area: Video Understanding Keywords: Multimodal Tracking, Mixture of Experts, Cross-modal Knowledge Transfer, Video Object Tracking, RGB-X
TL;DR¶
This paper proposes XTrack, which employs a Mixture of Modal Experts (MeME) framework and a soft-routing classifier to enable cross-modal knowledge sharing across RGB-D/T/E modalities, allowing inference with a single modality to benefit from multimodal training knowledge, achieving an average precision gain of 3%.
Background & Motivation¶
Multimodal perception (depth, thermal infrared, and event cameras) can compensate for the limitations of RGB-only tracking in extreme scenarios, but faces critical constraints:
Data Scarcity: No comprehensive dataset containing all modalities simultaneously exists; typically only paired RGB-X data is available.
Rigid Branch Design: Existing unified models (e.g., ViPT, UnTrack) activate predetermined branches at inference time based on the input modality, with no cross-modal interaction.
Cross-modal Knowledge Waste: Strict modality isolation prevents cross-modal knowledge transfer — for instance, in RGB-Depth sequences under fast-motion, low-light conditions, the model should learn knowledge that transcends single-modality boundaries.
Core Insight: Similar samples across different modalities share more transferable knowledge. When a "weak" classifier fails to accurately distinguish which modality a sample belongs to, it indicates that the sample resides at an optimal position for cross-modal knowledge sharing (i.e., minimal domain gap). This "confusion" is not a defect but rather a signal for knowledge transfer.
Method¶
Overall Architecture¶
XTrack inserts MeME modules after each attention block and FFN in a frozen RGB base tracker (OSTrack/SeqTrack). MeME bidirectionally processes RGB and X modality tokens to enhance feature modeling: $\(T_{rgb}^{attn} = T_{rgb}^l + Attn(T_{rgb}^l) + MeME(T_{rgb}^l, T_x^l)\)$ $\(T_{rgb}^{l+1} = T_{rgb}^{attn} + FFN(T_{rgb}^{attn}) + MeME(T_{rgb}^{attn}, T_x^{attn})\)$
Key Designs¶
-
Soft Router with Classification Loss: The routing function \(y = \sum_{i \in top\_k} p_i(T_x) \epsilon_i(T_x)\) incorporates not only the conventional expert load-balancing loss \(\mathcal{L}_{balance} = \mathcal{L}_{Imp} + \mathcal{L}_{Load}\), but also a modal classification loss \(\mathcal{L}_{cls}\). The classification loss preserves a degree of modality specialization within each expert, while the soft (non-rigid) classification allows cross-modal samples to access experts from other modalities — the optimal balance is achieved at a classification probability of approximately 80%.
-
Modality-Specific Experts + Shared Experts:
- Modality-Specific Experts: Each modality is assigned \(k\) experts (experiments show \(k=2\) is optimal), performing feature decomposition and reconstruction in a low-dimensional space \(k \ll c\).
- Edge-Gated Shared Experts: Shared experts incorporate an EdgeMix gating module initialized with a Laplacian filter, using high-frequency edge information as a natural cross-modal common prior: \(Out = (\sigma(EdgeMix(XW_1)) \cdot XW_2)W_3 + m_{s_k}\)
-
Modal Prompting: The low-dimensional modal matrix \(M_k\) output by MeME serves as a gating signal to modulate RGB tokens: \(Out = ((X_i W_5 \cdot \sigma(X_m W_6))W_7 + I_k)W_8\), rendering RGB features modality-aware.
Loss & Training¶
- Tracking loss: Inherits the IoU + L1 losses from the RGB base tracker.
- MoE loss: \(\mathcal{L}_{moe} = \mathcal{L}_{cls} + \lambda \cdot \mathcal{L}_{balance}\)
- Only MeME parameters are trained; the base tracker is frozen.
- Training settings: batch size 32, learning rate 4e-4, 90 epochs with a 10× decay after epoch 78.
- Training data: DepthTrack (RGB-D) + LasHeR (RGB-T) + VisEvent (RGB-E), with only one RGB-X pair available at a time.
Key Experimental Results¶
Main Results¶
RGB-Depth Tracking:
| Method | DepthTrack F-score | VOT-RGBD22 EAO | VOT-RGBD22 Acc. | VOT-RGBD22 Rob. |
|---|---|---|---|---|
| ViPT | 59.4 | 72.1 | 81.5 | 87.1 |
| UnTrack | 61.0 | 72.1 | 82.0 | 86.9 |
| SDSTrack | 61.4 | 72.8 | 81.2 | 88.3 |
| XTrack-B | 61.5 | 74.0 | 82.1 | 88.8 |
| XTrack-L | 64.8 | 74.0 | 82.8 | 88.9 |
RGB-Thermal Tracking (LasHeR/RGBT234):
| Method | LasHeR Pr | LasHeR Sr | RGBT234 MPR | RGBT234 MSR |
|---|---|---|---|---|
| ViPT | 65.1 | 52.5 | 83.5 | 61.7 |
| SDSTrack | 66.5 | 53.1 | 84.8 | 62.5 |
| OneTracker | 67.2 | 53.8 | 85.7 | 64.2 |
| XTrack-B | 69.1 | 55.7 | 87.4 | 64.9 |
| XTrack-L | 73.1 | 58.7 | 87.8 | 65.4 |
RGB-Event Tracking (VisEvent):
| Method | Pr | Sr |
|---|---|---|
| ViPT | 75.8 | 59.2 |
| OneTracker | 76.7 | 60.8 |
| SDSTrack | 76.7 | 59.7 |
| XTrack-B | 77.5 | 60.9 |
| XTrack-L | 80.5 | 63.3 |
Ablation Study¶
Component Analysis:
| Shared Expert | Modal Expert | DepthTrack F-score | LasHeR Pr | VisEvent Pr |
|---|---|---|---|---|
| ✓ | - | 59.1 | 67.8 | 76.5 |
| - | ✓ | 57.6 | 68.0 | 76.2 |
| ✓ | ✓ | 61.5 | 69.1 | 77.5 |
Gains from Multimodal Joint Training:
| Training Modalities | VisEvent Pr | DepthTrack F-score | LasHeR Pr |
|---|---|---|---|
| Baseline (w/o MeME) | 69.5 | 52.9 | 51.5 |
| Event only | 76.3 | 45.5 | 58.1 |
| Event + Depth | 76.6 | 60.8 | 58.1 |
| E + D + T | 77.5 | 61.5 | 69.1 |
Key Findings¶
- A domain gap exists in VOT-RGBD22, yet XTrack achieves larger margins over the prior SOTA, indicating that multimodal training enhances domain generalization.
- Training with Event data alone yields reasonable zero-shot generalization to the Thermal modality (58.1 Pr), as event cameras and thermal cameras both handle illumination variation.
- A soft classification probability of 80% yields the best performance; both rigid separation and random assignment perform worse.
- Two experts per modality is optimal; one is insufficiently expressive, while three introduces internal conflicts.
Highlights & Insights¶
- The core idea that "confusion is opportunity" is novel: classifier failure is not a flaw but a signal for cross-modal knowledge sharing.
- This work is the first to systematically achieve cross-modal knowledge transfer in RGB-X video object tracking.
- The Laplacian prior in EdgeMix provides a well-motivated inductive bias for shared experts.
- The experimental design is rigorous, with modalities added incrementally to clearly demonstrate each modality's contribution.
Limitations & Future Work¶
- Training still relies on paired RGB-X data and cannot exploit unpaired single-modality data.
- Low-dimensional projection reduces computational cost but may result in some information loss.
- Validation on additional sensor modalities (e.g., LiDAR, SAR) has not been performed.
- At inference, modality-specific experts must still be selected, so truly modality-agnostic inference has not been achieved.
Related Work & Insights¶
- The soft-routing MoE design strategy (balancing classification loss and load-balancing loss) has general applicability to other multimodal tasks.
- The theoretical framing of "modality confusion" as a knowledge transfer signal can be extended to other domains of multimodal fusion.
- Laplacian-initialized edge-sharing priors represent an initialization strategy worth exploring in other vision tasks.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Value | ⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall | ⭐⭐⭐⭐⭐ |