PlugTrack: Multi-Perceptive Motion Analysis for Adaptive Fusion in Multi-Object Tracking¶
Conference: AAAI 2026 arXiv: 2511.13105 Code: https://github.com/VisualScienceLab-KHU/PlugTrack Area: Video Understanding Keywords: Multi-Object Tracking, Kalman Filter, Adaptive Fusion, Motion Prediction, Plug-and-Play
TL;DR¶
This paper proposes PlugTrack, a framework that achieves, for the first time, adaptive fusion of Kalman filters and data-driven motion predictors via a Context Motion Encoder (CME) and an Adaptive Blending factor Generator (ABG), yielding significant improvements under both linear and nonlinear motion scenarios.
Background & Motivation¶
State of the Field¶
The dominant paradigm in Multi-Object Tracking (MOT) is tracking-by-detection, whose core pipeline consists of: detection → motion prediction → association matching. The motion predictor is the critical component for maintaining target identities.
Limitations of Prior Work¶
Linear assumption of the Kalman filter: As the standard motion predictor, the Kalman filter is computationally efficient but assumes linear motion, leading to poor performance on datasets with nonlinear motion such as DanceTrack.
Limitations of data-driven predictors: Methods such as DiffMOT (diffusion-model-based) and TrackSSM (state-space-model-based) can capture nonlinear dynamics but suffer from domain overfitting and high computational overhead.
False binary opposition: The community treats Kalman filters and data-driven methods as mutually exclusive choices, overlooking their complementarity.
Key Findings (Motivating Experiments)¶
The authors conduct a per-tracklet performance analysis of predictors on MOT17 and DanceTrack:
- MOT17 (predominantly linear motion): the Kalman filter outperforms data-driven predictors on 60.3% of tracklets.
- DanceTrack (predominantly nonlinear motion): the Kalman filter still wins on 34% of tracklets.
This striking finding reveals that linear motion patterns appear frequently even in datasets specifically designed for complex nonlinear motion. Real-world tracking scenarios inherently contain a mixture of linear and nonlinear motion patterns, motivating an adaptive unified framework.
Root Cause¶
The field lacks a plug-and-play adaptive fusion framework that dynamically decides—based on motion context—whether to trust the Kalman filter or the data-driven predictor, rather than treating them as mutually exclusive alternatives.
Method¶
Overall Architecture¶
PlugTrack consists of two core components: 1. Context Motion Encoder (CME): analyzes motion patterns from multiple perceptual perspectives to produce multi-perceptive motion features. 2. Adaptive Blending factor Generator (ABG): transforms multi-perceptive features into adaptive blending factors that perform coordinate-wise weighted fusion of the two predictors' outputs.
Final prediction: \(\hat{B}_{ABG} = \tilde{\alpha} \odot \hat{B}_{KF} + (1 - \tilde{\alpha}) \odot \hat{B}_{DP}\)
Key Designs¶
1. Context Motion Encoder (CME) — Multi-Perceptive Analysis¶
CME comprises three specialized modules that analyze motion characteristics from different perspectives:
(a) Motion Pattern Module (MPM): encodes the temporal motion information of a tracklet using an LSTM, capturing complex motion patterns such as acceleration, deceleration, and directional changes:
(b) Prediction Discrepancy Module (PDM): quantifies the prediction discrepancy between the Kalman filter and the data-driven predictor. Large discrepancies typically indicate transitions from linear to nonlinear motion. The discrepancy vector is processed by an MLP:
(c) Uncertainty Quantification Module (UQM): exploits the Normalized Innovation Squared (NIS) of the Kalman filter to quantify its prediction confidence. A high NIS value indicates low confidence in the Kalman filter's prediction, implying nonlinear motion:
The mean and standard deviation of NIS are aggregated via a sliding window to obtain a 4-dimensional uncertainty vector \(\sigma_{KF} \in \mathbb{R}^4\), from which features \(\mathbf{f}_{UQM} \in \mathbb{R}^{32}\) are extracted by an MLP.
The outputs of the three modules are concatenated and encoded into multi-perceptive motion features: \(\mathbf{f}_{mult} = \text{Encoder}(\text{Concat}(\mathbf{f}_{MPM}, \mathbf{f}_{PDM}, \mathbf{f}_{UQM}))\)
Design Motivation: A single module cannot comprehensively understand motion context. MPM provides temporal patterns, PDM reveals the degree of agreement or disagreement between the two predictors, and UQM supplies a self-reliability assessment of the Kalman filter. The three modules work in concert to achieve "multi-perceptive" understanding.
2. Adaptive Blending factor Generator (ABG) — Coordinate-Level Fusion¶
ABG maps \(\mathbf{f}_{mult}\) to a 4-dimensional blending factor \(\tilde{\alpha} = (\alpha_x, \alpha_y, \alpha_w, \alpha_h) \in [0,1]\):
Coordinate-level adaptation example: When horizontal linear motion occurs (MPM detects a stable horizontal pattern, UQM reports low uncertainty), ABG assigns a high weight to the Kalman filter for the x-coordinate (\(\alpha_x > 0.5\)); when vertical nonlinear motion occurs (PDM reports large prediction discrepancy), ABG relies on the data-driven predictor for the y-coordinate (\(\alpha_y < 0.5\)).
3. Monte Carlo Alpha Search (MCAS) — Training Supervision Signal Generation¶
Problem addressed: Directly training ABG tends to converge to dataset-specific biases (e.g., always assigning high weight to the Kalman filter on MOT17) rather than learning an adaptive strategy.
Core method: A discrete search space \(\mathcal{A} = \{0.3, 0.4, 0.5, 0.6, 0.7\}^4\) (625 candidate combinations in total) is defined, and Gaussian noise is added to each training batch for exploration:
Each candidate combination is evaluated by prediction accuracy (SmoothL1 + GIoU), and the optimal combination \(\alpha^*\) is selected as the pseudo ground truth for ABG:
MCAS is not used at inference time; ABG directly predicts the optimal blending factor, preserving real-time efficiency.
Loss & Training¶
Training uses the Adam optimizer with a learning rate of 0.001, batch size of 2048, and fixed-length tracklets of 5 frames. The model is trained for 220 epochs on DanceTrack and 270 epochs on MIX (MOT17 & MOT20).
Key Experimental Results¶
Main Results (DanceTrack Test Set — Nonlinear Motion)¶
| Method | Type | HOTA | IDF1 | AssA | DetA | MOTA |
|---|---|---|---|---|---|---|
| OC-SORT | KF-based | 55.1 | 54.2 | 38.0 | 80.3 | 89.4 |
| C-BIoU | KF-based | 60.6 | 61.6 | 45.4 | 81.3 | 91.6 |
| DiffMOT | Data-driven | 62.3 | 63.0 | 47.2 | 82.5 | 92.8 |
| TrackSSM | Data-driven | 57.7 | 57.5 | 41.0 | 81.5 | 92.2 |
| Ours (TrackSSM) | Fusion | 59.2 (+1.5) | 59.0 (+1.5) | 42.9 (+1.9) | 81.9 | 92.2 |
| Ours (DiffMOT) | Fusion | 63.3 (+1.0) | 64.1 (+1.1) | 48.4 (+1.2) | 82.5 | 92.4 |
Ablation Study (DanceTrack Validation Set)¶
| MPM | PDM | UQM | HOTA | AssA | IDF1 |
|---|---|---|---|---|---|
| ✗ | ✗ | ✗ | 59.2 | 44.5 | 59.7 |
| ✓ | ✗ | ✗ | 60.2 | 45.8 | 61.2 |
| ✓ | ✓ | ✗ | 60.4 | 46.0 | 61.8 |
| ✓ | ✗ | ✓ | 60.3 | 46.1 | 61.4 |
| ✓ | ✓ | ✓ | 60.8 | 46.6 | 61.7 |
Alpha range analysis: \([0.3, 0.7]\) achieves the best performance across both base predictors. An overly wide range (\([0.1, 0.9]\)) allows extreme values that completely ignore one predictor, while an overly narrow range (\([0.4, 0.6]\)) restricts adaptive capacity.
Key Findings¶
- Strong cross-domain generalization: HOTA improves by +6.5 when transferring from DanceTrack to MOT20, and still improves by +1.8 in the reverse direction.
- Minimal parameter overhead: only 0.54M additional parameters (a 22% increase over TrackSSM and 4.7% over DiffMOT), while FPS remains above the real-time threshold of 20 (34.2 and 24.7 FPS, respectively).
- Qualitative example of coordinate-level adaptation: at frame 485 of DanceTrack, \(\alpha_x=0.874\) (trusting the Kalman filter) and \(\alpha_y=0.413\) (trusting DiffMOT), because horizontal motion is linear while vertical motion is nonlinear.
Highlights & Insights¶
- Compelling core insight: the paper uses empirical evidence to demonstrate that even on DanceTrack, 34% of tracklets favor the Kalman filter, thereby refuting the assumption that "nonlinear datasets do not need Kalman filters."
- Plug-and-play design: existing motion predictors are not modified; the framework can directly augment any data-driven predictor.
- MCAS training strategy: elegantly resolves the bias collapse problem caused by directly optimizing blending factors.
- Coordinate-independent blending factors: different spatial dimensions can adopt different optimal fusion strategies—this is the first validation of such an approach in MOT.
Limitations & Future Work¶
- Currently fuses only two predictors (Kalman filter + one data-driven predictor); the framework could be extended to multi-way fusion of multiple motion paradigms (e.g., OC-SORT, Hybrid-SORT).
- The MCAS search space is discrete (625 candidates); continuous optimization may be more efficient.
- The LSTM used in CME has limited capacity; stronger temporal modeling may yield better motion understanding.
- Training requires inference results from both predictors to be available in the training data.
Related Work & Insights¶
- SORT / DeepSORT / ByteTrack: established the foundational tracking-by-detection + Kalman filter paradigm.
- DiffMOT: diffusion-model-based motion prediction; one of the base predictors in this work.
- TrackSSM: state-space-model-based motion prediction; the other base predictor.
- Monte Carlo methods: MCAS is inspired by the successful application of Monte Carlo search in 3D scene understanding.
Rating¶
- Novelty: ⭐⭐⭐⭐ — the "bridging classical and modern" approach is both novel and practical.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — three datasets + cross-domain experiments + efficiency analysis + detailed ablation + qualitative analysis.
- Writing Quality: ⭐⭐⭐⭐ — motivation is well-argued and supported by convincing data.
- Value: ⭐⭐⭐⭐ — the plug-and-play framework has direct engineering deployment value.