PlugTrack: Multi-Perceptive Motion Analysis for Adaptive Fusion in Multi-Object Tracking¶

Conference: AAAI 2026 arXiv: 2511.13105 Code: https://github.com/VisualScienceLab-KHU/PlugTrack Area: Video Understanding Keywords: Multi-Object Tracking, Kalman Filter, Adaptive Fusion, Motion Prediction, Plug-and-Play

TL;DR¶

This paper proposes PlugTrack, a framework that achieves, for the first time, adaptive fusion of Kalman filters and data-driven motion predictors via a Context Motion Encoder (CME) and an Adaptive Blending factor Generator (ABG), yielding significant improvements under both linear and nonlinear motion scenarios.

Background & Motivation¶

State of the Field¶

The dominant paradigm in Multi-Object Tracking (MOT) is tracking-by-detection, whose core pipeline consists of: detection → motion prediction → association matching. The motion predictor is the critical component for maintaining target identities.

Limitations of Prior Work¶

Linear assumption of the Kalman filter: As the standard motion predictor, the Kalman filter is computationally efficient but assumes linear motion, leading to poor performance on datasets with nonlinear motion such as DanceTrack.

Limitations of data-driven predictors: Methods such as DiffMOT (diffusion-model-based) and TrackSSM (state-space-model-based) can capture nonlinear dynamics but suffer from domain overfitting and high computational overhead.

False binary opposition: The community treats Kalman filters and data-driven methods as mutually exclusive choices, overlooking their complementarity.

Key Findings (Motivating Experiments)¶

The authors conduct a per-tracklet performance analysis of predictors on MOT17 and DanceTrack:

MOT17 (predominantly linear motion): the Kalman filter outperforms data-driven predictors on 60.3% of tracklets.
DanceTrack (predominantly nonlinear motion): the Kalman filter still wins on 34% of tracklets.

This striking finding reveals that linear motion patterns appear frequently even in datasets specifically designed for complex nonlinear motion. Real-world tracking scenarios inherently contain a mixture of linear and nonlinear motion patterns, motivating an adaptive unified framework.

Root Cause¶

The field lacks a plug-and-play adaptive fusion framework that dynamically decides—based on motion context—whether to trust the Kalman filter or the data-driven predictor, rather than treating them as mutually exclusive alternatives.

Method¶

Overall Architecture¶

PlugTrack consists of two core components: 1. Context Motion Encoder (CME): analyzes motion patterns from multiple perceptual perspectives to produce multi-perceptive motion features. 2. Adaptive Blending factor Generator (ABG): transforms multi-perceptive features into adaptive blending factors that perform coordinate-wise weighted fusion of the two predictors' outputs.

Final prediction: \(\hat{B}_{ABG} = \tilde{\alpha} \odot \hat{B}_{KF} + (1 - \tilde{\alpha}) \odot \hat{B}_{DP}\)

Key Designs¶

1. Context Motion Encoder (CME) — Multi-Perceptive Analysis¶

CME comprises three specialized modules that analyze motion characteristics from different perspectives:

(a) Motion Pattern Module (MPM): encodes the temporal motion information of a tracklet using an LSTM, capturing complex motion patterns such as acceleration, deceleration, and directional changes:

\[\mathbf{f}_{MPM} = \text{LSTM}(\tilde{\mathbf{T}}_{1:t}) = \mathbf{h}_t \in \mathbb{R}^{128}\]

(b) Prediction Discrepancy Module (PDM): quantifies the prediction discrepancy between the Kalman filter and the data-driven predictor. Large discrepancies typically indicate transitions from linear to nonlinear motion. The discrepancy vector is processed by an MLP:

\[\mathbf{f}_{PDM} = \text{MLP}(\hat{B}_{t+1}^{KF} - \hat{B}_{t+1}^{DP}) \in \mathbb{R}^{32}\]

(c) Uncertainty Quantification Module (UQM): exploits the Normalized Innovation Squared (NIS) of the Kalman filter to quantify its prediction confidence. A high NIS value indicates low confidence in the Kalman filter's prediction, implying nonlinear motion:

\[\text{NIS}_{t,i} = \frac{(B_{t,i} - \hat{B}_{t,i})^2}{S_{t,ii}}\]

The mean and standard deviation of NIS are aggregated via a sliding window to obtain a 4-dimensional uncertainty vector \(\sigma_{KF} \in \mathbb{R}^4\), from which features \(\mathbf{f}_{UQM} \in \mathbb{R}^{32}\) are extracted by an MLP.

The outputs of the three modules are concatenated and encoded into multi-perceptive motion features: \(\mathbf{f}_{mult} = \text{Encoder}(\text{Concat}(\mathbf{f}_{MPM}, \mathbf{f}_{PDM}, \mathbf{f}_{UQM}))\)

Design Motivation: A single module cannot comprehensively understand motion context. MPM provides temporal patterns, PDM reveals the degree of agreement or disagreement between the two predictors, and UQM supplies a self-reliability assessment of the Kalman filter. The three modules work in concert to achieve "multi-perceptive" understanding.

2. Adaptive Blending factor Generator (ABG) — Coordinate-Level Fusion¶

ABG maps \(\mathbf{f}_{mult}\) to a 4-dimensional blending factor \(\tilde{\alpha} = (\alpha_x, \alpha_y, \alpha_w, \alpha_h) \in [0,1]\):

\[\hat{B}_{ABG} = \tilde{\alpha} \odot \hat{B}_{KF} + (1 - \tilde{\alpha}) \odot \hat{B}_{DP}\]

Coordinate-level adaptation example: When horizontal linear motion occurs (MPM detects a stable horizontal pattern, UQM reports low uncertainty), ABG assigns a high weight to the Kalman filter for the x-coordinate (\(\alpha_x > 0.5\)); when vertical nonlinear motion occurs (PDM reports large prediction discrepancy), ABG relies on the data-driven predictor for the y-coordinate (\(\alpha_y < 0.5\)).

3. Monte Carlo Alpha Search (MCAS) — Training Supervision Signal Generation¶

Problem addressed: Directly training ABG tends to converge to dataset-specific biases (e.g., always assigning high weight to the Kalman filter on MOT17) rather than learning an adaptive strategy.

Core method: A discrete search space \(\mathcal{A} = \{0.3, 0.4, 0.5, 0.6, 0.7\}^4\) (625 candidate combinations in total) is defined, and Gaussian noise is added to each training batch for exploration:

\[\tilde{\mathcal{A}}_b = \text{clamp}(\mathcal{A} + \epsilon_b, 0, 1), \quad \epsilon_b \sim \mathcal{N}(0, 0.1^2)\]

Each candidate combination is evaluated by prediction accuracy (SmoothL1 + GIoU), and the optimal combination \(\alpha^*\) is selected as the pseudo ground truth for ABG:

\[\mathcal{L}_{MCAS} = \text{MSE}(\tilde{\alpha}, \alpha^*)\]

MCAS is not used at inference time; ABG directly predicts the optimal blending factor, preserving real-time efficiency.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{SmoothL1} + \mathcal{L}_{GIoU} + \mathcal{L}_{MCAS}\]

Training uses the Adam optimizer with a learning rate of 0.001, batch size of 2048, and fixed-length tracklets of 5 frames. The model is trained for 220 epochs on DanceTrack and 270 epochs on MIX (MOT17 & MOT20).

Key Experimental Results¶

Main Results (DanceTrack Test Set — Nonlinear Motion)¶

Method	Type	HOTA	IDF1	AssA	DetA	MOTA
OC-SORT	KF-based	55.1	54.2	38.0	80.3	89.4
C-BIoU	KF-based	60.6	61.6	45.4	81.3	91.6
DiffMOT	Data-driven	62.3	63.0	47.2	82.5	92.8
TrackSSM	Data-driven	57.7	57.5	41.0	81.5	92.2
Ours (TrackSSM)	Fusion	59.2 (+1.5)	59.0 (+1.5)	42.9 (+1.9)	81.9	92.2
Ours (DiffMOT)	Fusion	63.3 (+1.0)	64.1 (+1.1)	48.4 (+1.2)	82.5	92.4

Ablation Study (DanceTrack Validation Set)¶

MPM	PDM	UQM	HOTA	AssA	IDF1
✗	✗	✗	59.2	44.5	59.7
✓	✗	✗	60.2	45.8	61.2
✓	✓	✗	60.4	46.0	61.8
✓	✗	✓	60.3	46.1	61.4
✓	✓	✓	60.8	46.6	61.7

Alpha range analysis: \([0.3, 0.7]\) achieves the best performance across both base predictors. An overly wide range (\([0.1, 0.9]\)) allows extreme values that completely ignore one predictor, while an overly narrow range (\([0.4, 0.6]\)) restricts adaptive capacity.

Key Findings¶

Strong cross-domain generalization: HOTA improves by +6.5 when transferring from DanceTrack to MOT20, and still improves by +1.8 in the reverse direction.
Minimal parameter overhead: only 0.54M additional parameters (a 22% increase over TrackSSM and 4.7% over DiffMOT), while FPS remains above the real-time threshold of 20 (34.2 and 24.7 FPS, respectively).
Qualitative example of coordinate-level adaptation: at frame 485 of DanceTrack, \(\alpha_x=0.874\) (trusting the Kalman filter) and \(\alpha_y=0.413\) (trusting DiffMOT), because horizontal motion is linear while vertical motion is nonlinear.

Highlights & Insights¶

Compelling core insight: the paper uses empirical evidence to demonstrate that even on DanceTrack, 34% of tracklets favor the Kalman filter, thereby refuting the assumption that "nonlinear datasets do not need Kalman filters."
Plug-and-play design: existing motion predictors are not modified; the framework can directly augment any data-driven predictor.
MCAS training strategy: elegantly resolves the bias collapse problem caused by directly optimizing blending factors.
Coordinate-independent blending factors: different spatial dimensions can adopt different optimal fusion strategies—this is the first validation of such an approach in MOT.

Limitations & Future Work¶

Currently fuses only two predictors (Kalman filter + one data-driven predictor); the framework could be extended to multi-way fusion of multiple motion paradigms (e.g., OC-SORT, Hybrid-SORT).
The MCAS search space is discrete (625 candidates); continuous optimization may be more efficient.
The LSTM used in CME has limited capacity; stronger temporal modeling may yield better motion understanding.
Training requires inference results from both predictors to be available in the training data.

SORT / DeepSORT / ByteTrack: established the foundational tracking-by-detection + Kalman filter paradigm.
DiffMOT: diffusion-model-based motion prediction; one of the base predictors in this work.
TrackSSM: state-space-model-based motion prediction; the other base predictor.
Monte Carlo methods: MCAS is inspired by the successful application of Monte Carlo search in 3D scene understanding.

Rating¶

Novelty: ⭐⭐⭐⭐ — the "bridging classical and modern" approach is both novel and practical.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — three datasets + cross-domain experiments + efficiency analysis + detailed ablation + qualitative analysis.
Writing Quality: ⭐⭐⭐⭐ — motivation is well-argued and supported by convincing data.
Value: ⭐⭐⭐⭐ — the plug-and-play framework has direct engineering deployment value.