SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow¶
Conference: ECCV 2024
arXiv: 2404.11426
Code: https://research.nvidia.com/labs/dvl/projects/spam
Area: Video Understanding (Multiple Object Tracking / Efficient Annotation)
Keywords: Multiple Object Tracking, Video Annotation Engine, Pseudo-labels, Active Learning, Graph Neural Networks
TL;DR¶
Propopses the SPAM video annotation engine, which combines synthetic data pre-training, pseudo-label self-training, and graph-hierarchy active learning, generating Multiple Object Tracking (MOT) annotations close to ground-truth (GT) quality with only 3-20% of the manual annotation effort.
Background & Motivation¶
Multiple Object Tracking (MOT) is a core task in video understanding, but high-quality trajectory annotation is extremely expensive: - High Annotation Cost: Each frame requires detection, localization (bounding box with two clicks), and cross-frame identity association (one click), which is highly time-intensive. - Small Scale of Existing Datasets: MOT17 contains only 14 video sequences, and MOT20 has only 8, which is far smaller than the large-scale datasets in the image domain. - Limitations of Prior Work: - Most methods ignore dense temporal dependencies in videos (e.g., only selecting keyframes for annotation). - Or are limited to single-object scenarios. - There is no unified solution to handle both detection and association annotations concurrently.
Key Insight: 1. Association in most tracking scenarios is "easy" — pre-trained models can generate high-quality pseudo-labels at zero cost. 2. Trajectory annotations exhibit spatiotemporal dependencies — annotating a single trajectory cascades to influence neighboring trajectories, suggesting annotation should be trajectory-centric rather than frame-centric.
Method¶
Overall Architecture¶
SPAM = Synthetic pre-training + Pseudo-labeling + Active learning + graph-based Model
Pipeline: 1. Pre-train the detector, ReID network, and GNN hierarchy on synthetic data (MOTSynth). 2. Generate pseudo-labels on the target real-world dataset using the pre-trained model, followed by self-training fine-tuning. 3. Annotate the majority of the data using pseudo-labels from the updated model; use active learning to select hard and uncertain samples for manual annotation. 4. Output the final high-quality annotations to train downstream trackers.
Key Designs¶
-
Hierarchical Graph Model (Hierarchical GNN + GNN_node):
- Hierarchical Graph Neural Network based on SUSHI: Divides videos into subsequences to construct subgraphs, merging short tracklets into long trajectories layer-by-layer.
- Nodes = detection candidates, Edges = association hypotheses.
- Novelty: Newly introduced GNN_node layer for detection filtering:
- Uses a detector with a low confidence threshold to obtain an over-complete set of candidates (high recall → many false positives).
- GNN_node utilizes spatiotemporal consistency to classify nodes on the graph as valid/invalid detections.
- Experiments demonstrate that adding low-confidence bounding boxes without GNN_node results in a dramatic drop in performance (MOTA falls from 64.4 to 60.6), whereas adding GNN_node improves it to 65.4.
-
Synthetic Pre-training + Domain Gap Analysis:
- Comprehensively analyzes the synthetic-to-real domain gap of the three major tracking components (detection, association, ReID).
- Conclusion: Detection is most affected by the domain gap (a gap of 9.9 HOTA points), ReID is almost unaffected, and association has a moderate gap (2.1 HOTA points).
- Consequently, the annotation effort should focus on detection and association, while ReID can directly leverage models trained on synthetic data.
-
Uncertainty-Based Active Learning (Graph-Hierarchical Annotation):
- For each node \(v\), compute the uncertainty: \(\text{uncert}(v) = \max_{u \in N_v} H(\hat{y}_{(v,u)})\)
- \(H\) is the binary cross-entropy uncertainty.
- Nodes with high uncertainty are handed over to humans for manual annotation, while others use model pseudo-labels.
- Hierarchical Annotation: Allocates the annotation budget \(B\) across different hierarchy levels \(B_1, ..., B_L\).
- Deep nodes represent entire trajectories; annotating them once resolves identity associations for multiple detections, leading to highly efficient budget usage.
- Annotation action types: (i) accept/reject detection (1 click), (ii) refine box (2 clicks), (iii) cross-frame association (1 click).
Loss & Training¶
- The GNN model is trained end-to-end with edge classification and node classification.
- Synthetic pre-training → pseudo-label self-training (zero manual annotation cost) → active learning to annotate hard samples.
- Pseudo-label self-training yields a 4-6 HOTA point improvement (with zero manual cost).
Key Experimental Results¶
Main Results¶
Test set results with SPAM configured as a tracker (compared with SOTA trackers):
| Method | MOT17 HOTA↑ | MOT17 IDF1↑ | MOT20 HOTA↑ | DanceTrack HOTA↑ |
|---|---|---|---|---|
| ByteTrack | 62.8 | 77.1 | 60.4 | 47.7 |
| GHOST | 62.8 | 77.1 | 61.2 | 56.7 |
| SUSHI | 66.5 | 83.1 | 64.3 | 63.3 |
| SPAM | 67.5 | 84.6 | 65.8 | 64.0 |
Downstream trackers trained with SPAM labels vs. GT labels (MOT17 validation set):
| Tracker | Label Source | Annotation Vol. | HOTA↑ | MOTA↑ |
|---|---|---|---|---|
| ByteTrack | GT | 100% | 52.6 | 60.4 |
| ByteTrack | SPAM | 3.3% | 52.5 | 61.8 |
| GHOST | GT | 100% | 49.5 | 58.0 |
| GHOST | SPAM | 3.3% | 51.3 | 61.9 |
Achieving or even exceeding the level of training with ground-truth (GT) labels using only 3.3% of manual annotation!
Ablation Study¶
| Configuration | HOTA↑ | MOTA↑ | IDF1↑ | Description |
|---|---|---|---|---|
| High-confidence boxes only (w/o GNN_node) | 59.9 | 64.4 | 74.7 | Baseline |
| Low-confidence boxes added (w/o GNN_node) | 58.5 | 60.6 | 71.4 | Increased false positives, performance drop |
| Low-confidence boxes added + GNN_node | 60.4 | 65.4 | 75.1 | GNN_node effectively filters false positives |
Effect of pseudo-label self-training (SPAM model itself, without manual annotation):
| Dataset | HOTA w/o Pseudo-labels | HOTA w/ Pseudo-labels | Gain |
|---|---|---|---|
| MOT17 | 60.0 | 63.8 | +3.8 |
| MOT20 | 52.2 | 58.7 | +6.5 |
| DanceTrack | 41.8 | 48.1 | +6.3 |
Key Findings¶
- Synthetic pre-training suffices for most simple scenarios: ReID only needs training on synthetic data, while detection and association require fine-tuning on real data.
- Striking effects of pseudo-label self-training: Without any manual annotation, simply generating pseudo-labels from the synthetically pre-trained model and performing self-training improves performance by 4-6 HOTA points.
- Graph-hierarchical active learning significantly outperforms frame-level annotation: Comparative experiments show that uncertainty sampling at the node level is much better than image-level sampling.
- Hierarchical annotation is more efficient: Deep nodes represent long trajectories, and a single annotation resolves multiple points of uncertainty.
Highlights & Insights¶
- The core concept of SPAM is highly practical: 3% annotation effort yields ≈ 100% of the training performance, which is of great significance for constructing large-scale tracking datasets.
- A unified graph framework for both detection and association annotations: GNN_node + edge classification jointly handle two types of annotation challenges within a single graph structure.
- Domain gap analysis provides guidance on annotation priorities: Detection > Association > ReID. This conclusion has direct guiding value for data collection in the tracking community.
- The self-training loop (synthetic pre-training → pseudo-labeling → re-training) establishes a robust baseline requiring no manual annotation.
Limitations & Future Work¶
- The annotator itself does not generate new detections — if the detector misses targets entirely, it can only make up for it via low confidence thresholds, which does not guarantee full recovery.
- For extremely crowded scenarios (e.g., MOT20), false positive filtering by GNN_node might be insufficient.
- Re-training iterations after annotations have not been explored — can multi-round self-training continue to deliver improvements?
- Currently, only ByteTrack and GHOST have been validated as downstream trackers; evaluation with more trackers would be more convincing.
Related Work & Insights¶
- SUSHI is the direct predecessor of the GNN hierarchical architecture in this work; SPAM builds on it by incorporating GNN_node and the annotation engine.
- Shares a common underlying philosophy with efficient annotation methods in the image domain (such as DINO self-training) but is the first to systematically apply it to video tracking.
- Highly inspiring for building future video understanding datasets: shifting from "annotating every frame" to "selectively annotating hard examples."
Rating¶
- Novelty: ⭐⭐⭐⭐ (The system integration scheme is novel; individual components represent clever combinations of existing technologies.)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Three datasets MOT17/20/DanceTrack + complete ablation + domain gap analysis + downstream validation)
- Writing Quality: ⭐⭐⭐⭐ (The system description is clear, and the experiments are reasonably organized.)
- Value: ⭐⭐⭐⭐⭐ (Highly practical value for expanding tracking datasets; the finding regarding the 3% annotation effort is extremely appealing.)