Real-World Point Tracking with Verifier-Guided Pseudo-Labeling¶
Conference: CVPR 2026 arXiv: 2603.12217 Code: kuis-ai.github.io/track_on_r Area: Video Understanding / Point Tracking Keywords: point tracking, pseudo-labeling, verifier, multi-teacher ensemble, sim-to-real adaptation
TL;DR¶
This paper proposes a learnable Verifier meta-model trained on synthetic data to assess the reliability of tracker predictions and transfer this capability to the real world. By evaluating per-frame predictions from six pretrained trackers and selecting the most reliable as pseudo-labels, the proposed Track-On-R model is fine-tuned on only ~5K real videos and achieves comprehensive state-of-the-art performance across four real-world benchmarks.
Background & Motivation¶
Background: Long-range point tracking models (CoTracker, TAPIR, Track-On, etc.) are typically trained on synthetic data (TAP-Vid Kubric). Recent self-training methods (BootsTAPIR, CoTracker3) fine-tune on real videos using pseudo-labels to bridge the sim-to-real gap.
Limitations of Prior Work:
- Individual trackers exhibit highly variable reliability across different frames and scenes — each excels under different challenges such as fast motion, low texture, occlusion, and identity switching.
- Naive self-training (randomly selecting a single teacher for pseudo-labels) or fixed fusion strategies cannot handle this heterogeneity and propagate systematic errors.
- Methods such as BootsTAPIR require millions of real videos for large-scale distillation, resulting in low data efficiency.
Key Challenge: Oracle experiments show that selecting the best tracker per frame could yield substantial performance gains (with a large gap over existing methods), yet no automated method achieves such per-frame adaptive selection.
Goal: Learn to automatically assess the per-frame reliability of multiple trackers and select the most accurate predictions as pseudo-labels.
Key Insight: Train a "Verifier" meta-model — one that does not perform tracking itself, but learns to judge "who tracks well." Reliability estimation is learned on synthetic data via a "corrupt-then-detect" paradigm (applying random perturbations to GT trajectories to simulate tracker errors).
Core Idea: Rather than tracking points, the Verifier scores tracker predictions, converting multi-model complementarity into high-quality pseudo-labels.
Method¶
Overall Architecture¶
Given a video and query points, six pretrained teacher trackers each produce a candidate trajectory \(\mathbf{C} \in \mathbb{R}^{L \times M \times 2}\). The Verifier evaluates per-frame reliability scores \(\hat{\mathbf{s}}_t \in \mathbb{R}^M\) over these candidates and selects the highest-scoring candidate as the pseudo-label for each frame. Frame-level optimal predictions are concatenated into a complete pseudo-label trajectory used to fine-tune the student model Track-On2. At inference time, the Verifier can also serve as a plug-and-play ensemble module.
Key Designs¶
-
Verifier Training: "Corrupt-then-Detect" on Synthetic Data
- Trained on the K-EPIC synthetic dataset (11K videos, 24 frames/video).
- Six types of random perturbations (drift, jump, occlusion, identity switching, etc., with displacements of 1–128 pixels) are applied to GT trajectories to generate candidate trajectories \(\mathbf{C}\).
- Training objective: candidates closer to the GT should receive higher scores — implemented via soft contrastive learning with target distribution \(\mathbf{s}_t = \text{Softmax}(-\|\mathbf{C}_t - \mathbf{p}_t\| / \tau_s)\), \(\tau_s = 0.1\).
- Loss: cross-entropy \(\mathcal{L} = \sum_t v_t \cdot \text{CE}(\hat{\mathbf{s}}_t, \mathbf{s}_t)\), with occluded frames masked.
- Requires zero real-world annotations; the learned reliability estimation transfers across domains.
-
Localized Feature Extraction + Candidate Transformer
- Uses the frozen CNN encoder from CoTracker3 to extract frame-level dense features \(\mathbf{F}_t \in \mathbb{R}^{H' \times W' \times D}\).
- Local features are extracted at query and candidate positions via deformable attention, rather than global reasoning.
- Positional encoding: sinusoidal encoding of candidate displacements relative to the query point \(\boldsymbol{\Delta}_t = \mathbf{C}_t - \mathbf{q}_{t_0}\), augmented with learnable identity encodings to distinguish queries from candidates.
- Candidate Transformer: constrained cross-attention (per-frame query attends only to \(M\) candidates in the current frame) + temporal self-attention (propagating context across frames) → outputs a temperature-scaled softmax reliability distribution over \(M\) candidates per frame.
- Key formulation: \(\hat{\mathbf{s}}_t = \text{Softmax}(\mathbf{f}_t^q \cdot \mathbf{f}_t / \tau)\), \(\tau = 0.1\).
-
Verifier-Guided Real-World Fine-Tuning
- Six teacher models: Track-On2, BootsTAPIR, BootsTAPNext, Anthro-LocoTrack, AllTracker, CoTracker3 (window).
- Real video sources: TAO + OVIS + VSPW (videos with >48 frames, totaling 4,864 clips, requiring no annotations).
- Query point sampling: 2/3 from SIFT detections, 1/3 from motion-salient regions.
- Visibility estimation: majority voting among teacher models.
- Training strategy: mixed training on synthetic data (with GT) and real data (with Verifier pseudo-labels), with the weight of real data progressively increased.
Loss & Training¶
- Verifier training: cross-entropy on soft reliability targets, with occluded frames masked.
- Student model fine-tuning: standard point tracking loss (position L1 + visibility BCE), with mixed synthetic and real data and progressively increasing loss weight on real data.
- Base student model: Track-On2, pretrained on TAP-Vid Kubric.
Key Experimental Results¶
Main Results¶
Comparison on real-world point tracking benchmarks:
| Model | EgoPoints δ_avg | RoboTAP AJ | Kinetics AJ | DAVIS AJ | Type |
|---|---|---|---|---|---|
| Track-On2 | 61.7 | 68.1 | 55.3 | 67.0 | Synthetic pretrain |
| BootsTAPIR | 55.7 | 64.9 | 54.6 | 61.4 | Real fine-tune (M-scale) |
| BootsTAPNext-B | 33.6 | 64.0 | 57.3 | 65.2 | Real fine-tune (M-scale) |
| CoTracker3 | 54.0 | 66.4 | 55.8 | 63.8 | Real fine-tune |
| AllTracker† | 62.0 | 68.8 | 56.8 | 63.7 | Extra optical flow data |
| Track-On-R (Ours) | 67.3 | 70.9 | 57.8 | 68.1 | Real fine-tune (~5K) |
Ablation Study¶
Verifier ensemble vs. individual teachers (δ_avg / AJ on DAVIS & RoboTAP):
| Method | DAVIS δ_avg | RoboTAP δ_avg |
|---|---|---|
| Random teacher selection | 79.5 | 77.4 |
| Best single teacher | ~79–80 | ~80 |
| Verifier selection | 80.6 | 81.8 |
Effect of number of teachers: Increasing the number of teachers monotonically improves Verifier performance — even adding weak teachers does not degrade the Verifier (though it degrades the random-selection baseline).
Data efficiency: The majority of adaptation gains are achieved with only ~3K videos (TAO subset), far fewer than the millions required by BootsTAPIR.
Verifier vs. non-learning ensemble strategies: The Verifier surpasses all non-learning baselines including geometric median, consistency voting, and Kalman filtering.
Key Findings¶
- Track-On-R achieves comprehensive state-of-the-art results on all four benchmarks, with EgoPoints δ_avg of 67.3 surpassing AllTracker by 5.3 points.
- With only ~5K real videos, Track-On-R outperforms BootsTAPIR/BootsTAPNext trained on millions of videos, representing >100× improvement in data efficiency.
- Real-world fine-tuning does not degrade performance on synthetic benchmarks; δ_avg on PointOdyssey even improves by +8.3.
- Teacher rankings vary substantially across datasets (BootsTAPNext ranks lowest on RoboTAP but second on DAVIS), validating the necessity of adaptive per-frame selection.
Highlights & Insights¶
- The Verifier meta-model design is elegant — rather than performing tracking, it learns "who tracks well," converting multi-tracker complementarity into high-quality pseudo-labels.
- The oracle gap analysis (Fig. 2) compellingly demonstrates the large potential of adaptive selection.
- The trajectory perturbation scheme (six perturbation types) is carefully designed to cover diverse real-world tracker failure modes, and is conducted entirely on synthetic data.
- Extremely high data efficiency (~3K videos near optimal, 100× fewer than BootsTAPIR) is a key practical advantage.
- The Verifier can also be used directly as a plug-and-play ensemble module at inference time, providing gains without fine-tuning.
Limitations & Future Work¶
- The Verifier's upper bound is constrained by the teacher tracker pool — if all teachers fail on a given frame, the Verifier cannot recover.
- Using six teacher trackers at inference introduces substantial computational overhead (6× inference cost).
- The approach is only validated on point tracking; whether the Verifier paradigm generalizes to optical flow, VOS, and related tasks remains to be explored.
- Fine-tuning effectiveness depends on the quality and diversity of real video data; applicability to extreme out-of-domain scenarios (e.g., underwater, thermal imaging) is unknown.
- The generalization of the Verifier itself — trained on K-EPIC synthetic data — may be limited in real-world scenes with larger domain gaps.
Related Work & Insights¶
- vs. CoTracker3: Both adopt teacher pseudo-label strategies, but CoTracker3 randomly selects teachers (Kinetics AJ 55.8 vs. 57.8); the core gap lies in pseudo-label quality.
- vs. BootsTAPIR/BootsTAPNext: Large-scale student-teacher distillation requires millions of videos; this work exceeds their performance with only ~5K, representing a massive difference in data efficiency.
- vs. AllTracker: AllTracker leverages additional real optical flow annotation data (EgoPoints 62.0 vs. 67.3), demonstrating that Verifier pseudo-labels can even outperform supervision from real optical flow annotations.
- Broader implication: The Verifier paradigm is transferable to any scenario requiring multi-model fusion or pseudo-label selection — e.g., per-sample selection of the most reliable teacher in multi-teacher distillation for object detection or semantic segmentation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The Verifier meta-model elevates reliability estimation from heuristics to a learnable paradigm — a novel and elegant contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 4 real-world and 2 synthetic benchmarks, with comprehensive ablations over teacher combinations, data scale, non-learning baselines, and oracle comparisons.
- Writing Quality: ⭐⭐⭐⭐⭐ The oracle gap analysis is highly convincing; method descriptions are clear; Figs. 1/2/3 are excellent.
- Value: ⭐⭐⭐⭐⭐ The Verifier paradigm is broadly applicable, highly data-efficient, and offers important insights for point tracking and the wider self-training/pseudo-labeling community.