Real-World Point Tracking with Verifier-Guided Pseudo-Labeling¶

Conference: CVPR 2026 arXiv: 2603.12217 Code: kuis-ai.github.io/track_on_r Area: Video Understanding / Point Tracking Keywords: point tracking, pseudo-labeling, verifier, multi-teacher ensemble, sim-to-real adaptation

TL;DR¶

This paper proposes a learnable Verifier meta-model trained on synthetic data to assess the reliability of tracker predictions and transfer this capability to the real world. By evaluating per-frame predictions from six pretrained trackers and selecting the most reliable as pseudo-labels, the proposed Track-On-R model is fine-tuned on only ~5K real videos and achieves comprehensive state-of-the-art performance across four real-world benchmarks.

Background & Motivation¶

Background: Long-range point tracking models (CoTracker, TAPIR, Track-On, etc.) are typically trained on synthetic data (TAP-Vid Kubric). Recent self-training methods (BootsTAPIR, CoTracker3) fine-tune on real videos using pseudo-labels to bridge the sim-to-real gap.

Limitations of Prior Work:

Individual trackers exhibit highly variable reliability across different frames and scenes — each excels under different challenges such as fast motion, low texture, occlusion, and identity switching.
Naive self-training (randomly selecting a single teacher for pseudo-labels) or fixed fusion strategies cannot handle this heterogeneity and propagate systematic errors.
Methods such as BootsTAPIR require millions of real videos for large-scale distillation, resulting in low data efficiency.

Key Challenge: Oracle experiments show that selecting the best tracker per frame could yield substantial performance gains (with a large gap over existing methods), yet no automated method achieves such per-frame adaptive selection.

Goal: Learn to automatically assess the per-frame reliability of multiple trackers and select the most accurate predictions as pseudo-labels.

Key Insight: Train a "Verifier" meta-model — one that does not perform tracking itself, but learns to judge "who tracks well." Reliability estimation is learned on synthetic data via a "corrupt-then-detect" paradigm (applying random perturbations to GT trajectories to simulate tracker errors).

Core Idea: Rather than tracking points, the Verifier scores tracker predictions, converting multi-model complementarity into high-quality pseudo-labels.

Method¶

Overall Architecture¶

Given a video and query points, six pretrained teacher trackers each produce a candidate trajectory \(\mathbf{C} \in \mathbb{R}^{L \times M \times 2}\). The Verifier evaluates per-frame reliability scores \(\hat{\mathbf{s}}_t \in \mathbb{R}^M\) over these candidates and selects the highest-scoring candidate as the pseudo-label for each frame. Frame-level optimal predictions are concatenated into a complete pseudo-label trajectory used to fine-tune the student model Track-On2. At inference time, the Verifier can also serve as a plug-and-play ensemble module.

Key Designs¶

Verifier Training: "Corrupt-then-Detect" on Synthetic Data
- Trained on the K-EPIC synthetic dataset (11K videos, 24 frames/video).
- Six types of random perturbations (drift, jump, occlusion, identity switching, etc., with displacements of 1–128 pixels) are applied to GT trajectories to generate candidate trajectories \(\mathbf{C}\).
- Training objective: candidates closer to the GT should receive higher scores — implemented via soft contrastive learning with target distribution \(\mathbf{s}_t = \text{Softmax}(-\|\mathbf{C}_t - \mathbf{p}_t\| / \tau_s)\), \(\tau_s = 0.1\).
- Loss: cross-entropy \(\mathcal{L} = \sum_t v_t \cdot \text{CE}(\hat{\mathbf{s}}_t, \mathbf{s}_t)\), with occluded frames masked.
- Requires zero real-world annotations; the learned reliability estimation transfers across domains.
Localized Feature Extraction + Candidate Transformer
- Uses the frozen CNN encoder from CoTracker3 to extract frame-level dense features \(\mathbf{F}_t \in \mathbb{R}^{H' \times W' \times D}\).
- Local features are extracted at query and candidate positions via deformable attention, rather than global reasoning.
- Positional encoding: sinusoidal encoding of candidate displacements relative to the query point \(\boldsymbol{\Delta}_t = \mathbf{C}_t - \mathbf{q}_{t_0}\), augmented with learnable identity encodings to distinguish queries from candidates.
- Candidate Transformer: constrained cross-attention (per-frame query attends only to \(M\) candidates in the current frame) + temporal self-attention (propagating context across frames) → outputs a temperature-scaled softmax reliability distribution over \(M\) candidates per frame.
- Key formulation: \(\hat{\mathbf{s}}_t = \text{Softmax}(\mathbf{f}_t^q \cdot \mathbf{f}_t / \tau)\), \(\tau = 0.1\).
Verifier-Guided Real-World Fine-Tuning
- Six teacher models: Track-On2, BootsTAPIR, BootsTAPNext, Anthro-LocoTrack, AllTracker, CoTracker3 (window).
- Real video sources: TAO + OVIS + VSPW (videos with >48 frames, totaling 4,864 clips, requiring no annotations).
- Query point sampling: 2/3 from SIFT detections, 1/3 from motion-salient regions.
- Visibility estimation: majority voting among teacher models.
- Training strategy: mixed training on synthetic data (with GT) and real data (with Verifier pseudo-labels), with the weight of real data progressively increased.

Loss & Training¶

Verifier training: cross-entropy on soft reliability targets, with occluded frames masked.
Student model fine-tuning: standard point tracking loss (position L1 + visibility BCE), with mixed synthetic and real data and progressively increasing loss weight on real data.
Base student model: Track-On2, pretrained on TAP-Vid Kubric.

Key Experimental Results¶

Main Results¶

Comparison on real-world point tracking benchmarks:

Model	EgoPoints δ_avg	RoboTAP AJ	Kinetics AJ	DAVIS AJ	Type
Track-On2	61.7	68.1	55.3	67.0	Synthetic pretrain
BootsTAPIR	55.7	64.9	54.6	61.4	Real fine-tune (M-scale)
BootsTAPNext-B	33.6	64.0	57.3	65.2	Real fine-tune (M-scale)
CoTracker3	54.0	66.4	55.8	63.8	Real fine-tune
AllTracker†	62.0	68.8	56.8	63.7	Extra optical flow data
Track-On-R (Ours)	67.3	70.9	57.8	68.1	Real fine-tune (~5K)

Ablation Study¶

Verifier ensemble vs. individual teachers (δ_avg / AJ on DAVIS & RoboTAP):

Method	DAVIS δ_avg	RoboTAP δ_avg
Random teacher selection	79.5	77.4
Best single teacher	~79–80	~80
Verifier selection	80.6	81.8

Effect of number of teachers: Increasing the number of teachers monotonically improves Verifier performance — even adding weak teachers does not degrade the Verifier (though it degrades the random-selection baseline).

Data efficiency: The majority of adaptation gains are achieved with only ~3K videos (TAO subset), far fewer than the millions required by BootsTAPIR.

Verifier vs. non-learning ensemble strategies: The Verifier surpasses all non-learning baselines including geometric median, consistency voting, and Kalman filtering.

Key Findings¶

Track-On-R achieves comprehensive state-of-the-art results on all four benchmarks, with EgoPoints δ_avg of 67.3 surpassing AllTracker by 5.3 points.
With only ~5K real videos, Track-On-R outperforms BootsTAPIR/BootsTAPNext trained on millions of videos, representing >100× improvement in data efficiency.
Real-world fine-tuning does not degrade performance on synthetic benchmarks; δ_avg on PointOdyssey even improves by +8.3.
Teacher rankings vary substantially across datasets (BootsTAPNext ranks lowest on RoboTAP but second on DAVIS), validating the necessity of adaptive per-frame selection.

Highlights & Insights¶

The Verifier meta-model design is elegant — rather than performing tracking, it learns "who tracks well," converting multi-tracker complementarity into high-quality pseudo-labels.
The oracle gap analysis (Fig. 2) compellingly demonstrates the large potential of adaptive selection.
The trajectory perturbation scheme (six perturbation types) is carefully designed to cover diverse real-world tracker failure modes, and is conducted entirely on synthetic data.
Extremely high data efficiency (~3K videos near optimal, 100× fewer than BootsTAPIR) is a key practical advantage.
The Verifier can also be used directly as a plug-and-play ensemble module at inference time, providing gains without fine-tuning.

Limitations & Future Work¶

The Verifier's upper bound is constrained by the teacher tracker pool — if all teachers fail on a given frame, the Verifier cannot recover.
Using six teacher trackers at inference introduces substantial computational overhead (6× inference cost).
The approach is only validated on point tracking; whether the Verifier paradigm generalizes to optical flow, VOS, and related tasks remains to be explored.
Fine-tuning effectiveness depends on the quality and diversity of real video data; applicability to extreme out-of-domain scenarios (e.g., underwater, thermal imaging) is unknown.
The generalization of the Verifier itself — trained on K-EPIC synthetic data — may be limited in real-world scenes with larger domain gaps.

vs. CoTracker3: Both adopt teacher pseudo-label strategies, but CoTracker3 randomly selects teachers (Kinetics AJ 55.8 vs. 57.8); the core gap lies in pseudo-label quality.
vs. BootsTAPIR/BootsTAPNext: Large-scale student-teacher distillation requires millions of videos; this work exceeds their performance with only ~5K, representing a massive difference in data efficiency.
vs. AllTracker: AllTracker leverages additional real optical flow annotation data (EgoPoints 62.0 vs. 67.3), demonstrating that Verifier pseudo-labels can even outperform supervision from real optical flow annotations.
Broader implication: The Verifier paradigm is transferable to any scenario requiring multi-model fusion or pseudo-label selection — e.g., per-sample selection of the most reliable teacher in multi-teacher distillation for object detection or semantic segmentation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The Verifier meta-model elevates reliability estimation from heuristics to a learnable paradigm — a novel and elegant contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 4 real-world and 2 synthetic benchmarks, with comprehensive ablations over teacher combinations, data scale, non-learning baselines, and oracle comparisons.
Writing Quality: ⭐⭐⭐⭐⭐ The oracle gap analysis is highly convincing; method descriptions are clear; Figs. 1/2/3 are excellent.
Value: ⭐⭐⭐⭐⭐ The Verifier paradigm is broadly applicable, highly data-efficient, and offers important insights for point tracking and the wider self-training/pseudo-labeling community.