CVPR2026 Video Understanding point tracking pseudo-labeling self-training verifier ensemble learning sim-to-real

Real-World Point Tracking with Verifier-Guided Pseudo-Labeling¶

Conference: CVPR2026 arXiv: 2603.12217 Code: kuis-ai/track_on_r Area: Video Understanding Keywords: point tracking, pseudo-labeling, self-training, verifier, ensemble learning, sim-to-real

TL;DR¶

This paper proposes the Verifier — a meta-model that learns to assess the per-frame reliability of predictions from multiple pre-trained trackers, selecting the best candidate at each frame to construct high-quality pseudo-label trajectories. This enables annotation-free fine-tuning for real-world point tracking and achieves state-of-the-art performance on four real-world benchmarks.

Background & Motivation¶

State of the Field — Long-range point tracking relies on synthetic training data: Current Transformer-based point trackers (CoTracker, Track-On, etc.) are trained on large-scale synthetic datasets (TAP-Vid Kubric), as densely annotated long-range trajectories in real videos are prohibitively expensive to obtain.

Limitations of Prior Work — Significant sim-to-real domain gap: Synthetic data differs substantially from real-world video in terms of texture, lighting, non-rigid motion, occlusion patterns, and sensor noise, causing models trained on synthetic data to degrade in reliability on real-world scenes.

Limitations of Prior Work — Uncontrollable pseudo-label quality in naive self-training: Existing methods (e.g., CoTracker3) generate pseudo-labels from a randomly selected single teacher model; however, different trackers vary considerably across frames and scenes, and fixed heuristics or global confidence thresholds cannot handle this heterogeneous error distribution.

Limitations of Prior Work — Complementarity across trackers is unexploited: Oracle experiments demonstrate that adaptively selecting the best tracker per frame greatly outperforms any single model or random selection, indicating substantial room for adaptive selection strategies.

Limitations of Prior Work — Lack of reliability estimation: Existing pseudo-labeling pipelines do not distinguish reliable from unreliable frames in teacher predictions, allowing errors to accumulate and propagate during training.

Limitations of Prior Work — Data efficiency: Methods such as BootsTAPIR require millions of real videos for large-scale distillation, which is impractical in many real-world scenarios with limited unlabeled video.

Method¶

Overall Architecture¶

The system consists of three stages: (1) training the Verifier meta-model on synthetic data; (2) using the Verifier to generate high-quality pseudo-labels on unlabeled real-world videos; and (3) fine-tuning the student tracker Track-On2 with the pseudo-labels. At inference time, the Verifier can also serve as a plug-and-play ensemble module.

Candidate Trajectory Construction¶

Training stage: Random perturbations (drift, jumps, occlusion, jitter) are applied to ground-truth trajectories from synthetic data to produce \(M=6\)–\(12\) candidate trajectories, simulating the realistic error patterns of different trackers at inference time.
Inference/fine-tuning stage: Six pre-trained teacher models (Track-On2, BootsTAPIR, BootsTAPNext, Anthro-LocoTrack, AllTracker, CoTracker3-Window) each independently produce candidate trajectories.

Localized Feature Extraction¶

A frozen CoTracker3 CNN encoder extracts dense visual features \(\mathbf{F}_t \in \mathbb{R}^{H' \times W' \times D}\).
Deformable Attention is applied at the query point and each candidate position to sample local contextual features, rather than simple single-point sampling.
Sinusoidal displacement embeddings \(\eta(\Delta_t)\) and identity embeddings (distinguishing query vs. candidate) are concatenated and projected to the model dimension.

Candidate Transformer¶

Restricted cross-attention: At each frame, the query embedding attends only to the \(M\) candidate features of that frame (candidate dimension); the temporal dimension is treated as the batch axis, with computation performed independently per frame.
Temporal self-attention: Query embeddings across \(L\) frames are connected along the temporal dimension to propagate cross-frame consistency, allowing uncertain frames to be corrected using high-confidence predictions from neighboring frames.
Output: Temperature-scaled cosine similarity (\(\tau=0.1\)) is computed between decoded query features and candidate features, followed by Softmax to yield a per-frame candidate reliability distribution \(\hat{\mathbf{s}}_t \in \mathbb{R}^M\).

Loss & Training¶

A soft contrastive objective is adopted: the target distribution is \(\mathbf{s}_t = \text{Softmax}(-\|\mathbf{C}_t - \mathbf{p}_t\| / \tau_s)\) (with \(\tau_s=0.3\)), and training minimizes the cross-entropy loss \(\mathcal{L} = \sum_t v_t \cdot \text{CE}(\hat{\mathbf{s}}_t, \mathbf{s}_t)\), with occluded frames masked out.

Pseudo-Label Generation and Fine-Tuning¶

Query point sampling: 2/3 from SIFT detection and 1/3 from motion-salient regions (grayscale frame difference with spatial smoothing).
The candidate with the highest per-frame reliability score is selected as the pseudo-label; visibility is estimated via majority voting among teachers.
Fine-tuning mixes synthetic data (with ground truth) and real data, with a progressive schedule that gradually increases the loss weight on real data (Mix + Schedule).

Key Experimental Results¶

Main Results¶

Comparisons against both synthetically pre-trained and real-world fine-tuned methods are conducted on four real-world benchmarks (EgoPoints, RoboTAP, TAP-Vid Kinetics, TAP-Vid DAVIS):

Method	EgoPoints \(\delta_{avg}^x\)	RoboTAP AJ	Kinetics AJ	DAVIS AJ
Track-On2 (synthetic baseline)	61.7	68.1	55.3	67.0
BootsTAPIR	55.7	64.9	54.6	61.4
CoTracker3-ft	54.0	66.4	55.8	63.8
AllTracker†	62.0	68.8	56.8	63.7
Track-On-R (Ours)	67.3	70.9	57.8	68.1

Track-On-R achieves the highest scores across all datasets, outperforming the strongest competitor AllTracker by +5.3 on EgoPoints and +2.1 AJ on RoboTAP.

Ablation Study¶

Teacher combinations: As more teachers are incrementally added (A→E), the random baseline may degrade, whereas Verifier-guided selection consistently improves, demonstrating that the Verifier effectively exploits complementarity without being dragged down by weaker teachers.

# Teachers	DAVIS δ (Rand.)	DAVIS δ (Ver.)	RoboTAP δ (Rand.)	RoboTAP δ (Ver.)
2 (A,B)	79.5	80.6	77.4	81.8
5 (A–E)	77.7	81.1	78.0	83.1

Training data mixture: Using only real data is already competitive; mixing in synthetic data improves occlusion-aware (OA) visibility prediction; the Mix + Schedule strategy achieves the best overall performance.

Key Findings¶

Used as a plug-and-play inference-time ensemble module, the Verifier outperforms all individual teachers and random selection baselines without any fine-tuning.
Effective domain transfer is achieved with only approximately 4,864 real videos — far fewer than the millions required by BootsTAPIR.
No robot videos appear in the training set, yet Track-On-R achieves the best performance on RoboTAP, demonstrating strong generalization.

Highlights & Insights¶

Elegant meta-model design: Framing "which tracker is most reliable at the current frame" as a learnable classification problem, trained solely on synthetic data yet transferable across domains, is a well-motivated and elegant design choice.
Unified framework: The same Verifier serves both pseudo-label selection during training and plug-and-play ensemble integration at inference time, without requiring two separate mechanisms.
High data efficiency: The method matches or exceeds million-scale distillation approaches using fewer than 5K videos.
Robustness: Adding weaker teachers does not degrade Verifier performance and may even further improve it.

Limitations & Future Work¶

The Verifier's performance ceiling is bounded by the quality of the teacher trackers — if all teachers fail on a given frame, the Verifier cannot recover.
Fine-tuning effectiveness depends on the quality and diversity of the real-world video collection, requiring manual curation of suitable video sets.
Generating candidate trajectories requires running multiple teacher models, incurring substantial computational overhead during inference and pseudo-label generation (six models run serially or in parallel).
The paper does not discuss the scalability limits of the Verifier with respect to the number of teachers, nor its behavior in extreme scenarios such as motion blur or very low frame rates.

Direction	Representative Works	Relation to This Paper
Long-range point tracking	PIPs, TAPIR, CoTracker, Track-On	This paper uses Track-On2 as the student model
Pseudo-label self-training	BootsTAPIR, CoTracker3, AnthroTAP	This paper replaces random teacher selection with Verifier-guided selection
Ensemble learning	Bagging, structured consensus	The Verifier can be viewed as a learned adaptive ensemble

Rating¶

Novelty: ⭐⭐⭐⭐ — The angle of using a meta-model to assess tracker reliability is novel, and the design of training on synthetic data with cross-domain transfer is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive comparisons on four benchmarks, with three ablation groups covering teacher combinations, data mixture, and inference-time ensemble.
Writing Quality: ⭐⭐⭐⭐ — Clear structure; the oracle experiment figure provides strong motivation.
Value: ⭐⭐⭐⭐ — Provides a general pseudo-label quality control paradigm that is applicable to other dense prediction self-training tasks.