From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection¶

Conference: CVPR 2026 arXiv: 2602.20630 Code: None Area: 3D Vision Keywords: keypoint detection, reinforcement learning, long-term trackability, sequential decision-making, feature matching

TL;DR¶

This work shifts keypoint detection from an "image-pair matching" paradigm to "sequence-level trackability optimization." The proposed reinforcement learning framework, TraqPoint, directly optimizes long-term keypoint tracking quality over image sequences, achieving state-of-the-art performance on pose estimation, visual localization, visual odometry, and 3D reconstruction tasks.

Background & Motivation¶

Existing learned keypoint detection methods (SuperPoint, DISK, ALIKED, RDD, etc.) are trained on image pairs, optimizing for pairwise matchability between two images. However, the core requirement of real-world applications such as SfM and SLAM is long-term trackability — the ability of keypoints to persist reliably across long sequences in the presence of dramatic viewpoint changes, illumination variation, and motion blur. Keypoints that perform well on isolated image pairs may drift or disappear over long trajectories, directly degrading system stability.

Prior RL-based methods (RFP, DISK, RIPE) introduced reinforcement learning to handle discrete selection problems, but their reward functions remain defined on single image pairs and do not explicitly model temporal dynamics. This paper proposes a paradigm shift from "pairwise matchability" to "long-term trackability."

Method¶

Overall Architecture¶

TraqPoint adopts a "describe-then-detect" dual-branch architecture (inherited from RDD): the descriptor branch is pre-trained and frozen, while the keypoint branch serves as the RL policy network \(\pi_\theta\). The state \(s\) is a reference image \(I^{ref}\); the action is sampling \(N\) keypoints \(\mathcal{A} = \{\mathbf{x}_i\}_{i=1}^N\) from the policy output distribution \(P_\theta\). The objective is to maximize the expected trackability reward over the entire sequence.

Key Designs¶

Hybrid Sampling: To prevent keypoints from clustering in high-probability regions, sampling is divided into two components — global sampling draws \(N_g\) points directly from the global distribution \(P_\theta\), while grid sampling partitions the image into a \(G \times G\) grid and samples one point per cell according to a local softmax distribution, ensuring spatial coverage. The probability of all sampled points is uniformly defined by the global distribution \(P_\theta(\mathbf{x}_i)\) for policy gradient computation.
Trackability Reward: For each keypoint \(\mathbf{x}_i\), the method projects it into all target frames in the sequence using known poses and depth, and computes a composite reward over the visible frame set \(\mathcal{V}_i\):
- Ranking Reward \(R_{\text{rank}}\): Measures cross-view saliency consistency by computing the percentile rank of the point's logit value within a \(K \times K\) local region in the target frame, linearly scaled as: \(R_{\text{rank},i}^t = \max(0, \frac{\text{rank\_prop} - \tau_{\text{rank}}}{1.0 - \tau_{\text{rank}}})\), with \(\tau_{\text{rank}} = 0.2\)
- Distinctiveness Reward \(R_{\text{dist}}\): Inspired by the Lowe ratio test, computes the nearest-to-second-nearest descriptor distance ratio \(\text{ratio} = d_1/d_2\) using the frozen descriptor branch, rewarding points with ratio below threshold: \(R_{\text{dist},i}^t = \max(0, \frac{\tau_{\text{dist}} - \text{ratio}}{\tau_{\text{dist}}})\), with \(\tau_{\text{dist}} = 0.85\)
- Final trajectory reward: \(R_i = \frac{1}{|\mathcal{V}_i|} \sum_{t \in \mathcal{V}_i} R_i^t\)
DINOv3-ConvNeXt Backbone: Replaces the ResNet-50 used in RDD with DINOv3-ConvNeXt (base), providing multi-scale features and strong semantic representations. The descriptor branch employs a multi-scale deformable Transformer to aggregate features from four scales, outputting a 256-dimensional dense descriptor map.

Loss & Training¶

The total loss combines policy gradient, spatial entropy regularization, and a warm-up loss:

\[\mathcal{L}(\theta) = -\mathcal{R}(\mathcal{A}) \cdot \left(\frac{1}{N} \sum_{i=1}^N \log P_\theta(\mathbf{x}_i)\right) - \lambda \mathcal{H}(P_\theta) + \alpha_t \mathcal{L}_w\]

The mean reward \(\mathcal{R}(\mathcal{A}) = \frac{1}{N}\sum_i R_i\) is used as a baseline to reduce variance.
Entropy regularization coefficient \(\lambda = 0.001\) prevents mode collapse.
The warm-up loss \(\mathcal{L}_w\) provides weak supervision via FAST detector keypoints during the first 10% of training iterations.
Training data: 5-frame sequences constructed from MegaDepth, with \(N=256\) keypoints per step.
Trained on 8 NVIDIA H20 GPUs for 50,000 steps.

Key Experimental Results¶

Main Results¶

Dataset	Metric	TraqPoint	Prev. SOTA (RDD)	Gain
MegaDepth	AUC@5°	55.8	51.9	+3.9
MegaDepth	AUC@10°	71.3	68.0	+3.3
MegaDepth	AUC@20°	83.0	79.9	+3.1
ScanNet	AUC@5°	16.6	13.7	+2.9
ScanNet	AUC@10°	32.8	29.3	+3.5
ScanNet	AUC@20°	49.5	45.3	+4.2
KITTI Seq-01	ATE↓	29.9	35.3	-5.4
KITTI Seq-01	AKTL↑	7.3	4.6	+2.7
ETH Madrid	Reg.Img↑	693	632	+61
ETH Madrid	Sparse Pts↑	254k	154k	+100k
ETH Madrid	Track Len↑	11.14	9.40	+1.74

Ablation Study¶

Configuration	AUC@5° (MegaDepth)	AKTL (KITTI)	Notes
TraqPoint-Full	55.8	6.6	Full method
Pairwise RL	53.3	4.3	Degraded to two-frame training
Match Reward	49.7	2.8	Replaced with basic matching reward
w/o Ranking Reward	52.6	4.0	Ranking reward removed
w/o Distinctiveness	54.6	5.9	Distinctiveness reward removed
w/o RL (supervised)	52.0	3.8	Trained with RDD supervised scheme
ResNet-50 Backbone	54.5	6.1	Backbone replaced

Key Findings¶

Sequence-level RL vs. pairwise RL: AUC@5° improves by 2.5 and AKTL by 2.3, confirming the critical value of sequential supervision.
Using only MNN matching, the method surpasses SP+LG (with an additional learned matcher) by 5.9 AUC@5° on MegaDepth.
On ETH 3D reconstruction, keypoint track length improves by ~1.7, reconstructed point count increases by ~65%, and keypoint distribution becomes more concentrated in texture-rich regions.
Optimal hyperparameters: sequence length of 5, 256 sampled keypoints.

Highlights & Insights¶

Paradigm Innovation: This is the first work to explicitly model keypoint detection as a sequential decision-making problem, using RL to optimize long-term trackability rather than short-term matchability.
Reward Design: The ranking and distinctiveness rewards assess tracking quality from two complementary dimensions — consistency and discriminability — and are both formulated as continuous linear signals to avoid sparse gradients.
Decoupled Policy and Descriptor: Freezing the descriptor branch provides a stable reward signal, allowing the policy network to focus on optimizing detection, which leads to more stable training.
Broad Downstream Effectiveness: The method not only leads on pairwise tasks but exhibits even greater advantages on sequential tasks such as visual odometry and 3D reconstruction.

Limitations & Future Work¶

Training requires sequentially annotated data with depth and pose labels (MegaDepth), making data construction costly.
Inference still relies on NMS + sigmoid for keypoint extraction and does not exploit sequence information at test time.
Reprojection error increases slightly (as challenging points are retained); accuracy constraints could be incorporated into the reward.
Comparison with dense/semi-dense matching methods (LoFTR, MASt3R) on downstream tasks is absent.
While the keypoint branch is lightweight, the descriptor branch (DINOv3-ConvNeXt + deformable Transformer) is computationally heavy, and deployment cost warrants further evaluation.

RDD (CVPR'25) is the most direct baseline; TraqPoint comprehensively outperforms it under the same architecture by adopting an RL training paradigm.
DISK/RIPE also employ RL but retain pairwise rewards, underscoring the importance of sequence-level reward signals.
Inspiration: This paradigm could potentially be extended to dense matching, optical flow estimation, and other tasks requiring long-term consistency. The reward design philosophy (ranking + distinctiveness) is transferable to other visual feature learning settings.

Rating¶

Novelty: ⭐⭐⭐⭐ First work to shift keypoint detection from pairwise training to sequence-level RL optimization — a significant paradigm contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers both pairwise (pose estimation/localization) and sequential (odometry/reconstruction) tasks with thorough ablations.
Writing Quality: ⭐⭐⭐⭐ Clear motivation, complete formulations, and rich figures and tables.
Value: ⭐⭐⭐⭐ Directly improves keypoint quality for SfM/SLAM systems with strong practical utility.
Value: TBD