CVPR2026 Video Understanding token pruning visual object tracking one-stream transformer multimodal tracking unified tracking attention-guided pruning

UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking¶

Conference: CVPR2026
arXiv: 2602.23734
Code: EIT-NLP/UTPTrack
Area: Video Understanding / Visual Object Tracking
Keywords: token pruning, visual object tracking, one-stream transformer, multimodal tracking, unified tracking, attention-guided pruning

TL;DR¶

This paper proposes UTPTrack, the first unified framework that jointly prunes tokens from all three components — search region (SR), dynamic template (DT), and static template (ST) — within one-stream Transformer trackers, achieving 65–67% visual token reduction across both RGB and multimodal/language-guided tracking tasks while maintaining 99.7%–100.5% of baseline performance.

Background & Motivation¶

One-stream Transformer trackers are powerful but computationally expensive: Architectures such as OSTrack and SUTrack jointly encode templates and search regions to obtain stronger global feature representations, but the quadratic complexity of Transformers combined with large numbers of video tokens makes real-time deployment challenging.

Existing token pruning methods target only a single component: Prior work (e.g., CE in OSTrack, ProContEXT) prunes only the search region or the dynamic template, neglecting the interdependencies among SR, DT, and ST.

Isolated pruning leads to suboptimal decisions: The degree of redundancy varies across components; processing each independently fails to capture cross-component relationships, potentially discarding useful tokens or retaining substantial redundancy, thereby degrading spatial consistency and semantic integrity.

The problem is further compounded in multimodal settings: Unified tracking requires aligning RGB with depth/thermal infrared/event/language modalities, and isolated pruning disrupts cross-modal alignment.

External heuristics or auxiliary modules introduce additional overhead: ToMe relies on bipartite soft matching, and DynamicViT requires an additional MLP to predict saliency scores, both introducing structural modifications and extra computation.

No general efficiency solution exists for unified tracking: Existing efficient methods are mostly designed for single-modality RGB tracking; whether a single pruning strategy can simultaneously serve RGB, RGBD, RGBT, RGBE, and RGB-Language tasks remains unexplored.

Method¶

Overall Architecture¶

UTPTrack builds upon the one-stream Transformer tracking pipeline, concatenating SR, ST, DT (and language tokens where applicable) before feeding them into a shared encoder. A lightweight CTEM (Candidate or Template Elimination Module) is inserted at selected encoder layers to compute token importance scores from attention weights and perform pruning. Pruned SR tokens are restored to their original spatial positions via zero-padding to ensure spatial alignment for the tracking head.

Key Designs¶

Search Region Pruning (CE): Attention similarity is computed between the query of the ST center token and the keys of all SR tokens, \(\omega_x = \text{softmax}(Q_{sz'}K_x^T / \sqrt{d_k})\); the top-k tokens are retained to suppress background clutter.
Dynamic Template Pruning (DTE): The ST center token is similarly used as an anchor to compute similarity scores \(\omega_{dz}\) for DT tokens, eliminating noisy tokens introduced by drift, occlusion, or appearance variation.
Static Template Pruning (STE): Similarity scores \(\omega_{sz}\) are computed among ST tokens with respect to the center token; peripheral background tokens are removed while the center token is always preserved.
Token Type-Aware Strategy (TTA): A binary mask is constructed from the first-frame target bounding box, and patch-level foreground scores are added as a bonus to attention scores. Three strategies are provided — full bonus (score added only if all pixels fall within the box), soft bonus (mean, default), and all bonus (score added if any pixel falls within the box) — to prevent inadvertent removal of foreground tokens.
Text-Guided Pruning (TG): For RGB-Language tasks, language tokens encoded by CLIP-L interact bidirectionally with visual tokens; token importance is jointly determined by a weighted sum of attention from both the ST center token and the language tokens: \(\omega_x = \phi(\text{softmax}(Q_{sz'}K_x^T/\sqrt{d_k}) + \text{softmax}(Q_tK_x^T/\sqrt{d_k}))\).

Loss & Training¶

RGB Tracking: \(\mathcal{L}_{\text{RGB}} = \lambda_{\text{cls}}\mathcal{L}_{\text{cls}} + \lambda_{\text{giou}}\mathcal{L}_{\text{giou}} + \lambda_{L_1}\mathcal{L}_{L_1}\), where \(\lambda_{\text{cls}}=1, \lambda_{\text{giou}}=2, \lambda_{L_1}=5\).
Unified Tracking: A task identification cross-entropy loss is added: \(\mathcal{L}_{\text{Unified}} = \mathcal{L}_{\text{RGB}} + \lambda_{\text{task}}\mathcal{L}_{\text{task}}\), with \(\lambda_{\text{task}}=1\).

Key Experimental Results¶

Main Results¶

Evaluation is conducted on 10 benchmarks covering RGB (LaSOT, LaSOText, TrackingNet, GOT-10k) and multimodal (VOT-RGBD22, LasHeR, RGBT234, VisEvent, TNL2K, OTB99) tasks:

Model	Baseline	Visual Token Reduction	MACs Reduction	Baseline Performance Retention
UTPTrack-O384	OSTrack-384	65.4%	31.3% (78G→53G)	99.7%
UTPTrack-S384	SUTrack-B384	67.5%	28.4% (67G→48G)	100.5%
UTPTrack-O256	OSTrack-256	64.8%	30.7%	99.7%
UTPTrack-S224	SUTrack-B224	69.4%	28.9%	100.0%

In controlled-budget experiments with token retention ratios fixed at 87.2%/75.5%/65.6%, UTPTrack consistently outperforms CE, ToMe, EViT, and DynamicViT across all budget levels.

Ablation Study¶

Configuration	Avg. Visual Tokens	MACs (G)	Avg. Performance	Δ
Baseline (OSTrack256)	384	34.5	100.0%	-
+ CE	217	27.0	99.3%	-0.7%
+ DTE	176	25.4	99.6%	+0.3%
+ STE	135	23.8	98.9%	-0.7%
+ TTA	135	23.8	99.7%	+0.8%

For unified tracking ablation (SUTrack224), sequentially adding CE → DTE → STE → TTA → TG reduces the token count from 294 to 90 (69.4% reduction) while recovering performance to 100.0%.

Key Findings¶

Pruning can act as regularization: Under moderate pruning, UTPTrack even surpasses the baseline (UTPTrack-S384 reaches 100.5%), suggesting that removing redundant/noisy tokens concentrates attention on salient regions.
TTA yields significant recovery: The bounding box prior via the soft bonus strategy effectively prevents foreground tokens from being mistakenly removed, recovering +0.8% on RGB tasks and +0.4% on unified tracking.
TG provides additional gains for language-guided tasks: When language modalities are involved, text-guided pruning contributes an additional +0.3% performance improvement.
Advantage increases at higher compression ratios: Under extreme 64.6% token reduction, UTPTrack maintains 99.3% performance (unified tracking), whereas DynamicViT collapses to 14.7% and ToMe degrades to 92.5%.

Highlights & Insights¶

First joint three-component pruning: Breaks the limitation of prior methods that prune only the search region or dynamic template, providing the first unified redundancy modeling across SR+DT+ST.
No additional parameters or modules: Directly reuses the Transformer's own attention weights to guide pruning, introducing no trainable parameters and remaining architecture-agnostic.
Dual priors: token type-awareness and text guidance: The former leverages spatial priors to protect foreground tokens; the latter leverages semantic priors to enhance multimodal pruning; the two are orthogonal and complementary.
Strong cross-modal generalizability: A single framework serves five task categories — RGB, RGBD, RGBT, RGBE, and RGB-Language — validated across 10 benchmarks.

Limitations & Future Work¶

Practical GPU speedup is limited: a 65% reduction in token count yields only modest FPS gains (OSTrack384: 40→47 FPS), as zero-padding to restore spatial layout partially offsets the efficiency gains.
Validation is restricted to OSTrack and SUTrack; extension to other tracking architectures (e.g., SeqTrack, ARTrack) has not been explored.
The TTA strategy for ST relies on the accuracy of the first-frame bounding box annotation and may be sensitive to inaccurate initialization.
Text-guided pruning uses only a single CLIP token to represent text, resulting in coarse semantic granularity that limits utilization of complex textual descriptions.

One-stream trackers: OSTrack (ECCV'22), SUTrack (ECCV'24), and MixFormerV2 jointly encode templates and search regions.
Token pruning/merging: CE (OSTrack) and ProContEXT prune only SR; ToMe performs bipartite soft matching for token merging; EViT retains tokens based on CLS attention; DynamicViT predicts saliency via an MLP.
Unified multimodal tracking: UnTrack learns a shared low-rank latent space; SUTrack unifies five task categories; parameter-efficient adaptation methods (prompts/adapters) inject modality-specific information.

Rating¶

Novelty: ⭐⭐⭐⭐ — First joint three-component pruning combined with token type-awareness and text guidance; clear direction with practical significance
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 10 benchmarks, two baselines, three controlled-budget levels, detailed ablations, and progressive pruning analysis
Writing Quality: ⭐⭐⭐⭐ — Clear structure, complete mathematical derivations, and rich figures and tables
Value: ⭐⭐⭐⭐ — Simple and generalizable method with strong reference value for efficient one-stream trackers, though practical speedup requires further improvement