Skip to content

Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

Conference: ICML2026
arXiv: 2605.17270
Code: https://github.com/EdisonYCM/SymTrack
Area: Video Understanding
Keywords: Scene text tracking, Visual object tracking, Dual-branch architecture, Text feature calibration, Adaptive inference

TL;DR

Ours proposes SymTrack, a detection-free dual-branch scene text tracking framework. It addresses feature bottlenecks caused by perspective distortion through Predictive Token Rectification (PTR), eliminates visual ambiguity between text instances via Cross-Expert Calibration (CEC), and stabilizes fine-grained localization with an Adaptive Inference Engine (AIE). It significantly outperforms SOTA across three benchmarks (up to +12.32% AUC).

Background & Motivation

Background: Current video text tracking is primarily performed as a byproduct of Video Text Spotting (VTS) frameworks, which execute detection, recognition, and association per frame. This approach is computationally expensive and highly sensitive to detection failure—once a frame is missed, the trajectory breaks. Another direction involves applying general visual object trackers (e.g., OSTrack, ODTrack), but these models lack the feature modeling capabilities specific to text.

Limitations of Prior Work: General trackers face three challenges: (1) Perspective distortion—the planar structure of text deforms drastically under varying viewpoints, causing severe misalignment between template and search features, which shallow prediction heads struggle to decode; (2) High visual ambiguity—adjacent text characters share similar structures, and general models lack text-specific discriminative power, leading to drift; (3) Fine-grained structural sensitivity—slight deviations in text localization alter semantic content, yet frame-wise matching lacks sufficient temporal modeling to suppress jitter.

Key Challenge: The combination of strong ViT backbones with shallow prediction heads creates an information bottleneck—the encoder is powerful but the decoder is too weak to effectively distinguish targets from distractors in the feature space. Furthermore, general trackers lack domain priors to leverage the high-frequency structural features unique to text.

Key Insight: Authors advocate for a "tracking-first, detection-free" paradigm shift—instead of relying on frame-by-frame detection, text structures are modeled directly across continuous frames, with text-specific feature experts introduced as orthogonal supplements.

Core Idea: A collaborative dual-branch architecture is employed for simultaneous feature rectification (spatial and temporal) and text semantic calibration (cross-modal priors). During inference, an adaptive engine dynamically adjusts search regions and applies temporal smoothing to address all three core challenges.

Method

Overall Architecture

SymTrack takes the first-frame template and current search region as input, which are jointly encoded by a ViT backbone to produce the search feature map \(F_x\). Features then enter a collaborative dual-branch structure: the upper branch (PTR) uses template semantics to spatially rectify \(F_x\), while the lower branch (CEC) utilizes a frozen text expert to provide text-prior calibration masks. The outputs of both branches are fused via element-wise multiplication and fed into a lightweight prediction head. During inference, AIE further stabilizes the output to produce the final bounding box.

Key Designs

  1. Predictive Token Rectification (PTR):

    • Function: Pre-rectifies search feature maps to mitigate the information bottleneck between the backbone and the shallow head.
    • Mechanism: Extracts a semantic query \(\mathbf{q}_{\text{sem}} = \frac{1}{N_z}\sum_{i=1}^{N_z}z_i\) from template tokens \(\mathcal{Z}\), which is mapped to channel-wise modulation weights \(\mathbf{w}_m \in \mathbb{R}^C\) via an MLP. A probabilistic gating mask \(M = \sigma(\text{Conv}_{1\times1}(F_x \circledast \mathbf{w}_m))\) is generated via depth-wise correlation, and \(\hat{F}_x = F_x \odot M\) completes the soft rectification in the feature space.
    • Design Motivation: Performing rectification at the feature level instead of the geometric level avoids resampling noise. Template-driven gating adaptively suppresses distractor activations caused by perspective distortion.
  2. Cross-Expert Calibration (CEC):

    • Function: Injects text domain priors to resolve high ambiguity in general visual features for specific text instances.
    • Mechanism: A parallel branch uses a frozen text-specific high-resolution backbone (TokenFD visual encoder) to extract text features \(\mathcal{Z}_{txt}\) and \(\mathcal{X}_{txt}\). After linear projection for dimensionality alignment, multi-head cross-attention is performed using search features as queries and template features as keys/values: \(\mathcal{E}_{txt} = \mathcal{A}_{cross}(\mathcal{X}'_{txt}, \mathcal{Z}'_{txt}, \mathcal{Z}'_{txt})\). After residual connection and LayerNorm, a convolutional head generates a calibration mask \(M_{calib} \in [0,1]^{H\times W}\).
    • Design Motivation: The frozen text expert retains fine-grained discriminative capabilities learned from large-scale text data. Cross-attention focuses the calibration mask on regions consistent with the template text, effectively suppressing background noise and similar text distractors.
  3. Adaptive Inference Engine (AIE):

    • Function: Dynamically adjusts search regions and smooths trajectories during inference without further training to improve fine-grained robustness.
    • Mechanism: Dynamic Search Region—when prediction confidence \(c(S)\) falls below \(\tau_{\text{uncert}}=0.98\), the best scale is selected by re-inferring with scaling factors \(\{0.95, 1.05\}\). Temporal Regularization—a constant-velocity linear state-space model \(s_t = [c_x, c_y, v_x, v_y]^T\) is established, fusing motion prediction with visual tracking output using a fusion weight \(\alpha_{kalman}=0.5\).
    • Design Motivation: Text scales change rapidly under perspective shifts; static windows often lose the target. Kalman-based temporal regularization leverages motion continuity to suppress jitter and long-term drift.

Key Experimental Results

Benchmark Metric SymTrack Prev. SOTA Gain
ArTVideoSOT AUC 77.74% ROMTrack 70.62% +7.12%
DSTextSOT AUC 70.66% ODTrack 62.71% +7.95%
BOVTextSOT AUC 77.06% ODTrack 64.74% +12.32%
ArTVideoSOT Precision 95.88% ROMTrack 87.13% +8.75%

Ablation Study (ArTVideoSOT):

Component AUC Gain
Baseline (No PTR/CEC/AIE) 69.50%
+PTR 74.46% +4.96%
+PTR+CEC 76.58% +2.12%
+PTR+CEC+AIE (Full) 77.74% +1.16%
Model AUC Params Speed
SymTrack (Full) 77.74% 395.9M 22 fps
SymTrack w/o TokenFD 75.45% 92.7M 89 fps
SeqTrack 64.35% 306.5M 16 fps

Key Findings: Even when the strongest competitor (ODTrack) is fine-tuned on text tracking data, SymTrack still leads by +9.83% AUC (BOVTextSOT), proving that the performance gap originates from the architecture rather than the data domain.

Highlights & Insights

  • Paradigm Shift Value: VTS methods fail almost completely under SOT-format evaluation (e.g., TransDETR achieves only 9.18% AUC vs SymTrack's 77.74%), demonstrating the necessity of the "detection-free" approach for text tracking.
  • Collaborative Synergy: PTR contributes +4.96% and CEC adds another +2.12% on top of PTR, showing that the dual-branch synergy is significantly better than either alone.
  • AIE Impact: The introduction of AIE increased the average Search Region Coverage (SRC) from 83.27% to 95.25% (+11.98%), proving crucial for handling drastic scale changes under perspective distortion.
  • Efficiency without TokenFD: The lightweight version (92.7M, 89 fps) still achieves 75.45% AUC, outperforming all competitors and making it suitable for real-time applications.

Limitations & Future Work

  • The frozen TokenFD text expert introduces approximately 300M additional parameters, reducing inference speed from 89 fps to 22 fps.
  • Benchmark datasets are converted from VTS annotations and lack dedicated labels for long-term occlusion and extreme motion specifically designed for SOT.
  • Hyperparameters for AIE (\(\tau_{\text{uncert}}\), \(\alpha_{kalman}\)) are manually set; adaptive learning strategies have not been explored.
  • Evaluation was limited to English and Chinese; generalization to complex script systems like Arabic remains unknown.
  • Evolution of General Trackers: Progress from SiamRPN++ to TransT, then to one-stream (OSTrack) and sequential token modeling (ODTrack), all of which lack specialized text feature modeling.
  • VTS Paradigm: Models like TransVTSpotter and TransDETR treat tracking as a byproduct of detection; a single-frame detection failure results in irreversible trajectory breakage.
  • Text Feature Experts: TokenFD, a visual encoder pretrained on large-scale text data, provides high-fidelity text priors for CEC.
  • Insight: This work demonstrates that for domain-specific tracking, a "Domain Expert + General Backbone" fusion paradigm is superior to purely general models or end-to-end systems. This may extend to other fine-grained tracking tasks such as sheet music, barcodes, or license plates.

Rating

  • Novelty: 8/10 — Successfully defines the scene text tracking task and proposes a dedicated framework with an innovative dual-branch design.
  • Experimental Thoroughness: 9/10 — Comprehensive comparisons across three benchmarks, fine-tuning control experiments, detailed ablations, and visualization.
  • Writing Quality: 8/10 — Clear problem analysis and well-motivated methods, though some mathematical notation is redundant.
  • Value: 7/10 — Establishes new tasks and benchmarks, though the application area is niche and real-time performance needs further optimization.