TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs¶
Conference: CVPR 2026
arXiv: 2512.14698
Code: timelens-arc-lab.github.io
Area: Video Temporal Grounding / Multimodal LLM
Keywords: video temporal grounding, data quality, RLVR, timestamp encoding, benchmark refinement
TL;DR¶
This work systematically investigates the key factors for constructing Video Temporal Grounding (VTG) capabilities in MLLMs. From the dimensions of data quality and algorithmic design, the authors release the high-quality TimeLens-Bench and the TimeLens-100K training set. By adopting an interleaved text-time encoding and a thinking-free RLVR training paradigm, they develop the TimeLens model series, achieving SOTA among open-source models and surpassing GPT-5 and Gemini-2.5-Flash.
Background & Motivation¶
Background: MLLMs perform exceptionally well in "what" understanding but are significantly deficient in "when" capabilities. VTG (localizing specific segments given a video and a text query) is a core task for establishing temporal awareness, yet research methods are fragmented and lack unified best practices.
Limitations of Prior Work:
- The quality of existing VTG benchmarks is concerning: In Charades-STA, 20.6% of samples violate query uniqueness, and 34.9% have annotation precision issues; multiple datasets suffer from non-existent events, ambiguous queries, and information leakage.
- Different open-source methods use varying training data and experimental settings, preventing fair comparisons of design choices like temporal encoding and training strategies.
- The error rate in training data (sourced from multiple datasets) is even higher than that of the evaluation benchmarks.
Key Challenge: Model rankings shift drastically after fixing the benchmarks—open-source models scored higher than GPT-5 on the original benchmarks, but the ranking completely reversed after refinement—proving that previous evaluation standards were unreliable.
Goal: To establish a reliable data foundation for VTG and systematically explore optimal algorithmic design principles.
Key Insight: Instead of introducing complex new methods, this work conducts incremental but necessary systematic baseline research along the lines of data quality and algorithmic design.
Core Idea: Data quality refinement + Interleaved text-time encoding + Thinking-free RLVR = A simple and optimal VTG solution.
Method¶
Overall Architecture¶
TimeLens does not simply propose a new model but rather answers "which factors are critical to correctly implement VTG in MLLMs." It proceeds along two lines: at the data level, it diagnoses and refines three mainstream benchmarks to release TimeLens-Bench, then automates re-annotation of training data to create TimeLens-100K; at the algorithmic level, it systematically compares timestamp encoding methods, training paradigms, and RLVR recipes to ultimately train TimeLens-7B/8B.
graph TD
subgraph DATA["Data Foundation"]
direction TB
A["Existing VTG Benchmarks<br/>Dirty Annotations"] --> B["Manual Diagnose-then-Refine<br/>6 Criteria + Cross-validation"]
B --> C["TimeLens-Bench<br/>3 Refined Benchmarks"]
A2["Existing Training Corpora<br/>Higher Error Rates"] --> D["Automated Re-annotation<br/>MLLM Relabeling from Scratch"]
D --> E["TimeLens-100K<br/>High-quality Training Set"]
end
E --> F["Interleaved Text-Time Encoding<br/>Timestamp Prefixes Before Frame Tokens"]
F --> G["Thinking-free RLVR<br/>VTG is Perception, Remove CoT"]
G --> H["RLVR Recipe<br/>Early Stopping + Difficulty Sampling"]
H --> I["TimeLens-7B / 8B<br/>Open-source SOTA"]
Key Designs¶
1. Data Foundation: Cleaning both Evaluation and Training Data
The authors discovered that existing VTG benchmarks have alarming error rates (20.6% of Charades-STA samples violate query uniqueness, 34.9% have precision issues), making comparisons on dirty benchmarks untrustworthy. For evaluation, TimeLens-Bench establishes 6 strict annotation criteria (query clarity/uniqueness, event existence, avoidance of info-leakage, annotation precision/completeness) and uses a "Diagnose-then-Refine" workflow where the same annotator detects and fixes errors for efficiency and quality. After multiple rounds of cross-validation and batch re-work for high error rates, three refined benchmarks—Charades-TimeLens, ActivityNet-TimeLens, and QVHighlights-TimeLens—were produced. The dramatic reversal in model rankings (open-source models dropping below GPT-5) confirms the necessity of this step. For training, manual audits revealed even higher error rates in training corpora than in benchmarks. Consequently, an automated re-annotation pipeline was developed—using advanced multimodal models to re-annotate videos from scratch rather than patching old labels—resulting in TimeLens-100K with 100,000 high-quality samples. This pipeline is independent of manual evaluation refinement, ensuring no data contamination.
2. Interleaved Text-Time Encoding: Feeding Time to Models Simply
There has been no consensus on how to represent timestamps for MLLMs. The authors compared three schemes: position-encoding-based (e.g., MRoPE), visual overlay (rendering time text directly on frames), and text encoding (interleaved vs. non-interleaved). Each was tested with two formats: raw timestamps ("10.2s") vs. frame indices ("1, 2, 3"). Results show that interleaved text prefixes + raw timestamps are optimal (mIoU: Charades 48.3, ActivityNet 43.1, QVHighlights 56.7), significantly outperforming position-encoding schemes (36.6, 33.1, 49.2) without requiring architectural changes.
3. Thinking-free RLVR: VTG is Perception, Explicit Thinking is Harmful
While thinking/CoT is widely assumed to help reasoning, its benefit for VTG remained unverified. The authors compared four paradigms: SFT, thinking-based RLVR, SFT + thinking-free RLVR, and pure thinking-free RLVR. They found that VTG is essentially a perception task rather than a reasoning task. Explicit thinking processes not only provide no benefit but actually degrade performance (Charades mIoU 42.7 vs. 48.3). Pure thinking-free RLVR achieves peak performance with 1.0× training time (approx. 4h10m on 8×H20), while a preceding SFT stage (increasing total time to 2.9×) yields no additional gain.
4. RLVR Recipe: Early Stopping + Difficulty Sampling
After selecting thinking-free RLVR, the authors addressed "how long to train" and "how to sample data." Early Stopping: IoU rewards and intra-group reward standard deviations are monitored; training stops when both plateau. Continuing training leads to performance degradation even with high-quality data. Difficulty-based Data Sampling: The training model performs offline inference on training data to calculate difficulty (IoU) per sample. Sampling then follows a Gaussian distribution biased toward high-difficulty samples. Performance improves as average difficulty increases, saturating at mean > 0.75, with approximately 12K samples being sufficient. These recipes each contribute 1-2 mIoU points and save over 50% in training time.
Loss & Training¶
RLVR uses GRPO optimization with segment IoUs as verifiable rewards, without any Chain-of-Thought (i.e., thinking-free). TimeLens-7B is based on Qwen2.5-VL-7B, and TimeLens-8B is based on Qwen3-VL-8B. The 1.0× training time is approximately 4h10m on 8×H20. The early stopping and difficulty sampling recipes are summarized in Key Design 4.
Key Experimental Results¶
Main Results¶
Comparison of mIoU on TimeLens-Bench:
| Model | Charades | ActivityNet | QVHighlights | Type |
|---|---|---|---|---|
| GPT-4o | 41.8 | 40.4 | 52.1 | Commercial |
| GPT-5 | 40.5 | 42.9 | 56.8 | Commercial |
| Gemini-2.5-Flash | 48.6 | 52.5 | 64.3 | Commercial |
| Gemini-2.5-Pro | 52.8 | 58.1 | 70.4 | Commercial |
| Time-R1-7B | 36.6 | 33.1 | 49.2 | Open-source |
| MiMo-VL-7B | 39.6 | 35.5 | 41.5 | Open-source |
| Qwen2.5-VL-7B (Baseline) | 39.3 | 31.4 | 31.6 | Open-source |
| TimeLens-7B | 48.8 | 46.2 | 56.0 | Open-source |
| Qwen3-VL-8B (Baseline) | 48.3 | 46.8 | 59.4 | Open-source |
| TimeLens-8B | 55.2 | 53.2 | 65.5 | Open-source |
Ablation Study¶
Training Paradigm Comparison (using TimeLens-100K training data):
| Training Paradigm | Charades mIoU | ActivityNet mIoU | QVHighlights mIoU | Training Time |
|---|---|---|---|---|
| SFT (32K) | 47.4 | 39.9 | 52.0 | 1.0× |
| SFT (100K) | 48.6 | 39.7 | 49.0 | 2.4× |
| Thinking-based RLVR | 42.7 | 41.2 | 57.8 | 1.9× |
| SFT + Thinking-free RLVR | 50.1 | 42.7 | 55.9 | 2.9× |
| Thinking-free RLVR | 48.3 | 43.1 | 56.7 | 1.0× |
Key Findings¶
- TimeLens-8B achieves mIoU of 55.2/53.2/65.5 across three benchmarks, surpassing GPT-5 (40.5/42.9/56.8) and Gemini-2.5-Flash (48.6/52.5/64.3).
- Open-source models showed inflated performance on original benchmarks; rankings reversed after refinement, proving original benchmarks were unreliable.
- Thinking-free RLVR achieves best or near-best performance with minimal training time (1.0×); explicit thinking decreases Charades mIoU (42.7 vs. 48.3).
- Interleaved text encoding consistently outperforms visual overlay and position encoding across all benchmarks.
- Early stopping and difficulty sampling contribute approximately 1-2 mIoU gain and save over 50% of training time.
Highlights & Insights¶
- The positioning of "essential baseline rather than new method" is honest, but the scale of data refinement is massive; its impact exceeds many methodological papers.
- The "ranking flip" after benchmark refinement is the most impactful discovery, implying that prior conclusions based on old benchmarks need re-evaluation.
- The insight that "VTG is perception, not reasoning" is counter-intuitive: CoT/thinking is not only useless but harmful for VTG.
- The two RLVR recipes (early stopping + difficulty sampling) have broad applicability for other tasks with verifiable rewards.
- The victory of interleaved text encoding suggests: Simple scheme + good data > complex architectural modifications.
Limitations & Future Work¶
- Benchmark refinement requires extensive manual labor (annotator training, cross-validation), limiting scalability.
- Thinking-free RLVR might not apply to complex temporal reasoning tasks (e.g., event localization requiring causal reasoning).
- Validated only on Qwen2.5-VL/Qwen3-VL; the transferability of best practices to other architectures like InternVL or LLaVA remains to be investigated.
- No quantitative analysis was conducted on the quality gap between automated re-annotation (TimeLens-100K) and manual annotation.
- Multi-granularity temporal grounding (e.g., joint moment retrieval and video summarization) was not explored.
Related Work & Insights¶
- vs. Time-R1: Both use RLVR, but Time-R1 uses thinking-based RLVR, resulting in mIoU of only 36.6/33.1/49.2, far below TimeLens. The gap stems from data quality and the thinking-free design.
- vs. TRACE/TRACE-uni: Specialized VTG models with mIoU of only 27.1-28.1/32.7-33.6/39.0-39.8, falling short of strong MLLM-based solutions.
- vs. TimeSuite: Another systemic VTG solution; its mIoU of 19.8 on ActivityNet suggests that data and training strategies are more critical than model design.
- Insight: The research paradigm of Data Refinement → Fair Evaluation → Establishing Best Practices is a model for other tasks like detection or segmentation.
Rating¶
- Novelty: ⭐⭐⭐ (Incremental methodology, but systemic value is high)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Exhaustive exploration of encoding schemes, training paradigms, and RLVR recipes)
- Writing Quality: ⭐⭐⭐⭐⭐ (Clear structure, well-supported findings, and convincing visualization of ranking flips)
- Value: ⭐⭐⭐⭐⭐ (Benchmark refinement and best practices are extremely useful; TimeLens-Bench may become the new standard)