TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs¶
Conference: CVPR 2026 arXiv: 2512.14698 Code: timelens-arc-lab.github.io Area: Video Temporal Grounding / Multimodal LLM Keywords: video temporal grounding, data quality, RLVR, timestamp encoding, benchmark refinement
TL;DR¶
This paper systematically investigates the key factors for building video temporal grounding (VTG) capabilities in MLLMs from two dimensions — data quality and algorithm design. It releases the high-quality benchmark TimeLens-Bench and training set TimeLens-100K, and constructs the TimeLens model series via interleaved textual timestamp encoding combined with a thinking-free RLVR training paradigm, achieving state-of-the-art performance among open-source models and surpassing GPT-5 and Gemini-2.5-Flash.
Background & Motivation¶
Background: MLLMs excel at "what" understanding but are critically deficient in "when" understanding. VTG — localizing temporal segments in a video given a text query — is the core task for establishing temporal awareness, yet research approaches are highly fragmented with no unified best practices.
Limitations of Prior Work:
- Existing VTG benchmarks suffer from serious quality issues: 20.6% of samples in Charades-STA violate query uniqueness, and 34.9% exhibit annotation precision problems; multiple datasets contain errors such as non-existent events, ambiguous queries, and information leakage.
- Different open-source methods use heterogeneous training data and experimental setups, making fair comparison of design choices (e.g., temporal encoding, training strategies) infeasible.
- Training data (aggregated from multiple source datasets) exhibits even higher error rates than the evaluation benchmarks.
Key Challenge: Model rankings shift dramatically after benchmark correction — open-source models that outperform GPT-5 on the original benchmark are completely reversed after fixing — demonstrating that prior evaluation standards are unreliable.
Goal: Establish a reliable data foundation for VTG and systematically explore optimal algorithmic design principles.
Key Insight: Rather than introducing new complex methods, this work conducts incremental but necessary systematic baseline studies along two axes: data quality and algorithm design.
Core Idea: Data quality refinement + interleaved textual timestamp encoding + thinking-free RLVR = a simple yet optimal solution for VTG.
Method¶
Overall Architecture¶
The work proceeds along two dimensions: data quality and algorithm design. On the data side: diagnose and refine three mainstream benchmarks → release TimeLens-Bench + automatically re-annotate training data → TimeLens-100K. On the algorithm side: systematically compare timestamp encoding schemes → training paradigms → RLVR training recipes → ultimately build TimeLens-7B/8B.
Key Designs¶
-
TimeLens-Bench: High-Quality Evaluation Benchmark
- Defines 6 strict annotation criteria: query clarity/uniqueness, event existence, avoidance of information leakage, annotation precision/completeness.
- Diagnose-then-Refine workflow: the same annotator handles both error detection and correction to balance efficiency and quality.
- Multi-round cross-validation with quality control: each batch is reviewed by different annotators; batches exceeding an error threshold are entirely reworked.
- Produces Charades-TimeLens / ActivityNet-TimeLens / QVHighlights-TimeLens.
-
Interleaved Textual Timestamp Encoding
- Compares three encoding families: position-encoding-based (MRoPE, etc.), visual overlay (rendering timestamps directly on frames), and textual encoding (interleaved vs. non-interleaved).
- Each scheme is further compared across two time formats: raw timestamps (e.g., "10.2s") vs. frame indices (e.g., "1, 2, 3").
- Conclusion: interleaved textual prefix with raw timestamps is optimal (mIoU: Charades 48.3, ActivityNet 43.1, QVHighlights 56.7), substantially outperforming position-encoding-based approaches (36.6, 33.1, 49.2), and requires no architectural modifications.
-
Thinking-Free RLVR Training Paradigm
- Systematically compares four paradigms: SFT / thinking-based RLVR / SFT + thinking-free RLVR / pure thinking-free RLVR.
- Core finding: VTG is fundamentally a perception task rather than a reasoning task; explicit thinking processes are detrimental.
- Pure thinking-free RLVR achieves the best performance with 1.0× training time (~4h10m on 8×H20 GPUs).
- A preceding SFT stage provides no significant benefit (SFT+RLVR at 2.9× time vs. pure RLVR at 1.0× time, with comparable performance).
Loss & Training¶
- GRPO optimizer with verifiable rewards based on temporal segment IoU; no Chain-of-Thought is used.
- Early stopping strategy: training is halted when both IoU reward and within-group reward standard deviation plateau simultaneously; continued training leads to performance degradation.
- Difficulty-based data sampling: the model being trained performs offline inference on training data to compute IoU-based difficulty scores; Gaussian sampling biases toward hard samples (performance saturates when mean > 0.75); approximately 12K samples suffice.
- TimeLens-7B is based on Qwen2.5-VL-7B; TimeLens-8B is based on Qwen3-VL-8B.
Key Experimental Results¶
Main Results¶
mIoU comparison on TimeLens-Bench:
| Model | Charades | ActivityNet | QVHighlights | Type |
|---|---|---|---|---|
| GPT-4o | 41.8 | 40.4 | 52.1 | Commercial |
| GPT-5 | 40.5 | 42.9 | 56.8 | Commercial |
| Gemini-2.5-Flash | 48.6 | 52.5 | 64.3 | Commercial |
| Gemini-2.5-Pro | 52.8 | 58.1 | 70.4 | Commercial |
| Time-R1-7B | 36.6 | 33.1 | 49.2 | Open-source |
| MiMo-VL-7B | 39.6 | 35.5 | 41.5 | Open-source |
| Qwen2.5-VL-7B (baseline) | 39.3 | 31.4 | 31.6 | Open-source |
| TimeLens-7B | 48.8 | 46.2 | 56.0 | Open-source |
| Qwen3-VL-8B (baseline) | 48.3 | 46.8 | 59.4 | Open-source |
| TimeLens-8B | 55.2 | 53.2 | 65.5 | Open-source |
Ablation Study¶
Training paradigm comparison (trained on TimeLens-100K):
| Training Paradigm | Charades mIoU | ActivityNet mIoU | QVHighlights mIoU | Training Time |
|---|---|---|---|---|
| SFT (32K) | 47.4 | 39.9 | 52.0 | 1.0× |
| SFT (100K) | 48.6 | 39.7 | 49.0 | 2.4× |
| Thinking-based RLVR | 42.7 | 41.2 | 57.8 | 1.9× |
| SFT + Thinking-free RLVR | 50.1 | 42.7 | 55.9 | 2.9× |
| Thinking-free RLVR | 48.3 | 43.1 | 56.7 | 1.0× |
Key Findings¶
- TimeLens-8B achieves mIoU of 55.2/53.2/65.5 across three benchmarks, surpassing GPT-5 (40.5/42.9/56.8) and Gemini-2.5-Flash (48.6/52.5/64.3).
- Open-source models appear stronger on the original benchmarks, but rankings are dramatically reversed after correction — confirming the unreliability of original benchmarks.
- Thinking-free RLVR achieves the best or near-best performance with minimal training time (1.0×); explicit thinking degrades Charades mIoU (42.7 vs. 48.3).
- Interleaved textual encoding consistently outperforms visual overlay and position-encoding approaches across all three benchmarks.
- Early stopping and difficulty-based sampling each contribute approximately 1–2 mIoU improvement while saving over 50% of training time.
Highlights & Insights¶
- The positioning of this work as "necessary baselines rather than a new method" is remarkably candid; the benchmark refinement effort is substantial, and its impact far exceeds that of a typical methodology paper.
- The ranking reversal following benchmark correction is the most striking finding in the paper — implying that prior comparative conclusions drawn from the original benchmarks need to be revisited.
- The finding that "VTG is perception rather than reasoning" is counterintuitive: CoT/thinking is not only unhelpful for VTG but actively harmful.
- The two RLVR training insights (early stopping + difficulty-based sampling) have broad applicability to other tasks with verifiable rewards.
- The superiority of interleaved textual encoding demonstrates that simple approaches combined with high-quality data outperform complex architectural modifications.
Limitations & Future Work¶
- Benchmark refinement requires substantial human involvement (annotator training, cross-validation), limiting scalability.
- Thinking-free RLVR may not generalize to more complex temporal reasoning tasks (e.g., event localization requiring causal inference).
- Validation is limited to Qwen2.5-VL/Qwen3-VL; the transferability of best practices to architectures such as InternVL and LLaVA remains to be examined.
- The quality gap between automatic re-annotation in TimeLens-100K and human annotation has not been quantitatively analyzed.
- Multi-granularity temporal localization (e.g., joint moment retrieval and video summarization) is not explored.
Related Work & Insights¶
- vs. Time-R1: Both employ RLVR, but Time-R1 uses thinking-based RLVR and achieves only 36.6/33.1/49.2 mIoU, far below TimeLens's 48.8/46.2/56.0; the gap stems from data quality and the thinking-free design.
- vs. TRACE/TRACE-uni: Dedicated VTG models achieving only 27.1–28.1/32.7–33.6/39.0–39.8 mIoU, substantially inferior to approaches built on strong MLLMs.
- vs. TimeSuite: Another systematic VTG approach; its ActivityNet mIoU of only 19.8 suggests that data and training strategy matter more than model design.
- Takeaway: The research paradigm of data quality refinement → fair evaluation → best practice establishment is worth emulating in other tasks such as detection and segmentation.
Rating¶
- Novelty: ⭐⭐⭐ The method itself is incremental; the value lies in its systematic nature rather than a single innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three encoding families × two formats + four training paradigms + thorough RLVR recipe exploration — extremely comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, with every finding supported by sufficient experiments; the ranking-reversal visualization in Fig. 2(a) is highly persuasive.
- Value: ⭐⭐⭐⭐⭐ Benchmark refinement and best practices are extremely valuable to the VTG community; TimeLens-Bench is poised to become the new standard.