TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs¶
Conference: NeurIPS 2025 arXiv: 2509.18056 Code: github.com/HVision-NKU/TempSamp-R1 Area: Video Temporal Understanding / Reinforcement Fine-Tuning Keywords: temporal grounding, GRPO, off-policy, soft advantage, hybrid CoT, video LLM
TL;DR¶
TempSamp-R1 is a reinforcement fine-tuning framework that addresses the inefficiency of on-policy sampling in GRPO for video temporal grounding—caused by the vast search space—by introducing ground-truth annotations as off-policy supervision signals, non-linear soft advantage estimation, and a hybrid CoT training paradigm, achieving new state-of-the-art results on Charades-STA, ActivityNet, and QVHighlights.
Background & Motivation¶
Background: MLLMs perform well on general video question answering but struggle with tasks requiring precise temporal understanding, such as temporal grounding and highlight detection. SFT-based approaches tend to overfit deterministic timestamp annotations and lack temporal reasoning capacity. GRPO (DeepSeek-R1-style) is effective for mathematical reasoning but yields limited gains in video temporal grounding.
Limitations of Prior Work: (1) The search space for video temporal grounding is enormous—locating (start, end) pairs on a continuous time axis is substantially harder than selecting from discrete mathematical answers; (2) pure on-policy GRPO sampling rarely produces high-IoU solutions in such a large search space, resulting in sparse and unstable rewards (top-1 IoU reward on ActivityNet remains persistently low and oscillates); (3) introducing high-reward off-policy solutions (e.g., ground truth) biases advantage estimation—the high reward from GT inflates the group mean, causing all on-policy advantages to become negative.
Key Challenge: How can a policy be effectively guided to learn precise temporal grounding in a large search space while avoiding the distributional shift introduced by off-policy samples?
Key Insight: GT annotations are mixed into the GRPO sampling group as off-policy solutions, while non-linear reward shaping is applied to eliminate the negative impact of distributional shift on advantage estimation.
Method¶
Overall Architecture¶
TempSamp-R1 is built on the GRPO framework. For each query, \(G\) solutions are sampled (\(G-1\) on-policy and 1 off-policy GT). IoU rewards are computed and converted into normalized advantage values via a soft advantage estimation module for policy optimization. Training proceeds in two stages: the model first learns direct output generation, then a format reward is introduced to encourage chain-of-thought reasoning. At inference time, a single model supports both CoT and non-CoT modes.
Key Designs¶
-
Mix-Policy Sampling:
- Function: GT annotations are mixed into the GRPO sampling group as off-policy solutions, providing precise positive signals for temporal grounding.
- Mechanism: For each query \(q\), \(G-1\) solutions \(\{o_1,...,o_{G-1}\}\) are sampled from the current policy \(\pi_\theta\), and one external off-policy solution \(o_G\) (from GT annotations) is appended. Normalized advantages are computed over the joint distribution: \(A_i = \frac{r_i - \text{mean}(\{r_1,...,r_{G-1}\} \cup \{r_G\})}{\text{std}(\{r_1,...,r_{G-1}\} \cup \{r_G\})}\). An advantage anchoring strategy is also proposed: \(A_G = \lambda_{\text{off}} \cdot \max\{A_i | i \in \{1,...,G-1\}\}\) (with \(\lambda_{\text{off}}=1.2\)) to decouple the off-policy and on-policy advantages.
- Design Motivation: Pure on-policy GRPO in a large search space can almost never sample high-IoU solutions, yielding sparse rewards and weak learning signals. GT provides precise temporal anchors to compensate for insufficient on-policy exploration; however, the high reward from GT inflates the group mean, necessitating soft advantage estimation to remove the bias.
-
Non-Linear Soft Advantage Estimation:
- Function: An asymmetric non-linear transformation is applied to rewards, compressing differences in the high-reward region while amplifying differences in the low-reward region.
- Mechanism: A piecewise function is defined as \(\tilde{r}_i = \begin{cases}\tau + \alpha_1 \cdot \ln((r_i - \tau) + 1), & r_i \geq \tau \\ \tau - \frac{e^{\alpha_2(\tau - r_i)} - 1}{e^{\alpha_2} - 1}, & r_i < \tau\end{cases}\), where \(\tau=0.8\) is the threshold, \(\alpha_1=0.01\) controls logarithmic compression, and \(\alpha_2=1\) controls exponential amplification. The logarithmic branch suppresses gradient spikes from optimal solutions such as GT; the exponential branch amplifies the discriminability among suboptimal solutions.
- Design Motivation: In standard GRPO, the high reward of the off-policy solution causes all on-policy advantages to become negative, incorrectly penalizing high-quality on-policy solutions. After non-linear shaping, the high-reward region is compressed and the low-reward region is amplified, yielding more informative gradients and more stable optimization.
-
Hybrid Chain-of-Thought Training:
- Function: A single model is trained to support both CoT and non-CoT inference, with the mode selected at inference time according to query complexity.
- Mechanism: Two-stage training—the initialization stage optimizes the model to generate accurate final answers (non-CoT mode), after which a format reward is introduced to encourage generating reasoning steps within
<Think>...</Think>and final answers within<Answer>...</Answer>. The format reward is 1 for correct formatting and 0 otherwise. At inference, Mixed CoT takes the best result from both modes. - Design Motivation: Different queries have different complexity—simple queries can be answered directly, while complex queries require reasoning. CoT and non-CoT are complementary, and the Mixed mode outperforms either mode alone across all metrics.
Loss & Training¶
The standard GRPO objective is adopted: \(\mathcal{J}(\theta) = \frac{1}{G}\sum_{i=1}^{G}[\min(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}A_i, \text{clip}(\cdot, 1-\epsilon, 1+\epsilon)A_i) - \beta\text{KL}(\pi_\theta||\pi_{ref})]\), with \(\pi_{\theta_{old}} = \pi_\theta\) for computational simplicity. Task rewards: IoU reward \(R_{\text{IoU}}\) for temporal grounding; timestamp matching reward \(R_{\text{ts}} = \lambda_{\text{rec}} \cdot F2 + \lambda_{\text{score}} \cdot \frac{1}{1+\text{WMSE}}\) for highlight detection. The base model is Qwen2.5-VL-7B-Instruct, trained on 4×A100 GPUs with video sampled at 2 FPS.
Key Experimental Results¶
Main Results: SOTA Comparison on Temporal Understanding Benchmarks¶
| Method | Type | Charades R1@0.7 | ActivityNet R1@0.5 | QVHighlights mAP |
|---|---|---|---|---|
| TimeChat | SFT | 23.7 | — | 21.7 |
| iMOVE | SFT | 45.3 | 50.7 | — |
| VideoChat-R1 | RL | 50.2 | — | — |
| TimeZero | RL | 47.9 | 47.3 | — |
| TempSamp-R1 (no-CoT) | RL | 52.2 | 55.4 | 30.0 |
| TempSamp-R1 (CoT) | RL | 52.9 | 56.0 | 28.3 |
| TempSamp-R1 Mixed CoT | RL | 56.3 | 58.7 | 29.3 |
Ablation Study: Component Contributions (Charades-STA)¶
| Configuration | R1@0.5 | R1@0.7 | mIoU |
|---|---|---|---|
| GRPO baseline | 71.7 | 50.2 | 60.8 |
| + off-policy (reward scaling) | 72.5 | 51.1 | 61.0 |
| + off-policy (advantage anchor) | 73.0 | 51.7 | 61.3 |
| + off-policy (non-linear shaping) | 73.6 | 52.2 | 61.7 |
| + hybrid CoT (Mixed) | 76.0 | 56.3 | 64.2 |
Key Findings¶
- Pure on-policy GRPO on ActivityNet yields a top-1 IoU reward persistently below 0.3 with high variance; off-policy guidance rapidly stabilizes the reward above 0.6.
- Among the three off-policy integration strategies, non-linear reward shaping > advantage anchoring > reward scaling.
- Mixed CoT outperforms standalone CoT and non-CoT on all metrics, improving mIoU by 2.1–2.5 points.
- Few-shot capability: using only 10% of training data still achieves over 90% of the performance of full-data GRPO training.
Highlights & Insights¶
- The paper precisely diagnoses the root cause of GRPO's failure in temporal grounding—the large search space leads to sparse rewards under on-policy sampling.
- The piecewise design of the non-linear soft advantage is elegant: logarithmic compression suppresses gradient spikes in the high-reward region while exponential amplification enhances discriminability in the low-reward region.
- Mixed CoT is a simple yet effective design that enables the same model to adaptively select its reasoning depth.
- The work extends RL fine-tuning from mathematical reasoning to video temporal understanding, validating the cross-domain potential of the R1 paradigm.
Limitations & Future Work¶
- Off-policy sampling relies on GT annotations, which are unavailable at inference time, creating an inconsistency between training exploration and inference.
- Validation is primarily on temporal grounding tasks; effectiveness on general video QA remains unexplored.
- The hyperparameters of the non-linear transformation (\(\tau, \alpha_1, \alpha_2\)) may require task-specific tuning.
- Experiments are conducted only on a 7B model; it is unclear whether off-policy guidance remains necessary for larger models.
Related Work & Insights¶
- vs. TimeZero/VideoChat-R1: These GRPO-based methods rely solely on on-policy sampling. TempSamp-R1 introduces off-policy signals to address sparse rewards, improving R1@0.5 on ActivityNet by 8.7 points.
- vs. SFT methods (iMOVE, etc.): SFT overfits to deterministic timestamps, whereas RL fine-tuning learns more flexible temporal reasoning. TempSamp-R1 Mixed CoT surpasses iMOVE by 11 points on Charades R1@0.7.
- Insight: In RL tasks with large search spaces, judiciously incorporating off-policy expert signals may be a broadly effective strategy; the non-linear reward shaping approach is generalizable to other RL fine-tuning scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ Extending R1-style RL to video temporal grounding is valuable; the combination of off-policy sampling and soft advantage estimation is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ State-of-the-art results on 3 benchmarks, detailed ablation studies, and few-shot evaluation.
- Writing Quality: ⭐⭐⭐⭐ Problem analysis is thorough; the motivation-to-solution logical chain is clear.
- Value: ⭐⭐⭐⭐ Provides a practical RL fine-tuning framework for video temporal understanding; the Mixed CoT design is reusable.