Video-KTR: Enhancing Video Reasoning via Key Token Attribution¶
Conference: ICLR 2026 arXiv: 2601.19686 Area: Video Understanding Keywords: Video Reasoning, Reinforcement Learning, Token Attribution, Multimodal LLM, GRPO
TL;DR¶
This paper proposes Video-KTR, a modality-aware policy shaping framework that identifies three categories of key tokens—visual-aware, temporal-aware, and entropy-aware—via counterfactual analysis, and applies selective reinforcement learning updates exclusively to these tokens, achieving state-of-the-art performance across multiple video reasoning benchmarks (Video-Holmes 42.7%, surpassing GPT-4o).
Background & Motivation¶
Reinforcement learning (RL) has demonstrated strong potential in improving the reasoning capabilities of multimodal LLMs; however, existing video reasoning methods suffer from three critical limitations:
Coarse-grained rewards: Reliance on sequence-level rewards prevents precise guidance on which tokens require focused learning.
Single-factor token selection: Token selection based solely on information entropy ignores modality-specific dependencies.
Over-reliance on language priors: The absence of fine-grained semantic alignment between visual inputs and output tokens increases the risk of hallucination.
Prior methods such as T-GRPO introduce temporal constraints (penalizing predictions after frame shuffling), but this constitutes a coarse global assumption that overlooks cases where certain tasks can be resolved using static visual cues alone.
Method¶
Overall Architecture¶
Video-KTR introduces a modality-aware, token-level policy shaping mechanism built upon the GRPO framework, consisting of three core steps: (1) multi-perspective token importance analysis; (2) token selection; and (3) selective policy updates.
Key Designs: Three Attribution Signals¶
1. Visual-Aware Tokens
The dependence of each token on visual input is quantified via counterfactual masking, which zeroes out video features and computes the resulting logit change:
Tokens with high \(\Delta^{\text{vis}}_i\) (e.g., "person," "door," "blue") indicate that their predictions are strongly grounded in visual input.
2. Temporal-Aware Tokens
Sensitivity to temporal structure is detected via frame order shuffling:
Tokens with high \(\Delta^{\text{temp}}_i\) (e.g., "first," "then," "appear") reflect dependence on event order and causal relationships.
3. Entropy-Aware Tokens
Prediction uncertainty is captured to identify critical reasoning points:
High-entropy tokens (e.g., "however," "wait") typically mark discourse transitions or decision points.
Token Selection and Policy Update¶
The top \(r\%\) tokens from each attribution strategy are selected, and their union \(\mathcal{S} = \mathcal{S}_{\text{vis}} \cup \mathcal{S}_{\text{temp}} \cup \mathcal{S}_{\text{ent}}\) is used to construct a binary mask \(m_{i,t}\). The modified GRPO objective is:
Only key tokens where \(m_{i,t}=1\) contribute to the loss computation.
Key Experimental Results¶
Main Results: Cross-Benchmark Performance Comparison¶
| Model | Scale | Video-Holmes | VideoMMMU | MMVU(mc) | TempCompass | VideoMME |
|---|---|---|---|---|---|---|
| GPT-4o | — | 42.0 | 61.2 | 75.4 | 73.8 | 71.9 |
| GPT-5 | — | 46.7 | 84.6 | 82.6 | 83.3 | 86.7 |
| Video-R1 | 7B | 36.5 | 52.3 | 63.8 | 73.2 | 59.3 |
| TW-GRPO | 7B | 32.9 | 51.3 | 65.8 | 73.3 | 55.1 |
| Video-KTR | 7B | 42.7 | 53.1 | 66.6 | 73.5 | 62.5 |
Ablation Study: Attribution Signal Combinations¶
| Strategy | E | V | T | Video-Holmes | VideoMMMU | MMVU | Avg. |
|---|---|---|---|---|---|---|---|
| Vanilla GRPO | ✗ | ✗ | ✗ | 38.8 | 49.8 | 64.8 | 51.1 |
| T only | ✗ | ✗ | ✓ | 42.1 | 50.1 | 65.5 | 52.6 |
| V only | ✗ | ✓ | ✗ | 40.5 | 51.9 | 65.1 | 52.5 |
| V+E+T | ✓ | ✓ | ✓ | 41.6 | 52.6 | 65.9 | 53.4 |
Key Findings¶
- Complementarity of three signals: Each individual signal outperforms vanilla GRPO, while the full combination yields the best performance.
- Hard selection outperforms soft weighting: A top-20% binary mask consistently outperforms Softmax/Sigmoid/linear/exponential weighting schemes.
- Differentiated linguistic distributions: Visual tokens are predominantly nouns (24.8%), temporal tokens are predominantly verbs (21.2%), and entropy tokens exhibit a higher proportion of adverbs (8.8%).
- Optimal update ratio is 20%: Higher ratios introduce noise, while lower ratios provide insufficient signal.
Highlights & Insights¶
- Elegant application of counterfactual analysis: Visual masking and frame shuffling naturally disentangle visual and temporal dependencies through two complementary perturbations.
- Plug-and-play design: Video-KTR can be seamlessly integrated into any GRPO-based RL training pipeline.
- 7B model surpassing GPT-4o: 42.7% vs. 42.0% on Video-Holmes, demonstrating that fine-grained token-level optimization can compensate for the gap in model scale.
- Analysis of unselected tokens: Filtered low-information tokens are predominantly function words (auxiliary verbs, pronouns, prepositions, etc.), validating that the attribution mechanism effectively removes redundant content.
Limitations & Future Work¶
- Counterfactual analysis requires additional forward passes (masked visual input + shuffled frame order), increasing training overhead.
- Validation is limited to 7B-scale models; whether equivalent gains persist at larger scales remains unexplored.
- The token selection ratio \(r\) is a fixed hyperparameter that cannot adaptively adjust based on sample difficulty.
- Frame counts are limited to 16–64 frames; the method's capability on ultra-long videos has not been validated.
Rating ⭐⭐⭐⭐⭐¶
The paper presents an elegant method design, rigorous experimental analysis, and substantial performance gains. The transition from coarse-grained sequence-level rewards to fine-grained modality-aware token-level updates represents a significant advancement in RL training for video reasoning.