Video-KTR: Enhancing Video Reasoning via Key Token Attribution¶

Conference: ICLR 2026 arXiv: 2601.19686 Area: Video Understanding Keywords: Video Reasoning, Reinforcement Learning, Token Attribution, Multimodal LLM, GRPO

TL;DR¶

This paper proposes Video-KTR, a modality-aware policy shaping framework that identifies three categories of key tokens—visual-aware, temporal-aware, and entropy-aware—via counterfactual analysis, and applies selective reinforcement learning updates exclusively to these tokens, achieving state-of-the-art performance across multiple video reasoning benchmarks (Video-Holmes 42.7%, surpassing GPT-4o).

Background & Motivation¶

Reinforcement learning (RL) has demonstrated strong potential in improving the reasoning capabilities of multimodal LLMs; however, existing video reasoning methods suffer from three critical limitations:

Coarse-grained rewards: Reliance on sequence-level rewards prevents precise guidance on which tokens require focused learning.

Single-factor token selection: Token selection based solely on information entropy ignores modality-specific dependencies.

Over-reliance on language priors: The absence of fine-grained semantic alignment between visual inputs and output tokens increases the risk of hallucination.

Prior methods such as T-GRPO introduce temporal constraints (penalizing predictions after frame shuffling), but this constitutes a coarse global assumption that overlooks cases where certain tasks can be resolved using static visual cues alone.

Method¶

Overall Architecture¶

Video-KTR introduces a modality-aware, token-level policy shaping mechanism built upon the GRPO framework, consisting of three core steps: (1) multi-perspective token importance analysis; (2) token selection; and (3) selective policy updates.

Key Designs: Three Attribution Signals¶

1. Visual-Aware Tokens

The dependence of each token on visual input is quantified via counterfactual masking, which zeroes out video features and computes the resulting logit change:

\[\Delta^{\text{vis}}_i = |\log \text{softmax}(\mathbf{z}^{\text{full}}_i)_{y_i} - \log \text{softmax}(\mathbf{z}^{\text{masked}}_i)_{y_i}|\]

Tokens with high \(\Delta^{\text{vis}}_i\) (e.g., "person," "door," "blue") indicate that their predictions are strongly grounded in visual input.

2. Temporal-Aware Tokens

Sensitivity to temporal structure is detected via frame order shuffling:

\[\Delta^{\text{temp}}_i = |\log \text{softmax}(\mathbf{z}^{\text{ordered}}_i)_{y_i} - \log \text{softmax}(\mathbf{z}^{\text{shuffled}}_i)_{y_i}|\]

Tokens with high \(\Delta^{\text{temp}}_i\) (e.g., "first," "then," "appear") reflect dependence on event order and causal relationships.

3. Entropy-Aware Tokens

Prediction uncertainty is captured to identify critical reasoning points:

\[\mathcal{H}(i) = -\sum_w p(z_i = w) \log p(z_i = w)\]

High-entropy tokens (e.g., "however," "wait") typically mark discourse transitions or decision points.

Token Selection and Policy Update¶

The top \(r\%\) tokens from each attribution strategy are selected, and their union \(\mathcal{S} = \mathcal{S}_{\text{vis}} \cup \mathcal{S}_{\text{temp}} \cup \mathcal{S}_{\text{ent}}\) is used to construct a binary mask \(m_{i,t}\). The modified GRPO objective is:

\[\mathcal{J}_{\text{Video-KTR}}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} m_{i,t} \cdot \min(r_{i,t}\hat{A}_{i,t}, \text{clip}(r_{i,t})\hat{A}_{i,t})\right]\]

Only key tokens where \(m_{i,t}=1\) contribute to the loss computation.

Key Experimental Results¶

Main Results: Cross-Benchmark Performance Comparison¶

Model	Scale	Video-Holmes	VideoMMMU	MMVU(mc)	TempCompass	VideoMME
GPT-4o	—	42.0	61.2	75.4	73.8	71.9
GPT-5	—	46.7	84.6	82.6	83.3	86.7
Video-R1	7B	36.5	52.3	63.8	73.2	59.3
TW-GRPO	7B	32.9	51.3	65.8	73.3	55.1
Video-KTR	7B	42.7	53.1	66.6	73.5	62.5

Ablation Study: Attribution Signal Combinations¶

Strategy	E	V	T	Video-Holmes	VideoMMMU	MMVU	Avg.
Vanilla GRPO	✗	✗	✗	38.8	49.8	64.8	51.1
T only	✗	✗	✓	42.1	50.1	65.5	52.6
V only	✗	✓	✗	40.5	51.9	65.1	52.5
V+E+T	✓	✓	✓	41.6	52.6	65.9	53.4

Key Findings¶

Complementarity of three signals: Each individual signal outperforms vanilla GRPO, while the full combination yields the best performance.
Hard selection outperforms soft weighting: A top-20% binary mask consistently outperforms Softmax/Sigmoid/linear/exponential weighting schemes.
Differentiated linguistic distributions: Visual tokens are predominantly nouns (24.8%), temporal tokens are predominantly verbs (21.2%), and entropy tokens exhibit a higher proportion of adverbs (8.8%).
Optimal update ratio is 20%: Higher ratios introduce noise, while lower ratios provide insufficient signal.

Highlights & Insights¶

Elegant application of counterfactual analysis: Visual masking and frame shuffling naturally disentangle visual and temporal dependencies through two complementary perturbations.
Plug-and-play design: Video-KTR can be seamlessly integrated into any GRPO-based RL training pipeline.
7B model surpassing GPT-4o: 42.7% vs. 42.0% on Video-Holmes, demonstrating that fine-grained token-level optimization can compensate for the gap in model scale.
Analysis of unselected tokens: Filtered low-information tokens are predominantly function words (auxiliary verbs, pronouns, prepositions, etc.), validating that the attribution mechanism effectively removes redundant content.

Limitations & Future Work¶

Counterfactual analysis requires additional forward passes (masked visual input + shuffled frame order), increasing training overhead.
Validation is limited to 7B-scale models; whether equivalent gains persist at larger scales remains unexplored.
The token selection ratio \(r\) is a fixed hyperparameter that cannot adaptively adjust based on sample difficulty.
Frame counts are limited to 16–64 frames; the method's capability on ultra-long videos has not been validated.

Rating ⭐⭐⭐⭐⭐¶

The paper presents an elegant method design, rigorous experimental analysis, and substantial performance gains. The transition from coarse-grained sequence-level rewards to fine-grained modality-aware token-level updates represents a significant advancement in RL training for video reasoning.