Towards Sparse Video Understanding and Reasoning¶

Conference: CVPR 2026
arXiv: 2602.13602
Code: https://sparsevideounderstanding.github.io (Project page, no open-source code)
Area: Video Understanding
Keywords: Video Question Answering, Frame Selection, Multi-turn Reasoning, Summary-as-State, Reinforced Fine-tuning

TL;DR¶

ReViSe reforms video question answering as "question-driven multi-turn sparse frame selection"—selecting only a few frames per turn, compressing verified evidence into a structured "summary-as-state" across turns, and stopping early once confident. It serves as a plug-and-play wrapper for any VLM and supports reinforced fine-tuning using the label-free reward EAGER, achieving higher accuracy on multiple VQA benchmarks with only a few frames.

Background & Motivation¶

Background: Current video understanding using LLMs/VLMs mainly adopts uniform sampling. Existing methods either use captioners to convert frames into text for LLM reasoning (losing fine-grained visual details) or feed visual features directly into VLMs (e.g., LLaVA, QwenVL). In both cases, frames are uniformly sampled.

Limitations of Prior Work: Uniform sampling faces two major issues (following the categorization in VideoTree). (L1) Information Overload: Long videos are highly redundant; feeding too many redundant frames overwhelms the LLM, slowing down reasoning and interfering with judgment. (L2) Insufficient Perception of Key Information: Video content is hierarchically and temporally structured; missing semantically salient frames across multiple scales causes the model to lose clues necessary for answering.

Key Challenge: The root of these issues is semantic sparsity—for a specific question, only a small subset of frames is truly relevant. Uniform sampling is "question-agnostic," failing to identify critical frames (missing keys) while being burdened by irrelevant ones (redundancy).

Goal: To decouple video understanding into two sub-problems: (a) how to select only the most informative frames for the current question; (b) how to compactly accumulate seen evidence during multi-turn interactions to avoid re-reading the entire history.

Key Insight: Drawing an analogy to RNN hidden states, since evidence is accumulated step-by-step, there should be a continuously updated compact memory that carries only "task-critical" information to decide "where to look next." This transforms frame selection from a one-time coverage problem into a state-driven sequential decision process.

Core Idea: Replace raw dialogue history with a "summary-as-state" transmitted across turns. This turns multi-turn frame selection into a loop of "reading frames → updating state → deciding next frames or answering." The model is taught to select accurately and stop early via the label-free reward EAGER.

Method¶

Overall Architecture¶

ReViSe (Reasoning with Video Sparsity) models video QA as an agent interaction process of at most T turns. Given video $V=\{x_i\}_{i=0}^{L-1}$ and question $p$, the goal is to produce answer $a$ within a visual token budget $K$. Instead of consuming all frames at once, it iteratively: (i) selects a small batch of frames likely to reduce uncertainty; (ii) distills the evidence into a structured summary state $z_t$ carried across turns; (iii) stops early once cumulative evidence is sufficient.

The system consists of three modules: the Multi-turn Controller decides which frames to view and when to stop; the Structured Output Protocol externalizes reasoning and state via fixed labels; the Summary-as-State serves as the sole cross-turn memory, accumulating verified evidence. In each turn, the VLM outputs a <think> trajectory, followed by either a <summarize> state + <select> request (intermediate turns) or an <answer> (final turn). The next turn proceeds only with new frames and the updated summary. The model can be used as a plug-and-play wrapper for closed-source VLMs (§3.2) or reinforced via EAGER rewards for open-source VLMs (§3.3).

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Input: Video V + Question p<br/>Turn 1: Uniformly sample few frames F1"] --> B["Multi-turn Controller<br/>Reads (p_t, z_t-1, F_t) to output action"]
    B --> C["Structured Output Protocol<br/>think → summarize → select/answer"]
    C --> D["Summary-as-State z_t=(P,O,H,U,R)<br/>Distills evidence, sole cross-turn memory"]
    D -->|"Select(Q_t): Fetch new frames<br/>S_t+1=S_t∪F_t+1, budget C(S_t)+|p_t|≤K"| B
    D -->|"Answer(y_t): Sufficient evidence, stop early"| E["Output answer a=y_τ"]
    F["EAGER Reward<br/>Label-free RFT (Optional)"] -.Training Open-source VLM.-> B

Key Designs¶

1. Multi-turn Question-driven Frame Selection: From "Max Coverage" to Sequential Decision

Addressing L1 (Information Overload), ReViSe avoids large-scale uniform sampling by splitting selection into at most $T$ turns. Let $S_t=\bigcup_{j=1}^{t}F_j$ be the frames included up to turn $t$. In each turn $t\geq 2$, the agent outputs an action $a_t\in\{\textsc{Select}(Q_t),\textsc{Answer}(y_t)\}$ based on $(p_t,z_{t-1},F_t)$, where $Q_t$ are indices of new frames requested. Selecting Answer terminates the process at step $\tau\leq T$. The total budget is constrained by $C(S_t)+|p_t|\leq K$, where $C(F)$ is the visual token cost.

The optimization objective is: $$\max_{\{Q_t\},\,\tau\leq T}\ \mathcal{R}\bigl(\pi_\theta(p,z_{\tau-1},S_{\tau-1})\bigr)\quad\text{s.t.}\ C(S_{\tau-1})+|p_\tau|\leq K.$$ By reading a few frames, identifying missing pieces, and fetching targeted frames, the model approaches the answer with minimal frame usage.

2. Summary-as-State z_t=(P,O,H,U,R): Persistent Memory addressing L2

Addressing L2 (Insufficient key information perception), ReViSe formalizes memory as a quintuple with a fixed order P→O→H→U→R: $P_t$ (Previously seen), $O_t$ (Observations), $H_t$ (belief Hypotheses), $U_t$ (Uncertainties), and $R_t$ (Reasons for next steps). This is the only state carried across turns, stored in the <summarize> field.

This is effective because the state is cumulative: $z_{t-1}$ contains the essence of $\{z_0,\dots,z_{t-2}\}$. Conditioning only on $z_{t-1}$ is equivalent to conditioning on the state history, keeping memory costs constant and avoiding re-reading raw history (mitigating L1). $R_t$ drives the next proposal $Q_{t+1}$, while the stability of $H_t/U_t$ signals when to stop. This binds selection and termination to an explicit state, ensuring consistency.

3. Structured Output Protocol + Plug-and-Play: Transparency and Versatility

Each response starts with a <think> trajectory. Select turns output <think>…</think><summarize>…</summarize><select>…</select>, while Answer turns output <think>…</think><answer>…</answer>. The <think> trajectory is not persisted, keeping the prompt volume low. Only the <summarize> state is carried over.

Since the logic (multi-turn dialogue, adaptive selection, state updates) runs outside the VLM, ReViSe treats closed-source VLMs as frozen black boxes, allowing multi-turn orchestration via APIs without parameter updates—this is the basis for its plug-and-play capability.

Loss & Training¶

For open-source VLMs, ReViSe models interaction as a finite-step MDP $\mathcal{M}=\langle\mathcal{S},\mathcal{A},\mathcal{T},r,\gamma\rangle$. The policy $\pi_\theta$ is optimized at the token level.

The proposed EAGER (Evidence-Adjusted Gain for Efficient Reasoning) reward is a dense, label-free (requiring only answer labels and model scores) reward. It uses the temperature-calibrated log-odds margin $m_t=\log p_\theta(y^\star\mid p_t,z_{t-1},S_t)-\max_{y\neq y^\star}\log p_\theta(y\mid p_t,z_{t-1},S_t)$, comprising:

(i) Confidence Gain $r_t^{\text{conf}}=[m_{t+1}-m_t]_+$: Rewards new frames that increase the gap between the correct option and the strongest distractor.
(ii) Summary Adequacy $r_t^{\text{sum}}=\mathds{1}[\arg\max_{y}p_\theta(y\mid p_\tau,z_{\tau-1})=y^\star]$: At the terminal stage, the question is asked using only the final summary. A reward is given only if the answer is correct, forcing the summary to be self-sufficient.
(iii) Correct & Early Stopping $r_t^{\text{stop}}=1+\beta[T_{\text{stop}}-\tau]_+$: Rewards correct answers within a small turn budget $T_{\text{stop}}$.

The policy is optimized via GRPO. Since 3B-class VLMs have weak instruction-following, the model undergoes supervised format fine-tuning using 8,000 multi-turn dialogues distilled from GPT-4o before RL.

Key Experimental Results¶

Main Results¶

Backbones: Qwen2-VL-7B, Qwen2.5-VL(3B/7B), InternVL2-8B, GPT-4o. Datasets: VideoEspresso, NExT-QA, EgoSchema. Budget: Max 3 frames/turn, max 4 turns.

VideoEspresso (Avg Acc %, #Frames refers to average processed frames per video):

Backbone + Method	#Frames	Avg Acc	Gain vs Orig.
InternVL2-8B	FPS=1	28.7	—
InternVL2 + ReViSe	2.87	32.1	+3.4
Qwen2-VL-7B	FPS=1	28.5	—
Qwen2-VL + ReViSe	6.25	37.8	+9.3
GPT-4o	FPS=3	26.4	—
GPT-4o + ReViSe	7.99	48.9	+22.5

GPT-4o + ReViSe achieved the best performance in 13 out of 14 categories with single-digit frames.

Plug-and-play Comparison (EgoSchema subset / NExT-QA, Acc % + Frames/Captions):

Dataset	Method	Acc (%)	Frames/Caps
EgoSchema	VideoAgent	60.2	8.4
EgoSchema	VideoTree	66.2	62.4
EgoSchema	LLoVi	57.6	180
EgoSchema	GPT-4o + ReViSe	60.6	9.8
NExT-QA	VideoTree	73.5	56
NExT-QA	VideoAgent	71.3	8.2
NExT-QA	LVNet	61.1	12
NExT-QA	GPT-4o + ReViSe	63.8	8.4

ReViSe operates consistently in the "ultra-low frame budget" range, using 6–7× fewer frames than VideoTree and 13–18× fewer than LLoVi/MC-ViT-L without relying on captioners.

Reinforced Fine-tuning (RFT, Qwen2.5-VL-3B):

Dataset	Method	Acc (%)	Frames	Rounds	Time(s)
VideoEspresso	Direct Reasoning	12.6	8.0	1.00	1.02
VideoEspresso	Plug-and-Play	20.1	5.2	1.86	1.73
VideoEspresso	Reinforced FT	27.8	4.1	1.37	1.02
NExT-QA	Direct Reasoning	23.6	8.0	1.00	0.88
NExT-QA	Plug-and-Play	31.7	5.3	1.74	1.22
NExT-QA	Reinforced FT	51.3	3.9	1.32	0.62

RFT significantly outperformed plug-and-play on NExT-QA (51.3 vs 31.7) while reducing frames, turns, and inference time.

Ablation Study¶

(VideoEspresso + Qwen2.5-VL-7B)

Configuration	Key Metrics	Note
Full model	Highest acc, min turns, lowest latency	Complete model
Increasing allowed turns	1 turn (38.3%@4.60f) → 4 turns (42.1%@2.89f)	More turns improve accuracy and reduce average frames due to early stopping (mean ≈ 2.3)
w/o Persistent State	Acc −18.34%, computation nearly doubled	Forces context reconstruction every turn
w/o Structured (P/O/H/U/R)	Significant drop, max runtime increase (+32.14s)	Explicit state propagation is critical for stability

Key Findings¶

Persistent state is the primary contributor: Removing it causes an 18.34% drop in accuracy. The value of multiple turns lies in compressing seen frames into state.
Structured fields are essential: Removing the P/O/H/U/R structure increases runtime significantly (32.14s), proving that explicit states suppress redundant context needs.
More turns = Higher Accuracy & Efficiency: The Pareto front for accuracy vs. frame count is monotonic; allowing more turns enables the model to fetch Targeted frames and stop early, using fewer frames on average.

Highlights & Insights¶

Adapting RNN hidden states to VLM multi-turn reasoning: "Summary-as-state" acts as a text-based hidden state carrying task-critical info, solving the "long context vs. efficiency" dilemma by maintaining a constant-cost memory.
Clever "Summary Adequacy" in EAGER: Requiring an answer based solely on the summary forces the model to encode evidence into the state rather than relying on raw history.
Confidence gain via log-odds margin: The $[m_{t+1}-m_t]_+ \ $ reward only credits frames that truly widen the gap for the correct option, providing a dense signal for frame selection without manual keyframe labels.
Dual setup (Plug-and-play + RFT): The framework covers both closed-source and open-source models under a unified architecture.

Limitations & Future Work¶

Limitations identified by authors: (1) Dependence on backbone visual fidelity; (2) Multi-turn latency from API calls; (3) Lack of spatial cropping within frames.
Observations on limitations: Absolute accuracy on VideoEspresso is relatively low (48.9% for GPT-4o), indicating this is a "low-budget" specialist rather than a general SOTA across all settings. RFT was only verified on 3B models.
Future directions: Adaptive spatial cropping, stronger visual encoders, and extension to open-ended generation.

vs VideoAgent: Both do multi-turn selection, but VideoAgent is caption-based and re-reads history; ReViSe reasons in visual space and uses compact states.
vs VideoTree: VideoTree achieves higher accuracy (73.5% NExT-QA) but uses significantly more frames (56 vs 8.4); ReViSe prioritizes efficiency.
vs Single-turn/RL methods: ReViSe differentiates itself through persistent state maintenance across iterations and its label-free EAGER reward.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of "summary-as-state" and EAGER is novel for multi-turn VQA.
Experimental Thoroughness: ⭐⭐⭐⭐ Thorough benchmarks and ablations, though RFT is limited to 3B backbones.
Writing Quality: ⭐⭐⭐⭐ Clear formalization; state designs are well-explained.
Value: ⭐⭐⭐⭐ High practical utility for engineering efficient video reasoning agents.