TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision¶

Conference: ICCV 2025 arXiv: 2506.09445 Code: Not released Area: Video Understanding / Video QA / Temporal Grounding Keywords: Video QA, temporal grounding, weak supervision, vision-language models, multi-scale temporal modeling

TL;DR¶

This paper proposes TOGA — a weakly supervised vision-language model that generates pseudo temporal labels via a multi-scale visual-language connector and consistency constraints, enabling joint generation of open-ended answers and temporal grounding without any temporal annotations, achieving SOTA on NExT-GQA, MSVD-QA, and ActivityNet-QA.

Background & Motivation¶

Video QA requires models not only to generate correct answers but also to localize the temporal segment in the video that supports the answer — i.e., grounded video QA. This task presents three key challenges:

High cost of temporal annotation: Obtaining precise start/end time annotations requires substantial human effort. Existing methods such as Grounded-VideoLLM rely on external GPT-4-generated annotations or borrow labels from ActivityNet-Captions, incurring both cost and noise.

Open-ended vs. multiple-choice: Prior weakly supervised methods (e.g., SeViLA, LLoVi) rely on candidate options at inference time to select answers, which limits open-ended generation capability. TOGA generates free-form text answers, posing a significantly harder challenge.

Independent prediction of answers and grounding: Existing methods predict answers and temporal segments separately (e.g., via post-processing or independent modules), failing to model the dependency between answer content and temporal windows.

The core mechanism of TOGA is: jointly generating answers and temporal grounding (in the format Answer [start, end]), leveraging consistency constraints to produce high-quality pseudo labels under weak supervision.

Method¶

Overall Architecture¶

TOGA consists of four modules:

Visual Encoder: Frozen CLIP-ViT-Large, extracting per-frame features from uniformly sampled video frames.
Text Encoder: Frozen LLM tokenizer + embedding layer (Mistral-7B), generating token-level text features.
Multi-Scale Visual-Language Connector (MS-VLC): Trainable; aligns visual and text features at two temporal resolutions.
Text Decoder: Mistral-7B Instruct, fine-tuned to jointly generate answers and temporal grounding.

Multi-Scale Visual-Language Connector (MS-VLC)¶

The MS-VLC is one of TOGA's core innovations. It processes video frames at two temporal granularities:

Sparse scale (4 frames): Captures low-frequency temporal features, suitable for localizing long-duration events.
Dense scale (16 frames): Captures high-frequency temporal features, suitable for localizing short-duration events.

Each VLC module is implemented with RegNet + 3D convolutions, and the two scales share parameters. This multi-scale processing strategy draws on successful practices in activity recognition (SlowFast) and audio event detection.

Three-Stage Training Strategy¶

TOGA adopts a progressive multi-stage training procedure to incrementally acquire grounding capability:

Stage 1 — Visual-Text Alignment: Only the MS-VLC module is trained. Using video-text pairs from Video-ChatGPT, including video captioning, sentence completion, and QA tasks. The objective is to align multi-scale video features with text features.

Stage 2 — Instruction Fine-Tuning (Temporal Reference): MS-VLC + LLM decoder are trained jointly. The core objective is to teach the model to understand prompts with temporal references (e.g., What is the activity in [10, 20]?) and generate temporally grounded responses (e.g., A boy is running [10, 20]). Since no real annotations are available, pseudo labels are generated by cropping video temporal segments: a start/end time is selected, that segment is treated as an independent video, and the Stage 1 model generates a description as the pseudo answer.

Stage 3 — Consistency Constraint Refinement: The key innovation. High-quality pseudo labels are filtered through consistency constraints. Specifically, for a grounding question \(Q_g\) (e.g., What is the boy doing?) that produces the response Stands up [5, 10], a corresponding referring question \(Q_r\) is constructed (e.g., What does the boy do in [5, 10]?), with the expectation that the answer is consistent (Stands up) and aligned with the ground-truth answer. This bidirectional consistency ensures the reliability of weakly supervised pseudo labels.

Loss & Training¶

Standard next token prediction loss (consistent with language model training) is applied, with different prompt formats used across stages: - Answer only: answer - Grounding only: [<<<start>>>, <<<end>>>] - Answer + grounding: answer [<<<start>>>, <<<end>>>]

Key Experimental Results¶

Datasets and Metrics¶

Dataset	Task	Characteristics
NExT-GQA	Weakly supervised Grounded QA	Long videos (avg. 40s), causal + temporal questions
ReXTime	Zero-shot Grounding	Cross-segment causal reasoning
MSVD-QA	Open-ended QA	1,970 videos, 50K+ QA pairs
ActivityNet-QA	Open-ended QA	5,800 videos, 58K QA pairs

Main Results¶

Table 1: NExT-GQA Weakly Supervised Grounded QA

Method	Open-ended	mIoU	IoU@0.5	mIoP	IoP@0.5	Acc@GQA
SeViLA	✗	21.7	13.8	29.5	22.9	16.6
LLoVi	✗	20.0	15.3	37.3	36.9	24.3
Grounded-VideoLLM	✗	21.1	18.0	34.5	34.4	26.7
TOGA	✓	24.4	21.1	40.5	40.6	24.6

TOGA surpasses all closed-set methods under the open-ended (harder) setting. mIoU exceeds the best prior method by +3.3 pp, and IoP@0.5 reaches 40.6%.

Table 3: Open-Ended Video QA

Method	MSVD-QA Acc	MSVD-QA Score	ActivityNet-QA Acc	ActivityNet-QA Score
Video-LLaVA	70.7	3.9	45.3	3.3
Video-LLaMA2	70.9	3.8	50.2	3.3
TOGA	73.8	3.9	52.0	3.4

Ablation Study¶

Table 4: Importance of Multi-Scale VLC (NExT-GQA, IoU)

Model	All	Short	Medium	Long
Sparse only	20.0	16.2	28.9	47.5
Dense only	22.1	18.3	32.2	32.1
Multi-scale (MS-VLC)	24.4	20.5	34.7	49.3

Multi-scale processing yields the largest gains for short and long events — the sparse scale excels at localizing long-duration segments, while the dense scale excels at short ones, and the two are complementary.

Effect of consistency constraints: Removing Stage 3 (training only on pseudo labels) causes mIoU to drop sharply from 24.4 to 12.1, demonstrating that consistency constraints are critical to the success of weakly supervised grounding.

Table 5: Question Type Analysis (Acc@GQA)

Causal-Why	Causal-How	Temporal-Present	Temporal-Past	Temporal-Future
26.1	27.4	23.4	18.0	18.1

Temporal questions (especially past/future) are significantly harder than causal questions, requiring stronger long-range reasoning capability.

Highlights & Insights¶

A unified solution to the triple challenge of weak supervision + open-ended generation + joint prediction: no reliance on external models or annotation databases, purely bootstrapped training.
Consistency constraints serve as an elegant self-supervised signal: answers to grounding questions and referring questions mutually verify each other, filtering noisy pseudo labels.
Joint generation outperforms separate prediction: the model can adjust the temporal window based on answer content, capturing the correlation between answers and grounding.
High inference efficiency: average 0.6 seconds per sample (A100 GPU), suitable for practical deployment.

Limitations & Future Work¶

Open-ended answers may be underestimated under standard evaluation metrics (semantically equivalent but textually non-matching answers are marked incorrect).
Accuracy on temporal questions (past/future types) remains relatively low; long-range temporal reasoning has room for improvement.
Pseudo label quality is bounded by the descriptive capability of the Stage 1 model and may be insufficient for complex scenes (multi-person interactions, rapid action changes).
Validation is limited to videos of approximately 40 seconds; generalization to longer videos (several minutes or more) remains unknown.

Video QA: FrozenBiLM, Video-ChatGPT, Video-LLaVA, Chat-UniVi
Grounded Video QA: SeViLA, LLoVi, Grounded-VideoLLM, VideoStreaming
Weakly supervised temporal grounding: NExT-GQA, IGV

Rating¶

Novelty: ⭐⭐⭐⭐ — The consistency-constraint-based pseudo label strategy represents a new paradigm for weakly supervised grounded QA.
Value: ⭐⭐⭐⭐ — Zero annotation requirement lowers the barrier to deployment.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple datasets, detailed ablations, and question-type analysis.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation and thorough method description.