Skip to content

TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Conference: ICCV 2025 arXiv: 2506.09445 Code: Not released Area: Video Understanding / Video QA / Temporal Grounding Keywords: Video QA, temporal grounding, weak supervision, vision-language models, multi-scale temporal modeling

TL;DR

This paper proposes TOGA — a weakly supervised vision-language model that generates pseudo temporal labels via a multi-scale visual-language connector and consistency constraints, enabling joint generation of open-ended answers and temporal grounding without any temporal annotations, achieving SOTA on NExT-GQA, MSVD-QA, and ActivityNet-QA.

Background & Motivation

Video QA requires models not only to generate correct answers but also to localize the temporal segment in the video that supports the answer — i.e., grounded video QA. This task presents three key challenges:

High cost of temporal annotation: Obtaining precise start/end time annotations requires substantial human effort. Existing methods such as Grounded-VideoLLM rely on external GPT-4-generated annotations or borrow labels from ActivityNet-Captions, incurring both cost and noise.

Open-ended vs. multiple-choice: Prior weakly supervised methods (e.g., SeViLA, LLoVi) rely on candidate options at inference time to select answers, which limits open-ended generation capability. TOGA generates free-form text answers, posing a significantly harder challenge.

Independent prediction of answers and grounding: Existing methods predict answers and temporal segments separately (e.g., via post-processing or independent modules), failing to model the dependency between answer content and temporal windows.

The core mechanism of TOGA is: jointly generating answers and temporal grounding (in the format Answer [start, end]), leveraging consistency constraints to produce high-quality pseudo labels under weak supervision.

Method

Overall Architecture

TOGA consists of four modules:

  1. Visual Encoder: Frozen CLIP-ViT-Large, extracting per-frame features from uniformly sampled video frames.
  2. Text Encoder: Frozen LLM tokenizer + embedding layer (Mistral-7B), generating token-level text features.
  3. Multi-Scale Visual-Language Connector (MS-VLC): Trainable; aligns visual and text features at two temporal resolutions.
  4. Text Decoder: Mistral-7B Instruct, fine-tuned to jointly generate answers and temporal grounding.

Multi-Scale Visual-Language Connector (MS-VLC)

The MS-VLC is one of TOGA's core innovations. It processes video frames at two temporal granularities:

  • Sparse scale (4 frames): Captures low-frequency temporal features, suitable for localizing long-duration events.
  • Dense scale (16 frames): Captures high-frequency temporal features, suitable for localizing short-duration events.

Each VLC module is implemented with RegNet + 3D convolutions, and the two scales share parameters. This multi-scale processing strategy draws on successful practices in activity recognition (SlowFast) and audio event detection.

Three-Stage Training Strategy

TOGA adopts a progressive multi-stage training procedure to incrementally acquire grounding capability:

Stage 1 — Visual-Text Alignment: Only the MS-VLC module is trained. Using video-text pairs from Video-ChatGPT, including video captioning, sentence completion, and QA tasks. The objective is to align multi-scale video features with text features.

Stage 2 — Instruction Fine-Tuning (Temporal Reference): MS-VLC + LLM decoder are trained jointly. The core objective is to teach the model to understand prompts with temporal references (e.g., What is the activity in [10, 20]?) and generate temporally grounded responses (e.g., A boy is running [10, 20]). Since no real annotations are available, pseudo labels are generated by cropping video temporal segments: a start/end time is selected, that segment is treated as an independent video, and the Stage 1 model generates a description as the pseudo answer.

Stage 3 — Consistency Constraint Refinement: The key innovation. High-quality pseudo labels are filtered through consistency constraints. Specifically, for a grounding question \(Q_g\) (e.g., What is the boy doing?) that produces the response Stands up [5, 10], a corresponding referring question \(Q_r\) is constructed (e.g., What does the boy do in [5, 10]?), with the expectation that the answer is consistent (Stands up) and aligned with the ground-truth answer. This bidirectional consistency ensures the reliability of weakly supervised pseudo labels.

Loss & Training

Standard next token prediction loss (consistent with language model training) is applied, with different prompt formats used across stages: - Answer only: answer - Grounding only: [<<<start>>>, <<<end>>>] - Answer + grounding: answer [<<<start>>>, <<<end>>>]

Key Experimental Results

Datasets and Metrics

Dataset Task Characteristics
NExT-GQA Weakly supervised Grounded QA Long videos (avg. 40s), causal + temporal questions
ReXTime Zero-shot Grounding Cross-segment causal reasoning
MSVD-QA Open-ended QA 1,970 videos, 50K+ QA pairs
ActivityNet-QA Open-ended QA 5,800 videos, 58K QA pairs

Main Results

Table 1: NExT-GQA Weakly Supervised Grounded QA

Method Open-ended mIoU IoU@0.5 mIoP IoP@0.5 Acc@GQA
SeViLA 21.7 13.8 29.5 22.9 16.6
LLoVi 20.0 15.3 37.3 36.9 24.3
Grounded-VideoLLM 21.1 18.0 34.5 34.4 26.7
TOGA 24.4 21.1 40.5 40.6 24.6

TOGA surpasses all closed-set methods under the open-ended (harder) setting. mIoU exceeds the best prior method by +3.3 pp, and IoP@0.5 reaches 40.6%.

Table 3: Open-Ended Video QA

Method MSVD-QA Acc MSVD-QA Score ActivityNet-QA Acc ActivityNet-QA Score
Video-LLaVA 70.7 3.9 45.3 3.3
Video-LLaMA2 70.9 3.8 50.2 3.3
TOGA 73.8 3.9 52.0 3.4

Ablation Study

Table 4: Importance of Multi-Scale VLC (NExT-GQA, IoU)

Model All Short Medium Long
Sparse only 20.0 16.2 28.9 47.5
Dense only 22.1 18.3 32.2 32.1
Multi-scale (MS-VLC) 24.4 20.5 34.7 49.3

Multi-scale processing yields the largest gains for short and long events — the sparse scale excels at localizing long-duration segments, while the dense scale excels at short ones, and the two are complementary.

Effect of consistency constraints: Removing Stage 3 (training only on pseudo labels) causes mIoU to drop sharply from 24.4 to 12.1, demonstrating that consistency constraints are critical to the success of weakly supervised grounding.

Table 5: Question Type Analysis (Acc@GQA)

Causal-Why Causal-How Temporal-Present Temporal-Past Temporal-Future
26.1 27.4 23.4 18.0 18.1

Temporal questions (especially past/future) are significantly harder than causal questions, requiring stronger long-range reasoning capability.

Highlights & Insights

  1. A unified solution to the triple challenge of weak supervision + open-ended generation + joint prediction: no reliance on external models or annotation databases, purely bootstrapped training.
  2. Consistency constraints serve as an elegant self-supervised signal: answers to grounding questions and referring questions mutually verify each other, filtering noisy pseudo labels.
  3. Joint generation outperforms separate prediction: the model can adjust the temporal window based on answer content, capturing the correlation between answers and grounding.
  4. High inference efficiency: average 0.6 seconds per sample (A100 GPU), suitable for practical deployment.

Limitations & Future Work

  1. Open-ended answers may be underestimated under standard evaluation metrics (semantically equivalent but textually non-matching answers are marked incorrect).
  2. Accuracy on temporal questions (past/future types) remains relatively low; long-range temporal reasoning has room for improvement.
  3. Pseudo label quality is bounded by the descriptive capability of the Stage 1 model and may be insufficient for complex scenes (multi-person interactions, rapid action changes).
  4. Validation is limited to videos of approximately 40 seconds; generalization to longer videos (several minutes or more) remains unknown.
  • Video QA: FrozenBiLM, Video-ChatGPT, Video-LLaVA, Chat-UniVi
  • Grounded Video QA: SeViLA, LLoVi, Grounded-VideoLLM, VideoStreaming
  • Weakly supervised temporal grounding: NExT-GQA, IGV

Rating

  • Novelty: ⭐⭐⭐⭐ — The consistency-constraint-based pseudo label strategy represents a new paradigm for weakly supervised grounded QA.
  • Value: ⭐⭐⭐⭐ — Zero annotation requirement lowers the barrier to deployment.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple datasets, detailed ablations, and question-type analysis.
  • Writing Quality: ⭐⭐⭐⭐ — Clear motivation and thorough method description.