TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision¶
Conference: ICCV 2025 arXiv: 2506.09445 Code: Not released Area: Video Understanding / Video QA / Temporal Grounding Keywords: Video QA, temporal grounding, weak supervision, vision-language models, multi-scale temporal modeling
TL;DR¶
This paper proposes TOGA — a weakly supervised vision-language model that generates pseudo temporal labels via a multi-scale visual-language connector and consistency constraints, enabling joint generation of open-ended answers and temporal grounding without any temporal annotations, achieving SOTA on NExT-GQA, MSVD-QA, and ActivityNet-QA.
Background & Motivation¶
Video QA requires models not only to generate correct answers but also to localize the temporal segment in the video that supports the answer — i.e., grounded video QA. This task presents three key challenges:
High cost of temporal annotation: Obtaining precise start/end time annotations requires substantial human effort. Existing methods such as Grounded-VideoLLM rely on external GPT-4-generated annotations or borrow labels from ActivityNet-Captions, incurring both cost and noise.
Open-ended vs. multiple-choice: Prior weakly supervised methods (e.g., SeViLA, LLoVi) rely on candidate options at inference time to select answers, which limits open-ended generation capability. TOGA generates free-form text answers, posing a significantly harder challenge.
Independent prediction of answers and grounding: Existing methods predict answers and temporal segments separately (e.g., via post-processing or independent modules), failing to model the dependency between answer content and temporal windows.
The core mechanism of TOGA is: jointly generating answers and temporal grounding (in the format Answer [start, end]), leveraging consistency constraints to produce high-quality pseudo labels under weak supervision.
Method¶
Overall Architecture¶
TOGA consists of four modules:
- Visual Encoder: Frozen CLIP-ViT-Large, extracting per-frame features from uniformly sampled video frames.
- Text Encoder: Frozen LLM tokenizer + embedding layer (Mistral-7B), generating token-level text features.
- Multi-Scale Visual-Language Connector (MS-VLC): Trainable; aligns visual and text features at two temporal resolutions.
- Text Decoder: Mistral-7B Instruct, fine-tuned to jointly generate answers and temporal grounding.
Multi-Scale Visual-Language Connector (MS-VLC)¶
The MS-VLC is one of TOGA's core innovations. It processes video frames at two temporal granularities:
- Sparse scale (4 frames): Captures low-frequency temporal features, suitable for localizing long-duration events.
- Dense scale (16 frames): Captures high-frequency temporal features, suitable for localizing short-duration events.
Each VLC module is implemented with RegNet + 3D convolutions, and the two scales share parameters. This multi-scale processing strategy draws on successful practices in activity recognition (SlowFast) and audio event detection.
Three-Stage Training Strategy¶
TOGA adopts a progressive multi-stage training procedure to incrementally acquire grounding capability:
Stage 1 — Visual-Text Alignment: Only the MS-VLC module is trained. Using video-text pairs from Video-ChatGPT, including video captioning, sentence completion, and QA tasks. The objective is to align multi-scale video features with text features.
Stage 2 — Instruction Fine-Tuning (Temporal Reference): MS-VLC + LLM decoder are trained jointly. The core objective is to teach the model to understand prompts with temporal references (e.g., What is the activity in [10, 20]?) and generate temporally grounded responses (e.g., A boy is running [10, 20]). Since no real annotations are available, pseudo labels are generated by cropping video temporal segments: a start/end time is selected, that segment is treated as an independent video, and the Stage 1 model generates a description as the pseudo answer.
Stage 3 — Consistency Constraint Refinement: The key innovation. High-quality pseudo labels are filtered through consistency constraints. Specifically, for a grounding question \(Q_g\) (e.g., What is the boy doing?) that produces the response Stands up [5, 10], a corresponding referring question \(Q_r\) is constructed (e.g., What does the boy do in [5, 10]?), with the expectation that the answer is consistent (Stands up) and aligned with the ground-truth answer. This bidirectional consistency ensures the reliability of weakly supervised pseudo labels.
Loss & Training¶
Standard next token prediction loss (consistent with language model training) is applied, with different prompt formats used across stages:
- Answer only: answer
- Grounding only: [<<<start>>>, <<<end>>>]
- Answer + grounding: answer [<<<start>>>, <<<end>>>]
Key Experimental Results¶
Datasets and Metrics¶
| Dataset | Task | Characteristics |
|---|---|---|
| NExT-GQA | Weakly supervised Grounded QA | Long videos (avg. 40s), causal + temporal questions |
| ReXTime | Zero-shot Grounding | Cross-segment causal reasoning |
| MSVD-QA | Open-ended QA | 1,970 videos, 50K+ QA pairs |
| ActivityNet-QA | Open-ended QA | 5,800 videos, 58K QA pairs |
Main Results¶
Table 1: NExT-GQA Weakly Supervised Grounded QA
| Method | Open-ended | mIoU | IoU@0.5 | mIoP | IoP@0.5 | Acc@GQA |
|---|---|---|---|---|---|---|
| SeViLA | ✗ | 21.7 | 13.8 | 29.5 | 22.9 | 16.6 |
| LLoVi | ✗ | 20.0 | 15.3 | 37.3 | 36.9 | 24.3 |
| Grounded-VideoLLM | ✗ | 21.1 | 18.0 | 34.5 | 34.4 | 26.7 |
| TOGA | ✓ | 24.4 | 21.1 | 40.5 | 40.6 | 24.6 |
TOGA surpasses all closed-set methods under the open-ended (harder) setting. mIoU exceeds the best prior method by +3.3 pp, and IoP@0.5 reaches 40.6%.
Table 3: Open-Ended Video QA
| Method | MSVD-QA Acc | MSVD-QA Score | ActivityNet-QA Acc | ActivityNet-QA Score |
|---|---|---|---|---|
| Video-LLaVA | 70.7 | 3.9 | 45.3 | 3.3 |
| Video-LLaMA2 | 70.9 | 3.8 | 50.2 | 3.3 |
| TOGA | 73.8 | 3.9 | 52.0 | 3.4 |
Ablation Study¶
Table 4: Importance of Multi-Scale VLC (NExT-GQA, IoU)
| Model | All | Short | Medium | Long |
|---|---|---|---|---|
| Sparse only | 20.0 | 16.2 | 28.9 | 47.5 |
| Dense only | 22.1 | 18.3 | 32.2 | 32.1 |
| Multi-scale (MS-VLC) | 24.4 | 20.5 | 34.7 | 49.3 |
Multi-scale processing yields the largest gains for short and long events — the sparse scale excels at localizing long-duration segments, while the dense scale excels at short ones, and the two are complementary.
Effect of consistency constraints: Removing Stage 3 (training only on pseudo labels) causes mIoU to drop sharply from 24.4 to 12.1, demonstrating that consistency constraints are critical to the success of weakly supervised grounding.
Table 5: Question Type Analysis (Acc@GQA)
| Causal-Why | Causal-How | Temporal-Present | Temporal-Past | Temporal-Future |
|---|---|---|---|---|
| 26.1 | 27.4 | 23.4 | 18.0 | 18.1 |
Temporal questions (especially past/future) are significantly harder than causal questions, requiring stronger long-range reasoning capability.
Highlights & Insights¶
- A unified solution to the triple challenge of weak supervision + open-ended generation + joint prediction: no reliance on external models or annotation databases, purely bootstrapped training.
- Consistency constraints serve as an elegant self-supervised signal: answers to grounding questions and referring questions mutually verify each other, filtering noisy pseudo labels.
- Joint generation outperforms separate prediction: the model can adjust the temporal window based on answer content, capturing the correlation between answers and grounding.
- High inference efficiency: average 0.6 seconds per sample (A100 GPU), suitable for practical deployment.
Limitations & Future Work¶
- Open-ended answers may be underestimated under standard evaluation metrics (semantically equivalent but textually non-matching answers are marked incorrect).
- Accuracy on temporal questions (past/future types) remains relatively low; long-range temporal reasoning has room for improvement.
- Pseudo label quality is bounded by the descriptive capability of the Stage 1 model and may be insufficient for complex scenes (multi-person interactions, rapid action changes).
- Validation is limited to videos of approximately 40 seconds; generalization to longer videos (several minutes or more) remains unknown.
Related Work & Insights¶
- Video QA: FrozenBiLM, Video-ChatGPT, Video-LLaVA, Chat-UniVi
- Grounded Video QA: SeViLA, LLoVi, Grounded-VideoLLM, VideoStreaming
- Weakly supervised temporal grounding: NExT-GQA, IGV
Rating¶
- Novelty: ⭐⭐⭐⭐ — The consistency-constraint-based pseudo label strategy represents a new paradigm for weakly supervised grounded QA.
- Value: ⭐⭐⭐⭐ — Zero annotation requirement lowers the barrier to deployment.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple datasets, detailed ablations, and question-type analysis.
- Writing Quality: ⭐⭐⭐⭐ — Clear motivation and thorough method description.