Skip to content

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

Conference: ICLR 2026 arXiv: 2603.07394 Code: https://aqua-iclr2026.github.io/ Area: Dialogue Systems Keywords: ambiguity, VQA, response strategy, uncertainty handling, GRPO

TL;DR

This paper proposes AQuA, the first visual question answering dataset with fine-grained ambiguity grading across four levels (7.2K samples, 1.8K per level), defining an optimal response strategy for each level (direct answer / inference / enumeration / clarification request). The study finds that GPT-5 and Gemini over-confidently default to direct answers on ambiguous VQA instances, while a 3B model trained via SFT+GRPO surpasses closed-source large models in strategy adaptation.

Background & Motivation

Background: VQA benchmarks predominantly use unambiguous image-question pairs, yet ambiguity is pervasive in real-world scenarios (e.g., unclear referents, multiple plausible objects, complex scenes).

Limitations of Prior Work: (1) Existing ambiguous VQA research adopts a binary strategy—either answer or ask—which does not reflect the flexible, nuanced strategies humans employ in practice. (2) State-of-the-art models such as GPT-5 and Gemini tend to respond directly and overconfidently to ambiguous questions rather than adapting their strategy to the degree of ambiguity.

Key Challenge: Different types and degrees of ambiguity call for different response strategies, yet models lack fine-grained ambiguity awareness and the ability to select appropriate strategies.

Goal: How can a VLM adaptively select the optimal response strategy based on the ambiguity level of a visual question?

Key Insight: A four-level ambiguity taxonomy with corresponding strategies is defined, training data are constructed, and models are fine-tuned via SFT+GRPO.

Core Idea: Teaching VLMs to respond as humans do—answer directly for simple questions, infer for resolvable referents, enumerate for a small set of candidates, and request clarification when ambiguity is high.

Method

Overall Architecture

AQuA combines a dataset and a training methodology. The dataset contains 7.2K samples (1.8K per level). The training pipeline first applies SFT to teach the strategy space, followed by GRPO to reinforce strategy selection. Question-answer pairs are generated by GPT-5 from COCO images, filtered through a three-stage pipeline, and validated by human annotators.

Key Designs

  1. Four-Level Ambiguity Taxonomy:

    • Level 0 (Unambiguous): Standard VQA with a unique answer → direct answer
    • Level 1 (Low-level referential ambiguity): Contains referents such as "this/that" that are nevertheless inferable from context → infer and answer directly
    • Level 2 (Multiple plausible interpretations): 2–3 equally plausible referents → enumerate all possible answers
    • Level 3 (Highly ambiguous): 5+ similar objects, referent unresolvable → request clarification
    • Design Motivation: Models four natural human strategies for handling ambiguity; human evaluation confirms high alignment with human strategy selection.
  2. Two-Stage SFT + GRPO Training:

    • SFT: Supervised fine-tuning on the AQuA training set to teach the model the space of strategy expressions.
    • GRPO: An LLM-as-judge evaluates strategy correctness; \(R=1\) (correct strategy and factually accurate) / \(R=1-\lambda\) (correct strategy but hallucinated content) / \(R=0\) (incorrect strategy).
    • Design Motivation: SFT alone cannot reliably select the correct strategy; GRPO reinforces strategy decisions through reward signals.
  3. Saliency-Based Automatic Ambiguity Level Assignment:

    • Object saliency scores are computed from COCO bounding boxes (area ratio \(\times 0.7\) + center distance \(\times 0.3\)).
    • Objects with a score above 0.6 are considered salient; ambiguity levels are assigned based on the number of salient objects (\(1 \rightarrow\) L1, \(2\)\(3 \rightarrow\) L2, \(5+ \rightarrow\) L3).

Loss & Training

SFT uses standard cross-entropy loss. In GRPO, GPT-5-mini serves as the judge and rewards strategy consistency; factual errors are penalized with \(\lambda=0.3\). Fine-tuning is conducted on Qwen2.5-VL-3B and InternVL3-2B.

Key Experimental Results

Main Results (Strategic Accuracy)

Model L0 L1 L2 L3 Overall
GPT-5 89.7 0.7 0.3 0.8 22.9
Gemini 2.5 Flash 99.0 5.2 4.4 0.9 27.4
Qwen2.5-VL-72B 99.6 0.6 2.1 0.9 25.8
Qwen2.5-VL-3B + AQuA High High High >50

Key Findings

  • All baseline models achieve near-zero strategic accuracy on L1–L3: GPT-5 almost never requests clarification or enumerates options for L1/L2/L3 instances, defaulting instead to direct answers.
  • GPT-5 achieves 98.4% factual accuracy yet only 22.9% strategic accuracy—models know the answer but do not know when to express uncertainty.
  • The AQuA-trained 3B model surpasses GPT-5 and the 72B open-source model in strategic accuracy.
  • CoT prompting yields only marginal improvements in strategic accuracy (22.9→25.7 for GPT-5), indicating that the deficiency lies in strategy awareness rather than reasoning depth.
  • Human evaluation confirms high alignment between AQuA's four-level classification and human strategy selection (L0: 100%, L1: 96%, L2/L3: 64%).

Highlights & Insights

  • Exposes the "overconfidence" problem in VLMs: Even the strongest models tend to produce single answers to ambiguous questions rather than expressing uncertainty—a significant risk for safe deployment.
  • Practical value of the four-level taxonomy: More faithful to real human behavior than binary answer/ask strategies, providing a finer-grained framework for uncertainty handling in VLMs.
  • Small model + strategy training > large model: A 3B model trained on AQuA substantially outperforms GPT-5 in strategic capability, demonstrating that this is a learnable skill rather than one that requires scale.

Limitations & Future Work

  • The dataset is relatively small (7.2K samples), which may limit strategy generalization.
  • Ambiguity is grounded in COCO object-level referents and does not cover higher-level semantic ambiguity (e.g., metaphor, cultural differences).
  • The boundary between Level 2 and Level 3 exhibits some subjectivity (human agreement rate: 64%).
  • Evaluation is limited to single-turn VQA; strategy switching in multi-turn dialogue remains unexplored.
  • vs. ClearVQA: ClearVQA trains models on a binary ask-or-answer decision; AQuA defines four strategies, offering greater flexibility.
  • vs. VAGUE: VAGUE evaluates how visual context helps resolve ambiguity; AQuA trains models to select strategies based on the degree of ambiguity.
  • vs. "I don't know" training: Simple refusal to answer is not an optimal strategy—inference or enumeration is often more informative than outright rejection.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First multi-strategy ambiguous VQA framework with an original four-level taxonomy.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model comparisons, human validation, and GRPO ablations.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation, intuitive examples, and precise level definitions.
  • Value: ⭐⭐⭐⭐⭐ Direct implications for the safe deployment of VLMs—models need to learn to say "I'm not sure."