AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions¶
Conference: ICLR 2026 arXiv: 2603.07394 Code: https://aqua-iclr2026.github.io/ Area: Dialogue Systems Keywords: ambiguity, VQA, response strategy, uncertainty handling, GRPO
TL;DR¶
This paper proposes AQuA, the first visual question answering dataset with fine-grained ambiguity grading across four levels (7.2K samples, 1.8K per level), defining an optimal response strategy for each level (direct answer / inference / enumeration / clarification request). The study finds that GPT-5 and Gemini over-confidently default to direct answers on ambiguous VQA instances, while a 3B model trained via SFT+GRPO surpasses closed-source large models in strategy adaptation.
Background & Motivation¶
Background: VQA benchmarks predominantly use unambiguous image-question pairs, yet ambiguity is pervasive in real-world scenarios (e.g., unclear referents, multiple plausible objects, complex scenes).
Limitations of Prior Work: (1) Existing ambiguous VQA research adopts a binary strategy—either answer or ask—which does not reflect the flexible, nuanced strategies humans employ in practice. (2) State-of-the-art models such as GPT-5 and Gemini tend to respond directly and overconfidently to ambiguous questions rather than adapting their strategy to the degree of ambiguity.
Key Challenge: Different types and degrees of ambiguity call for different response strategies, yet models lack fine-grained ambiguity awareness and the ability to select appropriate strategies.
Goal: How can a VLM adaptively select the optimal response strategy based on the ambiguity level of a visual question?
Key Insight: A four-level ambiguity taxonomy with corresponding strategies is defined, training data are constructed, and models are fine-tuned via SFT+GRPO.
Core Idea: Teaching VLMs to respond as humans do—answer directly for simple questions, infer for resolvable referents, enumerate for a small set of candidates, and request clarification when ambiguity is high.
Method¶
Overall Architecture¶
AQuA combines a dataset and a training methodology. The dataset contains 7.2K samples (1.8K per level). The training pipeline first applies SFT to teach the strategy space, followed by GRPO to reinforce strategy selection. Question-answer pairs are generated by GPT-5 from COCO images, filtered through a three-stage pipeline, and validated by human annotators.
Key Designs¶
-
Four-Level Ambiguity Taxonomy:
- Level 0 (Unambiguous): Standard VQA with a unique answer → direct answer
- Level 1 (Low-level referential ambiguity): Contains referents such as "this/that" that are nevertheless inferable from context → infer and answer directly
- Level 2 (Multiple plausible interpretations): 2–3 equally plausible referents → enumerate all possible answers
- Level 3 (Highly ambiguous): 5+ similar objects, referent unresolvable → request clarification
- Design Motivation: Models four natural human strategies for handling ambiguity; human evaluation confirms high alignment with human strategy selection.
-
Two-Stage SFT + GRPO Training:
- SFT: Supervised fine-tuning on the AQuA training set to teach the model the space of strategy expressions.
- GRPO: An LLM-as-judge evaluates strategy correctness; \(R=1\) (correct strategy and factually accurate) / \(R=1-\lambda\) (correct strategy but hallucinated content) / \(R=0\) (incorrect strategy).
- Design Motivation: SFT alone cannot reliably select the correct strategy; GRPO reinforces strategy decisions through reward signals.
-
Saliency-Based Automatic Ambiguity Level Assignment:
- Object saliency scores are computed from COCO bounding boxes (area ratio \(\times 0.7\) + center distance \(\times 0.3\)).
- Objects with a score above 0.6 are considered salient; ambiguity levels are assigned based on the number of salient objects (\(1 \rightarrow\) L1, \(2\)–\(3 \rightarrow\) L2, \(5+ \rightarrow\) L3).
Loss & Training¶
SFT uses standard cross-entropy loss. In GRPO, GPT-5-mini serves as the judge and rewards strategy consistency; factual errors are penalized with \(\lambda=0.3\). Fine-tuning is conducted on Qwen2.5-VL-3B and InternVL3-2B.
Key Experimental Results¶
Main Results (Strategic Accuracy)¶
| Model | L0 | L1 | L2 | L3 | Overall |
|---|---|---|---|---|---|
| GPT-5 | 89.7 | 0.7 | 0.3 | 0.8 | 22.9 |
| Gemini 2.5 Flash | 99.0 | 5.2 | 4.4 | 0.9 | 27.4 |
| Qwen2.5-VL-72B | 99.6 | 0.6 | 2.1 | 0.9 | 25.8 |
| Qwen2.5-VL-3B + AQuA | — | High | High | High | >50 |
Key Findings¶
- All baseline models achieve near-zero strategic accuracy on L1–L3: GPT-5 almost never requests clarification or enumerates options for L1/L2/L3 instances, defaulting instead to direct answers.
- GPT-5 achieves 98.4% factual accuracy yet only 22.9% strategic accuracy—models know the answer but do not know when to express uncertainty.
- The AQuA-trained 3B model surpasses GPT-5 and the 72B open-source model in strategic accuracy.
- CoT prompting yields only marginal improvements in strategic accuracy (22.9→25.7 for GPT-5), indicating that the deficiency lies in strategy awareness rather than reasoning depth.
- Human evaluation confirms high alignment between AQuA's four-level classification and human strategy selection (L0: 100%, L1: 96%, L2/L3: 64%).
Highlights & Insights¶
- Exposes the "overconfidence" problem in VLMs: Even the strongest models tend to produce single answers to ambiguous questions rather than expressing uncertainty—a significant risk for safe deployment.
- Practical value of the four-level taxonomy: More faithful to real human behavior than binary answer/ask strategies, providing a finer-grained framework for uncertainty handling in VLMs.
- Small model + strategy training > large model: A 3B model trained on AQuA substantially outperforms GPT-5 in strategic capability, demonstrating that this is a learnable skill rather than one that requires scale.
Limitations & Future Work¶
- The dataset is relatively small (7.2K samples), which may limit strategy generalization.
- Ambiguity is grounded in COCO object-level referents and does not cover higher-level semantic ambiguity (e.g., metaphor, cultural differences).
- The boundary between Level 2 and Level 3 exhibits some subjectivity (human agreement rate: 64%).
- Evaluation is limited to single-turn VQA; strategy switching in multi-turn dialogue remains unexplored.
Related Work & Insights¶
- vs. ClearVQA: ClearVQA trains models on a binary ask-or-answer decision; AQuA defines four strategies, offering greater flexibility.
- vs. VAGUE: VAGUE evaluates how visual context helps resolve ambiguity; AQuA trains models to select strategies based on the degree of ambiguity.
- vs. "I don't know" training: Simple refusal to answer is not an optimal strategy—inference or enumeration is often more informative than outright rejection.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First multi-strategy ambiguous VQA framework with an original four-level taxonomy.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model comparisons, human validation, and GRPO ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation, intuitive examples, and precise level definitions.
- Value: ⭐⭐⭐⭐⭐ Direct implications for the safe deployment of VLMs—models need to learn to say "I'm not sure."