AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions¶

Conference: ICLR 2026 arXiv: 2603.07394 Code: https://aqua-iclr2026.github.io/ Area: Dialogue Systems Keywords: ambiguity, VQA, response strategy, uncertainty handling, GRPO

TL;DR¶

This paper proposes AQuA, the first visual question answering dataset with fine-grained ambiguity grading across four levels (7.2K samples, 1.8K per level), defining an optimal response strategy for each level (direct answer / inference / enumeration / clarification request). The study finds that GPT-5 and Gemini over-confidently default to direct answers on ambiguous VQA instances, while a 3B model trained via SFT+GRPO surpasses closed-source large models in strategy adaptation.

Background & Motivation¶

Background: VQA benchmarks predominantly use unambiguous image-question pairs, yet ambiguity is pervasive in real-world scenarios (e.g., unclear referents, multiple plausible objects, complex scenes).

Limitations of Prior Work: (1) Existing ambiguous VQA research adopts a binary strategy—either answer or ask—which does not reflect the flexible, nuanced strategies humans employ in practice. (2) State-of-the-art models such as GPT-5 and Gemini tend to respond directly and overconfidently to ambiguous questions rather than adapting their strategy to the degree of ambiguity.

Key Challenge: Different types and degrees of ambiguity call for different response strategies, yet models lack fine-grained ambiguity awareness and the ability to select appropriate strategies.

Goal: How can a VLM adaptively select the optimal response strategy based on the ambiguity level of a visual question?

Key Insight: A four-level ambiguity taxonomy with corresponding strategies is defined, training data are constructed, and models are fine-tuned via SFT+GRPO.

Core Idea: Teaching VLMs to respond as humans do—answer directly for simple questions, infer for resolvable referents, enumerate for a small set of candidates, and request clarification when ambiguity is high.

Method¶

Overall Architecture¶

AQuA combines a dataset and a training methodology. The dataset contains 7.2K samples (1.8K per level). The training pipeline first applies SFT to teach the strategy space, followed by GRPO to reinforce strategy selection. Question-answer pairs are generated by GPT-5 from COCO images, filtered through a three-stage pipeline, and validated by human annotators.

Key Designs¶

Four-Level Ambiguity Taxonomy:
- Level 0 (Unambiguous): Standard VQA with a unique answer → direct answer
- Level 1 (Low-level referential ambiguity): Contains referents such as "this/that" that are nevertheless inferable from context → infer and answer directly
- Level 2 (Multiple plausible interpretations): 2–3 equally plausible referents → enumerate all possible answers
- Level 3 (Highly ambiguous): 5+ similar objects, referent unresolvable → request clarification
- Design Motivation: Models four natural human strategies for handling ambiguity; human evaluation confirms high alignment with human strategy selection.
Two-Stage SFT + GRPO Training:
- SFT: Supervised fine-tuning on the AQuA training set to teach the model the space of strategy expressions.
- GRPO: An LLM-as-judge evaluates strategy correctness; \(R=1\) (correct strategy and factually accurate) / \(R=1-\lambda\) (correct strategy but hallucinated content) / \(R=0\) (incorrect strategy).
- Design Motivation: SFT alone cannot reliably select the correct strategy; GRPO reinforces strategy decisions through reward signals.
Saliency-Based Automatic Ambiguity Level Assignment:
- Object saliency scores are computed from COCO bounding boxes (area ratio \(\times 0.7\) + center distance \(\times 0.3\)).
- Objects with a score above 0.6 are considered salient; ambiguity levels are assigned based on the number of salient objects (\(1 \rightarrow\) L1, \(2\)–\(3 \rightarrow\) L2, \(5+ \rightarrow\) L3).

Loss & Training¶

SFT uses standard cross-entropy loss. In GRPO, GPT-5-mini serves as the judge and rewards strategy consistency; factual errors are penalized with \(\lambda=0.3\). Fine-tuning is conducted on Qwen2.5-VL-3B and InternVL3-2B.

Key Experimental Results¶

Main Results (Strategic Accuracy)¶

Model	L0	L1	L2	L3	Overall
GPT-5	89.7	0.7	0.3	0.8	22.9
Gemini 2.5 Flash	99.0	5.2	4.4	0.9	27.4
Qwen2.5-VL-72B	99.6	0.6	2.1	0.9	25.8
Qwen2.5-VL-3B + AQuA	—	High	High	High	>50

Key Findings¶

All baseline models achieve near-zero strategic accuracy on L1–L3: GPT-5 almost never requests clarification or enumerates options for L1/L2/L3 instances, defaulting instead to direct answers.
GPT-5 achieves 98.4% factual accuracy yet only 22.9% strategic accuracy—models know the answer but do not know when to express uncertainty.
The AQuA-trained 3B model surpasses GPT-5 and the 72B open-source model in strategic accuracy.
CoT prompting yields only marginal improvements in strategic accuracy (22.9→25.7 for GPT-5), indicating that the deficiency lies in strategy awareness rather than reasoning depth.
Human evaluation confirms high alignment between AQuA's four-level classification and human strategy selection (L0: 100%, L1: 96%, L2/L3: 64%).

Highlights & Insights¶

Exposes the "overconfidence" problem in VLMs: Even the strongest models tend to produce single answers to ambiguous questions rather than expressing uncertainty—a significant risk for safe deployment.
Practical value of the four-level taxonomy: More faithful to real human behavior than binary answer/ask strategies, providing a finer-grained framework for uncertainty handling in VLMs.
Small model + strategy training > large model: A 3B model trained on AQuA substantially outperforms GPT-5 in strategic capability, demonstrating that this is a learnable skill rather than one that requires scale.

Limitations & Future Work¶

The dataset is relatively small (7.2K samples), which may limit strategy generalization.
Ambiguity is grounded in COCO object-level referents and does not cover higher-level semantic ambiguity (e.g., metaphor, cultural differences).
The boundary between Level 2 and Level 3 exhibits some subjectivity (human agreement rate: 64%).
Evaluation is limited to single-turn VQA; strategy switching in multi-turn dialogue remains unexplored.

vs. ClearVQA: ClearVQA trains models on a binary ask-or-answer decision; AQuA defines four strategies, offering greater flexibility.
vs. VAGUE: VAGUE evaluates how visual context helps resolve ambiguity; AQuA trains models to select strategies based on the degree of ambiguity.
vs. "I don't know" training: Simple refusal to answer is not an optimal strategy—inference or enumeration is often more informative than outright rejection.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First multi-strategy ambiguous VQA framework with an original four-level taxonomy.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model comparisons, human validation, and GRPO ablations.
Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation, intuitive examples, and precise level definitions.
Value: ⭐⭐⭐⭐⭐ Direct implications for the safe deployment of VLMs—models need to learn to say "I'm not sure."