VQ-FocusAmbiguity: Acknowledging Focus Ambiguity in Visual Questions¶

Conference: ICCV 2025
arXiv: 2501.02201
Code: https://vizwiz.org/tasks-and-datasets/focus-ambiguity-in-visual-questions/
Area: Multimodal VLM
Keywords: Visual Question Answering, Focus Ambiguity, Grounding, Ambiguity Detection, VQA Benchmark

TL;DR¶

This work is the first to address "focus ambiguity" in VQA—where language in a question can point to multiple plausible regions in an image. The authors construct the VQ-FocusAmbiguity dataset comprising 5,500 samples, laying the foundation for developing ambiguity-aware VQA systems.

Background & Motivation¶

Background: Current VQA systems can understand and answer visual questions, but no published work has considered the ambiguity of the question focus.

Limitations of Prior Work: When a user asks "What is this cleaning product?" and there are multiple cleaning products in the image, a VQA system might output an incorrect answer. This could lead to severe consequences for visually impaired users (e.g., washing dishes with window cleaner).

Core Idea: Construct the first VQA dataset targeting focus ambiguity, where each ambiguous question is labeled with all possible target image regions (via instance segmentation), supporting two new tasks: identifying whether a question has focus ambiguity + grounding all possible focus regions.

Method¶

Dataset Construction¶

Derived from 4 datasets (PACO, MSRA-B, VQAv2, VizWiz-VQA), containing 5,500 visual questions + 12,880 instance segmentations. Ambiguous (2,437) and non-ambiguous (3,063) samples are nearly uniformly distributed. The candidates are AI-generated and manually verified/corrected.

Key Findings¶

Non-ambiguous questions are longer (higher average length) because additional words provide disambiguation context.
Non-ambiguous questions more frequently contain plural nouns (23.8% vs. 4.7%), as plural forms naturally allow multiple regions.
79% of ambiguous questions have different focus grounding compared to answer grounding (e.g., "What is above the mirror?" -> the focus is the mirror, while the answer is the object above the mirror).

Key Experimental Results¶

Task	Best Model	Performance	Description
Ambiguity Identification	GPT-4o	Moderate	Binary classification
Focus Grounding	Molmo-7B	Low	Grounding all regions

Key Findings¶

Modern models perform poorly on both tasks, proving the dataset is challenging.
Decoupling focus grounding from answer grounding is a critical step in understanding the VQA reasoning process.

Dataset Statistics¶

Dimension	Ambiguous	Non-ambiguous
Number of Samples	2437	3063
Average Question Length (Words)	8.2	10.5
Plural Noun Ratio	4.7%	23.8%
Average Number of Focus Regions	2.8	1.0
Focus \(\neq\) Answer Grounding Ratio	79%	N/A

Highlights & Insights¶

Proposing "focus ambiguity" is highly significant: AI assistants should proactively inform users of ambiguity rather than guessing an answer.
Decoupling the grounding of questions and answers is an important insight that provides intermediate steps for VQA reasoning.

Limitations & Future Work¶

The dataset size is relatively small (5,500 samples), which may not cover all scenarios.
Only spatial ambiguity in 2D images is considered, without extending to 3D or video scenarios.
Ambiguity identification relies on the joint understanding of text and vision, and current models perform poorly on both tasks.
The AI-generated candidate questions might not be natural enough, differing from how real users ask questions.
Although decoupling focus grounding and answer grounding is an important insight, how to utilize this intermediate step to improve VQA performance remains unexplored.
Sub-types of ambiguity are not analyzed—different sources of ambiguity (lexical, referential, quantifiers) might require different handling strategies.
Proactive disambiguation strategies (such as having the model ask the user clarifying questions) are not explored.
More user studies are needed to validate the actual application scenarios for visually impaired users.

vs VizWiz-VQA: VizWiz focuses on image quality issues in photos taken by visually impaired users, whereas VQ-FocusAmbiguity focuses on the ambiguity of the question itself.
vs Grounding DINO/Molmo: They perform visual grounding but do not handle ambiguity; VQ-FocusAmbiguity is the first to require models to identify and ground all possible focus regions.
vs Multi-Interpretation VQA: Prior work focused on answer diversity, whereas this paper addresses the ambiguity of question focus—a more fundamental problem.

Supplementary Discussion¶

The core innovation of this method lies in transforming the analysis of questions from a single dimension to multiple dimensions, offering a more comprehensive perspective.
The experimental design covers various scenarios and baseline comparisons, with statistically significant results.
The modular design of the method makes it easy to extend to related tasks and new datasets.
Open-sourcing the code/data is highly valuable for community replication and subsequent research.
Compared to concurrent work, this paper has advantages in the depth of the problem definition and the comprehensiveness of the experimental analysis.
The writing of the paper is logically clear, forming a complete closed loop from problem definition to method design to experimental validation.
The computational overhead of the method is reasonable, making it deployable in practical applications.
Future work could consider integrating more modalities (such as audio and 3D point clouds).
Validating the scalability of the method on larger-scale data and models is an important future direction.
Combining this method with reinforcement learning to achieve end-to-end optimization could be considered.
Cross-domain transfer is an direction worth exploring—the generalizability of the method requires more validation.
For edge computing and mobile deployment scenarios, a lightweight version of the method is worth studying.
Long-term evaluation and user studies could provide a more comprehensive assessment of the method.
Comparative analysis with human experts can better locate the strengths and weaknesses of the method.
Robustness testing under adversarial scenarios is a necessary step before actual deployment.
Interpretability analysis helps understand the reasons behind the successes and failures of the method.
Applicability under multilingual and multicultural contexts is worth attention.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First work to focus on focus ambiguity in VQA
Experimental Thoroughness: ⭐⭐⭐⭐ Deep data analysis with comprehensive model evaluation
Writing Quality: ⭐⭐⭐⭐ Strong motivation, meticulous analysis
Value: ⭐⭐⭐⭐ Direct significance for AI safety and accessibility assistance

VQ-FocusAmbiguity: Acknowledging Focus Ambiguity in Visual Questions¶

TL;DR¶

Background & Motivation¶

Method¶

Dataset Construction¶

Key Findings¶

Key Experimental Results¶

Key Findings¶

Dataset Statistics¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Supplementary Discussion¶

Rating¶

Related Papers¶