VQ-FocusAmbiguity: Acknowledging Focus Ambiguity in Visual Questions¶
Conference: ICCV 2025
arXiv: 2501.02201
Code: https://vizwiz.org/tasks-and-datasets/focus-ambiguity-in-visual-questions/
Area: Multimodal VLM
Keywords: Visual Question Answering, Focus Ambiguity, Grounding, Ambiguity Detection, VQA Benchmark
TL;DR¶
This work is the first to address "focus ambiguity" in VQA—where language in a question can point to multiple plausible regions in an image. The authors construct the VQ-FocusAmbiguity dataset comprising 5,500 samples, laying the foundation for developing ambiguity-aware VQA systems.
Background & Motivation¶
Background: Current VQA systems can understand and answer visual questions, but no published work has considered the ambiguity of the question focus.
Limitations of Prior Work: When a user asks "What is this cleaning product?" and there are multiple cleaning products in the image, a VQA system might output an incorrect answer. This could lead to severe consequences for visually impaired users (e.g., washing dishes with window cleaner).
Core Idea: Construct the first VQA dataset targeting focus ambiguity, where each ambiguous question is labeled with all possible target image regions (via instance segmentation), supporting two new tasks: identifying whether a question has focus ambiguity + grounding all possible focus regions.
Method¶
Dataset Construction¶
Derived from 4 datasets (PACO, MSRA-B, VQAv2, VizWiz-VQA), containing 5,500 visual questions + 12,880 instance segmentations. Ambiguous (2,437) and non-ambiguous (3,063) samples are nearly uniformly distributed. The candidates are AI-generated and manually verified/corrected.
Key Findings¶
- Non-ambiguous questions are longer (higher average length) because additional words provide disambiguation context.
- Non-ambiguous questions more frequently contain plural nouns (23.8% vs. 4.7%), as plural forms naturally allow multiple regions.
- 79% of ambiguous questions have different focus grounding compared to answer grounding (e.g., "What is above the mirror?" -> the focus is the mirror, while the answer is the object above the mirror).
Key Experimental Results¶
| Task | Best Model | Performance | Description |
|---|---|---|---|
| Ambiguity Identification | GPT-4o | Moderate | Binary classification |
| Focus Grounding | Molmo-7B | Low | Grounding all regions |
Key Findings¶
- Modern models perform poorly on both tasks, proving the dataset is challenging.
- Decoupling focus grounding from answer grounding is a critical step in understanding the VQA reasoning process.
Dataset Statistics¶
| Dimension | Ambiguous | Non-ambiguous |
|---|---|---|
| Number of Samples | 2437 | 3063 |
| Average Question Length (Words) | 8.2 | 10.5 |
| Plural Noun Ratio | 4.7% | 23.8% |
| Average Number of Focus Regions | 2.8 | 1.0 |
| Focus \(\neq\) Answer Grounding Ratio | 79% | N/A |
Highlights & Insights¶
- Proposing "focus ambiguity" is highly significant: AI assistants should proactively inform users of ambiguity rather than guessing an answer.
- Decoupling the grounding of questions and answers is an important insight that provides intermediate steps for VQA reasoning.
Limitations & Future Work¶
- The dataset size is relatively small (5,500 samples), which may not cover all scenarios.
- Only spatial ambiguity in 2D images is considered, without extending to 3D or video scenarios.
- Ambiguity identification relies on the joint understanding of text and vision, and current models perform poorly on both tasks.
- The AI-generated candidate questions might not be natural enough, differing from how real users ask questions.
- Although decoupling focus grounding and answer grounding is an important insight, how to utilize this intermediate step to improve VQA performance remains unexplored.
- Sub-types of ambiguity are not analyzed—different sources of ambiguity (lexical, referential, quantifiers) might require different handling strategies.
- Proactive disambiguation strategies (such as having the model ask the user clarifying questions) are not explored.
- More user studies are needed to validate the actual application scenarios for visually impaired users.
Related Work & Insights¶
- vs VizWiz-VQA: VizWiz focuses on image quality issues in photos taken by visually impaired users, whereas VQ-FocusAmbiguity focuses on the ambiguity of the question itself.
- vs Grounding DINO/Molmo: They perform visual grounding but do not handle ambiguity; VQ-FocusAmbiguity is the first to require models to identify and ground all possible focus regions.
- vs Multi-Interpretation VQA: Prior work focused on answer diversity, whereas this paper addresses the ambiguity of question focus—a more fundamental problem.
Supplementary Discussion¶
- The core innovation of this method lies in transforming the analysis of questions from a single dimension to multiple dimensions, offering a more comprehensive perspective.
- The experimental design covers various scenarios and baseline comparisons, with statistically significant results.
- The modular design of the method makes it easy to extend to related tasks and new datasets.
- Open-sourcing the code/data is highly valuable for community replication and subsequent research.
- Compared to concurrent work, this paper has advantages in the depth of the problem definition and the comprehensiveness of the experimental analysis.
- The writing of the paper is logically clear, forming a complete closed loop from problem definition to method design to experimental validation.
- The computational overhead of the method is reasonable, making it deployable in practical applications.
- Future work could consider integrating more modalities (such as audio and 3D point clouds).
- Validating the scalability of the method on larger-scale data and models is an important future direction.
- Combining this method with reinforcement learning to achieve end-to-end optimization could be considered.
- Cross-domain transfer is an direction worth exploring—the generalizability of the method requires more validation.
- For edge computing and mobile deployment scenarios, a lightweight version of the method is worth studying.
- Long-term evaluation and user studies could provide a more comprehensive assessment of the method.
- Comparative analysis with human experts can better locate the strengths and weaknesses of the method.
- Robustness testing under adversarial scenarios is a necessary step before actual deployment.
- Interpretability analysis helps understand the reasons behind the successes and failures of the method.
- Applicability under multilingual and multicultural contexts is worth attention.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First work to focus on focus ambiguity in VQA
- Experimental Thoroughness: ⭐⭐⭐⭐ Deep data analysis with comprehensive model evaluation
- Writing Quality: ⭐⭐⭐⭐ Strong motivation, meticulous analysis
- Value: ⭐⭐⭐⭐ Direct significance for AI safety and accessibility assistance