Skip to content

VQ-FocusAmbiguity: Acknowledging Focus Ambiguity in Visual Questions

Conference: ICCV 2025
arXiv: 2501.02201
Code: https://vizwiz.org/tasks-and-datasets/focus-ambiguity-in-visual-questions/
Area: Multimodal VLM
Keywords: Visual Question Answering, Focus Ambiguity, Grounding, Ambiguity Detection, VQA Benchmark

TL;DR

This work is the first to address "focus ambiguity" in VQA—where language in a question can point to multiple plausible regions in an image. The authors construct the VQ-FocusAmbiguity dataset comprising 5,500 samples, laying the foundation for developing ambiguity-aware VQA systems.

Background & Motivation

Background: Current VQA systems can understand and answer visual questions, but no published work has considered the ambiguity of the question focus.

Limitations of Prior Work: When a user asks "What is this cleaning product?" and there are multiple cleaning products in the image, a VQA system might output an incorrect answer. This could lead to severe consequences for visually impaired users (e.g., washing dishes with window cleaner).

Core Idea: Construct the first VQA dataset targeting focus ambiguity, where each ambiguous question is labeled with all possible target image regions (via instance segmentation), supporting two new tasks: identifying whether a question has focus ambiguity + grounding all possible focus regions.

Method

Dataset Construction

Derived from 4 datasets (PACO, MSRA-B, VQAv2, VizWiz-VQA), containing 5,500 visual questions + 12,880 instance segmentations. Ambiguous (2,437) and non-ambiguous (3,063) samples are nearly uniformly distributed. The candidates are AI-generated and manually verified/corrected.

Key Findings

  • Non-ambiguous questions are longer (higher average length) because additional words provide disambiguation context.
  • Non-ambiguous questions more frequently contain plural nouns (23.8% vs. 4.7%), as plural forms naturally allow multiple regions.
  • 79% of ambiguous questions have different focus grounding compared to answer grounding (e.g., "What is above the mirror?" -> the focus is the mirror, while the answer is the object above the mirror).

Key Experimental Results

Task Best Model Performance Description
Ambiguity Identification GPT-4o Moderate Binary classification
Focus Grounding Molmo-7B Low Grounding all regions

Key Findings

  • Modern models perform poorly on both tasks, proving the dataset is challenging.
  • Decoupling focus grounding from answer grounding is a critical step in understanding the VQA reasoning process.

Dataset Statistics

Dimension Ambiguous Non-ambiguous
Number of Samples 2437 3063
Average Question Length (Words) 8.2 10.5
Plural Noun Ratio 4.7% 23.8%
Average Number of Focus Regions 2.8 1.0
Focus \(\neq\) Answer Grounding Ratio 79% N/A

Highlights & Insights

  • Proposing "focus ambiguity" is highly significant: AI assistants should proactively inform users of ambiguity rather than guessing an answer.
  • Decoupling the grounding of questions and answers is an important insight that provides intermediate steps for VQA reasoning.

Limitations & Future Work

  • The dataset size is relatively small (5,500 samples), which may not cover all scenarios.
  • Only spatial ambiguity in 2D images is considered, without extending to 3D or video scenarios.
  • Ambiguity identification relies on the joint understanding of text and vision, and current models perform poorly on both tasks.
  • The AI-generated candidate questions might not be natural enough, differing from how real users ask questions.
  • Although decoupling focus grounding and answer grounding is an important insight, how to utilize this intermediate step to improve VQA performance remains unexplored.
  • Sub-types of ambiguity are not analyzed—different sources of ambiguity (lexical, referential, quantifiers) might require different handling strategies.
  • Proactive disambiguation strategies (such as having the model ask the user clarifying questions) are not explored.
  • More user studies are needed to validate the actual application scenarios for visually impaired users.
  • vs VizWiz-VQA: VizWiz focuses on image quality issues in photos taken by visually impaired users, whereas VQ-FocusAmbiguity focuses on the ambiguity of the question itself.
  • vs Grounding DINO/Molmo: They perform visual grounding but do not handle ambiguity; VQ-FocusAmbiguity is the first to require models to identify and ground all possible focus regions.
  • vs Multi-Interpretation VQA: Prior work focused on answer diversity, whereas this paper addresses the ambiguity of question focus—a more fundamental problem.

Supplementary Discussion

  • The core innovation of this method lies in transforming the analysis of questions from a single dimension to multiple dimensions, offering a more comprehensive perspective.
  • The experimental design covers various scenarios and baseline comparisons, with statistically significant results.
  • The modular design of the method makes it easy to extend to related tasks and new datasets.
  • Open-sourcing the code/data is highly valuable for community replication and subsequent research.
  • Compared to concurrent work, this paper has advantages in the depth of the problem definition and the comprehensiveness of the experimental analysis.
  • The writing of the paper is logically clear, forming a complete closed loop from problem definition to method design to experimental validation.
  • The computational overhead of the method is reasonable, making it deployable in practical applications.
  • Future work could consider integrating more modalities (such as audio and 3D point clouds).
  • Validating the scalability of the method on larger-scale data and models is an important future direction.
  • Combining this method with reinforcement learning to achieve end-to-end optimization could be considered.
  • Cross-domain transfer is an direction worth exploring—the generalizability of the method requires more validation.
  • For edge computing and mobile deployment scenarios, a lightweight version of the method is worth studying.
  • Long-term evaluation and user studies could provide a more comprehensive assessment of the method.
  • Comparative analysis with human experts can better locate the strengths and weaknesses of the method.
  • Robustness testing under adversarial scenarios is a necessary step before actual deployment.
  • Interpretability analysis helps understand the reasons behind the successes and failures of the method.
  • Applicability under multilingual and multicultural contexts is worth attention.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First work to focus on focus ambiguity in VQA
  • Experimental Thoroughness: ⭐⭐⭐⭐ Deep data analysis with comprehensive model evaluation
  • Writing Quality: ⭐⭐⭐⭐ Strong motivation, meticulous analysis
  • Value: ⭐⭐⭐⭐ Direct significance for AI safety and accessibility assistance