HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models¶
Conference: CVPR 2026 arXiv: 2603.26362 Code: https://kcsayem.github.io/handvqa/ Area: Multimodal VLM Keywords: hand spatial reasoning, VQA benchmark, vision-language models, fine-grained understanding, zero-shot transfer
TL;DR¶
This paper introduces HandVQA — a large-scale diagnostic benchmark containing over 1.6 million multiple-choice questions, automatically generated from 3D hand joint annotations. The benchmark covers joint angles, distances, and relative positions, and systematically exposes severe deficiencies of current VLMs in fine-grained hand spatial reasoning. The paper further demonstrates that models fine-tuned on HandVQA can zero-shot transfer to downstream tasks such as gesture recognition (+10.33%) and hand-object interaction recognition (+2.63%).
Background & Motivation¶
Background: Vision-language models (VLMs) have approached human-level performance on general VQA tasks (e.g., VQAv2), yet perform poorly on fine-grained spatial reasoning. Prior work has shown that VLMs achieve only ~56% accuracy on simple left/right discrimination (vs. 99% for humans), reflecting surface-level correlation rather than genuine geometric understanding.
Limitations of Prior Work: Hands are the primary medium through which humans convey actions, intentions, and control. Precise understanding of hand pose is critical in high-stakes scenarios such as robotic surgery, chip manufacturing, and AR/VR interaction. However, existing VLMs lack understanding of joint-level spatial relationships within hands (complex spatial configurations of 21 joints), and frequently exhibit "pose hallucinations" — misidentifying joint bending states or incorrectly estimating inter-finger distances.
Key Challenge: General VQA benchmarks cannot diagnose specific weaknesses of VLMs in fine-grained spatial reasoning. Existing spatial reasoning benchmarks (e.g., CLEVR, SPHERE) focus on inter-object relationships and do not evaluate part-level spatial structure within a single object (e.g., kinematic and geometric relationships among hand joints).
Goal: 1) How to systematically evaluate VLMs' understanding of joint-level spatial relationships in hands? 2) What are the specific failure modes of VLMs? 3) Can capabilities acquired through hand spatial reasoning training transfer to other tasks?
Key Insight: Leveraging precise 3D joint annotations from existing high-quality hand datasets (FreiHAND, InterHand2.6M, FPHA) to automatically generate diagnostic VQA questions, decomposing hand pose estimation into five independently evaluable subtasks.
Core Idea: Systematically converting 3D hand joint coordinates into structured natural-language multiple-choice questions to enable precise diagnosis and effective improvement of VLM hand spatial reasoning capabilities.
Method¶
Overall Architecture¶
The HandVQA automatic VQA generation pipeline consists of three deterministic stages: 1) Pose descriptor extraction \(\mathcal{F}_{\text{pose}}\): computing continuous geometric quantities (angles, distances, relative positions) from normalized 3D joint coordinates and discretizing them into semantic category labels; 2) Sentence generation \(\mathcal{F}_{\text{text}}\): filling category labels into natural-language sentences using deterministic templates, and selecting correct answers and distractors; 3) Multiple-choice question construction \(\mathcal{F}_{\text{mcq}}\): pairing images with answer options to produce standardized multiple-choice questions. Up to 25 questions are generated per image (5 descriptor types × 5 sampled instances), yielding over 1.6 million questions in total.
Key Designs¶
-
Five-Dimensional Pose Descriptor System:
- Function: Discretizes continuous 3D hand geometry into linguistically expressible categories.
- Mechanism: Three geometric measures are defined — angle \(\theta_j\) (the angle formed by three consecutive joints of the same finger, classified into 4 categories: fully bent / bent / slightly bent / straight), distance \(d_{(i,k)}\) (Euclidean distance between two joints, classified into 3 categories: close / spread / far apart), and relative position \(\Delta_a(i,k)\) (signed offset along the X/Y/Z axis, classified into left–right / up–down / front–back). Category thresholds are fixed (e.g., angle \(<105°\) is fully bent, \(\geq 170°\) is straight), with ambiguous "aligned" cases excluded.
- Design Motivation: Discretization ensures each descriptor has a unique semantic interpretation, eliminating ambiguity in continuous-value evaluation; the five-dimensional decomposition enables independent diagnosis of VLM capabilities across different spatial dimensions.
-
Deterministic Template Sentence Generation and Distractor Selection:
- Function: Converts geometric category labels into natural-language multiple-choice questions.
- Mechanism: Fixed syntactic templates are used for each descriptor type; for example, the distance template reads: "The {joint A} joint of the {finger A} is {category} the {joint B} joint of the {finger B}." For each joint or joint pair, the sentence corresponding to the ground-truth category serves as the correct answer, while sentences for the remaining categories serve as distractors, naturally forming a multiple-choice question.
- Design Motivation: Templating ensures objectivity and reproducibility of evaluation; distractors are drawn directly from other categories of the same joint pair, guaranteeing discriminability and validity.
-
Multi-Dataset Cross-Scenario Coverage:
- Function: Ensures comprehensiveness and generalizability of evaluation.
- Mechanism: Three complementary hand datasets are used — FreiHAND (third-person view, single hand), InterHand2.6M (multi-view, two-hand), and FPHA (first-person/egocentric view) — covering diverse viewpoints and interaction modes. Each dataset is trained and evaluated independently, revealing performance differences across settings.
- Design Motivation: The egocentric viewpoint of FPHA exposes VLM viewpoint bias (base model performance drops significantly on this dataset); the two-hand setting of InterHand2.6M increases the complexity of spatial reasoning.
Loss & Training¶
VLMs are parameter-efficiently fine-tuned using LoRA (with the visual encoder frozen) on the HandVQA training set. Evaluation metrics include accuracy and MAE (for angle/distance tasks).
Key Experimental Results¶
Main Results¶
| Model | Fine-tuned | Angle Acc↑ | Angle MAE↓ | Distance Acc↑ | Distance MAE↓ |
|---|---|---|---|---|---|
| DeepSeek 7B (Base) | ✗ | 34.10 | 0.883 | 45.55 | 0.657 |
| LLaVA 7B (Base) | ✗ | 40.08 | 0.739 | 16.20 | 1.293 |
| Qwen 7B (Base) | ✗ | 37.92 | 0.779 | 19.58 | 1.247 |
| LLaVA 7B (Finetuned) | InterHand | 74.35 | 0.263 | 90.79 | 0.094 |
| DeepSeek 7B (Finetuned) | InterHand | 68.00 | 0.334 | 88.02 | 0.122 |
Zero-Shot Transfer Results¶
| Model | Gesture Recognition↑ | Hand-Object Interaction↑ |
|---|---|---|
| LLaVA 7B (Base) | 57.42% | - |
| LLaVA 7B (Finetuned) | 69.58% (+12.16) | - |
| Qwen 7B (Base) | 71.86% | 80.26% |
| Qwen 7B (Finetuned) | 82.19% (+10.33) | 82.89% (+2.63) |
Key Findings¶
- Base VLMs fail severely on distance judgment: LLaVA and Qwen achieve accuracies below random chance (33.3%), with Qwen incorrectly answering "close" 93% of the time when the correct answer is "spread."
- Angle tasks remain difficult even after fine-tuning: the highest accuracy is only 74.35% (vs. 90.79% for distance), suggesting that freezing the visual encoder may be a bottleneck.
- The egocentric viewpoint (FPHA) is particularly challenging for base models, indicating viewpoint bias in VLMs.
- Strength on individual tasks does not generalize across tasks: no single base model leads on all spatial dimensions.
Highlights & Insights¶
- The pipeline design that converts 3D joint coordinates into multiple-choice VQA questions is highly elegant — fully automated, deterministic, and unambiguous — enabling large-scale diagnostic data generation at low cost. This paradigm can be extended to body pose, object 6DoF pose, and other domains.
- Zero-shot transfer experiments demonstrate that "3D spatial reasoning is a transferable skill" — joint-level spatial reasoning capabilities learned on HandVQA directly improve gesture recognition and video interaction recognition without task-specific training.
- The paper identifies a "pose hallucination" phenomenon in VLMs: models tend to respond to spatial reasoning questions with simplified answers (e.g., always answering "close" or "slightly bent"), which is distinct from object-level hallucination.
Limitations & Future Work¶
- Only 7B models are evaluated; larger models may exhibit different behavior.
- LoRA fine-tuning freezes the visual encoder, potentially limiting the learning of fine-grained features such as joint angles.
- Discretization thresholds are fixed and may not fully reflect the continuity of human perception.
- The benchmark covers only static images and does not extend to dynamic hand reasoning in video.
- Templated language limits question diversity; more naturalistic formulations could be explored in future work.
Related Work & Insights¶
- vs. SPHERE: SPHERE evaluates inter-object spatial relationships, whereas HandVQA focuses on part-level structure within a single object, making it more fine-grained and more challenging.
- vs. SpatialVLM: SpatialVLM injects spatial information via depth maps; HandVQA trains models to autonomously learn spatial reasoning capabilities through VQA.
- The automatic generation pipeline of HandVQA can inspire diagnostic benchmark construction in other domains.
Rating¶
- Novelty: ⭐⭐⭐⭐ First large-scale benchmark to systematically evaluate VLM hand spatial reasoning.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, three models, five subtasks, and zero-shot transfer — highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and in-depth analysis.
- Value: ⭐⭐⭐⭐ Significant reference value for understanding and improving VLM spatial reasoning.