Skip to content

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Method

Conference: ICLR 2026 arXiv: 2507.07999 Code: GitHub Area: Object Detection Keywords: Visual grounded reasoning, traceable evidence, second-order reasoning, TreeBench, reinforcement learning, Dual IoU

TL;DR

This paper proposes TreeBench (the first traceable visual reasoning benchmark comprising 405 highly challenging VQA pairs, on which OpenAI-o3 achieves only 54.87%) and TreeVGR (a training paradigm that jointly supervises grounding and reasoning via dual IoU reward-based reinforcement learning). A 7B model achieves gains of +16.8 on V*Bench, +12.6 on MME-RealWorld, and +13.4 on TreeBench, demonstrating that traceability is a key driver of visual reasoning advancement.

Background & Motivation

Background: OpenAI-o3 has pioneered the paradigm of "thinking with images"—dynamically referencing and zooming into task-relevant visual regions during reasoning—showing potential beyond purely text-based reasoning. However, no existing benchmark comprehensively evaluates this capability.

Limitations of Prior Work: 1. Classical benchmarks such as POPE, MMBench, and SEED-Bench overlook fine-grained grounding and verifiable reasoning chains. 2. V*Bench supports only simple spatial queries (e.g., "Is A to the left of B?") and risks data leakage due to its reliance on COCO images. 3. MME-RealWorld and HR-Bench accept high-resolution inputs but lack traceable evidence and complex reasoning. 4. Existing RL training methods (e.g., DeepEyes, Pixel-Reasoner) supervise only the final answer, leaving intermediate grounding steps unsupervised.

Key Challenge: No benchmark simultaneously satisfies three critical requirements—focused visual perception (identifying subtle objects in dense scenes), traceable evidence (assessing grounding quality at each step of the reasoning chain), and second-order reasoning (object interaction and spatial hierarchy reasoning beyond simple localization). On the training side, existing methods cannot quantify the actual contribution of grounding within a "ground-then-answer" framework.

Goal: A two-pronged approach is adopted: TreeBench establishes an evaluation standard, and TreeVGR establishes a training methodology. Together, they advance both the assessment and improvement of "thinking with images" capabilities.

Method

Overall Architecture

TreeBench Construction Pipeline: 1K high-quality images are sampled from SA-1B (prioritizing dense-object scenes) → annotated by 8 LMM experts → 3-stage quality control → 405 challenging VQA pairs (with bounding box annotations for target instances).

TreeVGR Training Pipeline: Cold-start SFT initialization → reinforcement learning post-training with traceable evidence.

Key Design 1: Three Evaluation Principles of TreeBench

1) Focused Visual Perception: All questions focus on extremely small targets in complex real-world scenes—target instances occupy on average only 3.05% of the image area. Models are required to identify subtle targets via detailed, precise, and unambiguous textual descriptions.

2) Traceable Evidence: Evaluation covers not only final answer accuracy but also the quality (mIoU) of bounding boxes generated within the reasoning chain. By comparing predicted boxes against ground-truth boxes, error sources can be precisely diagnosed—distinguishing comprehension failures from grounding failures.

3) Second-Order Reasoning: Beyond simple "what/where" queries, the benchmark encompasses 5 perception task types (attribute / material / physical state / object retrieval / OCR) and 5 reasoning task types (perspective transformation / ordering / contact-occlusion / spatial containment / comparison). Perspective transformation ("From person A's viewpoint, in which direction is object B?") constitutes the most challenging category.

Key Design 2: Dual IoU Reward Mechanism in TreeVGR

The total reward in TreeVGR comprises three components:

\[R = R_{\text{acc}} + R_{\text{format}} + R_{\text{IoU}}\]

The dual IoU reward \(R_{\text{IoU}}\) is the core innovation, simultaneously optimizing recall and precision:

Recall term (each GT box is matched by at least one predicted box):

\[R_{\text{IoU}}^{\text{R}} = \frac{1}{M} \sum_{k=1}^{M} \max_i \text{IoU}(\hat{b}_i, b_k)\]

Precision term (each predicted box matches at least one GT box, preventing the model from generating excessive boxes):

\[R_{\text{IoU}}^{\text{P}} = \frac{1}{N} \sum_{i=1}^{N} \max_k \text{IoU}(b_k, \hat{b}_i)\]
\[R_{\text{IoU}} = \frac{1}{2}(R_{\text{IoU}}^{\text{R}} + R_{\text{IoU}}^{\text{P}})\]

This bidirectional constraint addresses the reward hacking problem wherein a unidirectional recall reward incentivizes the model to exhaustively predict all possible boxes.

Key Design 3: Cold-Start Initialization

Directly applying RL to visual grounded reasoning is extremely inefficient (DeepEyes requires 32 H100 GPUs for 48 hours). This paper first performs cold-start initialization with carefully constructed SFT data—each sample containing an image, a question, a reasoning trajectory with bounding boxes, and a final answer—ensuring the model has acquired basic "ground-then-reason" capability prior to RL. This initialization strategy substantially reduces the computational cost of RL.

Key Experimental Results

Main Results: Per-Category Performance on TreeBench

Model Overall Attr. Phys. State Obj. Retrieval OCR Persp. Trans. Ordering Contact-Occ. Spatial Cont. Comparison mIoU
o3-0416 54.8 69.0 69.2 65.2 68.8 79.4 22.4 38.6 61.0 86.2 –†
Gemini-2.5-Pro 54.1 51.7 61.5 56.5 75.0 83.8 20.0 36.8 65.9 86.2
Qwen2.5-VL-72B 42.2 65.5 69.2 56.5 56.3 48.5 11.8 33.3 51.2 72.4
Qwen2.5-VL-7B 37.0 55.2 53.8 56.5 62.5 27.9 20.0 35.1 39.0 44.8
DeepEyes-7B 37.5 62.1 53.8 65.2 68.8 51.5 11.8 24.6 36.6 51.7 30.0
Pixel-Reasoner-7B 39.0 58.6 61.5 65.2 50.0 48.5 14.1 31.6 39.0 44.8 35.7
TreeVGR-7B 50.4 65.5 53.8 82.6 68.8 63.3 22.4 36.8 61.0 69.0 44.0

Ablation Study: Performance Gains Across Benchmarks

Benchmark Qwen2.5-VL-7B (Baseline) TreeVGR-7B Gain
TreeBench Overall 37.0 50.4 +13.4
V*Bench Overall 74.3 91.1 +16.8
V*Bench Attr. 77.4 94.0 +16.6
V*Bench Spatial 69.7 87.0 +17.3
MME-RealWorld-Lite 42.3 54.9 +12.6
HR-Bench-4K 72.1 77.1 +5.0
HR-Bench-8K 68.8 73.1 +4.3

Key Findings

  • No model exceeds 60% on TreeBench: Even the strongest model, o3, achieves only 54.87%, confirming the benchmark's genuine difficulty.
  • TreeVGR-7B matches InternVL3-78B: Joint grounding-reasoning training enables the 7B model to reach the performance level of a general-purpose 78B model.
  • mIoU is highly correlated with final accuracy: TreeVGR's mIoU of 44.0 substantially outperforms DeepEyes (30.0) and Pixel-Reasoner (35.7), validating the positive contribution of precise grounding to reasoning.
  • Contact-occlusion and ordering are the hardest categories: All models perform worst on these two types (<25%), reflecting the fundamental difficulty of second-order reasoning.

Highlights & Insights

  • The striking result of "o3 below 55%": Even the most capable multimodal model remains weak at fine-grained visual reasoning—TreeBench exposes genuine capability gaps.
  • Traceability = Verifiability: Evaluating the grounding evidence at each step of the reasoning chain, rather than only the final answer, yields more reliable and diagnostically informative assessments.
  • Elegant dual IoU reward design: Bidirectional recall-precision constraints prevent the reward hacking strategy of exhaustively predicting boxes.
  • Efficient cold-start + RL paradigm: Compared to the pure RL approach of DeepEyes (32×H100, 48 hours), cold-start initialization substantially reduces computational cost.

Limitations & Future Work

  • TreeBench is relatively small in scale (only 405 questions), limiting statistical significance.
  • TreeVGR does not perform actual image cropping or re-examination (grounding occurs only in text space), potentially missing visual details.
  • The quality of cold-start SFT data directly constrains the upper bound of RL training, and data construction incurs manual annotation costs.
  • Training samples for second-order reasoning (perspective transformation / spatial containment) are limited, potentially resulting in insufficient RL training coverage.
  • Multi-turn interactive grounded reasoning remains unexplored.

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐