Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Method¶
Conference: ICLR 2026 arXiv: 2507.07999 Code: GitHub Area: Object Detection Keywords: Visual grounded reasoning, traceable evidence, second-order reasoning, TreeBench, reinforcement learning, Dual IoU
TL;DR¶
This paper proposes TreeBench (the first traceable visual reasoning benchmark comprising 405 highly challenging VQA pairs, on which OpenAI-o3 achieves only 54.87%) and TreeVGR (a training paradigm that jointly supervises grounding and reasoning via dual IoU reward-based reinforcement learning). A 7B model achieves gains of +16.8 on V*Bench, +12.6 on MME-RealWorld, and +13.4 on TreeBench, demonstrating that traceability is a key driver of visual reasoning advancement.
Background & Motivation¶
Background: OpenAI-o3 has pioneered the paradigm of "thinking with images"—dynamically referencing and zooming into task-relevant visual regions during reasoning—showing potential beyond purely text-based reasoning. However, no existing benchmark comprehensively evaluates this capability.
Limitations of Prior Work: 1. Classical benchmarks such as POPE, MMBench, and SEED-Bench overlook fine-grained grounding and verifiable reasoning chains. 2. V*Bench supports only simple spatial queries (e.g., "Is A to the left of B?") and risks data leakage due to its reliance on COCO images. 3. MME-RealWorld and HR-Bench accept high-resolution inputs but lack traceable evidence and complex reasoning. 4. Existing RL training methods (e.g., DeepEyes, Pixel-Reasoner) supervise only the final answer, leaving intermediate grounding steps unsupervised.
Key Challenge: No benchmark simultaneously satisfies three critical requirements—focused visual perception (identifying subtle objects in dense scenes), traceable evidence (assessing grounding quality at each step of the reasoning chain), and second-order reasoning (object interaction and spatial hierarchy reasoning beyond simple localization). On the training side, existing methods cannot quantify the actual contribution of grounding within a "ground-then-answer" framework.
Goal: A two-pronged approach is adopted: TreeBench establishes an evaluation standard, and TreeVGR establishes a training methodology. Together, they advance both the assessment and improvement of "thinking with images" capabilities.
Method¶
Overall Architecture¶
TreeBench Construction Pipeline: 1K high-quality images are sampled from SA-1B (prioritizing dense-object scenes) → annotated by 8 LMM experts → 3-stage quality control → 405 challenging VQA pairs (with bounding box annotations for target instances).
TreeVGR Training Pipeline: Cold-start SFT initialization → reinforcement learning post-training with traceable evidence.
Key Design 1: Three Evaluation Principles of TreeBench¶
1) Focused Visual Perception: All questions focus on extremely small targets in complex real-world scenes—target instances occupy on average only 3.05% of the image area. Models are required to identify subtle targets via detailed, precise, and unambiguous textual descriptions.
2) Traceable Evidence: Evaluation covers not only final answer accuracy but also the quality (mIoU) of bounding boxes generated within the reasoning chain. By comparing predicted boxes against ground-truth boxes, error sources can be precisely diagnosed—distinguishing comprehension failures from grounding failures.
3) Second-Order Reasoning: Beyond simple "what/where" queries, the benchmark encompasses 5 perception task types (attribute / material / physical state / object retrieval / OCR) and 5 reasoning task types (perspective transformation / ordering / contact-occlusion / spatial containment / comparison). Perspective transformation ("From person A's viewpoint, in which direction is object B?") constitutes the most challenging category.
Key Design 2: Dual IoU Reward Mechanism in TreeVGR¶
The total reward in TreeVGR comprises three components:
The dual IoU reward \(R_{\text{IoU}}\) is the core innovation, simultaneously optimizing recall and precision:
Recall term (each GT box is matched by at least one predicted box):
Precision term (each predicted box matches at least one GT box, preventing the model from generating excessive boxes):
This bidirectional constraint addresses the reward hacking problem wherein a unidirectional recall reward incentivizes the model to exhaustively predict all possible boxes.
Key Design 3: Cold-Start Initialization¶
Directly applying RL to visual grounded reasoning is extremely inefficient (DeepEyes requires 32 H100 GPUs for 48 hours). This paper first performs cold-start initialization with carefully constructed SFT data—each sample containing an image, a question, a reasoning trajectory with bounding boxes, and a final answer—ensuring the model has acquired basic "ground-then-reason" capability prior to RL. This initialization strategy substantially reduces the computational cost of RL.
Key Experimental Results¶
Main Results: Per-Category Performance on TreeBench¶
| Model | Overall | Attr. | Phys. State | Obj. Retrieval | OCR | Persp. Trans. | Ordering | Contact-Occ. | Spatial Cont. | Comparison | mIoU |
|---|---|---|---|---|---|---|---|---|---|---|---|
| o3-0416 | 54.8 | 69.0 | 69.2 | 65.2 | 68.8 | 79.4 | 22.4 | 38.6 | 61.0 | 86.2 | –† |
| Gemini-2.5-Pro | 54.1 | 51.7 | 61.5 | 56.5 | 75.0 | 83.8 | 20.0 | 36.8 | 65.9 | 86.2 | – |
| Qwen2.5-VL-72B | 42.2 | 65.5 | 69.2 | 56.5 | 56.3 | 48.5 | 11.8 | 33.3 | 51.2 | 72.4 | – |
| Qwen2.5-VL-7B | 37.0 | 55.2 | 53.8 | 56.5 | 62.5 | 27.9 | 20.0 | 35.1 | 39.0 | 44.8 | – |
| DeepEyes-7B | 37.5 | 62.1 | 53.8 | 65.2 | 68.8 | 51.5 | 11.8 | 24.6 | 36.6 | 51.7 | 30.0 |
| Pixel-Reasoner-7B | 39.0 | 58.6 | 61.5 | 65.2 | 50.0 | 48.5 | 14.1 | 31.6 | 39.0 | 44.8 | 35.7 |
| TreeVGR-7B | 50.4 | 65.5 | 53.8 | 82.6 | 68.8 | 63.3 | 22.4 | 36.8 | 61.0 | 69.0 | 44.0 |
Ablation Study: Performance Gains Across Benchmarks¶
| Benchmark | Qwen2.5-VL-7B (Baseline) | TreeVGR-7B | Gain |
|---|---|---|---|
| TreeBench Overall | 37.0 | 50.4 | +13.4 |
| V*Bench Overall | 74.3 | 91.1 | +16.8 |
| V*Bench Attr. | 77.4 | 94.0 | +16.6 |
| V*Bench Spatial | 69.7 | 87.0 | +17.3 |
| MME-RealWorld-Lite | 42.3 | 54.9 | +12.6 |
| HR-Bench-4K | 72.1 | 77.1 | +5.0 |
| HR-Bench-8K | 68.8 | 73.1 | +4.3 |
Key Findings¶
- No model exceeds 60% on TreeBench: Even the strongest model, o3, achieves only 54.87%, confirming the benchmark's genuine difficulty.
- TreeVGR-7B matches InternVL3-78B: Joint grounding-reasoning training enables the 7B model to reach the performance level of a general-purpose 78B model.
- mIoU is highly correlated with final accuracy: TreeVGR's mIoU of 44.0 substantially outperforms DeepEyes (30.0) and Pixel-Reasoner (35.7), validating the positive contribution of precise grounding to reasoning.
- Contact-occlusion and ordering are the hardest categories: All models perform worst on these two types (<25%), reflecting the fundamental difficulty of second-order reasoning.
Highlights & Insights¶
- The striking result of "o3 below 55%": Even the most capable multimodal model remains weak at fine-grained visual reasoning—TreeBench exposes genuine capability gaps.
- Traceability = Verifiability: Evaluating the grounding evidence at each step of the reasoning chain, rather than only the final answer, yields more reliable and diagnostically informative assessments.
- Elegant dual IoU reward design: Bidirectional recall-precision constraints prevent the reward hacking strategy of exhaustively predicting boxes.
- Efficient cold-start + RL paradigm: Compared to the pure RL approach of DeepEyes (32×H100, 48 hours), cold-start initialization substantially reduces computational cost.
Limitations & Future Work¶
- TreeBench is relatively small in scale (only 405 questions), limiting statistical significance.
- TreeVGR does not perform actual image cropping or re-examination (grounding occurs only in text space), potentially missing visual details.
- The quality of cold-start SFT data directly constrains the upper bound of RL training, and data construction incurs manual annotation costs.
- Training samples for second-order reasoning (perspective transformation / spatial containment) are limited, potentially resulting in insufficient RL training coverage.
- Multi-turn interactive grounded reasoning remains unexplored.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐