Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems¶
Conference: ICLR 2026 arXiv: 2508.12026 Code: Available Area: Multimodal VLM Keywords: Bongard problems, abstract visual reasoning, few-shot learning, VLM benchmark, fine-grained concepts
TL;DR¶
Bongard-RWR+ is a benchmark comprising 5,400 Bongard problems, constructed via a VLM-based pipeline (Pixtral-12B + Flux.1-dev) that automatically generates photorealistic images to represent abstract concepts. Systematic evaluation reveals that state-of-the-art VLMs struggle to discriminate fine-grained visual concepts such as contour, rotation, and angle, with accuracy dropping as low as 19%.
Background & Motivation¶
Background: Bongard Problems (BPs) are a classic test of abstract visual reasoning — given six images on each side, the task is to identify the abstract concept that distinguishes the two groups. Existing BP datasets are either synthetic black-and-white images (Bongard-LOGO) or use real images to represent coarse-grained concepts (e.g., "a person driving").
Limitations of Prior Work: Although Bongard-RWR uses real images to represent fine-grained abstract concepts, it is hand-constructed with only 60 instances — a scale too small for robust evaluation. Moreover, no systematic diagnosis of VLM capabilities across different reasoning dimensions exists.
Key Challenge: VLMs perform reasonably well on coarse-grained concept recognition, but their ability to handle fine-grained abstract concepts (e.g., "arrows pointing in the same vs. different directions") remains unknown, necessitating a sufficiently large benchmark for systematic testing.
Goal: How can Bongard problems involving fine-grained abstract concepts be constructed at scale with photorealistic imagery? And how can the reasoning boundaries of VLMs be systematically evaluated?
Key Insight: A semi-automatic pipeline of I2T (image captioning) → T2T (description augmentation) → T2I (image generation) → human verification is employed to scale 60 Bongard-RWR instances up to 5,400.
Core Idea: A VLM-based pipeline is used to automatically generate photorealistic images for Bongard problems, enabling large-scale evaluation of VLMs' fine-grained abstract reasoning limits.
Method¶
Overall Architecture¶
Bongard-RWR+ is a benchmark paper rather than a methodology paper. Its core contributions are a semi-automatic data construction pipeline and a multi-task, multi-dimensional evaluation framework. The benchmark consists of 49 abstract concepts × 100 matrix variants = 5,400 BPs.
Key Designs¶
-
Semi-Automatic Image Generation Pipeline:
- Function: Starting from hand-crafted BP images, generates large quantities of new photorealistic images representing the same abstract concepts.
- Mechanism: (1) Pixtral-12B generates a positive description (capturing the target concept) and a negative description (suppressing the opposing concept) for each image; (2) a T2T model augments each positive description into 15 diverse variants; (3) Flux.1-dev generates candidate images from these descriptions; (4) human verification ensures concept fidelity. Diversity-maximizing selection is performed using pairwise cosine similarity of ViT-L/14 embeddings.
- Design Motivation: Manual construction does not scale. Automated generation must ensure concept fidelity — positive/negative descriptions prevent T2I models from conflating opposing concepts.
-
Multi-Task Evaluation Framework (6 Task Types):
- Function: Systematically evaluates VLMs from easy to difficult settings.
- I1S/I2S: Single/dual test-image binary classification (assigning images to the left or right group).
- D1S/D2S: Classification after converting images to text descriptions via I2T (testing the effect of intermediate steps).
- CS: Selecting the correct concept from \(K\) candidates (\(K = 2, 4, 8, 16\)).
- CG: Free-text generation of the correct concept description.
-
Semantic Grouping Analysis of Concepts:
- Function: Groups the 49 concepts into 9 semantic categories (Size, Position, Count, Branching, Similarity, Contour, Shape, Rotation, Angle).
- Design Motivation: Pinpoints specific weaknesses of VLMs — identifying which abstract concept categories are most challenging and which are more tractable.
Loss & Training¶
N/A (benchmark paper; existing models are evaluated without training).
Key Experimental Results¶
Main Results (Concept Selection Task)¶
| Model | K=2 | K=4 | K=8 | K=16 |
|---|---|---|---|---|
| InternVL2.5-78B | 91% | 78% | 68% | 57% |
| Qwen2-VL-72B | 85% | 65% | 48% | 33% |
| LLaVA-Next-110B | 73% | 45% | 30% | 19% |
| MiniCPM-o-8B | 72% | 44% | 28% | 19% |
Binary Classification Tasks (I1S/I2S)¶
| Model | I1S | I2S | D1S | D2S |
|---|---|---|---|---|
| InternVL2.5-78B | 0.50 | 0.39 | 0.57 | 0.49 |
| Qwen2-VL-72B | 0.49 | 0.44 | 0.58 | 0.42 |
| Random Baseline | 0.50 | 0.50 | 0.50 | 0.50 |
Key Findings¶
- Binary classification is near chance: All VLMs achieve approximately 50% accuracy on I1S/I2S, equivalent to random guessing, demonstrating that VLMs are nearly incapable of inferring fine-grained abstract concepts from few-shot images.
- Concept selection degrades rapidly: InternVL2.5 achieves 91% at \(K=2\) (demonstrating some discriminative ability), but collapses to 57% at \(K=16\) as the number of distractors increases.
- Significant variation across semantic groups: Shape, Size, and Branching are relatively easy (~75%), whereas Contour, Rotation, and Angle are difficult (<50%) — the latter require precise spatial relational reasoning.
- DeepSeek-R1 achieves 0.56 on the text-only D2S task, suggesting that textual reasoning is more effective than visual reasoning — the bottleneck for VLMs lies in visual perception rather than reasoning.
- Color vs. grayscale input shows no significant difference, confirming that the target concepts are structural and color-independent.
- Small models (MiniCPM-8B) and large models (LLaVA-110B) achieve comparable performance, indicating that model scale is not a determining factor.
Highlights & Insights¶
- Reveals a fundamental weakness of VLMs: Even the strongest 78B-parameter VLMs perform near chance on few-shot abstract visual reasoning — a failure mode unlikely to be resolved through scaling alone.
- Methodological value of the semi-automatic pipeline: The I2T → T2T → T2I → human verification workflow is reusable for other scenarios requiring large-scale conceptual datasets.
- Comprehensive multi-task evaluation design: Progressing from binary classification to multi-way selection to free-form generation, the framework enables precise localization of capability boundaries.
Limitations & Future Work¶
- Concept fidelity of generated images still requires human verification, preventing full automation.
- The 49 concepts covered represent a limited subset of the 394 concepts in the original Bongard problems.
- Evaluation is restricted to zero-shot/few-shot VLMs; whether fine-tuning could yield improvements remains untested.
- Generated images may contain T2I model artifacts that interfere with concept judgment.
Related Work & Insights¶
- vs. Bongard-LOGO: LOGO contains 12K instances but relies entirely on synthetic black-and-white images; RWR+ provides 5.4K instances with photorealistic imagery, more closely aligned with VLM training distributions.
- vs. Bongard-HOI/OpenWorld: These benchmarks use coarse-grained concepts (e.g., "a person driving"), on which VLMs perform relatively well; RWR+ employs fine-grained abstract concepts that expose genuine VLM weaknesses.
- vs. ARC (Chollet): ARC similarly tests abstract reasoning but in a grid domain; RWR+ operates in the real-image domain, making the two complementary.
Rating¶
- Novelty: ⭐⭐⭐⭐ Semi-automatic generation pipeline + multi-dimensional evaluation framework
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four large models, six task types, nine semantic groups, and extensive ablations
- Writing Quality: ⭐⭐⭐⭐ Clear structure and comprehensive evaluation
- Value: ⭐⭐⭐⭐ Establishes the capability ceiling and bottlenecks of VLMs in fine-grained reasoning