Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems¶

Conference: ICLR 2026 arXiv: 2508.12026 Code: Available Area: Multimodal VLM Keywords: Bongard problems, abstract visual reasoning, few-shot learning, VLM benchmark, fine-grained concepts

TL;DR¶

Bongard-RWR+ is a benchmark comprising 5,400 Bongard problems, constructed via a VLM-based pipeline (Pixtral-12B + Flux.1-dev) that automatically generates photorealistic images to represent abstract concepts. Systematic evaluation reveals that state-of-the-art VLMs struggle to discriminate fine-grained visual concepts such as contour, rotation, and angle, with accuracy dropping as low as 19%.

Background & Motivation¶

Background: Bongard Problems (BPs) are a classic test of abstract visual reasoning — given six images on each side, the task is to identify the abstract concept that distinguishes the two groups. Existing BP datasets are either synthetic black-and-white images (Bongard-LOGO) or use real images to represent coarse-grained concepts (e.g., "a person driving").

Limitations of Prior Work: Although Bongard-RWR uses real images to represent fine-grained abstract concepts, it is hand-constructed with only 60 instances — a scale too small for robust evaluation. Moreover, no systematic diagnosis of VLM capabilities across different reasoning dimensions exists.

Key Challenge: VLMs perform reasonably well on coarse-grained concept recognition, but their ability to handle fine-grained abstract concepts (e.g., "arrows pointing in the same vs. different directions") remains unknown, necessitating a sufficiently large benchmark for systematic testing.

Goal: How can Bongard problems involving fine-grained abstract concepts be constructed at scale with photorealistic imagery? And how can the reasoning boundaries of VLMs be systematically evaluated?

Key Insight: A semi-automatic pipeline of I2T (image captioning) → T2T (description augmentation) → T2I (image generation) → human verification is employed to scale 60 Bongard-RWR instances up to 5,400.

Core Idea: A VLM-based pipeline is used to automatically generate photorealistic images for Bongard problems, enabling large-scale evaluation of VLMs' fine-grained abstract reasoning limits.

Method¶

Overall Architecture¶

Bongard-RWR+ is a benchmark paper rather than a methodology paper. Its core contributions are a semi-automatic data construction pipeline and a multi-task, multi-dimensional evaluation framework. The benchmark consists of 49 abstract concepts × 100 matrix variants = 5,400 BPs.

Key Designs¶

Semi-Automatic Image Generation Pipeline:
- Function: Starting from hand-crafted BP images, generates large quantities of new photorealistic images representing the same abstract concepts.
- Mechanism: (1) Pixtral-12B generates a positive description (capturing the target concept) and a negative description (suppressing the opposing concept) for each image; (2) a T2T model augments each positive description into 15 diverse variants; (3) Flux.1-dev generates candidate images from these descriptions; (4) human verification ensures concept fidelity. Diversity-maximizing selection is performed using pairwise cosine similarity of ViT-L/14 embeddings.
- Design Motivation: Manual construction does not scale. Automated generation must ensure concept fidelity — positive/negative descriptions prevent T2I models from conflating opposing concepts.
Multi-Task Evaluation Framework (6 Task Types):
- Function: Systematically evaluates VLMs from easy to difficult settings.
- I1S/I2S: Single/dual test-image binary classification (assigning images to the left or right group).
- D1S/D2S: Classification after converting images to text descriptions via I2T (testing the effect of intermediate steps).
- CS: Selecting the correct concept from \(K\) candidates (\(K = 2, 4, 8, 16\)).
- CG: Free-text generation of the correct concept description.
Semantic Grouping Analysis of Concepts:
- Function: Groups the 49 concepts into 9 semantic categories (Size, Position, Count, Branching, Similarity, Contour, Shape, Rotation, Angle).
- Design Motivation: Pinpoints specific weaknesses of VLMs — identifying which abstract concept categories are most challenging and which are more tractable.

Loss & Training¶

N/A (benchmark paper; existing models are evaluated without training).

Key Experimental Results¶

Main Results (Concept Selection Task)¶

Model	K=2	K=4	K=8	K=16
InternVL2.5-78B	91%	78%	68%	57%
Qwen2-VL-72B	85%	65%	48%	33%
LLaVA-Next-110B	73%	45%	30%	19%
MiniCPM-o-8B	72%	44%	28%	19%

Binary Classification Tasks (I1S/I2S)¶

Model	I1S	I2S	D1S	D2S
InternVL2.5-78B	0.50	0.39	0.57	0.49
Qwen2-VL-72B	0.49	0.44	0.58	0.42
Random Baseline	0.50	0.50	0.50	0.50

Key Findings¶

Binary classification is near chance: All VLMs achieve approximately 50% accuracy on I1S/I2S, equivalent to random guessing, demonstrating that VLMs are nearly incapable of inferring fine-grained abstract concepts from few-shot images.
Concept selection degrades rapidly: InternVL2.5 achieves 91% at \(K=2\) (demonstrating some discriminative ability), but collapses to 57% at \(K=16\) as the number of distractors increases.
Significant variation across semantic groups: Shape, Size, and Branching are relatively easy (~75%), whereas Contour, Rotation, and Angle are difficult (<50%) — the latter require precise spatial relational reasoning.
DeepSeek-R1 achieves 0.56 on the text-only D2S task, suggesting that textual reasoning is more effective than visual reasoning — the bottleneck for VLMs lies in visual perception rather than reasoning.
Color vs. grayscale input shows no significant difference, confirming that the target concepts are structural and color-independent.
Small models (MiniCPM-8B) and large models (LLaVA-110B) achieve comparable performance, indicating that model scale is not a determining factor.

Highlights & Insights¶

Reveals a fundamental weakness of VLMs: Even the strongest 78B-parameter VLMs perform near chance on few-shot abstract visual reasoning — a failure mode unlikely to be resolved through scaling alone.
Methodological value of the semi-automatic pipeline: The I2T → T2T → T2I → human verification workflow is reusable for other scenarios requiring large-scale conceptual datasets.
Comprehensive multi-task evaluation design: Progressing from binary classification to multi-way selection to free-form generation, the framework enables precise localization of capability boundaries.

Limitations & Future Work¶

Concept fidelity of generated images still requires human verification, preventing full automation.
The 49 concepts covered represent a limited subset of the 394 concepts in the original Bongard problems.
Evaluation is restricted to zero-shot/few-shot VLMs; whether fine-tuning could yield improvements remains untested.
Generated images may contain T2I model artifacts that interfere with concept judgment.

vs. Bongard-LOGO: LOGO contains 12K instances but relies entirely on synthetic black-and-white images; RWR+ provides 5.4K instances with photorealistic imagery, more closely aligned with VLM training distributions.
vs. Bongard-HOI/OpenWorld: These benchmarks use coarse-grained concepts (e.g., "a person driving"), on which VLMs perform relatively well; RWR+ employs fine-grained abstract concepts that expose genuine VLM weaknesses.
vs. ARC (Chollet): ARC similarly tests abstract reasoning but in a grid domain; RWR+ operates in the real-image domain, making the two complementary.

Rating¶

Novelty: ⭐⭐⭐⭐ Semi-automatic generation pipeline + multi-dimensional evaluation framework
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four large models, six task types, nine semantic groups, and extensive ablations
Writing Quality: ⭐⭐⭐⭐ Clear structure and comprehensive evaluation
Value: ⭐⭐⭐⭐ Establishes the capability ceiling and bottlenecks of VLMs in fine-grained reasoning