FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis¶

Conference: NeurIPS 2025 arXiv: 2511.06522 Code: https://github.com/NaiveNeuron/FractalBench Area: Code Intelligence Keywords: Fractals, Visual-Mathematical Reasoning, Program Synthesis, MLLM Evaluation, Recursive Abstraction

TL;DR¶

This paper introduces FractalBench, a benchmark for diagnosing visual-mathematical reasoning in MLLMs via fractal image program synthesis. Comprising 12 classical fractals, 610 test images, and evaluations across 4 MLLMs, it reveals that while 76% of generated code is executable, only 4% is visually correct, exposing fundamental deficiencies in recursive abstraction capabilities.

Background & Motivation¶

Background: MLLMs demonstrate strong performance across diverse visual understanding tasks, yet evaluation of visual-mathematical reasoning remains insufficient. TurtleBench tests only simple geometric shape drawing (19% accuracy); MathVista/MATH-Vision assess mathematical problem-solving with visual context; MATHGLANCE finds that models "do not know where to look."

Limitations of Prior Work: Existing benchmarks primarily test "applying mathematical knowledge to visual problems" rather than "abstracting mathematical rules from visual patterns"—i.e., inferring the infinite generative process underlying finite observations.

Key Challenge: Whether models can infer recursive generation rules from self-similar visual patterns remains an open question. This is a core capability in mathematical discovery, yet no systematic diagnostic benchmark exists.

Goal: To provide a targeted diagnostic tool that systematically evaluates MLLMs across a hierarchy of capabilities: scale-invariance recognition, geometric transformation inference, recursive structure abstraction, compositional reasoning, and branching recursion.

Key Insight: Fractals serve as ideal test cases—generated by a small number of contractive mappings in an Iterated Function System (IFS) via simple recursion, they produce highly complex self-similar patterns that require models to demonstrate both visual perception and mathematical abstraction.

Core Idea: Fractal program synthesis is used as a diagnostic probe to test whether MLLMs can infer recursive generation rules from images, rather than merely memorizing patterns.

Method¶

Overall Architecture¶

FractalBench comprises 12 classical fractals (Cantor set, Koch curve, Sierpiński structures, Dragon curve, Tree fractal), with 610 test images at 1024×1024 resolution spanning 4–12 levels of recursion depth and color variations. Models receive a fractal image as input and must output Python code (using the MinimalTurtle interface) capable of reproducing the fractal. Generated code is executed in a sandboxed environment and evaluated against the ground truth via IoU.

Key Designs¶

IFS-Based Formal Fractal Definition:
- Function: Provides a rigorous mathematical definition for each fractal type.
- Mechanism: Each fractal is the attractor of an IFS of contractive mappings \(f_1, \ldots, f_m: \mathbb{R}^d \to \mathbb{R}^d\), i.e., the unique compact set satisfying \(K = \bigcup_{i=1}^m f_i(K)\). For example, the Koch curve comprises 4 mappings (scaling by \(1/3\) with rotations of \(\pm 60°\) and translations).
- Design Motivation: The IFS framework provides unambiguous ground-truth generation rules, enabling objective evaluation.
MinimalTurtle Restricted Interface:
- Function: Constrains the model to only 4 commands (move, turn, pen up/down).
- Mechanism: The API is deliberately restricted to isolate the capacity for visual-to-symbolic rule abstraction—richer APIs (e.g., L-systems, matplotlib) would allow models to bypass mathematical reasoning via template recall.
- Design Motivation: Analogous to minimal grammar tests in formal language evaluation, this ensures the benchmark targets reasoning ability rather than API memorization.
Five-Level Capability Hierarchy:
- Function: Defines a difficulty-ordered hierarchy of reasoning requirements.
- Mechanism: (1) Scale-invariance recognition → (2) Geometric transformation inference → (3) Recursive structure abstraction → (4) Compositional reasoning → (5) Branching recursion. Each fractal type targets a distinct level of this hierarchy.
- Design Motivation: Differential success rates across fractal types enable precise localization of the specific capability level at which models fail.
Contamination-Resistant Design:
- Function: Prevents memorization effects through parameterizable recursion depth and color variants.
- Mechanism: Color variants prevent models from relying on visual embeddings of canonical black fractals memorized during pretraining.
- Design Motivation: Ensures that the benchmark measures genuine visual-mathematical reasoning rather than pattern matching.

Evaluation Methodology¶

IoU (Jaccard Index) is used as the evaluation metric: \(\text{IoU} = |\mathcal{B}_a \cap \mathcal{B}_m| / |\mathcal{B}_a \cup \mathcal{B}_m|\), with \(\geq 95\%\) considered correct. Three prompting strategies are evaluated: Direct Code Generation (DCG), Reasoning Then Code (RTC), and Recursive Structure Focused (RSF).

Key Experimental Results¶

Main Results¶

Model	Executable Rate	Correct Rate (among executable)	End-to-End Correct Rate
GPT-4o (DCG)	94.3%	9.6%	9.0%
Claude 3.7 Sonnet (DCG)	82.0%	9.0%	7.4%
Gemini 2.5 Flash (DCG)	23.8%	48.3%	11.5%
Qwen 2.5-VL (DCG)	99.2%	3.3%	3.3%
Overall	76.1%	5.5%	4.2%

Fractal Type	Mathematical Challenge	Success Rate Range
Koch Curve	Geometric transformations	17–21%
Sierpiński Structures	Multi-scale self-similarity	3–18%
Cantor Set	Linear recursion	Moderate
Dragon Curve	Space-filling	Low
Tree Fractal	Branching recursion	<2%

Ablation Study (Prompting Strategy Comparison)¶

Prompting Strategy	Claude	GPT-4o	Gemini	Qwen
DCG (Direct)	7.4%	9.0%	11.5%	3.3%
RTC (Reasoning First)	2.5%	1.6%	3.3%	4.9%
RSF (Recursive Focused)	3.3%	2.5%	0.8%	0.0%

Key Findings¶

76% executable vs. 4% correct: Models exhibit syntactic competence but lack semantic understanding—they can generate syntactically valid Python but fail to infer correct generation rules.
Koch highest vs. Tree lowest: Models can compose local geometric operations (rotation, scaling, translation) but cannot handle branching recursion (a single parent node producing multiple independent recursive child nodes), even though the Tree fractal's IFS definition—with only 2 mappings—is among the simplest.
Direct generation substantially outperforms reasoning-first: Contrary to the advantage of Chain-of-Thought in logical reasoning tasks, extended intermediate reasoning degrades performance on precise visual-to-code synthesis—a counterintuitive finding.
Gemini achieves the lowest executable rate (23.8%) but the highest conditional correctness (48.3%), indicating a more conservative but more precise generation strategy.

Highlights & Insights¶

Precise capability diagnosis: By targeting different reasoning levels with different fractal types, the benchmark provides quantitative answers to "exactly which capability is lacking"—branching recursion, rather than recursion per se.
Anti-CoT finding: Direct code generation outperforms reasoning-first prompting, revealing a fundamental difference in optimal prompting strategies between precise spatial/numerical output tasks and high-level logical reasoning tasks.
Kolmogorov complexity perspective: Analysis of code complexity reveals a "phase transition" phenomenon—when a model identifies the recursive structure, code length drops sharply, transitioning from pixel-level descriptions to compressed algorithmic representations.
The contamination-resistant parameterized design philosophy is transferable to other benchmark design contexts.

Limitations & Future Work¶

Only a single generation per image is performed, without accounting for model stochasticity (best-of-N evaluation would be more equitable).
Reasoning-specialized models (e.g., o1, DeepSeek-R1), which may have advantages in recursive reasoning, are not evaluated.
The 95% IoU threshold is strict, and no finer-grained structure-aware metrics are provided (e.g., branch count accuracy, recursion depth detection).
The scope is limited to fractals; generalizability to broader visual-mathematical reasoning capabilities should be interpreted with caution.
The prompting analysis is observational and lacks controlled experiments to establish causal relationships.

vs. TurtleBench: TurtleBench asks "can the model draw what it sees" (simple geometric shapes); FractalBench asks "can the model infer the infinite generative process underlying finite observations"—representing fundamentally different difficulty levels.
vs. MathVista/MATH-Vision: These benchmarks test the application of mathematical knowledge to visual problems; FractalBench tests the abstraction of mathematical rules from visual patterns—the former is forward application, the latter is inverse inference.
vs. GeoGramBench: GeoGramBench observes performance degradation with increasing structural complexity; FractalBench further pinpoints branching recursion as the specific bottleneck, rather than recursion in general.
vs. MATHGLANCE: MATHGLANCE finds that models "do not know where to look"; FractalBench reveals a deeper problem—even when the correct pattern is perceived, models fail to infer its generative rule.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The use of fractals as a diagnostic tool for mathematical reasoning is both novel and insightful; the IFS framework provides a rigorous mathematical foundation.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 4 models × 3 prompting strategies × 12 fractal types with in-depth analysis; however, reasoning-specialized models and best-of-N evaluations are absent.
Writing Quality: ⭐⭐⭐⭐⭐ — The paper is elegantly structured, flowing seamlessly from mathematical definitions to experimental analysis, with precisely distilled insights.
Value: ⭐⭐⭐⭐ — Exposes fundamental limitations in MLLM visual-mathematical reasoning, making an important contribution to understanding the capability boundaries of current models.

vs. TurtleBench: TurtleBench tests simple geometric shape drawing (19% accuracy) as a "draw what you see" task; FractalBench tests recursive generation rule inference (4% accuracy) as an "infer the generative process" task—the difficulty levels are categorically different.
vs. MathVista/MATH-Vision: These benchmarks test "applying mathematical knowledge to solve visual problems"; FractalBench tests "abstracting mathematical rules from visual patterns"—the former is forward application, the latter is inverse inference.
vs. GeoGramBench: GeoGramBench observes performance degradation with increasing structural complexity; FractalBench further localizes the specific bottleneck to branching recursion rather than recursion itself.
vs. MATHGLANCE: MATHGLANCE finds that models "do not know where to look"; FractalBench reveals a deeper issue—even upon correctly perceiving a pattern, models cannot infer its generative rule.

Inspiration & Connections¶

The contamination-resistant parameterized design (adjustable recursion depth + color variants) represents a general paradigm for benchmark construction applicable to any evaluation requiring protection against data leakage.
The anti-CoT finding carries important implications for prompting research: tasks requiring precise spatial or numerical output may be fundamentally ill-suited to reason-then-output paradigms.
The Kolmogorov complexity perspective provides a new tool for analyzing whether models genuinely understand structure (as opposed to memorizing patterns)—code length serves as a proxy metric for structural comprehension.