Factuality Matters: When Image Generation and Editing Meet Structured Visuals¶

Conference: ICLR2026 arXiv: 2510.05091 Code: structvisuals.github.io Area: Image Generation Keywords: structured image generation, image editing, chain-of-thought reasoning, benchmark, diffusion transformer

TL;DR¶

The first systematic study on the generation and editing of structured images (charts, mathematical figures, diagrams, tables, etc.), contributing a 1.3M-pair code-aligned training dataset with CoT reasoning annotations, a unified VLM+diffusion model architecture, and the StructBench benchmark with 1,700+ samples. The work reveals that reasoning capability is the key bottleneck for current models in handling structured visual content.

Background & Motivation¶

Existing visual generation models (e.g., GPT-Image, FLUX, Bagel) excel at natural image generation but perform poorly on the generation and editing of structured visual content (charts, math figures, diagrams, tables, etc.)
Structured images differ fundamentally from natural images: they require composition planning, precise text rendering, and multimodal reasoning to ensure factual fidelity
Existing datasets primarily target aesthetic quality or instruction-following for natural images, lacking large-scale, high-quality training data for structured visuals
Existing evaluation metrics (e.g., CLIP score, aesthetic score, naive VLM-as-a-judge) are ill-suited for fine-grained factuality evaluation of structured images

Core Problem¶

How to systematically improve model capability for structured image generation and editing? Three sub-problems are identified: 1. Data: How to construct a large-scale, high-quality, precisely annotated structured image dataset? 2. Model: How to train a unified generation/editing model applicable to both natural and structured images? 3. Evaluation: How to reliably assess fine-grained factuality of structured images?

Method¶

Data Construction (1.3M Pairs)¶

Core Idea: Exploiting the fact that structured images can be rendered from code, approximately 2M drawing programs (Python + LaTeX) are collected, covering six categories: math, charts, puzzles, scientific illustrations, graph structures, and tables
Code-aligned image synthesis: Source code is executed to render source images; GPT-5 then generates both code-level and image-level editing instructions; modified code is rendered to produce target images, forming strictly aligned and verifiable state-transition pairs
Multi-step annotation pipeline: GPT-5 first analyzes salient visual features of the source image, then simultaneously generates image-level and code-level editing instructions — ensuring image instructions reference only visible elements while code instructions specify precise programmatic modifications
Post-processing filtering: Samples with rendering failures, edits with no visual difference, and low-information images are removed
CoT reasoning annotation: Each T2I sample is paired with a dense caption covering detailed attribute analysis; each editing sample is paired with a three-step reasoning chain (source image analysis → editing instruction interpretation → target image prediction), all generated by GPT-5, providing substantially richer semantic signal than conventional brief instructions

Model Architecture¶

Backbone: FLUX.1 Kontext (diffusion transformer), supporting unified image generation and editing
Multimodal enhancement: Qwen2.5-VL-7B is introduced to encode multimodal features and aligned with FLUX.1 Kontext via a lightweight MLP connector, replacing the original CLIP encoder
Design Motivation: Structured image editing relies on high-level semantic understanding (e.g., converting a bar chart to a pie chart requires understanding quantitative ratios); VAE features alone are insufficient as they only capture low-level information; the MLP connector incurs lower training overhead and more stable optimization compared to transformer-based projectors (e.g., MetaQuery)

Three-Stage Progressive Training¶

Stage 1 – Unified Alignment: The diffusion backbone is frozen; only the MLP connector is trained. T5 features are removed and only Qwen-VL features are used, preventing T5 from acting as a shortcut that impedes connector alignment
Stage 2 – Mixed Visual Learning: The diffusion backbone and connector are jointly fine-tuned to inject structured domain knowledge; high-quality natural image data is mixed in to preserve general capability; a mask-based training strategy is introduced to adaptively down-weight losses on background and unchanged regions
Stage 3 – Reasoning Enhancement: CoT annotations are used as long-context inputs to Qwen-VL to inject explicit reasoning capability; at inference time, the trained model can accept analysis and planning from an external reasoner (GPT-5), enabling inference-time compute scaling

StructBench Benchmark¶

Scale: 1,714 samples; 32,031 Q&A pairs for editing evaluation; 37,941 Q&A pairs for generation evaluation; covering six categories: Math, Graph, Chart, Puzzle, Science, and Table
StructScore metric:
A VLM-based multi-turn Q&A protocol that generates fine-grained atomic question–answer pairs from ground-truth images
Open-ended answers are elicited from model-generated images, forming [question, predicted answer, ground-truth answer] triples for comparison
Editing evaluation decouples visual consistency and instruction-following into two dimensions, combined via weighted scoring (\(0.1 \times \text{consistency} + 0.9 \times \text{instruction-following}\))
Atomized question decomposition and Q&A refinement improve ground-truth image accuracy from ~80% to >95%

Key Experimental Results¶

Editing benchmark (StructEditBench): The proposed model ranks first overall (open- and closed-source) with 55.98% accuracy, surpassing Nano Banana (51.57%), GPT-Image (52.20%), and Seedream 4.0 (52.85%); Nano Banana 2.0 achieves the highest score at 67.05%
Generation benchmark (StructT2IBench): GPT-Image leads closed-source models at 49.58%; the proposed model achieves 28.80% (T2I is harder, requiring fine-grained attribute synthesis from scratch); Nano Banana 2.0 leads all models by a large margin at 92.00%
Chart editing breakdown: Models achieve near 50% on color modification (relatively simple), but accuracy drops substantially on chart-type conversion (which requires reasoning about quantitative relationships), revealing reasoning as the core bottleneck
Reasoning enhancement: Adding explicit reasoning traces to Bagel improves accuracy from 28.87% to 38.44%, surpassing its native thinking variant Bagel-Think (33.34%), demonstrating that reasoning quality matters more than its form
Human alignment: StructScore achieves a Pearson correlation of \(r > 0.9\) with human Elo rankings, far exceeding traditional metrics such as PSNR
Evaluation coverage: Comprehensive comparison across 15 models, including 3 closed-source and 12 open-source systems

Highlights & Insights¶

Systematic contribution: Data, model, and evaluation are addressed together, constituting the first complete study in structured image generation and editing
Code-aligned data: Leveraging executable code to construct precisely verifiable editing pairs is more reliable than conventional synthesis methods
StructScore design: Atomic Q&A decomposition, dimension-decoupled weighting, and a refinement pipeline effectively mitigate VLM hallucination
Validation of reasoning importance: Experiments consistently demonstrate that inference-time reasoning yields gains on structured image tasks, independent of model architecture
Mask-based training strategy: Adaptive loss weighting tailored to the pixel statistics of structured images (large uniform backgrounds, small edited regions)

Limitations & Future Work¶

T2I generation performance remains substantially below closed-source models (28.80% vs. 49.58%); the editing lead exists but is not large
The external reasoner relies on GPT-5, incurring high inference costs; lightweight alternatives have not been explored
Data construction depends heavily on GPT-5 for annotation and filtering, leading to high cost and questionable reproducibility
Only six categories of structured images are currently covered; domains such as molecular formulas, musical scores, and educational videos are not addressed
Although the 1.3M training samples are sizable, they are primarily sourced from existing code repositories, potentially limiting diversity
StructScore still depends on GPT-5 as the evaluator, introducing a risk of circular dependency
Dynamic resolution sampling is limited to approximately 512×512, which may be insufficient for fine-grained detail rendering in high-resolution structured images

Dimension	Ours	Conventional T2I/Editing Work
Target domain	Structured visuals (charts, formulas, diagrams)	Natural images
Data construction	Code-aligned + code-level editing → precise and verifiable	Synthetic instructions + model-generated → approximate alignment
Reasoning annotation	Three-step CoT reasoning chain + dense caption	Brief instructions (e.g., "add tree right")
Evaluation	Atomic Q&A + dimension-decoupled weighting	CLIP/DINO score or naive VLM judge
Model design	VLM (Qwen-VL) + diffusion (FLUX Kontext) + MLP connector	Single diffusion model or unified autoregressive model

The comparison with Bagel-Think is particularly noteworthy: the external reasoner approach (38.44%) outperforms Bagel's built-in thinking (33.34%), indicating that the quality and design of reasoning traces matter more than simply integrating a thinking mode.

Compared to heavy transformer-based projector approaches such as MetaQuery, the proposed lightweight MLP connector reduces training overhead. Compared to dedicated editing models such as Step1X-Edit (34.11%) and Qwen-Edit (38.12%), the unified model achieves superior performance on structured editing (55.98%), validating the combined advantage of multimodal reasoning enhancement and domain-specific data.

The work carries several broader implications: - Structured visuals as reasoning-intensive tasks: This finding has important implications for all generation scenarios requiring precise factuality, such as automated scientific figure generation and data visualization editing - Code as an intermediate representation: The paradigm of using executable code to construct precise training data can be extended to other generation tasks requiring exact control (e.g., CAD drawings, circuit diagrams, flowcharts) - Value of inference-time scaling in visual generation: Analogous to test-time compute scaling in LLMs, visual generation can also benefit significantly from increased inference-time computation, representing an important direction for unified multimodal models - Evaluation methodology innovation: The atomic Q&A evaluation protocol is transferable to other visual tasks requiring fine-grained factuality assessment - Data-driven over architecture-driven: Experiments suggest that in the structured visual domain, data scale and quality matter more than architectural choices, contrasting with the community's prevailing emphasis on architectural innovation - Advantage of unified models: A unified visual understanding + generation architecture (VLM + diffusion) outperforms single-paradigm models on structured tasks, suggesting a direction for future multimodal foundation model development

Rating¶

Novelty: 8/10 — First systematic study on structured image generation/editing, with a well-defined problem formulation and a novel data construction approach
Experimental Thoroughness: 9/10 — Comprehensive comparison across 15 models, human alignment study, and thorough ablation experiments
Writing Quality: 8/10 — Clear structure, rich figures and tables, well-motivated problem statement
Value: 8/10 — Open-sourced dataset, model, and benchmark with substantial community impact