Skip to content

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Conference: ICLR2026 arXiv: 2510.05091 Code: structvisuals.github.io Area: Image Generation Keywords: structured image generation, image editing, chain-of-thought reasoning, benchmark, diffusion transformer

TL;DR

The first systematic study on the generation and editing of structured images (charts, mathematical figures, diagrams, tables, etc.), contributing a 1.3M-pair code-aligned training dataset with CoT reasoning annotations, a unified VLM+diffusion model architecture, and the StructBench benchmark with 1,700+ samples. The work reveals that reasoning capability is the key bottleneck for current models in handling structured visual content.

Background & Motivation

  • Existing visual generation models (e.g., GPT-Image, FLUX, Bagel) excel at natural image generation but perform poorly on the generation and editing of structured visual content (charts, math figures, diagrams, tables, etc.)
  • Structured images differ fundamentally from natural images: they require composition planning, precise text rendering, and multimodal reasoning to ensure factual fidelity
  • Existing datasets primarily target aesthetic quality or instruction-following for natural images, lacking large-scale, high-quality training data for structured visuals
  • Existing evaluation metrics (e.g., CLIP score, aesthetic score, naive VLM-as-a-judge) are ill-suited for fine-grained factuality evaluation of structured images

Core Problem

How to systematically improve model capability for structured image generation and editing? Three sub-problems are identified: 1. Data: How to construct a large-scale, high-quality, precisely annotated structured image dataset? 2. Model: How to train a unified generation/editing model applicable to both natural and structured images? 3. Evaluation: How to reliably assess fine-grained factuality of structured images?

Method

Data Construction (1.3M Pairs)

  • Core Idea: Exploiting the fact that structured images can be rendered from code, approximately 2M drawing programs (Python + LaTeX) are collected, covering six categories: math, charts, puzzles, scientific illustrations, graph structures, and tables
  • Code-aligned image synthesis: Source code is executed to render source images; GPT-5 then generates both code-level and image-level editing instructions; modified code is rendered to produce target images, forming strictly aligned and verifiable state-transition pairs
  • Multi-step annotation pipeline: GPT-5 first analyzes salient visual features of the source image, then simultaneously generates image-level and code-level editing instructions — ensuring image instructions reference only visible elements while code instructions specify precise programmatic modifications
  • Post-processing filtering: Samples with rendering failures, edits with no visual difference, and low-information images are removed
  • CoT reasoning annotation: Each T2I sample is paired with a dense caption covering detailed attribute analysis; each editing sample is paired with a three-step reasoning chain (source image analysis → editing instruction interpretation → target image prediction), all generated by GPT-5, providing substantially richer semantic signal than conventional brief instructions

Model Architecture

  • Backbone: FLUX.1 Kontext (diffusion transformer), supporting unified image generation and editing
  • Multimodal enhancement: Qwen2.5-VL-7B is introduced to encode multimodal features and aligned with FLUX.1 Kontext via a lightweight MLP connector, replacing the original CLIP encoder
  • Design Motivation: Structured image editing relies on high-level semantic understanding (e.g., converting a bar chart to a pie chart requires understanding quantitative ratios); VAE features alone are insufficient as they only capture low-level information; the MLP connector incurs lower training overhead and more stable optimization compared to transformer-based projectors (e.g., MetaQuery)

Three-Stage Progressive Training

  1. Stage 1 – Unified Alignment: The diffusion backbone is frozen; only the MLP connector is trained. T5 features are removed and only Qwen-VL features are used, preventing T5 from acting as a shortcut that impedes connector alignment
  2. Stage 2 – Mixed Visual Learning: The diffusion backbone and connector are jointly fine-tuned to inject structured domain knowledge; high-quality natural image data is mixed in to preserve general capability; a mask-based training strategy is introduced to adaptively down-weight losses on background and unchanged regions
  3. Stage 3 – Reasoning Enhancement: CoT annotations are used as long-context inputs to Qwen-VL to inject explicit reasoning capability; at inference time, the trained model can accept analysis and planning from an external reasoner (GPT-5), enabling inference-time compute scaling

StructBench Benchmark

  • Scale: 1,714 samples; 32,031 Q&A pairs for editing evaluation; 37,941 Q&A pairs for generation evaluation; covering six categories: Math, Graph, Chart, Puzzle, Science, and Table
  • StructScore metric:
  • A VLM-based multi-turn Q&A protocol that generates fine-grained atomic question–answer pairs from ground-truth images
  • Open-ended answers are elicited from model-generated images, forming [question, predicted answer, ground-truth answer] triples for comparison
  • Editing evaluation decouples visual consistency and instruction-following into two dimensions, combined via weighted scoring (\(0.1 \times \text{consistency} + 0.9 \times \text{instruction-following}\))
  • Atomized question decomposition and Q&A refinement improve ground-truth image accuracy from ~80% to >95%

Key Experimental Results

  • Editing benchmark (StructEditBench): The proposed model ranks first overall (open- and closed-source) with 55.98% accuracy, surpassing Nano Banana (51.57%), GPT-Image (52.20%), and Seedream 4.0 (52.85%); Nano Banana 2.0 achieves the highest score at 67.05%
  • Generation benchmark (StructT2IBench): GPT-Image leads closed-source models at 49.58%; the proposed model achieves 28.80% (T2I is harder, requiring fine-grained attribute synthesis from scratch); Nano Banana 2.0 leads all models by a large margin at 92.00%
  • Chart editing breakdown: Models achieve near 50% on color modification (relatively simple), but accuracy drops substantially on chart-type conversion (which requires reasoning about quantitative relationships), revealing reasoning as the core bottleneck
  • Reasoning enhancement: Adding explicit reasoning traces to Bagel improves accuracy from 28.87% to 38.44%, surpassing its native thinking variant Bagel-Think (33.34%), demonstrating that reasoning quality matters more than its form
  • Human alignment: StructScore achieves a Pearson correlation of \(r > 0.9\) with human Elo rankings, far exceeding traditional metrics such as PSNR
  • Evaluation coverage: Comprehensive comparison across 15 models, including 3 closed-source and 12 open-source systems

Highlights & Insights

  • Systematic contribution: Data, model, and evaluation are addressed together, constituting the first complete study in structured image generation and editing
  • Code-aligned data: Leveraging executable code to construct precisely verifiable editing pairs is more reliable than conventional synthesis methods
  • StructScore design: Atomic Q&A decomposition, dimension-decoupled weighting, and a refinement pipeline effectively mitigate VLM hallucination
  • Validation of reasoning importance: Experiments consistently demonstrate that inference-time reasoning yields gains on structured image tasks, independent of model architecture
  • Mask-based training strategy: Adaptive loss weighting tailored to the pixel statistics of structured images (large uniform backgrounds, small edited regions)

Limitations & Future Work

  • T2I generation performance remains substantially below closed-source models (28.80% vs. 49.58%); the editing lead exists but is not large
  • The external reasoner relies on GPT-5, incurring high inference costs; lightweight alternatives have not been explored
  • Data construction depends heavily on GPT-5 for annotation and filtering, leading to high cost and questionable reproducibility
  • Only six categories of structured images are currently covered; domains such as molecular formulas, musical scores, and educational videos are not addressed
  • Although the 1.3M training samples are sizable, they are primarily sourced from existing code repositories, potentially limiting diversity
  • StructScore still depends on GPT-5 as the evaluator, introducing a risk of circular dependency
  • Dynamic resolution sampling is limited to approximately 512×512, which may be insufficient for fine-grained detail rendering in high-resolution structured images
Dimension Ours Conventional T2I/Editing Work
Target domain Structured visuals (charts, formulas, diagrams) Natural images
Data construction Code-aligned + code-level editing → precise and verifiable Synthetic instructions + model-generated → approximate alignment
Reasoning annotation Three-step CoT reasoning chain + dense caption Brief instructions (e.g., "add tree right")
Evaluation Atomic Q&A + dimension-decoupled weighting CLIP/DINO score or naive VLM judge
Model design VLM (Qwen-VL) + diffusion (FLUX Kontext) + MLP connector Single diffusion model or unified autoregressive model

The comparison with Bagel-Think is particularly noteworthy: the external reasoner approach (38.44%) outperforms Bagel's built-in thinking (33.34%), indicating that the quality and design of reasoning traces matter more than simply integrating a thinking mode.

Compared to heavy transformer-based projector approaches such as MetaQuery, the proposed lightweight MLP connector reduces training overhead. Compared to dedicated editing models such as Step1X-Edit (34.11%) and Qwen-Edit (38.12%), the unified model achieves superior performance on structured editing (55.98%), validating the combined advantage of multimodal reasoning enhancement and domain-specific data.

The work carries several broader implications: - Structured visuals as reasoning-intensive tasks: This finding has important implications for all generation scenarios requiring precise factuality, such as automated scientific figure generation and data visualization editing - Code as an intermediate representation: The paradigm of using executable code to construct precise training data can be extended to other generation tasks requiring exact control (e.g., CAD drawings, circuit diagrams, flowcharts) - Value of inference-time scaling in visual generation: Analogous to test-time compute scaling in LLMs, visual generation can also benefit significantly from increased inference-time computation, representing an important direction for unified multimodal models - Evaluation methodology innovation: The atomic Q&A evaluation protocol is transferable to other visual tasks requiring fine-grained factuality assessment - Data-driven over architecture-driven: Experiments suggest that in the structured visual domain, data scale and quality matter more than architectural choices, contrasting with the community's prevailing emphasis on architectural innovation - Advantage of unified models: A unified visual understanding + generation architecture (VLM + diffusion) outperforms single-paradigm models on structured tasks, suggesting a direction for future multimodal foundation model development

Rating

  • Novelty: 8/10 — First systematic study on structured image generation/editing, with a well-defined problem formulation and a novel data construction approach
  • Experimental Thoroughness: 9/10 — Comprehensive comparison across 15 models, human alignment study, and thorough ablation experiments
  • Writing Quality: 8/10 — Clear structure, rich figures and tables, well-motivated problem statement
  • Value: 8/10 — Open-sourced dataset, model, and benchmark with substantial community impact