Factuality Matters: When Image Generation and Editing Meet Structured Visuals¶
Conference: ICLR2026 arXiv: 2510.05091 Code: structvisuals.github.io Area: Image Generation Keywords: structured image generation, image editing, chain-of-thought reasoning, benchmark, diffusion transformer
TL;DR¶
The first systematic study on the generation and editing of structured images (charts, mathematical figures, diagrams, tables, etc.), contributing a 1.3M-pair code-aligned training dataset with CoT reasoning annotations, a unified VLM+diffusion model architecture, and the StructBench benchmark with 1,700+ samples. The work reveals that reasoning capability is the key bottleneck for current models in handling structured visual content.
Background & Motivation¶
- Existing visual generation models (e.g., GPT-Image, FLUX, Bagel) excel at natural image generation but perform poorly on the generation and editing of structured visual content (charts, math figures, diagrams, tables, etc.)
- Structured images differ fundamentally from natural images: they require composition planning, precise text rendering, and multimodal reasoning to ensure factual fidelity
- Existing datasets primarily target aesthetic quality or instruction-following for natural images, lacking large-scale, high-quality training data for structured visuals
- Existing evaluation metrics (e.g., CLIP score, aesthetic score, naive VLM-as-a-judge) are ill-suited for fine-grained factuality evaluation of structured images
Core Problem¶
How to systematically improve model capability for structured image generation and editing? Three sub-problems are identified: 1. Data: How to construct a large-scale, high-quality, precisely annotated structured image dataset? 2. Model: How to train a unified generation/editing model applicable to both natural and structured images? 3. Evaluation: How to reliably assess fine-grained factuality of structured images?
Method¶
Data Construction (1.3M Pairs)¶
- Core Idea: Exploiting the fact that structured images can be rendered from code, approximately 2M drawing programs (Python + LaTeX) are collected, covering six categories: math, charts, puzzles, scientific illustrations, graph structures, and tables
- Code-aligned image synthesis: Source code is executed to render source images; GPT-5 then generates both code-level and image-level editing instructions; modified code is rendered to produce target images, forming strictly aligned and verifiable state-transition pairs
- Multi-step annotation pipeline: GPT-5 first analyzes salient visual features of the source image, then simultaneously generates image-level and code-level editing instructions — ensuring image instructions reference only visible elements while code instructions specify precise programmatic modifications
- Post-processing filtering: Samples with rendering failures, edits with no visual difference, and low-information images are removed
- CoT reasoning annotation: Each T2I sample is paired with a dense caption covering detailed attribute analysis; each editing sample is paired with a three-step reasoning chain (source image analysis → editing instruction interpretation → target image prediction), all generated by GPT-5, providing substantially richer semantic signal than conventional brief instructions
Model Architecture¶
- Backbone: FLUX.1 Kontext (diffusion transformer), supporting unified image generation and editing
- Multimodal enhancement: Qwen2.5-VL-7B is introduced to encode multimodal features and aligned with FLUX.1 Kontext via a lightweight MLP connector, replacing the original CLIP encoder
- Design Motivation: Structured image editing relies on high-level semantic understanding (e.g., converting a bar chart to a pie chart requires understanding quantitative ratios); VAE features alone are insufficient as they only capture low-level information; the MLP connector incurs lower training overhead and more stable optimization compared to transformer-based projectors (e.g., MetaQuery)
Three-Stage Progressive Training¶
- Stage 1 – Unified Alignment: The diffusion backbone is frozen; only the MLP connector is trained. T5 features are removed and only Qwen-VL features are used, preventing T5 from acting as a shortcut that impedes connector alignment
- Stage 2 – Mixed Visual Learning: The diffusion backbone and connector are jointly fine-tuned to inject structured domain knowledge; high-quality natural image data is mixed in to preserve general capability; a mask-based training strategy is introduced to adaptively down-weight losses on background and unchanged regions
- Stage 3 – Reasoning Enhancement: CoT annotations are used as long-context inputs to Qwen-VL to inject explicit reasoning capability; at inference time, the trained model can accept analysis and planning from an external reasoner (GPT-5), enabling inference-time compute scaling
StructBench Benchmark¶
- Scale: 1,714 samples; 32,031 Q&A pairs for editing evaluation; 37,941 Q&A pairs for generation evaluation; covering six categories: Math, Graph, Chart, Puzzle, Science, and Table
- StructScore metric:
- A VLM-based multi-turn Q&A protocol that generates fine-grained atomic question–answer pairs from ground-truth images
- Open-ended answers are elicited from model-generated images, forming [question, predicted answer, ground-truth answer] triples for comparison
- Editing evaluation decouples visual consistency and instruction-following into two dimensions, combined via weighted scoring (\(0.1 \times \text{consistency} + 0.9 \times \text{instruction-following}\))
- Atomized question decomposition and Q&A refinement improve ground-truth image accuracy from ~80% to >95%
Key Experimental Results¶
- Editing benchmark (StructEditBench): The proposed model ranks first overall (open- and closed-source) with 55.98% accuracy, surpassing Nano Banana (51.57%), GPT-Image (52.20%), and Seedream 4.0 (52.85%); Nano Banana 2.0 achieves the highest score at 67.05%
- Generation benchmark (StructT2IBench): GPT-Image leads closed-source models at 49.58%; the proposed model achieves 28.80% (T2I is harder, requiring fine-grained attribute synthesis from scratch); Nano Banana 2.0 leads all models by a large margin at 92.00%
- Chart editing breakdown: Models achieve near 50% on color modification (relatively simple), but accuracy drops substantially on chart-type conversion (which requires reasoning about quantitative relationships), revealing reasoning as the core bottleneck
- Reasoning enhancement: Adding explicit reasoning traces to Bagel improves accuracy from 28.87% to 38.44%, surpassing its native thinking variant Bagel-Think (33.34%), demonstrating that reasoning quality matters more than its form
- Human alignment: StructScore achieves a Pearson correlation of \(r > 0.9\) with human Elo rankings, far exceeding traditional metrics such as PSNR
- Evaluation coverage: Comprehensive comparison across 15 models, including 3 closed-source and 12 open-source systems
Highlights & Insights¶
- Systematic contribution: Data, model, and evaluation are addressed together, constituting the first complete study in structured image generation and editing
- Code-aligned data: Leveraging executable code to construct precisely verifiable editing pairs is more reliable than conventional synthesis methods
- StructScore design: Atomic Q&A decomposition, dimension-decoupled weighting, and a refinement pipeline effectively mitigate VLM hallucination
- Validation of reasoning importance: Experiments consistently demonstrate that inference-time reasoning yields gains on structured image tasks, independent of model architecture
- Mask-based training strategy: Adaptive loss weighting tailored to the pixel statistics of structured images (large uniform backgrounds, small edited regions)
Limitations & Future Work¶
- T2I generation performance remains substantially below closed-source models (28.80% vs. 49.58%); the editing lead exists but is not large
- The external reasoner relies on GPT-5, incurring high inference costs; lightweight alternatives have not been explored
- Data construction depends heavily on GPT-5 for annotation and filtering, leading to high cost and questionable reproducibility
- Only six categories of structured images are currently covered; domains such as molecular formulas, musical scores, and educational videos are not addressed
- Although the 1.3M training samples are sizable, they are primarily sourced from existing code repositories, potentially limiting diversity
- StructScore still depends on GPT-5 as the evaluator, introducing a risk of circular dependency
- Dynamic resolution sampling is limited to approximately 512×512, which may be insufficient for fine-grained detail rendering in high-resolution structured images
Related Work & Insights¶
| Dimension | Ours | Conventional T2I/Editing Work |
|---|---|---|
| Target domain | Structured visuals (charts, formulas, diagrams) | Natural images |
| Data construction | Code-aligned + code-level editing → precise and verifiable | Synthetic instructions + model-generated → approximate alignment |
| Reasoning annotation | Three-step CoT reasoning chain + dense caption | Brief instructions (e.g., "add tree right") |
| Evaluation | Atomic Q&A + dimension-decoupled weighting | CLIP/DINO score or naive VLM judge |
| Model design | VLM (Qwen-VL) + diffusion (FLUX Kontext) + MLP connector | Single diffusion model or unified autoregressive model |
The comparison with Bagel-Think is particularly noteworthy: the external reasoner approach (38.44%) outperforms Bagel's built-in thinking (33.34%), indicating that the quality and design of reasoning traces matter more than simply integrating a thinking mode.
Compared to heavy transformer-based projector approaches such as MetaQuery, the proposed lightweight MLP connector reduces training overhead. Compared to dedicated editing models such as Step1X-Edit (34.11%) and Qwen-Edit (38.12%), the unified model achieves superior performance on structured editing (55.98%), validating the combined advantage of multimodal reasoning enhancement and domain-specific data.
The work carries several broader implications: - Structured visuals as reasoning-intensive tasks: This finding has important implications for all generation scenarios requiring precise factuality, such as automated scientific figure generation and data visualization editing - Code as an intermediate representation: The paradigm of using executable code to construct precise training data can be extended to other generation tasks requiring exact control (e.g., CAD drawings, circuit diagrams, flowcharts) - Value of inference-time scaling in visual generation: Analogous to test-time compute scaling in LLMs, visual generation can also benefit significantly from increased inference-time computation, representing an important direction for unified multimodal models - Evaluation methodology innovation: The atomic Q&A evaluation protocol is transferable to other visual tasks requiring fine-grained factuality assessment - Data-driven over architecture-driven: Experiments suggest that in the structured visual domain, data scale and quality matter more than architectural choices, contrasting with the community's prevailing emphasis on architectural innovation - Advantage of unified models: A unified visual understanding + generation architecture (VLM + diffusion) outperforms single-paradigm models on structured tasks, suggesting a direction for future multimodal foundation model development
Rating¶
- Novelty: 8/10 — First systematic study on structured image generation/editing, with a well-defined problem formulation and a novel data construction approach
- Experimental Thoroughness: 9/10 — Comprehensive comparison across 15 models, human alignment study, and thorough ablation experiments
- Writing Quality: 8/10 — Clear structure, rich figures and tables, well-motivated problem statement
- Value: 8/10 — Open-sourced dataset, model, and benchmark with substantial community impact