Science-T2I: Addressing Scientific Illusions in Image Synthesis¶

Conference: CVPR 2025
arXiv: 2504.13129
Code: https://github.com/Jialuo-Li/Science-T2I
Area: Image Generation
Keywords: Text-to-Image Generation, Scientific Reasoning, Reward Model, Alignment, Benchmark

TL;DR¶

Science-T2I constructs a benchmark of 20k+ adversarial image pairs covering 16 scientific domains, revealing systematic deficiencies of current image generation models in implicit scientific reasoning (all models score below 50/100). It proposes the SciScore reward model and a two-stage alignment framework (SFT+OFT), improving the scientific reasoning capability of FLUX.1[dev] by over 50%.

Background & Motivation¶

Background: Current text-to-image (T2I) generation models (e.g., FLUX, SDXL) have made remarkable progress in visual fidelity, capable of generating high-resolution and highly aesthetic images. Evaluation metrics (such as FID) also continue to improve.

Limitations of Prior Work: Although visually realistic, the images generated by these models are often scientifically implausible. For example, given "an unripe apple," the model frequently generates a red apple (based on visual prototype memory) instead of a green apple (based on scientific knowledge). This exposes a fundamental gap between "visual realism" and "physical/scientific correctness."

Key Challenge: Model training data rarely pairs scientific concepts with their correct visual representations, and standard evaluation protocols do not detect whether a model understands the scientific principles behind the prompt. The issue is not that the model cannot render the correct scene (explicit prompts score about 35 points higher), but rather its inability to reason the correct visual outcome from implicit scientific cues.

Goal: (1) To construct a systematic benchmark for scientific image synthesis; (2) To develop a reward model capable of capturing fine-grained scientific phenomena; (3) To propose an effective alignment framework to inject scientific knowledge into generative models.

Key Insight: To decouple the model's compositional rendering ability from its scientific reasoning ability through a three-tier prompt structure of "implicit-explicit-superficial"—where explicit prompts measure the upper bound of rendering, implicit prompts measure the reasoning capability, and superficial prompts provide hard negative examples.

Core Idea: Finetune CLIP-H using expert-annotated adversarial image pairs to obtain the SciScore reward model, and then employ SFT coupled with SciScore-based online finetuning (OFT with entity masking) to inject scientific reasoning capabilities into FLUX.

Method¶

Overall Architecture¶

The entire work consists of three components: (1) Science-T2I dataset construction (20k+ training pairs + 454 test prompts); (2) SciScore reward model training; (3) Two-stage alignment framework (SFT → OFT). The input is an implicit scientific prompt, and the output is a scientifically correct generated image.

Key Designs¶

Three-Tier Prompt Structure (IP/EP/SP):
- Function: Systematically decouples the model's scientific reasoning capability from its compositional rendering capability.
- Mechanism: Constructs triple prompts for each scientific task. The Implicit Prompt (IP) contains terms requiring scientific reasoning (e.g., "unripe apple"); the Explicit Prompt (EP) directly describes the correct visual outcome (e.g., "green apple"); the Superficial Prompt (SP) provides a superficially associated incorrect outcome (e.g., "red apple"). IP tests reasoning capability, EP establishes the rendering upper bound, and SP serves as a hard negative example for preference training.
- Design Motivation: Prior works could not distinguish whether a model "cannot draw" or "does not know what to draw." The three-tier structure explicitly addresses this—experiments show that explicit prompts score about 35 points higher than implicit prompts, proving that the bottleneck lies in reasoning rather than rendering.
SciScore Reward Model:
- Function: Evaluates whether a generated image correctly reflects the scientific principles implied in the prompt, outperforming GPT-4o and human experts.
- Mechanism: Finetuned on CLIP-H, the training objective includes two complementary losses. Implicit Prompt Alignment (IPA) minimizes KL divergence to bring the embedding of the implicit prompt closer to the explicit image than the superficial image: \(\mathcal{L}_{IPA} = KL(p_{txt} || \hat{p}_{txt})\). Image Encoder Enhancement (IEE) introduces a preference loss on the image side to enhance sensitivity to fine-grained scientific details (such as subtle colors and layering patterns). The total loss is \(\mathcal{L} = \mathcal{L}_{IPA} + \lambda \mathcal{L}_{IEE}\), where \(\lambda=0.25\) achieves the best balance.
- Design Motivation: Original CLIP tends to embed implicit prompts near their superficial counterparts rather than their explicit counterparts because surface-level co-occurrence patterns dominate scientific semantics. Specific finetuning is required to correct this bias.
Two-Stage Alignment Framework (SFT + Masked OFT):
- Function: Injects scientific knowledge into the generative model to improve implicit reasoning capabilities.
- Mechanism: The first stage performs Supervised Fine-Tuning (SFT with LoRA, 2000 steps) on FLUX.1[dev] using the Science-T2I training set, teaching the model "what scientifically correct images look like." The second stage performs Online Fine-Tuning (OFT) using SciScore as the reward signal, adopting a DPO objective. The key innovation is the entity masking strategy: GroundingDINO is used to locate the scientific entity region, and gradients are backpropagated only within this region to prevent irrelevant backgrounds from introducing noise.
- Design Motivation: Standard post-training (PPO/DPO) optimizes within the pre-trained distribution, but if the model has never been exposed to images of scientific phenomena, pure preference optimization cannot teach it what it does not know. SFT first provides a knowledge foundation, and then OFT optimizes the implicit reasoning ability. Training is unstable without masking because preferred and rejected images usually differ only in scientifically relevant regions.

Loss & Training¶

The SFT stage uses the Flow Matching objective function \(L_{SFT} = \mathbb{E}\|v_\theta(z,t) - u_t(z|\epsilon)\|_2^2\). The OFT stage interprets the deterministic ODE of Flow Matching as an SDE, obtaining a Gaussian policy \(\pi_\theta(a_t|s_t) = \mathcal{N}(a_t; \mu_\theta(s_t), \sigma_t^2 I)\), and then utilizes DPO for preference optimization on trajectories while incorporating entity masking. SFT is finetuned with LoRA for 2000 steps. OFT samples 32 prompts per round, generating two images per prompt, for approximately 100 training steps.

Key Experimental Results¶

Main Results¶

Model	Physics	Chemistry	Biology	Total Score
FLUX.2[dev] (Best)	53.19	53.55	32.50	47.80
Z-Image	26.53	32.98	22.22	26.73
SDXL	16.11	20.92	25.56	19.60
Avg. Gap between Explicit & Implicit Prompts	-	-	-	~35 pts

SciScore classification accuracy (Science-T2I-S / Science-T2I-C):

Evaluator	S-Simple	S-Complex
SciScore	93.14	91.19
Human Experts	87.01	86.02
GPT-4o mini + CoT	74.97	77.16
CLIP-H	54.69	59.47

Ablation Study¶

Method	Science-T2I-S	RI	Science-T2I-C	RI
FLUX.1[dev] Baseline	23.56	-	27.26	-
+ SFT	~27	~37%	~29	~23%
+ SFT + OFT (Full)	28.52	53.39%	30.11	38.31%

Key Findings¶

All 18 T2I models scored below 50/100 under implicit scientific prompts, with the biology domain being the most challenging (no model exceeded 33%).
Explicit prompts performed approximately 35 points higher than implicit prompts, directly proving that the bottleneck lies in scientific reasoning rather than visual rendering.
Z-Image achieved top-tier visual quality but only scored 26.73 in science, indicating no correlation between visual fidelity and scientific reasoning capability.
SciScore's failure cases are almost entirely concentrated on subject-oriented tasks (ST) because they require specific subject knowledge (e.g., which metal produces what flame color).
SFT is a necessary pre-requisite—performing OFT directly without SFT fails to improve SciScore.
Entity masking is crucial for OFT training stability; without masking, performance becomes unstable or even stagnates.

Highlights & Insights¶

Diagnostic Power of the Three-Tier Prompt Structure: The design of IP/EP/SP is highly ingenious, precisely pinpointing the vague issue of "the model is not good enough" as a "lack of reasoning capability." This methodology can be transferred to any evaluation scenario where distinguishing "knowing vs. doing" is required.
SciScore Outperforming Human Experts: A finetuned CLIP model surpasses human evaluators with science degrees in scientific discrimination, showing that the quality of adversarial training data can compensate for the lack of innate model knowledge.
OFT Strategy with Entity Masking: Using GroundingDINO to target scientific entity regions for local gradient updates avoids the noise introduced by global image optimization—this strategy can be generalized to any RLHF/DPO finetuning requiring fine-grained control.

Limitations & Future Work¶

The training set size is limited (20k pairs), which may not cover the long-tail knowledge of all scientific fields.
SciScore still has obvious deficiencies in subject-oriented tasks, lacking priors for unseen subjects.
The current framework is based on FLUX, and its portability to other architectures remains to be validated.
The evaluation of scientific correctness itself relies on LMM (Qwen3.5-27B), introducing evaluator bias.
Deeper physical reasoning (e.g., fluid dynamics, complex optical phenomena) might require stronger training signals.

vs PhyBench: PhyBench also evaluates physical reasoning but focuses only on physics; this work extends to 16 sub-domains across physics, chemistry, and biology, and proposes a complete alignment solution.
vs Commonsense-T2I: Focuses on commonsense reasoning, which is culture-dependent and lacks explicit standards; the scientific knowledge in this work provides objective, unambiguous ground truth.
vs ImageReward/HPSv2: These reward models optimize for aesthetic preferences, while SciScore optimizes for scientific correctness, representing a completely different problem formulation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Formulates and addresses the problem of scientific reasoning in T2I systematically for the first time; both the three-tier prompt design and the two-stage alignment framework are highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated 18 models, compared with VLM/LMM/humans, with comprehensive ablations and both qualitative and quantitative analyses.
Writing Quality: ⭐⭐⭐⭐⭐ Highly rigorous logic, with a very clear narrative arc of problem-diagnosis-solution.
Value: ⭐⭐⭐⭐⭐ Exposes fundamental deficiencies of current T2I models; the dataset, reward model, and alignment framework make significant contributions to the community.