Skip to content

SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models

Conference: CVPR 2026
Paper: CVF Open Access
Code: Project Page https://SCIEval.github.io
Area: Diffusion Models / Image Generation Evaluation
Keywords: Scientific Images, Faithfulness Evaluation, CLIP Contrastive Learning, Explainable Evaluation, Benchmark

TL;DR

SCIEval is a faithfulness evaluator specifically designed for "scientific images" (line charts, binary trees, molecular formulas, etc., containing precise numerical/attribute data). It decomposes faithfulness into three dimensions: relevance, accuracy, and explainability. By training two scoring sub-modules via CLIP contrastive learning and fine-tuning a lightweight LMM to generate error explanations, it provides a comprehensive assessment. Accompanied by the manually annotated SCIEval-Bench (6,000 samples), SCIEval achieves significantly higher correlation with human judgment compared to 24 competitors, including GPT-4o.

Background & Motivation

Background: Scientific communication relies heavily on images, catalyzing two inverse tasks: Scientific Text-to-Image (Sci-T2I, generating scientific figures from text) and Scientific Image Captioning (Sci-IC, generating descriptions from scientific figures). Measuring the quality of these results depends on "faithfulness": whether the generated image/text accurately reflects the scientific details in the source.

Limitations of Prior Work: Existing faithfulness evaluation methods are inadequate. ① Current metrics (e.g., TIFA targets T2I via VQA, VALOR-Eval targets IC via object hallucination detection) are designed for natural images, handle only single tasks, and provide a single merged score without explanation. ② Expert human evaluation is accurate but prohibitively expensive—for example, ScImage spent $3,000 for 11 scientists to evaluate 3,000 images, which is not scalable. ③ Directly using general LMMs (e.g., Qwen-VL, ALIGNScore, CLIPScore) as judges shows weak correlation with human judgment (Kendall coefficient < 0.3 on ScImage). ④ State-of-the-art automatic judges like the GPT series (GPT-4o) suffer from high API costs and black-box opacity.

Key Challenge: Faithfulness in scientific images requires precise numerical and attribute alignment (e.g., "four blue lines," "seven binary tree nodes") rather than just "sketching lines/trees." Metrics for natural images focus on whether "entities exist," failing to distinguish these fine-grained errors.

Goal: Develop a unified, automatic, reference-free, fine-grained, and lightweight faithfulness evaluator for both Sci-T2I and Sci-IC that provides error explanations, accompanied by a dedicated human-annotated benchmark.

Key Insight: The authors explicitly decompose "faithfulness" into three complementary dimensions: Relevance (overall image-text correspondence), Accuracy (technical details of scientific objects), and Explainability (pointing out specific unfaithful elements). The first two are quantifiable scores, while the third provides natural language reasoning.

Core Idea: Using CLIP as a backbone, fine-grained perception of scientific images is injected into the encoders via "hard negatives + intra/inter-modal contrastive learning" (to create SCIEval-R and SCIEval-A). A lightweight 8B LMM is then fine-tuned with supervised reasoning signals to produce error explanations (SCIEval-E). Together, these form a low-cost evaluator that aligns more closely with human judgment than GPT-4o.

Method

Overall Architecture

The goal of SCIEval is to take a scientific text-image pair (⟨Text T, Image I⟩ for T2I; ⟨Image I, Caption C⟩ for IC) and output a relevance score, an accuracy score, and a reasoning text explaining any unfaithfulness. The system is supported by a "task-aligned three-stage training framework": training data with hard negatives is first constructed from ArXivCap, followed by CLIP contrastive learning to train the relevance sub-evaluator (SCIEval-R) and the accuracy sub-evaluator (SCIEval-A). Finally, a lightweight LMM is fine-tuned with reasoning signals to obtain the explainability module (SCIEval-E). Inference produces scores and reasons using only the two CLIP encoders and the LMM.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["ArXivCap Gold Standard Pairs<br/>32 Areas × 1000 Pairs"] --> B["Hard Negative Construction<br/>Relevance Negatives (Adversarial Retrieval)<br/>Accuracy Negatives (Object Rewriting)"]
    B --> C["Relevance Sub-evaluator SCIEval-R<br/>CLIP Intra + Inter-modal Contrastive"]
    C -->|Weight Initialization| D["Accuracy Sub-evaluator SCIEval-A<br/>Continued Training on R"]
    B --> E["Explainability Module SCIEval-E<br/>Fine-tuned 8B LMM for Error Causes"]
    C --> F["Relevance Score"]
    D --> G["Accuracy Score"]
    E --> H["Reasoning Text"]

Key Designs

1. Three-Dimensional Decomposition of Faithfulness: Separating "Similarity" into Optimizable Components

The authors argue against the conventional "single merged score" because failure modes in scientific images differ: some involve the wrong overall object (relevance), while others involve the correct object but incorrect attributes/values (accuracy, e.g., 5 nodes instead of 7). Faithfulness is split into: Relevance (R) for overall correspondence, Accuracy (A) for technical details, and Explainability (E) for identifying specific unfaithful elements. R and A are constrained to \([0,1]\), while E is unconstrained text. This routing of failure modes improves discrimination and provides built-in explanations for score deductions.

2. Hard Negative Construction: Adversarial Retrieval and Object Rewriting

Standard gold pairs ⟨I_T, C_T⟩ do not provide enough fine-grained discriminative power. The authors construct hard negatives in two ways. For Relevance: Following SciFIBench’s adversarial filtering, each caption is encoded into a vector \(x_C \in \mathbb{R}^d\) and stored in a Faiss index. Nearest neighbor retrieval is used to find the most similar caption \(C_R\) and its original image \(I_R\), yielding two hard negatives ⟨I_T, C_R⟩ and ⟨I_R, C_T⟩ that are semantically close but mismatched. For Accuracy: Target captions \(C_T\) are edited (changing quantities, attributes, or spatial relations) to create \(C_A\). A lightweight LMM (SEED-X) then serves as an image editor to modify the original image \(I_T\) based on \(C_A\), producing \(I_A\). This yields accuracy negatives ⟨I_T, C_A⟩ and ⟨I_A, C_T⟩. The modification \(\mathrm{diff}(C_T, C_A)\) serves as the supervision signal for training SCIEval-E.

3. Intra-modal + Inter-modal Contrastive Learning: Injecting Scientific Perception into CLIP

SCIEval-R and SCIEval-A use the same CLIP contrastive strategy (A is initialized from R to retain scientific knowledge). The intra-modal loss \(L_{IM}\) separates negative samples within the same modality in feature space: for the visual side, \(L_{IMv} = \max\{0, s(Z_{I_T}, Z_{I_F}) - \epsilon_v\}\), where \(s\) is cosine similarity and \(\epsilon_v\) (e.g., 0.2) is a margin threshold. The inter-modal loss \(L_{CM}\) pulls matching pairs together and pushes mismatched pairs apart. The positive term is \(L^P_{CM} = \exp(s(Z_{I_T}, Z_{C_T})/\tau) + \exp(s(Z_{I_F}, Z_{C_F})/\tau)\), and the negative term \(L^N_{CM}\) uses mismatched pairs ⟨I_F, C_T⟩ and ⟨I_T, C_F⟩. The total loss is \(L = L_{CM} + \frac{1}{2}(L_{IMt} + L_{IMv})\). CLIP is chosen for efficiency: training and inference take only 3 hours on 4x RTX 3090 GPUs.

4. SCIEval-E SFT + SCIEval-Bench

SCIEval-E is an mPLUG-owl3 (8B) model fine-tuned on triplets \((I_T, C_A, \mathrm{diff}(C_T, C_A))\) to point out error causes using templates like "The scientific detail should be [Ground Truth] rather than [Fake Value]." To validate the metrics, SCIEval-Bench was created: 600 high-quality gold pairs across CS, Bio, Econ, Physics, etc. (non-CS summarized as General) were sampled. For T2I, images were generated by Llama-python, Llama-tikz, Stable Diffusion, and DALL-E. For IC, captions were generated by LLaVA-1.6, IDEFICS-2, Qwen-VL, and DeepSeek-VL. Each sample (3,000 for each task) was annotated by three CS PhDs on a 1–5 scale. The total cost was ~$1,200, significantly lower than prior human evaluation efforts.

Loss & Training

  • Total loss for CLIP sub-modules: \(L = L_{CM} + \frac{1}{2}(L_{IMt} + L_{IMv})\), where \(L_{CM}\) uses InfoNCE and \(L_{IM}\) uses a hinged margin loss.
  • Strategy: Train SCIEval-R first, initialize SCIEval-A with those weights to transfer scientific knowledge, and then perform independent SFT for SCIEval-E.
  • Data: 32,000 pairs across 32 scientific domains from ArXivCap.

Key Experimental Results

Main Results

Reliability is measured by the Spearman and Pearson correlation coefficients (%) between automatic scores and human judgment. The table below shows results for the CS subset of Sci-T2I and Sci-IC (Spearman / Pearson):

Method T2I·CS Rel. T2I·CS Acc. IC·CS Rel. IC·CS Acc.
GPT-4o (Prev. SOTA) 71.3 / 70.5 66.7 / 66.0 72.8 / 72.4 67.2 / 67.3
Gemini 1.5 Pro 70.9 / 70.2 66.5 / 66.1 71.4 / 70.9 66.8 / 66.3
TIFA (T2I specialized) 45.2 / 44.3 42.6 / 42.1
CLIPScore 42.6 / 42.5 40.4 / 39.6 45.5 / 44.7 38.2 / 37.8
SCIEval (Ours) 74.1 / 73.2 69.4 / 68.2 75.9 / 75.6 69.9 / 69.1

SCIEval outperforms all 24 competitors across all CS categories. Similar trends are observed in the General subset (e.g., T2I·General Rel. 68.5/68.2, leading GPT-4o's 64.9/65.1).

Evaluation of reasoning quality (5-point scale, Human/LMM-as-Judge):

Model Human·Correct. Human·Complet. LMM·Correct. LMM·Complet.
InstructBLIP 3.6 2.8 3.3 2.7
Gemini 1.5 Pro 4.4 3.3 4.2 3.8
GPT-4V 4.5 3.5 4.4 3.9
SCIEval (Ours) 4.7 4.2 4.4 4.3

SCIEval is notably superior to GPT-4V in completeness (Human 4.2 vs 3.5), proving that 3D decomposition and reasoning supervision make explanations more precise.

Ablation Study

The paper analyzes drivers of gain through cross-model comparisons rather than simple module removal:

Dimension Key Metric Observation
Closed vs. Open LMM T2I·CS Rel. diff ~30.1% Top closed LMMs significantly outperform open LMMs; the strongest open LMM (InstructBLIP-13b) lags behind the weakest closed LMM (Claude 3 Haiku).
Specialized vs. General TIFA 45.2 > Open LMM Specialized metrics like TIFA/VALOR-Eval outperform generic LMMs, highlighting the value of task customization.
Model Scale Large ≠ Better For InstructBLIP, larger scale did not yield significant gains, suggesting >10B parameters might not be necessary for evaluation.

Key Findings

  • SCIEval achieves SOTA performance surpassing GPT-4o in 3 hours on 4x RTX 3090s, proving "distilling" faithfulness into CLIP encoders is highly cost-effective.
  • While a massive gap exists between closed and open LMMs, task-specialized metrics can partially bridge this gap.
  • Evaluating Sci-T2I is not necessarily harder than Sci-IC for LMM judges. While generating scientific images is harder, LMM performance for evaluation was similar across both tasks.

Highlights & Insights

  • Scalable Hard Negative Generation: Using CLIP+Faiss for semantic neighbors and LMM-based "targeted editing" to generate \(\mathrm{diff}\) signals are techniques transferable to any fine-grained alignment task.
  • "Score with Built-in Reason": Treating explainability as an independent dimension allows the evaluator to provide actionable feedback (e.g., "should be X, not Y"), which is directly useful for model debugging.
  • Cost-Efficiency as a Core Selling Point: 3 hours training / 4x RTX 3090 / ~$1,200 labeling cost. This demonstrates that "small and specialized" models can outperform "large and general" ones in evaluation scenarios.

Limitations & Future Work

  • The benchmark samples are derived from 600 gold pairs with only 3 CS PhD annotators; the "General" category is a mixture of fields, potentially lacking depth in biology or physics.
  • SCIEval is CLIP-based, inheriting CLIP's limits in perceiving extremely fine scientific symbols (e.g., complex formulas or specific axis labels).
  • Accuracy negative generation depends on the SEED-X editor; quality issues in edited images may introduce noise. Structured reasoning templates may also limit generalization to open-ended scientific errors.
  • vs. TIFA / VALOR-Eval: These target only one task (T2I or IC), are built for natural images, and lack explanations. SCIEval provides a unified framework for both, focused on scientific data with explainable scores.
  • vs. Generic LMM Judges (GPT-4o): While general LMMs have broad capabilities, they are expensive, black-box, and show lower human correlation. SCIEval distills this into a lightweight, high-correlation alternative.
  • vs. CLIPScore / BLIP2Score: These rely on one-shot alignment similarity and are insensitive to scientific nuances. SCIEval explicitly trains for fine-grained discrimination using hard negatives.

Rating

  • Novelty: ⭐⭐⭐⭐ First faithfulness metric for scientific images covering both T2I/IC with built-in reasoning; innovative 3D decomposition and negative construction.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Compared against 24 methods across two tasks and two domains; human and LMM-based evaluation of reasoning quality.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and decomposition; helpful diagrams for data construction and training.
  • Value: ⭐⭐⭐⭐ A low-cost, explainable scientific image evaluator and a 6,000-sample benchmark serve as practical infrastructure for the field.