RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering¶

Conference: NeurIPS 2025 arXiv: 2512.05119 Code: https://github.com/USTC-StarTeam/RAG-IGBench Area: Information Retrieval Keywords: interleaved image-text generation, retrieval-augmented generation, multimodal evaluation, open-domain question answering, benchmark

TL;DR¶

This paper introduces RAG-IGBench, a benchmark specifically designed to evaluate the quality of interleaved image-text content generated via retrieval-augmented generation. It proposes novel automatic evaluation metrics spanning three dimensions—text quality, image quality, and image-text consistency—and demonstrates strong correlation with human evaluation.

Background & Motivation¶

Background: Interleaved image-text generation requires models to jointly produce text and images, which is a core requirement for practical applications such as content creation and visual storytelling.

Limitations of Prior Work: - End-to-end generation methods (e.g., Chameleon) unify text and image processing but exhibit limited capacity for following complex instructions. - Existing evaluation frameworks either assess only unimodal metrics (e.g., FID for images alone) or rely on MLLM-based scoring (GPT-4o-based), which introduces model bias and instability. - High-quality open-domain datasets for interleaved image-text generation are lacking.

Key Challenge: How to comprehensively evaluate the quality of interleaved image-text content without relying on model-induced bias.

Key Insight: Adopting a RAG framework in which an MLLM selects images from retrieved documents and embeds them into text, rather than generating images from scratch.

Method¶

Overall Architecture¶

The RAG-IG framework: given a user query → retrieve relevant documents and images → MLLM generates a Markdown-formatted response with image placeholders → placeholders are replaced with actual images to produce the final multimodal response. Evaluation is conducted across three dimensions: text quality, image quality, and image-text consistency.

Key Designs¶

Dataset Construction (Three-Stage Pipeline):
- Function: Construct a high-quality interleaved image-text QA dataset.
- Mechanism: Stage 1 uses an MLLM to generate raw QA pairs; Stage 2 employs expert annotators to refine image selection and arrangement; Stage 3 filters out low-quality samples based on quality criteria.
- Design Motivation: Sourcing recent public content from social platforms ensures diversity and timeliness; manual annotation guarantees ground truth quality.
Image Quality Evaluation (Edit Distance + Kendall Score):
- Function: Evaluate the match between the model's selected image sequence and the ground truth.
- Mechanism: Edit Distance measures selection accuracy (the number of insertions/deletions/substitutions required), normalized as \(1 - dp(m,n)/\max(m,n)\); Kendall Score measures ordering correctness by computing the proportion of concordant pairs among correctly selected images.
- Design Motivation: Traditional FID/IS metrics assess generated image quality, but in the RAG setting images are selected rather than generated, necessitating metrics for selection accuracy and ordering correctness.
Image-Text Consistency Evaluation (CLIP Score + Alignment Score):
- Function: Evaluate the semantic alignment between images and their surrounding text.
- Mechanism: CLIP Score directly computes the cosine similarity between image and text embeddings; Alignment Score compares the contextual text surrounding the same image in the generated answer and the ground truth.
- Design Motivation: CLIP Score captures direct semantic alignment but lacks contextual understanding; Alignment Score compensates for this limitation.

Key Experimental Results¶

Main Results — Performance of Mainstream MLLMs on RAG-IGBench¶

Model	Rouge-1↑	Edit Dist↑	Kendall↑	Align Score↑	Mean↑
GPT-4o	0.374	0.471	0.532	0.495	0.468
Claude-3.5	0.350	0.439	0.490	0.481	0.440
Qwen2VL-72B	0.319	0.390	0.451	0.438	0.400
InternVL2-40B	0.281	0.328	0.368	0.402	0.345

Ablation Study — Correlation Between Evaluation Metrics and Human Judgments¶

Metric	Pearson Correlation	Spearman Correlation
Rouge-1	0.72	0.68
Edit Distance	0.81	0.78
Kendall Score	0.75	0.71
CLIP Score	0.65	0.62
Alignment Score	0.74	0.70

Key Findings¶

GPT-4o leads across all dimensions, yet a significant gap relative to human performance remains.
Image selection (Edit Distance) constitutes the largest bottleneck; models generally underperform in terms of both image quantity and selection accuracy.
Fine-tuned models exhibit performance improvements across multiple benchmarks, validating the high quality of the dataset.
The gap between open-source and closed-source models is substantial, particularly in image-text consistency.

Highlights & Insights¶

The RAG-IG paradigm is more practical than end-to-end image generation—image selection is more controllable and yields more stable quality than image generation.
The combination of Edit Distance and Kendall Score elegantly disentangles the two dimensions of "which images were correctly selected" and "whether the ordering of images is correct."
The design rationale behind Alignment Score is noteworthy: the same image should appear in semantically similar contexts across different answers.

Limitations & Future Work¶

The data sources are limited to social platforms, biasing the domain toward everyday lifestyle topics and underrepresenting professional or academic scenarios.
Although the evaluation metrics exhibit strong correlation with human judgments, the CLIP Score correlation remains relatively low (0.65), indicating room for further improvement.
The benchmark evaluates image selection but not image comprehension—a model may select the correct image without being able to explain its content.
The impact of layout and typesetting on user experience is not considered.

vs. INTERLEAVEDBENCH: Relies on GPT-4o-based scoring, introducing model bias; RAG-IGBench employs rule-based metrics, yielding more objective evaluation.
vs. MMIE: Depends on fine-tuned VLMs for evaluation, leading to inconsistency issues; the proposed metrics are more stable and reproducible.
vs. MEGA-Bench: Focuses on multimodal understanding capabilities and does not address interleaved generation.

Rating¶

Novelty: ⭐⭐⭐⭐ RAG-IG paradigm + innovative evaluation metrics
Experimental Thoroughness: ⭐⭐⭐⭐ Covers mainstream open-source and closed-source models with human evaluation validation
Writing Quality: ⭐⭐⭐⭐ Dataset construction pipeline is detailed and metric definitions are clear
Value: ⭐⭐⭐⭐ Fills a gap in the evaluation of interleaved image-text generation