NEURIPS2025 (MusIML Workshop) AIGC Detection LLM Evaluation Islamic Content Generation Dual-Agent Framework Citation Verification High-Stakes Domain Generation

Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content¶

Conference: NEURIPS2025 (MusIML Workshop) arXiv: 2510.24438 Code: To be confirmed Area: AIGC Detection Keywords: LLM Evaluation, Islamic Content Generation, Dual-Agent Framework, Citation Verification, High-Stakes Domain Generation

TL;DR¶

This paper proposes a dual-agent (quantitative + qualitative) evaluation framework that systematically assesses the faithfulness of GPT-4o, Ansari AI, and Fanar on Islamic content generation tasks across three dimensions—theological accuracy, citation integrity, and stylistic appropriateness—finding that even the best-performing model exhibits significant deficiencies in citation reliability.

Background & Motivation¶

Special requirements of high-stakes domains: Islamic content generation demands exceptionally high standards of theological accuracy, citation attribution, and tonal appropriateness; subtle errors—such as misquoting Qur'anic verses or misattributing hadith—can propagate misinformation and cause spiritual harm.

Limitations of conventional metrics: Surface-overlap metrics such as BLEU and ROUGE are incapable of measuring doctrinal fidelity, citation integrity, or theological correctness.

Evaluation gap: Specialized evaluation pipelines have been developed for high-stakes domains such as medicine and law, yet the religious domain remains nearly unaddressed; existing Islamic chatbots (Ansari AI, Fanar) have only been assessed on general Arabic benchmarks without any theological-level evaluation.

Infrastructure deficiencies: A substantial portion of classical Islamic texts remains in unstructured PDF or scanned-image formats, impeding computational utilization.

Cross-domain precedents: The legal domain (Mata v. Avianca) has exposed fabricated citations; 50–90% of medical responses lack adequate citation support; 41 of 77 AI-generated articles by CNET required correction—the religious domain faces analogous, if not more severe, risks.

Core research question: Can current LLMs generate Islamic content that is theologically accurate, correctly cited, and tonally appropriate? And how can such generation be systematically evaluated?

Method¶

Overall Architecture¶

The paper proposes a Dual-Agent Evaluation Framework consisting of a quantitative evaluation agent and a qualitative comparative agent. Both agents share a citation verification toolchain and assess LLM outputs from complementary perspectives.

Three Core Design Modules¶

1. Prompt Collection and Response Acquisition

Fifty prompts are collected from five authoritative Islamic blog platforms (The Thinking Muslim, IslamOnline, Yaqeen Institute, etc.), comprising article titles authored by prominent Islamic scholars.
The prompts span five domains: Fiqh (Islamic jurisprudence), Tafsir (Qur'anic exegesis), Ulum al-Hadith (hadith sciences), Aqidah (theology), and Adab (spiritual conduct).
Each prompt is submitted to GPT-4o, Ansari AI, and Fanar, yielding 150 responses in total.

2. Quantitative Evaluation Agent

Built on the OpenAI o3 reasoning model, equipped with three verification tools: Qur'an Ayah (verse retrieval), Internet Search, and Internet Extract.
Each article is segmented into introduction, body, and conclusion; scoring is performed on 6 dimensions (1–5 scale):
- Style & Structure (4 sub-dimensions): Structure, Theme, Clarity, Originality
- Islamic Content (2 sub-dimensions): Islamic Accuracy, Citation & Source Usage
Detected citations are automatically retrieved and verified, returning one of four labels: confirmed / partially confirmed / unverified / refuted.
Scores are penalized for citations that are not fully confirmed.

3. Qualitative Comparative Agent

Processes responses from all three models simultaneously (delimited by XML tags <R1>/<R2>/<R3>) and performs side-by-side comparison.
Evaluates five dimensions: Clarity & Structure, Islamic Accuracy, Tone & Appropriateness, Depth & Originality, and Comparative Reflection.
Identifies the strongest and weakest response per dimension, supported by specific textual excerpts.
Employs the same verification toolchain as the quantitative agent to ensure consistency.

Loss & Training¶

The quantitative dimensions adopt a 1–5 scale; citation verification outcomes directly affect the Islamic Accuracy and Citation scores.
The qualitative dimensions adopt a Best/Worst voting scheme, yielding binary judgments across three models per dimension per prompt.
Alignment between the two evaluation modes provides evidence of convergent validity.

Key Experimental Results¶

Main Results¶

Model	Overall Mean	Std	Structure	Theme	Clarity	Originality	Islamic Accuracy	Citation
GPT-4o	3.90	0.589	4.16	4.43	4.10	3.10	3.93	3.38
Ansari AI	3.79	—	—	—	—	—	3.68	3.32
Fanar	3.04	0.923	—	—	—	2.73	2.76	1.82

Qualitative Comparative Results (Best/Worst Votes, max 200 each)¶

Model	Total Best	Total Worst	Strongest Dimension
Ansari AI	116	3	Clarity & Structure (41), Islamic Accuracy (42)
GPT-4o	84	4	Tone & Appropriateness (48)
Fanar	0	193	Weakest across all dimensions

Key Findings¶

GPT-4o leads quantitatively: Achieving an overall mean of 3.90/5 with the lowest variance (std = 0.589), GPT-4o demonstrates consistent superiority in structure, theme, and Islamic accuracy.
Ansari AI leads qualitatively: With 116/200 Best votes, Ansari AI excels in clarity and religious fidelity, reflecting the advantages of domain adaptation.
Fanar lags overall but shows architectural innovation: Its 9B parameters and 4,096-token context window constrain reasoning capacity; nevertheless, its morphological tokenizer, region-specific datasets, and Islamic RAG pipeline represent valuable contributions.
Citation deficiencies are universal: Even the best-performing model (GPT-4o, Citation = 3.38/5) exhibits the most pronounced weakness in citation accuracy—a core requirement for faith-sensitive writing.
Model scale has a substantial impact: The performance gap between GPT-4o (128K context) and Fanar (4,096 context) is strongly correlated with differences in model scale and context length.

Highlights & Insights¶

First systematic faithfulness evaluation for Islamic content: This work fills a critical gap in LLM evaluation for the religious domain; the framework design is transferable to other high-stakes domains such as medicine and law.
Elegant dual-agent complementarity: The quantitative agent provides comparable numerical scores while the qualitative agent captures nuanced differences in tone and rhetoric; shared toolchains ensure methodological consistency.
Practical citation verification toolchain: Automatic retrieval of Qur'anic verses and hadith with four-level labeling (confirmed / partially confirmed / unverified / refuted) yields directly applicable output.
Rigorous experimental design: Fifty prompts covering five Islamic knowledge domains, a blind-review protocol to reduce bias, and human inspection as a sanity check collectively strengthen the evaluation's credibility.

Limitations & Future Work¶

Evaluator bias: Both the quantitative and qualitative agents are based on OpenAI models, introducing a risk of same-family bias; future work should incorporate heterogeneous evaluators such as Claude and Gemini for cross-validation.
Limited scale: Only 50 prompts are used, leaving different legal schools (madhahib), edge cases, and contemporary jurisprudential issues uncovered.
Single-language evaluation: Only English responses are assessed; Fanar's primary language (Arabic) is not evaluated, which may disadvantage Fanar unfairly.
Lack of multi-expert validation: Only one human reviewer is involved, falling short of the 3–5 scholar consensus panel recommended for theological review.
Imprecise domain categorization: Some prompts may span multiple domains, introducing ambiguity in domain attribution.

High-stakes domain evaluation: Hallucination and citation verification research exists for the legal (LEGAL-BERT, LegalBench), medical (SourceCheckup), and journalism domains; this paper extends those paradigms to the religious domain.
Islamic NLP: AraBERT and Qur'anQA have advanced Arabic language understanding; Ansari AI and Fanar are representative Islamic chatbots, but their evaluation has been limited to general benchmarks.
Agent-based evaluation: Tool-augmented approaches combining RAG, chain-of-thought reasoning, and multi-agent collaboration (LangChain / CrewAI / CamelAI) have improved verifiability on general tasks; this paper is the first to apply such methods to theological verification.
Data infrastructure: Platforms such as Usul.ai, SHARIAsource, Shamela, and OpenITI provide machine-readable Islamic legal data, but these resources have not yet been systematically integrated into LLM evaluation pipelines.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic faithfulness evaluation framework for Islamic content; the dual-agent design is inventive.
Experimental Thoroughness: ⭐⭐⭐ — The 50-prompt scale is modest; single-language evaluation and a single human reviewer limit the strength of conclusions.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-motivated problem statement, and strong connections to cross-domain related work.
Value: ⭐⭐⭐⭐ — The framework is transferable to other high-stakes domains; the problem formulation and evaluation dimension design offer useful references.