CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models¶
Conference: ACL 2025
arXiv: 2505.20767
Code: GitHub
Area: LLM/NLP
Keywords: cognitive faithfulness, hallucination detection, legal-inspired framework, benchmark, knowledge-grounded dialogue
TL;DR¶
Drawing inspiration from circumstantial evidence standards in the legal domain, this paper proposes a hierarchical evaluation framework and the CogniBench dataset, systematically defining and evaluating the cognitive faithfulness of LLMs in cognitive statements (reasoning, evaluation, explanation) for the first time, and training the CogniDet detector to achieve simultaneous detection of factual and cognitive hallucinations.
Background & Motivation¶
Existing hallucination evaluation focuses on factual statements: Existing benchmarks (such as RAGTruth, FAVA) mainly focus on whether models correctly recite facts from the context, ignoring the growing "cognitive statements" generated by LLMs—such as reasoning, evaluation, and explanation that transcend the original text.
Lack of evaluation standards for cognitive statements: The evaluation of cognitive statements is inherently subjective and context-dependent. Existing labels like "baseless" or "subjective" are too vague to precisely capture different levels of faithfulness requirements.
LLM cognitive hallucination rate is much higher than factual hallucination: Preliminary analysis shows that the hallucination rate of LLMs when generating cognitive statements (64.8%) is approximately 4.6 times that of factual statements (13.9%), indicating an urgent need for systematic research.
Varying faithfulness requirements across applications: Virtual personas (such as in creative writing) can tolerate some speculation, but high-risk scenarios like medical diagnosis and legal judgment require conclusions to be irrefutable, which demands graded standards.
Manual annotation cannot keep pace with LLM iteration: New models continue to emerge, and manual sentence-by-sentence annotation is too costly, necessitating low-resource, automatable evaluation methods.
Increasing proportion of cognitive statements in multi-turn dialogues: As dialogue turns increase, cognitive statements grow from 15% in the first turn to 50% in the final turn. However, existing datasets are mostly single-turn or short dialogues that cannot cover this characteristic.
Method¶
Overall Architecture¶
This paper presents three core contributions: (1) a hierarchical faithfulness evaluation framework (Rational \(\to\) Grounded \(\to\) Unequivocal) inspired by legal circumstantial evidence rules; (2) the human-annotated CogniBench dataset; and (3) the large-scale CogniBench-L dataset generated via an automated annotation pipeline and the fine-tuned CogniDet detector model. The overall workflow consists of defining the evaluation standards, conducting human annotation based on these standards, designing automated annotation methods for data scaling, and finally training the detector.
Key Designs¶
1. Legal-inspired Hierarchical Faithfulness Evaluation Framework¶
- Function: Distinguishes statements generated by LLMs into "factual statements" (directly reciting context) and "cognitive statements" (reasoning/evaluation/explanation), and then sequentially evaluates three progressive criteria for cognitive statements.
- Mechanism: Analogous to the determination process of circumstantial evidence in the legal domain, it proposes three progressive criteria:
- Rational: Whether the statement is rational and plausible even without direct supporting evidence (distinguishing reasoning from conjecture).
- Grounded: Whether the statement can be logically derived from the context.
- Unequivocal: Whether the statement is the sole reasonable conclusion, with no alternative plausible explanations.
- Design Motivation: The legal evidence determination system has been refined through centuries of practice, possessing advantages of objectivity and progressive design. Different applications can select varying levels of strictness: virtual characters only need to satisfy Rational, AI assistants need to satisfy Grounded, while high-risk decision systems (medical/finance) must satisfy Unequivocal.
2. Sequential Decision Annotation Protocol¶
- Function: Organizes the annotation process into a sequential decision-making structure where annotators progressively determine the highest criterion a statement satisfies, ultimately classifying it into four categories: Misleading, Speculative, Reliable, or Unequivocal.
- Mechanism: Annotators first evaluate whether a statement is Rational; if it passes, they evaluate whether it is Grounded; if that passes, they evaluate whether it is Unequivocal. Each step requires only a binary decision (Yes/No), reducing cognitive load.
- Design Motivation: Compared to direct multi-class annotation (inter-annotator agreement (IAA) of only 91.51%), the sequential decision framework improves the IAA to 96.19% while reducing quality control efforts by 48% (only 13 QA checks required per 500 cases vs. 25).
3. Automated Annotation Pipeline with Contrastive & Formatted Prompting + Multiple Sampling¶
- Function: Employs GPT-4 as an automatic annotator to generate sentence-level hallucination labels for large-scale dialogue data, producing CogniBench-L (24k+ dialogues, 234k+ annotated sentences).
- Mechanism: Enhances automated annotation quality in two steps—(a) Contrastive & Formatted Prompting (CFP): First diagnoses common annotation errors made by LLMs, then provides positive and negative contrastive examples to eliminate ambiguity, while using HTML tags to achieve batch sentence-level annotation; (b) Multiple Sampling Voting: Samples each instance 5 times and uses majority voting to decide the final label, filtering out sporadic hallucination judgments.
- Design Motivation: Manual annotation is expensive and struggles to cover rapidly iterating new models. Synthesized hallucination data (e.g., FAVA) suffers from distribution shifts compared to real-world scenarios. This pipeline offers a low-cost method to evaluate any new model, and the generated CogniBench-L can be used to train specialized detectors.
Loss & Training¶
CogniDet is fine-tuned based on Llama3 8B Instruct using the standard causal language modeling loss. The input is a context-response pair, and the output directly generates a list of hallucinated sentences (including three types: invented, speculative, and misleading), completing detection in a single forward pass. Training hyperparameters: epoch=3, batch size=2, learning rate \(5 \times 10^{-5}\), trained on 8 NVIDIA A6000 GPUs for approximately 18 hours.
Key Experimental Results¶
Main Results¶
Hallucination detection performance comparison (sentence-level F1):
| Method | Type | Overall F1 | Factual Hallu F1 | Cognitive Hallu F1 |
|---|---|---|---|---|
| ChatGPT-3.5 | Prompting | 48.54 | 22.98 | 56.57 |
| ChatGPT-4 | Prompting | 58.03 | 46.82 | 66.04 |
| Tasksource | NLI | 26.87 | 27.10 | 26.75 |
| SelfCheckGPT | NLI | 45.81 | 32.08 | 61.10 |
| FAVA | E2E | 7.90 | 12.90 | 5.10 |
| RAGTruth | E2E | 23.90 | 45.30 | 11.20 |
| Auto-Labeling | Ours | 82.20 | 82.50 | 81.90 |
| CogniDet 8B | Ours | 70.30 | 64.40 | 73.80 |
Ablation Study¶
Automated annotation pipeline ablation (evaluated on the human-annotated CogniBench):
| Configuration | Overall Recall | Overall Precision | Factual Recall | Factual Precision | Cognitive Recall | Cognitive Precision |
|---|---|---|---|---|---|---|
| Auto-Labeling (t=2) | 77.98 | 87.76 | 74.75 | 91.05 | 78.56 | 85.55 |
| Auto-Labeling (t=3) | 75.88 | 89.63 | 72.72 | 91.70 | 76.43 | 87.83 |
| − Sampling | 67.72 | 88.05 | 67.98 | 89.50 | 66.76 | 86.33 |
| − CFP | 60.49 | 85.11 | 53.69 | 85.26 | 62.65 | 84.29 |
Key Findings¶
- Cognitive hallucination rate is far higher than factual hallucination: The overall hallucination rate of cognitive statements is 64.8% (62.2% speculative + 2.6% misleading), while that of factual statements is only 13.9%, a difference of about 4.6 times.
- Longer dialogues contain more cognitive statements: The number of factual statements decreases with the number of dialogue turns, while cognitive statements grow from about 15% in the first turn to about 50% in the final turn.
- Different distributions of hallucination positions: Factual hallucinations tend to appear in the middle of a response, whereas cognitive hallucinations tend to occur at the beginning or end of a response.
- Significant variation in cognitive preferences across models: For GPT-4, factual statements account for 66.3% (with a cognitive hallucination rate of 60.1%), Gemini-Pro's cognitive statements account for 49.9% (with a cognitive hallucination rate up to 79.9%), and Claude-3.5 demonstrates the highest factual faithfulness (only 17.3% factual hallucination).
- Existing detectors fail heavily on cognitive hallucinations: FAVA achieves an F1 of 12.9% on factual hallucinations but only 5.1% on cognitive hallucinations; RAGTruth drops from 45.3% to 11.2%.
- CogniDet performance scales log-linearly with data size: The relationship between training data volume and detection F1 is log-linear, validating the value of large-scale automatically annotated data.
- Sequential decision annotation framework outperforms independent classification: It improves the IAA from 91.51% to 96.19% and reduces quality control workload by 48%.
Highlights & Insights¶
- Highly inspiring legal analogy: Adapting legal determination standards for circumstantial evidence to evaluate LLM cognitive faithfulness is an elegant and persuasive interdisciplinary analogy that provides both a theoretical foundation and guidance for annotation protocol design.
- First systematic quantification of "cognitive hallucinations": Formally defines and measures hallucination issues of LLMs in high-order cognitive tasks such as reasoning and evaluation, filling the gap beyond factual faithfulness.
- Hierarchical grading matches diverse application needs: The three-level standard (Rational/Grounded/Unequivocal) allows users to flexibly select thresholds based on application risk levels, balancing creativity and safety.
- Automated annotation pipeline can substitute for humans: Auto-Labeling achieves an 82.2% F1 score, approaching human performance, significantly reducing the cost of evaluating new models.
Limitations & Future Work¶
- Domain limitation: The current data is based on the general commonsense domain (Wikipedia) and does not cover high-risk specialized domains such as medicine and finance, the expansion of which requires domain expert participation.
- Granularity of cognitive levels can be further refined: Based on Bloom's Taxonomy, the statements are only categorized into factual and cognitive, without distinguishing details of faithfulness across subcomponents (e.g., reasoning, explanation, evaluation).
- Cultural applicability of the legal analogy: Legal concepts vary across jurisdictions, so the generalizability of the framework remains to be verified.
- Knowledge source bias: Relying on Wikipedia as the sole knowledge source may introduce systematic bias; it can be replaced with more diverse corpora in the future.
- Small scale of CogniDet: Fine-tuned only on an 8B model; larger models may yield further improvements.
Related Work & Insights¶
- RAGTruth / FAVA: Pioneers in token-level and fine-grained hallucination detection, respectively, but they focus on factual consistency. The cognitive dimension of this work is an important complement.
- SelfCheckGPT: Zero-resource hallucination detection based on multiple sampling, which inspired the multiple sampling voting strategy in this work.
- RefGPT: A framework for generating high-quality knowledge dialogues, used in this work to construct dialogue data.
- Bloom's Taxonomy: The theory of cognitive domains in education supplies the theoretical foundation for classifying factual and cognitive statements.
- Legal Evidence Theory: The three-level standard (Rational-Grounded-Unequivocal) for determining circumstantial evidence directly inspired the evaluation framework design in this work.
Rating¶
- Novelty: ⭐⭐⭐⭐ The legal analogy is highly novel, systematically defining cognitive faithfulness grading criteria for the first time.
- Value: ⭐⭐⭐⭐ Dataset and detector are open-sourced, and the automated annotation pipeline can be directly applied to evaluate new models.
- Experimental Thoroughness: ⭐⭐⭐⭐ Rigorously designed annotation protocol (96.19% IAA), supported by comprehensive ablation experiments.
- Writing Quality: ⭐⭐⭐⭐ Cognitive hallucination represents a key bottleneck for safe deployment of LLMs, and this paper opens up an important dimension for evaluation.