AAAI 2026 Graph Learning Scholarly Information Extraction Named Entity Recognition Relation Extraction Knowledge Graph ML Reproducibility Fine-Grained Annotation

GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning¶

Conference: AAAI 2026
arXiv: 2511.09411
Authors: Wolfgang Otto, Lu Gan, Sharmila Upadhyaya, Saurav Karmakar, Stefan Dietze (GESIS)
Code: https://data.gesis.org/gsap/gsap-ere
Area: Graph Learning
Keywords: Scholarly Information Extraction, Named Entity Recognition, Relation Extraction, Knowledge Graph, ML Reproducibility, Fine-Grained Annotation

TL;DR¶

This paper introduces GSAP-ERE — a fine-grained scholarly entity and relation extraction dataset for the machine learning domain, comprising 10 entity types and 18 relation types, with 63K entities and 35K relations annotated across 100 full-text papers. Experiments show that fine-tuned models (NER: 80.6%, RE: 54.0%) substantially outperform LLM prompting approaches (NER: 44.4%, RE: 10.1%).

Background & Motivation¶

Problem Background¶

Machine learning research is advancing rapidly, yet reproducibility continues to decline — prior studies have found that only 4% of ML papers can be reproduced without a response from the original authors. Understanding the dependencies among models, datasets, and tasks is critical for improving research reproducibility and reusability. Scholarly Information Extraction (Scholarly IE) provides a scalable pathway to construct knowledge graphs and monitor research reproducibility by automatically extracting entities and their relations from papers.

Limitations of Prior Work¶

Coarse-grained entity types: SciERC defines only 6 entity types; SciER only 3 (Dataset, Task, Method), making it impossible to identify key meta-information such as model architectures and data sources.
Limited annotation scope: ScienceIE annotates only paragraphs; SciERC and SemEval 2018 annotate only abstracts, failing to cover the diverse linguistic styles found in full text.
Poor direct LLM performance: Existing LLMs fall far short of fine-tuned models on fine-grained domain-specific IE tasks and are unsuitable for high-quality scholarly IE.
Lack of informal entity annotation: SciER includes only explicitly named entities, ignoring a large number of informal mentions (e.g., "the model", "this dataset"), leading to incomplete relation annotation.
Incomplete relation coverage: Existing datasets define at most 9 relation types (SciER), insufficient to capture multi-dimensional relations such as model design, data provenance, and peer comparison.

Core Motivation¶

To construct a comprehensive dataset with full-text coverage, fine-grained entity and relation annotations, supporting a variety of downstream tasks ranging from knowledge graph construction to AI research reproducibility monitoring.

Method¶

Data Model Design¶

Built upon an extension of the GSAP-NER dataset, the paper defines a complete entity-relation schema:

10 Entity Types (three major categories): - ML model-related: MLModel, ModelArchitecture, MLModelGeneric, Method, Task - Dataset-related: Dataset, DatasetGeneric, DataSource - Others: ReferenceLink, URL

18 Relation Types (seven semantic groups): 1. Model Design: usedFor, architecture, isBasedOn — capturing compositional and derivation relations among models/methods 2. Task Binding: appliedTo, benchmarkFor — linking models/datasets to tasks 3. Data Usage: trainedOn, evaluatedOn — training and evaluation dependencies 4. Data Provenance: sourcedFrom, transformedFrom, generatedBy — data origins and transformations 5. Data Properties: size, hasInstanceType — dataset scale and modality 6. Peer Relations: coreference, isPartOf, isHyponymOf, isComparedTo — relations among entities of the same type 7. Referencing: citation, url — links to external sources

Annotation Strategy¶

A two-phase "annotate-then-refine" strategy is adopted: - Annotation phase: Two student annotators with computer science backgrounds; 10 papers doubly annotated, 90 papers singly annotated, using the INCEpTION platform. - Refinement phase: Two PhD students and two postdoctoral researchers review annotation alignment, extract inconsistency patterns, and apply corrections.

Evaluation Setup¶

Four levels of RE evaluation strictness are defined: - RE+: Strict matching of entity types, relation labels, and entity boundaries. - RE: Matching of relation labels and entity boundaries only, without entity type constraints. - RE+≈: Strict label matching with partial overlap allowed for entity boundaries. - RE≈: Correct relation label with overlapping entity boundaries sufficient.

Baseline Models¶

Supervised Pipeline: PL-Marker — NER followed by RE, using Packed Levitated Markers to model entity-pair interactions.
Supervised Joint Model: HGERE — built on the PL-Marker framework with a hypergraph neural network, jointly optimizing NER and RE.
LLM Prompting: Qwen 2.5 (32B/72B) and LLaMA 3.1 (70B), using a two-stage pipeline prompt (NER then RE).

Key Experimental Results¶

Experiment 1: Supervised Models vs. LLM Prompting¶

Method	Model	NER	NER≈	RE	RE≈	RE+	RE+≈
Supervised Joint	HGERE	80.6	85.8	54.0	59.8	46.9	51.3
Supervised Pipeline	PL-Marker	72.6	77.7	41.4	46.2	36.3	39.9
LLM Pipeline	Qwen 2.5 72B	44.4	59.1	10.1	15.7	8.2	11.9
LLM Pipeline	Qwen 2.5 32B	42.0	56.9	7.2	14.6	7.2	10.9
LLM Pipeline	LLaMA 3.1 70B	40.5	55.0	6.4	9.6	5.7	7.8

Supervised HGERE outperforms all baselines across all metrics: NER exceeds the best LLM by 36.2 percentage points, and RE by 43.9 percentage points. In terms of inference speed, PLM-based methods are 182× faster than LLMs (4 minutes vs. 12.5 hours).

Experiment 2: Effect of Few-Shot Example Selection Strategy on NER (Qwen2.5 32B, validation set)¶

Selection Strategy	k=0	k=1	k=2	k=5	k=10	k=20
random (micro-F1)	19.1	24.7	23.1	29.7	34.1	34.4
similar+diverse (micro-F1)	19.1	34.7	38.2	40.4	40.9	27.8
random (NER≈ micro-F1)	33.0	41.8	37.1	50.1	53.3	50.3
similar+diverse (NER≈ micro-F1)	33.0	53.8	56.7	58.1	58.4	39.4

The similar+diverse strategy achieves optimal performance at k=10, outperforming the random strategy by approximately 5 percentage points; performance drops sharply at k=20. For RE, the best configuration is k=1 (micro-F1: 10.7%), and adding more examples is counterproductive.

Dataset Scale Comparison¶

Dataset	Annotation Unit	Papers	Entity Types	Relation Types	Entities	Relations	Relations/Paper
GSAP-ERE	Full text	100	10	18	62,619	35,302	353.0
SciER	Full text	106	3	9	24,518	12,083	114.0
SciERC	Abstract	500	6	7	8,094	4,648	9.3
SemEval18	Abstract	500	-	6	7,505	1,583	3.3
ScienceIE	Paragraph	500	3	2	9,946	672	3.1

GSAP-ERE surpasses all existing datasets in entity count, relation count, type richness, and annotation density.

Highlights & Insights¶

Largest scholarly IE dataset: 63K entities + 35K relations, with 18 relation types covering 7 semantic dimensions; annotation density (353 relations/paper) far exceeds comparable datasets.
Fine-grained data model: Distinguishes formal and informal entity mentions, captures multi-dimensional relations including model design, data provenance, and peer comparison, directly supporting ML research reproducibility monitoring.
Full-text annotation: Compared to datasets annotating only abstracts or paragraphs, full-text annotation covers richer linguistic styles and information.
Rigorous quality control: A two-stage annotate-then-refine pipeline achieves inter-annotator NER consistency of 0.82 macro-F1, with significant improvement after refinement.
Revealing LLM limitations: Empirical results demonstrate that even the strongest current LLMs lag far behind fine-tuned models on fine-grained scholarly IE (RE gap of 43.9%), providing strong evidence for the necessity of domain-specific datasets.

Limitations & Future Work¶

Domain scope: Coverage is limited to ML and applied ML papers; generalizability to other disciplines (e.g., biomedicine, physics) remains to be verified.
Sentence-level annotation: Current annotation supports only sentence-level entities and relations, lacking document-level cross-sentence relation annotation, making it impossible to capture long-range dependencies.
Dataset size: With only 100 papers and 80 in the training set, the dataset may be insufficient for deep learning models.
Low RE performance ceiling: Even the best supervised model achieves only 46.9% RE+ F1, indicating either extremely high task difficulty or the need for further refinement of the data model.
Variable inter-annotator agreement: RE+ consistency for the Model Design semantic group is only 38.4%, reflecting ambiguous boundary definitions for some relations.
Nested relations not addressed: Although the dataset contains nested and overlapping entities, the impact of these complex structures on model performance is not thoroughly analyzed.

SciERC (Luan et al. 2018): 6 entity types + 7 relation types, annotating 500 abstracts only; GSAP-ERE substantially surpasses it in type richness and annotation density, and provides full-text annotation.
SciER (Zhang et al. 2024): 3 entity types + 9 relation types, excluding informal entities; GSAP-ERE retains informal mentions, improving relation coverage completeness.
DMDD (Huitong et al. 2023): Automatically annotated via distant supervision with no relation annotation; GSAP-ERE is manually curated with rich relation annotations.
SciREX (Jain et al. 2020): Relation annotation is limited to mention clustering rather than pairwise relations, ignoring contextual information.
PL-Marker (Ye et al. 2022): The pipeline method achieves NER 72.6% and RE 41.4% on GSAP-ERE, below the joint method HGERE.
HGERE (Yan et al. 2023): The joint method achieves the best performance on GSAP-ERE (NER 80.6%, RE+ 46.9%), validating the effectiveness of hypergraph networks for scholarly IE.
LLM prompting methods: Even with a 10-shot similar+diverse strategy, Qwen 2.5 72B and LLaMA 3.1 achieve RE below 11%, consistent with observations by Zhang et al. (2024) on SciER.

Rating¶

Novelty: ⭐⭐⭐⭐ — The first full-text scholarly IE dataset combining 10 fine-grained entity types and 18 semantically grouped relation types, filling a clear gap in the field.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers both supervised and unsupervised methods with few-shot strategy ablations, though cross-domain generalization experiments are absent.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with detailed data model definitions and comprehensive comparison tables.
Value: ⭐⭐⭐⭐ — Provides a high-quality benchmark for ML reproducibility monitoring and knowledge graph construction, while exposing the limitations of LLMs on domain-specific IE.
Value: To be evaluated