SLIMER: Show Less, Instruct More - Enriching Prompts with Definitions and Guidelines for Zero-Shot NER¶

Conference: ECCV 2024
arXiv: 2407.01272
Code: HuggingFace
Area: NLP Understanding / Named Entity Recognition
Keywords: Zero-Shot NER, Instruction Tuning, LLMs, Entity Definitions, Prompt Engineering

TL;DR¶

SLIMER enhances the zero-shot named entity recognition capability of LLMs by injecting entity definitions and annotation guidelines into prompts. Trained on only 391 entity categories, it achieves performance comparable to State-of-the-Art (SOTA) methods trained on 13,000+ entity categories when evaluated on unseen entity tags.

Background & Motivation¶

Background: Instruction tuning of LLMs for NER has become a dominant approach, with representative works including InstructUIE, UniNER, GoLLIE, and GNER, which achieve excellent performance in out-of-distribution (OOD) zero-shot NER.

Limitations of Prior Work: (1) Existing methods are fine-tuned on a large number of entity categories (e.g., 13,020 in UniNER), where the entity tags between train and test sets highly overlap; thus, "zero-shot" is actually "out-of-domain" rather than "truly unseen entities". (2) Most methods perform poorly on genuinely unseen named entities (unseen NE). (3) Only GoLLIE attempts to guide unseen entity recognition using Python class definitions, but it requires extensive manual drafting of definitions.

Key Challenge: Existing methods rely on a massive amount of overlapping training entity tags to "memorize" entity types instead of truly understanding entity concepts. They lack generalization capabilities when encountering completely new entity types.

Goal: To achieve true zero-shot NER with fewer training samples and fewer entity categories by injecting definitions and guidelines into the prompts.

Key Insight: Instead of forcing the model to "memorize" more entity types, it is better to "teach" the model how to understand new entity types according to definitions—"Show Less, Instruct More".

Core Idea: Enriching prompt content with GPT-generated entity definitions and annotation guidelines, enabling the model to learn to recognize entities based on semantic descriptions rather than memorizing entity tags.

Method¶

Overall Architecture¶

SLIMER is instruction-tuned based on a decoder-only LLM architecture. The input prompt consists of three parts: (1) task instructions and the text to be annotated; (2) target entity type definitions (entity definition); and (3) annotation guidelines to guide the model on how to distinguish similar entities. During inference, each run targets a single entity type and generates all entity mentions belonging to that type.

Key Designs¶

Definition-enriched Prompt:
- Function: Provides semantic definitions for each entity type, enabling the model to understand "what this entity is".
- Mechanism: For each named entity tag (e.g., "Algorithm"), GPT is used to generate a brief description/definition (e.g., "An algorithm is a step-by-step procedure for solving a problem..."), which is embedded into the prompt template. The model learns to semantically match the definition with spans in the text.
- Design Motivation: Conventional methods solely rely on entity tag names (e.g., "PER", "ORG"), which provide insufficient semantic details to generalize to brand-new tags. Definitions serve as type-level semantic anchors.
Annotation Guidelines:
- Function: Instructs the model on how to label edge cases and distinguish similar entities.
- Mechanism: Additional annotation rules are provided for each entity type (e.g., "Include the full name including titles", "Do not confuse product names with organization names"), which are also generated by GPT.
- Design Motivation: Definitions alone are insufficient to handle ambiguous cases; guidelines provide operational-level instructions similar to annotation manuals used by human annotators.
Few-category Training Strategy:
- Function: Achieves stronger generalization by training on fewer entity categories.
- Mechanism: The training set contains only 391 entity categories (vs. 13,020 in UniNER) and deliberately selects training data with minimal tag overlap with the test set. Additionally, synthetic data augmentation (GPT-generated NER samples) is leveraged.
- Design Motivation: Reducing label overlap forces the model to rely on definitions and guidelines to understand entities instead of memorizing labels seen during training.

Loss & Training¶

The standard autoregressive language modeling loss is used. During inference, each entity type is invoked independently (inference cost of \(|\mathcal{X}| \times |\mathcal{Y}|\)), supporting document-level NER and nested entities.

Key Experimental Results¶

Main Results¶

Model	MIT (avg)	CrossNER (avg)	BUSTER (unseen NE)
UniNER	52.3	61.2	18.5
GoLLIE	48.7	58.4	35.2
GNER	50.1	59.8	15.3
SLIMER	49.8	57.6	38.7

Training entity categories: SLIMER 391 vs. UniNER/GNER 13,020.

Ablation Study¶

Configuration	CrossNER F1	BUSTER F1	Description
SLIMER (Def + Guide)	57.6	38.7	Full model
w/o Def & Guide (baseline)	51.2	25.3	Dramatic decrease without definitions and guidelines
Def only	55.1	33.4	Definitions contribute the most
Guide only	53.8	30.1	Guidelines also show significant contributions

Key Findings¶

Definitions and guidelines yield the most significant improvements on unseen entities (+13.4 F1 on BUSTER), demonstrating that the model learns to identify new entities based on semantic descriptions.
On OOD data (MIT, CrossNER), SLIMER achieves 95%+ of SOTA performance using only 1/33 of the training entity categories.
Learning curves indicate that the version enriched with definitions converges faster and more stably.
Synthetic data helps compensate for the disadvantage caused by the limited number of training entity categories.

Highlights & Insights¶

The "Less is More" Training Philosophy: Challenges the conventional wisdom that "more training categories = stronger generalization". Relying on definitions to bridge the information gap with only 391 vs. 13,020 entity categories presents a clear and convincing approach.
Low-cost Guideline Generation: Automatically generates definitions and guidelines using GPT, requiring almost zero human labor (vs. GoLLIE, which demands manually writing Python classes for each category).
True Zero-Shot Evaluation Protocol: Systematically distinguishes between OOD zero-shot and unseen NE zero-shot for the first time, establishing a more rigorous benchmark standard for NER evaluation.

Limitations & Future Work¶

Inference cost increases linearly with the number of entity types (requiring a separate inference pass for each type), leading to inefficiency in many-entity scenarios.
The quality of GPT-generated definitions and guidelines is uncontrollable, potentially introducing biases.
Evaluated only on English NER; multilingual scenarios remain to be validated.
Future work could explore Retrieval-Augmented Generation (RAG) to dynamically retrieve entity definitions, further reducing training dependencies.

vs. UniNER/GNER: These models rely on brute-forcing coverage of test labels with massive training entity categories, resulting in high train-test overlap. SLIMER proves the feasibility of the few-category + definition approach.
vs. GoLLIE: Also guided by definitions but requires labor-intensive manual drafting of Python classes; SLIMER automates this process through GPT prompts.
vs. GLiNER: Encoder-based architecture that cannot handle nested entities; SLIMER supports both nested and document-level entities under a generative paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐ The "few-category + definition" strategy is novel, though the idea of definition-enriched prompting is not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on both OOD and unseen NE, with clear ablation studies.
Writing Quality: ⭐⭐⭐⭐ Logical motivation with fair and detailed comparisons.
Value: ⭐⭐⭐⭐ Provides a practical solution for low-resource NER and genuine zero-shot scenarios.