Rank, Chunk, and Expand: Lineage-Oriented Reasoning for Taxonomy Expansion¶

Conference: ACL 2025
arXiv: 2505.13282
Code: Yes
Area: Others
Keywords: Taxonomy Expansion, LLM Structural Reasoning, Discriminative Ranking, Generative Reasoning, Taxonomic Paths

TL;DR¶

LORex proposes a plug-and-play taxonomy expansion framework that combines the discriminative ranker TEMPORA (taxonomic path verbalization based on Euler paths) and iterative LLM reasoning (semantic filtering \(\rightarrow\) parent retrieval \(\rightarrow\) path validation). Without fine-tuning LLMs, it achieves a 12% accuracy gain and a 5% Wu&P gain across four benchmarks.

Background & Motivation¶

Background¶

Taxonomies are hierarchical graph structures that capture "is-a" relations, widely used in search engines, recommendation systems, and advertising systems. However: - Real-world taxonomies are manually constructed by domain experts, which is costly and struggles to capture emerging concepts. - As new concepts constantly emerge, manual updates become increasingly impractical. - The Taxonomy Expansion task emerges: inserting new entities into appropriate positions within an existing seed taxonomy.

Limitations of Prior Work¶

Three generations of methods each have their shortcomings: - First Generation (lexical pattern matching + distributed embeddings): limited by scarce self-supervised annotation data. - Second Generation (structural summaries like mini-path, ego-net): insufficient training data on small-scale taxonomies, leading to poor generalization. - Third Generation (LLMs): GPT-4 suffers from high inference costs; fine-tuning LLaMA requires heavy resources; and they face context length limitations—either attempting to encode all candidates (infeasible) or using top-k selection (which may miss the correct answer).

Proposed Solution¶

LORex combines the advantages of discriminative ranking and generative reasoning: ranking candidates before processing them in chunks to avoid context overflow, while ensuring the correct answers are not missed through iterative reasoning.

Method¶

Overall Architecture¶

LORex (Figure 2) consists of four modules:

Input: Seed taxonomy T^o, New entity set C
→ TEMPORA ranking and chunking
→ Semantic filtering
→ Parent retrieval
→ Path validation
Output: Parent node for each new entity

Key Designs¶

TEMPORA Discriminative Ranker:
- Modified Euler Path: Extracted from the anchor node along the root path and expanded to sibling and child nodes. A modified Euler traversal is executed, allowing nodes to be visited multiple times.
- Path Verbalization: Interpretable relational phrases ("is parent of", "is child of") are used to replace special tokens (such as [SEP]) to enhance readability.
- Dual-Path Training Strategy: Training is conducted simultaneously on paths with query definitions \(P_v(q,P)\) and paths without definitions \(P_v(P)\). Positive samples minimize the distance between the two paths, while negative samples maximize it.
- Loss Function: \(\mathcal{L} = \mathcal{L}_m + \lambda_1 \cdot \mathcal{L}_+ + \lambda_2 \cdot \mathcal{L}_-\), where \(\mathcal{L}_m\) is the dynamic margin loss.
- Rank-then-Chunk: Candidates are ranked by their matching scores and then chunked into batches of size \(k\).
Semantic Filtering Module:
- Executes boolean reasoning on each candidate chunk to evaluate semantic consistency with the taxonomy/query.
- The LLM returns "Yes" to pass or "No" to discard.
- Effect: The average number of iterations drops from 3.1 to 1.7, and chunks containing the true parent node are rarely discarded incorrectly (only 3/42 on the Environment dataset).
Parent Node Retrieval Module:
- Performs hierarchical reasoning utilizing the candidate's taxonomic path (root \(\rightarrow\) anchor node path + ego network + definition).
- The LLM selects the most appropriate hypernym, returning "NOT FOUND" if none is suitable.
- Practical Issue: The LLM rarely returns "NOT FOUND" in practice, because pre-ranking aggregates semantically similar words into the same chunk.
Parent Node Validation Module:
- Core Innovation: Path-based LLM validation, rather than candidate-based discriminative validation.
- Computes the average token log probability of all candidate paths within the chunk as the ranking signal: \(P_j^* = \arg\max_{P_i} \frac{1}{n}\sum_{i=1}^{n} \log p(t_i|t_1,...,t_{i-1})\).
- Keeps the retrieved candidate path if it is consistent with the validation result; otherwise, removes the candidate and re-retrieves.
- Iterates until only two candidates remain in the chunk; if it still fails, discards the entire chunk and proceeds to the next chunk.

Loss & Training¶

TEMPORA only requires training for a few epochs (a simple retriever, not requiring exhaustive training).
The LLM component is completely zero-shot, using instruction-tuned models (such as LLaMA-3.1-8B-Instruct).
The inference phase only processes the top 3 chunks (15 candidates), dynamically covering 90% of queries where the correct answer lies in the top-15.

Key Experimental Results¶

Main Results - Taxonomy Expansion (Table 2)¶

Method	Env Acc	Env Wu&P	Sci Acc	Sci Wu&P	WordNet Acc	WordNet Wu&P
TEMP	45.5	77.3	43.5	76.3	24.6	61.2
TacoPrompt	56.2	82.1	53.1	76.3	44.8	72.3
FLAME	63.4	85.1	63.2	82.5	45.2	71.5
LORex8B-3.1I	67.3	82.9	64.7	87.4	49.5	84.5

TEMPORA Ranking Performance (Table 1)¶

Method	Env Hit@1	Env Hit@10	Sci Hit@1	Sci Hit@10
TEMP	0.403	0.654	0.459	0.612
TEMPORA	0.481	0.731	0.529	0.753

Ablation Study¶

Ablation Setting	Change in Acc	Wu&P Change
W/o Definition	-17.50%	-2.78%
W/o Path	-31.99%	-9.78%
W/o Path + Definition	-28.30%	-5.30%
Random Shuffle	-57.33%	-39.80%
Candidate-level Validation vs. Path-level Validation	-47.04% Acc	-10.84% Wu&P

Key Findings¶

LORex comprehensively outperforms SOTA: achieving a 12% accuracy gain and a 5% Wu&P gain over the strongest baseline FLAME.
TEMPORA ranking is crucial: randomly shuffling the ranking leads to a dramatic drop of 57.33% in accuracy, verifying that "the method is only as good as its ranker".
Paths are more important than definitions: removing paths causes a 31.99% decrease in accuracy, far exceeding the 17.50% drop from removing definitions.
Path validation is far superior to candidate validation: path validation improves accuracy by 47.04%.
The optimal chunk size is 4–10: too small lacks context, while too large introduces noise.
Semantic filtering is effective and safe: reducing the number of iterations by 45% while extremely rarely discarding true parent nodes.

Highlights & Insights¶

Hybrid Discriminative-Generative Paradigm: TEMPORA narrows the search space, while the LLM performs fine-grained reasoning, balancing both efficiency and accuracy.
Euler Path Verbalization: Translates structural information into natural language, enabling pre-trained models to directly understand hierarchical relations.
Path-Level Validation: Leverages token probability scores for path ranking, which is more stable than direct LLM generation.
Plug-and-Play: No LLM fine-tuning is required; achieving SOTA with an 8B parameter model is highly cost-effective.
Iterative Refinement: The retrieve-validate-retry iterative mechanism ensures that the correct answer is not lost due to a single failure.

Limitations & Future Work¶

Performance highly depends on the quality of the ranker ("the method is only as good as its ranker").
When the correct parent node is ranked low, multiple iterations lead to high latency and error accumulation.
Small instruction-tuned LLMs exhibit unstable outputs in certain scenarios.
Focuses solely on taxonomy expansion (leaf node insertion) rather than taxonomy completion (intermediate node discovery).
Future work: Exploring better ranking techniques, LLM ensemble reasoning, and extending the framework to taxonomy completion.

An improvement over the path encoding method of TEMP (Liu et al., 2021), incorporating sibling/child nodes and verbalization.
Inspired by structural reasoning methods like Think-on-Graph and Graph-CoT, applying iterative reasoning to taxonomic hierarchies.
FLAME (Mishra et al., 2024) is the direct state-of-the-art baseline, which LORex outperforms without fine-tuning LLMs.
Offers valuable insights for tasks like knowledge graph completion and ontology expansion.

Rating¶

Novelty: ⭐⭐⭐⭐ The combined framework of discriminative ranking + chunked generative reasoning + path validation is novel, and TEMPORA's Euler path verbalization is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly comprehensive, covering 4 datasets, 12 baselines, 5 ablation studies, multiple LLM variants, and a case study.
Writing Quality: ⭐⭐⭐⭐ The framework description is clear and the algorithm pseudocode is complete, though formula notations are somewhat dense.
Value: ⭐⭐⭐⭐ The plug-and-play design offers direct value for real-world taxonomy maintenance and has broad applicability in the KG/ontology domain.