Multi-Facet Blending for Faceted Query-by-Example Retrieval¶

Information	Content
Conference	ACL 2025
arXiv	2412.01443
Code	GitHub
Area	others (Information Retrieval × Data Augmentation × LLM)
Keywords	faceted query-by-example, data augmentation, LLM, contrastive learning, retrieval

TL;DR¶

Proposes FaBle (Multi-Facet Blending), a data augmentation method that constructs condition-oriented training triplets through a three-stage process: facet decomposition, facet generation, and facet recomposition. Using only 1K source documents, FaBle synthesizes training pairs that significantly improve faceted QBE retrieval under data-scarce conditions, notably outperforming a strong baseline trained on over 1.3M data points in the most challenging "method" facet.

Background & Motivation¶

Faceted Query-by-Example (Faceted QBE): Traditional query-by-example retrieval uses the entire document as a query. However, real-world documents contain multiple facets (e.g., background/method/result of a paper), and users may only care about the similarity of a specific facet. Retrieving based on the whole document directly leads to irrelevant results.
Limitations of Prior Work (Dependency on Citation Annotations): Prior faceted QBE methods (e.g., SPECTER, ASPIRE) rely heavily on extensive citation graphs as weak supervision signals (1.3M+ co-citation data), which limits their applicability to domains lacking citation networks (e.g., education, legal).
Coarse-grained Document-level Comparisons: Citation-based methods perform document-level comparisons and fail to truly capture facet-specific constraints, especially showing suboptimal performance for complex facets like "method".
Goal: To design a method that synthesizes facet-specific training data without requiring citation annotations or pre-defined facet labels, leveraging only a small set of seed documents and a small open-source LLM.

Method¶

Overall Architecture¶

FaBle consists of three core stages (Figure 2):

Stage 1: Facet Decomposition - Uses LLaMA2-13B in a zero-shot manner to generate summaries for each facet of a document. - Given a document D, a summary prompt, and a facet name f ∈ {background, method, result}, it generates the facet summary Sᶠ. - The facet summaries serve as "indicators" for the subsequent generation stage, guiding the generation of facet-specific text.

Stage 2: Facet Generation - Self-feeding mechanism: Feeds the prompt and output from Stage 1 back into the same model. - Generates two types of facet segments: - Similar facet segment C_sim^f: text semantically similar to the original facet. - Dissimilar facet segment C_dis^f: text semantically dissimilar to the original facet. - Key Insight: Generating without decomposition directly leads to the mixing of irrelevant facets (Figure 3); the two-stage method guarantees target facet focus.

Stage 3: Facet Recomposition - Combines the generated similar/dissimilar facet segments with other facets to construct facet-conditioned positive and negative document pairs. - Positive document D^{f+}: Target facet uses the similar segment, other facets are chosen randomly. - Negative document D^{f-}: Target facet uses the dissimilar segment, other facets are chosen randomly. - One original document can generate 4 positive documents and 4 negative documents, which are paired to produce 40 triplets.

Loss & Training¶

Uses standard triplet loss (Triplet Loss):

\[L(D^{f;Q}, D^{f+}, D^{f-}) = \max\{d(D^{f;Q}, D^{f+}) - d(D^{f;Q}, D^{f-}) + m, 0\}\]

where d is a distance function and m is the margin hyperparameter. Finetuning is based on SciBERT-based SPECTER, without additional modeling tricks.

Hard Negative Generation¶

Dissimilar facets generated by the LLM might be too simple (easily distinguishable).
A MiniLM cross-encoder is deployed to score the semantic similarity between the generated dissimilar segments and the original facet summary.
Segments with similarity < 0.25 are treated as "easy negatives". By incorporating the current similarity score in the prompt, the LLM is guided to regenerate "hard negatives" with a similarity score in the range of 0.25-0.5.

Key Experimental Results¶

Dataset & Settings¶

Training Data: Randomly selects only 1,017 computer science abstracts from the 81.1M papers in S2ORC.
Generation per document: 40 facet-oriented triplets per facet \(\rightarrow\) total of ~40K pairs.
Evaluation Set: CSFCube (50 query-facet pairs, facet relevance rated 0-3).
Metrics: NDCG@20, MAP

Main Results (CSFCube Test Set)¶

Model	Background NDCG	Method NDCG	Result NDCG	Aggregated NDCG
SPECTER	66.70	37.41	56.67	53.28
SPECTER + FaBle	67.38	44.97	58.10	56.60
SPECTER-COCITE (1.3M)	70.03	45.99	59.95	58.38
SPECTER-COCITE + FaBle	70.09	49.14	60.88	59.79
ASPIRE (OT, 2.6M)	71.04	46.46	67.38	61.41

Key Findings: With only 1K seed documents, FaBle achieves a significant improvement of +7.6% NDCG and +3.5% MAP on the Method facet; FaBle + COCITE even outperforms ASPIRE (which uses 32x the data) in Method MAP.
The improvement on the Background facet is minor, as the background is highly correlated with the overall document, which coarse-grained methods can already handle effectively.

FEIR Educational Domain Evaluation¶

The paper also constructs the FEIR (Faceted Educational exam Item Retrieval) evaluation set: - Based on TOEFL-QA data, containing 122 test samples across three facets: Story / Question / Options. - 8 queries per facet. - FaBle also yields significant improvements on the educational domain: SPECTER + FaBle improves the Aggregated NDCG@20 from 54.50 to 59.25.

Ablation Study¶

Method	Method NDCG	Method MAP
COCITE	45.99	25.60
+ FaBle	49.14	30.90
+ FaBle-RN (Random Negatives)	46.82	28.62

Using the LLM-generated dissimilar facets as negative samples significantly outperforms randomly selected negative samples, validating the effectiveness of the Stage 2 generation strategy.

Data Scale Analysis¶

The performance of FaBle increases with the scale of training data, but it remains highly effective even under extremely low-data regimes.
The facet-specific enhancement is mainly reflected on the Method and Result facets.

Highlights & Insights¶

Modular Design: The decomposition-generation-recomposition three-stage design is elegant, with each step serving a clear purpose and motivation.
Extreme Data Efficiency: Achieves performance comparable to methods using over 1.3M data points with only 1K seed documents, demonstrating the massive potential of LLM-synthesized data.
Domain-Agnostic: Does not rely on citation labels or predefined facet knowledge, successfully extending from scientific paper retrieval to academic exam question retrieval.
Self-Feeding Strategy: Cleverly leverages the decomposition output of the same LLM to guide subsequent generation, eliminating the overhead of finetuning.
Hard Negative Mining: Controls the difficulty of generated negative samples by embedding numerical similarity scores into the prompts, showing a novel approach.
New Benchmark: Releases the FEIR faceted educational exam retrieval benchmark, filling a gap in the domain.

Limitations & Future Work¶

LLM Decomposition Quality: Zero-shot facet summarization using LLaMA2-13B is not always accurate, especially for non-English or highly specialized domains.
Limited Gains on Background Facet: Since the background is highly correlated with the whole document, the effect of FaBle's facet-specific enhancement is less pronounced here.
Pre-defined Facet Requirements: Although explicit labels are not needed, the facet names (e.g., background/method/result) still require manual specification.
Small Benchmark Scale: CSFCube has only 50 queries, and FEIR has only 8 queries per facet, limiting statistical power.
Only Validated on SPECTER Architecture: The method has not been evaluated on more modern dense embedding models (e.g., E5, GTE).

QBE Retrieval: SPECTER (Cohan et al., 2020) learns document embeddings based on citation graphs, and SciNCL enhances this via neighborhood contrastive learning.
Faceted QBE: ASPIRE (Mysore et al., 2021, 2022) utilizes 66K citation pairs + 2.6M co-citation sentences + optimal transport techniques.
LLM Data Augmentation: InPars (Luu et al., 2021) utilizes GPT-2 to generate queries; HyDE (Gao et al., 2023) leverages GPT-3 to generate hypothetical documents. Wang et al. (2023) use ChatGPT to label facet relevance scores, which is expensive and only used for evaluation.

Rating ⭐⭐⭐⭐¶

The proposed method is simple, effective, and highly data-efficient (1K vs 1.3M), achieving breakthroughs in the most challenging "method" facet. It also demonstrates strong domain generalization (validated on the educational domain). Limitations include limited improvements on the Background facet, a relatively small evaluation set, and the lack of evaluation on more modern embedding architectures. Overall, this is a solid work at the intersection of data augmentation and information retrieval.