ConceptCarve: Dynamic Realization of Evidence¶

Conference: ACL 2025
arXiv: 2504.07228
Code: Yes (dataset released on HuggingFace: ecaplan/conceptcarve)
Area: Information Retrieval / Social Computing
Keywords: Evidence Retrieval, Concept Trees, LLM Reasoning, Domain Adaptation, Moral Foundations Theory

TL;DR¶

Proposes ConceptCarve, a framework that utilizes LLMs to dynamically construct concept trees to represent how evidence is concretely realized across different communities, significantly outperforming traditional retrieval systems in handling inferential gaps and domain sensitivity.

Background & Motivation¶

Core Problem¶

Finding evidence of human views and behaviors at scale on social media is a highly challenging task. For example, studying the relationship between gun ownership and perceptions of "liberty" requires a retrieval system that can run on large-scale social media posts while addressing two key challenges:

Inferential Gap: A lack of vocabulary overlap between the query and relevant documents, requiring complex reasoning to establish a connection. Unlike a simple vocabulary gap (which can be resolved through synonym replacement), the inferential gap demands non-trivial reasoning capabilities.
Domain Sensitivity: The realization of evidence for the same query can vary drastically across different communities. For instance, "liberty" has distinct meanings and expressions in liberal versus conservative communities.

Limitations of Prior Work¶

Direct LLM Analysis: Having LLMs analyze each document individually yields high-quality judgments but is extremely expensive (costing thousands of dollars for hundreds of thousands of posts).
Traditional IR Models: Fast retrieval speed but underperform in inferential gap scenarios and fail to adapt to specific domains.
Query Expansion Methods: Methods like Query2Doc do not interact with retrieval results, relying solely on prior LLM predictions of relevant results.
Parametric Methods: Require training to adapt to specific domains, lacking flexibility.

Core Idea¶

The authors strive to bridge the gap between the inefficiency of LLMs and the limited reasoning power of IR engines, while ensuring adaptation to specific domains. The key insight is to leverage LLM reasoning capabilities to "carve" out a concrete representation of trend evidence without running LLM inference over the entire corpus.

Method¶

Overall Architecture¶

ConceptCarve is an evidence retrieval framework composed of two core components: Characterizer and Retriever. The Characterizer uses an LLM to interactively grow a concept tree, repeatedly using the Retriever to obtain intermediate results that guide the tree's construction.

Key Designs¶

Concept Tree
- The concept tree is a tree structure of weighted concepts, where each concept is represented by a set of "groundings" (query strings directly usable by traditional retrievers).
- Concepts with positive weights are "promoted", while those with negative weights are "demoted".
- By carefully adding promoted and demoted concepts, the tree can sculpt a precise representation of complex intents.
- This is analogous to carving a detailed physical representation from a rough block of material.
Retriever Module
- Uses an off-the-shelf retrieval engine E and the concept tree T for reranking or retrieval.
- The relevance score of document d to tree T is calculated as: \(\rho_T(d) = \sum_{c \in C} \sum_{g \in G_c} w(c) \times \rho_E(g, d)\)
- Where C is the set of all concepts in the tree, \(G_c\) is the set of groundings for concept c, and \(w(c)\) is the weight of the concept.
- Demoted concepts have negative weights, meaning documents related to demoted concepts receive lower scores.
Characterizer Module

The Characterizer recursively grows the concept tree through three high-level operations:

- **Ancestor Path Retrieval**: Retrieves documents using the ancestor path of the current concept as a subtree to obtain top-k relevant documents.
- **Envision/Explore**: Clusters retrieved documents using BERTopic, then prompts the LLM to identify clusters supporting or opposing the intent (explore), or to generate missing content that should support the intent (envision).
- **Concept Induction**: Concludes clusters into concepts — the LLM extracts attributes from documents close to the cluster center and synthesizes them into synthetic documents to serve as groundings for the new concept.

Loss & Training¶

ConceptCarve does not require any training or fine-tuning. Its core advantages include:

Fixed LLM token budget: The cost of LLM inference does not grow with the corpus size, requiring approximately 20,000 tokens per tree.
Weight assignment strategy: A child concept's weight is smaller than its parent's, sibling weights are equal, and weights are normalized globally. Intuitively, subconcepts can only partially offset their parent concept.
The retrieval cost is \(O(C \times \gamma)\), where C is the total number of concepts, and \(\gamma\) is the number of groundings per concept.

Key Experimental Results¶

Dataset Construction¶

Source: Reddit posts (acquired via Cornell ConvoKit)
6 community sub-datasets: Conservative/Liberal, Rural/Urban, Religious/Secular
30 complex, domain-sensitive trend queries (based on Moral Foundations Theory)
2000 posts per query-community pair for reranking

Main Results¶

System	P@10	R@10	MAP@10	P@500	R@500	MAP@500
BM25	13.20	0.70	0.30	12.70	27.50	3.80
ColBERT	26.10	1.30	0.60	16.70	34.80	7.10
ANCE	23.70	1.30	0.60	16.00	33.40	6.50
RepLLaMA	14.11	0.53	0.23	15.05	29.84	4.49
Query2Doc + ColBERT	37.28	2.20	1.33	19.59	42.43	11.37
EnvisionOnly	38.00	2.10	1.20	20.70	46.00	12.50
ConceptCarve (depth 2)	41.56	2.40	1.49	21.78	49.71	14.33

Ablation Study¶

Retriever	P@5	P@10	P@50	P@100	P@500	P@1K
ColBERT	27.8	25.4	22.5	20.9	16.7	14.9
CC (Promoted Only)	30.8	34.2	29.8	25.8	19.8	17.9
CC (Promoted + Demoted)	34.2	32.9	30.7	26.9	20.4	18.0

Key Findings¶

ConceptCarve achieves a 120.46% relative improvement in MAP@500 compared to dense reranking models, and a 26.03% relative improvement compared to LLM keyphrase expansion techniques.
LLM-based methods (including EnvisionOnly and Query2Doc) significantly outperform dense and lexical models, highlighting the capability of LLMs to bridge the inferential gap.
Trees of Depth 2 slightly outperform Depth 1, suggesting that exploring more concepts improves the representation of trends.
Demoted concepts show positive effects in end-to-end retrieval (full dataset retrieval) but have a negligible impact on reranking (a pre-filtered subset).

Highlights & Insights¶

Explainability of Concept Trees: ConceptCarve not only retrieves evidence but also produces explainable representations. For example, when analyzing "disapproval of family promoting traditional values", evidence in rural communities emphasizes conflict with traditional family expectations, while urban communities focus on conflicts related to familial image.
Cost Efficiency: The token budget for LLM calls is fixed (approx. 20K tokens/tree) and does not scale with the corpus size, allowing the method to scale to massive datasets.
Plug-and-Play: The framework is agnostic to the underlying retriever — any improvement in retrieval engines can be directly leveraged.
Potential for Social Science Applications: Qualitative analysis of concept trees can automatically detect differentiating characteristics across different communities for a given trend.

Limitations & Future Work¶

Domain Limitations: Although spanning 3000+ subreddits, the source data is limited to Reddit; applicability to other platforms (e.g., Twitter, forums) remains unverified.
Tree Depth Saturation: When tree depth exceeds 2, concept weights decay severely, limiting the expressive power of the tree.
Demoted Concepts Ineffective in Reranking: This may call for better weight assignment strategies.
LLM Annotation Bias: Dataset labels are LLM-generated, with human annotator agreement at only 68%.
Scalability to Dialogues or Streaming Data: The current approach is designed for static corpora, necessitating future exploration of incrementally updating concept trees.

Complementary to RAG frameworks: ConceptCarve can serve as a retrieval enhancement module for RAG.
Promptriever (concurrent work) addresses the inferential gap parametrically, whereas ConceptCarve is training-free and explainable.
The building process of the concept tree resembles human cognitive "carving", progressing from coarse to fine to understand abstract concepts.

Rating¶

Novelty: 8/10 — The dynamic concept tree construction framework utilizing LLMs is novel and contributes to the formal definition of inferential gaps and domain sensitivity.
Experimental Thoroughness: 7/10 — The dataset is large and diverse, but restricted to the Reddit platform; ablation studies are mostly sufficient.
Writing Quality: 8/10 — The problem is clearly defined, Figures 1-3 are highly illustrative, and the overall structure is sound.
Value: 7/10 — Possesses actual application value for social sciences and opinion mining; the interpretability of the concept tree is a key selling point.