Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs¶
Conference: ACL 2026
arXiv: 2604.07562
Code: GitHub
Area: Text Clustering / Unsupervised Learning
Keywords: Text Clustering Refinement, LLM Semantic Adjudicator, Consistency Verification, Redundancy Adjudication, Label Grounding
TL;DR¶
A reasoning-based cluster refinement framework is proposed, utilizing LLMs as semantic adjudicators (rather than embedding generators) to verify and restructure the output of unsupervised clustering. Through three reasoning stages—consistency verification, redundancy adjudication, and label grounding—it significantly improves cluster coherence and human-aligned annotation quality on social media corpora.
Background & Motivation¶
Background: Unsupervised text clustering (LDA, BERTopic, HDBSCAN, etc.) is widely used to discover latent semantic structures from large-scale text collections. Recent methods primarily rely on contextual embeddings and geometric clustering criteria to evaluate cluster quality.
Limitations of Prior Work: Geometric properties in embedding space (e.g., separation, density) do not always align with human semantic understanding. Clusters may be numerically well-separated but semantically incoherent, and multiple clusters may encode overlapping topics. Particularly in social media short-text scenarios, high noise, lexical variation, and rapid topic drift exacerbate the gap between statistical consistency and human interpretability.
Key Challenge: Existing pipelines lack an explicit semantic verification mechanism—the "hypotheses" produced by clustering algorithms are never tested for true semantic coherence, non-redundancy, and interpretability.
Goal: Design a post-refinement layer that leverages the reasoning capabilities of LLMs to verify and restructure the output of any unsupervised clustering method.
Key Insight: Treat clustering as a "proposal" and the LLM as a "semantic adjudicator" rather than an embedding generator, decoupling representation learning from structural verification.
Core Idea: LLMs possess powerful natural language reasoning capabilities that can evaluate whether a cluster is internally consistent, whether two clusters are meaningfully distinct, and whether a topic is grounded in the text—tasks that purely geometric methods cannot achieve.
Method¶
Overall Architecture¶
A three-stage post-refinement process: Input consists of initial clusters generated by any unsupervised clustering method (e.g., HDBSCAN), and the output consists of refined clusters and interpretable labels. Stage 1 verifies the semantic coherence of each cluster, Stage 2 merges semantically redundant clusters, and Stage 3 generates and merges explanatory labels for the refined clusters.
Key Designs¶
-
Coherence Verification:
- Function: Identify and discard semantically incoherent clusters.
- Mechanism: Select top-5 representative documents closest to the centroid for each cluster, generate a concise summary using an LLM, and then use the LLM to evaluate if the summary is supported by the representative documents. If the LLM determines the summary fails to capture a consistent topic across documents, the cluster is marked as incoherent and discarded.
- Design Motivation: Clusters that appear compact in embedding space may contain semantically heterogeneous content; the language understanding of LLMs can identify this inconsistency.
-
Redundancy Adjudication:
- Function: Merge semantically overlapping clusters to reduce redundancy.
- Mechanism: Use SBERT to generate embeddings for each cluster summary and calculate pairwise cosine similarity. Clusters exceeding a threshold (\(\tau=0.85\), determined via grid search) are merged. The threshold selection balances Silhouette Score, Davies-Bouldin Index, and the number of clusters.
- Design Motivation: Multiple clusters may have surface lexical differences but essentially discuss the same topic; merging improves the non-redundancy of the structure.
-
Label Grounding:
- Function: Assign interpretable human-readable labels to each refined cluster.
- Mechanism: In the first stage, candidate labels are generated from summaries for each cluster. In the second stage, SBERT similarity between labels is calculated, labels with similarity \(>0.85\) are grouped, and the LLM generates a merged label for each group. Finally, the LLM reassigns each document to the most appropriate merged label.
- Design Motivation: Multiple clusters might produce semantically similar labels; merging avoids redundancy in the label taxonomy.
Loss & Training¶
The framework is training-free and based entirely on the zero-shot reasoning of an LLM (GPT-4o). The initial clustering stage utilizes TF-IDF + SVD + UMAP + HDBSCAN. In the refinement stage, the LLM and SBERT work in synergy.
Key Experimental Results¶
Main Results¶
| Method | CC | SS↑ | DBi↓ | Description |
|---|---|---|---|---|
| HDBSCAN (X) | 359 | 0.122 | 2.322 | Original clusters |
| SBERT Refinement (X) | 250 | 0.156 | 0.569 | Refinement using embeddings only |
| LLM Refinement (X) | 232 | 0.674 | - | Semantic reasoning refinement, SS improved 5.5x |
| HDBSCAN (Bluesky) | - | - | - | Baseline |
| LLM Refinement (Bluesky) | - | Significant improvement | - | Consistently effective across platforms |
Ablation Study¶
| Evaluation Dimension | Result | Description |
|---|---|---|
| LLM Labels vs. Human Evaluation | High Consistency | LLM labels align strongly with human judgment in the absence of gold standards |
| Cross-platform Stability | Consistent | Structures remain stable under matched time and quantity conditions |
| Incoherent Cluster Identification | Effective | Successfully identifies and discards clusters containing mixed content |
Key Findings¶
- LLM refinement improves the Silhouette Score from 0.122 to 0.674 (X dataset), a 5.5x gain.
- The number of clusters was reduced from 359 to 232, removing incoherent and redundant clusters.
- Human evaluation shows high consistency between LLM-generated labels and expert annotators.
- Cross-platform results (X vs. Bluesky) demonstrate structural stability under matched conditions.
- SBERT refinement improves separation (DBi) but is less effective than LLM refinement in enhancing semantic coherence.
Highlights & Insights¶
- Positioning the LLM as a "semantic adjudicator" rather than an embedding generator is the core idea of this framework—utilizing LLM reasoning for structural verification while leaving representation learning to specialized embedding models. This decoupled design makes the framework agnostic to the clustering algorithm, serving as a general-purpose post-refinement layer.
- The three-stage reasoning checkpoints specifically target three typical failure modes of unsupervised clustering (incoherence, redundancy, lack of interpretability), resulting in a highly targeted design.
- Validating LLM label quality through human evaluation in scenarios without a gold standard is a pragmatic and effective approach.
Limitations & Future Work¶
- Dependence on GPT-4o API calls limits cost-effectiveness and reproducibility.
- Validated only on vegetarian-related topics, resulting in limited thematic diversity.
- The top-5 representative documents may be insufficient to represent large, heterogeneous clusters.
- The merging threshold of 0.85 is empirical; different domains may require adjustments.
- Future work could explore open-source LLMs as alternatives to GPT-4o and expand to more domains.
Related Work & Insights¶
- vs. BERTopic: While BERTopic relies on embedding geometry for quality assessment, Ours introduces a semantic reasoning verification layer.
- vs. LDA/HDP: Probabilistic topic models often exhibit poor semantic coherence on short texts; the post-refinement in Ours can improve outputs from any topic model.
- vs. LLooM: LLooM requires user-provided seed sets, whereas Ours is completely unsupervised.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of using an LLM as a semantic adjudicator for cluster refinement is novel and generalizable.
- Experimental Thoroughness: ⭐⭐⭐ Evaluation is limited to a single thematic domain with few baseline comparisons.
- Writing Quality: ⭐⭐⭐⭐ The framework description is clear, and the design principles are well-defined.
- Value: ⭐⭐⭐⭐ Provides a general methodology for clustering post-processing.