Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs¶

Conference: ACL 2026
arXiv: 2604.07562
Code: GitHub
Area: Text Clustering / Unsupervised Learning
Keywords: text cluster refinement, LLM semantic judge, coherence verification, redundancy adjudication, label grounding

TL;DR¶

A reasoning-based cluster refinement framework that uses LLMs as semantic judges (rather than embedding generators) to verify and restructure unsupervised clustering outputs through coherence verification, redundancy adjudication, and label grounding, significantly improving cluster consistency and human-aligned annotation quality on social media corpora.

Background & Motivation¶

Background: Unsupervised text clustering (LDA, BERTopic, HDBSCAN, etc.) is widely used for discovering latent semantic structures from large-scale text collections. Recent methods primarily rely on contextual embeddings + geometric clustering criteria to assess cluster quality.

Limitations of Prior Work: Geometric properties in embedding space (e.g., separation, density) do not always align with human understanding of semantics. Clusters may be well-separated numerically but semantically incoherent, and multiple clusters may encode overlapping topics. This is especially problematic for social media short texts, where noise, vocabulary variation, and topic drift widen the gap between statistical consistency and human interpretability.

Key Challenge: Existing pipelines lack explicit semantic verification mechanisms—the "hypotheses" produced by clustering algorithms are never tested for genuine semantic coherence, non-redundancy, and interpretability.

Goal: Design a post-hoc refinement layer that leverages LLM reasoning capabilities to verify and restructure the output of any unsupervised clustering method.

Key Insight: Treat clusters as "proposals" and LLMs as "semantic judges" rather than embedding generators, decoupling representation learning from structural verification.

Core Idea: LLMs possess strong natural language reasoning abilities to evaluate whether a cluster is internally consistent, whether two clusters are meaningfully different, and whether topics are grounded in the text—capabilities that pure geometric methods cannot achieve.

Method¶

Overall Architecture¶

Three-stage post-hoc refinement: input is the initial clusters produced by any unsupervised clustering method (e.g., HDBSCAN), output is the refined cluster set with interpretable labels. Stage 1 verifies each cluster's semantic coherence, Stage 2 merges semantically redundant clusters, Stage 3 generates and consolidates explanatory labels for refined clusters.

Key Designs¶

Coherence Verification:
- Function: Identify and discard semantically incoherent clusters
- Mechanism: For each cluster, select the top-5 representative documents closest to the centroid, use LLM to generate a concise summary, then use LLM to evaluate whether the summary is supported by the representative documents. If the LLM determines the summary fails to capture a consistent theme across documents, the cluster is marked as incoherent and discarded
- Design Motivation: Clusters that appear compact in embedding space may contain semantically heterogeneous content; LLM language understanding can identify such inconsistency
Redundancy Adjudication:
- Function: Merge semantically overlapping clusters to reduce redundancy
- Mechanism: Generate SBERT embeddings for each cluster summary, compute pairwise cosine similarity. Clusters exceeding a threshold (\(\tau=0.85\), determined via grid search) are merged. Threshold selection balances Silhouette Score, Davies-Bouldin Index, and cluster count
- Design Motivation: Multiple clusters may have only surface lexical differences while substantively discussing the same topic; merging improves structural non-redundancy
Two-Stage Label Grounding:
- Function: Assign interpretable human-readable labels to each refined cluster
- Mechanism: Stage 1 generates candidate labels from summaries for each cluster; Stage 2 computes SBERT similarity between labels, groups labels with similarity >0.85, and has LLM generate merged labels for each group. Finally, LLM reassigns each document to the most appropriate merged label
- Design Motivation: Multiple clusters may produce semantically similar labels; merging avoids redundancy in the label taxonomy

Loss & Training¶

The framework requires no training and is entirely based on zero-shot reasoning with LLM (GPT-4o). The clustering stage uses TF-IDF + SVD + UMAP + HDBSCAN. The refinement stage combines LLM and SBERT collaboratively.

Key Experimental Results¶

Main Results¶

Method	CC	SS↑	DBi↓	Note
HDBSCAN (X)	359	0.122	2.322	Raw clusters
SBERT refinement (X)	250	0.156	0.569	Embedding-only refinement
LLM refinement (X)	232	0.674	-	Semantic reasoning refinement, SS improved 5.5×
HDBSCAN (Bluesky)	-	-	-	Baseline
LLM refinement (Bluesky)	-	Significant improvement	-	Consistently effective cross-platform

Ablation Study¶

Evaluation Dimension	Result	Note
LLM labels vs. human evaluation	High agreement	LLM labels strongly align with human judgments without gold standard
Cross-platform stability	Consistent	Structurally stable when matching temporal and quantitative conditions
Incoherent cluster identification	Effective	Successfully identifies and discards clusters with mixed content

Key Findings¶

LLM refinement boosts Silhouette Score from 0.122 to 0.674 (X dataset), a 5.5× improvement
Cluster count decreases from 359 to 232, removing incoherent and redundant clusters
Human evaluation shows high agreement between LLM-generated labels and expert annotators
Cross-platform (X vs Bluesky) structure remains stable under matched conditions
SBERT refinement improves separation (DBi) but falls short of LLM refinement in semantic consistency

Highlights & Insights¶

Positioning LLMs as "semantic judges" rather than embedding generators is the framework's core insight—leveraging LLM reasoning for structural verification while leaving representation learning to specialized embedding models. This decoupled design makes the framework clustering-algorithm-agnostic, serving as a general post-hoc refinement layer
The three-stage reasoning checkpoints each target a typical failure mode of unsupervised clustering (incoherence, redundancy, uninterpretability), with targeted design
Validating LLM label quality through human evaluation in the absence of gold standards is a pragmatic approach

Limitations & Future Work¶

Relies on GPT-4o API calls, with limited cost and reproducibility
Only validated on veganism-related topics, with limited topic diversity
Top-5 representative documents may be insufficient to represent large heterogeneous clusters
The 0.85 merging threshold is empirical and may need adjustment for different domains
Future work can explore open-source LLM alternatives to GPT-4o and extend to more domains

vs BERTopic: BERTopic relies on embedding geometry for quality assessment; this work adds a semantic reasoning verification layer
vs LDA/HDP: Probabilistic topic models have poor semantic coherence on short texts; this post-hoc refinement can improve any topic model's output
vs LLooM: LLooM requires user-provided seed sets; this work is fully unsupervised

Rating¶

Novelty: ⭐⭐⭐⭐ LLM as semantic judge for cluster refinement is a novel and general approach
Experimental Thoroughness: ⭐⭐⭐ Evaluated on a single topic domain only, lacking more baseline comparisons
Writing Quality: ⭐⭐⭐⭐ Clear framework description with well-defined design principles
Value: ⭐⭐⭐⭐ Provides a general methodology for cluster post-processing