Summaries as Centroids for Interpretable and Scalable Text Clustering¶
Conference: ICLR 2026 arXiv: 2502.09667 Code: None Area: Information Retrieval Keywords: Text Clustering, k-means, Summary-as-Centroid, Interpretability, Streaming Clustering, LLM-optional
TL;DR¶
This paper proposes k-NLPmeans and k-LLMmeans, which periodically replace numeric centroids with textual summaries (summary-as-centroid) during k-means iterations, achieving interpretable cluster prototypes while preserving the standard k-means objective. The number of LLM calls is independent of dataset size.
Background & Motivation¶
- Limitations of standard k-means on text: numeric averaging blurs textual semantics, and centroids are not human-interpretable.
- Problems with existing LLM-based clustering methods:
- Poor scalability: the number of LLM calls grows with dataset size.
- Opaque optimization: relies on prompts, greedy merging, and similarity thresholds without a clear objective function.
- A clustering method that is both interpretable and scalable is needed.
Method¶
Core Idea: Summary-as-Centroid¶
Within the standard k-means loop, numeric centroids are replaced by textual summaries every \(l\) iterations:
In the remaining iterations, the standard mean update is used: \(\boldsymbol{\mu}_j = \frac{1}{|C_j|}\sum_{i \in [C_j]} \mathbf{x}_i\)
k-NLPmeans (LLM-free Variant)¶
Uses classical NLP summarization methods as \(f_{\text{NLP}}^{(q)}\):
- Centroid-based: Computes the centroid of intra-cluster sentence embeddings and concatenates the top-\(q\) most similar sentences.
- TextRank: Constructs a sentence similarity graph, ranks sentences via PageRank, and selects the top-\(q\) sentences.
- LSA-style SVD: Applies SVD to sentence embeddings, scores sentences by principal component contribution, and selects accordingly.
Characteristics: fast, deterministic, LLM-free, and offline-capable.
k-LLMmeans (LLM-assisted Variant)¶
where \(p_j = \text{Prompt}(I, \{d_{z_i} | z_i \sim [C_j]\}_{i=1}^{m_j})\)
- The LLM processes representative samples of each cluster (sampled via k-means++) rather than all documents.
- Each summarization step requires exactly \(k\) LLM calls → call count is independent of dataset size.
Mini-batch Extension: Streaming Clustering¶
Summarization steps are inserted into the mini-batch k-means update rule: - Batches \(D_1, \ldots, D_b\) are processed sequentially. - Each batch is handled by k-NLPmeans/k-LLMmeans, followed by incremental centroid updates. - The low-memory property of mini-batch k-means is preserved.
Loss & Training¶
The standard k-means objective remains unchanged between summarization steps:
When summarization fails, the method gracefully degrades to standard k-means.
Key Experimental Results¶
Static Clustering (text-embedding-3-small)¶
| Method | Bank77 ACC | CLINC ACC | GoEmo ACC | MASSIVE(D) ACC | MASSIVE(I) ACC |
|---|---|---|---|---|---|
| k-means | ~65 | ~77 | ~20 | ~59 | ~52 |
| k-NLPmeans LSA-mult | 67.1 | 80.2 | 22.3 | 63.3 | 55.3 |
| k-LLMmeans single | 67.1 | 78.1 | 24.0 | — | — |
| k-LLMmeans mult | Higher | Higher | Higher | Higher | Higher |
LLM Call Efficiency Comparison¶
| Method | LLM Call Complexity | Data Dependence |
|---|---|---|
| ClusterLLM | O(n) | Grows with data |
| LLMEdgeRefine | O(n) | Grows with data |
| k-NLPmeans | O(0) | Zero LLM calls |
| k-LLMmeans | O(k · #summarization steps) | Independent of n |
Key Findings¶
- Even a single summarization step (\(l=60\)) yields substantial improvements over standard k-means.
- k-NLPmeans (zero LLM) approaches or matches k-LLMmeans on most benchmarks.
- k-means++ sampled documents yield better LLM summaries than random sampling.
- Consistent improvements are observed across 4 embedding models, 5 LLMs, and 3 classical NLP methods.
- The proposed methods also outperform standard mini-batch k-means in streaming clustering scenarios.
Highlights & Insights¶
- Minimal modification, significant gain: only the centroid update step of k-means is modified; everything else remains unchanged.
- LLM-optional design: k-NLPmeans captures most of the benefit without any LLM dependency.
- Intrinsic interpretability: each centroid is a human-readable textual summary.
- Graceful degradation: automatically falls back to standard k-means when summary quality is poor, ensuring no regression.
- Fixed LLM budget: total LLM calls are \(k \times\) the number of summarization steps, making the approach tractable at scale.
- Introduces a new StackExchange streaming clustering benchmark.
Limitations & Future Work¶
- Summarization quality is bounded by the capability of the underlying summarizer.
- For semantically overlapping clusters, summaries may fail to provide effective differentiation.
- The number of clusters \(k\) must be specified in advance (inherited from k-means).
- The summarization frequency \(l\) requires tuning, though experiments suggest low sensitivity to this hyperparameter.
Related Work & Insights¶
- LLM-based clustering: ClusterLLM, IDAS, LLMEdgeRefine, etc.
- Classical text clustering: k-medoids, spectral clustering, BERTopic.
- Streaming clustering: mini-batch k-means, LLM-based online methods.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The summary-as-centroid concept is concise and novel.
- Technical Depth: ⭐⭐⭐ — The method is intuitively clear, but theoretical analysis is limited.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extremely comprehensive: 4 datasets × 4 embeddings × 5 LLMs × 3 NLP methods.
- Practicality: ⭐⭐⭐⭐⭐ — Plug-and-play, interpretable, and scalable; high practical value.