DSL-Topic: Improving Topic Modeling by Distilling Soft Labels from Language Models¶
Conference: ICML 2026
arXiv: 2602.17907
Code: https://github.com/raymondzmc/dsl-topic-models (Available)
Area: NLP Understanding / Topic Modeling / Knowledge Distillation
Keywords: Neural Topic Models, Soft Label Distillation, Small Language Models, KL Reconstruction Objective, Implicit Bayesian Inference
TL;DR¶
The authors utilize the next-token probabilities generated by small language models (SLMs) under a "generate a topic word for the document" prompt, project them onto the topic model vocabulary, and use them as dense soft labels to replace traditional Bag-of-Words (BoW) reconstruction targets for training neural topic models (ProdLDA / ECRTM / FASTopic). This significantly improves allocation Purity on 20NewsGroup, TweetTopic, and StackOverflow datasets, providing a Bayesian interpretation of "projecting implicit LM posterior predictions into structured topic families."
Background & Motivation¶
Background: The mainstream of neural topic models follows the LDA-VAE lineage (ProdLDA, ETM, CombinedTM, ZeroshotTM), which first encodes documents into continuous latent variables \(\theta\) and then reconstructs the BoW via a reconstruction loss to infer topic-word distributions. Recent models like ECRTM add embedding clustering regularization, while FASTopic uses Optimal Transport to align documents, topics, and words in the same embedding space.
Limitations of Prior Work: BoW targets only assign probability mass to "words appearing in the document," completely ignoring context and compositional semantics. On short texts like TweetTopic and StackOverflow, co-occurrence signals are too sparse to learn coherent topics (in the paper, LDA Purity on StackOverflow is only .174, and ProdLDA is only .265).
Key Challenge: Directly using LLMs for topic modeling (e.g., TopicGPT, Prompt-based) loses the probabilistic framework—topics are represented in natural language, and document allocations are hard decisions, failing to characterize topic uncertainty. Furthermore, LLM context windows cannot accommodate large corpora. One must often choose between the "structural advantages of probabilistic topic models" and the "semantic priors of LLMs."
Goal: Retain the probabilistic structure of VAE topic models while allowing the reconstruction target to carry semantic/thematic information encoded by LMs, without introducing extra training stages or relying on large-scale LLM generation.
Key Insight: The authors noticed that if an LM is prompted to "Generate a single word label capturing the document theme," the next-token logits immediately following the prompt form a semantically dense distribution over the vocabulary. This distribution assigns probability to words that are topic-related but do not necessarily appear in the document (Figure 1 shows a document about religious debates assigning probability to unobserved religious terms).
Core Idea: Project this LM-induced prompt-conditioned next-token distribution onto the topic model vocabulary \(V\), obtain soft labels \(y_{DSL}\) via softmax, and use KL divergence to train any vocab-reconstruction topic model to fit \(y_{DSL}\). This is equivalent to distilling the implicit Bayesian posterior predictions of the LM into a structured topic family.
Method¶
Overall Architecture¶
Given a document \(x\) and a fixed topic instruction prompt \(\pi\), the method constructs a "target + input" pair: (1) \((x, \pi)\) is fed into an SLM to extract the next-token logits \(\ell_{LM}(x,\pi)\) immediately following the prompt. Only the sub-vector \(\ell_V(x,\pi) \in \mathbb{R}^{|V|}\) corresponding to the topic vocabulary \(V\) is retained and temperature-softmaxed to produce soft labels \(y_{DSL}\). (2) In the same forward pass, the last-layer hidden state at the final prompt position \(x_{emb} = h_{LM}(x,\pi)\) is taken as the input representation for the topic model. The topic model \(\mathcal{M}\) (ProdLDA / ECRTM / FASTopic) receives \(x_{emb}\) \(\to\) infers topic proportions \(\theta\) \(\to\) generates the vocabulary distribution \(\hat{y}_\psi(x)\). The training objective is \(D_{KL}(y_{DSL} \| \hat{y}_\psi) + \mathcal{R}_{\mathcal{M}}\), replacing only the reconstruction term while keeping the model's inherent regularization. The pipeline requires no LM training, and soft labels can be pre-computed once for offline use.
Key Designs¶
-
Prompt-Conditioned Soft Target \(y_{DSL}\):
- Function: Replaces BoW with LM next-token distributions as a dense semantic reconstruction target.
- Mechanism: A vocabulary \(V\) of the top \(|V|=2000\) high-frequency words is extracted (restricted to words that are single tokens under the LM tokenizer, covering 98%). After concatenating
<system, x, π>, a single LM forward pass yields \(\ell_V\) from the logits of terms in \(V\). The dense distribution is generated as \(y_{DSL}(x,\pi) = \mathrm{softmax}(\ell_V(x,\pi) / \tau)\) (using \(\tau=3\)). Unlike BoW, which is non-zero only for words present in the document, \(y_{DSL}\) assigns mass to "topic-relevant but unobserved" words. - Design Motivation: Sparse BoW co-occurrence matrices on short texts lead to models learning stop-word residues. The semantic prior obtained from SLM pre-training naturally knows that a "religious debate" document relates to god, atheist, and believer. Injecting this knowledge into the topic model provides a semantically enhanced reconstruction target for free.
-
Hidden State Input \(x_{emb}\):
- Function: Replaces BoW input with the LM's final prompt position hidden state to align the semantic spaces of input and target.
- Mechanism: For autoregressive LMs, the last hidden state is the vector projected by the LM head to next-token logits, so \(x_{emb}\) and \(y_{DSL}\) naturally reside in the same pre-mapping semantic space. Topic models simply replace the encoder interface (previously BoW/SBERT) without changing the internal structure of ProdLDA/ECRTM/FASTopic.
- Design Motivation: If the input remains BoW, the model must perform extra inference from a bag-of-words to LM-level themes. Using the LM's own hidden state ensures "input and target are from the same source," allowing the KL distillation to focus solely on projecting from the LM real-valued space to the topic structural space.
-
Model-Agnostic KL Distillation:
- Function: Replaces the BoW-NLL reconstruction term with a KL distillation term, preserving specialized regularization for plug-and-play compatibility.
- Mechanism: The general objective is \(\mathcal{L}_{DSL}(x) = \lambda D_{KL}(y_{DSL}(x,\pi) \| \hat{y}_\psi(x)) + \mathcal{R}_{\mathcal{M}}(x;\psi)\). For ProdLDA, \(\mathcal{R}\) is the logistic-normal prior KL; for ECRTM, it is the embedding clustering term; for FASTopic, it is the Optimal Transport term. The paper uses \(\lambda = 10^3\) to balance the KL magnitude. In a Bayesian view, the LM acts as an implicit Bayesian predictor \(y_{DSL}(v|x,\pi) \approx \int p_{LM}(v|c) p_{LM}(c|x,\pi) \, dc\) for latent concepts \(c\). The topic model projects this implicit posterior into a family of low-dimensional topic bottleneck representations \(\mathcal{P}_{\mathcal{M}}(x_{emb}) \subseteq \Delta^{|V|-1}\).
- Design Motivation: This allows a single method to enhance VAE, embedding-clustering, and Optimal Transport variants, explaining why the paper achieves gains across a 3x5 grid (3 topic models x 5 SLM teachers).
Loss & Training¶
The complete objective is as shown above; hyperparameters \(\tau=3\) and \(\lambda=10^3\) are kept consistent across three datasets without per-dataset tuning. The teacher LMs include five instruction-tuned SLMs: ERNIE-4.5-0.3B, Qwen3.5-0.8B, Llama-3.2-1B, Phi-3-mini, and Llama-3.1-8B. Soft labels are pre-calculated once and cached; the LM is not used during the topic model training phase.
Key Experimental Results¶
Main Results¶
Mean values across 3 datasets × 4 topic counts (\(K=25, 50, 75, 100\)) × 5 seeds. Table below highlights Purity (higher is better):
| Dataset | LDA | ProdLDA | CombinedTM | BERTopic | ProdLDA + DSL (Qwen3.5-0.8B) | ECRTM + DSL (Qwen3.5-0.8B) |
|---|---|---|---|---|---|---|
| 20NewsGroup | .301 | .356 | .391 | .352 | .542 | .561 |
| TweetTopic | .441 | .533 | .588 | .562 | .781 | .781 |
| StackOverflow | .174 | .265 | .306 | .202 | .788 | .805 |
On short texts like StackOverflow, Purity jumps from .265 (ProdLDA baseline) to .803 (Phi-3-mini + DSL). TweetTopic also improves from .588 (CombinedTM) to .787. LLM rating (topic quality) consistently increases from 2.49 (ProdLDA) to 2.89-2.92, while \(C_V\) coherence rises from .351 to .399-.404.
Ablation Study¶
A full 3x5 grid (topic model architecture x SLM teacher) was analyzed. The table below shows the differences between topic model backbones using the same SLM teacher on 20NewsGroup:
| Backbone (Qwen3.5-0.8B Teacher) | \(C_V\) | LLM rating | I-RBO | Purity |
|---|---|---|---|---|
| ProdLDA + DSL | .399 | 2.86 | .980 | .542 |
| ECRTM + DSL | .423 | 2.82 | .975 | .561 |
| FASTopic + DSL | .347 | 2.15 | 1.000 | .504 |
| (Baseline) ProdLDA | .351 | 2.49 | .992 | .356 |
Key Findings¶
- ProdLDA and ECRTM are enhanced across the board. FASTopic shows a slight decline in \(C_V\)/LLM rating on 20NewsGroup compared to its baseline. The authors attribute this to a target-solver mismatch: FASTopic's Sinkhorn-OT is designed for sparse BoW targets; DSL provides top-k dense distributions, forcing the OT to spread topic mass across a wider support set, weakening the "peaky" features favored by coherence measures.
- Teacher SLMs remain extremely stable between 0.3B and 8B parameters. The smallest, ERNIE-4.5-0.3B, already significantly outperforms ZeroshotTM/CombinedTM/BERTopic/FASTopic which use GTE-large-en-v1.5 (0.4B encoder). This suggests that passing semantic signals through next-token probabilities is more effective than through sentence-encoder embeddings.
- The authors recommend ProdLDA + DSL as the default configuration: it offers the best performance/complexity trade-off. Its \(C_V\)/Purity is within 0.02 of ECRTM + DSL across all datasets, but it consistently wins in LLM rating without requiring ECRTM's clustering regularization or OT solvers.
Highlights & Insights¶
- Reconstruction targets can carry semantic priors: Most researchers modify inputs (adding SBERT) or architectures (OT) when BoW is insufficient. DSL highlights a neglected dimension—switching the target from token statistics to LM next-token distributions is simple yet yields the highest empirical gains.
- Small models are sufficient teachers: A 0.3B parameter ERNIE is enough, meaning the computational overhead of DSL is negligible. After pre-calculating soft labels, training is as fast as original ProdLDA, contrasting with TopicGPT styles that require online LLM prompting.
- Bayesian interpretation links distillation to topic frameworks: Interpreting \(y_{DSL}\) as an implicit posterior distribution and the topic model as a structured hypothesis family \(\mathcal{P}_{\mathcal{M}}\) frames DSL as more than a trick. It provides a template for using LMs as priors and structured models for inference, transferable to clustering, HMMs, state-space models, etc.
Limitations & Future Work¶
- The vocabulary \(V\) is limited to words representable by a single token in the LM tokenizer (98% coverage). In multilingual or specialized corpora, the missing 2% might contain "thematic soul words." Future work could consider marginalization of multi-token words or BPE-aware projections.
- The prompt \(\pi\) is a fixed monolingual English instruction. There is no systematic comparison of prompt designs (e.g., multi-turn, multilingual, CoT-style), making prompt sensitivity unclear.
- The drop in FASTopic + DSL performance on 20NewsGroup suggests that switching reconstruction targets is coupled with the original model's regularization design. "Plug-and-play" is not entirely free; extending DSL to new models requires verifying if regularization assumes sparse BoW targets.
- Evaluation still relies heavily on \(C_V\), LLM rating, and Purity. While the newly proposed retrieval-based metric is a valuable addition, the end-to-end value for real-world applications (summarization, interpretable clustering) needs more evidence.
Related Work & Insights¶
- vs ProdLDA / ECRTM / FASTopic: These models modify architectures based on BoW targets. DSL modifies the target itself and can be stacked with these models, showing that "target vs structure" are orthogonal design dimensions.
- vs TopicGPT / mu-etal-2024: Purely prompt-based methods offer high topic interpretability but lose the probabilistic framework. DSL uses the LM as a one-time "teacher," gaining semantic priors while retaining probabilistic inference and uncertainty modeling.
- vs CombinedTM / ZeroshotTM: These use SBERT as input. DSL uses the LM's prompt-final hidden state as input and next-token distributions as targets; having source-aligned input and targets leads to more thorough topic alignment than external sentence encoders.
- vs Yang-etal-2025 (LLM-refined topics): This work refines topic distributions using LLMs and OT loss after BoW training. DSL positions semantic signals earlier in the training phase; the two are complementary and could be stacked.
Rating¶
- Novelty: ⭐⭐⭐⭐ The perspective of using next-token probabilities as topic reconstruction targets is clear and hasn't been systematically explored; the Bayesian explanation elevates the method to a paradigm level.
- Experimental Thoroughness: ⭐⭐⭐⭐ The 3x3x5x4x5 grid is extensive, with all primary metrics significantly exceeding baselines. The inclusion of retrieval metrics and \(|V|\) variations in the appendix is comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Method sections clearly explain targets, inputs, losses, and Bayesian interpretations. Figure 1 provides an intuitive motivation via BoW vs DSL comparison.
- Value: ⭐⭐⭐⭐ A simple one-line code change plus a 0.3B teacher can multiply Purity on short texts, making it highly practical for industrial topic modeling/interpretable clustering pipelines.