LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models¶
Conference: ACL 2026
arXiv: 2605.03299
Code: https://github.com/tienphat140205/LLM-XTM (Available)
Area: Multilingual / Topic Modeling
Keywords: Cross-Lingual Topic Model, LLM Refinement, Self-Consistency, MMD Alignment, QA-style Document Alignment
TL;DR¶
A two-stage enhancement module consisting of "LLM refinement + self-consistency voting + MMD word distribution alignment + QA-style document semantic alignment" is wrapped around pre-trained cross-lingual topic models. It serves as a plugin for various backbones such as NMTM, InfoCTM, and XTRA. Across three bilingual corpora (EC News, Amazon Review, Rakuten Amazon), it improves CNPMI by 9%–51% and TQ by 6%–44%, while reducing LLM calls to "once every \(f\) epochs."
Background & Motivation¶
Background: The goal of Cross-Lingual Topic Modeling (CLTM) is to extract "semantically corresponding" topic pairs from multilingual corpora, where high-frequency words for the same topic across different languages (e.g., English/Chinese/Japanese) are semantically consistent. Prevailing methods (MCTA, MTAnchor, NMTM, InfoCTM, XTRA, etc.) largely rely on external bilingual resources: parallel corpora, seed dictionaries, bilingual embeddings, or anchor words.
Limitations of Prior Work: Low bilingual dictionary coverage and noise in parallel corpora (mistranslations, domain shifts, lexical ambiguity) lead to "nominally aligned" topics drifting semantically. Table 1 provides a startling example where InfoCTM pairs the English words rating/gauge/height/mile/shoe with Chinese financial terms investors/finance/funds/stock market/index, which are completely unrelated.
Key Challenge: Shallow corpus-driven signals cannot characterize deep cross-lingual semantic consistency, whereas LLMs possess deep semantic priors from massive multilingual pre-training. Existing LLM-based approaches suffer from three issues: (1) treating LLM outputs as ground-truth for independent document calls, ignoring global structure and incurring high costs; (2) LLM instability and hallucinations; (3) requirements for token probabilities in white-box solutions like LLM-in-the-Loop, which are unavailable for closed-source models (e.g., Gemini/Claude).
Goal: Inject LLM semantic knowledge into both the topic-word distribution \(\beta\) and document-topic distribution \(\theta\) with minimal LLM calls, while ensuring (a) black-box availability, (b) robustness to hallucinations, and (c) preservation of the backbone's existing corpus-driven signals.
Key Insight: Drawing from the "self-consistency as uncertainty measure" idea in SelfCheckGPT, the authors sample LLM outputs multiple times, retaining high-consistency words and discarding low-consistency ones to filter hallucinations via voting. Simultaneously, the task of "assigning a document to a topic" is re-interpreted as a QA-style matching where the document is the "question" and the refined topic word set is the "candidate answer," using a multilingual encoder (BGE-M3) to calculate cosine similarity.
Core Idea: Wrap LLM refinement as a "periodic, self-consistent voting" black-box process. The output is aligned with the original \(\beta\) via MMD and pulls \(\theta\) toward a semantic target \(\hat{\theta}\) calculated by BGE-M3 using KL divergence. This "softly guides" the backbone toward more coherent and cross-lingually aligned solutions without disrupting its original loss functions.
Method¶
Overall Architecture¶
LLM-XTM is a two-phase post-processing enhancement: - Phase 1: Run an off-the-shelf VAE-based CLTM backbone (NMTM / InfoCTM / XTRA) using its original loss \(\mathcal{L}_{\text{Phase 1}}\) until convergence to obtain \(\beta^{(\ell)}\) and \(\theta_d\). - Phase 2: Train the converged model for an additional 30 epochs. Every \(f\) epochs, trigger LLM refinement and inject the results into the backbone via two external losses:
The total objective is \(\mathcal{J}(\phi, \psi) = \mathcal{L}_{\text{Phase 1}} + \lambda_{\text{mmd}} \mathcal{L}_{\text{MMD}} + \lambda_{\text{qa}} \mathcal{L}_{\text{doc-align}}\). The workflow within an epoch involves: sampling top-15 bilingual words \(\rightarrow\) calling LLM for voting to get \(\bar{w}_k\) \(\rightarrow\) calculating \(\mathcal{L}_{\text{MMD}}\) \(\rightarrow\) encoding documents/topics with BGE-M3 \(\rightarrow\) calculating KL-based \(\mathcal{L}_{\text{doc-align}}\) \(\rightarrow\) updating the backbone. The LLM module remains a black box to the backbone, requiring no token probabilities.
Key Designs¶
-
Self-Consistent Cross-Lingual Topic Word Refinement:
- Function: Merges top-15 English and top-15 Chinese words from the backbone into a candidate pool \(C_k = w_k^{(\text{en})} \cup w_k^{(\text{zh})}\). The LLM removes noise, fills gaps, and retains common topic words, outputting a refined set \(\bar{w}_k\) (15 words per language).
- Mechanism: Due to LLM output variance, the same prompt is run \(R\) times to obtain \(\tilde{w}_k^{(1)}, \dots, \tilde{w}_k^{(R)}\). The hit frequency for each word is calculated as \(f_k(v) = \frac{1}{R}\sum_{r=1}^R \mathbf{1}\{v \in \tilde{w}_k^{(r)}\}\), and the top-\(M\) high-frequency words form the final \(\bar{w}_k\). A "refinement frequency" hyperparameter \(f\) is introduced to call the LLM only every \(f\) epochs (rather than every step), reducing costs by an order of magnitude.
- Design Motivation: Consistency across multiple samplings is a strong signal for detecting hallucinations. High-consistency words are likely core topic words, whereas low-consistency words are often fabricated. This replicates uncertainty estimation without requiring logit access.
-
MMD Topic-Word Distribution Alignment (MMD Refinement Loss):
- Function: Pulls the backbone's original \(\beta_k^{(\text{raw})}\) toward the LLM-derived target distribution \(\beta_k^{(\text{refined})}\), without losing corpus-driven signals.
- Mechanism: For each topic \(k\), two distributions are constructed: the "raw" distribution from decoder top-\(N\) word probabilities and the "refined" distribution from \(R\)-round voting counts. Both are language-balanced and normalized. Squared MMD is calculated in the BGE-M3 embedding space using a Gaussian kernel (with median heuristic bandwidth): \(\mathcal{L}_{\text{MMD}} = \frac{1}{K}\sum_{k=1}^{K} \text{MMD}^2(\beta_k^{\text{(raw)}}, \beta_k^{\text{(refined)}})\).
- Design Motivation: MMD significantly outperformed Optimal Transport (OT) (CNPMI 0.016 vs. 0.013). Kernel methods naturally treat synonym replacement as a match via embedding similarity, whereas OT is more sensitive to word identity. MMD provides "soft alignment" that is gentler than hard replacement and does not disrupt the reconstruction loss.
-
QA-style Document-Topic Alignment (Document-Topic Alignment via QA):
- Function: Ensures documents with the same semantics in different languages have similar document-topic distributions \(\theta_d\).
- Mechanism: Documents are treated as questions and refined topics as candidate answers. BGE-M3 encodes documents into \(h_d\) and \(\bar{w}_k\) into topic vectors \(t_k = \text{Enc}(\bar{w}_k)\). Cosine similarity \(s_{d,k}\) is computed, followed by a softmax with temperature \(\tau\) to obtain target \(\hat{\theta}_{d,k} = \frac{\exp(s_{d,k}/\tau)}{\sum_j \exp(s_{d,j}/\tau)}\). The backbone's \(\theta_d\) is pulled toward this target using KL divergence: \(\mathcal{L}_{\text{doc-align}} = \sum_{d=1}^D \text{KL}(\theta_d \| \hat{\theta}_d)\).
- Design Motivation: BoW is naturally incomparable across languages. However, multilingual sentence embeddings bridge this gap. Using semantic similarity as external supervision forces the backbone to learn cross-lingually consistent document representations.
Loss & Training¶
The total Phase 2 objective is \(\mathcal{J} = \mathcal{L}_{\text{Phase 1}} + \lambda_{\text{mmd}} \mathcal{L}_{\text{MMD}} + \lambda_{\text{qa}} \mathcal{L}_{\text{doc-align}}\). Experiments used \(\lambda_{\text{mmd}} = 20,000\), \(\lambda_{\text{qa}} \in \{100, 200, 300\}\), \(f \in \{8, 10\}\), and \(R = 5\). The LLM components use the Gemini API, and Phase 2 completes in 30 epochs on a single NVIDIA P100.
Key Experimental Results¶
Main Results¶
Evaluated on EC News (EN-ZH), Amazon Review (EN-ZH), and Rakuten Amazon (JA-EN) benchmarks using CNPMI (cross-lingual coherence), TU (topic uniqueness), and TQ = max(0, CNPMI) × TU:
| Backbone | Dataset | CNPMI (base \(\rightarrow\) +LLM-XTM) | TQ Gain | Notes |
|---|---|---|---|---|
| XTRA | EC News | 0.078 \(\rightarrow\) 0.088 | +10.5% | TU slightly down 2.5% |
| XTRA | Amazon Review | 0.053 \(\rightarrow\) 0.072 | +32.7% | CNPMI +35.8% |
| InfoCTM | EC News | 0.041 \(\rightarrow\) 0.062 | +43.6% | CNPMI +51.2% |
| InfoCTM | Amazon Review | 0.037 \(\rightarrow\) 0.050 | +38.2% | TU up 0.3% |
| NMTM | Amazon Review | 0.043 \(\rightarrow\) 0.056 | +34.6% | CNPMI +30.2% |
| NMTM | Rakuten Amazon | 0.012 \(\rightarrow\) 0.016 | +37.5% | CNPMI +33.3% |
Across 9 combinations, CNPMI gains were consistently positive while TU volatility remained within ±5%, proving LLM-XTM is a universal enhancement layer. On long-document benchmarks (Airiti Thesis), TQ increased by up to +121.3%. In downstream document classification, cross-lingual accuracy (-C) improved significantly (e.g., 0.734 \(\rightarrow\) 0.788 for InfoCTM on Rakuten Amazon).
Ablation Study¶
On NMTM + Rakuten Amazon (50 topics):
| Configuration | CNPMI | TU | EN-C | JA-C | Description |
|---|---|---|---|---|---|
| NMTM (base) | 0.012 | 0.633 | 0.610 | 0.681 | Backbone |
| Full LLM-XTM | 0.016 | 0.666 | 0.621 | 0.728 | All components |
| w/o \(\mathcal{L}_{\text{doc-align}}\) | 0.012 | 0.679 | 0.611 | 0.723 | CNPMI drops to base |
| w/o \(\mathcal{L}_{\text{MMD}}\) | 0.012 | 0.641 | 0.621 | 0.723 | TU drops by 0.025 |
| w/o self-consistency | 0.011 | 0.654 | 0.619 | 0.720 | CNPMI falls below base |
| MMD \(\rightarrow\) OT | 0.013 | 0.664 | 0.620 | 0.720 | OT is 18.7% worse in CNPMI |
Key Findings¶
- \(\mathcal{L}_{\text{doc-align}}\) is crucial for alignment: Removing it causes CNPMI to drop to backbone levels, showing topic-word alignment alone cannot constrain document-level consistency.
- Self-consistency is indispensable: Without voting, CNPMI falls below the base (0.011 < 0.012), as single-call hallucinations contaminate the backbone.
- MMD Over OT: Kernel methods are more tolerant of synonym replacements in embedding space.
- Hyperparameter Sensitivity: \(R \in [5, 7]\) is the sweet spot for voting rounds. Frequent refinement (\(f\) small) increases CNPMI but can lower TU.
Highlights & Insights¶
- Dual control of cost and hallucinations: Periodic calls (\(f\)) combined with self-consistent voting (\(R\)) transform the LLM from a per-document oracle into a periodic expert consultant.
- Soft guidance via MMD: Using MMD with Gaussian kernels in embedding space recognizes synonymy, allowing the backbone to maintain its reconstruction signals while being guided toward semantic coherence.
- QA Paradigm for Alignment: Reframing document-topic distribution as a retrieval task allows multilingual sentence encoders like BGE-M3 to act as a bridge, bypassing the incomparability of BoW across languages.
- True Plugin Architecture: Consistent gains across multiple backbones and datasets indicate this post-processing paradigm is more economical than designing new backbones from scratch.
Limitations & Future Work¶
- Backbone Dependency: LLM-XTM is a refinement tool and cannot create reasonable topics from a completely failed initialization.
- API Cost and Latency: Despite periodic calls, \(R=5\) voting still involves multiple calls per topic, which may be expensive for large-scale corpora or many topics.
- Language Scope: Validated only on EN-ZH and EN-JA pairs. Performance on typologically distant or low-resource pairs depends heavily on the quality of the multilingual encoder.
- Encoder Dependency: Results may vary with different encoders; future work could investigate distilling BGE-M3 into smaller, more efficient models.
Related Work & Insights¶
- vs. LLM-ITL: LLM-ITL requires white-box token probabilities for uncertainty estimation; LLM-XTM uses black-box sampling/voting.
- vs. TopicGPT: TopicGPT calls LLMs per document (unscalable); LLM-XTM calls LLMs per topic, decoupling cost from corpus size.
- vs. XTRA: XTRA uses contrastive learning for alignment; LLM-XTM is complementary, providing further gains (+10.5% TQ) when stacked on top.
- Insight: The MMD + QA paradigm can be generalized to refine any latent distribution in generative models using LLMs, such as item embeddings in recommendation systems.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of self-consistency, MMD, and QA into a black-box plugin is innovative; the QA-style \(\theta\) alignment is a notable contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Extensive testing across backbones, datasets, and ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Method is clearly explained with convincing qualitative examples.
- Value: ⭐⭐⭐⭐ Provides a highly effective, ready-to-use plugin for the CLTM community.