Understanding Cross-Domain Adaptation in Low-Resource Topic Modeling¶

Conference: ACL 2025
arXiv: 2506.07453
Code: None
Area: Others
Keywords: Topic Modeling, Domain Adaptation, Low-Resource, Adversarial Training, Generalization Upper Bound

TL;DR¶

This work introduces domain adaptation formally into low-resource topic modeling for the first time. It derives a finite-sample generalization upper bound to guide method design and proposes the DALTA framework, which selectively transfers cross-domain topic knowledge through a shared encoder, domain-specific decoders, and adversarial alignment.

Background & Motivation¶

Topic modeling is a fundamental task in text analysis used to discover latent semantic structures in documents. Despite continuous advancements in Neural Topic Models (NTMs) and Contextualized Topic Models (CTMs), severe challenges remain in low-resource scenarios (e.g., emerging fields, niche domains, and privacy-restricted medical/legal texts):

Data scarcity leading to topic instability: When target domain documents are insufficient (e.g., potentially fewer than 1000 public documents in quantum machine learning), traditional topic models fail to extract stable and coherent topics.

Limitations of prior work: - Word embedding-based methods (Duan et al.) use static embeddings and cannot adapt to semantic drift. - Context-guided embedding adaptation (Meta-CETM) still relies solely on target domain data. - FASTopic assumes source domain knowledge is universally applicable, which may introduce irrelevant information.

Key Challenge: How to transfer useful topic knowledge from a high-resource source domain (e.g., news articles) to a low-resource target domain (e.g., medical research), while avoiding the introduction of irrelevant information (e.g., election/economy content from news should not contaminate genomics topics in the medical domain).

Domain adaptation has been widely used in supervised learning tasks but remains mostly unexplored in unsupervised topic modeling. This paper fills this gap for the first time.

Method¶

Overall Architecture¶

The DALTA (Domain-Aligned Latent Topic Adaptation) framework consists of four core components: 1. Shared encoder \(q_\phi\): Extracts domain-invariant features 2. Domain-specific decoders \(p_{\theta_S}, p_{\theta_T}\): Capture source- and target-specific semantics respectively 3. Domain discriminator \(C\): Aligns latent representations via adversarial training 4. Consistency loss: Ensures consistency of reconstruction functions across domains

Key Designs¶

Finite-Sample Generalization Upper Bound (Theorem 1): This is the most important theoretical contribution of the paper. The derived upper bound of the target domain error consists of five parts (plus a statistical term):
- Empirical reconstruction error (1st-2nd terms): Reconstruction quality of the source and target domains.
- KL divergence regularization (3rd term): Prevents overfitting and constrains the learned latent representations to be close to the prior.
- Latent space discrepancy (4th term): The \(\mathcal{H}\)-divergence between the latent representations of the source and target domains.
- Reconstruction function discrepancy (5th term): The difference between the optimal reconstruction functions across the two domains.
- Complexity term (6th term): Model capacity and statistical fluctuations of finite samples.

The first five terms can be optimized through methodological design, whereas the sixth term is harder to handle.

Adversarial Domain Alignment: The shared encoder \(q_\phi\) learns domain-invariant features via min-max adversarial training, preventing the domain discriminator \(C\) from distinguishing between source and target domain latent representations. Based on Proposition 1, when the classification error rate of the discriminator approaches 0.5 (random guess), the \(\mathcal{H}\)-divergence approaches 0, achieving perfect alignment. This directly optimizes the 4th term of the generalization upper bound.
Domain-Specific Decoders: The source domain decoder \(p_{\theta_S}\) and target domain decoder \(p_{\theta_T}\) infer document-topic distributions \(\alpha_S\) and \(\alpha_T\) for their respective domains. The number of topics in the two domains can differ and is independent of the latent space size, providing flexibility.
Consistency Loss: Forces aligned latent representations to produce similar outputs when passing through the two decoders: \(\mathcal{L}_{cons} = \mathbb{E}[\|p_{\theta_S}(Z) - p_{\theta_T}(Z)\|^2]\), which directly optimizes the 5th term of the generalization upper bound.

Loss & Training¶

Total objective function:

\[\mathcal{L}_{DALTA} = \mathcal{L}_{rec} + \omega_{adv}\mathcal{L}_{adv} + \omega_{cons}\mathcal{L}_{cons} + \omega_{KL}\mathcal{L}_{KL}\]

where: - \(\mathcal{L}_{rec}\): Reconstruction loss (optimizes the 1st-2nd terms of the upper bound) - \(\mathcal{L}_{adv}\): Adversarial loss (optimizes the 4th term of the upper bound) - \(\mathcal{L}_{cons}\): Consistency loss (optimizes the 5th term of the upper bound) - \(\mathcal{L}_{KL}\): KL divergence regularization (optimizes the 3rd term of the upper bound) - \(\omega_{adv}, \omega_{cons}, \omega_{KL}\): Weight hyperparameters

Training process: In each iteration, a batch of data is sampled from both the source and target domains. After encoding, four losses are calculated. The encoder and decoders are jointly updated (minimizing \(\mathcal{L}_{DALTA}\)), and the discriminator is updated separately (maximizing \(\mathcal{L}_{adv}\)).

Key Experimental Results¶

Main Results — Topic Quality (\(C_V\) Coherence Score)¶

\(C_V\) scores across datasets when k=10:

Model	Newsgroup Sci	Newsgroup Rel	Drug Nor	Drug Norges	Yelp	SMS Spam
LDA	0.425	0.424	0.439	0.461	0.394	0.351
ProdLDA	0.410	0.422	0.437	0.422	0.398	0.471
CTM	0.476	0.407	0.466	0.422	0.398	0.471
Meta-CETM	0.396	0.409	0.493	0.426	0.406	0.452
FASTopic	0.406	0.389	0.517	0.413	0.440	0.464
DALTA	0.493	0.431	0.582	0.484	0.448	0.503

Ablation Study¶

Configuration	Key Metric	Description
w/o adversarial loss	Significant decrease in \(C_V\)	Domain alignment is critical for cross-domain transfer
w/o consistency loss	Slight decrease in \(C_V\)	Reconstruction function consistency contributes positively
w/o KL regularization	Decrease in topic diversity	Regularization prevents latent space collapse
Full DALTA	Optimal	All components are complementary and indispensable

Key Findings¶

DALTA achieves the highest coherence and diversity in almost all settings, with particularly significant improvements in specialized domains like Drug Review and SMS Spam.
Increasing the number of topics (k=10 \(\rightarrow\) 20) generally improves diversity but does not necessarily improve coherence; DALTA maintains an optimal balance between the two.
ETM and CTM utilize embeddings to improve coherence but often at the cost of topic diversity.
Meta-CETM and FASTopic perform well on general datasets (such as Yelp) but are unstable in niche domains.
Source domain selection has a significant impact on transfer performance (the Science/Technology subset of AG News is most effective for technical target domains).

Highlights & Insights¶

Tight Coupling between Theory and Method: Each term in the generalization upper bound matches a loss function in the proposed method, presenting an exceptionally clear design logic.
First to Introduce Domain Adaptation to Topic Modeling: Fills an important research gap with a rigorous theoretical foundation.
High Practical Value: Low-resource topic modeling is a practical need for many vertical domains (medical, legal, emerging technologies).
Elegance of Proposition 1: Directly connects the error rate of the adversarial domain discriminator to \(\mathcal{H}\)-divergence, providing theoretical guarantees for adversarial alignment.
Flexible Architectural Design: The source and target domains can have different numbers of topics, enhancing flexibility in practical applications.

Limitations & Future Work¶

The target domain data size is fixed to 1000 instances; more extreme low-resource scenarios (e.g., 100 documents) were not explored.
AG News is used as the sole source domain, leaving the impact of source domain selection on transfer performance under-explored.
The instability of adversarial training may affect convergence, but training stability is not discussed.
Relies only on bag-of-words (BOW) and embedding representations, and does not leverage the generative capabilities of LLMs to enhance low-resource topic modeling.
Lacks comparison with LLM-based topic modeling methods (such as directly extracting topics using GPT).
Parameter sensitivity analysis for hyperparameters \(\omega_{adv}, \omega_{cons}, \omega_{KL}\) is insufficient.

Relationship with Ben-David et al. (2010): The generalization upper bound is constructed based on classic domain adaptation theory but extended to the VAE framework for unsupervised topic modeling.
Relationship with DANN (Ganin et al., 2016): The adversarial domain alignment strategy is directly adapted from DANN but tailored to the generative model framework of topic modeling.
Inspiration for Other Unsupervised Cross-Domain Tasks: The "shared encoder + domain-specific decoders + adversarial alignment" architecture of DALTA can be generalized to unsupervised tasks such as cross-domain clustering and cross-domain representation learning.
Paradigm Value of Theory-Driven Methods: The methodology of first deriving a generalization upper bound and then optimizing it term by term serves as an excellent paradigm of theory-guided practice in machine learning.

Rating¶

Novelty: ⭐⭐⭐⭐ Rigorously introduces domain adaptation theory to topic modeling for the first time, with theoretical depth in the derivation of the generalization upper bound.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dataset, multi-baseline, and complete ablation study, though source domain exploration is limited.
Writing Quality: ⭐⭐⭐⭐ Rigorous theoretical derivations, clear motivating example, and well-structured.
Value: ⭐⭐⭐⭐ Has both theoretical contributions and practical application value, although overall attention to topic modeling is limited.