S3 - Semantic Signal Separation¶

Conference: ACL 2025
arXiv: 2406.09556
Code: GitHub - Turftopic
Area: Others
Keywords: Topic Modeling, Independent Component Analysis, Semantic Axes, Sentence Embeddings, Preprocessing-free

TL;DR¶

S3 conceptualizes topic modeling as discovering independent semantic axes within a semantic space. By utilizing Independent Component Analysis (ICA) to decompose document embedding matrices, it produces highly coherent and diverse topics without requiring preprocessing, while standing out as the fastest contextual topic model (averaging 4.5 times faster than BERTopic).

Background & Motivation¶

Topic modeling is a crucial tool for the exploratory analysis of text data, used for the unsupervised discovery of latent semantic structures within large-scale text corpora. Traditional methods like LDA and LSA are based on Bag-of-Words (BoW) representations and suffer from several limitations:

Highly sensitive to preprocessing: Stopwords, low-frequency words, and other noise must be meticulously filtered; otherwise, they pollute the topic descriptions.

Sparse high-dimensional representation: The sparsity of BoW vectors degrades computational efficiency and model fitting quality.

Lack of contextual understanding: Bag-of-Words models fail to leverage syntactic and contextual information.

With the advancement of neural language representation (especially sentence embeddings), several contextual topic models have emerged (such as BERTopic, Top2Vec, and CTM). However, these methods still exhibit weaknesses: - BERTopic and Top2Vec rely on physical clustering pipelines (UMAP + HDBSCAN), incurring heavy computational overhead. - CTM requires cumbersome preprocessing to achieve optimal performance. - Many approach configurations are highly sensitive to hyperparameters, leading to unstable outputs. - It remains unclear whether these models truly leverage contextual information.

S3 aims to deliver a contextual topic model that is conceptually elegant, theoretically driven, preprocessing-free, and computationally efficient.

Method¶

Overall Architecture¶

S3 conceptualizes topics as independent axes in a semantic space—where each axis represents a specific semantic dimension, and variations along this axis reflect the strength of the topic. This differs from traditional methods that treat topics as probability distributions over words or document clusters.

Methodological workflow: 1. Use a Sentence Transformer to encode documents, obtaining an embedding matrix \(X\). 2. Decompose \(X\) into a mixing matrix \(A\) and a source matrix \(S\) using FastICA: \(X = AS\). 3. Project word embeddings onto the discovered semantic axes to compute word importance.

Key Designs¶

Why ICA instead of PCA: PCA discovers the directions of maximum variance but does not guarantee independence—different topics might be entangled along the variance directions. ICA assumes that the source signals are statistically independent, which aligns better with the intuition of "conceptually independent topics." ICA has been demonstrated by prior work to successfully discover interpretable semantic axes in embedding spaces (Musil and Mareček 2024, Yamagiwa et al. 2023).
Dimensionality Reduction and Whitening: FastICA is a noiseless model that requires pre-whitening. Since ICA by default discovers the same number of components as the embedding dimension, the paper controls the number of topics \(N\) during the whitening step by retaining the top \(N\) principal components. This step simultaneously achieves dimensionality reduction and denoising.
Three Ways to Compute Word Importance:
- Axial Importance (Axial): \(\beta_{tj} = W_{jt}\), directly taking the projection value of a word on the semantic axis to select the most prominent words.
- Angular Importance (Angular): \(\beta_{tj} = W_{jt} / \|W_j\|\), taking the cosine of the projection to isolate the most specific words.
- Combined Importance (Combined): \(\beta_{tj} = (W_{jt})^3 / \|W_j\|\), where the cubic term preserves the sign while balancing prominence and specificity. The paper recommends using this method by default.
Negative Definition Capability: Unlike most topic models, S3 naturally supports words with negative importance—the lowest-scoring words on a certain topic axis provide a "negative definition" of that topic. For instance, for a topic about "clustering algorithms," the negative words might be "reinforcement, exploration, planning," indicating that this direction is opposite to reinforcement learning.
Inference on New Documents: Computing topic proportions for unseen documents requires only a single matrix multiplication: \(\hat{S} = \hat{X} C^T\), where \(C\) is the unmixing matrix (the pseudoinverse of \(A\)).

Relation to LSA¶

S3 can be viewed as the contextual successor to Latent Semantic Analysis (LSA/LSI). While LSA performs SVD on a term-frequency matrix to discover latent factors, S3 applies ICA to a neural embedding matrix to discover independent semantic axes. Key advancements include: - Using dense contextual embeddings instead of sparse Bag-of-Words representations. - Utilizing ICA instead of SVD to ensure component independence. - Leveraging knowledge from pretrained models through transfer learning.

Key Experimental Results¶

Main Results¶

A comprehensive evaluation was conducted across 6 datasets, 4 embedding models, and 5 topic count settings:

Model	External Coherence \(C_{ex}\)	Internal Coherence \(C_{in}\)	Diversity \(d\)	Overall Score
S3 (Combined)	High	High	Close to 1.0	Best
Top2Vec	Very High	High	Low	Second Best
FASTopic	Medium	Medium	Very High	Medium
BERTopic	Low	Low	Medium	Poor
LDA	Low	Low	Medium	Poor

Running speed rankings (median execution time ratio):

Comparison	Relative Speed of S3
vs BERTopic	4.5x faster
vs All Baselines Average	27.5x faster
vs CTM	Dozens of times faster
vs ECRTM	Significantly faster

Ablation Study¶

Configuration	Key Metrics	Description
Raw Text vs Preprocessed Text	S3 performs better on raw text	The only model that improves without preprocessing
Axial vs Angular vs Combined	Trade-off between coherence and diversity	Combined is the most balanced
GloVe vs MiniLM vs mpnet vs E5	S3 is stable across embedding models	Top2Vec is extremely sensitive to embedding models
Number of Topics 10-50	Stable performance	No fine-tuning required

Key Findings¶

S3 significantly outperforms all baselines: Linear regression analysis indicates that the model type significantly predicts interpretability (\(F = 167.4, p < 0.001, R^2 = 0.673\)), with coefficients for all non-S3 models being significantly negative.
S3 is the only model whose performance improves without preprocessing: Preprocessing actually discards contextual information that S3 can exploit. Other models (especially BERTopic and LDA) are heavily dependent on preprocessing.
Robustness to stopwords: The topic descriptions of S3, Top2Vec, and ECRTM contains almost no stopwords, whereas stopwords sometimes account for up to 100% of the topics in BERTopic and LDA.
Robustness across embedding models: S3 exhibits stable performance across different embedding models (including static GloVe and large-scale E5), even reaching its best performance on E5. In contrast, Top2Vec's performance plunges on GloVe and E5, and FASTopic suffers from the "curse of dimensionality" in larger embedding dimensions.

Highlights & Insights¶

Conceptual elegance: The idea of conceptualizing topics as independent semantic axes is natural and theoretically grounded. The intuition of "source separation" in ICA for signal processing transfers perfectly to "semantic signal separation."
Highly practical: Zero preprocessing, ultrafast speed, and a unified interface (scikit-learn-style Turftopic package) lower the barrier to using topic modeling.
Negative definitions: Leveraging the bidirectionality of axes provides both positive and negative descriptions of a topic, enhancing model interpretability.
Concept Compass: The concept compass visualization built on two semantic axes demonstrates S3's unique analytical capability—allowing terms to be positioned along two topic dimensions to understand topic interactions.

Limitations & Future Work¶

Limitations of evaluation metrics: Word embedding coherence relies on the quality of pretrained models and cannot fully capture human-judged topic quality. Human evaluation is lacking.
Preset topic count required: S3 requires users to specify the number of topics, unlike HDBSCAN which automatically determines the number of clusters.
Independence assumption: ICA assumes that source signals are statistically independent, but real-world topics may be correlated (e.g., "politics" and "economy").
Linear decomposition assumption: ICA is a linear model and may fail to capture non-linear semantic structures within the embedding space.
Single-run evaluation: Due to computational constraints, each setting was run only once, lacking multi-seed evaluations to quantify output variance.
Lack of downstream task evaluation for document-topic proportions: Although the paper argues that embeddings can be directly used for downstream tasks in practice, completely skipping such evaluation is somewhat insufficient.

BERTopic (Grootendorst, 2022): Clusters using UMAP+HDBSCAN, then extracts topic words via c-TF-IDF. It is slow and requires a topic reduction step.
Top2Vec (Angelov, 2020): Similar to BERTopic but estimates word importance via cosine similarity. It assumes spherical clustering.
FASTopic (Wu et al., 2024b): Models document-topic-word dual semantic relationships using optimal transport but suffers from the curse of dimensionality.
LSA (Deerwester et al., 1988): The spiritual predecessor to S3—performing SVD on a term-frequency matrix. S3 inherits this matrix decomposition approach but upgrades to contextual embeddings + ICA.
Musil and Mareček (2024): Demonstrates that ICA can discover interpretable semantic axes in the embedding space, inspiring the core idea of S3.

Rating¶

Novelty: ⭐⭐⭐⭐ The application of ICA in embedding spaces has precedents, but systemizing it into a topic model and comprehensively comparing it with baselines is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, spanning 6 datasets, 4 embedding models, 8 baselines, with both quantitative and qualitative analyses.
Writing Quality: ⭐⭐⭐⭐ Clear structure, with excellent geometric intuition diagrams and concept compass visualizations.
Value: ⭐⭐⭐⭐⭐ Provides the fastest, best-performing, and most user-friendly contextual topic model, with a mature open-source implementation already available.