Skip to content

Sparse Autoencoders are Topic Models

Conference: ICML 2026
arXiv: 2511.16309
Code: https://github.com/ExplainableML/SAE-TM (Available)
Area: Interpretability / Representation Learning
Keywords: Sparse Autoencoders, Topic Models, LDA, Continuous Topic Model, Embedding Interpretation

TL;DR

This paper demonstrates that the \(L_1\) objective of Sparse Autoencoders (SAE) is exactly the MAP estimate of an LDA-style "Continuous Topic Model" (CTM) under the limit of high activity and small contribution. Based on this, the SAE-TM framework is proposed: pre-training an SAE to obtain reusable topic atoms, post-hoc learning word distributions, and merging them into an arbitrary number of topics via clustering. Topic coherence on text and image datasets significantly exceeds current mainstream neural topic models.

Background & Motivation

Background: Sparse Autoencoders (SAEs) are currently the primary tools for analyzing foundation model activations and performing "mechanistic interpretability." The community generally interprets each SAE feature as a monosemantic concept direction that can be individually "steered." Neural Topic Models (NTM) constitute a parallel research line, evolving from LDA and AVITM to FASTopic and TSCTM, primarily aimed at text bag-of-words.

Limitations of Prior Work: (1) Recent empirical studies on SAEs consistently find that behavioral steering through a single feature is ineffective and that monosemanticity is less stable than linear probes, leading to controversy over the utility of SAEs. (2) NTMs are restricted by posterior collapse, fixed topic numbers, and an almost exclusive focus on text, making them difficult to generalize to high-dimensional embeddings like images. No prior work has pointed out that these two lines are solving the same mathematical problem.

Key Challenge: Interpreting SAE features as "steerable monosemantic directions" is an over-interpretation. SAE features learned are more akin to "topic components"; an individual feature does not constitute an independent causal mechanism, which explains the failures in steering. However, a unified probabilistic model to formalize this intuition is lacking.

Goal: (1) Provide a principled explanation for SAEs from a generative model perspective; (2) Operationalize this explanation into a topic modeling framework comparable to NTMs; (3) Demonstrate its practical value in analyzing large-scale cross-modal (text and image) datasets.

Key Insight: The authors observe that the SAE's "reconstruction of embeddings via linear superposition of activations" is structurally isomorphic to LDA's "generation of bag-of-words via linear mixing of topics," with the distinction being the observation domain (continuous embeddings vs. discrete words). The authors construct a continuous extension of LDA in the embedding space and derive the SAE objective from it.

Core Idea: A Continuous Topic Model (CTM) is defined where each embedding is a linear combination of topic directions \(\mu_k\) plus Gaussian noise. Under the asymptotic limit of "high activity and small contribution," the \(L_1\)-SAE loss is shown to be the MAP objective of the CTM. Therefore, SAE features are essentially "topic atoms"; multiple small activities must superimpose to explain an embedding, and individual features should not be expected to have controlled behaviors.

Method

The methodology consists of two parts: the theoretical derivation in Section 3 (SAE = MAP estimator of CTM) and the SAE-TM framework in Section 4 that translates this perspective into an evaluable topic model.

Overall Architecture

SAE-TM Workflow: (a) Pre-train SAE: An SAE is trained on large-scale text or visual embeddings using a standard \(L_1\) objective (expansion factor 4, dictionary size \(\gg 1000\)) to obtain a set of "topic atoms" \(\mu_k\) (column vectors of the SAE decoder). (b) Topic Interpretation: The SAE is frozen on a downstream dataset, and an additional word emission matrix \(\mathbf{B}\in\mathbb{R}^{K\times V}\) is learned to map each SAE feature to a word distribution, allowing comparison with traditional NTMs using the same coherence/diversity metrics. (c) Topic Merging: The top-\(p\) word distribution of each SAE feature is used to calculate a topic vector via weighted sums of word embeddings. These are then clustered into a target topic number \(K'\) using \(k\)-means. Corresponding word distributions are merged by weighting based on feature priors. This process does not require retraining the SAE, and \(K'\) can be switched arbitrarily.

Input: An in-domain embedding dataset \(\{D_i\}\) (from Granite-R2 for text, SigLIP for images) + associated bag-of-words representations (long captions generated by InternVL3.5 for images). Output: \(K'\) topics, each being a word distribution and a corresponding "set of atoms."

Key Designs

  1. MAP Equivalence between CTM and \(L_1\)-SAE:

    • Function: Provides a principled probabilistic explanation for SAEs from a generative model perspective, elevating the use of \(L_1\) sparsity and squared reconstruction loss from heuristics to MAP inference.
    • Mechanism: CTM assumes document embeddings \(D = \epsilon + \sum_{i=1}^N \lambda_i c_i\). Under the "high activity, small contribution" limit (\(\rho_d\to\infty, \alpha_0\to 0, \rho_d\alpha_0\to\kappa\)) and \(\Sigma_k\to 0\), the aggregate topic strength \(S_k \Rightarrow \mathrm{Ga}(\kappa\theta_k, \beta)\). Parameterizing strength as \(a_k = s\theta_k\) simplifies the observation model to \(D \mid a \sim \mathcal{N}(Wa, \sigma^2 I)\). The negative log-posterior at \(\kappa=1, \alpha_k=1\) equals \(\mathcal{L}(a)=\frac{1}{2\sigma^2}\lVert D-Wa\rVert_2^2+\beta\lVert a\rVert_1\), which is the \(L_1\)-SAE loss of Bricken et al. Hard-sparse SAEs like TopK follow the same framework by replacing the prior with a hard support constraint.
    • Design Motivation: Provides a theoretical explanation for why steering fails—SAE features are "topic components" rather than "independently steerable causal directions."
  2. Post-hoc Word Emission (SAE Feature → Word Distribution):

    • Function: Bridges an "embedding-side model" like SAE to the NTM convention of "topic = word distribution," enabling standard evaluations like intruder detection and coherence rating.
    • Mechanism: Freezes the SAE and defines the bag-of-words likelihood \(P(D)=\prod_{w_i\in D}\pi P_0(w_i)+(1-\pi)\sum_k B_{k,i}\cdot \theta_k\), where \(\theta_k\) is the normalized activation of the \(k\)-th SAE feature. \(P_0\) is an unconditional unigram prior (\(\pi=0.3\)) to absorb high-frequency, non-topical words. \(\mathbf{B}\) is the \(K\times V\) word emission matrix. Words are weighted by normalized IDF during training to prevent common words from dominating the loss.
    • Design Motivation: Maintains the generative semantics of the original SAE without modification, making a "foundational SAE topic model" applicable to downstream datasets too small for retraining.
  3. K-means Based Topic Atom Merging:

    • Function: Consolidates the \(\gg 1000\) atomic SAE features into the \(50–500\) topics typical of NTMs, allowing for flexible topic counts post-hoc.
    • Mechanism: Calculates a topic vector \(\mathbf{T}_k = \sum_{w_i\in\mathcal{V}} B_{k,i}\mathbf{w}_i\) for each SAE feature using word embeddings (word2vec/GloVe). \(k\)-means is performed on \(\{\mathbf{T}_k\}\) to form \(K'\) clusters. Word distributions of features in the same cluster are merged by weighting with the feature prior \(P(k)=\bar{\theta}_k\).
    • Design Motivation: \(K'\) can be adjusted without retraining the SAE, and cluster boundaries provide signals for inter-topic similarity.

Loss & Training

SAE training uses \(L_1\) penalty (coefficient 2), expansion factor 4 (\(K\approx 3072\)), batch size 1000, 50k steps, lr=0.001. The word emission matrix \(\mathbf{B}\) is trained for 50–200 epochs, lr=0.01. Training an SAE on 50M Twitter embeddings takes ~10 minutes, and interpretation takes ~15 minutes on a single GPU.

Key Experimental Results

Main Results

Comparison with 8 NTM baselines across 5 text datasets (News-20K / IMDB / Yelp / DailyMail / Twitter, using Granite-R2 embeddings):

# Topics Metric SAE-TM TSCTM (Next Best) AVITM/CombinedTM
50 \(C_I\) / \(C_R\) 54.31 / 77.25 44.61 / 69.75 38.72 / 70.24
100 \(C_I\) / \(C_R\) 51.48 / 78.01 35.81 / 58.53 38.49 / 67.37
300 \(C_I\) / \(C_R\) 43.50 / 74.22 26.17 / 27.40 33.38 / 65.67
500 \(C_I\) / \(C_R\) 40.49 / 71.22 21.68 / 17.67 31.79 / 50.77

Ours ranks first in intruder detection (\(C_I\)) and coherence rating (\(C_R\)) at all topic counts, showing only slight degradation as topic numbers increase.

3 Image Datasets (CIFAR100 / Food101 / SUN397, using SigLIP embeddings + InternVL3.5 long captions):

# Topics Metric SAE-TM TSCTM CombinedTM FASTopic
50 \(C_I\) / \(C_R\) 42.57 / 85.05 40.51 / 80.40 42.30 / 79.39 34.44 / 69.56
200 \(C_I\) / \(C_R\) 38.59 / 85.53 34.69 / 72.61 23.16 / 30.80 32.28 / 68.14
500 \(C_I\) / \(C_R\) 36.54 / 84.43 25.28 / 39.81 20.29 / 26.56 31.05 / 67.27

For images, \(C_R\) remains above 84 across all topic counts, making this the only method that does not decay with topic count.

Ablation Study

Config Key Observation Description
High activity limit \(N=500\) vs. \(N=5\) Embedding distribution is a smooth Gaussian cloud at large \(N\), blocky grid at small \(N\) Validates that SAE's continuous \(L_2\) loss matches CTM's discrete generation only in the small contribution limit.
Topics from 50 → 500 SAE-TM \(C_R\) drops ~6 pts; TSCTM drops ~52 pts The merging mechanism preserves topic atoms; finer granularity does not destroy coherence.
ImageNet vs. Web Data (100 topics) ImageNet scores higher on "Fluffy Animals", lower on "Human Interaction" Matches ImageNet's class-balanced design, proving SAE-TM reflects dataset construction.
Ukiyo-e (7 periods) "Domestic Scene" peaks in Edo Golden Age; "Vibrant Garment" higher in Edo/Meiji than 20th C Topic trends align with known cultural history.

Key Findings

  • Theory-Practice Loop: The CTM assumption of high activity and small contribution is not just a mathematical trick; sampling visualizations confirm that discrete topic mixtures smooth into the Gaussian reconstruction loss used by SAEs only in this limit.
  • Topic Scalability: Traditional NTMs collapse in coherence as topic count increases because model capacity is fixed. SAE-TM learns atomic grains first and merges them, meaning quality is invariant to the number of clusters.
  • New Vision Topic Modeling Paradigm: SAE-TM demonstrates the feasibility of performing topic modeling directly on image embeddings, positioning SAEs as tools for large-scale visual dataset analysis.

Highlights & Insights

  • Deriving SAE as a Topic Model MAP Estimate: This is the first work to provide a formal asymptotic derivation from CTM to \(L_1\)-SAE, shifting the justification from empirical heuristics to MAP inference.
  • "Topic Atoms + Post-hoc Merging" Decouples Representation and Interpretation: A universal set of atoms is obtained via SAE pre-training, while the interpretative layer (topics, vocabulary, \(K'\)) can be flexibly swapped for downstream tasks.
  • Positional Shift for the SAE Community: The authors argue that SAEs are better suited for large-scale topic/dataset auditing than for single-feature steering.

Limitations & Future Work

  • Limitations: (1) Interpretation quality of SAE features has room for improvement as activation strength does not always align with topic importance; (2) Document embeddings encode non-topical information like sentiment and style, which SAE-TM may inadvertently capture; (3) The independence assumption for topics precludes hierarchical SAE extensions.
  • Future Work: The authors suggest extending hierarchical SAEs to hierarchical CTMs to obtain adaptive-granularity topics and using MLLMs to refine visual topic interpretation.
  • vs FASTopic / CombinedTM: These models also use embeddings but require training a new probabilistic model from scratch. SAE-TM leverages existing SAE dictionaries, avoiding posterior collapse and supporting dynamic topic counts.
  • vs Mechanistic Interpretability: While the mainstream narrative focuses on "monosemantic steerable directions," this work provides an anti-narrative: SAEs learn topic components, and analysis at the aggregate level (topics, distributions) is their true strength.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First MAP equivalence proof between \(L_1\)-SAE and LDA-style CTM.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Coverage of multiple text/image datasets and 8 baselines; missing direct comparison with mechanistic interp steering tasks.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear progression from theory to validation.
  • Value: ⭐⭐⭐⭐⭐ Significant impact on both SAE interpretability and topic modeling communities.