ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts¶

Conference: NeurIPS 2025 arXiv: 2510.26186 Code: GitHub Area: Human Understanding Keywords: Dataset Bias, Sparse Autoencoder, Visual Concepts, Bias Detection, Interpretability

TL;DR¶

This paper proposes ConceptScope, a framework that trains sparse autoencoders (SAE) on representations from visual foundation models to automatically discover and quantify visual concept biases in datasets, categorizing concepts into target / context / bias without any manual annotation.

Background & Motivation¶

Biases in machine learning datasets — such as high correlations between specific categories and specific backgrounds — are pervasive and degrade model generalization. For example, approximately 75% of "leatherback turtle" images in ImageNet are photographed on beaches, while only 15% are underwater. Existing methods either rely on costly human annotation or on descriptive text generated by VLMs; however, natural language descriptions suffer from inconsistent granularity and synonym substitution, making structured extraction of visual concepts difficult. This paper aims to construct a fully automatic and scalable framework for dataset bias analysis.

Method¶

Overall Architecture¶

ConceptScope operates in two stages: 1. Concept dictionary construction: An SAE is trained on intermediate-layer token embeddings of a pretrained visual encoder (CLIP-ViT-L/14) to disentangle dense representations into sparse, interpretable concepts. 2. Concept classification: Each concept is categorized as target, context, or bias based on semantic relevance and statistical frequency.

Key Designs¶

Sparse Autoencoder (SAE) Training: Given an image \(x\), patch-level token embeddings \(\mathbf{z} = \{z_1, \ldots, z_l\}\) are extracted. The SAE encode-decode process is:

\[f(z) = \phi(W_{\text{enc}}^T z), \quad \text{SAE}(z) = W_{\text{dec}}^T f(z)\]

where \(\phi\) denotes the ReLU activation, \(W_{\text{enc}} \in \mathbb{R}^{d \times d'}\), and \(d'\) is much larger than \(d\) (expansion factor of 16 or 32).

Concept Classification — Alignment Score: Two metrics, necessity \(N(c,y)\) and sufficiency \(S(c,y)\), are defined to measure the drop in prediction confidence upon removing concept \(c\) and the predictive capacity when retaining only \(c\), respectively:

\[N(c,y) = \frac{1}{|X_y|}\sum_{x \in X_y} \frac{P(y|x)}{P(y|x \odot (1-m_c(x)))}\]

\[S(c,y) = \frac{1}{|X_y|}\sum_{x \in X_y} \frac{P(y|x \odot m_c(x))}{P(y|x)}\]

Their average yields the alignment score \(A(c,y) = \frac{N(c,y) + S(c,y)}{2}\). A concept is classified as a target concept when \(A(c,y) \geq \mu_y^{\text{align}} + \alpha \times \sigma_y^{\text{align}}\); otherwise it is treated as a context concept.

Bias Concept Identification: After excluding target concepts, the concept strength of each context concept is computed as \(\tilde{f}_{c,y} = \text{avg}_{\mathbf{z} \in Z_y}(f(\mathbf{z})_c)\). A concept is identified as a bias concept when \(\tilde{f}_{c,y} \geq \mu^{c.s.} + \sigma^{c.s.}\).

Loss & Training¶

The SAE training loss combines reconstruction loss with an L1 sparsity penalty:

\[\mathcal{L} = \|z - \text{SAE}(z)\|_2^2 + \lambda \|z\|_1\]

Key Experimental Results¶

Main Results¶

Concept prediction performance (binary classification accuracy, F1 / AUPRC, across 6 annotated datasets):

Method	Caltech101	DTD	Waterbird	CelebA	RAF-DB	Stanford40	Avg.
BLIP-2	0.64	0.38	0.37	0.27	0.24	0.66	0.43
LLaVA-NeXT	0.61	0.40	0.57	0.62	0.45	0.80	0.58
ConceptScope	0.83	0.57	0.78	0.81	0.55	0.78	0.72

Bias discovery task (Precision@10):

Method	Waterbirds	CelebA	NICO++(75)	NICO++(90)	NICO++(95)
DOMINO	90.0%	87.0%	24.0%	24.0%	24.0%
FACTS	100.0%	100.0%	55.0%	60.8%	61.0%
ConceptScope	100.0%	100.0%	72.9%	73.1%	74.0%

Ablation Study¶

SAE spatial attribution segmentation precision: AUPRC reaches 0.399 on ADE20K, significantly outperforming BLIP-2 (0.098) and LLaVA-NeXT (0.302).
Pearson correlation between SAE activation values and CLIP similarity: \(r = 0.71\); Spearman \(\rho = 0.65\).
Standard deviation across SAEs trained with four random seeds is below 0.01, demonstrating framework robustness.

Key Findings¶

Previously unannotated biases are discovered in ImageNet-1K: e.g., "necklace" frequently co-occurs with the "mannequin" category, and the "bridegroom" category is highly correlated with East Asian cultural scenes.
An average of 2.45 bias concepts are detected per category.
Model robustness diagnostic experiments show that the high-target + high-bias group achieves the highest accuracy and the low-target + low-bias group the lowest, a trend consistent across all 34 evaluated models.

Highlights & Insights¶

Fully automatic and unsupervised: Dataset biases are discovered without manual annotation; once the SAE is trained, it transfers to other datasets.
The three-way concept classification (target / context / bias) is both theoretically grounded and practically useful.
Bias discovery Precision@10 on NICO++ improves over the previous SOTA (ViG-Bias) by approximately 10 percentage points.
The framework is extensible to multi-label settings (MS-COCO).

Limitations & Future Work¶

Concepts are constrained by the knowledge scope of CLIP representations; domain-specific datasets (e.g., medical imaging) require retraining the SAE.
Segmentation masks are patch-level (16×16), limiting localization precision.
Performance on domain-specific attributes (e.g., emotion, texture) is weaker than on general attributes.

Unlike methods such as SpLiCE, ConceptScope requires no predefined bias categories to perform automatic discrimination.
The successful application of SAEs in LLM interpretability is transferred to the visual domain.
This work inspires inquiry into whether ConceptScope could be applied to automatic dataset cleaning or active learning sample selection.

Rating¶

⭐ Novelty: 4/5 — Applying SAEs to visual dataset bias analysis represents the first systematic exploration of this approach.
⭐ Experimental Thoroughness: 5/5 — Covers 6 attribute datasets + 3 bias benchmarks + multiple real-world datasets + robustness analysis across 34 models.
⭐ Writing Quality: 4/5 — Well-structured with rigorous concept definitions.
⭐ Value: 4/5 — Provides a practical tool for dataset auditing and model diagnostics.