Skip to content

ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts

Conference: NeurIPS 2025 arXiv: 2510.26186 Code: GitHub Area: Human Understanding Keywords: Dataset Bias, Sparse Autoencoder, Visual Concepts, Bias Detection, Interpretability

TL;DR

This paper proposes ConceptScope, a framework that trains sparse autoencoders (SAE) on representations from visual foundation models to automatically discover and quantify visual concept biases in datasets, categorizing concepts into target / context / bias without any manual annotation.

Background & Motivation

Biases in machine learning datasets — such as high correlations between specific categories and specific backgrounds — are pervasive and degrade model generalization. For example, approximately 75% of "leatherback turtle" images in ImageNet are photographed on beaches, while only 15% are underwater. Existing methods either rely on costly human annotation or on descriptive text generated by VLMs; however, natural language descriptions suffer from inconsistent granularity and synonym substitution, making structured extraction of visual concepts difficult. This paper aims to construct a fully automatic and scalable framework for dataset bias analysis.

Method

Overall Architecture

ConceptScope operates in two stages: 1. Concept dictionary construction: An SAE is trained on intermediate-layer token embeddings of a pretrained visual encoder (CLIP-ViT-L/14) to disentangle dense representations into sparse, interpretable concepts. 2. Concept classification: Each concept is categorized as target, context, or bias based on semantic relevance and statistical frequency.

Key Designs

Sparse Autoencoder (SAE) Training: Given an image \(x\), patch-level token embeddings \(\mathbf{z} = \{z_1, \ldots, z_l\}\) are extracted. The SAE encode-decode process is:

\[f(z) = \phi(W_{\text{enc}}^T z), \quad \text{SAE}(z) = W_{\text{dec}}^T f(z)\]

where \(\phi\) denotes the ReLU activation, \(W_{\text{enc}} \in \mathbb{R}^{d \times d'}\), and \(d'\) is much larger than \(d\) (expansion factor of 16 or 32).

Concept Classification — Alignment Score: Two metrics, necessity \(N(c,y)\) and sufficiency \(S(c,y)\), are defined to measure the drop in prediction confidence upon removing concept \(c\) and the predictive capacity when retaining only \(c\), respectively:

\[N(c,y) = \frac{1}{|X_y|}\sum_{x \in X_y} \frac{P(y|x)}{P(y|x \odot (1-m_c(x)))}\]
\[S(c,y) = \frac{1}{|X_y|}\sum_{x \in X_y} \frac{P(y|x \odot m_c(x))}{P(y|x)}\]

Their average yields the alignment score \(A(c,y) = \frac{N(c,y) + S(c,y)}{2}\). A concept is classified as a target concept when \(A(c,y) \geq \mu_y^{\text{align}} + \alpha \times \sigma_y^{\text{align}}\); otherwise it is treated as a context concept.

Bias Concept Identification: After excluding target concepts, the concept strength of each context concept is computed as \(\tilde{f}_{c,y} = \text{avg}_{\mathbf{z} \in Z_y}(f(\mathbf{z})_c)\). A concept is identified as a bias concept when \(\tilde{f}_{c,y} \geq \mu^{c.s.} + \sigma^{c.s.}\).

Loss & Training

The SAE training loss combines reconstruction loss with an L1 sparsity penalty:

\[\mathcal{L} = \|z - \text{SAE}(z)\|_2^2 + \lambda \|z\|_1\]

Key Experimental Results

Main Results

Concept prediction performance (binary classification accuracy, F1 / AUPRC, across 6 annotated datasets):

Method Caltech101 DTD Waterbird CelebA RAF-DB Stanford40 Avg.
BLIP-2 0.64 0.38 0.37 0.27 0.24 0.66 0.43
LLaVA-NeXT 0.61 0.40 0.57 0.62 0.45 0.80 0.58
ConceptScope 0.83 0.57 0.78 0.81 0.55 0.78 0.72

Bias discovery task (Precision@10):

Method Waterbirds CelebA NICO++(75) NICO++(90) NICO++(95)
DOMINO 90.0% 87.0% 24.0% 24.0% 24.0%
FACTS 100.0% 100.0% 55.0% 60.8% 61.0%
ConceptScope 100.0% 100.0% 72.9% 73.1% 74.0%

Ablation Study

  • SAE spatial attribution segmentation precision: AUPRC reaches 0.399 on ADE20K, significantly outperforming BLIP-2 (0.098) and LLaVA-NeXT (0.302).
  • Pearson correlation between SAE activation values and CLIP similarity: \(r = 0.71\); Spearman \(\rho = 0.65\).
  • Standard deviation across SAEs trained with four random seeds is below 0.01, demonstrating framework robustness.

Key Findings

  • Previously unannotated biases are discovered in ImageNet-1K: e.g., "necklace" frequently co-occurs with the "mannequin" category, and the "bridegroom" category is highly correlated with East Asian cultural scenes.
  • An average of 2.45 bias concepts are detected per category.
  • Model robustness diagnostic experiments show that the high-target + high-bias group achieves the highest accuracy and the low-target + low-bias group the lowest, a trend consistent across all 34 evaluated models.

Highlights & Insights

  1. Fully automatic and unsupervised: Dataset biases are discovered without manual annotation; once the SAE is trained, it transfers to other datasets.
  2. The three-way concept classification (target / context / bias) is both theoretically grounded and practically useful.
  3. Bias discovery Precision@10 on NICO++ improves over the previous SOTA (ViG-Bias) by approximately 10 percentage points.
  4. The framework is extensible to multi-label settings (MS-COCO).

Limitations & Future Work

  • Concepts are constrained by the knowledge scope of CLIP representations; domain-specific datasets (e.g., medical imaging) require retraining the SAE.
  • Segmentation masks are patch-level (16×16), limiting localization precision.
  • Performance on domain-specific attributes (e.g., emotion, texture) is weaker than on general attributes.
  • Unlike methods such as SpLiCE, ConceptScope requires no predefined bias categories to perform automatic discrimination.
  • The successful application of SAEs in LLM interpretability is transferred to the visual domain.
  • This work inspires inquiry into whether ConceptScope could be applied to automatic dataset cleaning or active learning sample selection.

Rating

  • ⭐ Novelty: 4/5 — Applying SAEs to visual dataset bias analysis represents the first systematic exploration of this approach.
  • ⭐ Experimental Thoroughness: 5/5 — Covers 6 attribute datasets + 3 bias benchmarks + multiple real-world datasets + robustness analysis across 34 models.
  • ⭐ Writing Quality: 4/5 — Well-structured with rigorous concept definitions.
  • ⭐ Value: 4/5 — Provides a practical tool for dataset auditing and model diagnostics.