Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability¶

Conference: CVPR 2025
arXiv: 2503.20301
Code: https://github.com/tiggers23/ALBM
Area: LLM/NLP
Keywords: Language Bottleneck Models, Attribute Concept Space, Visual Prompt Learning, Zero-Shot Generalization, Interpretable Classification

TL;DR¶

This paper proposes the ALBM model, which replaces the class-shared concept space of existing Language Bottleneck Models (LBMs) with an Attribute-formed Class-specific Concept Space (ACCS) to address the issue of spurious cue reasoning and support cross-class generalization. Combined with Visual Attribute Prompt Learning (VAPL) to extract fine-grained attribute features, ALBM comprehensively outperforms existing interpretable classification methods on 9 few-shot benchmarks.

Background & Motivation¶

Background: Vision-Language Models (VLMs, such as CLIP) obtain powerful visual representation capabilities through contrastive learning, but their decision-making processes lack transparency. To improve interpretability, Language Bottleneck Models (LBMs) aggregate textual concepts of all classes into a unified concept space and train concept classifiers based on concept activation scores.

Limitations of Prior Work: Existing LBMs place all concepts into a class-shared space, leading to two issues: (1) Spurious cue reasoning—classifiers may learn spurious correlations between categories and non-essential concepts (e.g., identifying a tiger via "jungle"), which reduces interpretability; (2) Inability to generalize to new classes—introducing new classes requires extending the concept space, rendering pre-trained classifiers non-transferable.

Key Challenge: The class-shared concept space muddles the causal relationship of "which concepts are essentially related to which class." The root cause is the lack of structured correspondence between concepts and categories.

Goal: Construct a class-specific concept space that enforces the classifier to rely solely on the essential concepts of each class for reasoning, while simultaneously ensuring cross-class generalization ability.

Key Insight: The authors observe that "attributes" (such as color, shape, and texture) are high-level dimensions shared across categories, whereas "concepts" are class-specific descriptions under specific attributes (e.g., the color of an albatross -> dark gray). If the concept space is organized by attributes, naturally aligned correspondences exist between the concept spaces of different classes, supporting cross-class transfer.

Core Idea: Guide the construction of the Attribute-formed Class-specific Concept Space (ACCS) using a unified set of attributes. Each class reasons solely based on its own concepts, solving the spurious cue problem. At the same time, the cross-class consistency of attributes ensures that the classifier generalizes to new classes.

Method¶

Overall Architecture¶

Given an input image, the ALBM pipeline consists of: (1) extracting fine-grained features for each attribute using a vision encoder and visual attribute prompts; (2) computing the concept activation score matrix \(S \in \mathbb{R}^{K \times N_a}\) via the cosine similarity between the features and the attribute-formed concepts; (3) predicting the class using a linear concept classifier \(W_a\) within the class-specific space.

Key Designs¶

Attribute-formed Class-specific Concept Space (ACCS):
- Function: Construct independent concept spaces for each class, where concepts are organized based on a unified attribute set.
- Mechanism: The concept set \(C \in \mathbb{R}^{K \times N_a \times d}\) consists of descriptions of all classes under a unified attribute set \(\{a_j\}_{j=1}^{N_a}\). During classification, the prediction score for each class is based solely on the concept activation score \(s_i\) of that class itself, rather than a mixture of all concepts. This is equivalent to restricting classification to a causally correct path—class \(i\) can only be recognized through the essential attributes of class \(i\).
- Design Motivation: Directly address the spurious cue reasoning problem. The cross-class consistency of attributes implies that introducing new classes only requires generating new attribute descriptions without modifying the concept spaces of existing classes. The trained classifier can be transferred to new classes through weighting based on class-name similarity.
Visual Attribute Prompt Learning (VAPL):
- Function: Extract corresponding fine-grained visual features for each attribute from the image.
- Mechanism: Introduce \(N_a\) learnable prompt tokens \(\{p_j\}\) into the ViT vision encoder, where each prompt represents the semantics of an attribute. When an image is input, the output feature \(f_a^j\) of each attribute prompt represents the image's information regarding that attribute. To prevent prompts from interfering with image feature extraction, attention between prompts and from image tokens to prompts is masked. The learning objective is to align \(f_a^j\) with the textual features of the corresponding attribute concepts, optimized using the cross-entropy loss \(\mathcal{L}_p\).
- Design Motivation: CLIP's global [CLS] feature struggles to capture fine-grained attribute details. Directly guiding the encoder to focus on specific attributes via attribute prompts improves the accuracy of concept activation scores.
Description-Summary-Supplement (DSS) Strategy:
- Function: Automatically generate high-quality attribute-formed concept sets, avoiding manual annotation.
- Mechanism: A three-step approach: (1) Description: Let an LLM (GPT-4o) freely generate concept descriptions for each class; (2) Summary: Aggregate the concepts of all classes and prompt the LLM to extract a unified attribute set; (3) Supplement: Perform supplemental generation for classes with missing attribute values. Compared to existing methods that directly prompt an LLM to list attributes, DSS infers attributes from freely generated concepts, thus extracting a more complete and precise attribute set.
- Design Motivation: Directly prompting the LLM to summarize attribute sets often overlooks important attributes. The "describe first, then summarize" strategy leverages the diversity of the LLM in free-form generation to ensure sufficient attribute coverage.

Loss & Training¶

Two-stage training: first, train the visual attribute prompts with \(\mathcal{L}_p\) (concept alignment cross-entropy loss) for 5 epochs; then, train the concept classifier \(W_a\) with \(\mathcal{L}_w\) (classification cross-entropy loss) for 1000 epochs. SGD optimizer with a batch size of 64.

Key Experimental Results¶

Main Results¶

Dataset	ZS-CLIP	CuPL	CLIP-GPT	LaBo*	ALBM*(ours)
CUB	63.4	-	11.4	16.2	25.0
DTD	53.2	37.2	40.0	37.9	48.5
Food101	91.0	66.3	48.4	52.2	75.4
ImageNet	71.4	59.2	44.3	37.8	64.6

Comparison with training-free language bottleneck methods under the zero-shot setting

Dataset	LaBo Base	ALBM Base	ALBM Novel
CUB	76.9	91.9	27.8
Food101	87.6	88.5	86.8
ImageNet	71.7	75.0	73.9

Base-to-Novel setting; ALBM generalizes to unseen classes

Ablation Study¶

Config	Base	Novel
Zero-shot (Training-free)	54.2	55.2
+\(\mathcal{L}_w\) trained classifier	74.0	57.5
+\(\mathcal{L}_w\) + VAPL	77.2	58.2

Key Findings¶

Training the concept classifier brings a 19.8% gain on Base and a 2.3% gain on Novel, proving the necessity of the LBM approach.
VAPL contributes an additional 3.2% on Base, demonstrating that attribute-level feature extraction indeed outperforms global [CLS] features.
The attribute sets generated by DSS are more complete than those by CLIP-GPT (e.g., 12 vs 7 attributes on OxfordPets), which facilitates classification.
Case studies show that ALBM avoids spurious cues (e.g., LaBo using "jungle" to identify a tiger), relying solely on the essential concepts of each class.

Highlights & Insights¶

Attribute-level Concept Alignment: Organizing the concept space by using "attributes" as a cross-class common intermediate layer ensures both interpretability and generalizability, which is an elegant structural design.
VAPL's Attention Masking: Introducing attribute prompts in ViT while blocking mutual attention among prompts and attention from image tokens to prompts ensures that each attribute prompt extracts information independently. This design is simple yet effective.
DSS's "Describe First, Then Summarize" Philosophy: Leveraging the divergence capability of LLMs to generate diverse descriptions first, and then converging them into a structured attribute set, is more reliable than directly requesting attribute lists. This approach can be transferred to other scenarios requiring structured extraction.

Limitations & Future Work¶

There is still a significant performance gap between interpretable classification and uninterpretable CLIP (e.g., 64.6 vs 71.4 on ImageNet), indicating that the trade-off between interpretability and performance remains unresolved.
DSS is dependent on GPT-4o, and the quality of the concept set is constrained by LLM capability.
VAPL is currently implemented only on the ViT architecture and is not directly applicable to CNN encoders.
Multi-grained attribute hierarchies (such as coarse-grained "color" -> fine-grained "stripe color") have not yet been explored, which could potentially improve performance further.

vs LaBo: LaBo learns a concept classifier in a class-shared space, whereas ALBM learns in a class-specific space, avoiding spurious cues while structuring the concept space.
vs CLIP-GPT: Although both utilize attribute-formed concept sets, CLIP-GPT directly prompts the LLM to list attributes, while ALBM obtains more complete attribute sets through the DSS strategy.
vs MAP (Visual Prompting): MAP aligns visual prompts using cross-attention and unstructured descriptions, which yields ambiguous semantics. In contrast, ALBM's VAPL aligns within a structured attribute set, giving the prompts clear and interpretable semantics.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of attribute-formed class-specific concept spaces is clear and effective, though the core remains a linear classifier.
Experimental Thoroughness: ⭐⭐⭐⭐ 9 datasets, both zero-shot and base-to-novel settings, with complete ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear structure and well-formulated motivation.
Value: ⭐⭐⭐⭐ Improvements to LBM interpretability and generalization are of practical significance, and the DSS strategy provides inspiring insights.

title: >- [Paper Reading] Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Generalization description: >- [CVPR 2025][LLM/NLP][Language Bottleneck Models] To address the issues of spurious cue reasoning and poor zero-shot generalization caused by the mixture of all concepts in Language Bottleneck Models (LBM), this paper proposes an Attribute-formed Class-specific Concept Space, organizing concepts into independent spaces for each category along attribute dimensions. tags: - CVPR 2025 - LLM/NLP - Language Bottleneck Models - Attribute Concept Space - Interpretable Classification - Zero-Shot Generalization