Learning Interpretable Queries for Explainable Image Classification with Information Pursuit¶

Conference: ICCV 2025 arXiv: 2312.11548 Code: None Area: Explainable AI / Image Classification Keywords: Explainable classification, Information Pursuit, sparse dictionary learning, CLIP, query dictionary optimization

TL;DR¶

This paper parameterizes the query dictionary of Information Pursuit (IP) as learnable vectors in the CLIP semantic embedding space, and learns a task-sufficient interpretable query dictionary via an alternating optimization algorithm, substantially closing the performance gap between interpretable classifiers and black-box classifiers.

Background & Motivation¶

Information Pursuit (IP) is an interpretable-by-design classification framework: given a predefined dictionary of semantic queries, IP selects the most informative query subset in order of information gain and makes predictions based on query–answer pairs. However, IP faces critical limitations:

Manual dependency of query dictionaries: Prior methods rely on expert-annotated concepts (e.g., bird attributes in CUB-200) or LLM prompt-generated queries, whose quality depends heavily on domain expertise.

Suboptimality of LLM-generated queries: Reliance on prompt engineering heuristics produces query sets that may be redundant, irrelevant, or insufficient.

Performance gap: A significant accuracy gap exists between IP with handcrafted dictionaries and black-box classifiers.

Core Problem: How to learn a query dictionary that is task-sufficient?

Method¶

Overall Architecture¶

The method leverages CLIP's semantic embedding space by parameterizing each query as a learnable vector \(\theta_i\) in that space, maintaining interpretability via nearest-neighbor projection \(q^{(\theta_i)} = \arg\min_{q \in \mathcal{U}} \|\theta_i - q\|_2^2\) (each learned query always corresponds to a natural language concept in the query universe). An alternating optimization algorithm is adopted: freeze the dictionary to update the V-IP network → freeze the V-IP network to update the dictionary.

Key Designs¶

Query parameterization in CLIP space: The query universe \(\mathcal{U} = \{E_T(c) | c \in \mathcal{T}\}\) consists of approximately 300,000 CLIP text embeddings (derived from multiple LLM prompts and COCO dataset captions). \(K\) learnable embeddings \(\theta = \{\theta_i\}_{i=1}^K\) are projected to interpretable queries via nearest-neighbor projection. The straight-through estimator (STE) enables backpropagation through the \(\arg\min\) operation. The dictionary-augmented V-IP objective is: \(\arg\min_{\theta,\psi,\eta} J_{Q_\theta}(\psi, \eta)\).
Alternating optimization algorithm (Algorithm 1): Directly optimizing all three components (\(\theta\), \(\psi\), \(\eta\)) jointly is problematic: once the dictionary is updated, query semantics change, invalidating the existing querier policy. Accordingly, the method alternates every \(t=4\) V-IP network update steps with 1 dictionary update step. During V-IP updates, the dictionary is frozen while the querier and classifier are trained; during dictionary updates, the V-IP network is frozen and only \(\theta\) is updated.
Connection to sparse dictionary learning: The method bears deep connections to classical sparse dictionary learning algorithms such as K-SVD: (a) IP query subset selection ≈ OMP sparse coding (selecting the most informative atoms); (b) V-IP updates ≈ sparse coding step (computing semantic codes); (c) dictionary updates ≈ dictionary atom updates (minimizing classification error rather than reconstruction error). Proposition 1 proves that under biased sampling, the optimal dictionary parameters simultaneously minimize the sum of KL divergences across all query budgets.
Query answering mechanism: Soft answers are computed using CLIP ViT-L/14 via normalized dot products: \(\hat{q}^{(\theta_i)}(X) = (\langle q^{(\theta_i)}/\|q\|, E_I(X)/\|E_I(X)\| \rangle - m_\theta) / (M_\theta - m_\theta)\), then binarized into hard answers by thresholding at 0.5 (ensuring interpretability).

Loss & Training¶

V-IP loss: \(J_{Q_\theta}(\psi, \eta) = \mathbb{E}_{X,S}[D_{KL}(P(Y|X) \| P_\psi(Y|S, A_\eta(X,S)))]\)
Both the querier and classifier are two-layer MLPs with masking to handle variable-length inputs
Adam optimizer; V-IP updates and dictionary updates are performed alternately
Hyperparameters are tuned based on validation accuracy AUC

Key Experimental Results¶

Main Results — Query Dictionary Learning Improves V-IP Accuracy¶

K-Learned vs. K-LLM across 6 datasets at a fixed query budget \(\tau\):

Dataset	Query Budget \(\tau\)	K-LLM	K-Learned (best init)	Black-Box
RIVAL-10	10	~96%	~98.7%	~99%
CIFAR-10	10	~90%	~95.1%	~97%
CIFAR-100	50	~70%	~75.2%	~82%
ImageNet-100	50	~79%	~84.0%	~91%
CUB-200	100	~69%	~74.5%	~82%
Stanford-Cars	100	~77%	~82.4%	~87%

K-Learned consistently and significantly outperforms K-LLM on all datasets and substantially closes the gap with black-box models.

Ablation Study¶

Alternating optimization vs. joint optimization:

Dataset	Query Budget	Alternating	Joint
RIVAL-10	10	98.73%	98.26%
CIFAR-10	10	95.12%	87.00%
CUB-200	100	74.52%	72.14%
Stanford-Cars	100	82.39%	79.18%

Alternating optimization consistently outperforms joint optimization, with a gap of up to 8% on CIFAR-10.

Comparison with 4 state-of-the-art CBMs (using RN50 CLIP with soft answers):

Dataset	K-Learned	PCBM	LaBo	Label-free	Res-CBM
CIFAR-10	88.55%	82.08%	87.52%	86.77%	88.03%
CIFAR-100	68.02%	56.00%	67.36%	67.45%	67.91%

K-Learned outperforms or is competitive with all four state-of-the-art concept bottleneck models.

Key Findings¶

All three initialization strategies (K-LLM, K-Random, K-Medoids) benefit from learning, with performance differences within 5 percentage points.
Quantization (hard answers + nearest-neighbor projection) reduces performance but guarantees interpretability.
In a jellyfish classification case study, V-IP progressively reduces posterior entropy through 8 queries (e.g., "Wings? No", "Swims? Yes", "UFO-like? Yes"), providing a transparent decision process.
CLIP as a query-answering mechanism introduces noise (e.g., answering "anemone? Yes" for jellyfish).

Highlights & Insights¶

Bridging dictionary learning from signal processing to explainable AI: A formal connection between IP query selection and OMP sparse coding is established (Proposition 1).
Interpretability constraints built into the parameterization: Nearest-neighbor projection onto the query universe guarantees that all learned queries remain expressible in natural language.
Necessity of alternating optimization: The work reveals the coupling problem between the querier and the dictionary; joint optimization leads to semantic drift.
Progressive explanation: The IP decision process resembles a "20 Questions" game, where the posterior distribution change is observable at each step, offering more intuitive explanations than the static representations of CBMs.

Limitations & Future Work¶

The method relies heavily on CLIP's query-answering quality; noisy CLIP responses constrain final performance.
The query universe must be constructed in advance (~300K queries), and its quality affects the learned dictionary.
Hard-answer quantization loses information, yet removing quantization compromises interpretability—a fundamental tension.
Extension to larger-scale classification tasks (e.g., full ImageNet-1K) remains unexplored.
The query budget \(\tau\) has a large impact on performance but must be set manually.

Distinction from Res-CBM: Res-CBM compensates for an incomplete dictionary via a residual module, whereas this work directly learns a sufficient dictionary.
Sparse CLIP (SPLICE) decomposes images into sparse linear combinations of concepts, sharing a similar spirit but targeting a different task.
The approach may inspire other tasks requiring interpretable intermediate representations, such as explainable VQA and medical diagnosis.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The bridge between sparse dictionary learning and Information Pursuit is highly elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Six datasets with comparisons across multiple initialization strategies and optimization schemes.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations with in-depth connections to classical methods.
Value: ⭐⭐⭐⭐ Provides a principled learning approach for query design in interpretable classifiers.