DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification¶

Conference: CVPR 2026 arXiv: 2604.07166 Code: https://github.com/RobertZimm/DINO-QPM Area: Model Interpretability Keywords: Interpretable Classification, DINOv2, Quadratic Programming, Visual Foundation Models, Feature Sparsification

TL;DR¶

This paper proposes DINO-QPM, a lightweight interpretability adapter that transforms the complex, high-dimensional features of a frozen DINOv2 backbone into contrastive, class-agnostic interpretable representations. Through quadratic programming for sparse feature selection and class-level feature assignment, the method simultaneously surpasses DINOv2 linear probing in accuracy and all comparable methods in interpretability on CUB-2011 and Stanford Cars.

Background & Motivation¶

Visual foundation models such as DINOv2 excel as feature extractors, yet their complex, high-dimensional representations pose substantial challenges for interpretability. Existing approaches suffer from the following limitations:

Post-hoc explanation methods are unreliable: Attention maps, Grad-CAM, and similar techniques are external approximations that do not faithfully reflect the model's decision process; attention maps are decoupled from downstream tasks and frequently omit information critical for classification.

End-to-end interpretable models are resource-intensive: Methods such as prototype networks require full backbone fine-tuning, incurring prohibitive computational cost.

Frozen-backbone methods lack accuracy: Post-hoc Concept Bottleneck Models (Post-hoc CBMs) rely on textual concept supervision and cannot provide direct spatial localization, with accuracy typically lagging behind full fine-tuning approaches.

Prototype model interpretability can be misleading: Their similarity computations do not necessarily align with human cognition.

Core Problem: Can one construct a lightweight adapter on top of a fully frozen DINOv2 backbone that converts its powerful but entangled features into sparse, spatially localizable, and globally interpretable class representations?

Method¶

Overall Architecture¶

The DINO-QPM pipeline: 1. A frozen DINOv2 backbone extracts patch embeddings \(\boldsymbol{F}^{\text{froz}} \in \mathbb{R}^{N_p \times D}\). 2. An MLP projects patch embeddings into a task-specific feature space \(\boldsymbol{F} \in \mathbb{R}^{N_p \times N_f}\). 3. Average pooling produces a feature vector \(\boldsymbol{f} = \text{AvgPool}(\boldsymbol{F}) \in \mathbb{R}^{N_f}\). 4. A Binary Low-Dimensional Decision layer (BLDD) performs sparse feature assignment for classification.

Key design choice: the CLS token is discarded in favor of patch embeddings exclusively. This ensures that each feature has a corresponding spatial feature map, enabling high-fidelity spatial localization.

Key Designs¶

MLP Feature Transformation:
- Function: Maps DINOv2's \(D\)-dimensional patch representations to \(N_f\) task-specific features.
- Core Idea: The BLDD layer is a binary sparse matrix with no representational transformation capacity; the MLP therefore performs the necessary feature transformation upstream.
- Ablation Finding: The number of MLP layers has little effect on the dense model, but is critical for the sparse QPM — QPM with multiple MLP layers can outperform the dense model by nearly 10%.
Quadratic Programming Feature Selection (QP):
- Function: Selects \(N_f^* = 50\) features from \(N_f\) candidates and assigns \(N_f^c = 5\) features per class.
- Core Idea: Three objectives are jointly optimized — maximizing the correlation between each class and its assigned feature activations, minimizing similarity among selected features, and maximizing bias to prioritize local features.
- Design Motivation: Feature selection via mathematical programming rather than learning ensures diversity, contrastiveness, and compactness of the feature set.
Average Pooling for Spatial Localization:
- Function: Replaces standard CLS token aggregation with average pooling.
- Core Idea: Each dimension of the feature vector is the spatial average of the corresponding feature map, so feature maps can be directly upsampled to the original image resolution as saliency maps.
- Experimental Validation: On DINOv2 with register tokens, using patch embeddings alone achieves 88.3% accuracy, surpassing the CLS-based approach (87.6%).
Feature Map Sparsity Loss \(\mathcal{L}_{\text{L1-FM}}\):
- Function: Applies L1 regularization to feature maps.
- Core Idea: Compels feature activations to concentrate on object regions relevant to classification, suppressing background noise and spatial diffusion.
- Surprising Finding: This loss not only improves Plausibility but also substantially increases classification accuracy, indicating a strong positive correlation between the two.

Loss & Training¶

Three-stage training procedure: 1. Dense Training: Train with cross-entropy \(\mathcal{L}_{\text{CE}}\), feature diversity loss \(\mathcal{L}_{\text{div}}\), and L1 sparsity loss. 2. Quadratic Programming: Solve the QP to determine the feature selection vector \(\boldsymbol{s}\) and sparse weights \(\boldsymbol{W}^{\text{sparse}}\). 3. Sparse Fine-tuning: Fix \(\boldsymbol{W}^{\text{sparse}}\) and fine-tune only on the selected features.

Total loss: \(\mathcal{L} = \mathcal{L}_{\text{CE}} + \lambda_{\text{div}} \mathcal{L}_{\text{div}} + \lambda_{\text{L1-FV}} \mathcal{L}_{\text{L1-FV}} + \lambda_{\text{L1-FM}} \mathcal{L}_{\text{L1-FM}}\)

Key Experimental Results¶

Main Results¶

Method	CUB Acc↑	CARS Acc↑	CUB Plausib.↑	Contrast.↑
DINOv2 CLS Linear Probe	87.9	91.7	42.6	59.2
Dense \(\boldsymbol{F}^{\text{froz}}\)	78.1	92.9	32.7	84.5
ResNet50 QPM	82.9	92.1	82.9	93.6
DINO-SLDD	84.6	92.9	78.0	93.0
DINO-QSENN	85.4	93.3	86.0	94.4
DINO-QPM (Ours)	88.3	94.0	95.0	100.0

DINO-QPM surpasses the non-interpretable DINOv2 linear probe in accuracy (88.3 vs. 87.9), while Plausibility improves dramatically from 42.6 to 95.0.

Ablation Study¶

Configuration	CUB Acc (%)	Note
CLS + no register	87.3	CLS token lacks spatial localization
CLS + register	87.6	Register improves CLS
Patch + no register	83.3	Patch representations degrade without register
Patch + register	88.3	Register tokens are critical

Backbone Size	CUB Acc (%)	Patch Contextualization↑
DINO ViT-B/16	37.1	8.9
DINOv2 ViT-S/14 Reg	83.4	42.9
DINOv2 ViT-B/14 Reg	88.3	43.9
DINOv2 ViT-L/14 Reg	86.5	2.2

Key Findings¶

Register tokens are essential: Without register tokens, the spatial information quality of patch embeddings deteriorates, causing approximately 5% accuracy drop. Register tokens prevent patches from acting as "artifact tokens" that store global context.
Plausibility and accuracy are strongly correlated: The \(\mathcal{L}_{\text{L1-FM}}\) loss jointly improves both metrics, indicating that directing model attention toward object regions is intrinsically beneficial for classification.
The compactness–accuracy trade-off is minimal: Reducing features per class from 5 to 4 (Compact variant) incurs negligible accuracy loss.
Illustrative bird classification case: When distinguishing Brewer's Blackbird from Rusty Blackbird, the model automatically localizes to the beak region — precisely the discriminative cue used by ornithological experts.

Highlights & Insights¶

Non-interpretable models are not necessarily more accurate: DINO-QPM, operating under the extreme sparsity constraint of 50 total features with only 5 per class, outperforms linear probing that uses 768 features.
Counter-intuitive architectural choice: Discarding the CLS token — the standard choice in classification literature — yields superior results, as globally interpretable representations constructed directly from local evidence are both more interpretable and more effective than opaque pre-aggregated representations.
Exceptional training efficiency: Because the backbone is fully frozen, patch embeddings can be precomputed, reducing per-epoch training time to approximately 6 seconds.
Well-designed Plausibility metric: The use of dilated masks to handle patch boundary effects prevents unfair penalization of activations on precise object contours.

Limitations & Future Work¶

Validation is limited to fine-grained classification benchmarks (CUB-2011, Stanford Cars); generalization to generic image classification remains to be tested.
Performance degrades with the ViT-L backbone, suggesting that adapter design may need to be tailored to different backbone scales.
The quadratic programming feature selection is performed once and does not adapt dynamically during training, potentially limiting the discovery of optimal feature combinations.
The fixed allocation of 5 features per class does not account for varying class complexity, which may warrant adaptive feature budgets.

QPM / ChiQPM: End-to-end trainable interpretable models based on quadratic programming; this work transfers the paradigm to frozen backbones.
Post-hoc CBM: Achieves interpretable classification via textual concepts, but depends on external language models and lacks spatial localization.
ProtoViT / Zhu et al.: Prototype-based interpretable methods that require backbone fine-tuning for prototype clustering.
Insight: The paradigm of frozen backbone + lightweight adapter as an efficient interpretable solution merits further exploration across a broader range of visual tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — Transferring QPM to frozen visual foundation models is a natural yet effective contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Ablations are comprehensive, metrics are rigorously designed, and cross-backbone validation is complete.
Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are clear, and visualizations are outstanding (the bird feature localization case study is particularly compelling).
Value: ⭐⭐⭐⭐ — Provides a strong tool for interpretable classification with frozen foundation models.