Skip to content

Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=PXo0gtT7Al
Code: https://vrg.fel.cvut.cz/ep/ (Project Page)
Area: Self-Supervised / Representation Learning Evaluation
Keywords: Attentive Probing, Representation Evaluation, Multi-Query Cross-Attention, Parameter Efficient, Frozen Backbone

TL;DR

Addressing the common parameter bloat in "attentive probing"—an increasingly popular evaluation protocol for frozen representations—this paper first unifies existing methods into a single framework. By leveraging the mathematical equivalence between Multi-Head Cross-Attention (MHCA) and Multi-Query Cross-Attention (MQCA), it removes redundant projection matrices to propose the extremely lightweight Efficient Probing (EP). On ImageNet-1K, EP achieves 75.6% accuracy for MAE ViT-B using less than 1.4M parameters (compared to 67.7% for linear probing) and consistently outperforms linear probing and existing attentive probes across diverse pre-training paradigms.

Background & Motivation

Background: Three mainstream protocols exist for evaluating pre-trained representations: k-NN, Linear Probing (LP), and Full Fine-Tuning (FT). While FT yields the highest accuracy, its computational cost is becoming unsustainable in the era of large models. Consequently, "frozen backbone + lightweight probe" is becoming the de facto evaluation standard.

Limitations of Prior Work: Standard linear probing attaches a classification head to a single global representation (e.g., the [CLS] token). While effective for models trained with global objectives (like DINO's Joint-Embedding Architecture, JEA), it significantly underestimates models that disperse discriminative information across local patch representations—such as Masked Image Modeling (MAE, SimMIM), Autoregressive (AIM), and Diffusion (DiT) models, which lack a centralized global token. "Attentive probing" emerged to bridge this gap: using attention to selectively aggregate discriminative descriptors from patch features for linear classification.

Key Challenge: Despite its adoption by AIM, CAE, V-JEPA, and CAPI, attentive probing lacks a systematic study. Existing methods vary significantly in design, are generally over-parameterized, computationally inefficient, and the mechanism of how attention aggregation improves classification remains unclear. Ultimately, a probe is an evaluation tool and should not be heavier than the representation it evaluates.

Goal: This work re-examines attentive probing through the lens of "accuracy vs. parameter efficiency": (1) Systematically unify existing methods into a single framework for the first comprehensive benchmark; (2) Design a lightweight yet accurate attentive probe; (3) Clarify the relationship between attention quality and classification accuracy.

Key Insight: The authors observe that the key projection matrix \(W_K\) in standard multi-head cross-attention (MHCA) maps learnable queries back to the full space of input features. This step can be absorbed by a set of "effective queries" learned directly in the input space. Since the two are mathematically equivalent, the bulky projection matrices are entirely redundant.

Core Idea: Replace MHCA (with key/query projection matrices) with "Multi-Query Cross-Attention (MQCA) learning multiple queries directly in the input feature space." This reduces learnable parameters from \(D_a(D_i{+}1)\) to \(D_i M\) while maintaining mathematical equivalence, resulting in Efficient Probing (EP).

Method

Overall Architecture

The objective of attentive probing: Given a feature matrix \(X \in \mathbb{R}^{D_i \times N}\) (\(N = W \times H\) patch features, each \(D_i\)-dimensional) from a frozen ViT backbone, an attention pooling mechanism aggregates it into an image-level feature \(y \in \mathbb{R}^{D_o}\), which is then fed to a \(C\)-class linear classifier.

The authors unify all attention pooling designs into \(M\) attention predictors: the \(j\)-th predictor outputs an \(\ell_1\)-normalized attention vector \(a_j \in \mathbb{R}^N\) (reshaping to \(W \times H\) yields an attention map). The value features \(V = W_V X\) are split into \(M\) sub-matrices \(V_j \in \mathbb{R}^{d_o \times N}\) (\(d_o = D_o/M\)), and the output feature is partitioned accordingly:

\[y_j = V_j a_j = W_{V_j} X a_j .\]

In essence, each predictor weighted-pools \(N\) patch features into a \(d_o\)-dimensional subspace of the final representation, and these \(M\) segments are concatenated to form \(y\). This abstraction reveals that existing methods like AbMILP, AIM, DELF, SimPool, and V-JEPA represent different choices for constructing these \(M\) predictors. EP is designed as the most parameter- and computation-efficient construction within this framework.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Frozen ViT Backbone<br/>Patch Features X (Di×N)"] --> B["Unified Framework<br/>M Attention Predictors"]
    B --> C["EP / Multi-Query Cross-Attention<br/>Directly Learn Effective Queries u_j<br/>aj = softmax(Xᵀ u_j)"]
    A --> D["Retain Value Transformation<br/>V = W_V·X"]
    C --> E["Weighted Pooling<br/>y_j = V_j·a_j, Concatenate to y"]
    D --> E
    E --> F["C-class Linear Classifier"]

Key Designs

1. Unified Framework: Consolidating Diverse Attention Pooling into "M Predictors + Value Aggregation"

Varied designs in attentive probing make horizontal comparison difficult. The authors align them into standard algorithmic steps (query source, key/value transformations, attention calculation, and pooling). In this framework: AbMILP is a simple case with \(M=1\) and \(W_K, W_V\) as identity matrices; AIM is MHCA with batch normalization; DELF uses an MLP for scalar attention (\(M=1\)) with softplus instead of softmax; SimPool (\(M=1\)) uses a data-dependent input vector \(u = \frac{1}{N}X^\top \mathbf{1}\) with layer norm; and V-JEPA stacks an MLP with GeLU and residuals above MHCA, equivalent to a transformer block. This systematic comparison makes the sources of over-parameterization evident.

2. EP / Multi-Query Cross-Attention (MQCA): Learning Queries in Input Space to Remove Redundancy

Standard MHCA attention is \(\hat{a}_j = (W_{K_j}X)^\top q_j = X^\top W_{K_j}^\top q_j\), where \(q_j\) is a learnable query. The authors observe that \(W_{K_j}^\top\) only serves to map \(q_j\) back to the \(D_i\)-dimensional space of input features. Instead of learning \(W_{K_j}\) and \(q_j\) separately, EP directly learns the mapped vector in the \(D_i\) space. By defining the "effective query" \(u_j := W_{K_j}^\top q_j \in \mathbb{R}^{D_i}\), the attention becomes:

\[\hat{a}_j = X^\top u_j, \qquad a_j = \mathrm{softmax}(\hat{a}_j), \quad j \in \{1,\dots,M\}.\]

The process eliminates all projection matrices except for the queries. This reduces parameters from \(D_a(D_i{+}1)\) to \(D_i M\) and computation from \(N D_a(D_i{+}1)\) to \(N D_i M\). Since \(M\) is typically orders of magnitude smaller than \(D_i\), the savings are significant. This is a "free" improvement because MQCA is mathematically equivalent to MHCA: EP12 and AIM12 achieve the same accuracy (75.1%), but EP uses fewer parameters (1.36M vs 1.95M).

3. Retaining Value Transformation \(W_V\) + Using \(D_o/M\) as a Budget Knob

While EP cuts query/key projections, it deliberately retains the value transformation \(V = W_V X\). Ablations show \(W_V\) is critical: adding \(W_V\) to vanilla Global Average Pooling (GAP) improves accuracy from 66.7% to 68.0%, while removing it from EP12 drops accuracy from 75.1% to 72.1%. Intuitively, attention determines "what to aggregate," while \(W_V\) determines "what the representation should look like." EP also introduces two knobs—the number of queries \(M\) and output dimension \(D_o\)—allowing the method to scale across different parameter budgets on the Pareto front.

Loss & Training

Probing follows standard settings: frozen backbone, 90 epochs of training for the attention pooling and classifier. Performance is reported as Top-1 accuracy on validation sets, alongside trainable parameter counts and FLOPs. Unless otherwise specified, \(D_o = D_i = D_a\). EP is also compatible with PEFT; combining LoRA on all \(W_V\) layers with EP (LoRA+EP) yields the benefits of both.

Key Experimental Results

Main Results

Evaluations span 7 classification benchmarks (IN-1K, CIFAR-100, Places365, CUB-200, Aircraft, Cars, Food-101) across five pre-training paradigms (MIM, JEA, Hybrid, VLM, Generative). The following table compares protocols on ImageNet-1K (EP default is EP32):

Pre-training Architecture k-NN Linear Probing LP EP Gain (EP vs LP)
MAE (MIM) ViT-S/16 26.7 47.4 64.6 +17.2
MAE (MIM) ViT-B/16 46.1 67.7 75.6 +7.9
SimMIM (MIM) ViT-B/16 15.1 51.5 65.1 +13.6
DiT (Gen) DiT-XL/2 8.3 32.7 57.0 +24.3
BEiTv2 (MIM) ViT-B/16 74.8 79.0 81.7 +2.7
DINOv2 (Mix) ViT-L/14 83.5 85.2 85.6 +0.4
CLIP (VLM) ViT-L/14 77.2 82.3 83.4 +1.1
SigLIP (VLM) ViT-L/16 83.7 84.1 86.1 +2.0

Key Observation: Models optimizing patch local representations (rather than explicit global ones) benefit most from attentive probing (e.g., DiT +24.3). For models with strong global descriptors (DINO/JEA), the gain is marginal. EP also shifts relative rankings: MIM methods that initially seemed weaker under LP/k-NN (like MAE) outperform contrastive methods, challenging the notion that MIM representations are inherently "weaker."

Ablation Study

Configuration Key Metric (Top-1) Note
EP12 (Full) 75.1% Same accuracy as AIM12, but 1.36M vs 1.95M params
Single-head w/o \(W_K\) 71.8→71.7% \(W_K\) absorbed by single query; negligible effect
Multi-head w/o \(W_K\) (Identity) 75.1→72.9% Queries limited to subspaces; significant drop
GAP + \(W_V\) 66.7→68.0% Steady gain from value transformation
EP12 w/o \(W_V\) 75.1→72.1% \(W_V\) is a critical, non-optional component
LoRA+EP (850K params) 76.99% Outperforms pure EP (75.58%) and all-layer LoRA (76.72%)

Key Findings

  • Superior Efficiency: EP64 achieves 75.6% SOTA on MAE ViT-B with <1.4M parameters. EP48 (with \(D_o = D_i/8\)) reaches 70.3% with only ~200K parameters (4x fewer than linear probing).
  • Complementary to PEFT: EP outperforms single-layer LoRA, BitFit, and LayerNorm tuning in terms of parameter efficiency. LoRA+EP at 850K parameters reaches 76.99%, indicating that EP captures information that LoRA does not, and vice versa.
  • Correlation: Localization Quality ↔ Classification Accuracy: Predictors that focus more on the foreground object (lower entropy, centroid within GT box) contribute more to accuracy. EP tends to fixate on the object rather than "background shortcuts."
  • Complementary Attention Maps: Multiple queries in EP focus on distinct semantic parts (tail, beak, etc.), showing higher complementarity scores than MHSA, V-JEPA, or AIM.

Highlights & Insights

  • "Mathematical Equivalence → Free Slimming": EP's brilliance lies not in inventing new structures but in proving \(W_K\) is redundant via absorption, significantly reducing parameters without losing accuracy.
  • Recalibrated Evaluation Conclusions: Using the wrong probe (LP) systematically underestimates local-representation models, leading to misjudgments like "MIM is inferior to contrastive learning." EP corrects this bias.
  • Probes as Analysis Tools: The correlation between localization and accuracy, along with complementary attention maps, transforms probing from a scoring tool into a window for understanding representation interpretability.
  • Adjustable Budget Knob: The \(M\) and \(D_o\) hyperparameters allow a single method to cover the entire Pareto front from 200K to millions of parameters.

Limitations & Future Work

  • Limited to Image Classification + ViT: Experiments focused on patch-token ViT classification; generalization to dense tasks (segmentation/detection) or CNN backbones is not yet verified.
  • Linear Assumption: The "free equivalence" relies on linear projections. For variants with non-linearities (e.g., DELF's ReLU or V-JEPA's MLP blocks), this shortcut may not hold directly.
  • Correlation, Not Causation: The link between localization quality and accuracy is observed correlation; the paper does not prove that forcing focus causes higher accuracy.
  • Future Directions: Applying EP as a universal lightweight aggregator for dense downstream tasks or using localization/complementarity as explicit regularization terms during training.
  • vs. AIM / MHCA: EP is mathematically equivalent to these in terms of accuracy but is more parameter-efficient (e.g., 1.36M vs 1.95M).
  • vs. AbMILP / DELF / SimPool: These are special cases of the framework with \(M=1\) and limited expressive power compared to EP's multi-query design.
  • vs. V-JEPA / CaiT / ViT Block: These methods use significantly more parameters but provide marginal gains compared to EP.
  • vs. LoRA / BitFit / PEFT: PEFT adapts the backbone (task-adaptive), whereas EP is a representation-preserving probe; they are complementary rather than redundant.
  • vs. Slot Attention: EP is a minimalist version of slot attention (single iteration, no LayerNorm/GRU/MLP) where learnable queries compensate for the lack of iterative refinement.

Rating

  • Novelty: ⭐⭐⭐⭐ (Solidified via the "equivalence implies redundancy" perspective and unified framework)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Broad coverage of 5 paradigms, 7 datasets, and PEFT benchmarks)
  • Writing Quality: ⭐⭐⭐⭐⭐ (Clear derivation, effective tables, and insightful conclusions)
  • Value: ⭐⭐⭐⭐⭐ (Provides a lightweight standard for the community and corrects systematic evaluation biases)