Skip to content

SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

Conference: ICCV 2025 arXiv: 2503.10252 Code: https://github.com/uqzhichen/SVIP Area: Self-Supervised Learning / Zero-Shot Learning Keywords: zero-shot learning, semantic alignment, ViT, patch selection, attribute localization

TL;DR

This paper proposes the SVIP framework, which addresses semantic misalignment in zero-shot learning at its source by identifying and replacing semantically irrelevant image patches at the input stage with learnable embeddings initialized from attribute-level word embeddings.

Background & Motivation

Zero-shot learning (ZSL) relies on aligning visual features with semantic attributes to recognize unseen categories. The core challenge is semantic misalignment: raw images contain large amounts of attribute-irrelevant information (background clutter, illumination variation, surrounding objects) that dilutes critical attributes (color, shape, etc.), degrading model performance on unseen classes.

Limitations of existing approaches: - Feature-space disentanglement methods (e.g., RFF, SDGZSL, FREE): remove semantically irrelevant information after feature extraction, at which point noise is already fused into the representation and cannot be fully eliminated. - Progressive model-space pruning (e.g., ZSLViT): prunes irrelevant tokens layer by layer within the Transformer, but research shows semantic features are diluted in deeper ViT layers, making late-stage pruning potentially too late.

Core problem: Can semantically irrelevant patches be identified and handled at the input stage, so they never enter the feature extraction pipeline in the first place?

A key observation is that attention weights of different patches change dynamically across layers (as shown in Figure 3), making single-layer attention an unreliable indicator of semantic relevance. A global perspective aggregated across all layers is therefore required.

Method

Overall Architecture

SVIP consists of three components: (1) Self-Supervised Patch Selection (SSPS): trains a patch classifier by aggregating attention scores across all layers; (2) Patch Semantic Contextualization (PSC): replaces semantically irrelevant patches with learnable embeddings initialized from attribute word embeddings; (3) Attribute Localization: localizes attribute values from semantically relevant patches for classification.

Key Designs

  1. Self-Supervised Patch Selection (SSPS):

Attention matrix aggregation: Let \(\mathbf{T}^l\) denote the attention matrix at layer \(l\) (summed over heads), with recursive aggregation: \(\mathbf{W}^l = \mathbf{W}^{l-1} + \mathbf{W}^{l-1} \times \mathbf{T}^l, \quad l=1,\cdots,L\) The row corresponding to the class token yields a semantic score \(r_i = \mathbf{W}^L_{[0;i]}\) for each patch, used as a pseudo-label.

Patch classifier: An auxiliary classifier predicts the semantic score of each patch embedding \(\hat{r}_i = \text{PatchCls}(\mathbf{v}_i)\), trained with binary cross-entropy loss: \(\mathcal{L}_{\text{patch}} = -\frac{1}{N}\sum_{i=1}^{N}[r_i \log \hat{r}_i + (1-r_i)\log(1-\hat{r}_i)]\)

At test time, the classifier directly determines which patches are semantically irrelevant without requiring a full Transformer forward pass.

  1. Patch Semantic Contextualization (PSC): After selecting the top-\(M\) semantically relevant patches, rather than simply discarding the remaining (irrelevant) patches—which would damage object structure—a learnable embedding \(\mathbf{e}\) is added to them:

    \(\hat{\mathbf{v}}_i = \begin{cases} \mathbf{v}_i, & \text{if } i \in \mathcal{S}_M \\ \mathbf{v}_i + \mathbf{e}, & \text{otherwise} \end{cases}\)

\(\mathbf{e}\) is aggregated from the word embeddings of \(K\) attributes via a Word-to-Patch (W2P) projection: \(\mathbf{e} = \text{W2P}(\mathbf{w}_1, \cdots, \mathbf{w}_K)\). This ensures that semantically irrelevant patches carry attribute-level semantic information, thereby enhancing semantic–visual interaction in the subsequent Transformer layers.

  1. Attribute Localization: A Patch-to-Attribute (P2A) projection maps the final representations of the top-\(M\) semantically relevant patches to the attribute space, with max pooling selecting the most relevant patch per attribute: \(\hat{\mathbf{a}} = \text{MaxPool}(\text{P2A}(\mathbf{Z}^{L'}))\) Classification is performed via cosine similarity between the predicted attribute vector and class-level attribute vectors, followed by Softmax.

Loss & Training

The model performs two forward passes per sample: one on the original patch sequence \(\mathbf{Z}^0\) and one on the contextualized sequence \(\mathbf{Z}^{0'}\).

Overall loss: $\(\ell_{\text{overall}} = \ell_{\text{cls}} + \lambda_1 \ell_{\text{JSD}} + \lambda_2 \ell_{\text{patch}}\)$

  • \(\ell_{\text{cls}}\): sum of cross-entropy classification losses from both forward passes
  • \(\ell_{\text{JSD}}\): Jensen–Shannon divergence between the two predicted distributions (for training stability)
  • \(\ell_{\text{patch}}\): patch classification loss

The backbone is ViT-base (ImageNet-1k pretrained); the 196 patches are pooled into 49 via 2×2 merging.

Key Experimental Results

Main Results

Method Backbone CUB T1 CUB H AwA2 T1 AwA2 H SUN T1 SUN H
MSDN (CVPR'22) ResNet101 76.1 68.1 70.1 67.7 65.8 41.3
DUET (AAAI'23) ViT 72.3 67.5 69.9 72.7 64.4 45.8
ZSLViT (CVPR'24) ViT 78.9 73.6 70.7 74.2 68.3 47.3
SVIP (Ours) ViT 79.8 75.0 69.8 74.9 71.6 50.7

Achieves state-of-the-art GZSL harmonic mean (H) across all three benchmarks: CUB +1.4, AwA2 +0.7, SUN +3.4.

Ablation Study

Method CUB T1 CUB H AwA2 T1 AwA2 H SUN T1 SUN H
Baseline (ViT w/ att head) 76.8 63.8 61.4 67.8 62.7 36.0
SVIP w/o SSPS 78.9 71.9 66.8 72.6 67.6 47.3
SVIP w/o PSC 78.1 72.6 67.6 72.4 67.9 48.0
SVIP w/o JSD 79.5 74.9 69.1 74.5 71.2 50.4
SVIP w/o W2P 79.1 74.5 69.8 74.4 71.5 50.1
SVIP (full) 79.8 75.0 69.8 74.9 71.6 50.7

SSPS and PSC are the most critical components: removing SSPS reduces CUB H by 3.1 and SUN H by 3.4; W2P initialization from word embeddings outperforms random initialization.

Key Findings

  • \(M=40\) retained patches (out of 49 total) yields the best performance; excessive pruning risks discarding semantically relevant patches.
  • The JSD divergence weight is a sensitive hyperparameter; optimal values are 1 for JSD weight and 5 for temperature \(\sigma\).
  • t-SNE visualizations show that SVIP produces tighter within-class attribute clusters and clearer inter-class separation.
  • Intermediate features of the patch classifier show that semantically irrelevant patches naturally cluster in specific regions of the feature space.

Highlights & Insights

  • "Prevention over cure" philosophy: Handling semantic noise at the input stage is more thorough than post-hoc processing in feature space—a perspective that is novel in the ZSL literature.
  • Replace rather than remove: Substituting irrelevant patches with semantic embeddings preserves object structure while turning these positions into "semantic enhancement channels."
  • Self-supervised cross-layer attention aggregation: Cleverly leverages ViT's own attention as free supervision, requiring no additional annotation.
  • The dual forward pass + JSD stabilization design enforces consistency between contextualized and original patch representations.

Limitations & Future Work

  • The dual forward pass increases training time.
  • The patch retention threshold \(M\) requires manual tuning and may need to vary across datasets.
  • Validation is limited to the embedding level (ViT-base); larger models or stronger backbones such as CLIP have not been explored.
  • Word embedding initialization depends on GloVe quality and may be suboptimal for rare attribute names.
  • Dynamic \(M\) values (retaining different numbers of patches per image) have not been investigated.
  • SVIP is complementary to ZSLViT's progressive model-space pruning: SVIP operates at the input stage, while ZSLViT targets intermediate layers.
  • The self-supervised patch selection idea is transferable to other vision tasks that require input filtering (e.g., fine-grained recognition, object detection).
  • The method of injecting semantic information into visual tokens in PSC resembles visual prompt tuning, though the objectives differ.

Rating

  • Novelty: ⭐⭐⭐⭐ (novel approach of addressing semantic misalignment at the input stage with learnable semantic patches)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (three standard benchmarks + comprehensive ablations + hyperparameter sensitivity analysis)
  • Writing Quality: ⭐⭐⭐⭐ (clear method description; pseudocode is helpful)
  • Value: ⭐⭐⭐⭐ (effective contribution to ZSL with generalizable ideas)