SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning¶
Conference: ICCV 2025 arXiv: 2503.10252 Code: https://github.com/uqzhichen/SVIP Area: Self-Supervised Learning / Zero-Shot Learning Keywords: zero-shot learning, semantic alignment, ViT, patch selection, attribute localization
TL;DR¶
This paper proposes the SVIP framework, which addresses semantic misalignment in zero-shot learning at its source by identifying and replacing semantically irrelevant image patches at the input stage with learnable embeddings initialized from attribute-level word embeddings.
Background & Motivation¶
Zero-shot learning (ZSL) relies on aligning visual features with semantic attributes to recognize unseen categories. The core challenge is semantic misalignment: raw images contain large amounts of attribute-irrelevant information (background clutter, illumination variation, surrounding objects) that dilutes critical attributes (color, shape, etc.), degrading model performance on unseen classes.
Limitations of existing approaches: - Feature-space disentanglement methods (e.g., RFF, SDGZSL, FREE): remove semantically irrelevant information after feature extraction, at which point noise is already fused into the representation and cannot be fully eliminated. - Progressive model-space pruning (e.g., ZSLViT): prunes irrelevant tokens layer by layer within the Transformer, but research shows semantic features are diluted in deeper ViT layers, making late-stage pruning potentially too late.
Core problem: Can semantically irrelevant patches be identified and handled at the input stage, so they never enter the feature extraction pipeline in the first place?
A key observation is that attention weights of different patches change dynamically across layers (as shown in Figure 3), making single-layer attention an unreliable indicator of semantic relevance. A global perspective aggregated across all layers is therefore required.
Method¶
Overall Architecture¶
SVIP consists of three components: (1) Self-Supervised Patch Selection (SSPS): trains a patch classifier by aggregating attention scores across all layers; (2) Patch Semantic Contextualization (PSC): replaces semantically irrelevant patches with learnable embeddings initialized from attribute word embeddings; (3) Attribute Localization: localizes attribute values from semantically relevant patches for classification.
Key Designs¶
- Self-Supervised Patch Selection (SSPS):
Attention matrix aggregation: Let \(\mathbf{T}^l\) denote the attention matrix at layer \(l\) (summed over heads), with recursive aggregation: \(\mathbf{W}^l = \mathbf{W}^{l-1} + \mathbf{W}^{l-1} \times \mathbf{T}^l, \quad l=1,\cdots,L\) The row corresponding to the class token yields a semantic score \(r_i = \mathbf{W}^L_{[0;i]}\) for each patch, used as a pseudo-label.
Patch classifier: An auxiliary classifier predicts the semantic score of each patch embedding \(\hat{r}_i = \text{PatchCls}(\mathbf{v}_i)\), trained with binary cross-entropy loss: \(\mathcal{L}_{\text{patch}} = -\frac{1}{N}\sum_{i=1}^{N}[r_i \log \hat{r}_i + (1-r_i)\log(1-\hat{r}_i)]\)
At test time, the classifier directly determines which patches are semantically irrelevant without requiring a full Transformer forward pass.
-
Patch Semantic Contextualization (PSC): After selecting the top-\(M\) semantically relevant patches, rather than simply discarding the remaining (irrelevant) patches—which would damage object structure—a learnable embedding \(\mathbf{e}\) is added to them:
\(\hat{\mathbf{v}}_i = \begin{cases} \mathbf{v}_i, & \text{if } i \in \mathcal{S}_M \\ \mathbf{v}_i + \mathbf{e}, & \text{otherwise} \end{cases}\)
\(\mathbf{e}\) is aggregated from the word embeddings of \(K\) attributes via a Word-to-Patch (W2P) projection: \(\mathbf{e} = \text{W2P}(\mathbf{w}_1, \cdots, \mathbf{w}_K)\). This ensures that semantically irrelevant patches carry attribute-level semantic information, thereby enhancing semantic–visual interaction in the subsequent Transformer layers.
- Attribute Localization: A Patch-to-Attribute (P2A) projection maps the final representations of the top-\(M\) semantically relevant patches to the attribute space, with max pooling selecting the most relevant patch per attribute: \(\hat{\mathbf{a}} = \text{MaxPool}(\text{P2A}(\mathbf{Z}^{L'}))\) Classification is performed via cosine similarity between the predicted attribute vector and class-level attribute vectors, followed by Softmax.
Loss & Training¶
The model performs two forward passes per sample: one on the original patch sequence \(\mathbf{Z}^0\) and one on the contextualized sequence \(\mathbf{Z}^{0'}\).
Overall loss: $\(\ell_{\text{overall}} = \ell_{\text{cls}} + \lambda_1 \ell_{\text{JSD}} + \lambda_2 \ell_{\text{patch}}\)$
- \(\ell_{\text{cls}}\): sum of cross-entropy classification losses from both forward passes
- \(\ell_{\text{JSD}}\): Jensen–Shannon divergence between the two predicted distributions (for training stability)
- \(\ell_{\text{patch}}\): patch classification loss
The backbone is ViT-base (ImageNet-1k pretrained); the 196 patches are pooled into 49 via 2×2 merging.
Key Experimental Results¶
Main Results¶
| Method | Backbone | CUB T1 | CUB H | AwA2 T1 | AwA2 H | SUN T1 | SUN H |
|---|---|---|---|---|---|---|---|
| MSDN (CVPR'22) | ResNet101 | 76.1 | 68.1 | 70.1 | 67.7 | 65.8 | 41.3 |
| DUET (AAAI'23) | ViT | 72.3 | 67.5 | 69.9 | 72.7 | 64.4 | 45.8 |
| ZSLViT (CVPR'24) | ViT | 78.9 | 73.6 | 70.7 | 74.2 | 68.3 | 47.3 |
| SVIP (Ours) | ViT | 79.8 | 75.0 | 69.8 | 74.9 | 71.6 | 50.7 |
Achieves state-of-the-art GZSL harmonic mean (H) across all three benchmarks: CUB +1.4, AwA2 +0.7, SUN +3.4.
Ablation Study¶
| Method | CUB T1 | CUB H | AwA2 T1 | AwA2 H | SUN T1 | SUN H |
|---|---|---|---|---|---|---|
| Baseline (ViT w/ att head) | 76.8 | 63.8 | 61.4 | 67.8 | 62.7 | 36.0 |
| SVIP w/o SSPS | 78.9 | 71.9 | 66.8 | 72.6 | 67.6 | 47.3 |
| SVIP w/o PSC | 78.1 | 72.6 | 67.6 | 72.4 | 67.9 | 48.0 |
| SVIP w/o JSD | 79.5 | 74.9 | 69.1 | 74.5 | 71.2 | 50.4 |
| SVIP w/o W2P | 79.1 | 74.5 | 69.8 | 74.4 | 71.5 | 50.1 |
| SVIP (full) | 79.8 | 75.0 | 69.8 | 74.9 | 71.6 | 50.7 |
SSPS and PSC are the most critical components: removing SSPS reduces CUB H by 3.1 and SUN H by 3.4; W2P initialization from word embeddings outperforms random initialization.
Key Findings¶
- \(M=40\) retained patches (out of 49 total) yields the best performance; excessive pruning risks discarding semantically relevant patches.
- The JSD divergence weight is a sensitive hyperparameter; optimal values are 1 for JSD weight and 5 for temperature \(\sigma\).
- t-SNE visualizations show that SVIP produces tighter within-class attribute clusters and clearer inter-class separation.
- Intermediate features of the patch classifier show that semantically irrelevant patches naturally cluster in specific regions of the feature space.
Highlights & Insights¶
- "Prevention over cure" philosophy: Handling semantic noise at the input stage is more thorough than post-hoc processing in feature space—a perspective that is novel in the ZSL literature.
- Replace rather than remove: Substituting irrelevant patches with semantic embeddings preserves object structure while turning these positions into "semantic enhancement channels."
- Self-supervised cross-layer attention aggregation: Cleverly leverages ViT's own attention as free supervision, requiring no additional annotation.
- The dual forward pass + JSD stabilization design enforces consistency between contextualized and original patch representations.
Limitations & Future Work¶
- The dual forward pass increases training time.
- The patch retention threshold \(M\) requires manual tuning and may need to vary across datasets.
- Validation is limited to the embedding level (ViT-base); larger models or stronger backbones such as CLIP have not been explored.
- Word embedding initialization depends on GloVe quality and may be suboptimal for rare attribute names.
- Dynamic \(M\) values (retaining different numbers of patches per image) have not been investigated.
Related Work & Insights¶
- SVIP is complementary to ZSLViT's progressive model-space pruning: SVIP operates at the input stage, while ZSLViT targets intermediate layers.
- The self-supervised patch selection idea is transferable to other vision tasks that require input filtering (e.g., fine-grained recognition, object detection).
- The method of injecting semantic information into visual tokens in PSC resembles visual prompt tuning, though the objectives differ.
Rating¶
- Novelty: ⭐⭐⭐⭐ (novel approach of addressing semantic misalignment at the input stage with learnable semantic patches)
- Experimental Thoroughness: ⭐⭐⭐⭐ (three standard benchmarks + comprehensive ablations + hyperparameter sensitivity analysis)
- Writing Quality: ⭐⭐⭐⭐ (clear method description; pseudocode is helpful)
- Value: ⭐⭐⭐⭐ (effective contribution to ZSL with generalizable ideas)