Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning¶

Conference: ECCV2024
arXiv: 2407.04207
Code: Project Page
Area: Multimodal VLM
Keywords: CLIP, Sketch-Based Image Retrieval (SBIR), Multimodal Prompt Learning, Zero-Shot Learning, Cross-Modal Alignment

TL;DR¶

SpLIP is proposed, a bidirectional multimodal prompt learning framework based on frozen CLIP. By utilizing bidirectional knowledge exchange between vision and text encoders, an adaptive-margin triplet loss, and a conditional cross-modal jigsaw task, it achieves SOTA performance across three sketch retrieval settings: ZS-SBIR, GZS-SBIR, and FG-ZS-SBIR.

Background & Motivation¶

Task Definition¶

Sketch-Based Image Retrieval (SBIR) aims to retrieve corresponding photos from a gallery based on hand-drawn sketches. The core challenge lies in the huge domain gap between sketches and photos. This task has three main settings:

ZS-SBIR: The testing categories are completely unseen during training.
GZS-SBIR: During inference, the gallery contains both seen and unseen category photos, which easily biases the model towards seen classes.
FG-ZS-SBIR: Fine-grained matching is performed at the instance level rather than the category level.

Limitations of Prior Work¶

Unimodal Prompt Learning: Existing CLIP-based SBIR methods (such as CLIP-AT, TLT) mainly use unimodal (visual) prompts, failing to fully exploit the collaborative capacity of CLIP's dual vision-text pathways.

Limitations of Multimodal Prompts: Existing multimodal prompt methods (like MaPLe, PromptSRC) only support unidirectional token sharing, making the text pathway insensitive to visual information and limiting semantic depth.

Coarse Jigsaw Strategies: The patch shuffling strategy in CLIP-AT uses the same or different permutations for positive and negative pairs to perform triplet learning, failing to effectively establish local-to-global correspondences.

Motivation¶

A bidirectional cross-modal knowledge exchange mechanism is required to build tighter synergy between CLIP's vision and text encoders, while introducing a conditional cross-modal jigsaw task to enhance fine-grained alignment.

Method¶

Overall Architecture¶

SpLIP is based on frozen CLIP (ViT-B/32) and consists of the following core components:

Vision-guided Textual Prompting
Text-guided Visual Prompting
Conditional Cross-modal Jigsaw Solver
Adaptive margin triplet loss

During training, only the LayerNorm parameters, three mapping modules (\(\mathcal{B}_t, \mathcal{B}_v, \mathcal{B}_{vt}\)), and the jigsaw solver \(\mathcal{F}_{js}\) are optimized, while the CLIP backbone remains completely frozen.

Key Designs¶

Vision \(\rightarrow\) Text Direction: The image patch embeddings \(\mathbf{E}_0\) are converted into \(m=4\) learnable text tokens \(\mathbf{T}\) via the mapping block \(\mathcal{B}_t\), which are injected into every layer of the text encoder \(\mathcal{F}_t\). Distinct from the random initialization in methods like MaPLe, this textual prompt directly captures the visual distribution.

Text \(\rightarrow\) Vision Direction (Dual Channels): - Channel 1: Maps textual tokens ("sketch/photo of a", excluding [CLS]) to \(\mathcal{J}-1\) visual prompt tokens \(\mathbf{V}^{\text{tg}}\) via \(\mathcal{B}_v\), which remain identical across all layers. - Channel 2: Converts the layer output \(\mathbf{W}_l\) of the text encoder (containing tokens of all training classes) into \(n=2\) layer-wise varying visual prompts \(\mathbf{V}^{\text{ms}}\) via \(\mathcal{B}_{vt}\).

Key Innovation: \(\mathbf{V}^{\text{ms}}\) aggregates textual information across all training categories (category-agnostic). This cross-class knowledge sharing effectively bridges the semantic discrepancy within the embedding space.

Given an anchor sketch \(s_a\) and its permuted version \(s_a'\), a positive photo \(p_a^+\) (same class/instance) and a negative photo \(p_a^-\):

Construct fused features: \(r = [\mathcal{F}_v(s_a), \mathcal{F}_v(s_a')]\), \(r^+ = [\mathcal{F}_v(p_a^+), \mathcal{F}_v(s_a')]\), \(r^- = [\mathcal{F}_v(p_a^-), \mathcal{F}_v(s_a')]\)
The jigsaw solver \(\mathcal{F}_{js}\) (consisting of a 2-layer Transformer encoder + classifier) is required to predict the permutation index.
Core Idea: A positive photo should facilitate solving the sketch's jigsaw task more effectively than a negative photo, as positive photos share the same spatial layouts with the sketch.

Difference from CLIP-AT: CLIP-AT enforces positive pairs to share the same permutations and negative pairs to use different permutations, whereas SpLIP pairs the permuted sketch with the unpermuted positive photo, thereby capturing local-to-global correspondences more effectively.

Key Design 3: Adaptive Margin Triplet Loss¶

Traditional triplet loss uses a fixed margin, whereas SpLIP dynamically computes the margin leveraging class name embeddings from the CLIP text encoder:

\[\mu(c^+, c^-) = \cos(\mathcal{F}_t(c^+), \mathcal{F}_t(c^-))\]

When positive and negative classes are semantically close (high cosine similarity), the margin is larger, forcing the model to put more effort into distinguishing similar categories.

Loss & Training¶

\[\mathcal{L}_{total} = \mathcal{L}_{triplet} + \alpha \cdot \mathcal{L}_{class} + \beta \cdot \mathcal{L}_{cjs}\]

\(\mathcal{L}_{triplet}\): Cross-modal triplet loss with adaptive margin
\(\mathcal{L}_{class}\): Image-text classification loss (cross-entropy) based on CLIP textual prompts
\(\mathcal{L}_{cjs}\): Conditional jigsaw loss = jigsaw cross-entropy + hinge loss (constraining the positive pair to exhibit lower jigsaw CE than the negative pair)

Key Experimental Results¶

Main Results: Category-level ZS-SBIR (Table 1)¶

Method	Backbone	Sketchy-1 mAP	Sketchy-2 mAP@200	TU-Berlin mAP	QuickDraw mAP
BDA	CNN	43.7	55.6	37.4	15.4
PSKD	ViT	68.8	56.0	50.2	15.0
ZSE-Ret	ViT	73.6	50.4	56.9	14.2
CLIP-AT	CLIP	-	72.3	65.1	20.2
TLT	CLIP	77.9	66.1	61.5	27.8
MARL	CLIP	-	69.1	70.5	32.7
SpLIP	CLIP	80.2	76.4	73.1	34.2

GZS-SBIR Results (Table 2)¶

Method	Sketchy-2 mAP@200	Sketchy-2 P@200	TU-Berlin mAP	TU-Berlin P@100
STL (AAAI'23)	63.4	53.8	40.2	49.8
CLIP-AT	55.6	62.7	60.9	63.8
MARL	62.3	68.5	62.6	67.8
SpLIP	68.2	74.5	66.7	70.3

SpLIP outperforms the second-best method on GZS-SBIR by +4.8% (Sketchy) and +4.1% (TU-Berlin) in mAP.

FG-ZS-SBIR Results (Table 3, Sketchy Dataset)¶

Method	Acc@1	Acc@5
CLIP-AT	28.68	62.34
MARL	29.96	58.53
SpLIP	33.45	66.71

SpLIP exceeds CLIP-AT by nearly +5% and MARL by +3.5% on Top-1 accuracy.

Cross-Dataset Generalization (Table 4, Train on Sketchy-Ext, Test on Others)¶

Method	TU-Berlin mAP	TU-Berlin P@100	QuickDraw mAP	QuickDraw P@100
CLIP-AT	56.4	63.1	30.7	45.0
SpLIP	70.6	76.0	45.8	58.6

Significant improvements are observed in cross-dataset scenarios: +14.2% mAP on TU-Berlin and +15.1% mAP on QuickDraw.

Ablation Study: Loss Functions (Table 5, Sketchy)¶

Loss Combination	ZS mAP@200	ZS P@200	FG Acc@1	FG Acc@5
\(\mathcal{L}_{class}\)	57.5	58.1	17.23	39.41
\(\mathcal{L}_{triplet}\)	71.6	72.7	26.54	59.92
\(\mathcal{L}_{class} + \mathcal{L}_{triplet}\)	73.1	73.9	30.07	62.95
+ \(\mathcal{L}_{margin}\)	74.5	75.1	31.23	63.54
+ \(\mathcal{L}_{cjs}\) (Full)	76.4	77.3	33.45	66.71

Ablation Study: Learnable Modules (Table 6, Sketchy)¶

Configuration	ZS mAP@200	FG Acc@1
w/o LayerNorm	74.2	31.24
w/o \(\mathcal{B}_v\) (No text \(\rightarrow\) vision prompt)	71.9	29.76
w/o \(\mathcal{B}_t\) (No vision \(\rightarrow\) text prompt)	72.8	30.41
w/o \(\mathcal{B}_{vt}\) (No cross-layer text \(\rightarrow\) vision)	70.5	28.92
w/o All visual prompts	68.8	26.54
w/o All prompt modules	62.5	28.49
SpLIP (Full)	76.4	33.45

Key Findings¶

Bidirectional prompt sharing is crucial: Removing prompt sharing in either direction results in a drop of over 4% in ZS-SBIR and over 3% in FG-SBIR.
\(\mathcal{B}_{vt}\) is the most critical module: Removing it drops the mAP by 5.9%, as it aggregates textual information across all training categories to the visual encoder.
Adaptive margin outperforms fixed margin: Dynamic \(\mu\) significantly surpasses the fixed \(\mu=0.2\) across all tasks.
Conditional jigsaw outperforms CLIP-AT's jigsaw strategy: Performance degrades noticeably when replacing it with the CLIP-AT approach.
Deep prompts (layer-wise injection) outperform shallow prompts: The mAP continuously improves as the number of participating layers increases.
Strong cross-dataset generalization ability: Gains of 14-15% mAP are achieved on unseen datasets.

Highlights & Insights¶

First work to apply multimodal prompt learning to SBIR: Previous CLIP-SBIR methods only used visual prompts or simple fine-tuning.
Elegant design of bidirectional information flow: Vision \(\rightarrow\) text (via \(\mathcal{B}_t\)) makes textual prompts context-aware of visual contents; text \(\rightarrow\) vision (via \(\mathcal{B}_v + \mathcal{B}_{vt}\)) integrates semantic knowledge into the visual encoder, creating a closed loop.
Category-agnostic knowledge aggregation: \(\mathcal{B}_{vt}\) aggregates text features of all training categories instead of just the current class; it does not require seeing test class names during inference, naturally supporting zero-shot scenarios.
Clever refinement of the conditional jigsaw task: Changing “permuted sketch vs. permuted photo” to “permuted sketch vs. whole photo” establishes better local-global correspondences.
Adaptive margin leverages CLIP’s semantic structure: Categories that are semantically close require greater margins to differentiate, which aligns with human intuition.

Limitations & Future Work¶

Limited to ViT-B/32: Larger CLIP backbones (like ViT-L/14) have not been explored and might offer further performance gains.
Training overhead: Although the CLIP backbone is frozen, the bidirectional prompt sharing and the jigsaw solver introduce extra computations (multiple mapping modules and a Transformer decoder).
Hyperparameters of the jigsaw task: The selection of the permutation size \(|\mathcal{Y}^{\text{perm}}|\) is not thoroughly discussed, which might significantly affect the fine-grained performance.
FG-ZS-SBIR evaluated only on Sketchy: Results on other fine-grained datasets (such as QMUL-Shoe/Chair) are missing.
Inference speed not reported: Multimodal prompt generation (especially for \(\mathcal{B}_{vt}\) which needs to process all training classes) might lead to inference latency.
Grid search for loss weights \(\alpha\) and \(\beta\): Analysis on sensitivity across different datasets is absent.

Comparison with MaPLe (CVPR'23)¶

Like SpLIP, MaPLe is a multimodal prompt learning method, but it only supports unidirectional token sharing from vision to text (initializing visual prompts with text prompts) and is limited to specific layers. SpLIP implements a bidirectional, all-layer knowledge exchange, which is more effective for SBIR.

Comparison with CLIP-AT (CVPR'23)¶

CLIP-AT utilizes visual prompts combined with a patch shuffling triplet objective. SpLIP upgrades this in three dimensions: multimodal prompts (vs. unimodal), conditional jigsaw (vs. naive permutation matching), and adaptive margin (vs. fixed margin). SpLIP outperforms CLIP-AT by 14-15% mAP in cross-dataset evaluations.

Insights¶

The paradigm of bidirectional prompt sharing can be extended to other cross-modal tasks (e.g., text-to-image generation, VQA).
The adaptive margin concept can be applied to any metric learning based on CLIP semantic space.
The conditional jigsaw task idea can be generalized to other scenarios requiring fine-grained alignment (e.g., cross-domain ReID, medical image registration).

Rating¶

Novelty: ⭐⭐⭐⭐ — Bidirectional multimodal prompt sharing + conditional jigsaw + adaptive margin, with three innovations working collaboratively.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three ZS-SBIR settings × four datasets + cross-dataset generalization + detailed ablations, highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Method description is clear, formula notation is complete, and the illustrations are intuitive.
Value: ⭐⭐⭐⭐ — Re-establishes a new adapter paradigm for CLIP in the SBIR domain, with the bidirectional prompt mechanism showing solid generalizability.