FreestyleRet: Retrieving Images from Style-Diversified Queries¶
Conference: ECCV 2024
arXiv: 2312.02428
Code: https://github.com/CuriseJia/FreeStyleRet
Area: Model Compression (Lightweight Retrieval)
Keywords: Style-diversified retrieval, Gram matrix, prompt tuning, cross-style image retrieval, plug-and-play framework
TL;DR¶
This work proposes the first Style-Diversified Query-Based Image Retrieval (Style-Diversified QBIR) task and the DSR dataset. It designs FreestyleRet, a lightweight, plug-and-play framework that extracts texture/style features of queries using Gram matrices to construct a style space. These style features then initialize prompt tokens, enabling a frozen vision encoder to adapt to various query styles such as texts, sketches, low-resolution images, and artistic paintings.
Background & Motivation¶
Query-Based Image Retrieval (QBIR) is a fundamental task in computer vision, widely applied in image search engines and cross-modal tasks. However, current retrieval research almost exclusively focuses on text-to-image retrieval, overlooking the fact that users in real-world scenarios may use diverse query modalities to express their search intentions.
Limitations of Prior Work:
Single Query Style: Mainstream retrieval models (e.g., CLIP, BLIP) are primarily designed for text queries, lacking adaptation capabilities for non-text queries such as sketches, artistic paintings (art), and low-resolution screenshots (low-res).
User Intention Gap: Different users prefer different query styles—some may draw a sketch to describe shapes, some take low-resolution captures, others use text—rendering single-query modalities inadequate for expressing complex retrieval intents.
Inability to Cooperate Multi-styles: Existing models cannot process multi-style query inputs simultaneously, let alone leverage complementary information across styles to improve retrieval performance.
Key Challenge: How to make a single retrieval framework simultaneously understand and adapt to various query styles with minimal computational overhead for style-aware retrieval?
Key Insight: Borrowing the Gram matrix from image style transfer to capture texture features of different query styles, constructing a style space via K-Means clustering, and then initializing prompt tokens with style features to achieve style-aware retrieval with minimal parameter overhead on a frozen pre-trained vision encoder.
Method¶
Overall Architecture¶
FreestyleRet contains three core modules: (1) Gram-based style extraction module, (2) style space construction module, and (3) style-initialized prompt tuning module. During training, the dataset is traversed twice: the first pass constructs the style space, and the second pass executes style-initialized prompt tuning. During inference, the constructed style space is directly utilized for style-aware retrieval.
Key Designs¶
- Gram-based Style Extraction Module: For any query input \(q_i\), a frozen VGG network extracts the third convolutional layer features (\(112 \times 112 \times 128\)). After downsampling, the Gram matrix is computed:
The Gram matrix \(g_i\) captures the texture features and inter-channel spatial correlations of the query, effectively distinguishing sketch lines, art brushstrokes, low-resolution blur, etc.
Design Motivation: The Gram matrix has been proven effective in representing style/texture in style transfer. Introducing it to retrieval tasks naturally distinguishes different query styles without requiring manual style labels.
- Style Space Construction Module: The Gram matrices of all queries are clustered into 4 clusters (corresponding to 4 query styles) using the K-Means algorithm. The cluster centers \(\mu_1, ..., \mu_4\) serve as the basis vectors of the style space. For a newly input query \(q_i\), its style feature is calculated via weighted summation:
Design Motivation: Explicitly constructing a style space allows the model to adaptively comprehend the style characteristics of any query, rather than hardcoding style categories. The soft-weighted combination path allows handling queries that fall between two styles.
- Style-Init Prompt Tuning Module: Four learnable prompt tokens are inserted in each layer of the frozen ViT vision encoder. The key novelty lies in the token initialization strategy: shallow layers are initialized with the style space feature \(s_i\), while deep layers are initialized with the Gram matrix \(g_i\) (instead of random initialization), enabling the encoder to be style-aware from the beginning:
Design Motivation: VPT showed that deep prompts outperform shallow prompts, but random initialization cannot inject style information. Layer-specific differentiated initialization—shallow layers injecting global style, deep layers injecting fine-grained texture—enables the encoder to perceive style at different abstraction levels.
Loss & Training¶
The Triplet Loss is utilized as the training objective:
Where \(\text{dist}(x,y) = 1 - \cos(x,y)\), and \(\alpha=1.0\). Positive samples are ground-truth retrieval images, and negative samples are randomly selected images from the same style set. Training is done for 20 epochs with a learning rate of 1e-5 and a batch size of 24 on an A100 GPU.
Key Experimental Results¶
Main Results: Style-diversified Retrieval on DSR Dataset¶
| Method | Text R@1 | Sketch R@1 | Art R@1 | Low-Res R@1 | Note |
|---|---|---|---|---|---|
| CLIP (zero-shot) | 66.1 | 47.5 | 58.5 | 45.0 | No style adaptation |
| CLIP* (prompt tuned) | 72.2 | 63.6 | 58.2 | 78.8 | Standard prompt tuning |
| BLIP* (prompt tuned) | 74.3 | 67.1 | 51.1 | 77.2 | Standard prompt tuning |
| VPT* | 69.9 | 73.3 | 66.7 | 81.4 | Visual prompt tuning |
| ImageBind | 71.0 | 50.8 | 58.2 | 79.0 | Multi-modal model |
| LanguageBind | 79.7 | 63.6 | 67.5 | 78.6 | Multi-modal model |
| FreestyleRet-CLIP | 69.9 | 80.6 | 71.4 | 86.4 | Ours (CLIP encoder) |
| FreestyleRet-BLIP | 81.6 | 81.2 | 74.5 | 90.5 | Ours (BLIP encoder) |
FreestyleRet-BLIP achieves the best R@1 across all 4 query styles, with Sketch improved by 14.1% (vs BLIP*) and Art improved by 23.4%.
Ablation Study: Prompt Token Design¶
| Configuration | Sketch R@1 | Art R@1 | Low-Res R@1 | Note |
|---|---|---|---|---|
| Shallow Random + Deep Random | 68.1 | 63.5 | 78.8 | Randomly initialized baseline |
| Shallow StyleSpace + Deep Random | 76.7 | 69.1 | 82.4 | Shallow style injection |
| Shallow Random + Deep Gram | 76.8 | 69.2 | 81.8 | Deep texture injection |
| Shallow Gram + Deep StyleSpace | 78.1 | 69.5 | 84.4 | Reversed configuration |
| Shallow StyleSpace + Deep Gram | 80.6 | 71.4 | 86.4 | Best configuration |
| Best configuration, 1 token | 68.2 | 64.7 | 79.1 | Insufficient tokens |
| Best configuration, 2 tokens | 72.3 | 65.9 | 82.8 | Performance increases |
| Best configuration, 8 tokens | 77.9 | 67.1 | 80.7 | Too many tokens cause degradation |
Computational Efficiency Comparison¶
| Method | Parameters | Inference Speed |
|---|---|---|
| CLIP | 427M | 68ms |
| ImageBind | 1200M | 372ms |
| FreestyleRet-CLIP | 476M (+29M) | 96ms (+28ms) |
| FreestyleRet-BLIP | 940M (+29M) | 101ms (+39ms) |
Supporting multi-style retrieval with only 29M additional parameters and about 30ms extra inference time.
Key Findings¶
- Mutual Benefit of Multi-style Queries: In FreestyleRet, the Text R@1 of sketch+text joint query increases from 69.9% to 82.5% (+12.6%), whereas joint querying under CLIP/BLIP deteriorates performance.
- Convergence in 5-10 Epochs: FreestyleRet converges significantly faster than baseline models (which require 50+ epochs), taking only 4 minutes per epoch.
- Style Space Initialization Outperforms Random Initialization: Ablation study shows that the best configuration (Shallow StyleSpace + Deep Gram) improves Sketch retrieval by 12.5% compared to random initialization.
Highlights & Insights¶
- New Task Definition: Systematically defines the style-diversified QBIR task for the first time and proposes two evaluation datasets, DSR and ImageNet-X, laying a foundation for this research vector.
- Cross-domain Application of Gram Matrix: Intuitively brings the texture representation tool from style transfer into retrieval tasks in an elegant manner.
- Plug-and-play Design: Freezing encoders and only training prompt tokens allows the framework to seamlessly adapt to any ViT-based encoder, such as CLIP and BLIP.
- Multi-query Synergy: Queries of different styles can reinforce rather than interfere with each other, which is a bottleneck for current multi-modal models.
Limitations & Future Work¶
- Limited Query Styles: Currently only four styles—text, sketch, art, and low-res—are studied, leaving out other modalities such as 3D, video, and audio.
- Small Scale of DSR Dataset: The dataset contains only 10,000 images, and performance under large-scale scenarios remains to be validated.
- Underutilized Text Queries: The text branch directly employs the CLIP text encoder without style-aware adaptation like the visual branch.
- Fixed Number of Styles (K=4) in K-Means: Introducing a new query style requires re-clustering, highlighting a limitation in flexibility.
Related Work & Insights¶
- VPT (Visual Prompt Tuning): First introduced prompt tuning to vision models; this work innovates style-initialization strategies on top of it.
- CoOP / CoCoOP: Classic works on text prompt learning, inspiring the design of learnable prompts.
- Gram Matrix (Gatys et al.): The core representation tool in style transfer, creatively utilized here for style space construction.
- ImageBind / LanguageBind: Multi-modal unified models, which lack fine-grained modeling of style discrepancies.
- Insight: The prompt initialization strategy can be extended to other downstream tasks requiring conditional adaptation.
Rating¶
- Novelty: ⭐⭐⭐⭐ A pioneering combination of a new task, a new dataset, and Gram-matrix-initialized prompts.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation with two datasets, multiple baseline comparisons, detailed ablation studies, and qualitative analyses.
- Writing Quality: ⭐⭐⭐⭐ Elegant illustrations, clear motivation, and logically structured progression across the three modules.
- Value: ⭐⭐⭐⭐ Lays out a new task and provides robust baselines, promising a long-term impact on the retrieval community.