Skip to content

FreestyleRet: Retrieving Images from Style-Diversified Queries

Conference: ECCV 2024
arXiv: 2312.02428
Code: https://github.com/CuriseJia/FreeStyleRet
Area: Model Compression (Lightweight Retrieval)
Keywords: Style-diversified retrieval, Gram matrix, prompt tuning, cross-style image retrieval, plug-and-play framework

TL;DR

This work proposes the first Style-Diversified Query-Based Image Retrieval (Style-Diversified QBIR) task and the DSR dataset. It designs FreestyleRet, a lightweight, plug-and-play framework that extracts texture/style features of queries using Gram matrices to construct a style space. These style features then initialize prompt tokens, enabling a frozen vision encoder to adapt to various query styles such as texts, sketches, low-resolution images, and artistic paintings.

Background & Motivation

Query-Based Image Retrieval (QBIR) is a fundamental task in computer vision, widely applied in image search engines and cross-modal tasks. However, current retrieval research almost exclusively focuses on text-to-image retrieval, overlooking the fact that users in real-world scenarios may use diverse query modalities to express their search intentions.

Limitations of Prior Work:

Single Query Style: Mainstream retrieval models (e.g., CLIP, BLIP) are primarily designed for text queries, lacking adaptation capabilities for non-text queries such as sketches, artistic paintings (art), and low-resolution screenshots (low-res).

User Intention Gap: Different users prefer different query styles—some may draw a sketch to describe shapes, some take low-resolution captures, others use text—rendering single-query modalities inadequate for expressing complex retrieval intents.

Inability to Cooperate Multi-styles: Existing models cannot process multi-style query inputs simultaneously, let alone leverage complementary information across styles to improve retrieval performance.

Key Challenge: How to make a single retrieval framework simultaneously understand and adapt to various query styles with minimal computational overhead for style-aware retrieval?

Key Insight: Borrowing the Gram matrix from image style transfer to capture texture features of different query styles, constructing a style space via K-Means clustering, and then initializing prompt tokens with style features to achieve style-aware retrieval with minimal parameter overhead on a frozen pre-trained vision encoder.

Method

Overall Architecture

FreestyleRet contains three core modules: (1) Gram-based style extraction module, (2) style space construction module, and (3) style-initialized prompt tuning module. During training, the dataset is traversed twice: the first pass constructs the style space, and the second pass executes style-initialized prompt tuning. During inference, the constructed style space is directly utilized for style-aware retrieval.

Key Designs

  1. Gram-based Style Extraction Module: For any query input \(q_i\), a frozen VGG network extracts the third convolutional layer features (\(112 \times 112 \times 128\)). After downsampling, the Gram matrix is computed:
\[g_i = (f_d(v_i))^\mathsf{T} f_d(v_i)\]

The Gram matrix \(g_i\) captures the texture features and inter-channel spatial correlations of the query, effectively distinguishing sketch lines, art brushstrokes, low-resolution blur, etc.

Design Motivation: The Gram matrix has been proven effective in representing style/texture in style transfer. Introducing it to retrieval tasks naturally distinguishes different query styles without requiring manual style labels.

  1. Style Space Construction Module: The Gram matrices of all queries are clustered into 4 clusters (corresponding to 4 query styles) using the K-Means algorithm. The cluster centers \(\mu_1, ..., \mu_4\) serve as the basis vectors of the style space. For a newly input query \(q_i\), its style feature is calculated via weighted summation:
\[w_j = \frac{e^{\cos(q_i, \mu_j)}}{\sum_{j=1}^{4} e^{\cos(q_i, \mu_j)}}, \quad s_i = \sum_{j=1}^{4} w_j \mu_j\]

Design Motivation: Explicitly constructing a style space allows the model to adaptively comprehend the style characteristics of any query, rather than hardcoding style categories. The soft-weighted combination path allows handling queries that fall between two styles.

  1. Style-Init Prompt Tuning Module: Four learnable prompt tokens are inserted in each layer of the frozen ViT vision encoder. The key novelty lies in the token initialization strategy: shallow layers are initialized with the style space feature \(s_i\), while deep layers are initialized with the Gram matrix \(g_i\) (instead of random initialization), enabling the encoder to be style-aware from the beginning:
\[[x_i, \_, E_i] = L_i(x_{i-1}, P_{i-1}, E_{i-1}), \quad i=1,...,n\]

Design Motivation: VPT showed that deep prompts outperform shallow prompts, but random initialization cannot inject style information. Layer-specific differentiated initialization—shallow layers injecting global style, deep layers injecting fine-grained texture—enables the encoder to perceive style at different abstraction levels.

Loss & Training

The Triplet Loss is utilized as the training objective:

\[\mathcal{L} = \frac{1}{B}\sum_{i=1}^{B}\max(0, \text{dist}(F_i, P_i) - \text{dist}(F_i, N_i) + \alpha)\]

Where \(\text{dist}(x,y) = 1 - \cos(x,y)\), and \(\alpha=1.0\). Positive samples are ground-truth retrieval images, and negative samples are randomly selected images from the same style set. Training is done for 20 epochs with a learning rate of 1e-5 and a batch size of 24 on an A100 GPU.

Key Experimental Results

Main Results: Style-diversified Retrieval on DSR Dataset

Method Text R@1 Sketch R@1 Art R@1 Low-Res R@1 Note
CLIP (zero-shot) 66.1 47.5 58.5 45.0 No style adaptation
CLIP* (prompt tuned) 72.2 63.6 58.2 78.8 Standard prompt tuning
BLIP* (prompt tuned) 74.3 67.1 51.1 77.2 Standard prompt tuning
VPT* 69.9 73.3 66.7 81.4 Visual prompt tuning
ImageBind 71.0 50.8 58.2 79.0 Multi-modal model
LanguageBind 79.7 63.6 67.5 78.6 Multi-modal model
FreestyleRet-CLIP 69.9 80.6 71.4 86.4 Ours (CLIP encoder)
FreestyleRet-BLIP 81.6 81.2 74.5 90.5 Ours (BLIP encoder)

FreestyleRet-BLIP achieves the best R@1 across all 4 query styles, with Sketch improved by 14.1% (vs BLIP*) and Art improved by 23.4%.

Ablation Study: Prompt Token Design

Configuration Sketch R@1 Art R@1 Low-Res R@1 Note
Shallow Random + Deep Random 68.1 63.5 78.8 Randomly initialized baseline
Shallow StyleSpace + Deep Random 76.7 69.1 82.4 Shallow style injection
Shallow Random + Deep Gram 76.8 69.2 81.8 Deep texture injection
Shallow Gram + Deep StyleSpace 78.1 69.5 84.4 Reversed configuration
Shallow StyleSpace + Deep Gram 80.6 71.4 86.4 Best configuration
Best configuration, 1 token 68.2 64.7 79.1 Insufficient tokens
Best configuration, 2 tokens 72.3 65.9 82.8 Performance increases
Best configuration, 8 tokens 77.9 67.1 80.7 Too many tokens cause degradation

Computational Efficiency Comparison

Method Parameters Inference Speed
CLIP 427M 68ms
ImageBind 1200M 372ms
FreestyleRet-CLIP 476M (+29M) 96ms (+28ms)
FreestyleRet-BLIP 940M (+29M) 101ms (+39ms)

Supporting multi-style retrieval with only 29M additional parameters and about 30ms extra inference time.

Key Findings

  • Mutual Benefit of Multi-style Queries: In FreestyleRet, the Text R@1 of sketch+text joint query increases from 69.9% to 82.5% (+12.6%), whereas joint querying under CLIP/BLIP deteriorates performance.
  • Convergence in 5-10 Epochs: FreestyleRet converges significantly faster than baseline models (which require 50+ epochs), taking only 4 minutes per epoch.
  • Style Space Initialization Outperforms Random Initialization: Ablation study shows that the best configuration (Shallow StyleSpace + Deep Gram) improves Sketch retrieval by 12.5% compared to random initialization.

Highlights & Insights

  • New Task Definition: Systematically defines the style-diversified QBIR task for the first time and proposes two evaluation datasets, DSR and ImageNet-X, laying a foundation for this research vector.
  • Cross-domain Application of Gram Matrix: Intuitively brings the texture representation tool from style transfer into retrieval tasks in an elegant manner.
  • Plug-and-play Design: Freezing encoders and only training prompt tokens allows the framework to seamlessly adapt to any ViT-based encoder, such as CLIP and BLIP.
  • Multi-query Synergy: Queries of different styles can reinforce rather than interfere with each other, which is a bottleneck for current multi-modal models.

Limitations & Future Work

  • Limited Query Styles: Currently only four styles—text, sketch, art, and low-res—are studied, leaving out other modalities such as 3D, video, and audio.
  • Small Scale of DSR Dataset: The dataset contains only 10,000 images, and performance under large-scale scenarios remains to be validated.
  • Underutilized Text Queries: The text branch directly employs the CLIP text encoder without style-aware adaptation like the visual branch.
  • Fixed Number of Styles (K=4) in K-Means: Introducing a new query style requires re-clustering, highlighting a limitation in flexibility.
  • VPT (Visual Prompt Tuning): First introduced prompt tuning to vision models; this work innovates style-initialization strategies on top of it.
  • CoOP / CoCoOP: Classic works on text prompt learning, inspiring the design of learnable prompts.
  • Gram Matrix (Gatys et al.): The core representation tool in style transfer, creatively utilized here for style space construction.
  • ImageBind / LanguageBind: Multi-modal unified models, which lack fine-grained modeling of style discrepancies.
  • Insight: The prompt initialization strategy can be extended to other downstream tasks requiring conditional adaptation.

Rating

  • Novelty: ⭐⭐⭐⭐ A pioneering combination of a new task, a new dataset, and Gram-matrix-initialized prompts.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation with two datasets, multiple baseline comparisons, detailed ablation studies, and qualitative analyses.
  • Writing Quality: ⭐⭐⭐⭐ Elegant illustrations, clear motivation, and logically structured progression across the three modules.
  • Value: ⭐⭐⭐⭐ Lays out a new task and provides robust baselines, promising a long-term impact on the retrieval community.