Skip to content

GTA-CLIP: Generate, Transduct, Adapt — Iterative Transduction with VLMs

Conference: ICCV 2025
arXiv: 2501.06031
Code: https://github.com/cvl-umass/GTA-CLIP
Area: Multimodal VLM
Keywords: Transductive Learning, Zero-shot Classification, CLIP, Attribute Generation, VLM

TL;DR

This paper proposes GTA-CLIP, which iteratively executes three steps — LLM-based attribute generation, attribute-enhanced transductive inference, and encoder fine-tuning — achieving an average zero-shot improvement of 9.5% and few-shot improvement of 3–4% across 12 datasets, and for the first time unifying attribute discovery, transductive inference, and model adaptation in a zero-label setting.

Background & Motivation

Vision-language models such as CLIP enable zero-shot classification, yet their accuracy often falls short of practical requirements (e.g., ecologists batch-classifying species photographs). Transductive inference leverages image-to-image similarity across an entire unlabeled dataset to improve classification, as demonstrated by TransCLIP. However, existing methods overlook the rich structure in the language space:

  • Using only the "a photo of [class]" template results in overly simplistic class prototypes.
  • Semantically similar classes are easily confused, lacking fine-grained discriminative attributes.
  • Transductive inference and model adaptation are performed independently, preventing mutual reinforcement.

Core insight: Combining LLM-generated discriminative attributes, transductive inference, and CLIP encoder fine-tuning forms a closed loop in which each component reinforces the others — better attributes yield more accurate transductive inference, and more accurate pseudo-labels yield better fine-tuning.

Method

Overall Architecture

GTA-CLIP maintains three sets of variables: an attribute set \(\mathcal{A}\) (a group of textual attributes per class), GMM parameters \(\mu, \Sigma\), and a soft assignment matrix \(\mathbf{z} \in [0,1]^{N \times M}\). The algorithm iterates for \(T=30\) rounds, executing Generate → Transduct → Adapt in each round.

The overall objective function is: $\(\mathcal{L} = -\frac{1}{N}\sum_i \mathbf{z}_i^\top \log(\mathbf{p}_i) - \sum_{i,j} w_{i,j}\mathbf{z}_i^\top \mathbf{z}_j + \sum_i \mathbf{KL}_\lambda(\mathbf{z}_i \| \hat{\mathbf{y}}_i)\)$

The three terms correspond respectively to: the GMM clustering objective, Laplacian regularization (encouraging similar images to have similar predictions), and KL alignment with text-based predictions.

Key Designs

  1. Generate — Confusion-Driven Attribute Generation:

    • Initial attributes are generated by an LLM as descriptive text per class (e.g., "A bird with a small, round body shape").
    • After running transductive inference, the most easily confused class pairs are identified — specifically, sample pairs whose top-2 probabilities in \(\mathbf{z}\) differ by at most \(\alpha=0.1\).
    • An LLM (Llama-3.1 or GPT-4o) is then prompted to generate discriminative attributes for each confused pair, e.g., "Provide additional attributes for [class1] which can help distinguish it from [class2]."
    • Attributes are generated only for high-frequency confusion pairs (occurring more than \(\beta\) times) to ensure computational feasibility.
    • Design motivation: This mimics classical pairwise discriminative attribute discovery in computer vision; the attribute space grows incrementally rather than being fixed upfront.
  2. Transduct — Attribute-Enhanced Transductive Inference:

    • Text-based predictions \(\hat{\mathbf{y}}_i\) are obtained by computing the average similarity between an image and all attribute embeddings: \(\bar{s}_{i,j} = \frac{1}{n_j}\sum_k \theta(\mathbf{x}_i)\phi(\mathbf{a}_{j,k})\).
    • TransCLIP's Block Majorize-Minimization algorithm is used to optimize \(\mathbf{z}, \mu, \Sigma\).
    • Inter-image affinity is defined as \(w_{i,j} = \max(0, \mathbf{f}_i^\top \mathbf{f}_j)\), ensuring positive semi-definiteness and efficient optimization.
    • Design motivation: Attribute enhancement allows the KL term to more accurately reflect class-discriminative information.
  3. Adapt — Pseudo-Label-Based CLIP Fine-Tuning:

    • For each class \(j\), the top-\(k=8\) images with the highest values in \(\mathbf{z}_{\cdot,j}\) are selected as high-confidence samples.
    • The CLIP encoder (\(\theta, \phi\)) is fine-tuned end-to-end using the AdaptCLIPZS objective function.
    • Both class-level supervision and false negatives (multiple valid image-text pairs within a mini-batch) are accounted for.
    • Design motivation: Pseudo-labels from transductive inference combined with attributes serve as a weak supervision signal for adapting CLIP without any human annotation.

Loss & Training

  • CLIP fine-tuning uses the AdamW optimizer with betas=(0.9, 0.98); learning rates are set to 2E-7 for Transformer layers and 1E-6 for projection layers.
  • All hyperparameters are fixed across all 12 datasets.
  • The method iterates for \(T=30\) rounds, with total runtime only 10–20 minutes longer than vanilla CLIP inference.

Key Experimental Results

Main Results (Zero-shot, 12 Datasets)

Method CUB Aircraft Cars EuroSAT ImageNet Avg. (B/16)
CLIP 55.20 24.75 65.38 47.69 66.72 64.35
TransCLIP-ZS 62.23 26.88 68.87 65.42 70.38 69.80
GTA-CLIP 66.76 29.31 72.09 76.35 71.87 73.81

Gains: +9.46% over CLIP and +4.01% over TransCLIP (B/16). EuroSAT shows the largest improvement (+18.87% on B/32).

Ablation Study

Attributes Transduct Adapt Avg. Acc. Δ CLIP
None 60.56
Static 61.59 +1.03%
None 64.26 +3.70%
Static 66.76 +6.20%
Dynamic 67.52 +6.96%

Key finding: The combination of dynamic attributes and Adapt performs best; generating attributes alone (without Adapt) yields limited gains; Adapt is critical for exploiting dynamic attributes effectively.

Key Findings

  • Zero-shot GTA-CLIP surpasses 1-shot TransCLIP, eliminating the need to annotate one sample per class.
  • Few-shot (4-shot, B/16): GTA-CLIP achieves 79.08% vs. TransCLIP's 75.22%, a gain of 3.86%.
  • Confused class pairs are highly consistent with those identified by fully supervised linear classifiers: 9 of the top-10 confused pairs discovered by GTA-CLIP fall within the top-10% of the supervised model.
  • On CUB, only approximately 30 class pairs require LLM-based attribute expansion, with Llama-3.1 completing the task in under 10 minutes.
  • LLM-generated attributes often involve habitat, relative features, and other cues that complement missing discriminative information from the initial attributes.

Highlights & Insights

  • The closed-loop design of the unified framework is remarkably elegant: the Generate→Transduct→Adapt cycle forms a positive feedback loop.
  • High practical value: given only an unlabeled dataset and class names, classification accuracy is automatically improved without any annotation.
  • t-SNE visualizations of the attribute space clearly demonstrate how dynamic attributes improve inter-class separation.
  • Computational cost is manageable: GTA-CLIP processes CUB (12k images, 200 classes) on a single A100 GPU in only 12–20 minutes.

Limitations & Future Work

  • LLM hallucination risk: Generated attributes may be inaccurate, though no significant impact was observed in experiments.
  • Transductive setting constraint: All test images must be available simultaneously, making the method unsuitable for streaming scenarios.
  • The assumption that images within each class form a single Gaussian distribution may fail for classes with multimodal distributions.
  • Attribute space exploration relies on heuristics (confusion thresholds \(\alpha\), \(\beta\)) without theoretical guarantees.
  • Improvements on datasets such as Food101 are marginal (+0.14%), suggesting that attribute enhancement provides limited benefit in certain domains.
  • TransCLIP serves as the direct baseline; GTA-CLIP extends its framework with attribute enhancement and model adaptation.
  • AdaptCLIPZS provides the technical foundation for annotation-free CLIP fine-tuning.
  • Pairwise attribute discovery is inspired by classical computer vision literature (e.g., Parikh & Grauman).
  • The confusion-driven attribute discovery strategy is transferable to other scenarios requiring fine-grained class discrimination.

Rating

  • Novelty: ⭐⭐⭐⭐ — The three-step unified framework is novel, and the confusion-driven attribute generation is an elegant design.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 12 datasets × 3 encoders × 3 settings, with comprehensive ablations.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure with a good balance between formulations and intuitive explanations.
  • Value: ⭐⭐⭐⭐⭐ — Highly practical, with substantial zero-shot accuracy gains at low computational cost.