Skip to content

CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

Conference: ICLR 2026 arXiv: 2502.03566 Code: GitHub Area: Robotics Keywords: CLIP, compositionality, bag-of-words, attribute-object binding, cross-modal alignment

TL;DR

Through linear probing experiments, this paper demonstrates that CLIP's bag-of-words (BoW) behavior does not stem from a lack of binding information in the encoders, but rather from a failure of cross-modal alignment. The paper proposes LABCLIP, which trains a single lightweight linear transformation to substantially recover attribute-object binding capability.

Background & Motivation

Background: CLIP is widely used as a foundational component of vision-language models; however, prior work (ARO, SugarCrepe, etc.) has shown that CLIP performs poorly on compositional understanding, often behaving like a BoW model that cannot distinguish "red cube and blue triangle" from "blue cube and red triangle."

Limitations of Prior Work: Previous studies evaluated BoW behavior only at the cross-modal level (image-text matching), making it impossible to determine whether the problem originates from the encoders lacking binding information or from insufficient cross-modal alignment.

Key Challenge: If the problem lies in the encoders, retraining is required; if it resides only in the alignment, a lightweight adjustment suffices. Diagnosing the root cause has decisive implications for the direction of improvement.

Goal: To identify the fundamental cause of CLIP's BoW behavior and propose a minimal-cost remedy accordingly.

Key Insight: The paper evaluates whether attribute-object binding information exists within each modality independently (uni-modally), for both image and text.

Core Idea: CLIP's uni-modal embeddings already encode correct attribute binding; the cross-modal alignment simply fails to preserve this information — and a single linear transformation is sufficient to fix this.

Method

Overall Architecture

A three-stage argument: (1) confirm that CLIP is cross-modally BoW → (2) demonstrate that it is not uni-modally BoW → (3) repair cross-modal alignment with a linear transformation.

Key Designs

  1. Uni-modal Linear Probing: For each object \(o \in \mathcal{O}\), an independent linear classifier is trained to predict the attribute of that object from frozen CLIP embeddings: \(\text{image-probe}_o: f_{\text{image}}(\mathbf{x}^{\text{img}}) \mapsto a, \quad \text{text-probe}_o: f_{\text{text}}(\mathbf{x}^{\text{txt}}) \mapsto a\) On CLEVR, the image-side probe achieves 0.96 accuracy and the text-side probe achieves 1.00 (random baseline: 0.12), confirming that binding information is linearly decodable.

  2. Multi-Object Robustness: As the number of objects in the scene increases, text probe accuracy remains above 0.8, while the image-side accuracy decreases from 0.9 to 0.6 but still far exceeds chance.

  3. Joint Search Experiment: In images containing distractors (e.g., green sphere + red cube), the linear classifier accurately detects "incongruent" objects (e.g., a red sphere) with accuracy >0.80 even with 35 objects, whereas zero-shot classification performs at chance — confirming that image embeddings are not purely BoW.

  4. LABCLIP: A linear transformation \(\mathbf{A} \in \mathbb{R}^{D \times D}\) is applied to text embeddings: \(\langle f_{\text{image}}(\mathbf{x}^{\text{img}}), \mathbf{A} f_{\text{text}}(\mathbf{x}^{\text{txt}}) \rangle\) Initialized from the identity matrix, \(\mathbf{A}\) is trained via contrastive learning with negatives generated by permuting attribute-object pairs. Training is more than 100× faster than NegCLIP.

Loss & Training

  • LABCLIP is trained with a contrastive loss; negative text samples (with swapped attribute-object pairs) are added to each batch, forming a \(B \times 2B\) batch.
  • The CLIP encoders are fully frozen; only the \(D \times D\) matrix is trained (262K parameters for ViT-B/32, compared to 151M for NegCLIP).

Key Experimental Results

Main Results

Cross-modal binding accuracy on synthetic datasets:

Model CLEVR PUG:SPAR PUG:SPARE
CLIP (chance level) 0.58 0.53 0.50
LABCLIP 0.95 0.97 0.94
CLIP-FT (upper bound) 1.00 1.00 1.00

Real-world benchmarks (ARO + SugarCrepe):

Model VG-A VG-R Replace Swap COCO R@1
CLIP 0.63 0.63 0.80 0.62 0.30
NegCLIP 0.71 0.81 0.85 0.75 0.41
LABCLIP 0.69 0.82 0.82 0.74 0.41

Ablation Study

Linear probe weight similarity (before vs. after alignment):

Dataset Pre-alignment cos-sim Post-alignment cos-sim
CLEVR 0.20 0.75
PUG:SPAR 0.18 0.78
PUG:SPARE 0.09 0.65

Key Findings

  • Linear probing of a purposely trained BoW CLIP yields only 0.66/0.85 accuracy, confirming that a purely BoW representation indeed lacks binding information.
  • LABCLIP with only 262K parameters matches the compositional reasoning performance of NegCLIP (151M parameters).
  • The linear transformation raises cross-modal cosine similarity of probe weights from ~0.15 to ~0.70, confirming that alignment genuinely recovers binding.
  • LABCLIP exhibits a slight drop on downstream single-object classification (CIFAR, ImageNet), indicating a trade-off between binding and coarse-grained recognition.

Highlights & Insights

  • Diagnostic Insight: The BoW problem is precisely localized from "CLIP encoders are insufficient" to "cross-modal alignment is insufficient," reshaping the community's understanding of CLIP's capabilities.
  • Minimal Fix: The linear transformation is both effective and practical — it does not require re-extracting features from vector databases and is backward compatible.
  • Methodological Contribution: The paper introduces the PUG:SPARE dataset (a positional-bias-free variant of PUG:SPAR), providing a more rigorous evaluation protocol.
  • Theoretical Completeness: The logical chain from linear probing → multi-object robustness → joint search → cross-modal repair is complete and coherent.

Limitations & Future Work

  • Experiments validating uni-modal binding are conducted primarily on synthetic datasets; uni-modal analysis in real-world scenarios is insufficient.
  • Only attribute-object binding is studied; other compositional tasks such as spatial relations, negation, and counting are not addressed.
  • LABCLIP exhibits slight degradation on single-object classification, revealing a trade-off between binding and coarse-grained recognition.
  • Evaluation is limited to ViT-B/32; the consistency of findings for larger CLIP models (ViT-L/14, ViT-H) has not been confirmed.
  • Negative samples are constructed via simple noun/adjective shuffling, which may be insufficient for complex linguistic structures.
  • The effectiveness of LABCLIP on generative tasks such as text-to-image generation remains unexplored.
  • The paper responds to Yuksekgonul et al. (2023)'s BoW findings and proposes a more precise diagnosis.
  • NegCLIP repairs the issue by fine-tuning 151M parameters, whereas LABCLIP achieves comparable performance through a post-hoc transformation with only 262K parameters.
  • Modality gap literature: LABCLIP can be viewed as a targeted approach to reducing the modality gap with respect to binding-relevant information.
  • Compared to Lewis et al. (2024), who test binding together with compositional generalization, this paper focuses specifically on the pure binding problem, localizing the root cause more precisely.
  • Implication: Useful information concealed by "alignment" may exist in large pretrained models, warranting more careful layer-wise diagnosis.
  • Implications for downstream VLMs (e.g., text-to-image generation, image editing): similar linear alignment techniques could be applied to improve compositional understanding.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Precise diagnosis and counterintuitive finding (CLIP is not BoW), reshaping understanding of CLIP
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Synthetic and real datasets, probing + search + repair, rigorous multi-angle validation
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear logical progression from diagnosis to remedy, coherent throughout
  • Value: ⭐⭐⭐⭐ Significant for understanding and improving VLM compositionality, with strong practical utility