CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally¶

Conference: ICLR 2026 arXiv: 2502.03566 Code: GitHub Area: Robotics Keywords: CLIP, compositionality, bag-of-words, attribute-object binding, cross-modal alignment

TL;DR¶

Through linear probing experiments, this paper demonstrates that CLIP's bag-of-words (BoW) behavior does not stem from a lack of binding information in the encoders, but rather from a failure of cross-modal alignment. The paper proposes LABCLIP, which trains a single lightweight linear transformation to substantially recover attribute-object binding capability.

Background & Motivation¶

Background: CLIP is widely used as a foundational component of vision-language models; however, prior work (ARO, SugarCrepe, etc.) has shown that CLIP performs poorly on compositional understanding, often behaving like a BoW model that cannot distinguish "red cube and blue triangle" from "blue cube and red triangle."

Limitations of Prior Work: Previous studies evaluated BoW behavior only at the cross-modal level (image-text matching), making it impossible to determine whether the problem originates from the encoders lacking binding information or from insufficient cross-modal alignment.

Key Challenge: If the problem lies in the encoders, retraining is required; if it resides only in the alignment, a lightweight adjustment suffices. Diagnosing the root cause has decisive implications for the direction of improvement.

Goal: To identify the fundamental cause of CLIP's BoW behavior and propose a minimal-cost remedy accordingly.

Key Insight: The paper evaluates whether attribute-object binding information exists within each modality independently (uni-modally), for both image and text.

Core Idea: CLIP's uni-modal embeddings already encode correct attribute binding; the cross-modal alignment simply fails to preserve this information — and a single linear transformation is sufficient to fix this.

Method¶

Overall Architecture¶

A three-stage argument: (1) confirm that CLIP is cross-modally BoW → (2) demonstrate that it is not uni-modally BoW → (3) repair cross-modal alignment with a linear transformation.

Key Designs¶

Uni-modal Linear Probing: For each object \(o \in \mathcal{O}\), an independent linear classifier is trained to predict the attribute of that object from frozen CLIP embeddings: \(\text{image-probe}_o: f_{\text{image}}(\mathbf{x}^{\text{img}}) \mapsto a, \quad \text{text-probe}_o: f_{\text{text}}(\mathbf{x}^{\text{txt}}) \mapsto a\) On CLEVR, the image-side probe achieves 0.96 accuracy and the text-side probe achieves 1.00 (random baseline: 0.12), confirming that binding information is linearly decodable.
Multi-Object Robustness: As the number of objects in the scene increases, text probe accuracy remains above 0.8, while the image-side accuracy decreases from 0.9 to 0.6 but still far exceeds chance.
Joint Search Experiment: In images containing distractors (e.g., green sphere + red cube), the linear classifier accurately detects "incongruent" objects (e.g., a red sphere) with accuracy >0.80 even with 35 objects, whereas zero-shot classification performs at chance — confirming that image embeddings are not purely BoW.
LABCLIP: A linear transformation \(\mathbf{A} \in \mathbb{R}^{D \times D}\) is applied to text embeddings: \(\langle f_{\text{image}}(\mathbf{x}^{\text{img}}), \mathbf{A} f_{\text{text}}(\mathbf{x}^{\text{txt}}) \rangle\) Initialized from the identity matrix, \(\mathbf{A}\) is trained via contrastive learning with negatives generated by permuting attribute-object pairs. Training is more than 100× faster than NegCLIP.

Loss & Training¶

LABCLIP is trained with a contrastive loss; negative text samples (with swapped attribute-object pairs) are added to each batch, forming a \(B \times 2B\) batch.
The CLIP encoders are fully frozen; only the \(D \times D\) matrix is trained (262K parameters for ViT-B/32, compared to 151M for NegCLIP).

Key Experimental Results¶

Main Results¶

Cross-modal binding accuracy on synthetic datasets:

Model	CLEVR	PUG:SPAR	PUG:SPARE
CLIP (chance level)	0.58	0.53	0.50
LABCLIP	0.95	0.97	0.94
CLIP-FT (upper bound)	1.00	1.00	1.00

Real-world benchmarks (ARO + SugarCrepe):

Model	VG-A	VG-R	Replace	Swap	COCO R@1
CLIP	0.63	0.63	0.80	0.62	0.30
NegCLIP	0.71	0.81	0.85	0.75	0.41
LABCLIP	0.69	0.82	0.82	0.74	0.41

Ablation Study¶

Linear probe weight similarity (before vs. after alignment):

Dataset	Pre-alignment cos-sim	Post-alignment cos-sim
CLEVR	0.20	0.75
PUG:SPAR	0.18	0.78
PUG:SPARE	0.09	0.65

Key Findings¶

Linear probing of a purposely trained BoW CLIP yields only 0.66/0.85 accuracy, confirming that a purely BoW representation indeed lacks binding information.
LABCLIP with only 262K parameters matches the compositional reasoning performance of NegCLIP (151M parameters).
The linear transformation raises cross-modal cosine similarity of probe weights from ~0.15 to ~0.70, confirming that alignment genuinely recovers binding.
LABCLIP exhibits a slight drop on downstream single-object classification (CIFAR, ImageNet), indicating a trade-off between binding and coarse-grained recognition.

Highlights & Insights¶

Diagnostic Insight: The BoW problem is precisely localized from "CLIP encoders are insufficient" to "cross-modal alignment is insufficient," reshaping the community's understanding of CLIP's capabilities.
Minimal Fix: The linear transformation is both effective and practical — it does not require re-extracting features from vector databases and is backward compatible.
Methodological Contribution: The paper introduces the PUG:SPARE dataset (a positional-bias-free variant of PUG:SPAR), providing a more rigorous evaluation protocol.
Theoretical Completeness: The logical chain from linear probing → multi-object robustness → joint search → cross-modal repair is complete and coherent.

Limitations & Future Work¶

Experiments validating uni-modal binding are conducted primarily on synthetic datasets; uni-modal analysis in real-world scenarios is insufficient.
Only attribute-object binding is studied; other compositional tasks such as spatial relations, negation, and counting are not addressed.
LABCLIP exhibits slight degradation on single-object classification, revealing a trade-off between binding and coarse-grained recognition.
Evaluation is limited to ViT-B/32; the consistency of findings for larger CLIP models (ViT-L/14, ViT-H) has not been confirmed.
Negative samples are constructed via simple noun/adjective shuffling, which may be insufficient for complex linguistic structures.
The effectiveness of LABCLIP on generative tasks such as text-to-image generation remains unexplored.

The paper responds to Yuksekgonul et al. (2023)'s BoW findings and proposes a more precise diagnosis.
NegCLIP repairs the issue by fine-tuning 151M parameters, whereas LABCLIP achieves comparable performance through a post-hoc transformation with only 262K parameters.
Modality gap literature: LABCLIP can be viewed as a targeted approach to reducing the modality gap with respect to binding-relevant information.
Compared to Lewis et al. (2024), who test binding together with compositional generalization, this paper focuses specifically on the pure binding problem, localizing the root cause more precisely.
Implication: Useful information concealed by "alignment" may exist in large pretrained models, warranting more careful layer-wise diagnosis.
Implications for downstream VLMs (e.g., text-to-image generation, image editing): similar linear alignment techniques could be applied to improve compositional understanding.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Precise diagnosis and counterintuitive finding (CLIP is not BoW), reshaping understanding of CLIP
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Synthetic and real datasets, probing + search + repair, rigorous multi-angle validation
Writing Quality: ⭐⭐⭐⭐⭐ Clear logical progression from diagnosis to remedy, coherent throughout
Value: ⭐⭐⭐⭐ Significant for understanding and improving VLM compositionality, with strong practical utility