CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally¶
Conference: ICLR 2026 arXiv: 2502.03566 Code: GitHub Area: Robotics Keywords: CLIP, compositionality, bag-of-words, attribute-object binding, cross-modal alignment
TL;DR¶
Through linear probing experiments, this paper demonstrates that CLIP's bag-of-words (BoW) behavior does not stem from a lack of binding information in the encoders, but rather from a failure of cross-modal alignment. The paper proposes LABCLIP, which trains a single lightweight linear transformation to substantially recover attribute-object binding capability.
Background & Motivation¶
Background: CLIP is widely used as a foundational component of vision-language models; however, prior work (ARO, SugarCrepe, etc.) has shown that CLIP performs poorly on compositional understanding, often behaving like a BoW model that cannot distinguish "red cube and blue triangle" from "blue cube and red triangle."
Limitations of Prior Work: Previous studies evaluated BoW behavior only at the cross-modal level (image-text matching), making it impossible to determine whether the problem originates from the encoders lacking binding information or from insufficient cross-modal alignment.
Key Challenge: If the problem lies in the encoders, retraining is required; if it resides only in the alignment, a lightweight adjustment suffices. Diagnosing the root cause has decisive implications for the direction of improvement.
Goal: To identify the fundamental cause of CLIP's BoW behavior and propose a minimal-cost remedy accordingly.
Key Insight: The paper evaluates whether attribute-object binding information exists within each modality independently (uni-modally), for both image and text.
Core Idea: CLIP's uni-modal embeddings already encode correct attribute binding; the cross-modal alignment simply fails to preserve this information — and a single linear transformation is sufficient to fix this.
Method¶
Overall Architecture¶
A three-stage argument: (1) confirm that CLIP is cross-modally BoW → (2) demonstrate that it is not uni-modally BoW → (3) repair cross-modal alignment with a linear transformation.
Key Designs¶
-
Uni-modal Linear Probing: For each object \(o \in \mathcal{O}\), an independent linear classifier is trained to predict the attribute of that object from frozen CLIP embeddings: \(\text{image-probe}_o: f_{\text{image}}(\mathbf{x}^{\text{img}}) \mapsto a, \quad \text{text-probe}_o: f_{\text{text}}(\mathbf{x}^{\text{txt}}) \mapsto a\) On CLEVR, the image-side probe achieves 0.96 accuracy and the text-side probe achieves 1.00 (random baseline: 0.12), confirming that binding information is linearly decodable.
-
Multi-Object Robustness: As the number of objects in the scene increases, text probe accuracy remains above 0.8, while the image-side accuracy decreases from 0.9 to 0.6 but still far exceeds chance.
-
Joint Search Experiment: In images containing distractors (e.g., green sphere + red cube), the linear classifier accurately detects "incongruent" objects (e.g., a red sphere) with accuracy >0.80 even with 35 objects, whereas zero-shot classification performs at chance — confirming that image embeddings are not purely BoW.
-
LABCLIP: A linear transformation \(\mathbf{A} \in \mathbb{R}^{D \times D}\) is applied to text embeddings: \(\langle f_{\text{image}}(\mathbf{x}^{\text{img}}), \mathbf{A} f_{\text{text}}(\mathbf{x}^{\text{txt}}) \rangle\) Initialized from the identity matrix, \(\mathbf{A}\) is trained via contrastive learning with negatives generated by permuting attribute-object pairs. Training is more than 100× faster than NegCLIP.
Loss & Training¶
- LABCLIP is trained with a contrastive loss; negative text samples (with swapped attribute-object pairs) are added to each batch, forming a \(B \times 2B\) batch.
- The CLIP encoders are fully frozen; only the \(D \times D\) matrix is trained (262K parameters for ViT-B/32, compared to 151M for NegCLIP).
Key Experimental Results¶
Main Results¶
Cross-modal binding accuracy on synthetic datasets:
| Model | CLEVR | PUG:SPAR | PUG:SPARE |
|---|---|---|---|
| CLIP (chance level) | 0.58 | 0.53 | 0.50 |
| LABCLIP | 0.95 | 0.97 | 0.94 |
| CLIP-FT (upper bound) | 1.00 | 1.00 | 1.00 |
Real-world benchmarks (ARO + SugarCrepe):
| Model | VG-A | VG-R | Replace | Swap | COCO R@1 |
|---|---|---|---|---|---|
| CLIP | 0.63 | 0.63 | 0.80 | 0.62 | 0.30 |
| NegCLIP | 0.71 | 0.81 | 0.85 | 0.75 | 0.41 |
| LABCLIP | 0.69 | 0.82 | 0.82 | 0.74 | 0.41 |
Ablation Study¶
Linear probe weight similarity (before vs. after alignment):
| Dataset | Pre-alignment cos-sim | Post-alignment cos-sim |
|---|---|---|
| CLEVR | 0.20 | 0.75 |
| PUG:SPAR | 0.18 | 0.78 |
| PUG:SPARE | 0.09 | 0.65 |
Key Findings¶
- Linear probing of a purposely trained BoW CLIP yields only 0.66/0.85 accuracy, confirming that a purely BoW representation indeed lacks binding information.
- LABCLIP with only 262K parameters matches the compositional reasoning performance of NegCLIP (151M parameters).
- The linear transformation raises cross-modal cosine similarity of probe weights from ~0.15 to ~0.70, confirming that alignment genuinely recovers binding.
- LABCLIP exhibits a slight drop on downstream single-object classification (CIFAR, ImageNet), indicating a trade-off between binding and coarse-grained recognition.
Highlights & Insights¶
- Diagnostic Insight: The BoW problem is precisely localized from "CLIP encoders are insufficient" to "cross-modal alignment is insufficient," reshaping the community's understanding of CLIP's capabilities.
- Minimal Fix: The linear transformation is both effective and practical — it does not require re-extracting features from vector databases and is backward compatible.
- Methodological Contribution: The paper introduces the PUG:SPARE dataset (a positional-bias-free variant of PUG:SPAR), providing a more rigorous evaluation protocol.
- Theoretical Completeness: The logical chain from linear probing → multi-object robustness → joint search → cross-modal repair is complete and coherent.
Limitations & Future Work¶
- Experiments validating uni-modal binding are conducted primarily on synthetic datasets; uni-modal analysis in real-world scenarios is insufficient.
- Only attribute-object binding is studied; other compositional tasks such as spatial relations, negation, and counting are not addressed.
- LABCLIP exhibits slight degradation on single-object classification, revealing a trade-off between binding and coarse-grained recognition.
- Evaluation is limited to ViT-B/32; the consistency of findings for larger CLIP models (ViT-L/14, ViT-H) has not been confirmed.
- Negative samples are constructed via simple noun/adjective shuffling, which may be insufficient for complex linguistic structures.
- The effectiveness of LABCLIP on generative tasks such as text-to-image generation remains unexplored.
Related Work & Insights¶
- The paper responds to Yuksekgonul et al. (2023)'s BoW findings and proposes a more precise diagnosis.
- NegCLIP repairs the issue by fine-tuning 151M parameters, whereas LABCLIP achieves comparable performance through a post-hoc transformation with only 262K parameters.
- Modality gap literature: LABCLIP can be viewed as a targeted approach to reducing the modality gap with respect to binding-relevant information.
- Compared to Lewis et al. (2024), who test binding together with compositional generalization, this paper focuses specifically on the pure binding problem, localizing the root cause more precisely.
- Implication: Useful information concealed by "alignment" may exist in large pretrained models, warranting more careful layer-wise diagnosis.
- Implications for downstream VLMs (e.g., text-to-image generation, image editing): similar linear alignment techniques could be applied to improve compositional understanding.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Precise diagnosis and counterintuitive finding (CLIP is not BoW), reshaping understanding of CLIP
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Synthetic and real datasets, probing + search + repair, rigorous multi-angle validation
- Writing Quality: ⭐⭐⭐⭐⭐ Clear logical progression from diagnosis to remedy, coherent throughout
- Value: ⭐⭐⭐⭐ Significant for understanding and improving VLM compositionality, with strong practical utility