READ: Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions¶
Conference: NeurIPS 2025 arXiv: 2510.16540 Code: Available Area: Multimodal VLM / CLIP Improvement Keywords: CLIP, compositional reasoning, text reconstruction, paraphrase alignment, contrastive learning
TL;DR¶
This paper proposes READ, a fine-tuning method that enhances the compositional reasoning capability of CLIP's text encoder via two auxiliary objectives: (1) token-level reconstruction, where a frozen decoder reconstructs alternative descriptions from text embeddings, and (2) sentence-level alignment, which enforces consistency among embeddings of paraphrases. READ achieves state-of-the-art performance on 5 compositional reasoning benchmarks, outperforming NegCLIP by 4.5% and FSC-CLIP by 4.1%.
Background & Motivation¶
Background: VLMs such as CLIP, trained with contrastive objectives, perform poorly on compositional reasoning — they fail to distinguish "horse eating grass" from "grass eating horse," as contrastive training encourages the text encoder to focus on individual tokens (aligned with visual objects) while neglecting inter-token relations.
Limitations of Prior Work: (1) Hard-negative-based methods (e.g., NegCLIP) may lead models to exploit shortcuts specific to the negative sample format rather than achieving genuine compositional understanding; (2) existing auxiliary objectives either act jointly on both image and text encoders, or exclusively on the image encoder, overlooking that the text encoder is the primary bottleneck for compositional reasoning; (3) auxiliary objectives specifically designed for the text encoder are lacking.
Key Challenge: The nature of contrastive training (image-text alignment) encourages the text encoder to produce bag-of-words representations, whereas compositional reasoning requires understanding of structural relationships among tokens.
Goal: Improve compositional reasoning in CLIP through auxiliary training objectives applied to the text encoder.
Key Insight: Two complementary objectives — reconstruction forces embeddings to retain inter-token relational information (otherwise alternative descriptions cannot be reconstructed), while alignment ensures that paraphrases with different surface forms yield consistent representations.
Core Idea: Compel the text encoder to encode inter-token relations via "alternative description reconstruction," and ensure semantic invariance via "paraphrase alignment."
Method¶
Overall Architecture¶
A weighted combination of three training losses: \(\mathcal{L} = \mathcal{L}_{Contrastive} + \lambda_1 \mathcal{L}_{Token\ Reconstruction} + \lambda_2 \mathcal{L}_{Sentence\ Alignment}\). The contrastive loss follows the standard CLIP formulation augmented with hard negatives; the two auxiliary losses act solely on the text encoder.
Key Designs¶
-
Token-Level Reconstruction Loss:
- Function: Given an original caption \(T_i\), the text encoder produces embedding \(v_i = f_T(T_i)\), which is projected via a learnable projection \(h_i = W^T v_i\) and passed to a frozen pretrained decoder \(\pi\) to reconstruct an alternative description \(\mathbf{y}_i^{(k)}\) of the same image (rather than the original caption).
- Mechanism: \(\mathcal{L}_{Token\ Rec} = -\frac{1}{BK}\sum_i\sum_k \log \pi(\mathbf{y}_i^{(k)} | h_i)\); the decoder is frozen and gradients are backpropagated only through the text encoder and projection layer.
- Why reconstruct alternative descriptions rather than the original: Reconstructing the original caption causes the encoder to overfit to exact phrasing (e.g., memorizing the position of "the"), whereas reconstructing alternative descriptions compels the encoder to capture deeper semantic relations — understanding "horse eating grass" is necessary to reconstruct "a horse is feeding on the lawn."
- Design Motivation: Prior NLP work (MASS, RetroMAE) has demonstrated that encoder-decoder reconstruction objectives help encoders capture syntactic and semantic relations; this paper is the first to transfer this insight to VLMs.
-
Sentence-Level Alignment Loss:
- Function: Aligns the embedding of a paraphrase \(T_i'\) (semantically equivalent but differently worded) with that of the original description \(T_i\) in the embedding space.
- Mechanism: \(\mathcal{L}_{Sent\ Align} = -\frac{1}{B}\sum_i \log \frac{\phi(T_i, T_i')}{\sum_j \phi(T_i, T_j')}\), treating paraphrases as positive pairs and other in-batch paraphrases as negatives.
- Design Motivation: Compositional reasoning requires not only understanding intra-sentence relations but also recognizing semantic equivalence across different surface forms — e.g., "the dog chased the cat" and "the cat was pursued by the dog" should yield similar embeddings.
-
Hard-Negative-Augmented Contrastive Loss:
- Function: Incorporates hard negative captions into the image-to-text direction of the standard contrastive loss.
- Mechanism: \(M\) hard negatives \(\tilde{T}_i^{(m)}\) are generated via rule-based transformations (e.g., subject-object swapping, adjective exchange) and inserted into the denominator to increase discriminative difficulty.
- Design Motivation: Compatible with NegCLIP; the two auxiliary objectives in READ are additive to hard-negative-based methods.
Complementarity Analysis¶
- Reconstruction objective: encourages the encoder to capture inter-token relations (requires understanding "who does what to whom" to reconstruct paraphrases).
- Alignment objective: encourages the encoder to maintain semantic consistency across different surface forms.
- The two are complementary: reconstruction provides fine-grained structural understanding; alignment provides coarse-grained semantic normalization.
Key Experimental Results¶
5 Compositional Reasoning Benchmarks¶
| Model | SugarCrepe | ARO | Winoground | VALSE | Cola | Average |
|---|---|---|---|---|---|---|
| CLIP (ViT-B/32) | 74.3 | 59.2 | 29.5 | 70.1 | 68.4 | 60.3 |
| NegCLIP | 80.1 | 66.8 | 32.0 | 73.2 | 72.5 | 64.9 |
| FSC-CLIP | 81.5 | 68.4 | 33.8 | 74.1 | 73.2 | 66.2 |
| READ-CLIP | 83.9 | 71.2 | 36.1 | 76.3 | 75.8 | 68.7 |
Stacking with Existing CLIP Variants¶
| Base Model | Standalone | + READ | Gain |
|---|---|---|---|
| NegCLIP | 64.9 | 67.3 | +2.4% |
| FSC-CLIP | 66.2 | 68.1 | +1.9% |
| DAC-CLIP | 65.7 | 67.9 | +2.2% |
Ablation Study¶
| Configuration | SugarCrepe | ARO |
|---|---|---|
| Contrastive loss only | 80.1 | 66.8 |
| + Token Reconstruction | 82.5 | 69.8 |
| + Sentence Alignment | 81.3 | 68.5 |
| + Both (READ) | 83.9 | 71.2 |
Key Findings¶
- The reconstruction objective contributes more (~2.4% gain); the alignment objective provides an additional ~1.4%.
- Reconstructing alternative descriptions substantially outperforms reconstructing original descriptions — the latter leads to overfitting to exact phrasing.
- READ functions as a plug-in that consistently provides 1.9–2.4% additional gains when stacked on top of existing CLIP improvements.
- Qualitative analysis shows that READ-CLIP text embeddings exhibit larger distance differences in response to changes in inter-token relations.
Highlights & Insights¶
- Elegant design of reconstructing alternative descriptions: Rather than reconstructing the original input (which causes overfitting), the model reconstructs synonymous but differently worded descriptions — compelling the encoder to understand deep semantics rather than surface form. This insight is simple yet profound.
- NLP reconstruction objectives → VLMs: The success of encoder-decoder reconstruction in NLP (MASS/RetroMAE) is transferred to vision-language models for the first time, bridging two research communities.
- Complementarity of the two objectives: Reconstruction → inter-token relations (fine-grained); alignment → semantic invariance (coarse-grained); their combination yields effects greater than the sum of the parts.
- Strong generality: The method is independent of specific hard-negative generation strategies and can be applied as a plug-in module.
Limitations & Future Work¶
- Requires alternative description / paraphrase data: Currently sourced from datasets such as COCO that provide multiple captions; additional generation is needed for datasets without such annotations.
- Decoder is fixed as a pretrained model: Joint training of the decoder has not been explored.
- Validated only on the CLIP architecture: Newer architectures such as SigLIP and CoCa remain untested.
- ViT-B/32 scale: The effectiveness on larger models (e.g., ViT-L/14) has not been thoroughly evaluated.
- Reconstruction objective increases training computation: Although the decoder is not backpropagated through, its forward pass still incurs additional overhead.
Related Work & Insights¶
- vs. NegCLIP: NegCLIP relies solely on hard negatives and is susceptible to shortcut learning; READ provides a more fundamental basis for compositional understanding through auxiliary objectives.
- vs. SF-CLIP: SF-CLIP uses masked distillation to supervise both image and text encoders jointly; READ focuses exclusively on the text encoder, which is the primary bottleneck.
- vs. DAC: DAC finds that well-aligned captions benefit compositional reasoning; READ further adds a reconstruction objective and achieves superior performance.
- vs. RetroMAE: RetroMAE trains encoders via reconstruction in NLP; READ transfers this idea to the multimodal setting and adopts alternative descriptions.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of alternative-description reconstruction and paraphrase alignment is novel; the transfer from NLP to VLMs reflects genuine insight.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 benchmarks + stacking experiments with multiple CLIP variants + detailed ablations + qualitative analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clearly derived, formulations are complete, and figures are intuitive.
- Value: ⭐⭐⭐⭐⭐ Substantial practical improvement in compositional reasoning for CLIP; the method is concise and reusable.