READ: Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions¶

Conference: NeurIPS 2025 arXiv: 2510.16540 Code: Available Area: Multimodal VLM / CLIP Improvement Keywords: CLIP, compositional reasoning, text reconstruction, paraphrase alignment, contrastive learning

TL;DR¶

This paper proposes READ, a fine-tuning method that enhances the compositional reasoning capability of CLIP's text encoder via two auxiliary objectives: (1) token-level reconstruction, where a frozen decoder reconstructs alternative descriptions from text embeddings, and (2) sentence-level alignment, which enforces consistency among embeddings of paraphrases. READ achieves state-of-the-art performance on 5 compositional reasoning benchmarks, outperforming NegCLIP by 4.5% and FSC-CLIP by 4.1%.

Background & Motivation¶

Background: VLMs such as CLIP, trained with contrastive objectives, perform poorly on compositional reasoning — they fail to distinguish "horse eating grass" from "grass eating horse," as contrastive training encourages the text encoder to focus on individual tokens (aligned with visual objects) while neglecting inter-token relations.

Limitations of Prior Work: (1) Hard-negative-based methods (e.g., NegCLIP) may lead models to exploit shortcuts specific to the negative sample format rather than achieving genuine compositional understanding; (2) existing auxiliary objectives either act jointly on both image and text encoders, or exclusively on the image encoder, overlooking that the text encoder is the primary bottleneck for compositional reasoning; (3) auxiliary objectives specifically designed for the text encoder are lacking.

Key Challenge: The nature of contrastive training (image-text alignment) encourages the text encoder to produce bag-of-words representations, whereas compositional reasoning requires understanding of structural relationships among tokens.

Goal: Improve compositional reasoning in CLIP through auxiliary training objectives applied to the text encoder.

Key Insight: Two complementary objectives — reconstruction forces embeddings to retain inter-token relational information (otherwise alternative descriptions cannot be reconstructed), while alignment ensures that paraphrases with different surface forms yield consistent representations.

Core Idea: Compel the text encoder to encode inter-token relations via "alternative description reconstruction," and ensure semantic invariance via "paraphrase alignment."

Method¶

Overall Architecture¶

A weighted combination of three training losses: \(\mathcal{L} = \mathcal{L}_{Contrastive} + \lambda_1 \mathcal{L}_{Token\ Reconstruction} + \lambda_2 \mathcal{L}_{Sentence\ Alignment}\). The contrastive loss follows the standard CLIP formulation augmented with hard negatives; the two auxiliary losses act solely on the text encoder.

Key Designs¶

Token-Level Reconstruction Loss:
- Function: Given an original caption \(T_i\), the text encoder produces embedding \(v_i = f_T(T_i)\), which is projected via a learnable projection \(h_i = W^T v_i\) and passed to a frozen pretrained decoder \(\pi\) to reconstruct an alternative description \(\mathbf{y}_i^{(k)}\) of the same image (rather than the original caption).
- Mechanism: \(\mathcal{L}_{Token\ Rec} = -\frac{1}{BK}\sum_i\sum_k \log \pi(\mathbf{y}_i^{(k)} | h_i)\); the decoder is frozen and gradients are backpropagated only through the text encoder and projection layer.
- Why reconstruct alternative descriptions rather than the original: Reconstructing the original caption causes the encoder to overfit to exact phrasing (e.g., memorizing the position of "the"), whereas reconstructing alternative descriptions compels the encoder to capture deeper semantic relations — understanding "horse eating grass" is necessary to reconstruct "a horse is feeding on the lawn."
- Design Motivation: Prior NLP work (MASS, RetroMAE) has demonstrated that encoder-decoder reconstruction objectives help encoders capture syntactic and semantic relations; this paper is the first to transfer this insight to VLMs.
Sentence-Level Alignment Loss:
- Function: Aligns the embedding of a paraphrase \(T_i'\) (semantically equivalent but differently worded) with that of the original description \(T_i\) in the embedding space.
- Mechanism: \(\mathcal{L}_{Sent\ Align} = -\frac{1}{B}\sum_i \log \frac{\phi(T_i, T_i')}{\sum_j \phi(T_i, T_j')}\), treating paraphrases as positive pairs and other in-batch paraphrases as negatives.
- Design Motivation: Compositional reasoning requires not only understanding intra-sentence relations but also recognizing semantic equivalence across different surface forms — e.g., "the dog chased the cat" and "the cat was pursued by the dog" should yield similar embeddings.
Hard-Negative-Augmented Contrastive Loss:
- Function: Incorporates hard negative captions into the image-to-text direction of the standard contrastive loss.
- Mechanism: \(M\) hard negatives \(\tilde{T}_i^{(m)}\) are generated via rule-based transformations (e.g., subject-object swapping, adjective exchange) and inserted into the denominator to increase discriminative difficulty.
- Design Motivation: Compatible with NegCLIP; the two auxiliary objectives in READ are additive to hard-negative-based methods.

Complementarity Analysis¶

Reconstruction objective: encourages the encoder to capture inter-token relations (requires understanding "who does what to whom" to reconstruct paraphrases).
Alignment objective: encourages the encoder to maintain semantic consistency across different surface forms.
The two are complementary: reconstruction provides fine-grained structural understanding; alignment provides coarse-grained semantic normalization.

Key Experimental Results¶

5 Compositional Reasoning Benchmarks¶

Model	SugarCrepe	ARO	Winoground	VALSE	Cola	Average
CLIP (ViT-B/32)	74.3	59.2	29.5	70.1	68.4	60.3
NegCLIP	80.1	66.8	32.0	73.2	72.5	64.9
FSC-CLIP	81.5	68.4	33.8	74.1	73.2	66.2
READ-CLIP	83.9	71.2	36.1	76.3	75.8	68.7

Stacking with Existing CLIP Variants¶

Base Model	Standalone	+ READ	Gain
NegCLIP	64.9	67.3	+2.4%
FSC-CLIP	66.2	68.1	+1.9%
DAC-CLIP	65.7	67.9	+2.2%

Ablation Study¶

Configuration	SugarCrepe	ARO
Contrastive loss only	80.1	66.8
+ Token Reconstruction	82.5	69.8
+ Sentence Alignment	81.3	68.5
+ Both (READ)	83.9	71.2

Key Findings¶

The reconstruction objective contributes more (~2.4% gain); the alignment objective provides an additional ~1.4%.
Reconstructing alternative descriptions substantially outperforms reconstructing original descriptions — the latter leads to overfitting to exact phrasing.
READ functions as a plug-in that consistently provides 1.9–2.4% additional gains when stacked on top of existing CLIP improvements.
Qualitative analysis shows that READ-CLIP text embeddings exhibit larger distance differences in response to changes in inter-token relations.

Highlights & Insights¶

Elegant design of reconstructing alternative descriptions: Rather than reconstructing the original input (which causes overfitting), the model reconstructs synonymous but differently worded descriptions — compelling the encoder to understand deep semantics rather than surface form. This insight is simple yet profound.
NLP reconstruction objectives → VLMs: The success of encoder-decoder reconstruction in NLP (MASS/RetroMAE) is transferred to vision-language models for the first time, bridging two research communities.
Complementarity of the two objectives: Reconstruction → inter-token relations (fine-grained); alignment → semantic invariance (coarse-grained); their combination yields effects greater than the sum of the parts.
Strong generality: The method is independent of specific hard-negative generation strategies and can be applied as a plug-in module.

Limitations & Future Work¶

Requires alternative description / paraphrase data: Currently sourced from datasets such as COCO that provide multiple captions; additional generation is needed for datasets without such annotations.
Decoder is fixed as a pretrained model: Joint training of the decoder has not been explored.
Validated only on the CLIP architecture: Newer architectures such as SigLIP and CoCa remain untested.
ViT-B/32 scale: The effectiveness on larger models (e.g., ViT-L/14) has not been thoroughly evaluated.
Reconstruction objective increases training computation: Although the decoder is not backpropagated through, its forward pass still incurs additional overhead.

vs. NegCLIP: NegCLIP relies solely on hard negatives and is susceptible to shortcut learning; READ provides a more fundamental basis for compositional understanding through auxiliary objectives.
vs. SF-CLIP: SF-CLIP uses masked distillation to supervise both image and text encoders jointly; READ focuses exclusively on the text encoder, which is the primary bottleneck.
vs. DAC: DAC finds that well-aligned captions benefit compositional reasoning; READ further adds a reconstruction objective and achieves superior performance.
vs. RetroMAE: RetroMAE trains encoders via reconstruction in NLP; READ transfers this idea to the multimodal setting and adopts alternative descriptions.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of alternative-description reconstruction and paraphrase alignment is novel; the transfer from NLP to VLMs reflects genuine insight.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 benchmarks + stacking experiments with multiple CLIP variants + detailed ablations + qualitative analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clearly derived, formulations are complete, and figures are intuitive.
Value: ⭐⭐⭐⭐⭐ Substantial practical improvement in compositional reasoning for CLIP; the method is concise and reusable.