Skip to content

READ: Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions

Conference: NeurIPS 2025 arXiv: 2510.16540 Code: Available Area: Multimodal VLM / CLIP Improvement Keywords: CLIP, compositional reasoning, text reconstruction, paraphrase alignment, contrastive learning

TL;DR

This paper proposes READ, a fine-tuning method that enhances the compositional reasoning capability of CLIP's text encoder via two auxiliary objectives: (1) token-level reconstruction, where a frozen decoder reconstructs alternative descriptions from text embeddings, and (2) sentence-level alignment, which enforces consistency among embeddings of paraphrases. READ achieves state-of-the-art performance on 5 compositional reasoning benchmarks, outperforming NegCLIP by 4.5% and FSC-CLIP by 4.1%.

Background & Motivation

Background: VLMs such as CLIP, trained with contrastive objectives, perform poorly on compositional reasoning — they fail to distinguish "horse eating grass" from "grass eating horse," as contrastive training encourages the text encoder to focus on individual tokens (aligned with visual objects) while neglecting inter-token relations.

Limitations of Prior Work: (1) Hard-negative-based methods (e.g., NegCLIP) may lead models to exploit shortcuts specific to the negative sample format rather than achieving genuine compositional understanding; (2) existing auxiliary objectives either act jointly on both image and text encoders, or exclusively on the image encoder, overlooking that the text encoder is the primary bottleneck for compositional reasoning; (3) auxiliary objectives specifically designed for the text encoder are lacking.

Key Challenge: The nature of contrastive training (image-text alignment) encourages the text encoder to produce bag-of-words representations, whereas compositional reasoning requires understanding of structural relationships among tokens.

Goal: Improve compositional reasoning in CLIP through auxiliary training objectives applied to the text encoder.

Key Insight: Two complementary objectives — reconstruction forces embeddings to retain inter-token relational information (otherwise alternative descriptions cannot be reconstructed), while alignment ensures that paraphrases with different surface forms yield consistent representations.

Core Idea: Compel the text encoder to encode inter-token relations via "alternative description reconstruction," and ensure semantic invariance via "paraphrase alignment."

Method

Overall Architecture

A weighted combination of three training losses: \(\mathcal{L} = \mathcal{L}_{Contrastive} + \lambda_1 \mathcal{L}_{Token\ Reconstruction} + \lambda_2 \mathcal{L}_{Sentence\ Alignment}\). The contrastive loss follows the standard CLIP formulation augmented with hard negatives; the two auxiliary losses act solely on the text encoder.

Key Designs

  1. Token-Level Reconstruction Loss:

    • Function: Given an original caption \(T_i\), the text encoder produces embedding \(v_i = f_T(T_i)\), which is projected via a learnable projection \(h_i = W^T v_i\) and passed to a frozen pretrained decoder \(\pi\) to reconstruct an alternative description \(\mathbf{y}_i^{(k)}\) of the same image (rather than the original caption).
    • Mechanism: \(\mathcal{L}_{Token\ Rec} = -\frac{1}{BK}\sum_i\sum_k \log \pi(\mathbf{y}_i^{(k)} | h_i)\); the decoder is frozen and gradients are backpropagated only through the text encoder and projection layer.
    • Why reconstruct alternative descriptions rather than the original: Reconstructing the original caption causes the encoder to overfit to exact phrasing (e.g., memorizing the position of "the"), whereas reconstructing alternative descriptions compels the encoder to capture deeper semantic relations — understanding "horse eating grass" is necessary to reconstruct "a horse is feeding on the lawn."
    • Design Motivation: Prior NLP work (MASS, RetroMAE) has demonstrated that encoder-decoder reconstruction objectives help encoders capture syntactic and semantic relations; this paper is the first to transfer this insight to VLMs.
  2. Sentence-Level Alignment Loss:

    • Function: Aligns the embedding of a paraphrase \(T_i'\) (semantically equivalent but differently worded) with that of the original description \(T_i\) in the embedding space.
    • Mechanism: \(\mathcal{L}_{Sent\ Align} = -\frac{1}{B}\sum_i \log \frac{\phi(T_i, T_i')}{\sum_j \phi(T_i, T_j')}\), treating paraphrases as positive pairs and other in-batch paraphrases as negatives.
    • Design Motivation: Compositional reasoning requires not only understanding intra-sentence relations but also recognizing semantic equivalence across different surface forms — e.g., "the dog chased the cat" and "the cat was pursued by the dog" should yield similar embeddings.
  3. Hard-Negative-Augmented Contrastive Loss:

    • Function: Incorporates hard negative captions into the image-to-text direction of the standard contrastive loss.
    • Mechanism: \(M\) hard negatives \(\tilde{T}_i^{(m)}\) are generated via rule-based transformations (e.g., subject-object swapping, adjective exchange) and inserted into the denominator to increase discriminative difficulty.
    • Design Motivation: Compatible with NegCLIP; the two auxiliary objectives in READ are additive to hard-negative-based methods.

Complementarity Analysis

  • Reconstruction objective: encourages the encoder to capture inter-token relations (requires understanding "who does what to whom" to reconstruct paraphrases).
  • Alignment objective: encourages the encoder to maintain semantic consistency across different surface forms.
  • The two are complementary: reconstruction provides fine-grained structural understanding; alignment provides coarse-grained semantic normalization.

Key Experimental Results

5 Compositional Reasoning Benchmarks

Model SugarCrepe ARO Winoground VALSE Cola Average
CLIP (ViT-B/32) 74.3 59.2 29.5 70.1 68.4 60.3
NegCLIP 80.1 66.8 32.0 73.2 72.5 64.9
FSC-CLIP 81.5 68.4 33.8 74.1 73.2 66.2
READ-CLIP 83.9 71.2 36.1 76.3 75.8 68.7

Stacking with Existing CLIP Variants

Base Model Standalone + READ Gain
NegCLIP 64.9 67.3 +2.4%
FSC-CLIP 66.2 68.1 +1.9%
DAC-CLIP 65.7 67.9 +2.2%

Ablation Study

Configuration SugarCrepe ARO
Contrastive loss only 80.1 66.8
+ Token Reconstruction 82.5 69.8
+ Sentence Alignment 81.3 68.5
+ Both (READ) 83.9 71.2

Key Findings

  • The reconstruction objective contributes more (~2.4% gain); the alignment objective provides an additional ~1.4%.
  • Reconstructing alternative descriptions substantially outperforms reconstructing original descriptions — the latter leads to overfitting to exact phrasing.
  • READ functions as a plug-in that consistently provides 1.9–2.4% additional gains when stacked on top of existing CLIP improvements.
  • Qualitative analysis shows that READ-CLIP text embeddings exhibit larger distance differences in response to changes in inter-token relations.

Highlights & Insights

  • Elegant design of reconstructing alternative descriptions: Rather than reconstructing the original input (which causes overfitting), the model reconstructs synonymous but differently worded descriptions — compelling the encoder to understand deep semantics rather than surface form. This insight is simple yet profound.
  • NLP reconstruction objectives → VLMs: The success of encoder-decoder reconstruction in NLP (MASS/RetroMAE) is transferred to vision-language models for the first time, bridging two research communities.
  • Complementarity of the two objectives: Reconstruction → inter-token relations (fine-grained); alignment → semantic invariance (coarse-grained); their combination yields effects greater than the sum of the parts.
  • Strong generality: The method is independent of specific hard-negative generation strategies and can be applied as a plug-in module.

Limitations & Future Work

  • Requires alternative description / paraphrase data: Currently sourced from datasets such as COCO that provide multiple captions; additional generation is needed for datasets without such annotations.
  • Decoder is fixed as a pretrained model: Joint training of the decoder has not been explored.
  • Validated only on the CLIP architecture: Newer architectures such as SigLIP and CoCa remain untested.
  • ViT-B/32 scale: The effectiveness on larger models (e.g., ViT-L/14) has not been thoroughly evaluated.
  • Reconstruction objective increases training computation: Although the decoder is not backpropagated through, its forward pass still incurs additional overhead.
  • vs. NegCLIP: NegCLIP relies solely on hard negatives and is susceptible to shortcut learning; READ provides a more fundamental basis for compositional understanding through auxiliary objectives.
  • vs. SF-CLIP: SF-CLIP uses masked distillation to supervise both image and text encoders jointly; READ focuses exclusively on the text encoder, which is the primary bottleneck.
  • vs. DAC: DAC finds that well-aligned captions benefit compositional reasoning; READ further adds a reconstruction objective and achieves superior performance.
  • vs. RetroMAE: RetroMAE trains encoders via reconstruction in NLP; READ transfers this idea to the multimodal setting and adopts alternative descriptions.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of alternative-description reconstruction and paraphrase alignment is novel; the transfer from NLP to VLMs reflects genuine insight.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 benchmarks + stacking experiments with multiple CLIP variants + detailed ablations + qualitative analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clearly derived, formulations are complete, and figures are intuitive.
  • Value: ⭐⭐⭐⭐⭐ Substantial practical improvement in compositional reasoning for CLIP; the method is concise and reusable.