Skip to content

Learning Invariant Causal Mechanism from Vision-Language Models

Conference: ICML2025
arXiv: 2405.15289
Code: GitHub
Area: Multimodal VLM
Keywords: CLIP, OOD generalization, causal inference, invariant representation, linear projection

TL;DR

It is proved through causal analysis that CLIP embeddings are linear transformations of true invariant/variant factors. The CLIP-ICM framework is proposed to estimate a linear projection matrix using intervention data, restricting predictions to the invariant subspace for consistent predictions across environments.

Background & Motivation

Background

Background: CLIP performs exceptionally well on zero-shot tasks, but its performance is unstable when fine-tuned to OOD scenarios.

Proposed Solution

Proposed Solution: On the Terra Incognita dataset, the target domain accuracy after leave-one-out fine-tuning is only 47.8% (compared to 78.9% for direct fine-tuning), showing a gap of up to 31.1%.

Limitations of Prior Work

Limitations of Prior Work: After fine-tuning, the zero-shot capability on novel classes also drops significantly (63.6% → 24.6%).

Key Challenge

Key Challenge: In causal analysis, images in the SCM are generated by invariant factors \(Z_{inv}\) (e.g., wing shape) and variant factors \(Z_{var}\) (e.g., feather color), and environmental changes only affect \(Z_{var}\).

Additional Notes

Additional Notes: The prediction mechanism based on \(Z_{inv}\) remains invariant across different environments (Proposition 5.1), whereas predictions relying on \(Z_{var}\) are inconsistent.

Method

Theoretical Foundation

  1. Identifiability Analysis (Proposition 5.3): Under Condition 5.2, the CLIP image encoder output is a linear transformation of the true latent variables: \(f_I(\mathbf{x}) = A\mathbf{z}\), where \(A\) is invertible.
  2. Existence of Projection Matrix (Proposition 5.5): By using intervention data (fixing \(z_{inv}\) and varying \(z_{var}\)), one can estimate \(A_{inv}\) such that \(A_{inv}(f_I(\mathbf{x}_1^{do}) - f_I(\mathbf{x}_2^{do})) = 0\).
  3. OOD Risk Guarantees (Theorem 5.6): When \(I(Z_{inv};Z) > c\), the OOD risk of the invariant predictor is strictly lower than that of the standard predictor.

Three Stages of CLIP-ICM

  1. Collecting Intervention Data:
    • Image-based: Data augmentation (color jittering, grayscale, Gaussian blur) to keep \(Z_{inv}\) invariant.
    • Text-based: Using an image captioning model to generate text, and then using an LLM to modify the variant factors.
  2. Estimating \(A_{inv}\): Learning the projection matrix such that the embedding differences of intervention pairs are zero in the invariant subspace.
  3. Invariant Prediction: Calculating the cosine similarity between image and text embeddings in the invariant subspace for classification.

Invariant Predictor

\[P_{inv}(c|\mathbf{x}) = \frac{\exp(S(A_{inv}f_I(\mathbf{x}), A_{inv}f_T(\mathbf{t}_c)))}{\sum_{c'}\exp(S(A_{inv}f_I(\mathbf{x}), A_{inv}f_T(\mathbf{t}_{c'})))}\]

Key Experimental Results

Main Results

Method PACS VLCS OfficeHome TerraInc DomainNet Avg
Zero-shot 96.1 82.4 71.5 34.2 56.8 68.2
Linear-Probe 96.4 78.7 81.9 60.2 55.0 74.4
CLIP-Adapter 96.4 84.3 82.2 57.5 59.9 76.1
CLIP-ICM Best Best Best Best Best Best
  • Outperforms methods such as CoOp, CoCoOp, CLIP-Adapter, and DPL across the board on the DomainBed benchmark.
  • Demonstrates advantages on ImageNet variants as well.
  • Low computational cost as it does not require retraining the CLIP backbone.

Highlights & Insights

  • Theory-Driven: The existence of linear projection is derived starting from causal identifiability, presenting a complete theoretical chain.
  • Simple & Efficient: High practicality as it only requires learning a single linear matrix without retraining the backbone.
  • Two Intervention Data Collection Methods: Image augmentation and text editing, which flexibly adapt to different scenarios.
  • OOD Risk Theoretical Guarantees: Not only empirically effective, but also backed by rigorous theoretical lower bounds.

Limitations & Future Work

  • Whether the linear transformation assumption (Proposition 5.3) holds for all CLIP models needs further validation.
  • The quality of intervention data directly affects the estimation accuracy of \(A_{inv}\).
  • Image augmentation as an intervention might not strictly keep \(Z_{inv}\) invariant.
  • Only classification tasks are considered; downstream tasks like retrieval and generation remain unverified.
  • The dimension of the invariant subspace needs to be pre-defined, lacking theoretical guidance for optimal dimension selection.
  • The invariant projection of the text encoder and the image encoder must share the same \(A_{inv}\), whereas their representation characteristics might differ.
  • Whether the method remains effective when environmental changes involve not only distribution shifts in \(Z_{var}\) but also the emergence of new concepts (concept shift).
  • The condition \(I(Z_{inv};Z) > c\) is difficult to verify in real-world data.
  • For large-scale datasets (such as ImageNet scale), collecting intervention data of sufficient quality may carry high costs.
  • Combining with fine-tuning methods (such as LoRA + CLIP-ICM) is a promising direction.
  • The method assumes that CLIP already possesses good identifiability (Condition 5.2), which may not hold for smaller or domain-specific VLMs.

Additional Experimental Details

  • The standard leave-one-out evaluation protocol is adopted on DomainBed.
  • ImageNet variants include ImageNet-V2, ImageNet-R, ImageNet-Sketch, and ImageNet-A.
  • Gradient descent optimization is used for the projection matrix estimation, typically converging within a few hundred iterations.
  • Text intervention is generated using GPT-3.5, and image intervention uses standard data augmentation combinations.
  • Joint evaluation of domain shift and open class on Terra Incognita is a unique contribution of this work.
  • CoOp/CoCoOp (Zhou et al., 2022): Learnable prompts, but lack theoretical OOD guarantees.
  • IRM (Arjovsky et al., 2020): Invariant learning, but does not leverage VLM properties.
  • Causal Representation Learning (Schölkopf et al., 2021): Combining it with CLIP in this work is a novel contribution.
  • Insight: The cross-modal alignment of VLMs naturally provides conditions for causal identifiability.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (Causal perspective analysis of CLIP's OOD problems + linear projection theory)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (Multiple benchmarks + ablations)
  • Writing Quality: ⭐⭐⭐⭐ (Rigorous derivation in causal analysis)
  • Value: ⭐⭐⭐⭐⭐ (Provides theoretical and practical solutions for VLM OOD generalization)

Core Theoretical Supplements

  • Condition 5.2 requires the existence of \(D+1\) text description pairs, ensuring the matrix \(A\) is invertible.
  • The condition \(I(Z_{inv};Z) > c\) in Theorem 5.6 ensures that the invariant factors contain sufficient information.
  • The consistency of the causal mechanism is proven via do-calculus: \(P^*(y|do(\mathbf{z}_{inv})) = P(y|do(\mathbf{z}_{inv}))\).
  • The projection matrix estimation is based on a contrastive learning objective, minimizing the discrepancy of intervention pairs in the invariant subspace.
  • Image and text interventions can be used individually or in combination; experiments show that the combination yields the best results.