Learning Invariant Causal Mechanism from Vision-Language Models¶

Conference: ICML2025
arXiv: 2405.15289
Code: GitHub
Area: Multimodal VLM
Keywords: CLIP, OOD generalization, causal inference, invariant representation, linear projection

TL;DR¶

It is proved through causal analysis that CLIP embeddings are linear transformations of true invariant/variant factors. The CLIP-ICM framework is proposed to estimate a linear projection matrix using intervention data, restricting predictions to the invariant subspace for consistent predictions across environments.

Background & Motivation¶

Background¶

Background: CLIP performs exceptionally well on zero-shot tasks, but its performance is unstable when fine-tuned to OOD scenarios.

Proposed Solution¶

Proposed Solution: On the Terra Incognita dataset, the target domain accuracy after leave-one-out fine-tuning is only 47.8% (compared to 78.9% for direct fine-tuning), showing a gap of up to 31.1%.

Limitations of Prior Work¶

Limitations of Prior Work: After fine-tuning, the zero-shot capability on novel classes also drops significantly (63.6% → 24.6%).

Key Challenge¶

Key Challenge: In causal analysis, images in the SCM are generated by invariant factors \(Z_{inv}\) (e.g., wing shape) and variant factors \(Z_{var}\) (e.g., feather color), and environmental changes only affect \(Z_{var}\).

Additional Notes¶

Additional Notes: The prediction mechanism based on \(Z_{inv}\) remains invariant across different environments (Proposition 5.1), whereas predictions relying on \(Z_{var}\) are inconsistent.

Method¶

Theoretical Foundation¶

Identifiability Analysis (Proposition 5.3): Under Condition 5.2, the CLIP image encoder output is a linear transformation of the true latent variables: \(f_I(\mathbf{x}) = A\mathbf{z}\), where \(A\) is invertible.
Existence of Projection Matrix (Proposition 5.5): By using intervention data (fixing \(z_{inv}\) and varying \(z_{var}\)), one can estimate \(A_{inv}\) such that \(A_{inv}(f_I(\mathbf{x}_1^{do}) - f_I(\mathbf{x}_2^{do})) = 0\).
OOD Risk Guarantees (Theorem 5.6): When \(I(Z_{inv};Z) > c\), the OOD risk of the invariant predictor is strictly lower than that of the standard predictor.

Three Stages of CLIP-ICM¶

Collecting Intervention Data:
- Image-based: Data augmentation (color jittering, grayscale, Gaussian blur) to keep \(Z_{inv}\) invariant.
- Text-based: Using an image captioning model to generate text, and then using an LLM to modify the variant factors.
Estimating \(A_{inv}\): Learning the projection matrix such that the embedding differences of intervention pairs are zero in the invariant subspace.
Invariant Prediction: Calculating the cosine similarity between image and text embeddings in the invariant subspace for classification.

Invariant Predictor¶

\[P_{inv}(c|\mathbf{x}) = \frac{\exp(S(A_{inv}f_I(\mathbf{x}), A_{inv}f_T(\mathbf{t}_c)))}{\sum_{c'}\exp(S(A_{inv}f_I(\mathbf{x}), A_{inv}f_T(\mathbf{t}_{c'})))}\]

Key Experimental Results¶

Main Results¶

Method	PACS	VLCS	OfficeHome	TerraInc	DomainNet	Avg
Zero-shot	96.1	82.4	71.5	34.2	56.8	68.2
Linear-Probe	96.4	78.7	81.9	60.2	55.0	74.4
CLIP-Adapter	96.4	84.3	82.2	57.5	59.9	76.1
CLIP-ICM	Best	Best	Best	Best	Best	Best

Outperforms methods such as CoOp, CoCoOp, CLIP-Adapter, and DPL across the board on the DomainBed benchmark.
Demonstrates advantages on ImageNet variants as well.
Low computational cost as it does not require retraining the CLIP backbone.

Highlights & Insights¶

Theory-Driven: The existence of linear projection is derived starting from causal identifiability, presenting a complete theoretical chain.
Simple & Efficient: High practicality as it only requires learning a single linear matrix without retraining the backbone.
Two Intervention Data Collection Methods: Image augmentation and text editing, which flexibly adapt to different scenarios.
OOD Risk Theoretical Guarantees: Not only empirically effective, but also backed by rigorous theoretical lower bounds.

Limitations & Future Work¶

Whether the linear transformation assumption (Proposition 5.3) holds for all CLIP models needs further validation.
The quality of intervention data directly affects the estimation accuracy of \(A_{inv}\).
Image augmentation as an intervention might not strictly keep \(Z_{inv}\) invariant.
Only classification tasks are considered; downstream tasks like retrieval and generation remain unverified.
The dimension of the invariant subspace needs to be pre-defined, lacking theoretical guidance for optimal dimension selection.
The invariant projection of the text encoder and the image encoder must share the same \(A_{inv}\), whereas their representation characteristics might differ.
Whether the method remains effective when environmental changes involve not only distribution shifts in \(Z_{var}\) but also the emergence of new concepts (concept shift).
The condition \(I(Z_{inv};Z) > c\) is difficult to verify in real-world data.
For large-scale datasets (such as ImageNet scale), collecting intervention data of sufficient quality may carry high costs.
Combining with fine-tuning methods (such as LoRA + CLIP-ICM) is a promising direction.
The method assumes that CLIP already possesses good identifiability (Condition 5.2), which may not hold for smaller or domain-specific VLMs.

Additional Experimental Details¶

The standard leave-one-out evaluation protocol is adopted on DomainBed.
ImageNet variants include ImageNet-V2, ImageNet-R, ImageNet-Sketch, and ImageNet-A.
Gradient descent optimization is used for the projection matrix estimation, typically converging within a few hundred iterations.
Text intervention is generated using GPT-3.5, and image intervention uses standard data augmentation combinations.
Joint evaluation of domain shift and open class on Terra Incognita is a unique contribution of this work.

CoOp/CoCoOp (Zhou et al., 2022): Learnable prompts, but lack theoretical OOD guarantees.
IRM (Arjovsky et al., 2020): Invariant learning, but does not leverage VLM properties.
Causal Representation Learning (Schölkopf et al., 2021): Combining it with CLIP in this work is a novel contribution.
Insight: The cross-modal alignment of VLMs naturally provides conditions for causal identifiability.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Causal perspective analysis of CLIP's OOD problems + linear projection theory)
Experimental Thoroughness: ⭐⭐⭐⭐ (Multiple benchmarks + ablations)
Writing Quality: ⭐⭐⭐⭐ (Rigorous derivation in causal analysis)
Value: ⭐⭐⭐⭐⭐ (Provides theoretical and practical solutions for VLM OOD generalization)

Core Theoretical Supplements¶

Condition 5.2 requires the existence of \(D+1\) text description pairs, ensuring the matrix \(A\) is invertible.
The condition \(I(Z_{inv};Z) > c\) in Theorem 5.6 ensures that the invariant factors contain sufficient information.
The consistency of the causal mechanism is proven via do-calculus: \(P^*(y|do(\mathbf{z}_{inv})) = P(y|do(\mathbf{z}_{inv}))\).
The projection matrix estimation is based on a contrastive learning objective, minimizing the discrepancy of intervention pairs in the invariant subspace.
Image and text interventions can be used individually or in combination; experiments show that the combination yields the best results.