Learning Invariant Causal Mechanism from Vision-Language Models¶
Conference: ICML2025
arXiv: 2405.15289
Code: GitHub
Area: Multimodal VLM
Keywords: CLIP, OOD generalization, causal inference, invariant representation, linear projection
TL;DR¶
It is proved through causal analysis that CLIP embeddings are linear transformations of true invariant/variant factors. The CLIP-ICM framework is proposed to estimate a linear projection matrix using intervention data, restricting predictions to the invariant subspace for consistent predictions across environments.
Background & Motivation¶
Background¶
Background: CLIP performs exceptionally well on zero-shot tasks, but its performance is unstable when fine-tuned to OOD scenarios.
Proposed Solution¶
Proposed Solution: On the Terra Incognita dataset, the target domain accuracy after leave-one-out fine-tuning is only 47.8% (compared to 78.9% for direct fine-tuning), showing a gap of up to 31.1%.
Limitations of Prior Work¶
Limitations of Prior Work: After fine-tuning, the zero-shot capability on novel classes also drops significantly (63.6% → 24.6%).
Key Challenge¶
Key Challenge: In causal analysis, images in the SCM are generated by invariant factors \(Z_{inv}\) (e.g., wing shape) and variant factors \(Z_{var}\) (e.g., feather color), and environmental changes only affect \(Z_{var}\).
Additional Notes¶
Additional Notes: The prediction mechanism based on \(Z_{inv}\) remains invariant across different environments (Proposition 5.1), whereas predictions relying on \(Z_{var}\) are inconsistent.
Method¶
Theoretical Foundation¶
- Identifiability Analysis (Proposition 5.3): Under Condition 5.2, the CLIP image encoder output is a linear transformation of the true latent variables: \(f_I(\mathbf{x}) = A\mathbf{z}\), where \(A\) is invertible.
- Existence of Projection Matrix (Proposition 5.5): By using intervention data (fixing \(z_{inv}\) and varying \(z_{var}\)), one can estimate \(A_{inv}\) such that \(A_{inv}(f_I(\mathbf{x}_1^{do}) - f_I(\mathbf{x}_2^{do})) = 0\).
- OOD Risk Guarantees (Theorem 5.6): When \(I(Z_{inv};Z) > c\), the OOD risk of the invariant predictor is strictly lower than that of the standard predictor.
Three Stages of CLIP-ICM¶
- Collecting Intervention Data:
- Image-based: Data augmentation (color jittering, grayscale, Gaussian blur) to keep \(Z_{inv}\) invariant.
- Text-based: Using an image captioning model to generate text, and then using an LLM to modify the variant factors.
- Estimating \(A_{inv}\): Learning the projection matrix such that the embedding differences of intervention pairs are zero in the invariant subspace.
- Invariant Prediction: Calculating the cosine similarity between image and text embeddings in the invariant subspace for classification.
Invariant Predictor¶
Key Experimental Results¶
Main Results¶
| Method | PACS | VLCS | OfficeHome | TerraInc | DomainNet | Avg |
|---|---|---|---|---|---|---|
| Zero-shot | 96.1 | 82.4 | 71.5 | 34.2 | 56.8 | 68.2 |
| Linear-Probe | 96.4 | 78.7 | 81.9 | 60.2 | 55.0 | 74.4 |
| CLIP-Adapter | 96.4 | 84.3 | 82.2 | 57.5 | 59.9 | 76.1 |
| CLIP-ICM | Best | Best | Best | Best | Best | Best |
- Outperforms methods such as CoOp, CoCoOp, CLIP-Adapter, and DPL across the board on the DomainBed benchmark.
- Demonstrates advantages on ImageNet variants as well.
- Low computational cost as it does not require retraining the CLIP backbone.
Highlights & Insights¶
- Theory-Driven: The existence of linear projection is derived starting from causal identifiability, presenting a complete theoretical chain.
- Simple & Efficient: High practicality as it only requires learning a single linear matrix without retraining the backbone.
- Two Intervention Data Collection Methods: Image augmentation and text editing, which flexibly adapt to different scenarios.
- OOD Risk Theoretical Guarantees: Not only empirically effective, but also backed by rigorous theoretical lower bounds.
Limitations & Future Work¶
- Whether the linear transformation assumption (Proposition 5.3) holds for all CLIP models needs further validation.
- The quality of intervention data directly affects the estimation accuracy of \(A_{inv}\).
- Image augmentation as an intervention might not strictly keep \(Z_{inv}\) invariant.
- Only classification tasks are considered; downstream tasks like retrieval and generation remain unverified.
- The dimension of the invariant subspace needs to be pre-defined, lacking theoretical guidance for optimal dimension selection.
- The invariant projection of the text encoder and the image encoder must share the same \(A_{inv}\), whereas their representation characteristics might differ.
- Whether the method remains effective when environmental changes involve not only distribution shifts in \(Z_{var}\) but also the emergence of new concepts (concept shift).
- The condition \(I(Z_{inv};Z) > c\) is difficult to verify in real-world data.
- For large-scale datasets (such as ImageNet scale), collecting intervention data of sufficient quality may carry high costs.
- Combining with fine-tuning methods (such as LoRA + CLIP-ICM) is a promising direction.
- The method assumes that CLIP already possesses good identifiability (Condition 5.2), which may not hold for smaller or domain-specific VLMs.
Additional Experimental Details¶
- The standard leave-one-out evaluation protocol is adopted on DomainBed.
- ImageNet variants include ImageNet-V2, ImageNet-R, ImageNet-Sketch, and ImageNet-A.
- Gradient descent optimization is used for the projection matrix estimation, typically converging within a few hundred iterations.
- Text intervention is generated using GPT-3.5, and image intervention uses standard data augmentation combinations.
- Joint evaluation of domain shift and open class on Terra Incognita is a unique contribution of this work.
Related Work & Insights¶
- CoOp/CoCoOp (Zhou et al., 2022): Learnable prompts, but lack theoretical OOD guarantees.
- IRM (Arjovsky et al., 2020): Invariant learning, but does not leverage VLM properties.
- Causal Representation Learning (Schölkopf et al., 2021): Combining it with CLIP in this work is a novel contribution.
- Insight: The cross-modal alignment of VLMs naturally provides conditions for causal identifiability.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Causal perspective analysis of CLIP's OOD problems + linear projection theory)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Multiple benchmarks + ablations)
- Writing Quality: ⭐⭐⭐⭐ (Rigorous derivation in causal analysis)
- Value: ⭐⭐⭐⭐⭐ (Provides theoretical and practical solutions for VLM OOD generalization)
Core Theoretical Supplements¶
- Condition 5.2 requires the existence of \(D+1\) text description pairs, ensuring the matrix \(A\) is invertible.
- The condition \(I(Z_{inv};Z) > c\) in Theorem 5.6 ensures that the invariant factors contain sufficient information.
- The consistency of the causal mechanism is proven via do-calculus: \(P^*(y|do(\mathbf{z}_{inv})) = P(y|do(\mathbf{z}_{inv}))\).
- The projection matrix estimation is based on a contrastive learning objective, minimizing the discrepancy of intervention pairs in the invariant subspace.
- Image and text interventions can be used individually or in combination; experiments show that the combination yields the best results.