Skip to content

Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

Conference: ACL 2026
arXiv: 2604.18460
Code: GitHub
Area: Audio and Speech
Keywords: Causal Invariant Representation, Multimodal Sentiment Analysis, Out-of-Distribution Generalization, Feature Disentanglement, Virtual Environments

TL;DR

This paper proposes CmIR (Causal Modality Invariant Representation learning), which explicitly disentangles each modality into causal invariant representations and environment-specific spurious representations based on causal inference theory. Through an elegant objective function incorporating invariance, mutual information, and reconstruction constraints, it ensures that invariant representations maintain stable predictive relationships across environments. The method achieves SOTA performance in multimodal sentiment, humor, and sarcasm detection, particularly in OOD and noisy scenarios.

Background & Motivation

Background: Multimodal affective computing integrates linguistic, acoustic, and visual modalities for emotion prediction. Existing methods perform well on in-distribution tests but often learn spurious cross-modal correlations present in the training data.

Limitations of Prior Work: (1) Models may over-rely on consistent smiling of a speaker (a spurious visual feature) rather than semantic content; (2) Noisy modalities (e.g., background noise or low-resolution video) further disrupt these spurious correlations, widening the generalization gap; (3) Existing causal methods either lack theoretical guarantees or target only specific biases (such as speaker bias), lacking generality.

Key Challenge: There is a need for a general framework to distinguish causal features from spurious ones without relying on prior assumptions about bias types or requiring pre-defined bias labels.

Goal: Establish a theoretically guaranteed general framework based on causal inference to disentangle each modality into causal invariant and environment-specific spurious components.

Key Insight: The core property of causal invariant representation is its predictive stability across environments—if \(P(Y|Z_m^{\text{inv}}, E=e_1) = P(Y|Z_m^{\text{inv}}, E=e_2)\), then \(Z_m^{\text{inv}}\) contains only causal features.

Core Idea: Disentanglement is learned through a tri-constraint optimization: invariance constraints ensure consistent cross-environment predictions, mutual information constraints ensure independence between the two components, and reconstruction constraints prevent information loss. When explicit environment labels are missing, virtual environments are simulated by injecting different intensities of noise into the original features.

Method

Overall Architecture

Each modality \(X_m\) is disentangled into \((Z_m^{\text{inv}}, Z_m^{\text{spu}})\) via an encoder \(g_m\). Only the concatenation of invariant representations \(\{Z_m^{\text{inv}}\}_{m=1}^M\) is used for prediction. During training, the prediction loss and three constraint terms are optimized simultaneously. A decoder \(r_m\) reconstructs the original input from both components to prevent information loss.

Key Designs

  1. Virtual Environment Construction + Invariance Constraint:

    • Function: Implements cross-environment invariance training in the absence of explicit environment labels.
    • Mechanism: Each sample is randomly assigned a virtual environment label \(e \in \{1,...,K\}\), and additive Gaussian noise with intensity \(\alpha^{(e)} = \alpha^{(1)} \cdot e\) is injected. Invariant representations are extracted from these perturbed features, and the L1 norm is used to constrain consistency across different environments: \(\mathcal{R}_{\text{inv}}^{(m)} = \sum_{e_1 \neq e_2} \|Z_m^{\text{inv},(e_1)} - Z_m^{\text{inv},(e_2)}\|_1\).
    • Design Motivation: This is stronger than KL divergence constraints (matching inputs implies matching output distributions) and is applicable to both classification and regression tasks without requiring additional unimodal predictors.
  2. Mutual Information Minimization via Orthogonality Approximation:

    • Function: Ensures statistical independence between the invariant and spurious components.
    • Mechanism: It computes the in-batch normalized correlation matrix \(\bm{C}^m = \text{Nor}(\bm{Z}_m^{\text{inv}}) \cdot \text{Nor}(\bm{Z}_m^{\text{spu}})^\top\) and penalizes it using a weighted Frobenius norm: diagonal terms (within-sample orthogonality) have a weight of 1, while off-diagonal terms have a weight of \(\alpha < 1\).
    • Design Motivation: Since direct computation of mutual information is infeasible, orthogonality serves as a necessary condition for independence. Used alongside invariance and reconstruction constraints, it ensures semantic separation.
  3. Reconstruction Constraint to Prevent Degradation:

    • Function: Ensures that the two disentangled components retain all information from the original input.
    • Mechanism: The decoder \(r_m\) reconstructs original features from \((Z_m^{\text{inv}}, Z_m^{\text{spu}})\): \(\mathcal{R}_{\text{rec}}^{(m)} = \|X_m - r_m(Z_m^{\text{inv}}, Z_m^{\text{spu}})\|_2^2\).
    • Design Motivation: Without reconstruction constraints, the model might learn degenerate solutions where the invariant component contains all information while the spurious component is empty, or vice versa.

Loss & Training

The total objective is defined as: \(\mathcal{L} = \mathcal{L}_{\text{pred}} + \sum_{m=1}^{M} \lambda_1 \mathcal{R}_{\text{inv}}^{(m)} + \lambda_2 \mathcal{R}_{\text{dec}}^{(m)} + \lambda_3 \mathcal{R}_{\text{rec}}^{(m)}\). The authors provide full proofs for three theorems: the existence of invariant representations, their extractability, and their OOD risk advantages.

Key Experimental Results

Main Results

Evaluations were conducted on CMU-MOSI/MOSEI/CH-SIMS-v2 (sentiment) + UR-FUNNY (humor) + MUStARD (sarcasm). CmIR achieves SOTA performance in both standard and OOD settings.

Key Findings

  • Under OOD settings (CMU-MOSI OOD), the performance gap of CmIR is more significant, confirming the generalization advantages of causal invariant representations.
  • In tests with noisy modalities, the performance degradation of CmIR is much smaller than that of baselines—the isolation of spurious components makes the model more robust to noise.
  • Ablations demonstrate that all three constraints are indispensable; removing any single constraint leads to a decrease in performance.

Highlights & Insights

  • The design of the tri-constraint framework is elegant—invariance ensures "causality," orthogonality ensures "purity," and reconstruction ensures "completeness."
  • Virtual environment construction is a practical compromise—while potentially less precise than real environment labels, it provides a viable solution for the reality that most datasets lack environment metadata.
  • Theoretical guarantees (the three theorems) provide a solid mathematical foundation for the framework.

Limitations & Future Work

  • The construction of virtual environments relies on the assumption of additive Gaussian noise, which may not perfectly reflect real-world distribution shifts.
  • Hyperparameters (number of environments \(K\), noise coefficient \(\alpha\), and the three \(\lambda\) values) require careful tuning.
  • Both encoders and decoders are simple MLPs; employing stronger architectures might yield further improvements.
  • vs IRM: While Invariant Risk Minimization targets unimodal scenarios, CmIR extends the principle to multimodal feature disentanglement.
  • vs existing multimodal causal methods: Unlike methods targeting specific biases (such as speaker or modality-specific bias), CmIR is a general framework that does not rely on specific bias assumptions.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First systematic integration of causal invariant representation learning with feature disentanglement in Multimodal Affective Computing.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6 datasets + standard/OOD/noise settings + complete ablations + theoretical proofs.
  • Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations and comprehensive experimental analysis.
  • Value: ⭐⭐⭐⭐⭐ Represents a paradigm-level contribution to multimodal robustness research.