Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings¶

Conference: ICLR 2026 arXiv: 2602.06218 Code: https://github.com/Parabrele/IsoEnergy Area: Interpretability Keywords: modality gap, sparse autoencoders, cross-modal redundancy, iso-energy hypothesis, VLM interpretability

TL;DR¶

This paper proposes the Iso-Energy hypothesis — that concepts genuinely shared across modalities should exhibit equal average activation energy in each modality — and introduces Aligned SAE as an analytical tool to reveal the geometric structure of VLM embedding spaces, where bimodal atoms carry cross-modal alignment signals and unimodal atoms fully account for the modality gap.

Background & Motivation¶

Background: Vision-language models such as CLIP and SigLIP use contrastive learning to map images and text into a shared embedding space, achieving cross-modal alignment. However, the internal geometric structure of these embedding spaces remains poorly understood.

Limitations of Prior Work: A well-known phenomenon called the "modality gap" exists — image and text embeddings occupy disjoint cones within the latent space. Prior work has attempted to eliminate this gap by removing mean differences or projecting out certain coordinate directions, but such interventions consistently degrade cross-modal performance. When sparse autoencoders (SAEs) are used to extract concept dictionaries, concepts tend to be modality-separated, making it difficult to identify genuinely bimodal concepts.

Key Challenge: Although VLMs are trained explicitly for cross-modal alignment, the extracted concept dictionaries exhibit pervasive modality separation. This arises because concept recovery is an underdetermined problem — nonlinear ICA is non-identifiable — and without additional inductive bias, standard SAEs cannot correctly distinguish bimodal from unimodal atoms.

Goal: (a) How can bimodal vs. unimodal concepts be accurately recovered from VLM embeddings? (b) What is the fundamental nature of the modality gap? (c) Can the modality gap be eliminated without degrading model performance?

Key Insight: The authors reason from the data-generating process: if multimodal data is produced by shared latent concept vectors passed through modality-specific generators, then genuinely shared concepts should leave "redundant statistical traces" across modalities — in particular, equal average activation energy.

Core Idea: Cross-modal redundancy is used as an inductive bias. An iso-energy constraint guides SAE training toward a correct bimodal/unimodal concept decomposition, thereby revealing and enabling manipulation of the geometric structure of VLM embedding spaces.

Method¶

Overall Architecture¶

The paper defines a multimodal concept-generating process in which a sparse latent concept vector \(\mathbf{c}\) is sampled and passed through modality-specific generators \(\mathbf{g}^{(d)}\) to produce observations in each modality. The VLM encoder \(\mathbf{f}\) approximates the inverse of the generator, and the SAE \(\phi\) further lifts embeddings back to concept space. The goal is for \(\phi \circ \mathbf{f}\) to correctly recover the underlying concepts; however, standard SAEs fail due to identifiability issues.

Key Designs¶

Iso-Energy Hypothesis:
- Function: Provides a testable inductive bias for concept recovery.
- Mechanism: If concept \(k\) is shared across modalities, its mean squared activation should be equal in both: \(\mathbb{E}_{X \in \mathcal{X}^{(d)}}[\psi(X)_k^2] = \mathbb{E}_{X \in \mathcal{X}^{(d')}}[\psi(X)_k^2]\)
- Design Motivation: Truly shared concepts are generated by the same latent codes and should therefore produce comparable activation magnitudes across modalities. This supplies an additional constraint to the otherwise non-identifiable nonlinear ICA problem.
Aligned SAE (SAE-A):
- Function: Augments standard SAE training with an iso-energy regularization term.
- Mechanism: \(\mathcal{L}_{\text{SAE-A}} = \mathcal{L}_{\text{SAE}} + \beta \cdot \mathcal{L}_{\text{align}}\), where \(\mathcal{L}_{\text{align}} = -\frac{1}{b}\text{Tr}(\mathbf{Z}^{(d)} \mathbf{Z}^{(d')^\top})\) encourages high cosine similarity between \(\ell_2\)-normalized encodings of paired samples. Sparsity is enforced via Matching Pursuit (\(\ell_0\)), and a small weight of \(\beta \approx 10^{-4}\) suffices.
- Design Motivation: The lightweight regularizer preserves reconstruction quality (\(R^2 \geq 0.99\)) while substantially improving the recovery of bimodal concepts.
Synthetic Data Validation:
- Function: Validates the method on synthetic data where ground-truth atom types are known.
- Mechanism: Data are generated with known bimodal/unimodal atoms; parameter \(\tau_1\) controls the cross-modal alignment of atoms and \(\tau_2\) controls embedding-level alignment. When Iso-Energy holds (\(\tau_1=1\)), the standard SAE fails (Wasserstein=0.396, mma=0.29) while SAE-A succeeds (0.184, 0.52); when it does not hold, the two methods perform comparably.
- Design Motivation: Ensures that the regularizer does not hallucinate bimodal atoms that do not exist.
Geometric Decomposition and Intervention:
- Function: Decomposes the VLM embedding space into a bimodal subspace \(\Gamma\) and a unimodal subspace \(\Omega_I \oplus \Omega_T\).
- Mechanism: A modality score \(\mu\) partitions dictionary atoms into bimodal and unimodal categories. Bimodal atoms span \(\Gamma\) and carry cross-modal alignment signals; unimodal atoms span \(\Omega_{I/T}\), carry modality-specific information, and fully account for the modality gap.
- Design Motivation: This decomposition enables targeted interventions — removing unimodal atoms eliminates the modality gap without harming performance.

Loss & Training¶

The base SAE enforces \(\ell_0\) sparsity via Matching Pursuit, selecting active atoms through sequential residual updates. The alignment loss \(\mathcal{L}_{\text{align}}\) maximizes cosine similarity between paired-sample encodings with weight \(\beta \approx 10^{-4}\), which is sufficiently small to leave reconstruction quality essentially unchanged.

Key Experimental Results¶

Main Results¶

SAEs and SAE-As are trained on 6 VLMs (CLIP, CLIP-L, OpenCLIP, OpenCLIP-L, SigLIP, SigLIP2):

Model	MSE (SAE/SAE-A)	R² (SAE/SAE-A)	Classification Accuracy \(p_{\text{acc}}\) (SAE/SAE-A)
CLIP	0.141/0.163	0.859/0.837	0.847/0.915
SigLIP2	0.115/0.115	0.884/0.885	0.897/0.899

SAE-A substantially improves the classification accuracy of bimodal atom activation patterns with negligible change in reconstruction quality.

Ablation Study¶

Experiment	Key Metric	Notes
Synthetic data (Iso-Energy holds)	SAE: W=0.396, mma=0.29; SAE-A: W=0.184, mma=0.52	SAE-A recovers bimodal atoms significantly better
Synthetic data (Iso-Energy does not hold)	Both: W≈0.19, mma≈0.82	Regularizer does not fabricate bimodal atoms
Removing unimodal atoms	Modality gap disappears + cross-modal performance maintained	Confirms that unimodal atoms = modality gap
Vector arithmetic in bimodal subspace only	Retrieval performance improves + edits are more in-distribution	Bimodal subspace is the correct space for cross-modal operations

Key Findings¶

Sparse bimodal atoms carry all cross-modal alignment signals — few in number but highly information-dense.
A small number of high-energy unimodal atoms act as "modality biases" and fully account for the modality gap.
Removing unimodal atoms eliminates the modality gap without degrading downstream performance — a result no prior method has achieved.
Restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improves retrieval.
Contrary to Papadimitriou et al. (2025), cross-modal information is carried by shared atoms rather than idiosyncratic ones.

Highlights & Insights¶

Elegance and depth of the Iso-Energy hypothesis: Such a simple statistic — equal mean squared activation across modalities — suffices as a criterion for bimodal concepts and is grounded in a principled generative model. The idea transfers directly to concept extraction in any multi-view or multimodal setting.
"Neutrality as validation" strategy: Demonstrating on synthetic data that the regularizer is inert when the hypothesis does not hold (i.e., it does not fabricate bimodal atoms) is a particularly elegant form of validation that preempts concerns about artificially introduced bias.
Concept-level explanation of the modality gap: The paper elevates a previously purely geometric description (cones, ellipsoidal shells) to the concept level (unimodal atoms = modality biases), reframing the gap not as a bug to be eliminated but as a feature through which the model correctly preserves modality-specific information.
Matching Pursuit SAE: Enforcing \(\ell_0\) sparsity rather than using ReLU or TopK is more consistent with the theoretical assumptions of sparse coding and is transferable to other SAE applications.

Limitations & Future Work¶

The Iso-Energy hypothesis requires exactly equal energy across modalities, but some concepts may inherently be richer in one modality (e.g., color and texture in vision); this asymmetry is not discussed.
Experiments are conducted solely on dual-encoder VLMs and are not extended to single-encoder or encoder-decoder architectures (e.g., LLaVA, Flamingo).
SAE-A requires paired image-text data for training, limiting its applicability to unpaired settings.
Although \(\beta\) is small, it still requires tuning and is not a hyperparameter-free approach.
The binary bimodal/unimodal partition may be overly coarse; in practice, "partially bimodal" concepts may exist.

vs. Liang et al. (2022) on the modality gap: They characterize the geometric phenomenon (cone structure) but find that eliminating the gap harms performance. This paper explains why — the gap arises from unimodal atoms that carry necessary modality-specific information — and shows it can be precisely removed at the concept level.
vs. Schrodi et al. (2025): Their approach of projecting out a small number of canonical directions inadvertently damages bimodal information. SAE-A achieves a correct separation that avoids this collateral damage.
vs. Papadimitriou et al. (2025): They conclude that cross-modal information is carried by idiosyncratic concepts; this paper finds the opposite — it is carried by shared atoms. The discrepancy originates from the identifiability limitations of standard SAEs.
vs. the Platonic Representation Hypothesis (Huh et al. 2024): The Iso-Energy hypothesis can be viewed as an operationalization of that hypothesis: if different models and modalities converge to the same features, the statistics of those features should be consistent across modalities.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The Iso-Energy hypothesis is concise and elegant; this is the first work to provide a complete concept-level explanation of the modality gap.
Experimental Thoroughness: ⭐⭐⭐⭐ Validation on both synthetic and real data is solid, though non-dual-encoder architectures are absent.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical motivation is clear, experimental logic is rigorous, and figures are well designed.
Value: ⭐⭐⭐⭐⭐ Significant contribution to VLM interpretability; Aligned SAE has broad application prospects.