Scaling Language-Centric Omnimodal Representation Learning¶
Conference: NeurIPS 2025 arXiv: 2510.11693 Code: GitHub Area: Information Retrieval Keywords: Multimodal Representation Learning, Contrastive Learning, MLLM Embedding, Cross-Modal Alignment, Generation-Representation Scaling Law
TL;DR¶
This paper proposes the LCO-Emb framework and demonstrates that Multimodal Large Language Models (MLLMs) implicitly establish cross-modal alignment during generative pretraining. Lightweight text-only contrastive fine-tuning suffices to activate full omnimodal representation capabilities. The work further identifies the Generation-Representation Scaling Law (GRSL), which establishes a positive correlation between generative capability and representation performance.
Background & Motivation¶
Limitations of Prior Work¶
Background: Cross-modal representation alignment is a central problem in multimodal AI. Traditional methods such as CLIP rely on contrastive learning over large-scale paired data to achieve vision-language alignment, yet they exhibit performance saturation on complex tasks including multilingual retrieval, visual text understanding, and interleaved multimodal encoding.
Core Limitations:
Bottleneck of the CLIP Paradigm: CLIP-style models improve by scaling model size, data, and batch size, but yield limited gains on tasks requiring deep cross-modal understanding.
Black-Box Advantage of MLLM Embeddings: Recent MLLM-based embedding methods outperform CLIP, yet a thorough analysis of why they are superior remains absent.
Large Multimodal Training Data Requirements: Leading methods such as GME require 8 million multimodal paired samples.
Core Insight: During generative pretraining, the language decoder of an MLLM learns to leverage multimodal signals within a shared representation space to generate unimodal outputs, thereby implicitly achieving cross-modal alignment. Contrastive learning therefore serves only as a lightweight "activation" step rather than learning alignment from scratch.
Method¶
Overall Architecture¶
The core mechanism of LCO-Emb is straightforward: extract the language decoder (LLM) from an MLLM, apply text-only contrastive learning with LoRA fine-tuning, and reinsert it into the original MLLM architecture. The modality encoders and projectors are frozen; only the LoRA parameters of the decoder are updated.
Key Designs¶
-
Discovery and Verification of Implicit Cross-Modal Alignment
- Anisotropy analysis demonstrates that the original MLLM representation space exhibits degeneracy (high isotropy). After text-only contrastive learning, not only do text embeddings become isotropic, but image, audio, and video embeddings also improve simultaneously.
- Kernel-level similarity analysis shows that after text-only fine-tuning, the kNN overlap between image and language modalities increases substantially, and the 7B model exhibits stronger cross-modal kernel alignment than the 3B model.
-
Language-Centric Contrastive Learning Strategy
- Only text-paired data (276K triplets from all-NLI) is used for InfoNCE contrastive learning.
- LoRA rather than full fine-tuning is adopted — not primarily for parameter efficiency, but to minimize perturbation to pretrained weights and preserve the established cross-modal alignment structure.
- Optionally, approximately 94K synthetic multimodal paired samples are added for calibration, bringing the total to 370K.
-
Multimodal Variants and Model Fusion
- Separate models are fine-tuned on different datasets (all-NLI for semantic similarity; Scale-1M for multilingual and scene description tasks), and their respective strengths are combined via Model Soup weight averaging.
- Multiple MLLM backbones are supported, including LLaVA-Next, Qwen2.5-VL, and Qwen2.5-Omni.
Loss & Training¶
- Optimizer: AdamW with cosine schedule; peak learning rate \(4 \times 10^{-4}\)
- Batch size: 768 (text-only); scaled proportionally to ~1052 for multimodal training
- Training duration: 2 epochs; LoRA rank=64, alpha=16 (text-only) / 128 (multimodal)
- Training cost: text-only requires only ~4.7 GPU hours (3B) / ~9.3 GPU hours (7B)
Key Experimental Results¶
Main Results (MIEB-Lite, 51 Tasks)¶
| Model | Data | Retrieval (en) | Clustering | Zero-Shot Cls. | Linear Probe | Compositionality | Doc. Understanding | vSTS (en) | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| CLIP-ViT-bigG (2B) | - | 34.2 | 80.8 | 72.4 | 77.8 | 35.0 | 35.5 | 73.4 | 51.3 |
| GME (7B) | 8.0M | 37.9 | 69.6 | 55.5 | 68.7 | 52.2 | 86.1 | 81.8 | 64.5 |
| LCO-Emb-VL (7B, text-only) | 276k | 31.8 | 52.7 | 49.1 | 68.5 | 40.4 | 66.0 | 88.4 | 60.4 |
| LCO-Emb-Omni (7B, multimodal) | 370k | 36.4 | 80.0 | 68.5 | 74.1 | 50.1 | 75.4 | 86.2 | 68.8 |
Ablation Study (Training Strategy Comparison, Qwen2.5-VL-7B)¶
| Training Strategy | GPU Hours | Multilingual Retrieval | vSTS (en) | Doc. Understanding | Linear Probe | Avg. |
|---|---|---|---|---|---|---|
| CLIP-style CL (multimodal 800K) | ~550h | 18.24 | 73.92 | 44.89 | 38.93 | 50.02 |
| Full Fine-Tune (text-only) | ~17.3h | 44.05 | 83.15 | 58.02 | 53.34 | 66.49 |
| LoRA (text-only) | ~9.3h | 56.64 | 85.05 | 67.49 | 53.91 | 71.98 |
Key Findings¶
- Text-Only Training Surpasses Multimodal CLIP Training: LoRA text-only fine-tuning requires only 1/60 the training time of CLIP-style training, yet achieves 22 points higher average score.
- LoRA Substantially Outperforms Full Fine-Tuning: Preserving the pretrained alignment structure is critical; the regularization imposed by LoRA proves more beneficial than unconstrained full fine-tuning.
- Multi-Teacher Fusion Provides Additional Gains: Incorporating only 94K multimodal samples (25% of total data) raises the average score from 60.4 to 67.6.
Highlights & Insights¶
- Most Important Insight: The role of contrastive learning on MLLMs is not to "learn alignment" but to "activate alignment" — MLLMs have already established cross-modal alignment implicitly during generative pretraining, and contrastive learning merely "awakens" the representation space from anisotropic degeneracy.
- GRSL (Generation-Representation Scaling Law): Stronger generative capability in a model corresponds to a higher upper bound on representation performance after contrastive fine-tuning. This relationship is theoretically grounded via the PAC-Bayesian framework, where the generative loss \(\mathcal{L}_g(P)\) determines the upper bound on representation performance.
- A New Perspective on LoRA: The primary value of LoRA lies not in parameter efficiency but in minimally perturbing pretrained knowledge and cross-modal alignment structure.
- Extreme Data Efficiency: Only 276K text pairs suffice to surpass GME, which uses 8 million multimodal paired samples.
Limitations & Future Work¶
- Representation capability is bounded by the generative ability of the underlying MLLM — if the base model has weak generative performance on certain modalities, contrastive fine-tuning cannot compensate.
- The text-only variant still lags behind CLIP-style encoder models on clustering and zero-shot classification tasks.
- Validation of GRSL is currently conducted primarily on the Qwen model family; verification across a broader range of model families remains insufficient.
- More efficient contrastive learning objectives, such as hard negative mining, have not been explored.
Related Work & Insights¶
- vs. CLIP/SigLIP: CLIP requires large-scale paired data to learn alignment from scratch, whereas LCO-Emb leverages alignment already established during MLLM pretraining, reducing data requirements by more than 20×.
- vs. GME: GME is trained on 8 million multimodal samples; LCO-Emb achieves superior performance on MIEB using only 370K samples (4.6% of GME's data).
- vs. E5-V: Both are MLLM-based embedding methods, but LCO-Emb outperforms E5-V by an average of 21.69 points on Sub18, attributable to a stronger backbone and the LoRA fine-tuning strategy.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The discovery of implicit alignment in MLLMs and the GRSL constitute entirely novel insights with a high degree of theoretical formalization.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 51 tasks, multiple backbones, multiple modalities, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Structure is clear with rich figures and tables, though the theoretical section is somewhat condensed.
- Value: ⭐⭐⭐⭐⭐ Uncovers fundamental principles of MLLM representation learning with paradigm-level implications for future embedding model design.