Skip to content

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

Conference: ACL2026
arXiv: 2605.18257
Code: https://visual-ai.github.io/codebind
Area: Multimodal Alignment / 3D Vision
Keywords: Multimodal Representation, Compositional Vector Quantization, shared-specific Decoupling, Codebook, Cross-modal Retrieval

TL;DR

CodeBind enhances ImageBind/ViT-Lens style multimodal alignment using shared-specific representation decoupling and a unified compositional VQ codebook. It simultaneously improves cross-modal classification/retrieval across nine modalities while preserving stronger modality-specific fine-grained information.

Background & Motivation

Background: Multimodal representation alignment is crucial for integrating multi-sensor inputs—such as images, videos, audio, depth, thermal, tactile, point clouds, and EEG—into LLMs, robots, and perception systems. Common practices align specialized modalities to a mature vision-language space, using models like OpenCLIP, ImageBind, or ViT-Lens as bridges.

Limitations of Prior Work: First, "hard alignment" forces all modalities into a single shared space, often causing a "lowest common denominator" effect: while cross-modal semantics become consistent, modality-unique information (e.g., color, texture, tactile pressure, thermal signals) is flattened. Second, specialized modality data is far scarcer than image-text data; during training, dominant modalities override the space, suppressing low-resource or rare modalities. Third, existing methods often rely on large-scale paired data, synthetic data, or unified encoders, making expansion to new modalities costly.

Key Challenge: Cross-modal tasks require a shared semantic space, yet fine-grained tasks and robotic perception necessitate the preservation of modality-private details. Total sharing leads to over-compression, while total separation prevents cross-modal retrieval and interaction.

Goal: The authors aim to achieve "partial alignment" to accomplish two things: placing cross-modal consistent semantics in a shared space for classification and retrieval, and placing modality-unique details in a specific space for fine-grained recognition, reconstruction, and fusion.

Key Insight: This paper views the VQ codebook as a distribution-independent discrete semantic foundation. The shared embeddings of different modalities share the same codebook to ensure consistent semantic centers, while each modality maintains its own specific codebook to prevent private information from being swallowed by the shared space.

Core Idea: Utilizing shared-specific representation decoupling combined with a compositional VQ codebook to expand representation capacity, reduce modality bias, and protect fine-grained features within a compact parameter set.

Method

Overall Architecture

CodeBind uses a frozen vision-language foundation model as a bridging space to gradually align target modalities with text/image semantics. Each modality encoder output is projected into two components: \(z^{\mathcal{M}}_{shared}\) for cross-modal shared semantics and \(z^{\mathcal{M}}_{spec}\) for modality-specific information. The shared component enters a codebook common to all modalities, while the specific component enters a modality-specific codebook. The quantized shared and specific embeddings are concatenated and passed to a Transformer decoder for reconstruction to ensure no information loss. During inference for cross-modal alignment tasks, only the shared embedding is retained, and the reconstruction module can be discarded to reduce costs.

Key Designs

  1. shared-specific representation decoupling:

    • Function: Separates cross-modal semantics from modality-private details.
    • Mechanism: Traditional alignment directly maximizes the mutual information of complete embeddings across modalities, which risks forcing the alignment of noise and private features. CodeBind only involves the shared component in cross-modal alignment, using orthogonal constraints, uniform constraints, and reconstruction loss to ensure the specific component retains non-redundant details.
    • Design Motivation: Classification and cross-modal retrieval need shared concepts like "cat," but fine-grained retrieval requires details like fur color, texture, thermal patterns, or tactile pressure. Decoupling prevents these requirements from compromising each other.
  2. modality-shared-specific codebook:

    • Function: Uses discrete codevectors as a unified semantic foundation while preserving dedicated expression spaces for different modalities.
    • Mechanism: Shared embeddings use a universal codebook \(\mathcal{C}_{shared}\), while specific embeddings use a modality-exclusive codebook \(\mathcal{C}^{\mathcal{M}}_{spec}\). For example, "striking" represents a general semantic in the shared space, but corresponds to sound, motion, or pressure patterns in the specific spaces of audio, video, and tactile modalities, respectively.
    • Design Motivation: The shared codebook prevents low-resource modalities from being biased by dominant ones, while specific codebooks prevent details from being compressed into a single set of abstract semantics.
  3. Compositional Vector Quantization:

    • Function: Increases representation capacity without expanding codebook parameter size.
    • Mechanism: Divides a \(d\)-dimensional embedding into \(m\) sub-vectors, each independently selecting a low-dimensional codevector. If each sub-codebook has \(K\) codevectors, the total compositional space reaches \(K^m\).
    • Design Motivation: Traditional large codebooks suffer from computational overhead, codebook collapse, and low utilization. Compositional VQ constructs high capacity from small codebooks, making it more suitable for multimodal expansion.

Loss & Training

The training objective consists of multiple loss types. Cross-modal semantic alignment uses InfoNCE \(\mathcal{L}_{align}\). Representation decoupling is enforced by orthogonal loss \(\mathcal{L}_{orth}\), uniform loss \(\mathcal{L}_{uni}\), and reconstruction loss \(\mathcal{L}_{recon}\). Codebook stability is maintained via EMA updates, commitment loss, dynamic re-initialization, and codevector regularization \(\mathcal{L}_{cctr}\) and \(\mathcal{L}_{cuni}\). Cross-modal matching of the shared codebook is constrained by the Cross-Modal Code Matching loss \(\mathcal{L}_{cm}\). To reduce manual tuning, the authors designed adaptive loss balancing, using EMA to estimate the magnitude of each loss and dynamically scaling weights relative to \(\mathcal{L}_{align}\).

In implementation, CodeBind is integrated into ImageBind and ViT-Lens to create CodeBind-IB and CodeBind-VL. Experiments use 1024 shared codevectors and 256 specific codevectors with a dimension of 8. Initialized from ImageBind/ViT-Lens, the models are trained on 8 NVIDIA RTX 3090 GPUs with a learning rate of \(5\times10^{-4}\). Target modality encoders can be fine-tuned via LoRA; adding a new modality only requires training a new codebook and corresponding paths.

Key Experimental Results

Main Results

The paper evaluates 9 modalities across various classification and retrieval datasets. The table below highlights representative gains of CodeBind-IB over ImageBind; classification is measured by Acc@1, AudioSet by mAP, and Clotho/AudioCaps by Recall@1/Recall@10.

Modality/Dataset ImageBind CodeBind-IB Gain
NYU-D Depth Classification 54.0 59.3 +5.3
SUN-D Depth Classification 35.1 45.7 +10.6
AudioSet Audio Classification 17.6 21.1 +3.5
VGGSound Audio Classification 27.8 30.5 +2.7
ESC Audio Classification 66.9 71.0 +4.1
LLVIP Thermal Classification 63.4 95.5 +32.1
FLIR_v2 Thermal Classification 46.6 97.2 +50.6
MSR-VTT Video Retrieval 36.1 37.8 +1.7
AudioCaps Audio Retrieval 9.3/42.3 13.3/53.8 +4.0/+11.5

CodeBind-VL also consistently outperforms ViT-Lens; for instance, ModelNet40 point cloud classification improved from 70.6/94.4 to 78.3/96.5, and IN-EEG improved from 41.8/42.7 to 54.5/54.1.

Ablation Study

Configuration NYU-D SUN-D FLIR_v2 Description
w/o codebook / w/o decoupling / w/o reconstruction 54.0 35.1 46.6 ImageBind Baseline
decoupling + reconstruction only 54.1 39.7 94.5 Decoupling significantly helps low-resource modalities
codebook only 57.6 46.9 80.5 Discrete foundation improves shared space
codebook + decoupling 56.7 45.3 97.7 Near optimal performance
codebook + decoupling + reconstruction 59.3 45.7 97.2 Full proposed method

Key Findings

  • Improvements are most significant in low-resource/high-variance modalities like thermal and depth, indicating the codebook's effectiveness against modality bias.
  • The specific embedding is not just a reconstruction aid; it contributes usable details in fine-grained retrieval and multimodal fusion.
  • Compositional VQ provides gains of +10.8, +5.7, and +16.1 across three ablation datasets compared to standard VQ, primarily due to larger compositional representation capacity.
  • The reconstruction module adds overhead during training but can be discarded at inference; it acts as a training-time constraint to ensure the specific component indeed preserves information.

Highlights & Insights

  • Partial alignment is more realistic for multi-sensor settings: Robotics or medical scenarios do not want all modalities to be completely homogenized. CodeBind’s shared/specific division provides a clear modeling language.
  • The codebook acts as both an alignment tool and an information regulator: The shared codebook extracts cross-modal invariants, while the specific codebook captures modality-unique signals, providing more structural constraint than simple projectors.
  • Compositional VQ elegantly solves the capacity problem: Expanding the representation space through combinations of low-dimensional sub-vectors avoids the need for infinitely large codebooks.
  • Extensive experimental coverage: Spanning images, videos, audio, depth, thermal, tactile, EEG, and point clouds, it demonstrates the framework's extensibility beyond just image-text pairs.

Limitations & Future Work

  • While modality-specific information in visual embeddings can be interpreted via text using VLMs, interpreting specific info for modalities lacking strong foundation spaces (like tactile or EEG) remains challenging.
  • Main experiments primarily use category names for alignment for fairness; the authors note that dense descriptions generated by LLMs/VLMs might further unlock the potential of the decoupled space, though this introduces dependency on description quality.
  • The method still relies on bridging modalities and existing foundation models; if a new modality has weak semantic links to text/images, the transfer effect may decrease.
  • Future work could integrate CodeBind into MLLMs for on-demand fusion, using gating to dynamically decide when to use shared concepts versus specific cues. This could also enhance explainability in medical diagnosis.
  • vs. ImageBind / ViT-Lens: These methods aim to map multiple modalities into a unified space; CodeBind builds on them by adding a shared-specific codebook to reduce detail loss from hard alignment.
  • vs. LanguageBind / FreeBind / OmniBind: These methods often rely on large-scale or pseudo-paired data for expansion; CodeBind emphasizes naturally paired data and parameter-efficient codebook design.
  • vs. MoE-style Unified Encoders: MoE fuses modalities via routing but may collapse with unbalanced data; CodeBind directly constrains representation structure through discrete codebooks and decoupling.
  • Insight: For 3D, thermal, tactile, and medical multimodal tasks, shared embeddings can be used for cross-modal semantic retrieval, while specific embeddings can be used for diagnostic details, sensor anomalies, or fine-grained localization.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The combination of shared-specific decoupling and compositional codebooks is highly structural and addresses the core problem of hard alignment.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 9 modalities, multiple baselines, main experiments, and multi-layered ablations with solid evidence.
  • Writing Quality: ⭐⭐⭐⭐☆ High density of method details but logical and clear; some table layouts are complex and require background knowledge of multimodal baselines.
  • Value: ⭐⭐⭐⭐⭐ Provides direct inspiration for robotics, multi-sensor perception, and integrating new modalities into MLLMs.