CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning¶

Conference: CVPR 2026 arXiv: 2602.19605 Code: N/A Area: Video Understanding Keywords: Cross-level semantic alignment, shared-private disentanglement, multimodal fusion, sentiment analysis, event localization

TL;DR¶

This paper proposes the CLCR framework, which organizes each modality's features into three semantic hierarchy levels (shallow/middle/deep). An intra-level Controlled Exchange Domain (IntraCED) restricts cross-modal interaction to the shared subspace only, while an inter-level Collaborative Aggregation Domain (InterCAD) enables adaptive cross-level fusion, addressing the cross-level semantic asynchrony problem in multimodal learning.

Background & Motivation¶

Multimodal learning aims to capture shared and private information from multiple modalities (language, vision, acoustics). Two dominant paradigms both suffer from a common limitation:

Feature disentanglement methods (MISA, DMD, etc.): Learn modality-invariant/modality-specific subspaces, but assume cross-modal interaction occurs at a single semantic level.

Dynamic calibration methods (MLA, ARL, etc.): Adjust contribution weights at the sample/modality level, but equally ignore hierarchical structure.

Core Problem: Cross-Level Semantic Asynchrony - Shallow layers capture lexical/frame-level cues; middle layers encode phrase/prosodic structure; deep layers reflect discourse intent/event context. - Mixing tokens from different levels leads to semantic confusion, error propagation, and private information leakage. - From an information bottleneck perspective, unstructured mixing tends to increase \(I(Z;N)\) rather than \(I(Z;Y)\).

Method¶

Overall Architecture¶

CLCR consists of three core components: 1. Semantic-Hierarchy Encoder: Constructs three semantic hierarchy levels for each modality. 2. Intra-level Controlled Exchange Domain (IntraCED): Performs controlled cross-modal exchange independently at each level. 3. Inter-level Collaborative Aggregation Domain (InterCAD): Synchronizes and aggregates the final task representation across levels.

Key Designs¶

1. Semantic-Hierarchy Encoder¶

For each modality \(m \in \{L, V, A\}\), three-level features of uniform width \(d\) are constructed:

\[H_\ell^{(m)} = \text{LN}(Z_\ell^{(m)} W_\ell^{(m)} + P_\ell^{(m)})\]

Language modality: Early/middle/late layers of pretrained BERT → lexical-syntactic / phrase-level sentiment / discourse intent.
Visual/acoustic modalities: Three-stage TCN with increasing receptive fields → local appearance / part structure / long-range scene context.

2. IntraCED: Intra-Level Controlled Exchange¶

Shared-private orthogonal decomposition: Orthonormal bases \(U_\ell^{sh}\) and \(U_{\ell,m}^{pr}\) are learned via Stiefel parameterization:

\[h_{\ell,t,sh}^{(m)} = h_{\ell,t}^{(m)} P_\ell^{sh}, \quad h_{\ell,t,pr}^{(m)} = h_{\ell,t}^{(m)} P_{\ell,m}^{pr}\]

Only shared components participate in cross-modal exchange; private components are fully isolated.

Controlled Token Budget: Not all shared tokens are worth exchanging. The shared evidence strength \(e_{\ell,t}^{(m)} = \|h_{\ell,t,sh}^{(m)}\|_2\) is measured for each token, mapped to activation weights via learnable scales and level-specific thresholds, and projected onto a truncated simplex to enforce sparsity:

\[\boldsymbol{\alpha}_\ell^{(m)} = \text{Proj}_{\Delta(B_\ell)}(\tilde{\boldsymbol{\alpha}}_\ell^{(m)})\]

where \(B_\ell\) is a learnable budget controlling the number of tokens participating in exchange.

Three-modality shared-space exchange: Each modality queries the shared token pool of the remaining modalities:

\[\tilde{h}_{\ell,t,sh}^{(m)} = \alpha_{\ell,t}^{(m)} \text{Attn}(Q_{\ell,t}^{(m)}, K_\ell^{(-m)}, V_\ell^{(-m)})\]

The budget \(\alpha\) controls how much external evidence each token absorbs.

3. InterCAD: Inter-Level Collaborative Aggregation¶

Cross-level semantic synchronization: Mean pooling + LN are applied to the shared/private streams at each level and modality to obtain summaries \(s_\ell^{(m)}\) and \(p_\ell^{(m)}\). Level weights \(\omega = [\omega_1, \omega_2, \omega_3]\) are computed via MLP + softmax:

\[\bar{s}^{(m)} = \sum_{\ell=1}^3 \omega_\ell s_\ell^{(m)}, \quad \bar{p}^{(m)} = \sum_{\ell=1}^3 \omega_\ell p_\ell^{(m)}\]

Modality selection and private aggregation: - Shared path: A global context \(\bar{g}\) serves as query; per-modality \(\bar{s}^{(m)}\) serve as keys; scaled dot-product attention selects the most informative modality. - Private path: Confidence-gated aggregation \(\eta_m = \sigma(w_p^\top \text{LN}(W_p \bar{p}^{(m)}))\).

Final task representation: \(\hat{y} = f_\theta(z_{sh} \oplus u_{pr})\)

Loss & Training¶

\[\mathcal{L}_{all} = \mathcal{L}_{task} + \lambda_{inter} \mathcal{L}_{Inter} + \lambda_{intra} \mathcal{L}_{Intra}\]

Intra-level regularization \(\mathcal{L}_{Intra}\): A whitening cross-correlation identifiability regularizer that penalizes correlation between private streams of different modalities and between private-shared streams of the same modality.

Inter-level regularization \(\mathcal{L}_{Inter}\): Three constraints — - \(\mathcal{L}_{pr}\): Reduces cross-level private redundancy. - \(\mathcal{L}_{sp}\): Suppresses cross-level shared-private leakage. - \(\mathcal{L}_{mix}\): Penalizes the simultaneous activation of semantically incompatible level pairs.

Training configuration: SGD (momentum 0.9), lr 1e-3, weight decay 1e-4, batch size 64, 100 epochs, A100 GPU.

Key Experimental Results¶

Main Results¶

Table 1: Audio-Visual Benchmarks (Acc% / F1%)

Method	CREMA-D Acc	KS Acc	AVE Acc	UCF101 Acc
ARL	76.46	74.09	72.61	83.06
D&R	73.52	69.10	69.62	82.11
CLCR	77.92	75.41	73.82	83.64

Table 2: Multimodal Sentiment Analysis (CMU-MOSI / CMU-MOSEI)

Method	MOSI MAE↓	MOSI Acc-2	MOSEI MAE↓	MOSEI Acc-2
DLF	0.731	85.06	0.536	85.42
EMOE	0.710	85.4	0.536	85.3
CLCR	0.678	88.05	0.511	87.96

Ablation Study¶

Table 3: Key Component Ablation (MOSI MAE↓ / KS Acc)

Variant	MOSI MAE	KS Acc
w/o Hierarchy	0.720	71.9
w/o IntraCED	0.703	73.0
w/o InterCAD	0.699	73.4
Full Mix (shuffled levels)	0.743	70.3
w/o both regularizations	0.725	71.2
CLCR (full)	0.678	75.41

Key Findings¶

Semantic hierarchy is essential: Removing the hierarchical structure causes the largest performance drop; Full Mix (completely shuffled) performs worst.
IntraCED is more critical than InterCAD: Removing IntraCED generally yields a larger performance drop, indicating that intra-level shared/private separation and controlled exchange are the key factors.
Optimal sparsity of token budget: Performance peaks at participation rate \(r \approx 0.68\) (\(\gamma \approx 1.0\)); fully dense exchange performs worst.
Noise robustness: Under Gaussian noise injection experiments, CLCR exhibits the smallest performance degradation compared to baseline methods.
Adaptive modality importance: The language modality dominates on MOSI, while visual modality receives the highest weight on KS; CLCR adapts automatically.

Highlights & Insights¶

Problem formulation of cross-level semantic asynchrony: The paper provides an information bottleneck perspective explaining why mixing features across levels degrades representation quality.
Controlled token budget mechanism: Differentiable sparse token selection is achieved via truncated simplex projection, avoiding noisy dense fusion.
Dual protection of shared-private structure: Orthogonal projection (structural constraint) and whitening cross-correlation regularization (statistical constraint) are applied jointly.
Comprehensive validation across six benchmarks: Covers four task types — emotion recognition, event localization, sentiment analysis, and action recognition.

Limitations & Future Work¶

The three-level hierarchy is a hard-coded design; different tasks may require a different number of levels.
Computational overhead analysis is insufficient — actual training time for whitening operations and Stiefel parameterization is not reported.
Validation is limited to classification/regression tasks and has not been extended to generative multimodal tasks.
Handling of missing-modality scenarios (addressed only in ablation analysis) has not been developed into a systematic solution.

MISA: A classical approach for modality-invariant and modality-specific subspace decomposition; CLCR extends this by introducing hierarchical structure.
DMD: Graph-based cross-modal knowledge distillation; CLCR replaces distillation with controlled attention.
ARL: Dual-path calibration strategy; CLCR achieves analogous functionality through the modality selection mechanism in InterCAD.
The token budget mechanism is transferable to vision-language pretraining for controlling the granularity of cross-modal interaction.

Rating¶

Novelty: ★★★★☆ — The problem formulation of cross-level semantic asynchrony and the controlled exchange design are original.
Technical Depth: ★★★★★ — Orthogonal decomposition + truncated simplex projection + whitening regularization; theoretically well-grounded.
Experimental Thoroughness: ★★★★★ — Six benchmarks, detailed ablations, t-SNE visualization, noise robustness, and hyperparameter sensitivity analysis.
Writing Quality: ★★★★☆ — Framework diagrams are clear, but the high density of notation raises the reading barrier.