CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning¶
Conference: CVPR 2026 arXiv: 2602.19605 Code: N/A Area: Video Understanding Keywords: Cross-level semantic alignment, shared-private disentanglement, multimodal fusion, sentiment analysis, event localization
TL;DR¶
This paper proposes the CLCR framework, which organizes each modality's features into three semantic hierarchy levels (shallow/middle/deep). An intra-level Controlled Exchange Domain (IntraCED) restricts cross-modal interaction to the shared subspace only, while an inter-level Collaborative Aggregation Domain (InterCAD) enables adaptive cross-level fusion, addressing the cross-level semantic asynchrony problem in multimodal learning.
Background & Motivation¶
Multimodal learning aims to capture shared and private information from multiple modalities (language, vision, acoustics). Two dominant paradigms both suffer from a common limitation:
Feature disentanglement methods (MISA, DMD, etc.): Learn modality-invariant/modality-specific subspaces, but assume cross-modal interaction occurs at a single semantic level.
Dynamic calibration methods (MLA, ARL, etc.): Adjust contribution weights at the sample/modality level, but equally ignore hierarchical structure.
Core Problem: Cross-Level Semantic Asynchrony - Shallow layers capture lexical/frame-level cues; middle layers encode phrase/prosodic structure; deep layers reflect discourse intent/event context. - Mixing tokens from different levels leads to semantic confusion, error propagation, and private information leakage. - From an information bottleneck perspective, unstructured mixing tends to increase \(I(Z;N)\) rather than \(I(Z;Y)\).
Method¶
Overall Architecture¶
CLCR consists of three core components: 1. Semantic-Hierarchy Encoder: Constructs three semantic hierarchy levels for each modality. 2. Intra-level Controlled Exchange Domain (IntraCED): Performs controlled cross-modal exchange independently at each level. 3. Inter-level Collaborative Aggregation Domain (InterCAD): Synchronizes and aggregates the final task representation across levels.
Key Designs¶
1. Semantic-Hierarchy Encoder¶
For each modality \(m \in \{L, V, A\}\), three-level features of uniform width \(d\) are constructed:
- Language modality: Early/middle/late layers of pretrained BERT → lexical-syntactic / phrase-level sentiment / discourse intent.
- Visual/acoustic modalities: Three-stage TCN with increasing receptive fields → local appearance / part structure / long-range scene context.
2. IntraCED: Intra-Level Controlled Exchange¶
Shared-private orthogonal decomposition: Orthonormal bases \(U_\ell^{sh}\) and \(U_{\ell,m}^{pr}\) are learned via Stiefel parameterization:
Only shared components participate in cross-modal exchange; private components are fully isolated.
Controlled Token Budget: Not all shared tokens are worth exchanging. The shared evidence strength \(e_{\ell,t}^{(m)} = \|h_{\ell,t,sh}^{(m)}\|_2\) is measured for each token, mapped to activation weights via learnable scales and level-specific thresholds, and projected onto a truncated simplex to enforce sparsity:
where \(B_\ell\) is a learnable budget controlling the number of tokens participating in exchange.
Three-modality shared-space exchange: Each modality queries the shared token pool of the remaining modalities:
The budget \(\alpha\) controls how much external evidence each token absorbs.
3. InterCAD: Inter-Level Collaborative Aggregation¶
Cross-level semantic synchronization: Mean pooling + LN are applied to the shared/private streams at each level and modality to obtain summaries \(s_\ell^{(m)}\) and \(p_\ell^{(m)}\). Level weights \(\omega = [\omega_1, \omega_2, \omega_3]\) are computed via MLP + softmax:
Modality selection and private aggregation: - Shared path: A global context \(\bar{g}\) serves as query; per-modality \(\bar{s}^{(m)}\) serve as keys; scaled dot-product attention selects the most informative modality. - Private path: Confidence-gated aggregation \(\eta_m = \sigma(w_p^\top \text{LN}(W_p \bar{p}^{(m)}))\).
Final task representation: \(\hat{y} = f_\theta(z_{sh} \oplus u_{pr})\)
Loss & Training¶
Intra-level regularization \(\mathcal{L}_{Intra}\): A whitening cross-correlation identifiability regularizer that penalizes correlation between private streams of different modalities and between private-shared streams of the same modality.
Inter-level regularization \(\mathcal{L}_{Inter}\): Three constraints — - \(\mathcal{L}_{pr}\): Reduces cross-level private redundancy. - \(\mathcal{L}_{sp}\): Suppresses cross-level shared-private leakage. - \(\mathcal{L}_{mix}\): Penalizes the simultaneous activation of semantically incompatible level pairs.
Training configuration: SGD (momentum 0.9), lr 1e-3, weight decay 1e-4, batch size 64, 100 epochs, A100 GPU.
Key Experimental Results¶
Main Results¶
Table 1: Audio-Visual Benchmarks (Acc% / F1%)
| Method | CREMA-D Acc | KS Acc | AVE Acc | UCF101 Acc |
|---|---|---|---|---|
| ARL | 76.46 | 74.09 | 72.61 | 83.06 |
| D&R | 73.52 | 69.10 | 69.62 | 82.11 |
| CLCR | 77.92 | 75.41 | 73.82 | 83.64 |
Table 2: Multimodal Sentiment Analysis (CMU-MOSI / CMU-MOSEI)
| Method | MOSI MAE↓ | MOSI Acc-2 | MOSEI MAE↓ | MOSEI Acc-2 |
|---|---|---|---|---|
| DLF | 0.731 | 85.06 | 0.536 | 85.42 |
| EMOE | 0.710 | 85.4 | 0.536 | 85.3 |
| CLCR | 0.678 | 88.05 | 0.511 | 87.96 |
Ablation Study¶
Table 3: Key Component Ablation (MOSI MAE↓ / KS Acc)
| Variant | MOSI MAE | KS Acc |
|---|---|---|
| w/o Hierarchy | 0.720 | 71.9 |
| w/o IntraCED | 0.703 | 73.0 |
| w/o InterCAD | 0.699 | 73.4 |
| Full Mix (shuffled levels) | 0.743 | 70.3 |
| w/o both regularizations | 0.725 | 71.2 |
| CLCR (full) | 0.678 | 75.41 |
Key Findings¶
- Semantic hierarchy is essential: Removing the hierarchical structure causes the largest performance drop; Full Mix (completely shuffled) performs worst.
- IntraCED is more critical than InterCAD: Removing IntraCED generally yields a larger performance drop, indicating that intra-level shared/private separation and controlled exchange are the key factors.
- Optimal sparsity of token budget: Performance peaks at participation rate \(r \approx 0.68\) (\(\gamma \approx 1.0\)); fully dense exchange performs worst.
- Noise robustness: Under Gaussian noise injection experiments, CLCR exhibits the smallest performance degradation compared to baseline methods.
- Adaptive modality importance: The language modality dominates on MOSI, while visual modality receives the highest weight on KS; CLCR adapts automatically.
Highlights & Insights¶
- Problem formulation of cross-level semantic asynchrony: The paper provides an information bottleneck perspective explaining why mixing features across levels degrades representation quality.
- Controlled token budget mechanism: Differentiable sparse token selection is achieved via truncated simplex projection, avoiding noisy dense fusion.
- Dual protection of shared-private structure: Orthogonal projection (structural constraint) and whitening cross-correlation regularization (statistical constraint) are applied jointly.
- Comprehensive validation across six benchmarks: Covers four task types — emotion recognition, event localization, sentiment analysis, and action recognition.
Limitations & Future Work¶
- The three-level hierarchy is a hard-coded design; different tasks may require a different number of levels.
- Computational overhead analysis is insufficient — actual training time for whitening operations and Stiefel parameterization is not reported.
- Validation is limited to classification/regression tasks and has not been extended to generative multimodal tasks.
- Handling of missing-modality scenarios (addressed only in ablation analysis) has not been developed into a systematic solution.
Related Work & Insights¶
- MISA: A classical approach for modality-invariant and modality-specific subspace decomposition; CLCR extends this by introducing hierarchical structure.
- DMD: Graph-based cross-modal knowledge distillation; CLCR replaces distillation with controlled attention.
- ARL: Dual-path calibration strategy; CLCR achieves analogous functionality through the modality selection mechanism in InterCAD.
- The token budget mechanism is transferable to vision-language pretraining for controlling the granularity of cross-modal interaction.
Rating¶
- Novelty: ★★★★☆ — The problem formulation of cross-level semantic asynchrony and the controlled exchange design are original.
- Technical Depth: ★★★★★ — Orthogonal decomposition + truncated simplex projection + whitening regularization; theoretically well-grounded.
- Experimental Thoroughness: ★★★★★ — Six benchmarks, detailed ablations, t-SNE visualization, noise robustness, and hyperparameter sensitivity analysis.
- Writing Quality: ★★★★☆ — Framework diagrams are clear, but the high density of notation raises the reading barrier.