Skip to content

Towards Multimodal Domain Generalization with Few Labels

Conference: CVPR 2026 arXiv: 2602.22917 Code: https://github.com/lihongzhao99/SSMDG Area: Multimodal VLM Keywords: Semi-supervised Learning, Domain Generalization, Multimodal Fusion, Pseudo Labels, Cross-modal Prototype Alignment

TL;DR

This paper defines and investigates the novel problem of Semi-Supervised Multimodal Domain Generalization (SSMDG), and proposes a unified framework integrating consensus-driven pseudo-labeling, disagreement-aware regularization, and cross-modal prototype alignment to achieve cross-domain generalization of multimodal models under limited annotation.

Background & Motivation

Background: Multimodal Domain Generalization (MMDG) assumes all source domain data are labeled; Semi-Supervised Multimodal Learning (SSML) exploits unlabeled data but ignores domain shift; Semi-Supervised Domain Generalization (SSDG) addresses domain shift but is restricted to unimodal inputs. Each direction addresses only part of the problem.

Limitations of Prior Work: In real-world scenarios, three challenges arise simultaneously—multimodal data, scarce labels, and domain shift. MMDG methods cannot leverage large amounts of unlabeled data; SSML methods assume identical training and test distributions; SSDG methods cannot exploit cross-modal complementarity.

Key Challenge: (a) How to obtain reliable pseudo labels under low confidence and inter-modal disagreement; (b) how to learn representations that are simultaneously invariant to modality and domain under limited supervision.

Goal: Establish an SSMDG benchmark and design a unified framework that jointly addresses pseudo-label reliability and domain-modality invariant representation learning.

Key Insight: The consensus between fusion predictions and unimodal predictions is leveraged to filter reliable pseudo labels, while class prototypes serve as semantic anchors across domains and modalities.

Core Idea: Achieve robust generalization on sparsely labeled multimodal multi-domain data through consensus-driven pseudo-label filtering and cross-modal prototype alignment.

Method

Overall Architecture

The model comprises modality-specific encoders, unimodal classifiers, and a fusion classifier. During training, samples are drawn from a joint pool of labeled and unlabeled data, processed through three complementary components: (1) Consensus-Driven Consistency Regularization (CDCR); (2) Disagreement-Aware Regularization (DAR); (3) Cross-Modal Prototype Alignment (CMPA).

Key Designs

  1. Consensus-Driven Consistency Regularization (CDCR):

    • Function: Generates reliable pseudo labels for unlabeled data.
    • Mechanism: A pseudo label is accepted only when the fusion prediction and at least one unimodal prediction simultaneously satisfy a high-confidence threshold \(\tau\) with consistent label assignment. A FixMatch-style consistency loss is then applied to qualifying samples: \(\mathcal{L}_{\text{cdcr}} = \frac{1}{|\mathcal{B}_{\text{cdcr}}^u|}\sum\sum_{n\in\{v,a,f\}}\mathcal{H}(\hat{y}, \hat{p}_n^s)\)
    • Design Motivation: Decisions consistent across multiple views are more reliable than single-view predictions; the consensus mechanism naturally filters out low-quality pseudo labels.
  2. Disagreement-Aware Regularization (DAR):

    • Function: Exploits "non-consensus" samples excluded by CDCR that still carry useful information.
    • Mechanism: For samples where the fusion prediction is highly confident but inter-modal labels are inconsistent, the standard cross-entropy loss is replaced by the Generalized Cross-Entropy (GCE) loss \(\mathcal{L}_{\text{GCE}} = (1-p_{\hat{y}}^q)/q\), which is more robust to noisy labels.
    • Design Motivation: Non-consensus samples typically reside near decision boundaries; discarding them entirely wastes valuable information. The parameter \(q\) of GCE controls tolerance to noise.
  3. Cross-Modal Prototype Alignment (CMPA):

    • Function: Constructs a domain- and modality-invariant representation space.
    • Mechanism: Learnable prototype vectors are maintained per class as semantic anchors, and features from each domain and modality are aligned toward the corresponding class prototype. A cross-modal translation network is additionally trained to handle modality-missing scenarios.
    • Design Motivation: Class prototypes provide stable reference points across domains and modalities, offering greater flexibility than direct domain or modality alignment without requiring domain labels.

Loss & Training

Total loss: \(\mathcal{L} = \mathcal{L}_{\text{sup}} + \lambda_1\mathcal{L}_{\text{cdcr}} + \lambda_2\mathcal{L}_{\text{dar}} + \lambda_3\mathcal{L}_{\text{cmpa}}\). The supervised loss is computed over both fusion and unimodal predictions on labeled data. Weak augmentation uses standard transformations; strong augmentation applies RandAugment (video) and SpecAugment (audio).

Key Experimental Results

Main Results (5 labels per class)

Method Type HAC Mean EPIC Mean
Source-only Baseline 42.39 29.46
SimMMDG MMDG 44.39 31.11
MDJA MMDG 44.28 31.51
FixMatch (Video) SSL 48.74 32.54
CGMatch (Video) SSL 49.10 33.42
Ours SSMDG 55.82 38.15

Ablation Study

Configuration HAC Mean EPIC Mean
Baseline 42.39 29.46
+ CDCR 49.15 33.80
+ CDCR + DAR 52.30 35.90
+ CDCR + DAR + CMPA 55.82 38.15
w/o Consensus Filtering 47.20 31.50

Key Findings

  • The SSMDG method substantially outperforms all MMDG baselines (+11%), as the latter cannot exploit unlabeled data.
  • Unimodal SSL (FixMatch on video) already surpasses MMDG methods, underscoring the value of leveraging unlabeled data.
  • CDCR contributes the largest gain (+7%); DAR adds an additional 3%, and CMPA a further 3%.
  • Directly using all high-confidence pseudo labels without consensus filtering degrades performance by 5%, validating the necessity of the filtering strategy.
  • In modality-missing scenarios (video-only or audio-only), cross-modal translation yields more graceful performance degradation.

Highlights & Insights

  • Forward-Looking Problem Formulation: The paper unifies three independently studied challenges into SSMDG and establishes the first benchmark. The intersection of these three threads represents a practically needed yet previously unexplored setting.
  • Consensus-Driven Pseudo Labeling: Unlike threshold-based filtering that relies solely on the fusion prediction, incorporating inter-modal consistency verification further improves reliability—a natural and effective innovation for multimodal semi-supervised learning.
  • GCE for Non-Consensus Samples: Rather than simply discarding uncertain samples, the method gently exploits them via a noise-robust loss, reflecting a design philosophy of "partial utilization over wasteful exclusion."

Limitations & Future Work

  • Validation is limited to the video-audio bimodal setting; vision-language or trimodal scenarios remain unexplored.
  • The threshold \(\tau\) is applied uniformly across all domains; domain-adaptive thresholds may yield better results.
  • Class prototypes are updated via simple averaging; momentum updates or attention-weighted aggregation may offer improvements.
  • Scalability on large-scale datasets (e.g., large-scale video classification) has not been verified.
  • vs. SimMMDG: SimMMDG performs cross-modal alignment with fully labeled data; the proposed method achieves the same objective under limited labels via pseudo labeling and prototype alignment, making it more practical.
  • vs. FixMatch: FixMatch is the standard unimodal SSL baseline; the proposed CDCR leverages multimodal consensus to produce more reliable pseudo labels.

Rating

  • Novelty: ⭐⭐⭐⭐ Novel problem formulation with a well-motivated unified framework
  • Experimental Thoroughness: ⭐⭐⭐⭐ Two benchmarks, diverse baseline comparisons, and modality-missing experiments add value
  • Writing Quality: ⭐⭐⭐⭐ Problem definition and method description are clear
  • Value: ⭐⭐⭐⭐ Fills an unexplored gap at the intersection of three research lines; the benchmark offers community value