Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery¶

Conference: CVPR2026 arXiv: 2602.19910 Code: To be confirmed Area: Multimodal VLM Keywords: Generalized Category Discovery, Multi-Modal Representation Learning, Semi-Supervised Rate Reduction, Intra-Modal Alignment, CLIP

TL;DR¶

This paper proposes SSR²-GCD, a framework that learns structured representations with uniformly compressed intra-modal distributions via a Semi-Supervised Rate Reduction (SSR²) loss, and introduces a Retrieval-based Text Aggregation (RTA) strategy to enhance cross-modal knowledge transfer. The method surpasses existing multi-modal GCD approaches on 8 benchmarks.

Background & Motivation¶

Practical demand for Generalized Category Discovery (GCD): Real-world data contains both known and novel categories. GCD leverages knowledge from known categories to discover novel ones, serving as a natural extension of open-set recognition.
Rise of multi-modal methods: Recent methods such as CLIP-GCD, TextGCD, and GET incorporate textual information into visual GCD tasks, improving performance through cross-modal alignment.
Limitations of inter-modal alignment: Existing multi-modal GCD methods focus primarily on inter-modal alignment while neglecting structural issues in the intra-modal representation distribution.
Imbalanced compression problem: The conventional contrastive loss \(\mathcal{L}_{\text{con}}\) comprises an unsupervised term (pulling all augmented pairs together) and a supervised term (pulling together only labeled known-category samples), causing over-compression of known categories and under-compression of novel ones, resulting in blurred cluster boundaries.
CLIP's limitations with long text: CLIP encodes prompts exceeding 20 tokens poorly, making the conventional concatenation-based prompt construction suboptimal.
Potential harm of inter-modal alignment: Naively adding inter-modal alignment loss on top of intra-modal losses may in fact degrade intra-modal representation learning.

Method¶

Overall Architecture: SSR²-GCD¶

The framework consists of three modules: (a) Retrieval-based Text Aggregation (RTA) for generating text representations; (b) the Semi-Supervised Rate Reduction (SSR²) module for representation learning; and (c) a dual-branch classifier for learning pseudo-labels from each modality.

Retrieval-based Text Aggregation (RTA)¶

Following TextGCD, the method maintains a label dictionary and an attribute dictionary, retrieving the top-\(c\) most similar label and attribute candidates for each query image.
Key improvement: Rather than concatenating candidates into a long string for CLIP, each candidate is encoded independently and then aggregated with learned weights:

\[\boldsymbol{z}^{\text{T}} = \sum_{i=1}^{c} \sigma_i \mathcal{F}^{\text{T}}(\mathcal{T}(a_i)) + \sum_{i=1}^{c} \sigma_i \mathcal{F}^{\text{T}}(\mathcal{T}(b_i))\]

Weight assignment: the most similar candidate receives weight \(1-\alpha\); the remaining candidates each receive \(\frac{\alpha}{c-1}\) (\(\alpha=0.5, c=4\)), effectively integrating richer candidate information.

Semi-Supervised Rate Reduction Loss (SSR²)¶

The core loss is grounded in the Maximal Coding Rate Reduction principle:

\[\mathcal{L}_{\text{SSR}^2} = -R(\mathbf{Z}) + R_c^{\text{s}}(\mathbf{Z}_{\text{s}}, \mathbf{Y}^*) + R_c^{\text{u}}(\mathbf{Z}_{\text{u}}, \mathbf{Y})\]

\(R(\mathbf{Z})\): Global coding rate; maximized to spread all representations across the full feature space.
\(R_c^{\text{s}}\): Class-conditional coding rate for labeled samples; compresses each known class using ground-truth labels \(\mathbf{Y}^*\).
\(R_c^{\text{u}}\): Class-conditional coding rate for unlabeled samples; compresses novel categories using classifier-predicted pseudo-labels \(\mathbf{Y}\).
Applied separately to the image and text encoders: \(\mathcal{L}_{\text{SSR}^2}^{\text{I}}\) and \(\mathcal{L}_{\text{SSR}^2}^{\text{T}}\).
Effect: Global expansion combined with uniform within-class compression yields balanced low-dimensional subspace representations for both known and novel categories.

Dual-Branch Clustering and Training Strategy¶

Warm-up phase: \(\mathcal{L}_{\text{warm}} = \mathcal{L}_{\text{SSR}^2}^{\text{I}} + \mathcal{L}_{\text{SSR}^2}^{\text{T}} + \mathcal{L}_{\text{cls}}^{\text{I}} + \mathcal{L}_{\text{cls}}^{\text{T}}\)
Alignment phase: A co-teaching loss \(\mathcal{L}_{\text{co-teach}}\) is introduced, enabling mutual supervision using high-confidence samples.
Final prediction: \(\arg\max(\boldsymbol{y}_i^{\text{I}} + \boldsymbol{y}_i^{\text{T}})\)

Key Experimental Results¶

Main Results (8 Datasets, All ACC %)¶

Dataset	TextGCD	GET	SSR²-GCD	Gain
ImageNet-100	88.0	91.7	92.1	+0.4
ImageNet-1k	64.8	62.4	66.7	+1.9
CIFAR-10	98.2	97.2	98.5	+0.3
CIFAR-100	85.7	82.1	86.4	+0.7
CUB-200	76.6	77.0	78.3	+1.3
Stanford Cars	86.1	78.5	89.2	+3.1
Oxford Pets	93.7	91.1	95.7	+2.0
Flowers102	87.2	85.5	93.5	+6.3

Improvements are especially pronounced on Stanford Cars and Flowers102 (+3.1% and +6.3%, respectively).

Comparison of Representation Learning Objectives (All ACC %)¶

Loss Configuration	CIFAR-10	Stanford Cars	Flowers102
\(\mathcal{L}_{\text{CLIP}}\) (inter-modal only)	98.3	87.0	89.7
\(\mathcal{L}_{\text{con}}\) (intra-modal only)	98.4	87.9	91.8
\(\mathcal{L}_{\text{SSR}^2}\) (intra-modal only)	98.5	89.2	93.5
\(\mathcal{L}_{\text{CLIP}} + \mathcal{L}_{\text{SSR}^2}\)	98.3	88.1	92.9

Key finding: adding inter-modal alignment loss on top of SSR² consistently degrades performance.

Ablation Study (Stanford Cars / Flowers102, All ACC %)¶

Dual	RTA	SSR²	Stanford Cars	Flowers102
✗	✗	✗	75.2	78.3
✓	✗	✗	81.7	83.9
✓	✓	✗	86.0	87.4
✓	✗	✓	85.5	89.1
✓	✓	✓	89.2	93.5

Each component contributes independently, and their combination yields optimal performance.

Highlights & Insights¶

Novel theoretical perspective: This is the first work to apply the Maximal Coding Rate Reduction principle to multi-modal GCD, replacing conventional contrastive learning with an information-theoretic framework that provides balanced compression guarantees.
Counterintuitive yet compelling finding: Inter-modal alignment can be harmful in multi-modal GCD; intra-modal alignment alone appears sufficient to implicitly achieve cross-modal alignment.
Thorough empirical analysis: The core claims are validated through multiple lenses, including similarity distribution plots, effective rank curves, \(R_e\) consistency metrics, and t-SNE visualizations.
Elegant RTA design: By performing weighted aggregation in the embedding space rather than concatenating long prompts, the approach circumvents CLIP's long-text limitations while incorporating richer candidate information.

Limitations & Future Work¶

Computational and memory costs scale linearly with the number of candidates \(c\), as each requires a separate pass through the CLIP text encoder.
Image and text modalities are treated symmetrically, with no adaptive mechanism for modality importance weighting.
The number of categories \(K\) must be known or estimated in advance; robustness to incorrect estimates of the number of novel categories is not discussed.
Experiments are conducted solely on the CLIP-B/16 backbone; performance on larger models (ViT-L/H) remains unexplored.
The unlabeled term of the SSR² loss relies on pseudo-label quality, and noisy pseudo-labels in early training may impair convergence.

Method	Text Generation	Representation Learning	Clustering Strategy	Characteristics
TextGCD	Concatenate top-3 labels + top-2 attributes	\(\mathcal{L}_{\text{CLIP}}\) (inter-modal)	Dual-branch + co-teaching	First multi-modal GCD; neglects intra-modal alignment
GET	Text inversion network generates prompts	\(\mathcal{L}_{\text{CLIP}}+\mathcal{L}_{\text{con}}\)	Single-branch MLP	Uses both inter- and intra-modal losses but combines them naively
CLIP-GCD	Knowledge base retrieves similar texts	\(\mathcal{L}_{\text{CLIP}}\)	SimGCD clustering	Relies solely on inter-modal alignment
SSR²-GCD	RTA: weighted aggregation of multiple candidates	\(\mathcal{L}_{\text{SSR}^2}\) (intra-modal only)	Dual-branch + co-teaching	First to address imbalanced compression; eliminates inter-modal alignment

Rating¶

Novelty: ⭐⭐⭐⭐ — Introducing coding rate reduction into multi-modal GCD offers a distinctive perspective; the finding that inter-modal alignment may be harmful is thought-provoking.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across 8 datasets, comparison of 6 representation learning configurations, and multi-dimensional analysis (rank, consistency, distribution, visualization).
Writing Quality: ⭐⭐⭐⭐ — Well-structured with rigorous mathematical derivations, though notation is occasionally dense.
Value: ⭐⭐⭐⭐ — Provides a new direction for representation learning in multi-modal GCD, with substantial improvements on fine-grained datasets.