Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery¶
Conference: CVPR2026 arXiv: 2602.19910 Code: To be confirmed Area: Multimodal VLM Keywords: Generalized Category Discovery, Multi-Modal Representation Learning, Semi-Supervised Rate Reduction, Intra-Modal Alignment, CLIP
TL;DR¶
This paper proposes SSR²-GCD, a framework that learns structured representations with uniformly compressed intra-modal distributions via a Semi-Supervised Rate Reduction (SSR²) loss, and introduces a Retrieval-based Text Aggregation (RTA) strategy to enhance cross-modal knowledge transfer. The method surpasses existing multi-modal GCD approaches on 8 benchmarks.
Background & Motivation¶
- Practical demand for Generalized Category Discovery (GCD): Real-world data contains both known and novel categories. GCD leverages knowledge from known categories to discover novel ones, serving as a natural extension of open-set recognition.
- Rise of multi-modal methods: Recent methods such as CLIP-GCD, TextGCD, and GET incorporate textual information into visual GCD tasks, improving performance through cross-modal alignment.
- Limitations of inter-modal alignment: Existing multi-modal GCD methods focus primarily on inter-modal alignment while neglecting structural issues in the intra-modal representation distribution.
- Imbalanced compression problem: The conventional contrastive loss \(\mathcal{L}_{\text{con}}\) comprises an unsupervised term (pulling all augmented pairs together) and a supervised term (pulling together only labeled known-category samples), causing over-compression of known categories and under-compression of novel ones, resulting in blurred cluster boundaries.
- CLIP's limitations with long text: CLIP encodes prompts exceeding 20 tokens poorly, making the conventional concatenation-based prompt construction suboptimal.
- Potential harm of inter-modal alignment: Naively adding inter-modal alignment loss on top of intra-modal losses may in fact degrade intra-modal representation learning.
Method¶
Overall Architecture: SSR²-GCD¶
The framework consists of three modules: (a) Retrieval-based Text Aggregation (RTA) for generating text representations; (b) the Semi-Supervised Rate Reduction (SSR²) module for representation learning; and (c) a dual-branch classifier for learning pseudo-labels from each modality.
Retrieval-based Text Aggregation (RTA)¶
- Following TextGCD, the method maintains a label dictionary and an attribute dictionary, retrieving the top-\(c\) most similar label and attribute candidates for each query image.
- Key improvement: Rather than concatenating candidates into a long string for CLIP, each candidate is encoded independently and then aggregated with learned weights:
- Weight assignment: the most similar candidate receives weight \(1-\alpha\); the remaining candidates each receive \(\frac{\alpha}{c-1}\) (\(\alpha=0.5, c=4\)), effectively integrating richer candidate information.
Semi-Supervised Rate Reduction Loss (SSR²)¶
The core loss is grounded in the Maximal Coding Rate Reduction principle:
- \(R(\mathbf{Z})\): Global coding rate; maximized to spread all representations across the full feature space.
- \(R_c^{\text{s}}\): Class-conditional coding rate for labeled samples; compresses each known class using ground-truth labels \(\mathbf{Y}^*\).
- \(R_c^{\text{u}}\): Class-conditional coding rate for unlabeled samples; compresses novel categories using classifier-predicted pseudo-labels \(\mathbf{Y}\).
- Applied separately to the image and text encoders: \(\mathcal{L}_{\text{SSR}^2}^{\text{I}}\) and \(\mathcal{L}_{\text{SSR}^2}^{\text{T}}\).
- Effect: Global expansion combined with uniform within-class compression yields balanced low-dimensional subspace representations for both known and novel categories.
Dual-Branch Clustering and Training Strategy¶
- Warm-up phase: \(\mathcal{L}_{\text{warm}} = \mathcal{L}_{\text{SSR}^2}^{\text{I}} + \mathcal{L}_{\text{SSR}^2}^{\text{T}} + \mathcal{L}_{\text{cls}}^{\text{I}} + \mathcal{L}_{\text{cls}}^{\text{T}}\)
- Alignment phase: A co-teaching loss \(\mathcal{L}_{\text{co-teach}}\) is introduced, enabling mutual supervision using high-confidence samples.
- Final prediction: \(\arg\max(\boldsymbol{y}_i^{\text{I}} + \boldsymbol{y}_i^{\text{T}})\)
Key Experimental Results¶
Main Results (8 Datasets, All ACC %)¶
| Dataset | TextGCD | GET | SSR²-GCD | Gain |
|---|---|---|---|---|
| ImageNet-100 | 88.0 | 91.7 | 92.1 | +0.4 |
| ImageNet-1k | 64.8 | 62.4 | 66.7 | +1.9 |
| CIFAR-10 | 98.2 | 97.2 | 98.5 | +0.3 |
| CIFAR-100 | 85.7 | 82.1 | 86.4 | +0.7 |
| CUB-200 | 76.6 | 77.0 | 78.3 | +1.3 |
| Stanford Cars | 86.1 | 78.5 | 89.2 | +3.1 |
| Oxford Pets | 93.7 | 91.1 | 95.7 | +2.0 |
| Flowers102 | 87.2 | 85.5 | 93.5 | +6.3 |
Improvements are especially pronounced on Stanford Cars and Flowers102 (+3.1% and +6.3%, respectively).
Comparison of Representation Learning Objectives (All ACC %)¶
| Loss Configuration | CIFAR-10 | Stanford Cars | Flowers102 |
|---|---|---|---|
| \(\mathcal{L}_{\text{CLIP}}\) (inter-modal only) | 98.3 | 87.0 | 89.7 |
| \(\mathcal{L}_{\text{con}}\) (intra-modal only) | 98.4 | 87.9 | 91.8 |
| \(\mathcal{L}_{\text{SSR}^2}\) (intra-modal only) | 98.5 | 89.2 | 93.5 |
| \(\mathcal{L}_{\text{CLIP}} + \mathcal{L}_{\text{SSR}^2}\) | 98.3 | 88.1 | 92.9 |
Key finding: adding inter-modal alignment loss on top of SSR² consistently degrades performance.
Ablation Study (Stanford Cars / Flowers102, All ACC %)¶
| Dual | RTA | SSR² | Stanford Cars | Flowers102 |
|---|---|---|---|---|
| ✗ | ✗ | ✗ | 75.2 | 78.3 |
| ✓ | ✗ | ✗ | 81.7 | 83.9 |
| ✓ | ✓ | ✗ | 86.0 | 87.4 |
| ✓ | ✗ | ✓ | 85.5 | 89.1 |
| ✓ | ✓ | ✓ | 89.2 | 93.5 |
Each component contributes independently, and their combination yields optimal performance.
Highlights & Insights¶
- Novel theoretical perspective: This is the first work to apply the Maximal Coding Rate Reduction principle to multi-modal GCD, replacing conventional contrastive learning with an information-theoretic framework that provides balanced compression guarantees.
- Counterintuitive yet compelling finding: Inter-modal alignment can be harmful in multi-modal GCD; intra-modal alignment alone appears sufficient to implicitly achieve cross-modal alignment.
- Thorough empirical analysis: The core claims are validated through multiple lenses, including similarity distribution plots, effective rank curves, \(R_e\) consistency metrics, and t-SNE visualizations.
- Elegant RTA design: By performing weighted aggregation in the embedding space rather than concatenating long prompts, the approach circumvents CLIP's long-text limitations while incorporating richer candidate information.
Limitations & Future Work¶
- Computational and memory costs scale linearly with the number of candidates \(c\), as each requires a separate pass through the CLIP text encoder.
- Image and text modalities are treated symmetrically, with no adaptive mechanism for modality importance weighting.
- The number of categories \(K\) must be known or estimated in advance; robustness to incorrect estimates of the number of novel categories is not discussed.
- Experiments are conducted solely on the CLIP-B/16 backbone; performance on larger models (ViT-L/H) remains unexplored.
- The unlabeled term of the SSR² loss relies on pseudo-label quality, and noisy pseudo-labels in early training may impair convergence.
Related Work & Insights¶
| Method | Text Generation | Representation Learning | Clustering Strategy | Characteristics |
|---|---|---|---|---|
| TextGCD | Concatenate top-3 labels + top-2 attributes | \(\mathcal{L}_{\text{CLIP}}\) (inter-modal) | Dual-branch + co-teaching | First multi-modal GCD; neglects intra-modal alignment |
| GET | Text inversion network generates prompts | \(\mathcal{L}_{\text{CLIP}}+\mathcal{L}_{\text{con}}\) | Single-branch MLP | Uses both inter- and intra-modal losses but combines them naively |
| CLIP-GCD | Knowledge base retrieves similar texts | \(\mathcal{L}_{\text{CLIP}}\) | SimGCD clustering | Relies solely on inter-modal alignment |
| SSR²-GCD | RTA: weighted aggregation of multiple candidates | \(\mathcal{L}_{\text{SSR}^2}\) (intra-modal only) | Dual-branch + co-teaching | First to address imbalanced compression; eliminates inter-modal alignment |
Rating¶
- Novelty: ⭐⭐⭐⭐ — Introducing coding rate reduction into multi-modal GCD offers a distinctive perspective; the finding that inter-modal alignment may be harmful is thought-provoking.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across 8 datasets, comparison of 6 representation learning configurations, and multi-dimensional analysis (rank, consistency, distribution, visualization).
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with rigorous mathematical derivations, though notation is occasionally dense.
- Value: ⭐⭐⭐⭐ — Provides a new direction for representation learning in multi-modal GCD, with substantial improvements on fine-grained datasets.