SSR2-GCD: Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery¶

Conference: CVPR 2026 arXiv: 2602.19910 Code: N/A Area: Self-Supervised Learning / Multi-Modal VLM / Representation Learning Keywords: Generalized Category Discovery, Maximal Coding Rate Reduction, Intra-modal Alignment, CLIP, Multi-Modal Representation Learning

TL;DR¶

This paper proposes SSR2-GCD, a framework that replaces conventional contrastive losses with a Semi-Supervised Rate Reduction (SSR2) loss to learn uniformly compressed, structured representations. The work further reveals that inter-modal alignment is not only unnecessary but harmful in multi-modal GCD, achieving +3.1% and +6.3% over the prior state of the art on Stanford Cars and Flowers102, respectively.

Background & Motivation¶

Generalized Category Discovery (GCD) requires a model to leverage partially labeled known categories to discover unknown ones. Recent multi-modal methods (CLIP-GCD, TextGCD, GET) have begun exploiting textual modalities to assist visual category discovery, yet their representation learning suffers from two fundamental issues: (1) they over-emphasize inter-modal alignment (CLIP-style) while neglecting intra-modal representation structure; (2) conventional contrastive losses induce imbalanced compression—labeled known categories are excessively compressed (effective rank drops sharply), while unlabeled unknown categories are under-compressed, causing blurry cluster boundaries.

Core Problem¶

How can one learn uniformly compressed, structured representations for both known and unknown categories in multi-modal GCD? And what roles do inter-modal alignment and intra-modal alignment each play in GCD?

Method¶

Overall Architecture¶

A three-module pipeline: (1) Retrieval-based Text Aggregation (RTA) generates semantically rich pseudo-text embeddings for each image; (2) the SSR2 module applies semi-supervised rate reduction losses separately within the image and text modalities to learn structured representations; (3) a dual-branch classifier processes image/text embeddings independently and aligns pseudo-labels via co-teaching.

Key Designs¶

Semi-Supervised Rate Reduction Loss (SSR2): Designed on the principle of Maximal Coding Rate Reduction (MCR2). \(\mathcal{L}_{SSR^2} = -R(\mathbf{Z}) + R_c^s(\mathbf{Z}, \mathbf{Y}^*) + R_c^u(\mathbf{Z}, \mathbf{Y})\). The first term maximizes the global coding rate of representations (encouraging a more spread-out overall distribution); the second and third terms minimize the within-class coding rates of known categories (using ground-truth labels) and unknown categories (using pseudo-labels), respectively. The key advantage is that MCR2 theory guarantees each class is compressed into an equal-rank low-dimensional subspace, avoiding the over-compression of known categories that occurs with contrastive losses.
Retrieval-based Text Aggregation (RTA): Addresses the limitation of CLIP's inability to handle long text prompts as used in TextGCD. Rather than concatenating multiple tags into a long string, each tag and attribute is encoded separately, and a weighted aggregation produces the final text embedding: weight \(\sigma_1 = 1-\alpha\) is assigned to the most similar candidate and \(\sigma_i = \alpha/(c-1)\) to the remaining candidates (\(\alpha=0.5, c=4\)). This enables integration of richer candidate information without any token-length constraint.
Finding that Inter-Modal Alignment Is Unnecessary: Experiments demonstrate that applying \(\mathcal{L}_{SSR^2}\) alone (intra-modal alignment only) outperforms the combination with \(\mathcal{L}_{CLIP}\) (inter-modal alignment) on 5 out of 6 datasets. The rationale is that pre-trained CLIP already implicitly establishes cross-modal associations through similar-text retrieval; explicit inter-modal alignment introduces noise due to imprecise correspondence between pseudo-text and images, ultimately disrupting the structured intra-modal representations.

Loss & Training¶

Two-stage training: a warm-up phase (10 epochs) using \(\mathcal{L}_{SSR^2}^I + \mathcal{L}_{SSR^2}^T + \mathcal{L}_{cls}^I + \mathcal{L}_{cls}^T\); an alignment phase (190 epochs) that adds a co-teaching loss to align dual-branch predictions. No inter-modal alignment loss is used at any stage. Optimization is performed with SGD, learning rate 0.001, batch size 128, on a single RTX 3090.

Key Experimental Results¶

Dataset	Metric	SSR2-GCD	TextGCD	GET	Gain vs. Best
Stanford Cars	All ACC	89.2	86.1	78.5	+3.1
Flowers102	All ACC	93.5	87.2	85.5	+6.3
CIFAR-100	All ACC	86.4	85.7	82.1	+0.7
ImageNet-100	All ACC	92.1	88.0	91.7	+0.4
Oxford Pets	All ACC	95.7	93.7	91.1	+2.0
ImageNet-1K	All ACC	66.7	64.8	62.4	+1.9

The ACC gap between Old and New categories is substantially narrowed—e.g., on Stanford Cars, Old 93.1% vs. New 87.3% yields a gap of only 5.8% (TextGCD's gap is 7.9%).

Ablation Study¶

SSR2 vs. contrastive loss: SSR2 outperforms conventional supervised + unsupervised contrastive losses on 5/6 datasets, with a margin of 1.7% on Flowers102.
Inter-modal alignment is harmful: \(\mathcal{L}_{CLIP} + \mathcal{L}_{SSR^2}\) underperforms \(\mathcal{L}_{SSR^2}\) alone on all 6 datasets; similarly, \(\mathcal{L}_{CLIP} + \mathcal{L}_{con}\) underperforms \(\mathcal{L}_{con}\) alone on 4/6 datasets.
Quantifying imbalanced compression: Effective rank visualizations clearly show that contrastive losses cause a sharp rank drop in Old categories (over-compression), whereas SSR2 maintains uniform rank across Old and New categories.
RTA is effective: Using 4 candidate tags + attributes outperforms TextGCD's top-3 tags + top-2 attributes, with \(\alpha=0.5\) being optimal.
SSR2 generalizes to unimodal settings: Replacing contrastive losses with SSR2 in GCD and SimGCD also yields substantial improvements on fine-grained datasets, indicating that imbalanced compression is a general problem.

Highlights & Insights¶

The finding that "inter-modal alignment is unnecessary and even harmful in multi-modal GCD" is highly counter-intuitive yet rigorously validated—it challenges the default assumption of CLIP-style contrastive learning.
SSR2 addresses the imbalanced compression problem from the perspective of coding rate theory, yielding an approach that is both theoretically principled and practically effective.
Two quantitative metrics—edge ratio \(R_e\) and effective rank—clearly characterize the impact of different loss functions on representation structure.
The RTA strategy is concise and effective: encoding multiple retrieved candidates separately and aggregating them with learned weights elegantly circumvents CLIP's token-length limitation.

Limitations & Future Work¶

Increasing the number of candidates incurs additional computational and memory overhead (Table B.11 reports an extra 11% memory usage).
Image and text modalities are treated symmetrically; adaptive weighting of modality importance is not explored.
The number of categories \(K\) is assumed to be known or accurately estimable, whereas category number estimation is itself a challenging problem in practice.
Validation is limited to CLIP-B/16; whether larger CLIP models (L/14, H/14) could yield further gains remains unknown.

vs. TextGCD: TextGCD employs CLIP-style inter-modal alignment with co-teaching but imposes no intra-modal structural constraints. SSR2-GCD outperforms it comprehensively across all datasets.
vs. GET: GET uses a textual inversion network to generate prompts combined with contrastive loss and CICO inter-modal alignment. SSR2-GCD demonstrates that both forms of inter-modal alignment (CLIP loss and CICO) degrade intra-modal learning.
vs. SimGCD/SelEx (unimodal): SSR2 loss also improves these unimodal methods, confirming that imbalanced compression is a widespread issue.

The finding that "intra-modal alignment matters more than inter-modal alignment" is broadly relevant to any work leveraging CLIP for downstream tasks—blindly adding a CLIP contrastive loss is not always beneficial. The MCR2/rate reduction principle is generalizable to a wider class of semi-supervised and open-world learning problems. Effective rank, as a measure of representation quality, provides a useful tool for monitoring and diagnosing issues in representation learning.

Rating¶

Novelty: ⭐⭐⭐⭐ — Extending MCR2 to GCD with a semi-supervised formulation is novel, and the finding that inter-modal alignment is harmful is striking.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Eight datasets, extensive loss function comparisons, visualization analyses, unimodal validation, and highly detailed ablations.
Writing Quality: ⭐⭐⭐⭐ — Overall clear with thorough theoretical and empirical analysis, though the dense notation requires careful reading.
Value: ⭐⭐⭐⭐ — Offers important insights for the multi-modal GCD field; core findings are transferable to broader VLM applications.