Region-based Cluster Discrimination for Visual Representation Learning¶
Conference: ICCV 2025 arXiv: 2507.20025 Code: GitHub Area: Image Segmentation Keywords: Region representation learning, cluster discrimination, OCR-awareness, visual encoder, multimodal large language models
TL;DR¶
This paper proposes RICE (Region-Aware Cluster Discrimination), which constructs a billion-scale region dataset, designs a Region Transformer layer, and introduces a unified region cluster discrimination loss to jointly optimize object-aware and OCR capabilities, significantly improving visual encoder performance across segmentation, detection, and MLLM multi-task benchmarks.
Background & Motivation¶
Vision-language contrastive models such as CLIP and SigLIP have learned powerful global visual representations through large-scale image-text alignment. However, these models exhibit limited performance on dense prediction tasks (segmentation, localization, OCR) due to the following reasons:
Semantic insufficiency in instance discrimination: The partitioning of positive and negative sample pairs ignores semantic similarity, treating semantically similar samples from different instances as negative pairs.
OCR interference with high-level semantics: When OCR pairs are included in vision-language contrastive training, the visual encoder tends to focus on text recognition at the expense of object semantics.
Limitations of global representations for local modeling: Existing cluster discrimination methods (e.g., UNICOM, MLCD) assign one or more pseudo-labels per image, precluding the learning of local region-level representations.
Existing region-level methods such as RegionCLIP and CLIM rely on region-text descriptions, limiting their scalability. The core motivation is to replace text descriptions with cluster centers as region supervision signals, enabling large-scale and scalable region-level representation learning.
Method¶
Overall Architecture¶
RICE adopts a ViT backbone, appending Region Transformer layers after the standard Transformer layers to extract both global and region-level semantics in a single forward pass. Training supervision comes from two branches:
- Object Region Loss: Single-label classification based on cluster centers
- OCR Region Loss: Multi-label classification based on token embeddings
Key Design 1: Region Data Construction¶
Object region data: Sampled from LAION-2B, COYO-700M, and SAM-1B. SAM is applied to LAION and COYO to generate fine-grained masked regions; candidate boxes with shortest side ≥128px are retained, yielding 400 million images and 2 billion candidate regions. Region features are extracted via CLIP and clustered into 1 million semantic centers using k-means:
Clustering employs hierarchical k-means with Faiss GPU, completing in approximately 10 hours on 64 GPUs.
OCR region data: PaddleOCR is applied to LAION-2B and COYO-700M to extract text (confidence >0.7), yielding 50 million images and 400 million candidate regions. Extracted text is tokenized to produce OCR labels.
Key Design 2: Region Transformer Layer¶
Region sampling: The number of regions per image is normalized to \(N\). If the actual count exceeds \(N\), regions are randomly subsampled; otherwise, they are sampled with replacement.
Region attention: - The number of tokens per region varies with spatial size, making direct batch processing difficult. - A region visibility mask \(\mathcal{M}\) is introduced: tokens inside the region are set to 0 and tokens outside are set to \(-\infty\). - Region attention is computed as:
- Fixed-length embeddings for all regions are extracted in a single forward pass.
Key Design 3: Region Cluster Discrimination Loss¶
Object region loss (single-label classification):
OCR region loss (multi-label classification): Each OCR region has multiple token embeddings as positives:
Negative sampling strategy: Negative cluster centers are uniformly sampled from the full category set at rate \(\rho=0.1\), reducing semantically conflicting gradients and improving training stability.
Experiments¶
MLLM Multimodal Understanding (LLaVA-NeXT Framework)¶
| Vision Tower | LLM | DocVQA | OCRBench | InfoVQA | MM-Bench | Other Avg |
|---|---|---|---|---|---|---|
| CLIP ViT-L-336px | Qwen2.5-7B | 75.21 | 525 | 38.88 | 74.57 | 69.83 |
| SigLIP SO400M-384px | Qwen2.5-7B | 76.71 | 554 | 41.38 | 76.98 | 70.62 |
| AIMv2 ViT-L-336px | Qwen2.5-7B | 77.19 | 572 | 35.44 | 78.61 | 70.58 |
| RICE ViT-L-336px | Qwen2.5-7B | 79.19 | 575 | 45.23 | 76.55 | 73.03 |
RICE-336px consistently leads on OCR tasks: +50 points on OCRBench and +3.98% on DocVQA over CLIP.
Referring Image Segmentation (LLaVA-NeXT + LISA)¶
| Vision Tower | LLM | RefCOCO val | RefCOCO+ val | RefCOCOg val |
|---|---|---|---|---|
| CLIP | Qwen2.5-7B | 81.8 | 76.6 | 77.3 |
| MLCD | Qwen2.5-7B | 82.8 | 77.4 | 78.5 |
| RICE | Qwen2.5-7B | 83.5 | 79.4 | 79.8 |
RICE surpasses both CLIP and MLCD on all RefCOCO benchmarks, with mean IoU improvements of +2.45 (vs. CLIP) and +1.30 (vs. MLCD).
Key Findings¶
- Region cluster discrimination yields a pronounced advantage on dense OCR tasks (InfoVQA +9.79 vs. AIMv2), as the joint training objective avoids conflicts between object semantics and OCR.
- t-SNE visualizations demonstrate that RICE produces substantially better object feature clustering than DINOv2, MLCD, and SigLIP.
- A random negative sampling rate of \(\rho=0.1\) is optimal; higher rates introduce semantically conflicting gradients.
- Placing the Region Transformer layers in the last few layers is optimal, balancing global context and region-level precision.
Highlights & Insights¶
- Data engineering at scale: The construction of 2 billion regions with 1 million cluster centers is ambitious; replacing text descriptions with cluster centers enables text-free region supervision.
- Unified framework: Object recognition and OCR tasks are jointly trained within a single classification framework, avoiding multi-task conflicts.
- Plug-and-play compatibility: The RICE visual encoder can directly replace CLIP in frameworks such as LLaVA without architectural modifications.
Limitations & Future Work¶
- Data construction depends on the quality of SAM and PaddleOCR; annotation noise may propagate into cluster centers.
- Storage and update costs for 1 million cluster centers are non-trivial.
- Evaluation is limited to ViT-L and ViT-B scales; performance at larger scales (e.g., ViT-G) remains unexplored.
Related Work & Insights¶
- Instance discrimination: Visual representation learning methods such as CLIP, SigLIP, and DINOv2.
- Cluster discrimination: Clustering-based self-supervised methods such as DeepCluster, SwAV, UNICOM, and MLCD.
- Region representation: Region-language alignment methods such as RegionCLIP, CLIM, and GLIP.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The region cluster discrimination approach is original; the unified Object+OCR design is meaningful.
- Technical Depth: ⭐⭐⭐⭐ — The Region Transformer and loss function designs are complete; data engineering is rigorous.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers MLLM, segmentation, detection, and OCR with thorough ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Motivation is clear and the architecture diagram is intuitive.