KEC: Hierarchical Textual Knowledge for Enhanced Image Clustering¶

Conference: CVPR 2026 arXiv: 2604.11144 Code: None Area: Multimodal VLM Keywords: Image Clustering, Textual Knowledge, Large Language Models, CLIP, Discriminative Attributes

TL;DR¶

KEC leverages LLMs to construct hierarchical concept-attribute structured textual knowledge to guide image clustering, outperforming zero-shot CLIP on 14 out of 20 datasets without any training, demonstrating that discriminative attributes are more effective than simple class names.

Background & Motivation¶

Background: Image clustering has evolved from geometric priors → deep representation learning → vision-language model-assisted paradigms. VLMs such as CLIP have made it possible to inject textual knowledge into clustering.

Limitations of Prior Work: Existing methods either employ VLMs to generate per-image captions (computationally expensive) or select shallow nouns from WordNet (semantically redundant and inconsistent in granularity). Naively introducing textual knowledge can even degrade clustering performance.

Key Challenge: Visually similar but semantically distinct categories (e.g., Akita Inu vs. Shiba Inu) cannot be distinguished by class names alone; discriminative attributes (leg length, tail curvature, ear posture) are required. However, acquiring such attributes demands domain expertise and is difficult to automate.

Core Idea: Use LLMs to distill abstract concepts from redundant nouns, then automatically extract intra-concept and inter-concept discriminative attributes, constructing hierarchical knowledge for feature enhancement.

Method¶

Overall Architecture¶

Images → CLIP visual features → alignment with WordNet nouns → LLM-based distillation of representative concepts → LLM extraction of single-concept and concept-pair discriminative attributes → instantiation as knowledge-enhanced features per image → combination with visual features → input to downstream clustering algorithms.

Key Designs¶

Concept Abstraction:
- Function: Distills representative concepts from redundant WordNet nouns.
- Mechanism: Images are first mapped to their nearest nouns via CLIP; an LLM then merges semantically overlapping nouns into more abstract concept categories.
- Design Motivation: WordNet contains excessive synonyms and near-synonyms (e.g., car/automobile/vehicle), and using them directly dilutes inter-category discriminability.
Discriminative Attribute Extraction:
- Function: Automatically generates distinguishing attributes for pairs of similar concepts.
- Mechanism: Single-concept attributes (LLM describes typical features of each concept) + concept-pair attributes (LLM contrasts differential features between two similar concepts). For example, "Akita Inu vs. Shiba Inu" → "body size, coat length, ear shape."
- Design Motivation: Humans distinguish similar objects precisely through discriminative attributes; CLIP attention maps confirm that attribute descriptions direct the model's focus toward relevant regions.
Knowledge Instantiation and Feature Fusion:
- Function: Converts structured knowledge into per-image enhanced features.
- Mechanism: Attribute descriptions are encoded by the CLIP text encoder; cosine similarities with the image are computed as attribute scores, concatenated into a knowledge-enhanced feature vector, and combined with the original visual features via weighted aggregation.
- Design Motivation: Grounding global knowledge to individual image instances ensures that different images receive distinct knowledge enhancements.

Loss & Training¶

KEC requires no training; it directly generates enhanced features that are fed into existing clustering algorithms (e.g., K-means, spectral clustering).

Key Experimental Results¶

Main Results¶

Comparison	Metric	KEC (Training-Free)	Trained Methods	Notes
Average over 20 datasets	NMI	Superior	~3% lower	KEC surpasses trained methods without training
vs. CLIP zero-shot	Acc	Wins on 14/20 datasets	—	—

Ablation Study¶

Configuration	NMI	Notes
KEC (full)	Best	Concepts + attributes + fusion
Naive textual knowledge	Degraded or negative	Demonstrates necessity of structured knowledge
Concepts only, no attributes	Moderate	Attributes contribute significantly
Single-concept attributes only	Sub-optimal	Concept-pair attributes provide further gains

Key Findings¶

Naively introducing textual knowledge (e.g., directly using nouns) degrades performance on certain datasets, confirming the necessity of structured knowledge.
Concept-pair discriminative attributes contribute more than single-concept attributes, indicating that contrastive information is critical for distinguishing similar categories.
KEC is insensitive to the choice of downstream clustering algorithm, exhibiting broad compatibility.

Highlights & Insights¶

LLM as a Knowledge Source: No image input to the LLM is required; sufficient discriminative knowledge is obtained purely through textual interaction at minimal cost.
Structured > Naive: Demonstrates that knowledge quality matters more than knowledge quantity.

Limitations & Future Work¶

Performance depends on the quality of CLIP's text-image alignment.
LLM-generated attributes may be biased.
No comparison against specialized fine-grained methods on fine-grained datasets.

vs. SIC/TAC: These methods annotate directly with shallow nouns or WordNet, leading to severe semantic redundancy.
vs. VLM captioning: Per-image caption generation is computationally intensive and does not scale well.

Rating¶

Novelty: ⭐⭐⭐⭐ Clear and well-motivated hierarchical knowledge construction pipeline.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation across 20 datasets is highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Motivation and methodology are described clearly.
Value: ⭐⭐⭐⭐ Surpassing trained methods without training demonstrates strong practical utility.