CAE: Hierarchical Semantic Alignment for Image Clustering¶

Conference: AAAI 2026 arXiv: 2512.00904 Code: None Area: Other Keywords: image clustering, semantic alignment, optimal transport, caption-level semantics, training-free

TL;DR¶

By combining two complementary semantic sources — noun-level (WordNet) and caption-level (Flickr image captions) — and constructing a semantic space via optimal transport alignment followed by adaptive fusion, this work achieves training-free image clustering with a 4.2% accuracy improvement on ImageNet-1K.

Background & Motivation¶

Background: Image clustering has evolved from contrastive learning and self-supervised methods toward leveraging external textual semantics. SIC employs WordNet nouns, while TAC combines noun embeddings with image embeddings.

Limitations of Prior Work: Relying solely on nouns introduces two problems — polysemy (e.g., "crane" can refer to a bird or a machine) and insufficient granularity (e.g., "spaniel" cannot distinguish between different dog breeds).

Key Challenge: Nouns provide high-level category information but lack attribute-level detail; a single semantic type is insufficient for disambiguation.

Goal: Construct a more precise external semantic space to guide image clustering.

Key Insight: Nouns (categories) and captions (attributes) are complementary — nouns encode "what it is," while captions encode "what it looks like."

Core Idea: Combine WordNet nouns and Flickr captions, construct a dual semantic space via OT alignment, and adaptively fuse the two sources.

Method¶

Overall Architecture¶

The inputs are CLIP-encoded image embeddings, WordNet noun embeddings, and Flickr caption embeddings. The pipeline consists of two stages: (1) Semantic space construction — filtering relevant subsets of nouns/captions and computing semantic correspondences for each image via OT; (2) Adaptive fusion — a prototype-guided weighting mechanism fuses tri-modal features, followed by k-means clustering.

Key Designs¶

Semantic Space Construction
- Function: Filter a semantically relevant subset from the large noun/caption corpus.
- Mechanism: Apply k-means on image embeddings to obtain \(n = N/300\) cluster centers; for each center, select the top-\(K\) most similar nouns/captions and take the union.
- Design Motivation: The space must be neither too large (introducing noise) nor too small (losing information).
Optimal Transport for Semantic Correspondence
- Function: Compute noun correspondences \(\mathbf{x}_i^u\) and caption correspondences \(\mathbf{x}_i^v\) for each image.
- Mechanism: Image-noun alignment is formulated as an OT problem; the Sinkhorn-Knopp algorithm solves for the transport plan \(\mathbf{T}^u\), and the correspondence is computed as \(\mathbf{x}_i^u = \sum_j t_{i,j}^u s_{i,j}^u \mathbf{u}_j\).
- Design Motivation: Theorem 1 proves that the column constraints of OT ensure balanced noun utilization, and the resulting semantic error is provably no greater than that of softmax.
Adaptive Semantic Fusion
- Function: Dynamically adjust tri-modal fusion weights at the sample level.
- Mechanism: A semantic prototype is computed as \(\mathbf{x}_i^p = \frac{1}{3}(\mathbf{x}_i + \mathbf{x}_i^u + \mathbf{x}_i^v)\); the cosine similarity between each modality and the prototype is scaled by a softmax temperature to derive fusion weights.
- Design Motivation: Different images rely on nouns and captions to varying degrees, necessitating sample-level adaptation.

Loss & Training¶

The method is entirely training-free. OT is solved via Sinkhorn iteration; k-means is applied after fusion. Temperature \(\gamma = 0.01\).

Key Experimental Results¶

Main Results¶

Dataset	Metric	CAE	Prev. SOTA	Gain
ImageNet-1K	ACC	76.5%	72.3% (SIC)	+4.2%
ImageNet-1K	ARI	56.8%	53.9%	+2.9%
DTD	ACC	52.3%	47.7%	+4.6%
UCF-101	ACC	71.6%	69.3%	+2.3%

Ablation Study¶

Configuration	ImageNet-1K ACC	Note
Full CAE	76.5%	Complete model
w/o Captions	72.8%	−3.7% without captions
w/o Nouns	71.2%	−5.3% without nouns
Softmax instead of OT	74.1%	OT outperforms softmax by 2.4%
Simple concatenation	74.8%	Adaptive fusion outperforms concatenation by 1.7%

Key Findings¶

Nouns and captions are complementary: removing either leads to a substantial performance drop, with captions contributing slightly more.
OT outperforms softmax by 2.4%: column constraints ensure balanced utilization of nouns.
Gains are more pronounced on challenging datasets (DTD texture: +4.6%; UCF-101 action: +2.3%).

Highlights & Insights¶

Dual semantic complementarity: Combining noun-level categories and caption-level attributes reduces the average cosine similarity among ImageNet-1K image embeddings from 0.73 to 0.35.
Theoretical advantage of OT: Theorem 1 rigorously proves OT superiority over softmax; column constraints prevent winner-takes-all collapse.
Fully training-free: Relies only on CLIP and external databases; runs on a single RTX 3090.

Limitations & Future Work¶

Performance depends on the quality of CLIP embeddings and may degrade in domains where CLIP underperforms.
The caption source is fixed (Flickr), providing insufficient coverage for specialized domains.
The final clustering step uses k-means, which may be inadequate for non-convex cluster structures.
The number of clusters \(K\) must be specified in advance.

vs. SIC: SIC relies solely on WordNet nouns and is susceptible to polysemy; CAE introduces captions to resolve ambiguity, yielding +4.2%.
vs. TAC: TAC combines nouns via cross-modal distillation but operates at a single semantic level; CAE introduces a caption-level hierarchy.
vs. VIC: VIC generates captions using MLLMs but requires known class names; CAE is fully unsupervised.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of dual semantics and OT alignment is novel, though individual components are not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 8 datasets with comprehensive ablations and theoretical proofs.
Writing Quality: ⭐⭐⭐⭐ The polysemy motivation figure is intuitive and effective.
Value: ⭐⭐⭐⭐ A significant +4.2% on ImageNet-1K; strong practical utility as a training-free method.