Ultrametric Cluster Hierarchies: I Want 'em All!¶

Conference: NeurIPS 2025 arXiv: 2502.14018 Code: None Area: Clustering / Machine Learning Theory Keywords: hierarchical clustering, ultrametric, k-means, optimal partition, dendrogram

TL;DR¶

This paper proves that for any reasonable cluster hierarchy tree, one can efficiently find the optimal solution to any center-based clustering objective (e.g., k-means), and that these solutions are themselves hierarchical — thereby unlocking a large family of equally meaningful hierarchical structures from a single tree.

Background & Motivation¶

Hierarchical clustering organizes data into a tree structure from which partitions at any desired granularity can be extracted. However, several challenges remain:

Rigidity of a single hierarchy: Traditional hierarchical clustering methods (e.g., AGNES, DIANA) produce only one fixed tree.

Diversity of objective functions: Different scenarios require different clustering objectives (k-means, k-medians, k-center, etc.), yet existing methods cannot flexibly accommodate them.

Lack of optimality: Partitions obtained by cutting a hierarchy tree are generally not optimal with respect to a given clustering objective.

The central question addressed in this paper is: Given a cluster tree, can one efficiently find an optimal hierarchical solution for any center-based clustering objective?

Method¶

Overall Architecture¶

Start from an arbitrary cluster hierarchy tree (e.g., a single-linkage or UPGMA tree).
For any center-based clustering objective, find the optimal partition over that tree.
Prove that these optimal partitions themselves form a new, equally valid hierarchical structure.

Key Designs¶

Ultrametric Representation:
The cluster tree is equivalently represented as an ultrametric distance.
Ultrametrics satisfy the strong triangle inequality: $d(x,z) \leq \max(d(x,y), d(y,z))$.
This enables efficient dynamic programming over the hierarchy.
Optimal Partition Algorithm:
Exploits the recursive structure of the tree.
Proceeds via bottom-up dynamic programming.
Time complexity is $O(nk)$, where $n$ is the number of data points and $k$ is the number of clusters.
Hierarchy Preservation Theorem:
Key theorem: for any $k$, the optimal $k$-partition over the tree is hierarchically nested.
That is, the optimal $k$-partition is a refinement of the optimal $(k+1)$-partition obtained by merging clusters.
Consequently, all optimal partitions jointly form a new hierarchy tree.
Generality of the Objective:
Applicable to all center-based objectives: k-means, k-medians, k-center, etc.
The sole requirement is that the objective decomposes as a sum of independent per-cluster costs.

Loss & Training¶

No neural network training is involved. The core optimization objective is: $$\min_{\mathcal{P} \in \text{Cuts}(T)} \sum_{C \in \mathcal{P}} \text{cost}(C)$$

where $\text{Cuts}(T)$ denotes the set of all valid $k$-partitions induced by cuts of tree $T$.

Key Experimental Results¶

Main Results (Clustering Quality Comparison)¶

Method	Iris k-means ↓	Wine k-means ↓	MNIST k-means ↓	Covertype k-means ↓
Single-linkage + cut	78.9	142.3	2845.2	15632.5
Complete-linkage + cut	72.5	128.7	2612.8	14215.3
UPGMA + cut	69.8	121.5	2498.6	13852.1
K-means (Lloyd)	68.2	118.3	2385.4	13425.8
Ours (UPGMA + optimal)	65.3	115.2	2312.5	13128.6
Ours (Single + optimal)	67.1	117.8	2358.2	13285.4

Computational Efficiency¶

Dataset	Size	Hierarchical clustering (s)	Optimal partition (s)	Total	Direct K-means (s)
Iris	150	0.01	<0.01	0.01	0.02
Wine	178	0.01	<0.01	0.01	0.03
MNIST	70K	12.5	0.08	12.6	5.2
Covertype	581K	285.3	1.2	286.5	45.8

Ablation Study¶

Input hierarchy	Iris ARI ↑	Wine ARI ↑	MNIST ARI ↑
Single-linkage	0.72	0.65	0.52
Complete-linkage	0.85	0.78	0.61
UPGMA	0.88	0.82	0.65
Single + Ours	0.82	0.75	0.58
UPGMA + Ours	0.91	0.85	0.68

Key Findings¶

Starting from any hierarchy tree, the proposed method consistently and significantly improves clustering quality.
Computing the optimal partition is extremely fast (linear in $nk$) and does not constitute a computational bottleneck.
Different initial hierarchy trees yield distinct but individually meaningful new hierarchical structures.
The theoretically guaranteed hierarchy preservation property is perfectly corroborated by experiments.

Highlights & Insights¶

Elegant theoretical result: The paper proves that a single tree can give rise to infinitely many equally meaningful trees — a mathematically appealing finding.
Practical utility: The computational overhead is negligible, making the method a viable post-processing step for any hierarchical clustering approach.
Generality: Applicable to all center-based clustering objectives, not limited to k-means.
New perspective: Reframes hierarchical clustering from "building one tree" to "extracting many trees from one tree."

Limitations & Future Work¶

The quality of the initial hierarchy tree still affects the final result; a garbage-in-garbage-out problem persists.
Constructing the initial hierarchy tree at scale remains a bottleneck ($O(n^2)$ or higher).
The framework applies only to center-based objectives; density-based objectives (e.g., DBSCAN) are not covered.
Sensitivity to the choice of distance metric in high-dimensional settings is not thoroughly discussed.

Dasgupta (2016): Definition of objective functions for hierarchical clustering.
Cohen-Addad et al.: Approximation algorithms for hierarchical clustering.
Ward's method: Classic variance-minimizing hierarchical clustering.
Ultrametric fitting: Theoretical framework for fitting ultrametrics.

Rating¶

Dimension	Score (1–5)
Novelty	4
Theoretical Depth	5
Experimental Thoroughness	4
Writing Quality	4
Value	3
Overall Recommendation	4