Skip to content

Ultrametric Cluster Hierarchies: I Want 'em All!

Conference: NeurIPS 2025 arXiv: 2502.14018 Code: None Area: Clustering / Machine Learning Theory Keywords: hierarchical clustering, ultrametric, k-means, optimal partition, dendrogram

TL;DR

This paper proves that for any reasonable cluster hierarchy tree, one can efficiently find the optimal solution to any center-based clustering objective (e.g., k-means), and that these solutions are themselves hierarchical — thereby unlocking a large family of equally meaningful hierarchical structures from a single tree.

Background & Motivation

Hierarchical clustering organizes data into a tree structure from which partitions at any desired granularity can be extracted. However, several challenges remain:

Rigidity of a single hierarchy: Traditional hierarchical clustering methods (e.g., AGNES, DIANA) produce only one fixed tree.

Diversity of objective functions: Different scenarios require different clustering objectives (k-means, k-medians, k-center, etc.), yet existing methods cannot flexibly accommodate them.

Lack of optimality: Partitions obtained by cutting a hierarchy tree are generally not optimal with respect to a given clustering objective.

The central question addressed in this paper is: Given a cluster tree, can one efficiently find an optimal hierarchical solution for any center-based clustering objective?

Method

Overall Architecture

  1. Start from an arbitrary cluster hierarchy tree (e.g., a single-linkage or UPGMA tree).
  2. For any center-based clustering objective, find the optimal partition over that tree.
  3. Prove that these optimal partitions themselves form a new, equally valid hierarchical structure.

Key Designs

  1. Ultrametric Representation:

  2. The cluster tree is equivalently represented as an ultrametric distance.

  3. Ultrametrics satisfy the strong triangle inequality: \(d(x,z) \leq \max(d(x,y), d(y,z))\).
  4. This enables efficient dynamic programming over the hierarchy.

  5. Optimal Partition Algorithm:

  6. Exploits the recursive structure of the tree.

  7. Proceeds via bottom-up dynamic programming.
  8. Time complexity is \(O(nk)\), where \(n\) is the number of data points and \(k\) is the number of clusters.

  9. Hierarchy Preservation Theorem:

  10. Key theorem: for any \(k\), the optimal \(k\)-partition over the tree is hierarchically nested.

  11. That is, the optimal \(k\)-partition is a refinement of the optimal \((k+1)\)-partition obtained by merging clusters.
  12. Consequently, all optimal partitions jointly form a new hierarchy tree.

  13. Generality of the Objective:

  14. Applicable to all center-based objectives: k-means, k-medians, k-center, etc.

  15. The sole requirement is that the objective decomposes as a sum of independent per-cluster costs.

Loss & Training

No neural network training is involved. The core optimization objective is: $\(\min_{\mathcal{P} \in \text{Cuts}(T)} \sum_{C \in \mathcal{P}} \text{cost}(C)\)$

where \(\text{Cuts}(T)\) denotes the set of all valid \(k\)-partitions induced by cuts of tree \(T\).

Key Experimental Results

Main Results (Clustering Quality Comparison)

Method Iris k-means ↓ Wine k-means ↓ MNIST k-means ↓ Covertype k-means ↓
Single-linkage + cut 78.9 142.3 2845.2 15632.5
Complete-linkage + cut 72.5 128.7 2612.8 14215.3
UPGMA + cut 69.8 121.5 2498.6 13852.1
K-means (Lloyd) 68.2 118.3 2385.4 13425.8
Ours (UPGMA + optimal) 65.3 115.2 2312.5 13128.6
Ours (Single + optimal) 67.1 117.8 2358.2 13285.4

Computational Efficiency

Dataset Size Hierarchical clustering (s) Optimal partition (s) Total Direct K-means (s)
Iris 150 0.01 <0.01 0.01 0.02
Wine 178 0.01 <0.01 0.01 0.03
MNIST 70K 12.5 0.08 12.6 5.2
Covertype 581K 285.3 1.2 286.5 45.8

Ablation Study

Input hierarchy Iris ARI ↑ Wine ARI ↑ MNIST ARI ↑
Single-linkage 0.72 0.65 0.52
Complete-linkage 0.85 0.78 0.61
UPGMA 0.88 0.82 0.65
Single + Ours 0.82 0.75 0.58
UPGMA + Ours 0.91 0.85 0.68

Key Findings

  1. Starting from any hierarchy tree, the proposed method consistently and significantly improves clustering quality.
  2. Computing the optimal partition is extremely fast (linear in \(nk\)) and does not constitute a computational bottleneck.
  3. Different initial hierarchy trees yield distinct but individually meaningful new hierarchical structures.
  4. The theoretically guaranteed hierarchy preservation property is perfectly corroborated by experiments.

Highlights & Insights

  • Elegant theoretical result: The paper proves that a single tree can give rise to infinitely many equally meaningful trees — a mathematically appealing finding.
  • Practical utility: The computational overhead is negligible, making the method a viable post-processing step for any hierarchical clustering approach.
  • Generality: Applicable to all center-based clustering objectives, not limited to k-means.
  • New perspective: Reframes hierarchical clustering from "building one tree" to "extracting many trees from one tree."

Limitations & Future Work

  1. The quality of the initial hierarchy tree still affects the final result; a garbage-in-garbage-out problem persists.
  2. Constructing the initial hierarchy tree at scale remains a bottleneck (\(O(n^2)\) or higher).
  3. The framework applies only to center-based objectives; density-based objectives (e.g., DBSCAN) are not covered.
  4. Sensitivity to the choice of distance metric in high-dimensional settings is not thoroughly discussed.
  • Dasgupta (2016): Definition of objective functions for hierarchical clustering.
  • Cohen-Addad et al.: Approximation algorithms for hierarchical clustering.
  • Ward's method: Classic variance-minimizing hierarchical clustering.
  • Ultrametric fitting: Theoretical framework for fitting ultrametrics.

Rating

Dimension Score (1–5)
Novelty 4
Theoretical Depth 5
Experimental Thoroughness 4
Writing Quality 4
Value 3
Overall Recommendation 4