Learning-Augmented Streaming Algorithms for Correlation Clustering¶

Conference: NeurIPS 2025

Code: None

Area: Streaming Algorithms / Clustering

Keywords: Correlation Clustering, Streaming Algorithms, Learning-Augmented, Space Efficiency, Approximation Algorithms

TL;DR¶

This paper proposes the first learning-augmented streaming algorithms for Correlation Clustering. By leveraging pairwise distance predictions, the proposed methods achieve a better-than-3 approximation ratio on complete graphs ($\tilde{O}(n)$ space) and an $O(\log|E^-|)$ approximation ratio on general graphs ($\tilde{O}(n)$ space), significantly improving the space–approximation tradeoff over existing prediction-free algorithms.

Background & Motivation¶

Correlation Clustering is a classical graph clustering problem:

Input: A graph $G=(V,E)$ where each edge is labeled as "positive" (endpoints should be in the same cluster) or "negative" (endpoints should be in different clusters)
Objective: Partition vertices into disjoint clusters to minimize the number of disagreements—positive edges crossing clusters and negative edges within clusters
Streaming model: Edges arrive in arbitrary order, and the algorithm is restricted to limited memory

Traditional streaming algorithms face the challenge of the space–approximation tradeoff:

Algorithm	Graph Type	Approximation Ratio	Space
Cambus et al. (SODA'24)	Complete graph	3	$\tilde{O}(n)$
Ahn et al. (ICML'15)	General graph	$O(\log n)$	$\tilde{O}(n \cdot \log n)$

Core Problem: Can machine learning predictions be leveraged to simultaneously improve both the approximation ratio and space efficiency?

The learning-augmented algorithms paradigm augments classical algorithms with (possibly inaccurate) predictions, achieving superior performance when predictions are accurate and gracefully degrading to classical guarantees when predictions are poor.

Method¶

Overall Architecture¶

The algorithms use pairwise distance predictions provided by a predictor as auxiliary information:

The predictor estimates, for each pair $(u,v)$, their "distance" in the optimal clustering (i.e., whether they belong to the same cluster)
An initial clustering structure is constructed based on these predictions
During stream processing, actual edge information is used to correct prediction errors

Key Designs¶

Complete Graph Algorithm¶

Design Motivation:

Preprocessing phase: Pre-group vertices according to predicted distances
Stream processing phase: Update groupings upon reading each edge using its label (positive/negative)
Key technique: Builds on the framework of Cambus et al. (SODA'24), incorporating predictions to relax sampling requirements

Theoretical Guarantees: - When prediction quality is high: approximation ratio better than 3 (breaking the prediction-free lower bound) - Space complexity: $\tilde{O}(n)$ - When predictions are entirely uninformative: degrades to a 3-approximation

Prediction Quality Measure:

The prediction error $\eta$ is defined as the degree of inconsistency between predicted distances and the true optimal distances. The approximation ratio varies smoothly with $\eta$:

\[\text{Approximation Ratio} = f(\eta), \quad f(0) < 3, \quad f(\infty) = 3\]

General Graph Algorithm¶

Design Motivation:

Builds on the linear sketching method of Ahn et al. (ICML'15)
Uses predicted distances to reduce the amount of edge information that must be stored
Achieves a logarithmic approximation with respect to the negative edge set $E^-$

Theoretical Guarantees: - Approximation ratio: $O(\log|E^-|)$ (when prediction quality is high) - Space complexity: $\tilde{O}(n)$ - Improves the space efficiency of known prediction-free algorithms

Comparison with Prior Results:

Algorithm	Approximation Ratio	Space	Notes
Ahn et al. (no learning)	$O(\log n)$	$\tilde{O}(n \log n)$	Baseline
Ours (learning-augmented)	$O(\log	E^-	)$

Note that $|E^-| \leq \binom{n}{2}$, so $\log|E^-| \leq 2\log n$; in practice, $|E^-|$ is typically much smaller than $n^2$.

Loss & Training¶

This paper does not address predictor training; instead, it analyzes algorithm performance given a fixed predictor. The predictor can be: - A classifier trained on historical data - A heuristic based on graph features - Any method capable of providing pairwise distance estimates

Key Experimental Results¶

Main Results¶

Experiments are conducted on both synthetic and real-world datasets.

Synthetic Data (Complete Graph):

Prediction Error $\eta$	Ours	Cambus et al. (no learning)	Gain
0%	Optimal	3×	Significant
10%	<3×	3×	Notable
30%	≈3×	3×	On par
50%+	3×	3×	Degraded

Real-World Data:

Dataset	Ours	Non-learning Baseline	Space Usage
Synthetic random graph	Better than baseline	Baseline	$\tilde{O}(n)$
Real social network	Significantly better	Baseline	$\tilde{O}(n)$

Ablation Study¶

Effect of Prediction Quality on Performance:

Component	Good Prediction	Moderate Prediction	Poor Prediction
Complete graph approx. ratio	<3	≈3	=3
General graph approx. ratio	$O(\log	E^-	)$
Space usage	$\tilde{O}(n)$	$\tilde{O}(n)$	$\tilde{O}(n)$

Key observation: The improvement in space efficiency is unconditional—the $\tilde{O}(n)$ space bound is maintained regardless of prediction quality.

Effect of Edge Arrival Order:

Edge Arrival Order	Performance Impact
Random order	Optimal
Adversarial order	Slight degradation, but still better than non-learning methods
Natural order	Close to random

Key Findings¶

The learning-augmented approach is the first to break the 3-approximation barrier for streaming correlation clustering on complete graphs, demonstrating that prediction information can provide substantial benefit in the streaming setting.
The space improvement from $\tilde{O}(n \log n)$ to $\tilde{O}(n)$ is unconditional and does not depend on prediction quality.
On real-world datasets, the learning-augmented algorithm consistently outperforms its prediction-free counterpart even when predictions are imperfect.
The key algorithmic design insight is to use predictions to reduce the edge information that must be stored, rather than directly substituting predictions for algorithmic decisions.

Highlights & Insights¶

First application of learning-augmentation to streaming correlation clustering: Opens a new research direction.
Elegant degradation guarantee: Naturally degrades to the known optimum when predictions are poor, without requiring knowledge of prediction quality.
Unconditional space improvement: The space efficiency gain is independent of prediction quality, which is uncommon among learning-augmented algorithms.
Integration of theory and practice: Combines rigorous theoretical analysis with empirical validation on real-world data.

Limitations & Future Work¶

Predictor design left unspecified: The paper does not discuss how to construct effective distance predictors.
Complete graph assumption: The complete graph assumption may be unrealistic for large-scale data.
No closed-form expression for approximation ratio: The concrete better-than-3 approximation ratio on complete graphs depends on prediction quality without a closed-form characterization.
Single-pass restriction: The current algorithms may require multi-pass extensions to achieve stronger results.
Potential direction: Investigating learning-augmented clustering in the semi-streaming setting ($O(n \text{ polylog } n)$ space).

Cambus et al. (SODA'24): Foundational framework for the 3-approximation streaming algorithm on complete graphs.
Ahn et al. (ICML'15): Linear sketch-based correlation clustering on general graphs.
Learning-augmented online algorithms: Pioneering work by Lykouris & Vassilvitskii et al.; this paper extends the paradigm to the streaming computation model.
Other correlation clustering algorithms: Offline settings admit PTAS and other stronger results; the constraints of the streaming setting make learning augmentation particularly valuable.

Rating¶

Novelty: ⭐⭐⭐⭐ — First learning-augmented streaming algorithm for correlation clustering
Technical Depth: ⭐⭐⭐⭐ — Rigorous theoretical analysis with meaningful improvements in the space–approximation tradeoff
Experimental Thoroughness: ⭐⭐⭐⭐ — Synthetic and real-world data with multi-dimensional ablation
Practicality: ⭐⭐⭐ — Potential applications in large-scale clustering scenarios
Overall: ⭐⭐⭐⭐

Algorithm	Graph Type	Approximation Ratio	Space
Cambus et al. (SODA'24)	Complete graph	3	\(\tilde{O}(n)\)
Ahn et al. (ICML'15)	General graph	\(O(\log n)\)	\(\tilde{O}(n \cdot \log n)\)

Algorithm	Approximation Ratio	Space	Notes
Ahn et al. (no learning)	\(O(\log n)\)	\(\tilde{O}(n \log n)\)	Baseline
Ours (learning-augmented)	$O(\log	E^-	)$