KWIKBUCKS: CORRELATION CLUSTERING WITH CHEAP-WEAK AND EXPENSIVE-STRONG SIGNALS

Abstract

The unprecedented rate at which machine learning (ML) models are growing in size necessitates novel approaches to enable efficient and scalable solutions. We contribute to this line of work by studying a novel version of the Budgeted Correlation Clustering problem (BCC) where along with a limited number of queries to an expensive oracle for node similarities (e.g. a large ML model), we have unlimited access to a cheaper but less accurate second oracle. Our formulation is inspired by many practical scenarios where coarse approximations of the expensive similarity metric can be efficiently obtained via weaker models. We develop a theoretically motivated algorithm that leverages the cheap oracle to judiciously query the strong oracle while maintaining high clustering quality. We empirically demonstrate gains in query minimization and clustering metrics on a variety of datasets with diverse strong and cheap oracles. Most notably, we demonstrate a practical application in text clustering based on expensive cross-attention language models by showing that cheaper (but weaker) embedding-based models can be leveraged to substantially reduce the number of inference calls to the former.

1. INTRODUCTION

Modern ML techniques have made incredible advances at the cost of needing resource-intensive models (Sharir et al., 2020) . Many recent approaches are so resource-intensive that despite amazing accuracy, they only serve as proof-of-concepts and are infeasible to be scaled as-is in practical usage. The total effect of all such deployments on energy usage is also a major sustainability concern (Wu et al., 2022) . While the high performance of these models motivates incorporation of their signal, their high inference cost limits the interactions that any practical algorithm can have with them. With the increased cost in querying ML models, the cost of obtaining similarities between objects of different types (texts, images, etc.) has also substantially increased. In this paper, we aim to answer a challenging question when working with such costly similarity measure models: how can we group similar objects together when similarities of objects are obtained via expensive queries? This problem can be naturally cast as a popular and versatile clustering framework, named Correlation Clustering (CC), which has been extensively studied over the past 15+ years (Bonchi et al., 2022) : given similarities between arbitrary objects represented as a graph, CC minimizes a natural objective that attempts to cluster together similar vertices while simultaneously separating dissimilar ones. The high cost of querying large ML models motivates the use of the Budgeted CC (BCC) setting studied in (Bressan et al., 2019; García-Soriano et al., 2020b) where relationships between nodes are determined by making a limited number of queries to an oracle, e.g. a large ML model. We posit that in many practical settings, coarse but efficient approximations of an expensive model can be obtained through substantially cheaper but weaker models. These weaker models can be used as a guide to spend the query budget for the expensive model more carefully. A motivating example, which heavily inspires our work, is in text clustering where one wishes to obtain similarity signals from the latest highly-accurate cross-attention (CA) language models (e.g., (Brown et al., 2020; Thoppilan et al., 2022) ), but may be hindered by the computational burden as obtaining each pair-wise similarity between data points requires an inference call to the model, giving rise to a worse case O(n 2 ) inference calls, where n is the number of data points. Embedding based models (e.g., (Mikolov et al., 2013; Devlin et al., 2018) can come to the rescue as they require only O(n) inference calls to obtain embedding vectors for each data point that can then be used for fast similarity computation. While embedding models typically produce substantially lower quality similarity signals than CA models (see, e.g., Menon et al. ( 2022)), they can still provide a good approximation to guide where the budget for the CA model should be spent. Inspired by the above, we introduce a variant of BCC where, along with a limited number of queries to an expensive oracle, we also have unlimited access to a cheaper but less accurate second oracle. This variant of BCC bridges algorithm design and practical considerations. Indeed, a recent book (Bonchi et al., 2022) on CC states "A further intriguing question is whether one can devise other graph-querying models that allow for improved theoretical results while being reasonable from a practical viewpoint." This is exactly the gap our work fills through the introduction of a queryefficient setting with access to two oracles with differing quality and cost. We develop an algorithm dubbed KwikBucks that extends the well-known KwikCluster algorithm to budgeted CC with cheap-weak and expensive-strong signals. KwikBucks uses the weak signal as a guide to minimize the number of calls to the strong signal. Under the assumption that the weak signal returns a strict superset of the strong signal edges, our algorithm can approximately match the performance of KwikCluster, i.e., a 3-approximation, using only an exceedingly small fraction of all possible queries to the expensive model (Theorem 2.1). In our experiments, we strengthen our theoretical modelling with several well-motivated optimizations and demonstrate that KwikBucks manages to produce high quality clusterings with only a small number of queries to the expensive oracle even when there is only a weak correlation between the weak and strong signal. We conduct extensive experiments with multiple well-studied datasets to evaluate the performance of KwikBucks over natural extensions of previous algorithms for closely-related problems. In all settings, KwikBucks recovers the best clustering solution with a much smaller strong signal budget than the alternatives, and in many cases it finds asymptotically better solutions as well. Our algorithm is also robust to the choice of weak signal oracle across different dataset settings and obtains significant improvements over five baselines -64% relative improvement in clustering quality (measured in terms of F1 score) when averaging over 9 datasets, and over > 3.5x reduction in query complexity compared to the best baseline. Lastly, Our contributions can be summarized as follows: • Introducing a novel formulation of the BCC problem which strengthens the connection between theory and practice through the interplay between algorithm design and modern ML, where a cheap-weak signal guides the queries made to the expensive-strong signal, • Developing an algorithm for this setting with strong theoretical motivations/guarantees, • Identifying a highly impactful application domain (text clustering) where the introduced formulation and the developed algorithm are effectively applicable, • A comprehensive empirical analysis showing large gains over extensions of existing algorithms for closely-related problems.

1.1. RELATED WORK

Our paper spans correlation clustering, clustering with budget constraints, and learning from multiple annotators. For brevity, we focus on the closely related key works in these areas. Correlation clustering is one of the most well studied graph clustering problems and has been actively researched over the past 15+ years (see the book (Bonchi et al., 2022) ). It has numerous applications in ML and beyond, including spam detection (Ramachandran et al., 2007; Bonchi et al., 2014) , social network analysis (Bonchi et al., 2015; Tang et al., 2016) , entity resolution (Getoor & Machanavajjhala, 2012) , and many others (Gionis et al., 2005; Hassanzadeh et al., 2009; Cohen & Richman, 2002; Kim et al., 2011 ). Bansal et al. (2004) introduced and gave the first constant factor approximation for complete graphs (see Def. 1.1). Variants include incomplete signed graph (Bansal et al., 2004; Ailon et al., 2008) , where the problem is APX-Hard (Demaine et al., 2006) , and weighted graphs (Charikar et al., 2005) , where it is Unique-Games hard (Chawla et al., 2006) .

