ATTENTION-BASED CLUSTERING: LEARNING A KERNEL FROM CONTEXT Anonymous authors Paper under double-blind review

Abstract

In machine learning, no data point stands alone. We believe that context is an underappreciated concept in many machine learning methods. We propose Attention-Based Clustering (ABC), a neural architecture based on the attention mechanism, which is designed to learn latent representations that adapt to context within an input set, and which is inherently agnostic to input sizes and number of clusters. By learning a similarity kernel, our method directly combines with any out-of-the-box kernel-based clustering approach. We present competitive results for clustering Omniglot characters and include analytical evidence of the effectiveness of an attention-based approach for clustering.

1. INTRODUCTION

Many problems in machine learning involve modelling the relations between elements of a set. A notable example, and the focus of this paper, is clustering, in which the elements are grouped according to some shared properties. A common approach uses kernel methods: a class of algorithms that operate on pairwise similarities, which are obtained by evaluating a specific kernel function (Filippone et al., 2008) . However, for data points that are not trivially comparable, specifying the kernel function is not straightforward. With the advent of deep learning, this gave rise to metric learning frameworks where a parameterized binary operator, either explicitly or implicitly, is taught from examples how to measure the distance between two data points (Bromley et al., 1993; Koch et al., 2015; Zagoruyko & Komodakis, 2015; Hsu et al., 2018; Wojke & Bewley, 2018; Hsu et al., 2019) . These cases operate on the assumption that there exists a global metric, that is, the distance between points depends solely on the two operands. This assumption disregards situations where the underlying metric is contextual, by which we mean that the distance between two data points may depend on some structure of the entire dataset. We hypothesize that the context provided by a set of data points can be helpful in measuring the distance between any two data points in the set. As an example of where context might help, consider the task of clustering characters that belong to the same language. There are languages, like Latin and Greek, that share certain characters, for example the Latin T and the Greek upper case τ .foot_0 However, given two sentences, one from the Aeneid and one from the Odyssey, we should have less trouble clustering the same character in both languages correctly due to the context, even when ignoring any structure or meaning derived from the sentences themselves. Indeed, a human performing this task will not need to rely on prior knowledge of the stories of Aeneas or Odysseus, nor on literacy in Latin or Ancient Greek. As a larger principle, it is well recognized that humans perceive emergent properties in configurations of objects, as documented in the Gestalt Laws of Perceptual Organization (Palmer, 1999, Chapter 2). We introduce Attention-Based Clustering (ABC) which uses context to output pairwise similarities between the data points in the input set. Like other approaches in the literature (Hsu et al., 2018; 2019; Han et al., 2019; Lee et al., 2019b) , our model is trained with ground-truth labels in the form of pairwise constraints, but in contrast to other methods, ours can be used with an unsupervised clustering method to obtain cluster labels. To demonstrate the benefit of using ABC over pairwise metric learning methods, we propose a clustering problem that requires the use of properties emerging from the entire input set in order to be solved. The task is to cluster a set of points that lie on a number of intersecting circles, which is a generalization of the Olympic circles problem (Anand et al., 2014) . Pairwise kernel methods for clustering perform poorly on the circles problem, whereas our ABC handles it with ease, as displayed in Figure 1 . We use the circles dataset for an ablation study in Section 6.1. In recent years, numerous deep neural network architectures have been proposed for clustering (Xie et al., 2016; Min et al., 2018) . The idea of using more than pairwise interactions between elements of an input set in order to improve clustering has been pursued recently in Lee et al. (2019a; b) , and is motivated by the problem of amortized clustering (Gershman & Goodman, 2014; Stuhlmüller et al., 2013) . Our architecture is inspired by the Transformer (Vaswani et al., 2017) , which was used by Lee et al. (2019a) as the Set Transformer to improve clustering (Lee et al., 2019b) . We inherit its benefits such as being equivariant under permutations as well as agnostic to input size. However, our approach is motivated by the use of context to improve metric learning, giving us a model that is moreover agnostic to the number of clusters in the sense that neither a prediction nor a bound on the number of clusters needs to be specified for the architecture definition. We also provide theoretical evidence that the Transformer architecture is effective for metric learning and clustering, and to our knowledge, are the first to do so. The idea of using deep metric learning to improve clustering has been pursued in Koch et al. (2015) ; Zagoruyko & Komodakis (2015) ; Hsu et al. (2018; 2019); Han et al. (2019) , but without considering the use of context. We use ground-truth labels, only in the form of pairwise constraints, to train a similarity kernel, making our approach an example of constrained clustering. These algorithms are often categorized by whether they use the constraints to only learn a metric or to also generate cluster labels (Hsu et al., 2018) . Our architecture belongs to the former category, where we only use the constraints to learn a metric and rely on an unconstrained clustering process to obtain cluster labels. Despite this, we achieve nearly state-of-the-art clustering results on the Omniglot, embedded Ima-geNet, and CIFAR-100 datasets, comparable to sophisticated methods that synthesize clusters, either using the constraints (Hsu et al., 2018; 2019; Han et al., 2019) or otherwise (Lee et al., 2019a; b) . Our main contributions are: • ABC incorporates context in a general and flexible manner to improve metric learning for clustering. Our competitive results on Omniglot, embedded ImageNet, and CIFAR-100, as well as our ablation study on our circles dataset provide support for the use of context in metric learning algorithms. • We provide theoretical evidence of why the self-attention module in the Transformer architecture is well suited for clustering, justifying its effectiveness for this task.



To the extend that there is not even a LaTeX command \Tau



Figure 1: Illustration of the output of different clustering methods for points sampled from four overlapping circles. (A) ABC with additive attention. (B) ABC with multiplicative attention. (C) Pairwise similarity with additive attention. Pairwise similarity with multiplicative attention performed similarly. (D) Out-of-the box spectral clustering. Only D was given the true number of clusters. (Best viewed in colour.)

