ATTENTION-BASED CLUSTERING: LEARNING A KERNEL FROM CONTEXT Anonymous authors Paper under double-blind review

Abstract

In machine learning, no data point stands alone. We believe that context is an underappreciated concept in many machine learning methods. We propose Attention-Based Clustering (ABC), a neural architecture based on the attention mechanism, which is designed to learn latent representations that adapt to context within an input set, and which is inherently agnostic to input sizes and number of clusters. By learning a similarity kernel, our method directly combines with any out-of-the-box kernel-based clustering approach. We present competitive results for clustering Omniglot characters and include analytical evidence of the effectiveness of an attention-based approach for clustering.

1. INTRODUCTION

Many problems in machine learning involve modelling the relations between elements of a set. A notable example, and the focus of this paper, is clustering, in which the elements are grouped according to some shared properties. A common approach uses kernel methods: a class of algorithms that operate on pairwise similarities, which are obtained by evaluating a specific kernel function (Filippone et al., 2008) . However, for data points that are not trivially comparable, specifying the kernel function is not straightforward. With the advent of deep learning, this gave rise to metric learning frameworks where a parameterized binary operator, either explicitly or implicitly, is taught from examples how to measure the distance between two data points (Bromley et al., 1993; Koch et al., 2015; Zagoruyko & Komodakis, 2015; Hsu et al., 2018; Wojke & Bewley, 2018; Hsu et al., 2019) . These cases operate on the assumption that there exists a global metric, that is, the distance between points depends solely on the two operands. This assumption disregards situations where the underlying metric is contextual, by which we mean that the distance between two data points may depend on some structure of the entire dataset. We hypothesize that the context provided by a set of data points can be helpful in measuring the distance between any two data points in the set. As an example of where context might help, consider the task of clustering characters that belong to the same language. There are languages, like Latin and Greek, that share certain characters, for example the Latin T and the Greek upper case τ .foot_0 However, given two sentences, one from the Aeneid and one from the Odyssey, we should have less trouble clustering the same character in both languages correctly due to the context, even when ignoring any structure or meaning derived from the sentences themselves. Indeed, a human performing this task will not need to rely on prior knowledge of the stories of Aeneas or Odysseus, nor on literacy in Latin or Ancient Greek. As a larger principle, it is well recognized that humans perceive emergent properties in configurations of objects, as documented in the Gestalt Laws of Perceptual Organization (Palmer, 1999, Chapter 2). We introduce Attention-Based Clustering (ABC) which uses context to output pairwise similarities between the data points in the input set. Like other approaches in the literature (Hsu et al., 2018; 2019; Han et al., 2019; Lee et al., 2019b) , our model is trained with ground-truth labels in the form of pairwise constraints, but in contrast to other methods, ours can be used with an unsupervised clustering method to obtain cluster labels. To demonstrate the benefit of using ABC over pairwise



To the extend that there is not even a LaTeX command \Tau 1

