DHOG: DEEP HIERARCHICAL OBJECT GROUPING

Abstract

Unsupervised learning of categorical representations using data augmentations appears to be a promising approach and has proven useful for finding suitable representations for downstream tasks. However current state-of-the-art methods require preprocessing (e.g. Sobel edge detection) to work. We introduce a mutual information minimization strategy for unsupervised learning from augmentations, that prevents learning from locking on to easy to find, yet unimportant, representations at the expense of more informative ones requiring more complex processing. We demonstrate specifically that this process learns representations which capture higher mutual information between augmentations, and demonstrate that these representations are better suited to the downstream exemplar task of clustering. We obtain substantial accuracy improvements on CIFAR-10, CIFAR-100-20, and SVHN.

1. INTRODUCTION

It is very expensive to label a dataset with respect to a particular task. Consider the alternative where a user, instead of labelling a dataset, specifies a simple set of class-preserving transformations or 'augmentations'. For example, lighting changes will not change a dog into a cat. Is it possible to learn a model that produces a useful representation by leveraging a set of such augmentations? This representation would need to be good at capturing salient information about the data, and enable downstream tasks to be done efficiently. If the representation were a discrete labelling which groups the dataset into clusters, an obvious choice of downstream task is unsupervised clustering. Ideally the clusters should match direct labelling, without ever having been learnt on explicitly labelled data. Using data augmentations to drive unsupervised representation learning for images has been explored by a number of authors (Dosovitskiy et al., 2014; 2015; Bachman et al., 2019; Chang et al., 2017; Wu et al., 2019; Ji et al., 2019; Cubuk et al., 2019) . These approaches typically involve learning neural networks that map augmentations of the same image to similar representations, which is reasonable since variances across many common augmentations often align with the invariances we would require. A number of earlier works target maximising mutual information (MI) between augmentations (van den Oord et al., 2018; Hjelm et al., 2019; Wu et al., 2019; Ji et al., 2019; Bachman et al., 2019) . Targetting high MI between representations computed from distinct augmentations enables learning representations that capture the invariances induced by the augmentations. We are interested in a particularly parsimonious representation: a discrete labelling of the data. This labelling can be seen as a clustering (Ji et al., 2019) procedure, where MI can be computed and assessment can be done directly using the learned labelling, as opposed to via an auxiliary network trained posthoc.

1.1. SUBOPTIMAL MUTUAL INFORMATION MAXIMISATION

We argue and show that the MI objective is not maximised effectively in existing work due to the combination of: 1. Greedy optimisation algorithms used to train neural networks, such as stochastic gradient descent (SGD) that potentially target local optima; and 2. A limited set of data augmentations that can result in the existence of multiple local optima to the MI maximisation objective.

