DHOG: DEEP HIERARCHICAL OBJECT GROUPING

Abstract

Unsupervised learning of categorical representations using data augmentations appears to be a promising approach and has proven useful for finding suitable representations for downstream tasks. However current state-of-the-art methods require preprocessing (e.g. Sobel edge detection) to work. We introduce a mutual information minimization strategy for unsupervised learning from augmentations, that prevents learning from locking on to easy to find, yet unimportant, representations at the expense of more informative ones requiring more complex processing. We demonstrate specifically that this process learns representations which capture higher mutual information between augmentations, and demonstrate that these representations are better suited to the downstream exemplar task of clustering. We obtain substantial accuracy improvements on CIFAR-10, CIFAR-100-20, and SVHN.

1. INTRODUCTION

It is very expensive to label a dataset with respect to a particular task. Consider the alternative where a user, instead of labelling a dataset, specifies a simple set of class-preserving transformations or 'augmentations'. For example, lighting changes will not change a dog into a cat. Is it possible to learn a model that produces a useful representation by leveraging a set of such augmentations? This representation would need to be good at capturing salient information about the data, and enable downstream tasks to be done efficiently. If the representation were a discrete labelling which groups the dataset into clusters, an obvious choice of downstream task is unsupervised clustering. Ideally the clusters should match direct labelling, without ever having been learnt on explicitly labelled data. Using data augmentations to drive unsupervised representation learning for images has been explored by a number of authors (Dosovitskiy et al., 2014; 2015; Bachman et al., 2019; Chang et al., 2017; Wu et al., 2019; Ji et al., 2019; Cubuk et al., 2019) . These approaches typically involve learning neural networks that map augmentations of the same image to similar representations, which is reasonable since variances across many common augmentations often align with the invariances we would require. A number of earlier works target maximising mutual information (MI) between augmentations (van den Oord et al., 2018; Hjelm et al., 2019; Wu et al., 2019; Ji et al., 2019; Bachman et al., 2019) . Targetting high MI between representations computed from distinct augmentations enables learning representations that capture the invariances induced by the augmentations. We are interested in a particularly parsimonious representation: a discrete labelling of the data. This labelling can be seen as a clustering (Ji et al., 2019) procedure, where MI can be computed and assessment can be done directly using the learned labelling, as opposed to via an auxiliary network trained posthoc.

1.1. SUBOPTIMAL MUTUAL INFORMATION MAXIMISATION

We argue and show that the MI objective is not maximised effectively in existing work due to the combination of: 1. Greedy optimisation algorithms used to train neural networks, such as stochastic gradient descent (SGD) that potentially target local optima; and SGD is greedy in the sense that early-found high-gradient features can dominate and so networks will tend to learn easier-to-compute locally-optimal representations (for example, one that can be computed using fewer neural network layers) over those that depend on complex features. By way of example, in natural images, average colour is an easy-to-compute characteristic, whereas object type is not. If the augmentation strategy preserves average colour, then a reasonable mapping need only compute colour information, and high MI between learned image representations will be obtained. This result is suboptimal in the sense that a hypothetical higher MI optima exists that also captures semantic information, assuming the model has sufficient capacity to learn and represent this. The conceivable existence of many such local optima coupled with greedy optimisation presents a challenge: how can we leverage powerful image augmentation-driven MI objectives while avoiding greedily-found local optima? Dealing with greedy solutions Heuristic solutions, such as as Sobel edge-detection (Caron et al., 2018; Ji et al., 2019) as a pre-processing step, have been suggested to remove/alter the features in images that may cause trivial representations to be learned. This is a symptomatic treatment and not a solution. In the work presented herein, we acknowledge that greedy SGD can get stuck in local optima of the MI maximisation objective because of limited data augmentations. Instead of trying to prevent a greedy solution, our technique lets a model learn this representation, but also requires it to learn an additional distinct representation. Specifically, we minimise the MI between these two representations so that the latter cannot rely on the same features. We extend this idea by adding representations, each time requiring the latest to be distinct from all previous representations. Downstream task: clustering For this work, our focus is on finding higher MI representations; we then assess the downstream capability on the ground truth task of image classification, meaning that we can either (1) learn a representation that must be 'decoded' via an additional learning step, or (2) produce a discrete labelling that requires no additional learning. Clustering methods offer a direct comparison and require no labels for learning a mapping from the learned representation to class labels. Instead, labels are only required to assign groups to appropriate classes and no learning is done using these. Our comparisons are with respect to clustering methods.

1.2. CONTRIBUTIONS

Learning a set of representations by encouraging them to have low MI, while still maximising the original augmentation-driven MI objective for each representation, is the core idea behind Deep Hierarchical Object Grouping (DHOG). We define a mechanism to produce a set of hierarchicallyordered solutions (in the sense of easy-to-hard orderings, not tree structures). DHOG is able to better maximise the original MI objective between augmentations since each representation must correspond to a unique local optima. Our contributions are: 1. We demonstrate that current methods do not effectively maximise the MI objectivefoot_1 because greedy stochastic gradient descent (SGD) typically results in suboptimal local optima. To mitigate for this problem, we introducing DHOG: a robust neural network image grouping method to learn diverse and hierarchically arranged sets of discrete image labellings (Section 3) by explicitly modelling, accounting for, and avoiding spurious local optima, requiring only simple data augmentations, and needing no Sobel edge detection. 2. We show a marked improvement over the current state-of-the-art for standard benchmarks in end-to-end image clustering for CIFAR-10, CIFAR-100-20 (a 20-way class grouping of CIFAR-100, and SVHN; we set a new accuracy benchmarks on CINIC-10; and show the utility of our method on STL-10 (Section 4). To be clear, DHOG still learns to map data augmentations to similar representations as this is imperative to the learning process. The difference is that DHOG enables a number of intentionally distinct data labellings to be learned, arranged hierarchically in terms of source feature complexity.



A limited set of data augmentations that can result in the existence of multiple local optima to the MI maximisation objective. We show this by finding higher mutual information solutions using DHOG, rather than by any analysis of the solutions themselves.

