UNSUPERVISED VISUALIZATION OF IMAGE DATASETS USING CONTRASTIVE LEARNING

Abstract

Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbors are not semantically close. This problem can be circumvented by selfsupervised approaches based on contrastive learning, such as SimCLR, relying on data augmentation to generate implicit neighbors, but these methods do not produce two-dimensional embeddings suitable for visualization. Here, we present a new method, called t-SimCNE, for unsupervised visualization of image data. t-SimCNE combines ideas from contrastive learning and neighbor embeddings, and trains a parametric mapping from the high-dimensional pixel space into two dimensions. We show that the resulting 2D embeddings achieve classification accuracy comparable to the state-of-the-art high-dimensional SimCLR representations, thus faithfully capturing semantic relationships. Using t-SimCNE, we obtain informative visualizations of the CIFAR-10 and CIFAR-100 datasets, showing rich cluster structure and highlighting artifacts and outliers.

1. INTRODUCTION

As many research fields are producing ever larger and more complex datasets, data visualization methods have become important in many scientific and practical applications (Becht et al., 2019; Diaz-Papkovich et al., 2019; Kobak & Berens, 2019; Schmidt, 2018) . Such methods allow a concise summary of the entire dataset, displaying a high-dimensional dataset as a 2D embedding. This low-dimensional representation is often convenient for data exploration, highlighting clusters and relationships between them. In practice, most useful are neighbor embedding methods, such as t-SNE (van der Maaten & Hinton, 2008) and UMAP (McInnes et al., 2018) , that aim to preserve nearest neighbors from the high-dimensional space when optimizing the layout in 2D. 



Figure 1: Left: t-SimCNE. Two augmentations of the same image are fed through the same ResNet and fully-connected projection head to get representations z i and z j . The loss function pushes z i and z j together to maximize their Cauchy similarity. Middle: Embedding of CIFAR-10. The dashed arrows point to the locations of z i and z j from the left. Right: Training loss. The optimization consists of three stages: (1) pre-training with a 128D output for 1000 epochs; (2) fine-tuning only the 2D readout layer for 50 epochs; and (3) fine-tuning the entire network for 450 epochs.

