UNSUPERVISED VISUALIZATION OF IMAGE DATASETS USING CONTRASTIVE LEARNING

Abstract

Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbors are not semantically close. This problem can be circumvented by selfsupervised approaches based on contrastive learning, such as SimCLR, relying on data augmentation to generate implicit neighbors, but these methods do not produce two-dimensional embeddings suitable for visualization. Here, we present a new method, called t-SimCNE, for unsupervised visualization of image data. t-SimCNE combines ideas from contrastive learning and neighbor embeddings, and trains a parametric mapping from the high-dimensional pixel space into two dimensions. We show that the resulting 2D embeddings achieve classification accuracy comparable to the state-of-the-art high-dimensional SimCLR representations, thus faithfully capturing semantic relationships. Using t-SimCNE, we obtain informative visualizations of the CIFAR-10 and CIFAR-100 datasets, showing rich cluster structure and highlighting artifacts and outliers.

1. INTRODUCTION

As many research fields are producing ever larger and more complex datasets, data visualization methods have become important in many scientific and practical applications (Becht et al., 2019; Diaz-Papkovich et al., 2019; Kobak & Berens, 2019; Schmidt, 2018) . Such methods allow a concise summary of the entire dataset, displaying a high-dimensional dataset as a 2D embedding. This low-dimensional representation is often convenient for data exploration, highlighting clusters and relationships between them. In practice, most useful are neighbor embedding methods, such as t-SNE (van der Maaten & Hinton, 2008) and UMAP (McInnes et al., 2018) , that aim to preserve nearest neighbors from the high-dimensional space when optimizing the layout in 2D. Unfortunately, for image datasets, nearest neighbors computed using the Euclidean metric in pixel space are typically not worth preserving. Although t-SNE works well on very simple image datasets such as MNIST (van der Maaten & Hinton, 2008, Figure 2a ), the approach fails when considering more natural image datasets such as CIFAR-10/100 (Supp. Fig. A.1). To create 2D embeddings for images, new visualization approaches are required, which use different notions of similarity. Here, we provide such a method based on the contrastive learning framework. Contrastive learning is currently the state-of-the-art approach to unsupervised learning in computer vision (Hadsell et al., 2006) . The contrastive learning method SimCLR (Chen et al., 2020) uses image transformations to create two views of each image and then optimizes a convolutional neural network so that the two views always stay close together in the resulting representation. While this method performs very well in benchmarks -such as linear or kNN classification accuracy, -the computed representation is typically high-dimensional (e.g. 128-dimensional), hence not suitable for visualization. We extend the SimCLR framework to directly optimize a 2D embedding. Taking inspiration from t-SNE, we use the Euclidean distance and the Cauchy (t-distribution) kernel to measure similarity in 2D. While using 2D instead of 128D output may not seem like a big step, we show that optimizing the resulting architecture is challenging. We develop an efficient training strategy to overcome these challenges, and only then are able to achieve satisfactory visualizations. We call the resulting method t-SimCNE (Fig. 1 ) and show that it yields meaningful and useful embeddings of CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). Our code is available at github.com/berenslab/t-simcne (see iclr2023 branch). NE algorithms have been used to visualize latent representations of neural networks trained in a supervised setting (e.g. Karpathy, 2014; Mnih et al., 2015) . This approach is, however, unsuitable for data exploration as it is supervised. NE algorithms can also be applied to an unsupervised (also known as self-supervised) representation of a dataset obtained with SimCLR, or to a representation obtained with a neural network pre-trained on a generic image classification task such as ImageNet (e.g. Narayan et al., 2015) . The downside is that these approaches would not yield a parametric mapping to 2D. In this work, we are interested in an unsupervised but parametric mapping that allows embedding out-of-sample points. See Discussion for further considerations.

2. RELATED WORK

Conceptual similarity between SimCLR and t-SNE has recently been pointed out by Damrich et al. (2023) . The authors suggest interpreting kNN graph edges as data augmentations, and show that t-SNE can also be optimized using the InfoNCE loss (van den Oord et al., 2018) used by SimCLR, and/or using a parametric mapping. Equivalently, one can think of SimCLR as a parametric SNE that samples edges from an unobservable neighbor graph. We were motivated by this connection when developing t-SimCNE. Further motivation comes from a recently described phenomenon called dimensional collapse (Jing et al., 2022; Tian, 2022) , which suggests that there is redundant information in the output of SimCLR. Hence we reasoned that it should be possible to achieve a good representation even with drastically reduced output dimensionality. Two closely related works appeared during preparation of this manuscript: Zang et al. ( 2022) suggest an architecture similar to SimCLR for 2D embeddings, but use a more complicated setup to impose 'local flatness' and, judging from their figures, obtain qualitatively worse embeddings of CIFAR datasets than we do (we were unable to quantitatively benchmark their method). Hu et al. (2023) suggest to use the Cauchy kernel in the SimCLR framework (calling it t-SimCLR), but in terms of 2D visualization, obtain worse results than we do (Hu et al., 2023, Fig. B.11 shows CIFAR-10, reported kNN accuracy 57% vs. our 89%).



Figure 1: Left: t-SimCNE. Two augmentations of the same image are fed through the same ResNet and fully-connected projection head to get representations z i and z j . The loss function pushes z i and z j together to maximize their Cauchy similarity. Middle: Embedding of CIFAR-10. The dashed arrows point to the locations of z i and z j from the left. Right: Training loss. The optimization consists of three stages: (1) pre-training with a 128D output for 1000 epochs; (2) fine-tuning only the 2D readout layer for 50 epochs; and (3) fine-tuning the entire network for 450 epochs.

Neighbor embeddings (NE) have a rich history dating back to locally linear embedding(Roweis  & Saul, 2000)  and stochastic neighbor embedding(SNE; Hinton & Roweis, 2003). They became widely used after the introduction of the Cauchy kernel into the SNE framework (van der Maaten & Hinton, 2008) and after efficient approximations became available (van derMaaten, 2014; Linderman et al., 2019). A number of algorithms based on that framework, such as LargeVis(Tang et al.,  2016),UMAP (McInnes et al., 2018), and TriMap (Amid & Warmuth, 2019)  have been developed and got widespread adoption in recent years in a variety of application fields. All of them are closely related to SNE(Böhm et al., 2022; Damrich et al., 2023)  and rely on the kNN graph of the data.

