SHUFFLE TO LEARN: SELF-SUPERVISED LEARNING FROM PERMUTATIONS VIA DIFFERENTIABLE RANKING

Abstract

Self-supervised pre-training using so-called "pretext" tasks has recently shown impressive performance across a wide range of tasks. In this work, we advance self-supervised learning from permutations, that consists in shuffling parts of input and training a model to reorder them, improving downstream performance in classification. To do so, we overcome the main challenges of integrating permutation inversions (a discontinuous operation) into an end-to-end training scheme, heretofore sidestepped by casting the reordering task as classification, fundamentally reducing the space of permutations that can be exploited. We make two main contributions. First, we use recent advances in differentiable ranking to integrate the permutation inversion flawlessly into a neural network, enabling us to use the full set of permutations, at no additional computing cost. Our experiments validate that learning from all possible permutations improves the quality of the pre-trained representations over using a limited, fixed set. Second, we successfully demonstrate that inverting permutations is a meaningful pretext task in a diverse range of modalities, beyond images, which does not require modality-specific design. In particular, we improve music understanding by reordering spectrogram patches in the time-frequency space, as well as video classification by reordering frames along the time axis. We furthermore analyze the influence of the patches that we use (vertical, horizontal, 2-dimensional), as well as the benefit of our approach in different data regimes.

1. INTRODUCTION

Supervised learning has achieved important successes on large annotated datasets (Deng et al., 2009; Amodei et al., 2016) . However, most available data, whether images, audio, or videos are unlabelled. For this reason, pre-training representations in an unsupervised way, with subsequent fine-tuning on labelled data, has become the standard to extend the performance of deep architectures to applications where annotations are scarce, such as understanding medical images (Rajpurkar et al., 2017) , recognizing speech from under-resourced languages (Rivière et al., 2020; Conneau et al., 2020) , or solving specific language inference tasks (Devlin et al., 2018) . Among unsupervised training schemes, self-supervised learning focuses on designing a proxy training objective, that requires no annotation, such that the representations incidentally learned will generalize well to the task of interest, limiting the amount of labeled data needed for fine-tuning. Such "pretext" tasks, a term coined by Doersch et al. (2015) , include learning to colorize an artificially gray-scaled image (Larsson et al., 2017) , inpainting removed patches (Pathak et al., 2016) or recognizing with which angle an original image was rotated (Gidaris et al., 2018) . Other approaches for self-supervision include classification to original images after data augmentation (Chen et al., 2020) and clustering (Caron et al., 2018) . In this work, we consider the pretext task of reordering patches of an input, first proposed for images by Noroozi & Favaro (2016) , the analogue of solving a jigsaw puzzle. In this setting, we first split an input into patches and shuffle them by applying a random permutation. We train a neural network to predict which permutation was applied, taking the shuffled patches as inputs. We then use the inner representations learned by the neural network as input features to a low-capacity supervised classifier (see Figures 1 and 2 for illustration). We believe that permutations provide a promising avenue for self-supervised learning, as they are conceptually general enough to be applied across a large range of modalities, unlike colorization (Larsson et al., 2017) or rotations (Gidaris et al., 2018) that are specific to images. The idea of using permutations was also explored in Santa Cruz et al. The learned embeddings can be used for several downstream tasks. The inputs for the upstream task are permuted, while the inputs for the downstream task are not. (2018) where they use a bi level optimization scheme which leverages sinkhorn iterations to learn visual reconstructions. Their method resorts to approximating the permuation matrix with such continuous methods. Our method relies on no such approximations and can efficiently represent all possible permutations. Moreover, the encouraging results of Noroozi & Favaro (2016) when transferring learned image features for object detection and image retrieval inspire us to advance this method a step forward. However, including permutations into an end-to-end differentiable pipeline is challenging, as permutations are a discontinuous operation. Noroozi & Favaro (2016) circumvent this issue by using a fixed set of permutations and casting the permutation prediction problem as a classification one. Given that the number of possible permutations of n patches is n!, this approach cannot scale to exploiting the full set of permutations, even when n is moderately small. In this work, we leverage recent advances in differentiable ranking (Berthet et al., 2020; Blondel et al., 2020b) to integrate permutations into end-to-end neural training. This allows us to solve the permutation inversion task for the entire set of permutations, removing a bottleneck that was heretofore sidestepped in manners that could deteriorate downstream performance. Moreover, we successfully demonstrate for the first time the effectiveness of permutations as a pretext task on multiple modalities with minimal modality-specific adjustments. In particular, we improve music understanding by learning to reorder spectrogram frames, over the time and frequency axes. We also improve video understanding by reordering video frames along time. To summarize, we make the following two contributions. -We integrate differentiable ranking into end-to-end neural network training for representation learning. This provides an efficient manner to learn in reordering tasks for all permutations, for larger numbers of patches. We show that this drastic increase in the number of permutations improves the quality of learned representations for downstream tasks. -We successfully demonstrate for the first time the effectiveness of permutations as a general purpose self-supervision method, efficient on multiple modalities with extremely minimal modifications to the network. Additionally, the pre-trained representations perform well across diverse



Figure 1: Permutations as a self-supervised technique can handle a variety of modalities with minimal changes to the network architecture. Dotted layers weight sharing across input patches.The learned embeddings can be used for several downstream tasks. The inputs for the upstream task are permuted, while the inputs for the downstream task are not.

