SELF-SUPERVISED LEARNING WITH ROTATION-INVARIANT KERNELS

Abstract

We introduce a regularization loss based on kernel mean embeddings with rotation-invariant kernels on the hypersphere (also known as dot-product kernels) for self-supervised learning of image representations. Besides being fully competitive with the state of the art, our method significantly reduces time and memory complexity for self-supervised training, making it implementable for very large embedding dimensions on existing devices and more easily adjustable than previous methods to settings with limited resources. Our work follows the major paradigm where the model learns to be invariant to some predefined image transformations (cropping, blurring, color jittering, etc.), while avoiding a degenerate solution by regularizing the embedding distribution. Our particular contribution is to propose a loss family promoting the embedding distribution to be close to the uniform distribution on the hypersphere, with respect to the maximum mean discrepancy pseudometric. We demonstrate that this family encompasses several regularizers of former methods, including uniformity-based and information-maximization methods, which are variants of our flexible regularization loss with different kernels. Beyond its practical consequences for state-ofthe-art self-supervised learning with limited resources, the proposed generic regularization approach opens perspectives to leverage more widely the literature on kernel methods in order to improve self-supervised learning methods.

1. INTRODUCTION

Self-supervised learning is a promising approach for learning visual representations: recent methods (He et al., 2020; Grill et al., 2020; Caron et al., 2020; Gidaris et al., 2021) reach the performance of supervised pretraining in terms of quality for transfer learning in many downstream tasks, like classification, object detection, semantic segmentation, etc. These methods rely on some prior knowledge on images: the semantic of an image is invariant (Misra & Maaten, 2020) to some small transformations of the image, such as cropping, blurring, color jittering, etc. One way to design an objective function that encodes such an invariance property is to enforce two different augmentations of the same image to have a similar representation (or embedding) when they are encoded by the neural network. However, the main issue with this kind of objective function is to avoid an undesirable loss of information (Jing et al., 2022) where, e.g., the network learns to represent all images by the same constant representation. Hence, one of the main challenges in self-supervised learning is to propose an efficient way to regularize the embedding distribution in order to avoid such a collapse. Our contribution is to propose a generic regularization loss promoting the embedding distribution to be close to the uniform distribution on the hypersphere, with respect to the maximum mean discrepancy (MMD), a distance on the space of probability measures based on the notion of embedding probabilities in a reproducing kernel Hilbert space (RKHS), using the so-called kernel mean embedding mapping. Inspired by high-dimensional statistical tests for uniformity that are rotation-invariant (García-Portugués & Verdebout, 2018), we choose to embed probability distributions using rotationinvariant kernels on the hypersphere (dot-product kernels), i.e., kernels for which the evaluation for two vectors depends only on their inner product (Smola et al., 2000) . This paper shows that such an approach leads to important theoretical and practical consequences for self-supervised learning. Code: https://github.com/valeoai/sfrik Figure 1: Self-supervised learning with rotation-invariant kernels. The invariance criterion minimizes the 2 -distance between two normalized embeddings {z i (v) } v=1,2 of two views of the same image x i encoded by the backbone f θ and the projection head g w . To avoid collapse, the embedding distribution is regularized to be close to the uniform distribution on the hypersphere, in the sense of the MMD associated to a rotation-invariant kernel K(u, v) = ϕ(u v) defined on the hypersphere. Table 1 : Correspondence between kernel choices K(•, •) in our generic regularization loss and regularizers of former methods. K(u, v) Method (uv ) 2 Contrastive e -t u-v 2 2 AUH C -u -v 2s-q+1 2 PointContrast b 1 uv + b 2 q(uv ) 2 -1 q-1 Analog to VICReg (cf. Section 3.3) We demonstrate that our regularization loss family parameterized by such rotation-invariant kernels encompasses several regularizers of former methods. As illustrated in 2020) ; and a linear combination of the linear kernel and the quadratic kernel yields a regularizer that promotes the covariance matrix of the embedding distribution to be proportional to the identity matrix, similarly to information-maximization methods like VICReg (Bardes et al., 2022) . In other words, these former methods turn out to be particular ways of minimizing the MMD between the embedding distribution and the uniform distribution on the hypersphere during training, with various specific kernel choices. The proposed generic regularization approach opens perspectives to leverage more widely the literature on kernel methods in order to improve self-supervised learning. Numerically, we show in a rigorous experimental setting with a separate validation set for hyperparameter tuning that our method yields fully competitive results compared to the state of the art, when choosing truncated kernels of the form K(u, v) = L =0 b P (q; u v), with L ∈ {2, 3}, b ≥ 0 for ∈ {0, . . . , L}, where P (q; •) denotes the Legendre polynomial of order , dimension q. To our knowledge, this kernel choice has not been considered in previous self-supervision methods. Therefore, we introduce SFRIK (SelF-supervised learning with Rotation-Invariant Kernels, pronounced like "spheric"), which regularizes the embedding distribution to be close to the uniform distribution with respect to the MMD associated to such a truncated kernel, as summarized in Figure 1 . Importantly, our method significantly reduces time and memory complexity for self-supervised training compared to information-maximization methods. Due to the kernel trick, the complexity of SFRIK's loss is quadratic in the batch size and linear in the embedding dimension, instead of being quadratic as in VICReg. In practice, SFRIK's pretraining time is up to 19% faster than VICReg for an embedding dimension 16384, and it can scale at dimension 32768, as opposed to VICReg whose memory requirement is too large at this dimension for a machine with 8 GPUs and 32GB of memory per GPU. Hence our work opens perspectives in self-supervised learning on embedded devices with limited memory like in (Xiao et al., 2022) . We summarize our contributions as follows: • We introduce a generic regularization loss based on kernel mean embeddings with rotationinvariant kernels on the hypersphere for self-supervised learning of image representations. • We show that our loss family encompasses several previous self-supervised learning methods, like uniformity-based and information-maximization methods. • We numerically show that SFRIK significantly reduces time and memory complexity for selfsupervised training, while remaining fully competitive with the state of the art.





