SELF-SUPERVISED LEARNING WITH ROTATION-INVARIANT KERNELS

Abstract

We introduce a regularization loss based on kernel mean embeddings with rotation-invariant kernels on the hypersphere (also known as dot-product kernels) for self-supervised learning of image representations. Besides being fully competitive with the state of the art, our method significantly reduces time and memory complexity for self-supervised training, making it implementable for very large embedding dimensions on existing devices and more easily adjustable than previous methods to settings with limited resources. Our work follows the major paradigm where the model learns to be invariant to some predefined image transformations (cropping, blurring, color jittering, etc.), while avoiding a degenerate solution by regularizing the embedding distribution. Our particular contribution is to propose a loss family promoting the embedding distribution to be close to the uniform distribution on the hypersphere, with respect to the maximum mean discrepancy pseudometric. We demonstrate that this family encompasses several regularizers of former methods, including uniformity-based and information-maximization methods, which are variants of our flexible regularization loss with different kernels. Beyond its practical consequences for state-ofthe-art self-supervised learning with limited resources, the proposed generic regularization approach opens perspectives to leverage more widely the literature on kernel methods in order to improve self-supervised learning methods.

1. INTRODUCTION

Self-supervised learning is a promising approach for learning visual representations: recent methods (He et al., 2020; Grill et al., 2020; Caron et al., 2020; Gidaris et al., 2021) reach the performance of supervised pretraining in terms of quality for transfer learning in many downstream tasks, like classification, object detection, semantic segmentation, etc. These methods rely on some prior knowledge on images: the semantic of an image is invariant (Misra & Maaten, 2020) to some small transformations of the image, such as cropping, blurring, color jittering, etc. One way to design an objective function that encodes such an invariance property is to enforce two different augmentations of the same image to have a similar representation (or embedding) when they are encoded by the neural network. However, the main issue with this kind of objective function is to avoid an undesirable loss of information (Jing et al., 2022) where, e.g., the network learns to represent all images by the same constant representation. Hence, one of the main challenges in self-supervised learning is to propose an efficient way to regularize the embedding distribution in order to avoid such a collapse. Our contribution is to propose a generic regularization loss promoting the embedding distribution to be close to the uniform distribution on the hypersphere, with respect to the maximum mean discrepancy (MMD), a distance on the space of probability measures based on the notion of embedding probabilities in a reproducing kernel Hilbert space (RKHS), using the so-called kernel mean embedding mapping. Inspired by high-dimensional statistical tests for uniformity that are rotation-invariant (García-Portugués & Verdebout, 2018) , we choose to embed probability distributions using rotationinvariant kernels on the hypersphere (dot-product kernels), i.e., kernels for which the evaluation for two vectors depends only on their inner product (Smola et al., 2000) . This paper shows that such an approach leads to important theoretical and practical consequences for self-supervised learning. Figure 1 : Self-supervised learning with rotation-invariant kernels. The invariance criterion minimizes the 2 -distance between two normalized embeddings {z i (v) } v=1,2 of two views of the same image x i encoded by the backbone f θ and the projection head g w . To avoid collapse, the embedding distribution is regularized to be close to the uniform distribution on the hypersphere, in the sense of the MMD associated to a rotation-invariant kernel K(u, v) = ϕ(u v) defined on the hypersphere. Table 1 : Correspondence between kernel choices K(•, •) in our generic regularization loss and regularizers of former methods.

2. RELATED WORK

Instance discrimination methods typically rely on a contrastive loss that behaves asymptotically like an alignment and uniformity loss on the hypersphere in the limit of infinite samples. Our contribution is to formalize and generalize existing uniformity-based methods by using kernel mean embeddings. To the best of our knowledge, the proposed kernel framework establishes the first connection between uniformity-based methods and information-maximization methods like VICReg. Instance discrimination One way of learning image representations that are invariant to predefined image transformations (Misra & Maaten, 2020) is to rely on an instance classification approach (Wu et al., 2018) Henaff, 2020) discriminates instances within a batch of sampled images using the noise contrastive estimator (Gutmann & Hyvärinen, 2010) , by attracting embeddings of transformed images coming from the same image instance, and repulsing embeddings coming from different image instances. In practice, this estimator needs a large number of image representations in order to achieve good results, which requires a large batch size like SimCLR (Chen et al., 2020a) or a memory bank (Wu et al., 2018; He et al., 2020) . In the limit of infinite samples, the contrastive loss is shown to behave asymptotically like the alignment and uniformity loss of AUH. . Uniformity on the hypersphere Existing uniformity-based methods avoid collapse by regularizing the embedding distribution to be somehow close to the uniform distribution on the hypersphere, which has a high entropy. Bojanowski & Joulin (2017) perform this kind of regularization by aligning the learned representations on a fixed number of vectors sampled uniformly at random on the hypersphere. AUH maximizes the average pairwise distance between embeddings using an RBF kernel, in the spirit of energy minimization methods that address the problem of scattering points evenly on the hypersphere (Hardin & Saff, 2005; Liu et al., 2018; Borodachov et al., 2019) . Although alternative high-entropy prior distributions (e.g., the uniform distribution on the hypercube) can be used for regularization (Chen et al., 2021) , encoding images into 2 -normalized representations helps to stabilize training (Schroff et al., 2015; Parkhi et al., 2015; Liu et al., 2017) .

Kernel mean embedding

As a contribution, our generic loss formalizes and generalizes these previous uniformity losses, by relying on kernel mean embeddings (cf. Appendix A.1) to measure the distance between probability distributions on high-dimensional spaces, using the MMD pseudometric (Gretton et al., 2012; Li et al., 2015; Dziugaite et al., 2015; Briol et al., 2019) with rotation-invariant kernels on the hypersphere (Smola et al., 2000; Pennington et al., 2015; Lyu, 2017; Dutordoir et al., 2020) . These tools are adapted for high-dimensional problems on the hypersphere whose geometry is different from the one in small dimension, as illustrated by García-Portugués & Verdebout (2018): many statistical tests for uniformity on the hypersphere, i.e., tests for rejecting the null hypothesis where a batch of normalized vectors is sampled from the uniform distribution on the hypersphere, are in fact precisely estimators of the MMD between the embedding distribution and the uniform distribution, for different kernels. Our kernel method for self-supervision is complementary to (Li et al., 2021) , in which the dependency between image instances and their embedding is maximized with respect to the Hilbert-Schmidt independence criterion (cf. Appendix A.3). Information maximization Our generic regularization approach has the benefit of connecting uniformity-based and information-maximization methods (Zbontar et al., 2021; Ermolov et al., 2021; Bardes et al., 2022) . The latter are alternatives to distillation methods (Grill et al., 2020; Gidaris et al., 2020; 2021; Chen & He, 2021; Caron et al., 2021) where a student network learns to predict the representations of a teacher network. In such methods, using various architecture tricks (like prediction head, stop-gradient, momentum encoder, batch normalization or centering) is shown empirically to be sufficient to avoid collapse without instance discrimination, even though it is not fully understood how these multiple factors induce a regularization during training (Richemond et al., 2020; Tian et al., 2021) . Instead of using these tricks, information-maximization methods use a Siamese architecture and avoid collapse by maximizing the statistical information of a batch of embeddings, using a whitening operation (Ermolov et al., 2021) , or an explicit regularization term making the covariance (Bardes et al., 2022) or the cross-correlation (Zbontar et al., 2021 ) matrix close to a scaled identity matrix. This paper shows that our generic regularization loss with an appropriate kernel also promotes the covariance matrix of the embedding distribution to be proportional to the identity matrix. But in contrast to VICReg which explicitly computes the covariance matrix, our method uses the kernel trick to significantly reduce complexity at large embedding dimensions.

3. METHOD DESCRIPTION

Given an unlabeled dataset of images x i ∼ P, i ∈ [N ] := {1, . . . , N }, sampled independently from a data distribution P, the goal is to learn a backbone network f θ parameterized by θ (e.g., a convolutional neural network) such that any new image x ∼ P is encoded by a good representation f θ (x) whose quality is evaluated in several downstream tasks (see Section 4).

3.1. INVARIANCE AND UNIFORMITY FOR SELF-SUPERVISION

Our self-supervised learning method (see Figure 1 ) follows the principle of the recent methods like SimCLR or VICReg. During self-supervised training, each image x i is augmented using two different random transformations t (1) and t (2) sampled from a distribution T , which yields two views x i (1) v) with the backbone f θ and 2 -normalizing the resulting feature vector. For a given subset of indices I ⊆ [N ], we write Z I (v) := {z i (v) } i∈I . The backbone f θ is trained by minimizing the total objective function: := t (1) (x i ) and x i (2) := t (2) (x i ) of the image x i . Two representations z i (v) (v = 1, 2) are obtained by encoding each x i L = E t (1) ,t (2) ∼T E I⊆[N ] (Z I (1) , Z I (2) ), where batches I are drawn at random with a prescribed batch size, and the loss is a weighted sum involving an alignment term a and a uniformity term u , in the spirit of AUH: (Z I (1) , Z I (2) ) := λ a (Z I (1) , Z I (2) ) + 0.5 ( u (Z I (1) ) + u (Z I (2) )) ; λ > 0 is a hyperparameter that tunes the balance between the two terms. The loss a enforces the invariance property of the model, and is defined for a batch I ⊆ [N ] of cardinality |I| as: a (Z I (1) , Z I (2) ) := 1 |I| i∈I z i (1) -z i Our main contribution is in the choice of the uniformity term u , detailed in the rest of the section. Note that instead of applying the loss (2) to the output of f θ (called image representation), we add a projection head g w (a multi-layer perceptron) parameterized by w to the output of f θ and apply (2) at the output of g w (called image embedding). This common practice (Caron et al., 2020; Grill et al., 2020) improves the performance in the downstream tasks. Therefore, denoting S q-1 the unit hypersphere in R q , the image embedding actually reads z i v) ) 2 ∈ S q-1 . Both g w and f θ are jointly trained without supervision by minimizing (1) using a stochastic mini-batch algorithm. After training, g w is discarded and only f θ is kept for the downstream tasks. (v) := (g w • f θ )(x i (v) )/ (g w • f θ )(x i

3.2. UNIFORMITY LOSS VIA MMD MINIMIZATION

We continue by explaining our generic kernel formulation of u using the MMD pseudometric and rotation-invariant kernels. Then we provide examples of such kernels and describe our kernel choice.

3.2.1. MMD PSEUDOMETRIC AND ROTATION-INVARIANT KERNELS

Our uniformity loss relies on a divergence in the space of probability distributions based on a positive definite kernel K defined on some space X . Denoting H the corresponding RKHS with norm • H , the MMD between two probability distributions Q 1 , Q 2 on X can be expressed as the distance in • H between their kernel mean embeddings (Borgwardt et al., 2006; Muandet et al., 2017) : MMD(Q 1 , Q 2 ) = X K(u, •)dQ 1 (u) - X K(u, •)dQ 2 (u) H . We propose to use this pseudometric to measure the distance between the probability distribution of the embeddings z i (v) (v = 1, 2) and the uniform probability distribution on the hypersphere S q-1 defined by U := σ q-1 / S q-1 , where σ q-1 denotes the normalized Hausdorff surface measure on S q-1 , and S q-1 := S q-1 dσ q-1 = 2π q 2 /Γ( q 2 ) is the surface area of S q-1 , with Γ denoting the Gamma function. Intuitively, a good choice of kernel for measuring the distance (4) should distinguish any probability distribution from the uniform distribution. Inspired by statistical tests for uniformity that are rotation-invariant (García-Portugués & Verdebout, 2018) , we propose to use rotation-invariant kernels on X := S q-1 of the form K(u, v) := ϕ(u v) with ϕ a continuous function defined on [-1, 1] ( Smola et al., 2000) . The following theorem characterizes the form of function ϕ that ensures positive definiteness of K, and thus that (4) is a valid pseudometric. Theorem 1 (Schoenberg (1942, Theorem 1)) The kernel K(u, v) := ϕ(u v) on X := S q-1 with ϕ continuous is positive definite if, and only if, the function ϕ admits an expansion: ϕ(t) = +∞ =0 b P (q; t), with b ≥ 0, where P (q; t) := ! Γ q-1 2 2 k=0 -1 4 k (1-t 2 ) k t -2k k! ( -2k)! Γ(k+ q-1 2 ) is the Legendre (or Gegenbauer) polynomial of degree in dimension q (Müller, 2012, (2.32)). As we are interested in measuring the distance between the embedding distribution and the uniform distribution on the hypersphere U, we compute the kernel mean embedding of U for a kernel satisfying the condition of Theorem 1 using the following known result used, e.g., implicitly in (Brauchart et al., 2014 ). As we could not locate a formal proof, we provide one in Appendix B.1. Lemma 2 Let K(u, v) := ϕ(u v) be a rotation-invariant kernel on X := S q-1 where ϕ admits the expansion (5) . The kernel mean embedding of the uniform distribution U on S q-1 is constant: S q-1 K(u, v) dU(u) = b 0 ∈ R for all v ∈ S q-1 . The kernel mean embedding of any probability distribution Q defined on the hypersphere satisfies: S q-1 K(u, •) dQ(u) = b 0 + S q-1 K(u, •) dQ(u), where K(u, v) := φ(u v) for any u, v ∈ S q-1 with φ := +∞ =1 b P (q; •). Using Lemma 2 in (4) yields MMD(Q, U) = S q-1 K(u, •)dQ(u) H for any probability distribu- tion Q on S q-1 . Then, by the reproducing property in the RKHS H, the squared MMD satisfies, for any rotation-invariant kernel K verifying the condition of Theorem 1: MMD 2 (Q, U) = E z,z ∼Q K (z, z ) , with z, z i.i.d.

3.2.2. ESTIMATOR OF THE SQUARED MMD AND KERNEL CHOICES

The proposed uniformity loss u for self-supervision is a biased estimator (Gretton et al., 2012) of MMD 2 (Q, U) in (6) . Given a batch Z I := {z i } i∈I sampled from Q, our uniformity loss is: u (Z I ) = MMD 2 (Q, U; {Z I }) := 1 |I| 2 i∈I i ∈I K(z i , z i ) = 1 |I| 2 i∈I i ∈I φ(z i z i ). In our framework, any rotation-invariant kernel satisfying the condition of Theorem 1 can be used to compute (7) and train a self-supervised model by minimizing (1) . The uniformity term (7) can be interpreted as an energy functional (Brauchart et al., 2014) : minimizing the average pairwise energy quantified by K tends to scatter evenly the embeddings on the hypersphere. We now give examples of kernels that can be used for this uniformity term. This illustrates that our framework offers a unification of several strategies for self-supervision. Example 1: RBF kernel. Using K(u, v) = e -t u-v 2 2 (with t > 0) in the uniformity term u (7) yields the regularization term from AUH, with the only difference that AUH uses the logarithm of the energy functional as their uniformity loss. Example 2: Generalized distance kernel. It is defined as K(u, v) := C -u -v 2s-q+1 2 with q-1 2 < s < q+1 2 and C > 0 sufficiently large (Brauchart et al., 2014) . A variation of this kernel choice is, e.g., used in the hard-contrastive loss of PointContrast for self-supervision on point clouds. Example 3: Truncations of the Laplace-Fourier series. A truncated kernel up to order L (Brauchart et al., 2014) is a kernel K(u, v) = L =0 b P (q; u v), with b ≥ 0 for = 0, . . . , L. It admits a closed-form expression given by the definition of Legendre polynomials P (q, •) in Theorem 1, e.g., P 1 (q, t) = t, P 2 (q, t) = qt 2 -1 q-1 , P 3 (q, t) = (q+2)t 3 -3t q-1 . We explore numerically this kernel choice in Section 4, since it has never been considered in previous self-supervision methods. The expansion of ϕ in Legendre polynomials (5) for the RBF (Example 1) and the generalized distance kernel (Example 2) verifies b > 0 for each integer (see Appendix B.2). By (Micchelli et al., 2006, Theorem 10) , this is a necessary and sufficient condition for a rotation-invariant kernel to be universal, and universality is a sufficient condition for injectivity of the corresponding kernel mean embedding mapping, i.e., the kernel is characteristic (Fukumizu et al., 2004) . The benefit of this property is to guarantee that the uniform distribution U is the unique solution to the minimization problem: min MMD(Q, U) | Q is a probability distribution on S q-1 . In contrast, the truncated rotation-invariant kernel up to an order L (Example 3) is not universal. Yet, our experiments in Section 4 show that truncated kernels up to order L = 2, 3 provide better results than, e.g., AUH whose uniformity loss is based on the RBF kernel. In summary, the uniformity loss in our method, called SFRIK, corresponds to (7) with a truncated kernel up to order L = 3 and satisfies: u ({z i } i∈I ) = 1 |I| 2 i∈I i ∈I b 1 z i z i + b 2 q(z i z i ) 2 -1 q -1 + b 3 (q + 2)(z i z i ) 3 -3z i z i q -1 , where b ≥ 0, = 1, 2, 3, are hyperparameters, and q is the dimension of the image embedding z i .

3.3. CONNECTION WITH INFORMATION-MAXIMIZATION METHODS

We now show that choosing an appropriate kernel in the proposed uniformity term ( 7) leads to a regularizer that maximizes a statistical measure of information analog to the one used in VICReg. To the best of our knowledge, this is the first connection made between uniformity-based and informationmaximization methods. The regularization loss of VICReg is a weighted sum between two terms: v(Z I ) := 1 q q j=1 max 0, γ -Var(z j I ) + ε , c(Z I ) := 1 q 1≤j =j ≤q [C(Z I )] 2 j,j , for a batch of image embeddings Z I := {z i } i∈I , where z j denotes the j-th coordinate of a (random) vector z and ε is a fixed small scalar. The variance term v(Z I ) enforces the empirical variance Var(z j I ) := 1 |I|-1 i∈I (z j i -z j ) 2 in each coordinate j = 1, . . . , q to be above a certain threshold γ 2 > 0 (here z is the empirical mean of Z I ). The covariance term c(Z I ) enforces the non-diagonal entries of the empirical covariance matrix C(Z I ) := 1 |I|-1 i∈I (z i -z)(z i -z) to be zero. In order to connect VICReg and SFRIK, let us consider for simplicity a truncated kernel K(u, v) = L =1 b P (q; u v) of order L = 2 (the reasoning would be the same if the kernel was not truncated), and assume b 1 , b 2 > 0. By the addition theorem (Müller, 2012, Theorem 2, §1), there exists a feature map Φ : S q-1 → R m involving an orthonormal basis of spherical harmonics (homogeneous harmonic polynomials restricted to the hypersphere) of order 1 and 2 such that Φ(u) Φ(v) = K(u, v). Hence, the kernel mean embedding of a distribution in the associated RKHS contains its first and second-order moments (see Appendix B.3). Therefore, denoting N (q, ) the dimension of the space of spherical harmonics of order , dimension q, and defining Φ : S q-1 → R N (q, ) , z → (Y ,k (z)) N (q, ) k=1 for ∈ {1, 2} with {Y 1,k } N (q,1) k=1 := u → u j | 1 ≤ j ≤ q , {Y 2,k } N (q,2) k=1 := u → u j u j | 1 ≤ j < j ≤ q ∪ u → (u j ) 2 - 1 q | 2 ≤ j ≤ q , it is possible to show (see Appendix B.3) that the squared MMD (6) can be written as MMD 2 (Q, U) = a 1 M 1 E z∼Q [Φ 1 (z)] 2 2 + a 2 M 2 E z∼Q [Φ 2 (z)] 2 2 , where a := b |S q-1 |/N (q, ) for ∈ {1, 2}, and M 1 , M 2 are two lower triangular matrices with nonzero diagonal entries. Hence, when Q plays the role of the embedding distribution during selfsupervised training, minimizing MMD 2 (Q, U) in (11) as we propose for regularizing the embedding distribution promotes its expectation E z∼Q [z] and its autocorrelation matrix E z∼Q [zz ] to be close to 0 and q -1 I q respectively, where I q is the identity matrix. When MMD 2 (Q, U) = 0, the covariance matrix is equal to E[(z -E[z])(z -E[z]) ] = E[zz ] -E[z]E[z] = 1 q I q because b 1 , b 2 > 0 and the two terms on the right-hand side of (11) are null. In conclusion, the regularization both in VICReg and SFRIK induces the embedding distribution to have a covariance matrix with zero non-diagonal entries. The diagonal entries of the covariance matrix are encouraged to be equal to 1/q in SFRIK, and greater than γ 2 in VICReg (we recall that the image embeddings {z i } i∈I are not 2 -normalized in VICReg). However, one difference in terms of regularization behavior is that SFRIK encourages the expectation of the embedding distribution to be zero, as shown in the first term of (11). This is not the case for VICReg, as we can see in (9). Finally, the memory and computational complexities for computing the uniformity term (8) in SFRIK are O(|I| 2 ) and O(q|I| 2 ), as opposed to O(q 2 ) and O(q 2 |I|) for the variance and covariance terms (9) in VICReg. In the setting where SFRIK and VICReg work best, i.e., larger dimension q and smaller batch size |I|, SFRIK has the lowest memory and computational complexities. This computational advantage is due to the kernel trick and it is illustrated in Section 4.

4. EXPERIMENTS

We first demonstrate numerically that the regularization loss (8) of SFRIK outperforms existing alternatives, in a rigorous experimental setting with a subset of ImageNet-1000 (Deng et al., 2009) for pretraining and a separate validation set for hyperparameter tuning. Then, we pretrain a ResNet-50 backbone (He et al., 2016) with SFRIK on the full ImageNet dataset and show competitive results compared to the state of the art, with significant computational benefits during pretraining.

4.1. EXPERIMENTAL SETTING

The backbone f θ is either ResNet-18 or ResNet-50, depending on the experiment. Following Zbontar et al. ( 2021), the projection head g w is a three-layer MLP made of two hidden layers with ReLU activation and batch normalization (Ioffe & Szegedy, 2015) , and a linear output layer. Unless otherwise specified, the size (number of neurons) of the two hidden layers is the same as the one, denoted q, of the output layer and the default value is q = 8192. The augmentations used for transforming images into views are the same as the ones used in VICReg. The backbone and the projection head are trained with a LARS optimizer (You et al., 2017) . The weight decay is fixed at 10 -6 . The learning rate scheduling starts with 10 warm-up epochs (Goyal et al., 2017) with a linear increase from 0 to initial lr = base lr * bs/256, where base lr is called the base learning rate (Goyal et al., 2017) and bs is the batch size, followed by a cosine decay (Loshchilov & Hutter, 2017 ) with a final learning rate 1000 times smaller than initial lr. For pretraining, we consider a 20% subset of ImageNet-1000 (denoted by IN20%), like in (Gidaris et al., 2021) , and 100% of ImageNet-1000 (denoted by IN100%). In IN20%, we keep all the 1000 classes but only 260 images per class.

4.2. SFRIK'S REGULARIZER OUTPERFORMS EXISTING ALTERNATIVE ON IN20%

Many existing self-supervision methods are based on the Siamese architecture and have the same form of training loss λ a (Z I (1) , Z I 1) , Z I 2) ). This is the case of SimCLR, AUH and VICReg, for which Appendix B.4 gives the expression of the regularization loss r . For SFRIK, following (2), we have µ = 0.5 and r (Z I (1) , 2) ) with u given by (8). (2) ) + µ r (Z I Z I (2) ) = u (Z I (1) ) + u (Z I Protocol. To isolate the impact of r on the quality of the learned representations, we (re)implement all these four methods in the setting of Section 4.1, to get rid of the influence of other design choices, like image augmentations or projection head architecture. We fix the batch size at 2048, and tune the base learning rate and hyperparameters specific to each method's loss. We also compare different embedding dimension q ∈ {1024, 2048, 4096, 8192}. In order to perform an extensive hyperparameter tuning by grid search of each method for fair comparisons, we choose a smaller backbone and a reduced dataset for pretraining, i.e., we pretrain a ResNet-18 on IN20% for 100 epochs with all methods. Pretrained backbones are then evaluated by linear probing trained on IN20% with labels.

Number of hyperparameters.

Note that in total SFRIK with L = 2 has as many hyperparameters to tune as AUH or VICReg, and SFRIK with L = 3 has a single additional hyperparameter. Rigorous hyperparameter tuning. In contrast to the common practice in the literature where hyperparameters are directly selected on the evaluation dataset, we choose to tune hyperparameters on Results. Table 2 shows that SFRIK at optimal truncation order L = 3 outperforms SimCLR, AUH and VICReg by at least 0.7 points at q = 8192. The gain in top-1 accuracy by linear probing between SFRIK at L = 1 and L = 2 is important, but is smaller between L = 2 and L = 3. This suggests that L > 3 is likely to marginally improve performance, while requiring more hyperparameter tuning, which is why we did not explore L > 3. We also remark that all methods benefit from an increase in embedding dimension q, including SimCLR which was originally introduced with a smaller dimension. Appendix D. Ablation. Table 3 confirms empirically that a truncated kernel is better than the RBFfoot_0 or the generalized distance kernel for the uniformity term (7) . During tuning we observed that the truncated kernel performs well when the weights b 2 , b 3 in (8) are larger than b 1 , e.g., (b 1 , b 2 , b 3 ) = (1, 40, 40) for q = 8192. This contrasts with the RBF and the generalized distance kernel for which the weights b decay polynomially with respect to (see Appendix B.2). This suggests that it is important to focus more on order 2, 3 than on order 1 in the Legendre expansion (5).

4.3. RESULTS FOR RESNET-50 PRETRAINED ON IN100%

Protocol. We pretrain a ResNet-50 with SFRIK on IN100% under the setting of Section 4.1, with a batch size of 2048. We study the impact of a larger embedding dimension in SFRIK by considering a projection head with two hidden layers of size 8192, and an output layer of size q ∈ {8192, 16384, 32768}. Truncation order is either L = 2 or L = 3. For comparison, we also pretrain a ResNet-50 with VICReg under the same setting with q = 8192. Similarly to the original paper (Bardes et al., 2022) , the alignment, variance and covariance weights are respectively 25, 25, 1, and the base learning rate is 0.2 for VICReg. All pretrained backbones are evaluated by: linear probing on IN100%; linear classification on Places205 and VOC2007 in order to measure how the learned representations generalize to an unseen dataset; and semi-supervised learning with few labels of IN100% (backbones are fine-tuned for classification using 1% or 10% of labeled images). Computational complexity. We show under this protocol that SFRIK's time and memory complexity during pretraining is significantly smaller than the one of VICReg for large dimensions. This allows us to scale SFRIK at dimension 16384 and even to 32768 for better results on downstream tasks. 2 We measure the peak memory per GPU during pretraining on IN100% with a batch size of 2048 and the pretraining wall time of both methods on a 8× AMD Radeon Instinct MI50 32GB: • at q = 8192, SFRIK is 8% faster than VICReg and needs 3% less memory per GPU; • at q = 16384, SFRIK is 19% faster than VICReg and needs 8% less memory per GPU; • at q = 32768, SFRIK is still 2% faster than VICReg run in the lower dimension 16384. It only requires 30.9GB per GPU while VICReg at q = 32768 needs more than the available memory. Table 5 emphasizes this memory advantage at reduced batch sizes. Ideally we could combine these ingredients with our generic kernel framework to design an improved version of SFRIK that can still benefit from its computational advantages over VICReg.

5. CONCLUSION

We proposed a regularization loss family based on the MMD and rotation-invariant kernels. We demonstrated that several regularizers of former methods are indeed variants of our flexible loss with different kernels. This generic regularization approach allowed us to leverage degrees of freedom in rotation-invariant kernel design to improve self-supervision methods. In practice, using a truncated kernel, we derived from the proposed framework a fully competitive self-supervised pretraining method, SFRIK, which significantly reduces time and memory complexity during pretraining compared to information-maximization methods. Combining the approach with kernel approximation techniques such as quadrature rules or random feature expansions offers promising perspectives to further enhance the ability to perform self-supervised training with limited computational resources.

A EXTENDED RELATED WORK

We further discuss some related works referenced in the main paper.

A.1 REMINDERS ON KERNEL MEAN EMBEDDINGS

In this appendix we provide a high-level introduction to the notion of kernel mean embedding. We refer the reader to (Muandet et al., 2017) for a complete survey. The idea of kernel mean embedding is to encode a probability distribution in an RKHS H. Denoting K : X × X → R the reproducing kernel of H defined on some space X , the kernel mean embedding of a probability distribution Q defined on X is µ Q := X K(u, •)dQ(u) ∈ H. ( ) In other words, the kernel mean embedding mapping Q → µ Q transforms a probability distribution into an element in H. As an application, this allows one to quantify the divergence between probabilities using the norm • H associated to H. Given two probability distributions Q 1 , Q 2 defined on X , one can indeed quantify their divergence by µ Q1 -µ Q2 H = X K(u, •)dQ 1 (u) - X K(u, •)dQ 2 (u) H , which is precisely the MMD between Q 1 and Q 2 defined in (4).

A.2 SAMPLE-CONTRASTIVE CRITERION

Given a batch of embeddings Z I := {z i } i∈I (that are not necessarily normalized), the general sample-constrative criterion of Garrido et al. ( 2023) is defined by: c (Z I ) = i,i ∈I, i =i (z i z i ) 2 . ( ) Garrido et al. (2023) show that this criterion is minimized in many contrastive learning methods like (HaoChen et al., 2021). In the case where the embeddings are normalized, this sample-contrastive criterion can be derived from the proposed generic uniformity loss u defined by (7) with the quadratic kernel K(u, v) = (u v) 2 where u, v ∈ S q-1 , as claimed in Section 1. Indeed, since z i 2 = 1 for all i ∈ I: c (Z I ) = i,i ∈I (z i z i ) 2 - i∈I (z i z i ) 2 = |I| 2 u (Z I ) -|I|, ( ) where |I| is the batch size. Therefore, the sample-contrastive criterion in the normalized case is an estimator of the MMD associated to the quadratic kernel between the embedding distribution and the uniform distribution on the hypersphere.

A.3 KERNEL DEPENDENCE MAXIMIZATION

We further explain the positioning of our paper with respect to (Li et al., 2021) , which proposes a self-supervised learning method based on kernel dependence maximization, using the Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) . The HSIC measures the dependence between two random variables X ∈ X and Y ∈ Y using two RKHS F on X with kernel k and G on Y with kernel l, in order to capture nonlinear correlations. It is defined as the squared MMD associated to the reproducing kernel of the tensor product space F ⊗ G between the joint probability distribution P X,Y and the product P X P Y of marginal probability distributions: HSIC(X, Y ) := µ P X,Y -µ P X P Y 2 F ⊗G , where Q → µ Q is the kernel mean embedding mapping defined by (12). Then, the self-supervised learning loss in (Li et al., 2021) is defined as: L SSL-HSIC := -HSIC(Z, Y ) + γ HSIC(Z, Z), where Z encodes embeddings of transformed images, and Y encodes image identity as the index of the original image (before transformation) in the training dataset. By maximizing HSIC(Z, Y ), the backbone learns image representations that are invariant to image transformations. To avoid collapse, high-variance representations are penalized by minimizing HSIC(Z, Z). This is similar to previous information-maximization methods (Bardes et al., 2022; Zbontar et al., 2021) , with the difference that they take into account nonlinear correlations using kernels. Although both our approach and the one in (Li et al., 2021) view self-supervised learning as a kernel method, we highlight here a main distinction between the two works. Both approaches use the MMD, but they do not use it to measure the same quantity. As explained above, Li et al. (2021) use the MMD to measure dependency between random variables (like Z and Y ), while the regularization loss we propose uses the MMD to measure the divergence between the embedding distribution and the uniform distribution on the hypersphere. As explained in Sections 1 and 2, this kernel approach for self-supervised learning is new in the literature and allows for the unification of several previous self-supervised learning methods as illustrated in Table 1 . Note that when the image identity Y is a one-hot encoding, (Li et al., 2021) shows that -HSIC(Z, Y ) = C -E (Z1,Z2)∼pos [k(Z 1 , Z 2 )] + E Z3 E Z4 [k(Z 3 , Z 4 )] , where C > 0 is a constant, (Z 1 , Z 2 ) is a positive pair of embeddings, i.e., they are embeddings of two transformations of the same original image, and (Z 3 , Z 4 ) is a pair of independent embeddings. In other words, HSIC(Z, Y ) is proportional to the sum of an alignment term -E (Z1,Z2)∼pos [k(Z, Z )] and an energy term E Z3 E Z4 [k(Z 3 , Z 4 )], similarly to (3) combined with (7) , which yields the proposed loss (2). Our paper shows that, if k(•, •) is a rotation-invariant kernel on the hypersphere, then the energy term E Z3 E Z4 [k(Z 3 , Z 4 ) ] is precisely the MMD between the embedding distribution and the uniform distribution on the hypersphere, cf. (6) . However there are two differences between the maximization of HSIC(Z, Y ) and the minimization of the proposed loss (2) . First, the alignment term and the energy term in (18) are quantified with the same kernel k(•, •), which is not the case in (2) where the alignment term is quantified by the 2 -distance between embeddings (equivalent to the linear kernel when the embeddings are normalized), and the uniformity term (7) is quantified by another rotation-invariant kernel. Second, the loss (2) is a weighted sum between the alignment loss (3) and the uniformity loss (7) controlled by the hyperparameter λ that tunes the balance between the two terms, which is not the case of (18).

B THEORETICAL RESULTS

We provide proofs and more details about the theoretical results in the main text.

B.1 PROOF OF LEMMA 2

Consider a rotation-invariant kernel K(u, v) defined on the hypersphere S q-1 such that: K(u, v) = +∞ =0 b P (q; u v), ∀u, v ∈ S q-1 , with weights b ≥ 0 and P (q; •) the Legendre polynomial of order in dimension q. The proof of Lemma 2 relies on an orthonormal system of spherical harmonics. Let f, g (q) := S q-1 f gdσ q-1 be the inner product in the space of continuous functions defined on S q-1 and, for each ∈ N, consider {Y ,k | k = 1, . . . , N (q, )} an orthonormal basis of spherical harmonics of order in dimension q (homogeneous harmonic polynomials in q variables restricted to S q-1 , see e.g. (Müller, 2012) for more details), where N (q, ) denotes the dimension of this space, which is by (Müller, 2012 , Exercise 6, §3): N (q, ) = q for = 1, (2 +q-2)( +q-3)! ! (q-2)! for ≥ 2. ( ) By the addition theorem (Müller, 2012, Theorem 2, §1): N (q, ) k=1 Y ,k (u)Y ,k (v) = N (q, ) |S q-1 | P (q; u v), u, v ∈ S q-1 . Hence, the kernel K(u, v) can be rewritten as: K(u, v) = +∞ =0 N (q, ) k=1 b |S q-1 | N (q, ) Y ,k (u)Y ,k (v). Since {Y ,k | = 0, . . . , +∞, k = 1, . . . , N (q, )} is an orthonormal system for the inner product •, • (q) , and since Y 0,1 is constant on S q-1 , we have for any integer and k ∈ {1, . . . , N (q, )} that: S q-1 Y ,k dσ q-1 = 1 Y 0,1 Y ,k , Y 0,1 (q) = 1 Y0,1 if = 0, k = 1 0 otherwise . ( ) Moreover Y 0,1 = 1/ |S q-1 |, because 1 = Y 0,1 , Y 0,1 (q) = S q-1 Y 2 0,1 dσ q-1 = Y 2 0,1 |S q-1 |. Therefore, the kernel mean embedding of the uniform distribution on the hypersphere U := σ q-1 /|S q-1 | associated to the kernel K is: ∀v ∈ S q-1 , S q-1 K(u, v)dU(u) = S q-1 +∞ =0 b P (q; u v) dσ q-1 (u) |S q-1 | = +∞ =0 S q-1 N (q, ) k=1 b N (q, ) Y ,k (u)Y ,k (v)dσ q-1 (u) = +∞ =0 b N (q, ) N (q, ) k=1 S q-1 Y ,k (u)dσ q-1 (u) Y ,k (v) = b 0 1 Y 1,0 Y 1,0 = b 0 , where we inverted series and integral in the second equation using the dominated convergence theorem: the series (Müller, 2012 , Lemma 2, §8), P (q; 1) = 1 for all by (Müller, 2012, §9) , and +∞ =0 b P (q; 1) < +∞ is integrable on S q-1 . This yields the first claim of Lemma 2. Consider now any probability distribution Q defined on the hypersphere. The kernel mean embedding of Q is simply rewritten as: ∀v ∈ S q-1 , S q-1 K(u, v)dQ(u) = S q-1 +∞ =0 b P (q; u v)dQ(u) = b 0 S q-1 P 0 (q; u v)dQ(u) + S q-1 +∞ =1 b P (q; u v)dQ(u) = b 0 + S q-1 K(u, v)dQ(u), because the Legendre polynomial of order 0 is the constant function equal to 1 (see the closed form expression of P 0 (q; •) in Theorem 1 in the main text) and S q-1 dQ = 1. This ends the proof of Lemma 2.

B.2 LEGENDRE EXPANSION OF ROTATION-INVARIANT KERNELS

We show that the kernel weights b in the Legendre expansion (5) of the RBF kernel and the generalized distance kernel decay with a rate at least polynomial with respect to .

RBF kernel

The RBF kernel is defined as: K RBF (u, v) = e -σ u-v 2 2 = e -2σ(1-u v) for u, v ∈ S q , ( ) where σ > 0 is the scale of the RBF kernel. Denote ϕ(t) := e -2σ(1-t) for t ∈ [-1, 1]. Since the RBF kernel is positive definite and rotation-invariant, by Theorem 1, there exist weights b ≥ 0, = 0, . . . , +∞, such that: ϕ(t) = e -2σ(1-t) = +∞ =0 b P (q; t) for t ∈ [-1, 1]. The Legendre polynomials P (q; •) are orthogonal on the interval [-1, 1] with respect to the weight function (1 -t 2 ) q-3 2 , see e.g. (Müller, 2012) : 1 -1 P n (q; t)P m (q; t)(1 -t 2 ) q-3 2 dt = 0 for m = n. Moreover, by (Müller, 2012 , Exercise 3, §2): 1 -1 (P n (q; t)) 2 (1 -t 2 ) q-3 2 dt = |S q-1 | |S q-2 | 1 N (q, n) for any n. We multiply (27) by P (q; t)(1 -t 2 ) q-3 2 and integrate the equation on [-1, 1]: 1 -1 ϕ(t)P (q; t)(1 -t 2 ) q-3 2 dt = 1 -1 +∞ n=0 b n P n (q; t)P (q; t)(1 -t 2 ) q-3 2 dt = +∞ n=0 b n 1 -1 P n (q; t)P (q; t)(1 -t 2 ) q-3 2 dt = b |S q-1 | |S q-2 | 1 N (q, ) , where the inversion between series and integral is justified by the dominated convergence theorem: the series +∞ n=0 b n P n (q; t)P (q; t)(1 -t 2 ) q-3 2 converges for every t, and for any N , | N n=0 b n P n (q; t)P (q; t)(1 -t 2 ) q-3 2 | ≤ +∞ n=0 b n (1 -t 2 ) q-3 2 := g(t) since |P n (q; •)| ≤ 1 for any n by (Müller, 2012 , Lemma 2, §8), and g is integrable on [-1, 1] because +∞ n=0 b n < +∞. Hence: b = N (q, ) |S q-2 | |S q-1 | 1 -1 ϕ(t)P (q; t)(1 -t 2 ) q-3 2 dt. By the Rodrigues rule (Müller, 2012 , Exercise 1, §2), since ϕ has continuous derivatives of all orders on [-1, 1], we have: b = N (q, ) |S q-2 | |S q-1 | Γ( q-1 2 ) 2 Γ( + q-1 2 ) 1 -1 ϕ ( ) (t)(1 -t 2 ) + q-3 2 dt, ∈ N, where ϕ ( ) is the -th derivative of ϕ, which is ϕ ( ) (t) = e -2σ (2σ) e 2σt . We now show that the weights b decay very fast with respect to . We bound the integral: 1 -1 ϕ ( ) (t)(1-t 2 ) + q-3 2 dt = 1 -1 e -2σ (2σ) e 2σt (1-t 2 ) + q-3 2 dt ≤ 1 -1 (2σ) dt = 2(2σ) . (33) Hence: b ≤ 2N (q, ) |S q-2 | |S q-1 | Γ( q-1 2 ) 2 Γ( + q-1 2 ) (2σ) . Denote (a) n := Γ(n + a)/Γ(a) the Pochhammer symbol defined for any integer n and any scalar a. By the Stirling approximation of the Gamma function (Spiegel et al., 2013, (25.15 )), the asymptotic behavior of (a) n when n goes to infinity is: (a) n ∼ √ 2π Γ(a) e -n n a+n-1/2 as n → ∞. Moreover, for a fixed dimension q, the asymptotic behavior of N (q, ) defined by (20) when goes to infinity is: N (q, ) ∼ 2 (q -2)! q-2 as → ∞. Therefore, the asymptotic behavior of b as goes to infinity is: b = O σ e q/2-1- as → +∞. Generalized distance kernel For q-1 2 < s < q+1 2 , the generalized distance kernel on the hypersphere S q-1 is defined in (Brauchart et al., 2014 , Section 5) as: K (s) gd (u, v) := 2V q-1-2s (S q-1 ) -u -v 2s-q+1 2 for u, v ∈ S q-1 , where  V q-1-2s (S q-1 ) := S q-1 S q-1 u-v 2s-q+1 2 dσ q-1 (u)dσ q-1 (v) = 2 2s-1 Γ(q/2)Γ(s) √ πΓ((q -1)/2 + s) . K (s) gd (u, v) = V q-1-2s (S q-1 ) + +∞ =1 α (s) N (q, )P (q; u v), ) := -V q-1-2s (S q-1 ) ((q -1)/2 -s) ((q -1)/2 + s) , ≥ 1. The kernel weights indeed decay polynomially with respect to , because according to Brauchart et al. (2014, Section 5), the asymptotic behavior of the α (s) is: α (s) ∼ 2 2s-1 Γ(q/2)Γ(s) √ πΓ((q -1)/2 -s) -2s as → +∞.

B.3 CONNECTION BETWEEN SFRIK AND VICREG

Consider a rotation-invariant kernel K(u, v) := +∞ =1 b P (q; u v) defined on S q-1 such that b ≥ 0 for ∈ {1, . . . , +∞}, with b 1 , b 2 > 0. To show the connection between SFRIK and VICReg, we construct a high-dimensional feature map Φ : S q-1 → 2 (N), where 2 (N) denotes the space of square-summable sequences with its canonical inner product •, • 2 , such that Φ(u), Φ(v) 2 = K(u, v) for any u, v ∈ S q-1 . One way to construct such a feature map is to consider an orthonormal system of spherical harmonics. For any integer , denote {Y ,k } N (q, ) =1 an orthonormal basis of spherical harmonics of order in dimension q. By the addition theorem (Müller, 2012, Theorem 2, §1) recalled in (21) , the kernel K(u, v) admits the decomposition: K(u, v) = +∞ =1 b P (q; u v) = +∞ =1 N (q, ) k=1 b |S q-1 | N (q, ) Y ,k (u)Y ,k (v) = Φ(u), Φ(v) 2 , where Φ := b |S q-1 | N (q, ) Φ +∞ =1 with Φ : S q-1 → R N (q, ) u → (Y ,k (u)) N (q, ) for ∈ {1, . . . , +∞}. (44) Then, the MMD in (6) between any probability distribution Q defined on the hypersphere and the uniform distribution U on the hypersphere can be written as the norm in 2 (N) of the generalized 

C.4 HYPERPARAMETERS FOR EXPERIMENTS ON IN20%

We describe in detail our hyperparameter tuning protocol for the experiments in Section 4.2 on IN20%. For a rigorous tuning, it is important that the dataset used for the final evaluation remains unseen during pretraining and hyperparameter tuning. For each pretraining method, we pretrain on the IN20% training set (blue subset in Figure 2 ) a backbone for each choice of hyperparameters. These backbones are then evaluated by weighted kNN classification on a separate validation set, which is another 20% subset of the ImageNet train set (purple subset in Figure 2 ), and we select the hyperparameters yielding the highest top-1 accuracy on this kNN evaluation. Then, we tune the learning rate for the linear probing evaluation, again on our separate validation set (purple subset in Figure 2 ). Finally, we use the model trained with the best learning rate discovered for linear probing evaluation on the usual ImageNet validation set (red subset in Figure 2 ), which has never been seen during hyperparameter tuning. We report the values of the optimal hyperparameters found after our hyperparameter tuning on a separate validation set, for each pretraining experiment on IN20% with a ResNet-18 presented in Section 4.2. These hyperparameters yield the evaluation results reported in Section 4.2 for linear probing on the usual ImageNet validation set. We recall that the hyperparameters specific to each self-supervision method was introduced in Appendix B.4.

SimCLR.

For each embedding dimension, we fix the batch size at 2048, and tune the temperature τ and the base learning rate base lr for pretraining with SimCLR. Then, we tune the initial learning rate lr head for linear probing on IN20%. The optimal hyperparameters are shown in Table 6 . AUH. For each embedding dimension, we fix the batch size at 2048, and tune the alignment weight λ, the scale of the RBF kernel t, and the base learning rate base lr for pretraining with AUH. Then we tune the initial learning rate lr head for linear probing on IN20%. The optimal hyperparameters are shown in Table 7 . 



The performance gap between AUH and the RBF kernel is only due to the presence of the logarithm in AUH (cf. Example 1). Future work could clarify the role of this logarithm for regularization in self-supervision. We recall that the time and memory complexity is identical for all methods on downstream tasks.



Typically, contrastive learning (Oord et al., 2018; Hjelm et al., 2019; Chen et al., 2020a;b; He et al., 2020;

=0 b P (q; u v) converges for every u, v, and for any L, | L =0 b P (q; u v)| ≤ L =0 b P (q; u v) ≤ +∞ =0 b = +∞ =0 b P (q; 1), because |P (q; •)| ≤ 1 for all by

) Following Brauchart et al. (2014, Section 5), the Legendre expansion of the generalized distance kernel K (s) gd is:

4 SFRIK (L = 3, q = 8192) use of the following public resources, during the course of the experimental work of this paper: • VICReg official code (Bardes et al., 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MIT License • DINO official code (Caron et al., 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . Apache License 2.0 • OBoW official code (Gidaris et al., 2021) . . . . . . . . . . . . . . . . . . . . . . . . . Apache License 2.0 • SwAV official code (Caron et al., 2020) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CC BY-NC 4.0 • SimCLR official code (Chen et al., 2020a) . . . . . . . . . . . . . . . . . . . . . . . . . Apache License 2.0 • VISSL code (Goyal et al., 2021) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MIT License • ImageNet dataset (Deng et al., 2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • Places 205 dataset (Zhou et al., 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Attribution CC BY • VOC2007 dataset (Everingham et al., 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Linear probing on IN20% (top-1 accuracy) at different embedding dimensions q. All methods were pretrained on IN20% with a ResNet-18 for 100 epochs. We tuned all hyperparameters specific to each method and the learning rate. Symbol † indicates models that we retrained ourselves. a separate validation set that consists of another 20% subset of the ImageNet train set. We select the hyperparameters that yield the highest top-1 accuracy obtained by weighted kNN-classification (k = 20)(Wu et al., 2018) on this validation set, and we finally report the evaluation results by linear probing on the usual ImageNet validation set, which is never seen during hyperparameter tuning.

2 provides extra results for linear classification on Places205 (Zhou et al., 2014) and VOC2007 (Everingham et al., 2010) that further support our findings: SFRIK outperforms AUH while having the same pretraining complexity, and is fully competitive compared to VICReg with a reduced pretraining complexity. Impact

Linear classification on IN100%, Places205, VOC2007, and semi-supervised learning with few labels of IN100% (accuracy or mean average precision). Methods are pretrained on IN100% with ResNet-50. We only include methods relying on a Siamese architecture with image augmentations limited to two views. The scores of methods marked with * are from Chen & He (2021). The score of VICReg † was obtained by retraining the model ourselves. For each downstream task, we highlight in bold the best score among all backbones pretrained on 200 epochs.

Peak

demonstrates the competitiveness of SFRIK: it has the best accuracy for linear probing on IN100% among SimCLR, SwAV with no multi-crop, SimSiam and VICReg, and it performs better than VICReg for linear classification on Places205, VOC2007, and semi-supervised-learning with 10% of labels. We observe that SFRIK and VICReg offer a different trade-off between performance on linear probing on IN100% and performance on semi-supervised learning with 1% of labels. But as shown in Appendix D.3, other methods like BYOL and SwAV with multi-crop similarly have a performance drop compared to VICReg on semi-supervised learning with 1% of labels, even though they perform better on linear probing. Future work will therefore involve understanding what specific ingredients of VICReg make it more robust for semi-supervised learning with few labels.

Weighted kNN classification We follow the usual protocol ofWu et al. (2018);Caron et al. (2021). We compute the normalized representations f θ (x i ) of the images x i , i ∈ [N ], in the training set. The label of an image x test in the test set is predicted by a weighted vote of its k nearest neighbors N k in the representation space: the class c gets a score of w c :=i∈N k exp(f θ (x i ) f θ (x test )/0.07)1 [ci=c] where f θ (x test ) is normalized, c i is the class of x i , and 1 [ci=c] is equal to 1 if c i = c,and 0 otherwise. We report the kNN classification top-1 accuracy for k = 20. The image augmentation pipeline for both training and testing is the following one: images are resized to 256 × 256, and cropped at the center with a size 224 × 224. The code that we use for kNN classification is the one of Caron et al. (2021) available at https: //github.com/facebookresearch/dino.

Hyperparameter choice for SimCLR pretrained on IN20% with a ResNet-18 during 100 epochs, evaluated by linear probing on IN20%.

Hyperparameter tuning for linear probing on IN100% and Places205 for SFRIK and VICReg pretrained on IN100% with a ResNet-50.

Hyperparameter tuning for semi-supervised learning for SFRIK and VICReg pretrained on IN100% with a ResNet-50.

ACKNOWLEDGMENTS

This project was funded by the CIFRE fellowship N°2020/1643 and supported in part by the Alle-groAssai ANR project ANR-19-CHIA-0009. This work was granted access to the HPC resources of IDRIS under the allocation 2021-AD011012940 made by GENCI. Experiments presented in this paper were carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).

availability

https://github.com/valeoai/

ETHICS STATEMENT

The authors are concerned by the carbon footprint of deep learning research. To raise awareness of this issue in the community, we report the energy used for the computational resources of this project, which is approximately 12500 kWh.

REPRODUCIBILITY STATEMENT

In the interest of reproducible research, we provide our code and our pretrained ResNet-50 backbones with SFRIK on IN100% at https://github.com/valeoai/sfrik (Zheng et al., 2022) . All details about our experimental setting can be found in either Section 4 or Appendix C. For the experiments on IN20%, as detailed in Appendix C. 4 , hyperparameters are tuned on a separate validation set different from the one used for evaluation, in contrast to the common practice in the literature where hyperparameter are directly selected on the evaluation dataset. All hyperparameters that yield the evaluation results reported in the paper are given in Appendix C. moment of Q with the mapping Φ:(45)We now explain how to construct explicitly an orthonormal basis of spherical harmonics {Y ,k | = 1, . . . , +∞; k = 1, . . . , N (q, )}, based on the following theorem.Theorem 3 (Axler et al. (2013, Theorem 5.25 )) For any order ∈ N and any dimension q ≥ 3, the familyis a (non-orthonormal) basis of the space of spherical harmonics of order in dimension q, where α j (j = 1, ..., q) are nonnegative integers, and ∂ αj j denotes the α j -th partial derivative with respect to the j-th coordinate.Typically, we construct the orthonormal basis {Y ,k } N (q, ) k=1 by orthonormalizing the basis {Y ,k } N (q, ) k=1 of Theorem 3 using, e.g., the Gram-Schmidt procedure. For = 1, . . . , +∞, denote:Then, for each = 1, . . . , +∞, there exists a lower triangular matrix M such that:Remark that it is possible to compute explicitly the entries of the matrices M , = 1, . . . , +∞, because there exists a closed-form expression for the inner product Y ,k , Y ,k (q) for any , k, k : indeed, the function Y ,k for any , k is a polynomial defined on the hypersphere, and the integral of any monomial with respect to the measure σ q-1 on the hypersphere S q-1 admits a closed-form expression given by Weyl (1939, Section 3) .By injecting (48) in (45), we obtain:This yields the claim of Section 3.3 by remarking with Theorem 3 that the familiesare bases of the space of spherical harmonics of order 1 and 2 in dimension q.

B.4 REGULARIZATION LOSS OF SIMCLR, AUH AND VICREG

In SimCLR, AUH, VICReg and SFRIK, each image x i in a batch {x i } i∈I is augmented into two different views x i (1) and x i (2) , which are encoded into two embeddings z i (1) and z i (2) . These embeddings are normalized in SimCLR, AUH and SFRIK, but not in VICReg. This yields two batches of embeddings. The four methods share the same form of loss function: 1) , Z I (2) ), (51) for some scalars λ, µ > 0, where a is the alignment loss defined by (3) (which is the same for all the four methods), and r is the regularization loss specific to each method.SimCLR The regularization loss in SimCLR is:where τ > 0 is a hyperparameter of the method called the temperature, and), and 0 otherwise. The scalars λ, µ are fixed at λ = 1 τ and µ = 1. Alignment & Uniformity The regularization loss in AUH is:where t > 0 is a hyperparameter called the scale of the RBF kernel. The scalar λ is tuned as a hyperparameter and µ is fixed at µ = 1.VICReg As introduced in Section 3.3, the regularization loss in VICReg is:where µ is the scalar from (51). Here, v(•) and c(•) are respectively the variance and covariance terms defined by (9). Both λ and µ are tuned as hyperparameters.

C EXPERIMENTAL SETTING

In the interest of reproducible research, we give more details about the setting of our experiments presented in Section 4.

C.1 IN20% DATASET DESCRIPTION

The datasets used in our experiments include a subset of 20% of ImageNet-1000 as in (Gidaris et al., 2021) . This reduced dataset, denoted IN20%, contains all the 1000 classes of ImageNet, but we keep only 260 images per class. The 260 images extracted are the same as those extracted in the official implementation of OBoW (https://github.com/valeoai/obow). In Section 4.2, we also use another 20% subset of the ImageNet train set as a separate validation set for hyperparameter tuning (see Appendix C.4 below). The construction of this validation set is based on the code of OBoW, and will be exactly detailed in our code that will be released at publication.

C.2 IMAGE AUGMENTATIONS

We follow the same image augmentation pipeline as in (Bardes et al., 2022) . We also use image augmentations implemented by PIL, as in VICReg's code available at https: //github.com/facebookresearch/vicreg:• GaussianBlur(): blur an image using a Gaussian kernel with standard deviation uniformly sampled in [0.1, 2.0]; • Solarization(): randomly invert all pixel values above a threshold, which is 130.In our experiments, the first image view is obtained by composing the following random augmentations: random cropping resized to 224 × 224, random horizontal flip applied with probability 0.5, random color jittering applied with probability 0.8, random grayscale conversion applied with probability 0.2, random Gaussian blur applied with probability 0.1, and random solarization applied with probability 0.2. The second view is obtained by composing the same random augmentations as the first view, except that Gaussian blur is applied every time (probability 1), and solarization is never applied (probability 0).

C.3 EVALUATION PROTOCOL

We describe the downstream tasks on which self-supervision methods are evaluated in our experiments of Section 4.

Linear probing on IN20% and IN100%

Following, e.g., (Bardes et al., 2022) , the weights of the backbone (ResNet-18 or ResNet-50) are frozen and a linear layer followed by a softmax on top of the backbone is trained in a supervised setting on a training set. Then the model is evaluated on a test set. The training set is either IN20% or IN100%, but with labels. The test set is the validation set of ImageNet. The linear layer is trained using an SGD optimizer with momentum parameter equal to 0.9 during 100 epochs. We apply a weight decay of 10 -6 . The learning rate follows a cosine decay scheduling. The batch size is fixed at 256. Training images are augmented by composing a random cropping of an area between 8% and 100% of the total area resized to 224 × 224, and a random horizontal flip of probability 0. 

Linear probing on Places205

We use the code of Gidaris et al. ( 2021), available at https:// github.com/valeoai/obow, for the evaluation by linear probing on Places205. The weights of the backbone (ResNet-50) pretrained on IN100% are frozen and a linear prediction layer is trained for the classification task on Places205. We note that a batch normalization layer with non-learnable scale and bias parameters is added at the output of the backbone in (Gidaris et al., 2021) . The linear prediction layer is trained with an SGD optimizer with a 0.9 momentum parameter during 28 epochs. The weight decay is 10 Linear classification on VOC2007 After pretraining a ResNet backbone, we use the VISSL library (Goyal et al., 2021) to extract features of VOC2007 images resized to 224 × 224 by taking the output of the last average pooling layer of the pretrained ResNet backbone. We then learn a linear SVM with LIBLINEAR (Fan et al., 2008) on top of these features to predict the presence or the absence of a given class in the test images. An average precision score is then computed for each class after a 3-fold cross-validation, and we report the mean score over all classes as the mean average precision (mAP).Semi-supervised learning After pretraining a ResNet backbone by self-supervision, we fine-tune this backbone and the linear classifier on the ImageNet classification task with only 1% or 10% of the labeled data. The labeled images that are considered in these subsets are the ones used in the official code of SimCLR available at https://github.com/google-research/simclr. We use an SGD optimizer with momentum parameter equal to 0.9 during 20 epochs, without weight decay. The batch size is fixed at 256. The learning rates of the backbone and the linear classifier follow a cosine decay scheduling with different initial learning rates. These initial learning rates are tuned as hyperparameters. We report the top-1 accuracy on the validation set of ImageNet obtained after the last training epoch, along with the corresponding top-5 accuracy. We use the same image augmentation pipeline for training and testing as in linear probing on IN100%. The code that we use for semi-supervised learning with few labels of IN100% is the one of Bardes et al. (2022) available at https://github.com/facebookresearch/vicreg. VICReg. For each embedding dimension, we fix the batch size at 2048, and we tune the alignment weight λ, the variance weight µ, and the base learning rate base lr for pretraining with VICReg. Then, we tune the initial learning rate lr head for linear probing on IN20%. The optimal hyperparameters are shown in Table 8 .SFRIK, batch size 2048. For each embedding dimension, we fix the batch size at 2048, and tune the alignment weight λ in (2), the kernel weights b ( ∈ {2, 3}) in (8), and the base learning rate base lr for pretraining with SFRIK. Without loss of generality, the first kernel weight b 1 in ( 8) is fixed at b 1 = 1. Then, we tune the initial learning rate lr head for linear probing on IN20%. The optimal hyperparameters are shown in Table 9 .SFRIK, batch size 4096. We fix the dimension at q = 8192, and the batch size at 4096, and tune the alignment weight λ in ( 2 This yields a top-1 accuracy of 46.3 for linear probing on IN20%, meaning that it is not necessary to use a batch size larger than 2048 in SFRIK to obtain a better performance.

C.5 HYPERPARAMETERS FOR THE ABLATION ON THE KERNEL CHOICE

We detail the experimental setting of our ablation study on the kernel choice for the generic uniformity term (7) in Table 3 of Section 4.2. We follow the protocol of Section 4.2, with an embedding dimension of q = 8192. The training loss is λ a (Z I (1) , Z I (2) ) + 0.5( u (Z I (1) ) + u (Z I 2) )), where u is the loss (7) with an RBF or a general distance kernel.RBF kernel. The kernel is K(u, v) = e -t u-v 2 2 . We fix the batch size at 2048, and tune the alignment weight λ, the scale of the RBF kernel t, and the base learning rate base lr for pretraining. Then, we tune the initial learning rate lr head for linear probing on IN20%. The optimal hyperparameters are: λ = 100, t = 2, base lr = 1.0, lr head = 10.2 , where we fixed C = 0 because the value of this constant does not change the gradients of the optimization problem, and p = 2 because we verified empirically that choosing p < 2 yields poor results. We fix the batch size at 2048, and tune the alignment weight λ, and the base learning rate base lr for pretraining. Then, we tune the initial learning rate lr head for linear probing on IN20%. The optimal hyperparameters are: λ = 10000, base lr = 0.6, lr head = 10.As shown in Table 3 , a truncated kernel is empirically a better choice than the RBF or the generalized distance kernel for the uniformity term (7) .

C.6 HYPERPARAMETERS FOR EXPERIMENTS ON IN100%

Table 10 reports the selected hyperparameters for pretraining ResNet-50 on IN100% with SFRIK in Section 4.3. The tuning protocol is as follows. Since hyperparameter tuning is costly on IN100%, we pretrain several ResNet-50 for different values of kernel weights b , alignment weight λ and base learning rate base lr, and pause the pretraining after 50 epochs. We evaluate the obtained backbones on kNN classification (top-1 accuracy, k = 20), and select the best performing backbones. Then we continue pretraining these selected backbones until reaching epoch 200 or 400. Finally we select the hyperparameters that yield the highest top-1 accuracy for linear probing on IN100% after 200 or 400 epochs of pretraining. Because of the conclusions of Appendix D.1, our hyperparameter tuning follows the common practice in the self-supervised learning literature where hyperparameters are selected by measuring the performance on the validation set of ImageNet. Note that we verified experimentally a posteriori that the hyperparameters obtained by our tuning protocol are similar to the ones obtained from a tuning on a smaller dataset like STL-10 (Coates et al., 2011) . This means that an alternative is to tune the hyperparameters on STL-10, and generalize these hyperparameters to pretrain SFRIK on IN100%. Tables 11 and 12 give the optimal hyperparameters found for linear probing on IN100%, linear probing on Places205, and semi-supervised learning with limited labels of IN100% when evaluating pretrained ResNet-50 backbones with SFRIK and VICReg. The hyperparameters that are tuned for evaluation are: the initial learning rate lr head of the linear layer in linear probing; and the initial learning rate lr backbone and lr head for respectively the backbone and the linear layer in semisupervised learning. The reported hyperparameters in the two tables yield the evaluation results reported in Section 4. 

D ADDITIONAL EXPERIMENTAL RESULTS

We provide in this appendix other experimental results to complement the main paper.

D.1 HYPERPARAMETER TUNING WITHOUT A SEPARATE VALIDATION SET

A common practice in the self-supervised learning literature, e.g., (Bardes et al., 2022; Chen et al., 2020a) , is to select the hyperparameters by measuring the performance on the validation set of Im-ageNet (red dataset in Figure 2 ) instead of a separate validation dataset (purple dataset in Figure 2 ). In this paragraph, we verify whether this less rigorous practice changes the conclusion of the experiments in Section 4.2. In Table 13 , we report the evaluation of the different backbones pretrained on IN20% after tuning each method directly on the validation set of ImageNet, which is the same dataset used for evaluation in linear probing. By comparison with Table 2 , which follows the more rigorous hyperparameter tuning protocol described in Appendix C.4, we observe that although the absolute figures of merit slightly vary if we use the less rigorous protocol instead of the more rigorous one, the conclusion of the experiments in Section 4.2 does not change. This gives an empirical justification to this common practice.Table 13 : Linear probing on IN20% (top-1 accuracy) at different embedding dimensions q. All methods were pretrained on IN20% with a ResNet-18 for 100 epochs. Hyperparameters specific to each method and the learning rate are tuned on the same dataset as the one used for evaluation in linear probing, which is less rigorous than tuning the hyperparameters on a separate validation set as described in Appendix C.4. Symbol † indicates models that we retrained ourselves. We conclude that, under the rigorous protocol of Section 4.2 and Appendix C.4, SFRIK performs better than AUH on various downstream tasks, while having the same computational saving offered by the kernel trick. Compared to VICReg, SFRIK performs better on these tasks with a reduced complexity.

D.3 RESULTS OF OTHER PRETRAINING METHODS ON IN100% WITH RESNET-50

In complement to 

