SIMPLICIAL EMBEDDINGS IN SELF-SUPERVISED LEARNING AND DOWNSTREAM CLASSIFICATION

Abstract

Simplicial Embeddings (SEM) are representations learned through self-supervised learning (SSL), wherein a representation is projected into L simplices of V dimensions each using a softmax operation. This procedure conditions the representation onto a constrained space during pre-training and imparts an inductive bias for discrete representations. For downstream classification, we provide an upper bound and argue that using SEM leads to a better expected error than the unnormalized representation. Furthermore, we empirically demonstrate that SSL methods trained with SEMs have improved generalization on natural image datasets such as CIFAR-100 and ImageNet. Finally, when used in a downstream classification task, we show that SEM features exhibit emergent semantic coherence where small groups of learned features are distinctly predictive of semantically-relevant classes.

1. INTRODUCTION

Self-supervised learning (SSL) is an emerging family of methods that aim to learn representations of data without manual supervision, such as class labels. Recent works (Hjelm et al., 2019; Grill et al., 2020; Saeed et al., 2020; You et al., 2020) learn dense representations that can solve complex tasks by simply fitting a linear model on top of the learned representation. While SSL is already highly effective, we show that changing the type of representation learned can improve both the performance and interpretability of these methods. For this we draw inspiration from overcomplete representations: representations of an input that are non-unique combinations of a number of basis vectors greater than the input's dimensionality (Lewicki & Sejnowski, 2000) . Mostly studied in the context of the sparse coding literature (Gregor & LeCun, 2010; Goodfellow et al., 2012; Olshausen, 2013) , sparse overcomplete representations have been shown to increase stability in the presence of noise (Donoho et al., 2006) , have applications in neuroscience (Olshausen & Field, 1996; Lee et al., 2007) , and lead to more interpretable representations (Murphy et al., 2012; Fyshe et al., 2015; Faruqui et al., 2015) . But, the basis vector is learned using linear models (Lewicki & Sejnowski, 2000; Teh et al., 2003) . In this work, we show that SSL may be used to learn discrete, sparse and overcomplete representations. Prior work has considered sparse representation but not sparse and overcomplete representation learning with SSL; for example, Dessì et al. (2021) propose to discretize the output of the encoder in a SSL model using Gumbel-Softmax (Jang et al., 2017) . However, we show that discretization during pre-training is not necessary to achieve a sparse representation. Instead, we propose to project the encoder's output into L vectors of V dimensions onto which we apply a softmax function to impart an inductive bias toward sparse one-hot vectors (Correia et al., 2019; Goyal et al., 2022) , also alleviating the need to use high-variance gradient estimators to train the encoder. We refer to this embedding as Simplicial Embeddings (SEM), as the softmax functions map the unnormalized representations onto L simplices. The procedure to induce SEM is simple, efficient, and generally applicable. The SSL pre-training phase, used with SEM, learns a set of L approximately one-hot vectors. Key to controlling the inductive bias of SEM during pre-training is the softmax temperature parameter: the lower the temperature, the stronger the bias toward sparsity. Consistent with earlier attempts at sparse representation learning (Coates & Ng, 2011) , we find that the optimal sparsity for pre-training need not match the optimal level for downstream learning. For downstream classification, we may discretize the learned representation by, for example, taking the argmax for each simplex. But, we can also use SEM to control the representation's expressivity via the softmax's temperature. We provide a theoretical bound showing that the expected error follows a trade-off between the training error and the representations' expressivity that can be controlled by the softmax's temperature used to normalize the representation for downstream classification. Our bound also shows improved expected error as we increase L and V for SEM. SEM is generally applicable to recent SSL methods. Applying it to seven different SSL methods (Chen et al., 2020b; He et al., 2020; Grill et al., 2020; Caron et al., 2020; 2021; Zbontar et al., 2021; Bardes et al., 2022) , we find accuracy increases of 2% to 4% on CIFAR-100. We observe monotonic improvement as we increase the number of vectors L, showing the benefit of the overcomplete representations learned by SEM, while this improvement is absent when we do not use softmax normalization. When training a SSL method with SEM on ImageNet we also observe improvements on in-distribution compared to the baseline (Figure 1 ). We also observe improvement on out-of-distribution test sets, semi-supervised learning benchmark and transfer learning datasets, demonstrating the potential of SEM for large scale applications. Finally, we find that SEM learns features that are closely aligned to the semantic categories in the data. This demonstrates that SEM learns disentangled and interpretable representations, as previously observed in overcomplete representations (Faruqui et al., 2015) .

2. RELATED WORK

The softmax operation has been used in other contexts, notably as an architectural component for models to attend to context-dependent queries via, for example, an attention mechanism (Bahdanau et al., 2016; Vaswani et al., 2017; Correia et al., 2019; Goyal et al., 2022) , a mixture of experts (Jordan & Jacobs, 1993) or memory augmented networks (Graves et al., 2014) . This operation is also used for the computation of several SSL objectives such as InfoNCE (van den Oord et al., 2018; Hjelm et al., 2019) , and as a normalization of the output to compute the objective in DINO and SWaV (Caron et al., 2020; 2021) . Different from these, our method places the softmax at the output of an encoder to constrain the representation into a set of L sparse vectors. Similar to our approach, other architectural constraints such as Dropout (Srivastava et al., 2014) , BatchNorm (Ioffe & Szegedy, 2015) and LayerNorm (Ba et al., 2016) also improve the training of large neural networks. However, contrary to SEMs, they are not used to induce sparsity on the representation or control its expressivity for downstream tasks. Closer to our work, Liu et al. (2021) propose to constrain the expressivity of the representation of a neural network with a set of discrete-valued symbols obtained using a set of Vector Quantized (Oord et al., 2018) bottlenecks. Similarly, Dessi et al. (2021) propose a communication game with a discrete bottleneck. The idea of discretizing the encoder's output is similar to using SEM vectors that are one-hot (e.g. temperature = 0) and only one symbol (e.g. L = 1, V = 2048). In our work, we find success in removing the hard-discretization and having L > 1, which can be interepreted as combining several symbols.

3. SIMPLICIAL EMBEDDINGS

Simplicial Embeddings (SEM) are representations that can be integrated easily into a contrastive learning model (Hjelm et al., 2019; Chen et al., 2020b) , the BYOL method (Grill et al., 2020) , and other SSL methods (Caron et al., 2020; 2021; Zbontar et al., 2021) . For example, in BYOL, we insert the SEM after the encoder and before the projector and the rest is unchanged as shown in Figure 2c . In this figure, t and t are augmentations defined by the practitioner, ξ are parameters of the target network that are updated as moving average of the parameters θ of the online networks trained with SGD. So, ξ are updated as follow: ξ ← αξ + (1 -α)θ, with α ∈ [0, 1]. To produce SEM representation, the encoder's output e is embedded into L vectors z i ∈ R V . A temperature parameter τ scales z i , and then a softmax re-normalizes each vector z i to produce zi . (Grill et al., 2020) . The encoder outputs a latent vector which is embedded into the matrix z ∈ R L×V and then transformed into SEM. Finally, the normalized vectors zi are concatenated to produce the vector ẑ of length L • V . We illustrate SEM in Figure 2a . Formally, the re-normalization is as follows: zi := σ τ (z i ), σ τ (z i ) j = e zij /τ V k=1 e z ik /τ , ẑ := Concat(z 1 , . . . , zL ), ∀i ∈ [L], ∀j ∈ [V ]. (1)

3.1. INDUCTIVE BIAS TOWARDS SPARSITY DURING PRE-TRAINING

In SEM, L controls the numbers of simplices and V controls the dimensionality of each simplex. As such, the higher V is, the sparser the representation can be. During pre-training, the constraint induced by embedding the representation into a simplex biases each vector towards sparse vectors by creating a zero-sum competition between the components of the vector. In order for a component to increase by α, then the other elements must decrease by α, and all elements are bounded by 0. For networks to learn useful features and minimize their objective, they must prioritize some components at the expense of others. The strength of this bias is controlled via the pretraining temperature τ p of the softmax, and the size of the vectors V as it was noted in the context of attention (Vaswani et al., 2017; Wang et al., 2021b) . For SSL methods with a target network, the temperature for the target network can be different to the online network's as no gradient is back-propagated through it. To visualize the effect of the temperature on SEM after pre-training, we interpret each simplex as a probability mass function p(z ij ) where, for all i ∈ [L], V j=1 p(z ij ) = 1 and p(z ij ) ≥ 0 ∀j. The entropy of a simplex zi , defined as H(z i ) := -V j=1 p(z ij ) log p(z ij ), informs whether the simplex is a sparse or a dense vector. That is, if H(z (x) i ) = 0 then the vector is one-hot. On the other hand, if H(z (x) i ) = ln(V ) then the vector is dense and uniform. While the temperature τ p is merely a scaling of the logits, it has an important control over the learned representation's entropy and resulting SEM sparsity. We demonstrate this by learning a representation on CIFAR-100 using BYOL, and analyze the entropies of the resulting simplices. In Figure 2b , we plot the histogram of the entropies H(z i ), for a given τ p , of each simplex for each sample in the training set of CIFAR-100. We observe that even after pre-training, small temperatures (τ p = 0.01) yields representations that are close to one-hot vectors while high temperatures yields vectors that are close to uniform vectors. By pre-training using a softmax, SEMs create representations that are conditioned to fit onto simplices. In pre-training, we select τ p for optimal inductive bias: τ p too small yields vanishing gradients (Wang et al., 2021b) and τ p too large yields a bias that is too weak. We may select a different optimal τ d for downstream performance as discussed formally in the next subsection.

3.2. SEM IMPROVEMENT ON THE GENERALIZATION OF THE DOWNSTREAM CLASSIFIER

In this subsection, we theoretically demonstrate the benefit of training a downstream classifier with SEM normalized input compared to a baseline classifier with unnormalized input. We show that: (1) there is a trade-off between the training loss and the generalization gap, which is controlled by the value of τ d (denoted τ := τ d in this subsection), (2) SEM can improve the base model performance when we attain good balance in this trade-off, and (3) the improvement due to SEM is expected to increase or stay constant as L and V increase. In the remainder of this subsection, we introduce the notation and assumptions needed to understand and derive the result, then present our theoretical claim and discuss its implications. Notation. We use a training dataset S = (z (i) , y (i) ) n i=1 of n samples for supervised training of a classifier, using the representation z extracted from the pre-trained model * and the corresponding label y ∈ Y where Y is the space of possible labels. Assume that z ∈ Z = [-1, +1] L×V , which means that z is a matrix with L rows and V columns. We denote the element of z at row i and column j as z ij . Let g represent the downstream classifier. We refer to the baseline downstream model with unnormalized input as f base , and f base (z) = g(z). The corresponding downstream model trained with the SEM normalization is f SEM(τ ) (z) = (g • σ τ )(z), where σ τ is applied element-wise along each row of z such that σ τ (z ij ) = e z ij /τ V t=1 e z it /τ for j = 1, . . . , V . Moreover, we define f S base and f S SEM(τ ) the base and the SEM normalized models obtained by fitting the dataset S. Finally, let H be the union of the hypothesis spaces of f SEM(τ ) and f base . To compare the quality of the base model and the model with SEM normalization, we analyze the generalization gap E z,y [l(f S (z), y)] -1 n n i=1 l(f S (z (i) ), y (i) ) for each f S ∈ {f S SEM(τ ) , f S base }, where l : R × Y → R ≥0 is the per-sample loss. The key insight that we exploit for the theorem is that the softmax operation σ τ controls the expressivity of the input's representation to g via the temperature τ . We denote ϕ f base as an upper bound on the expressivity of z i for the baseline model f base , and ϕ f SEM(τ ) as the upper bound on the expressivity of σ τ (z i ) for the model with SEM normalization f SEM(τ ) . The formal definition of ϕ f base and ϕ f SEM(τ ) requires proof devices that will hinder the readability of this section, so we refer the reader to Appendix A for a detailed definition. Let ϕ f ∈ {ϕ f base , ϕ f SEM(τ ) }. Intuitively, ϕ f S measures the largest possible distance that two embeddings can have such that the largest component remains the same for both embeddings. We note that this measure depends only on V for f base , and on both V and τ for f SEM(τ ) . We use ϕ f S (V, τ ) to denote the measure given by either model and note that τ has no effect for f base . Assumptions. We assume that the per-sample loss is bounded such that l(f (z), y) ≤ B for all f ∈ H and for all (z, y) ∈ Z × Y. For example, B = 1 for the 0-1 loss. Next, let l y be the per-sample loss given y. We assume that l y •g are uniformly Lipschitz functions for all y ∈ Y and g ∈ G S , where G S is the set of classifiers g returned by the training algorithm using the dataset S. Let R be such a uniform Lipschitz constant. This means that |(l y • g)(σ f (z)) -(l y • g)(σ f (z ))| ≤ R σ f (z) -σ f (z ) F , where l y (g • σ f (z)) = l(g • σ f (z), y) , and σ f = σ τ when f = f SEM(τ ) and σ f is identity when f = f base . Finally, we assume that there exists ∆ > 0 such that for all representations z of the underlying distribution we have that for any i ∈ [L], if k = arg max j∈[V ] z ij , then z ik ≥ z ij + ∆ for any j = k. Since ∆ can be arbitrarily small (e.g. as small as machine precision), this assumption typically holds in practice. We are now ready to state our theoretical claim. Theorem 1 illuminates the advantage of SEM and the effect of the hyper-parameter τ on the performance of the downstream classifier. We present the proof in Appendix A. Theorem 1. Let V ≥ 2. For any 1 ≥ δ > 0, with probability at least 1 -δ, the following holds for any f S ∈ {f S SEM(τ ) , f S base }: E z,y [l(f S (z), y)] ≤ 1 n n i=1 l(f S (z (i) ), y (i) ) + R L ϕ f S (V, τ ) + c ln(2/δ) n , where c > 0 is a constant in (n, f, H, δ, τ, S). Moreover, ϕ f S SEM(τ ) → 0 as τ → 0 and ϕ f S SEM(τ ) -ϕ f S base ≤ 3 4 (1 -V ) < 0 ∀τ > 0. The first statement of Theorem 1 shows that the expected loss is bounded by the three terms: the training loss 1 n n i=1 l(f S (z (i) ), y (i) ), the second term R Lϕ f S , and the third term c ln(2/δ) n . Since c is a constant in (n, f, H, δ, τ, S), the third term goes to zero as n → ∞ and is the same with and without SEM. Thus, for the purpose of assessing the impact of SEM, we can focus on the second term, where a difference arises. Theorem 1 shows that R Lϕ f S goes to zero with SEM; i.e., ϕ(f S SEM(τ ) ) → 0 as τ → 0. Also, for any τ > 0, the second term with SEM is strictly smaller than that without SEM as ϕ f S SEM(τ ) -ϕ f S base ≤ 3 4 (1 -V ) < 0 and demonstrates that the improvement due to SEM is expected to asymptotically increase as V increases. Moreover, L is a multiplicative constant of ϕ which shows that, as L increases, the improvement due to SEM is also expected to be higher. Overall, Theorem 1 shows the benefit of SEM as well as the trade-off with τ . When τ → 0, the second term goes to zero, but the training loss (the first term) can increase due to underfitting resulting from the reduction in representation expressivity. Thus, τ should be chosen to optimally balance this trade-off.

4. EMPIRICAL ANALYSIS

We empirically study the effect of SEM on the representation of SSL methods and demonstrate that SEM improves the test set accuracy on CIFAR-100 (Krizhevsky, 2009) . We compare SEM with other methods for inducing sparse representations during pretraining and demonstrate that SEM lead to better downstream accuracy. On IMAGENET (Deng et al., 2009) , we study the effect of SEM on robustness, semi-supervised learning and transfer learning datasets, demonstrating consistent improvement attributed to SEM. Finally, we present evidences that features produced by SEMs are more naturally aligned with the semantic categories of the data. The code for reproducing the results is available at: https://github.com/lavoiems/simplicial-embeddings/. Training setup. For all experiments, we build off the implementation of the baseline models from the Solo-Learn library (da Costa et al., 2021) . We probe the encoder's output for the baseline methods, as typically done in the literature. For models with SEM, we probe the SEM normalized representation (i.e. ẑ). In our experiments, the embedder is a linear layer followed by BatchNorm (Ioffe & Szegedy, 2015) . Unless mentioned otherwise, we use L = 5000 and V = 13 for the SEM representation. We do not perform any search for the non-SEM hyper-parameters. The SEM hyper-parameters are selected by using a validation set of 10% of the training set of CIFAR-100 and 10 samples per class on the in distribution dataset for IMAGENET. The test accuracy is obtained by retraining the model with all of the training data using the parameters found with the validation set. We pre-train the SSL models for 200 epochs on IMAGENET and 1000 epochs on CIFAR-100. 

4.1. SEM IMPROVES ON DOWNSTREAM CLASSIFICATION

Baseline comparison. We evaluate the effect of adding SEMs in seven modern SSL approaches. We take standard SimCLR (Chen et al., 2020b) , MoCo-v2 (He et al., 2020) , BYOL (Grill et al., 2020) Barlow-Twins (Zbontar et al., 2021) , SwAV (Caron et al., 2020) , DINO (Caron et al., 2021) and VicReg (Bardes et al., 2022) models and implement SEM after the encoder. We compare our approach on CIFAR-100 with a ResNet-18 and ResNet-50 in Table 1 . We found SWaV and DINO to be unstable with ResNet-50 thus have decided not to compare them with SEM. For every SSL methods, using SEMs improves the baseline methods by 2% to 4% demonstrating that SEM is a general approach that improves in-distribution generalization for SSL methods. -73.2 BYOL+SEM (τ d = 0.1) -73.9 Increasing the representation's size of SEM increases the performance. We find that increasing L (the number of simplices of SEM) beyond the over-complete regime increases the downstream accuracy. This increased performance is not observed when we abstain from using the softmax normalization of SEM. In Figure 3 , using a ResNet-50 encoder, we compare BYOL + SEM, with an identical model without the Softmax normalization which we call BYOL + Embed and BYOL to which we increase the representation's size before the meanpooling using the method proposed in (Dubois et al., 2022) and described in their Appendix F. To be clear, the extracted representation of BYOL + Embed is the embedder's output z θ and the extracted representation for BYOL is the encoder's output e θ . We fix V = 13 and scale L ∈ [10, 10000] to get a range of representation sizes.

Comparison of SEM with hard discretization approaches.

Several other methods can be used to induce a sparse and overcomplete representation during pre-training and downstream classification. For example, we may sample L discrete one-hot codes of V dimensions using Gumbel Softmax (Jang et al., 2017) as done in Dessì et al. (2021) . We can also use Vector Quantization (VQ) (Oord et al., 2018) and consider L latent embedding spaces with V embedding vectors each, wherein the vectors are in R d . In contrast to SEM, it is not possible to propagate the gradient through the bottleneck trivially and VQ uses straight-through estimation in the embedding space to backpropagate the gradient to the encoder. Here, we observe that these alternative approaches exhibit a considerable decrease in performance in comparison to the baseline as demonstrated in Table 2 . In this table, we reproduce the same setup as SEM but we replace the Softmax with hard discretization baselines methods. For discretization with Gumbel Straight-Through estimation, we use the same setup as SEM with L = 5000 and V = 13, that is 5000 one-hot vectors of 13 dimensions and τ = 2 † . For VQ, we found that L = 512 and V = 128 led to the best performance. That is, we have 512 latent embedding spaces, each with 128 possible embedding vectors that are in R 32 . We note that while we have not found hard-discretization to be successful during pre-training, we may hard-discretize a SEM representation for downstream task. In Table 2 , we also present SEM with τ DS = 0, which correspond to using the discretized representation for downstream classification. We obtain the discrete representation by taking the argmax for each simplex. This result demonstrating that SEM with pre-training can be used to learn meaningful discrete codes for downstream applications and yields better performance than the baselines, implying that pre-training with SEM could be be used in applications that require discretization. Memory and computational efficiency of SEM. SEM's performance improvements come at a cost of increased memory allocation (VRAM) due to additional parameters needed to perform the matrix multiplication, and slightly more computation (FLOPs/sample). For very large over-complete representation the increased memory requirement can impede practical application. We propose a more efficient version of SEM by sparsifying the matrix multiplication of the embedder and of the projector and detail this procedure in Appendix D.1. As shown in Table 17 , SEM with sparse matrix multiplication use only slightly more memory and compute but outperforms the BYOL baseline on CIFAR-100 though underperforming the regular SEM. We also note that SEM's memory cost becomes relatively minor as we scale up the encoder. As well, the computational cost of SEM is small compared to the total cost of pre-training and achieves higher accuracy using fewer FLOPs compared to scaling the encoder as shown in Figure 1 .

4.2. ANALYZING THE PARAMETERS OF SEM

We present two figures in this section to better understand the effect of the parameters of SEM on the downstream accuracy. In Figure 4 , we evaluate the effect of changing τ p and τ d on the downstream accuracy. In Figure 5 , we evaluate the effect of L and V on the downstream accuracy and also contrast f base and f SEM (τ = 1) by using the same encoder pre-trained with SEM. This allows us to relate some observations to the theorey presented in Section 3.2. Now, we discuss the effect of each of SEM's parameter on the resulting downstream classification. Increasing V yields a steep performance increase for small V but quickly plateau. In Figure 5b , we observe a steep increase of the accuracy for V < 13 followed by a plateau for V > 13. In Figure 4a , we observe that the optimal accuracy obtained for V = 1024 and L = 64 is similar to the one obtained for L = 50 (Embedding size=650) in Figure 3 . Increasing L yields monotonical improvement for downstream classification. In the regime that we can test it, increasing L lead to consistent improvement on the downstream accuracy as observed in Figure 3 and Figure 5a . Using SEM in pre-training only is not enough and using it in the downstream classifier is necessary for the improved performance as demonstrated in Figure 5a . The optimal τ p depends on V . As previously noted in the context of Attention (Vaswani et al., 2017; Wang et al., 2021a) , the optimal attention's temperature is proportional to attention's vector size. We also observe this in SEM. As presented in Figure 4a , the optimal τ p for larger V is higher. Models with larger L are more robust to smaller τ d . In Figure 4 , we observe that SSL models are more robust to smaller τ d as L increase. We speculate that the information can be scattered across the simplices for large L, allowing to reduce the expressivity of each vector with minimal impact on the downstream accuracy. 

4.3. SEM IMPROVEMENT ON LARGE-SCALE DATASETS WITH IMAGENET

Figure 1 in the introduction demonstrates that using SEM leads to better in distribution generalization for IMAGENET and is a more efficient method of scaling up the model as compared to scaling up the width of the ResNet-50 encoder. Here, we demonstrate that SEM generally improves the accuracy on several robustness test sets, a semi-supervised learning benchmark and transfer learning datasets. We use BYOL+SEM with an embedding size of 105 000 features (L = 5000 and V = 21) for these experiments. The embedding is pre-trained for 200 epochs using the BYOL SSL procedure. Robustness to out-of-distribution test sets. We perform a comparative study using several test sets: to perform our experiments. We observe that BYOL + SEM outperforms BYOL on every robustness datasets probed, demonstrating that SEM also improves generalization to out-of-distribution test sets. Transfer learning. We probe the effectiveness of SEM in BYOL and MoCo when transferring representations trained on IMAGENET to other classification tasks. We follow the linear evaluation and fine-tuning methodologies described in previous works (Grill et al., 2020; Lee et al., 2021) , which entails training a linear classifier with logistic regression using sklearn (Pedregosa et al., 2011) on the embeddings of the samples and fine-tuning the encoder respectively. To avoid out-of-memory issues that may occur in the linear probe experiment with the sklearn solver when the number of features, we discretize our features and use sparse matrix to fit the logistic regression. This is equivalent to forcing τ d = 0 for all the experiments. For the fine-tuning experiments, we fix τ d = 1 since the evaluation method allows for mini-batch gradient descent. We perform our transfer learning experiments on the following datasets: FOOD (Bossard et al., 2014) , CIFAR-10 (C-10) (Krizhevsky, 2009) , CIFAR-100 (C-100) (Krizhevsky, 2009) , SUN (Xiao et al., 2010) , DTD (Cimpoi et al., 2014) and FLOWER (Nilsback & Zisserman, 2008) . This task evaluates the generality of the encoder as it has to encode samples from various out-ofdistribution domains with categories that it may not have seen during training. We present our results in Table 4 and observe that SEM improves the transfer accuracy over the baseline for every datasets but DTD for the linear probe experiment. For DTD, we hypothesize that the drop in performance is due to the fact that we use a temperature that is too small. Since this is a texture dataset with higher frequency, it might be the case that we need more expressivity to correctly fit the data. We support the conjecture with the fine-tuning experiment where BYOL + SEM out-performs the baseline. Table 5 : Semi-supervised learning accuracy by fine-tuning on IMAGENET.

Top-1

Top-5 1% 10% 1% 10% BYOL 51.6 67.5 78.0 88.9 BYOL+SEM 56.7 69.9 81.0 90.0 Semi-supervised learning. We evaluate the effect of using SEM when fine-tuning on a classification task with a small subset of IMAGENET's training set. We follow the semi-supervised learning procedure of Chen et al. (2020b) ; Grill et al. (2020) and use the same fixed splits of 1% and 10% of ImageNet labelled training set. In Table 5 , we demonstrate that using SEM lead to an important increased performance, especially in the low supervised data regime.

4.4. SEMANTIC COHERENCE OF SEM FEATURES

Here we demonstrate that SEM features are coherently aligned with the semantics present in the training data. Qualitatively, we visualize the most predictive features of a downstream linear classifier trained on CIFAR-100 and see that the classes with similar predictive features are semantically related. Quantitatively we propose a metric that returns the ratio of features mostly predictive for a classes that are in the same super class to total number of class predictive for this feature. For both our analysis, we use a linear classifier trained on the features extracted from BYOL with and without SEM. Consider the trained linear classifier with a weight matrix W ∈ R N ×C , with N features, and C classes. By preserving the top K parameters of the weight matrix W for each class and pruning the features predictive for only one class, we create a bipartite graph between two set of nodes: the CIFAR-100 classes and the features of the representation. We denote this graph W K . The qualitative analysis is given by plotting the subset W 5 , obtained by taking the top 5 features for each class. We present a subset of the graph for BYOL+SEM in Figure 6a and for BYOL in Figure 6b . The full graphs are presented in the Appendix. In the SEM plot, a set of connected components emerge, and the connected components of the graph are semantically related. For example, the first set of connected components are flowers, and the last set of connected components are aquatic mammals. ‡ . The same class coherence is not observed with either the BYOL baseline or with BYOL augmented with a large representation. In particular, we do not see a small number of semantically related connected components. Instead, we see a large fully connected graphs. Next, we describe how we quantitatively measure the semantic coherence of the features. Notice that two classes share a common predictive feature on W K if they are 2-neighbour. Let N (c i ) returns all pairs (c i , c j ) for all j 2-neighbour of c i . Moreover, define the operation is_super(c i , c j ) which returns 1 if c i and c j are from the same CIFAR-100 superclass and 0 otherwise. We reproduce the superclass of CIFAR-100 in Table 22 in the Appendix. We measure semantic coherence as follows: Coherence(W K ) := 1 C C i=1 (ci,cj )∈N (ci) is_super(c i , c j ) |N (c i )| , where C = 100 for CIFAR-100 and | • | is the cardinality of a set. We compare the semantic coherence of BYOL+SEM with the control experiments on BYOL: regular BYOL, BYOL with an embedding of the same size as BYOL+SEM but without the normalization and BYOL to which we applied linear ICA (Hyvärinen & Oja, 2000) in an attempt to disentangle the features. In Figure 10 , we plot the full graph W 5 for BYOL+SEM and the baselines. We observe that using the SEM yields semantically coherent features for all the classes of CIFAR-100. This observation is consistent with the qualitative and quantitative experiments presented earlier and demonstrates that SEM's inductive bias during pre-training leads to features that are semantically coherent with the semantic categories extant in the data. This arguably have important implications for improving the interpretability of SSL representations.

5. CONCLUSION

SEM is a simple, drop-in module that induces discrete sparse overcomplete representations for standard SSL methods using a softmax operation. This simple modification leads to improved generalization on downstream classification across several state-of-the-art SSL methods. Furthermore, SEM improves performance on out-of-distribution, semi-supervised, and transfer learning tasks across the board and also scales with encoder size. By analyzing semantic coherence, we find that SEMs naturally disentangle data into semantic categories without any explicit training objectives.

A PROOF OF THEOREM 1

Let us introduce additional notations used in the proofs. Define r = (z, y) ∈ R, (f, r) = l(f (z), y), Cy,k1,...,k L = {(z, ŷ) ∈ Z × Y : ŷ = y, k j = arg max t∈[V ] z j,t ∀j ∈ [L]}, and Zk1,...,k L = {z ∈ Z : k j = arg max t∈[V ] z j,t ∀j ∈ [L]}. We then define C k to be the flatten version of Cy,k1,...,k L ; i.e., {C k } K k=1 = { Cy,k1,...,k L ,y } y∈Y,k1,...,k L ∈[V ] with C 1 = C1,1,...,1 , C 2 = C2,1,...,1 , C |Y| = C|Y|,1,...,1 , C |Y|+1 = C1,2,1,...,1 , C 2|Y| = C|Y|, 2,1,...,1 , and so on. Similarly, define Z k to be the flatten version of Zk1,...,k L . We also use Q i = {q ∈ [-1, +1] V : i = arg max j∈[V ] q j }, I k := I S k := {i ∈ [n] : r i ∈ C k }, and α k (h) := E r [ (h, r)|r ∈ C k ]. Moreover, we define ϕ(f S base ) = sup i∈[V ] sup q,q ∈Qi q -q 2 2 , and ϕ(f S SEM(τ ) ) = sup i∈[V ] sup q,q ∈Qi σ τ (q) -σ τ (q ) 2 2 where σ τ (q) j = e q j /τ V t=1 e q t /τ for j = 1, . . . , V . We first decompose the generalization gap into two terms using the following lemma: Lemma 1. For any δ > 0, with probability at least 1 -δ,the following holds for all h ∈ H: E r [ (h, r)] - 1 n n i=1 (h, r i ) ≤ 1 n K k=1 |I k |   α k (h) - 1 |I k | i∈I k (h, r i )   + c ln(2/δ) n . Proof. We first write the expected error as the sum of the conditional expected error: E r [ (h, r)] = K k=1 E r [ (h, r)|r ∈ C k ] Pr(r ∈ C k ) = K k=1 E r k [ (h, r k )] Pr(r ∈ C k ), where r k is the random variable for the conditional with r ∈ C k . Using this, we decompose the generalization error into two terms: E r [ (h, r)] - 1 n n i=1 (h, r i ) (3) = K k=1 E r k [ (h, r k )] Pr(r ∈ C k ) - |I k | n +   K k=1 E r k [ (h, r k )] |I k | n - 1 n n i=1 (h, r i )   . The second term in the right-hand side of (3) is further simplified by using 1 n n i=1 (h, r i ) = 1 n K k=1 i∈I k (h, r i ), as K k=1 E r k [ (h, r k )] |I k | n - 1 n n i=1 (h, r i ) = 1 n K k=1 |I k |   E r k [ (h, r k )] - 1 |I k | i∈I k (h, r i )   Substituting these into equation (3) yields E r [ (h, r)] - 1 n n i=1 (h, r i ) (4) = K k=1 E r k [ (h, r k )] Pr(r ∈ C k ) - |I k | n + 1 n K k=1 |I k |   E r k [ (h, r k )] - 1 |I k | i∈I k (h, r i )   ≤ B K k=1 Pr(r ∈ C k ) - |I k | n + 1 n K k=1 |I k |   E r k [ (h, r k )] - 1 |I k | i∈I k (h, r i )   By using the Bretagnolle-Huber-Carol inequality (van der Vaart & Wellner, 1996, A6.6 Proposition), we have that for any δ > 0, with probability at least 1 -δ, K k=1 Pr(r ∈ C k ) - |I k | n ≤ 2K ln(2/δ) n . Here, notice that the term of K k=1 Pr(r ∈ C k ) -|I k | n does not depend on h ∈ H. Moreover, note that for any (f, h, M ) such that M > 0 and B ≥ 0 for all X, we have that P(f (X) ≥ M ) ≥ P(f (X) > M ) ≥ P(Bf (X) + h(X) > BM + h(X)) , where the probability is with respect to the randomness of X. Thus, by combining ( 4) and ( 5), we have that for any h ∈ H, for any δ > 0, with probability at least 1 -δ, the following holds for all h ∈ H, E r [ (h, r)] - 1 n n i=1 (h, r i ) ≤ 1 n K k=1 |I k |   α k (h) - 1 |I k | i∈I k (h, r i )   + c ln(2/δ) n . In particular, the first term from the previous lemma will be bounded with the following lemma: Lemma 2. For any f ∈ {f S SEM(τ ) , f S base }, 1 n K k=1 |I k |   α k (f ) - 1 |I k | i∈I k (f, r i )   ≤ R Lϕ(f ). Proof. By using the triangle inequality, 1 n K k=1 |I k |   E r [ (f, r)|r ∈ C k ] - 1 |I k | i∈I k (f, r i )   ≤ 1 n K k=1 |I k | E r [ (f, r)|r ∈ C k ] - 1 |I k | i∈I k (f, r i ) . Furthermore, by using the triangle inequality, E r [ (f, r)|r ∈ C k ] - 1 |I k | i∈I k (f, r i ) = 1 |I k | i∈I k E r [ (f, r)|r ∈ C k ] - 1 |I k | i∈I k (f, r i ) ≤ 1 |I k | i∈I k E r [ (f, r)|r ∈ C k ] -(f, r i ) ≤ sup r,r ∈C k (f, r) -(f, r ) . If f = f S SEM(τ ) = g S SEM(τ ) •σ τ , since g S SEM(τ ) ∈ G S , by using the Lipschitz continuity, boundedness, and non-negativity, sup r,r ∈C k (f, r) -(f, r ) = sup y∈Y sup z,z ∈Z k |(l y • g S SEM(τ ) )(σ τ (z)) -(l y • g S SEM(τ ) )(σ τ (z ))| ≤ R sup z,z ∈Z k σ τ (z) -σ τ (z ) F = R sup z,z ∈Z k L t=1 V j=1 (σ τ (z t,j ) -σ τ (z t,j )) 2 2 ≤ R L t=1 sup i∈[V ] sup q,q ∈Qi σ τ (q) -σ τ (q ) 2 2 = R Lϕ(f S SEM(τ ) ) Similarly, if f = f S base = g S base , since g S base ∈ G S , by using the Lipschitz continuity, boundedness, and non-negativity, sup r,r ∈C k (f, r) -(f, r ) = sup y∈Y sup z,z ∈Z k |(l y • g S base )(z) -(l y • g S base )(z )| ≤ R sup z,z ∈Z k z -z F ≤ R Lϕ(f S base ). Therefore, for any f ∈ {f S SEM(τ ) , f S base }, 1 n K k=1 |I k |   α k (f ) - 1 |I k | i∈I k (f, r i )   ≤ 1 n K k=1 |I k |R Lϕ(f ) = R Lϕ(f ). Combining Lemma 1 and Lemma 2, we obtain the following upper bound on the gap: Lemma 3. For any δ > 0, with probability at least 1 -δ, the following holds for any f ∈ {f S SEM(τ ) , f S base }: E r [ (f, r)] - 1 n n i=1 (f, r i ) ≤ R Lϕ(f ) + c ln(2/δ) n . Proof. This follows directly from combining Lemma 1 and Lemma 2. We now provide an upper bound on ϕ(f S SEM(τ ) ) in the following lemma: Lemma 4. For any τ > 0, ϕ(f S SEM(τ ) ) ≤ 1 1 + (V -1)e -2/τ - 1 1 + (V -1)e -∆/τ 2 + (V -1) 1 1 + e ∆/τ (1 + (V -2)e -2/τ ) - 1 1 + e 2/τ (1 + (V -2)e -∆/τ ) 2 . Proof. Recall the definition: ϕ(f S SEM(τ ) ) = sup i∈[V ] sup q,q ∈Qi σ τ (q) -σ τ (q ) 2 2 . where σ τ (q) j = e qj /τ V t=1 e qt/τ , for j = 1, . . . , V . By the symmetry and independence over i ∈ [V ] inside of the first supremum, we have ϕ(f S SEM(τ ) ) = sup q,q ∈Q1 σ τ (q) -σ τ (q ) 2 2 . For any q, q ∈ Q 1 and i ∈ {2, . . . , V } (with q = (q 1 , . . . , q V ) and q = (q 1 , . . . , q V )), there exists δ i , δ i > 0 such that q i = q 1 -δ i and q i = q 1 -δ i . Here, since z ik -∆ ≥ z ij from the assumption, we have that for all i ∈ {2, . . . , V }, δ i , δ i ≥ ∆ > 0. Thus, we can rewrite V t=1 e qt/τ = e q1/τ + V i=2 e (q1-δi)/τ = e q1/τ + e q1/τ V i=2 e -δi/τ = e q1/τ   1 + V i=2 e -δi/τ   Similarly, V t=1 e q t /τ = e q 1 /τ   1 + V i=2 e -δ i /τ   . Using these, σ τ (q) 1 = e q1/τ V t=1 e qt/τ = e q1/τ e q1/τ 1 + V i=2 e -δi/τ = 1 1 + V i=2 e -δi/τ and for all j ∈ {2, . . . , V }, σ τ (q) j = e qj /τ V t=1 e qt/τ = e (q1-δj )/τ e q1/τ 1 + V i=2 e -δi/τ = e -δj /τ 1 + V i=2 e -δi/τ = 1 1 + e δj /τ + V i∈Ij e (δj -δi)/τ where I j := {2, . . . , V } \ {j}. Similarly, σ τ (q ) 1 = 1 1 + V i=2 e -δ i /τ , and for all j ∈ {2, . . . , V }, σ τ (q ) j = 1 1 + e δ j /τ + V i∈Ij e (δ j -δ i )/τ . Using these, for any q, q ∈ Q 1 , |σ τ (q) 1 -σ τ (q ) 1 | = 1 1 + V i=2 e -δi/τ - 1 1 + V i=2 e -δ i /τ ≤ 1 1 + V i=2 e -2/τ - 1 1 + V i=2 e -∆/τ = 1 1 + (V -1)e -2/τ - 1 1 + (V -1)e -∆/τ , and for all j ∈ {2, . . . , V }, |σ τ (q) j -σ τ (q ) j | = 1 1 + e δj /τ + V i∈Ij e (δj -δi)/τ - 1 1 + e δ j /τ + V i∈Ij e (δ j -δ i )/τ ≤ 1 1 + e ∆/τ + V i∈Ij e (∆-2)/τ - 1 1 + e 2/τ + V i∈Ij e (2-∆)/τ = 1 1 + e ∆/τ + (V -2)e (∆-2)/τ - 1 1 + e 2/τ + (V -2)e (2-∆)/τ = 1 1 + e ∆/τ (1 + (V -2)e -2/τ ) - 1 1 + e 2/τ (1 + (V -2)e -∆/τ ) . By combining these, sup q,q ∈Q1 σ τ (q) -σ τ (q ) 2 2 = sup q,q ∈Q1 V j=1 |σ τ (q) j -σ τ (q ) j | 2 ≤ 1 1 + (V -1)e -2/τ - 1 1 + (V -1)e -∆/τ 2 + (V -1) 1 1 + e ∆/τ (1 + (V -2)e -2/τ ) - 1 1 + e 2/τ (1 + (V -2)e -∆/τ ) 2 . Using the previous lemma, we will conclude the asymptotic behavior of ϕ(f S SEM(τ ) ) in the following lemma: Lemma 5. It holds that ϕ(f S SEM(τ ) ) → 0 as τ → 0. Proof. Using Lemma 4, lim τ →0 ϕ(f S SEM(τ ) ) ≤ lim τ →0 1 1 + (V -1)e -2/τ - 1 1 + (V -1)e -∆/τ 2 + n(V -1) lim τ →0 1 1 + e ∆/τ (1 + (V -2)e -2/τ ) - 1 1 + e 2/τ (1 + (V -2)e -∆/τ ) 2 . Moreover, lim τ →0 1 1 + (V -1)e -2/τ - 1 1 + (V -1)e -∆/τ 2 = 1 1 - 1 1 2 = 0, lim τ →0 1 1 + e ∆/τ (1 + (V -2)e -2/τ ) - 1 1 + e 2/τ (1 + (V -2)e -∆/τ ) 2 = |0 -0| 2 = 0. Therefore, lim τ →0 ϕ(f S SEM(τ ) ) ≤ 0. Since ϕ(f S SEM(τ ) ) ≥ 0, this implies the statement of this lemma. We combine the lemmas above to prove Theorem 1, which is restated below with its proof: Theorem 1. Let V ≥ 2. For any 1 ≥ δ > 0, with probability at least 1 -δ, the following holds for any f S ∈ {f S SEM(τ ) , f S base }: E z,y [l(f S (z), y)] ≤ 1 n n i=1 l(f S (z (i) ), y (i) ) + R L ϕ f S (V, τ ) + c ln(2/δ) n , where c > 0 is a constant in (n, f, H, δ, τ, S). Moreover, ϕ f S SEM(τ ) → 0 as τ → 0 and ϕ f S SEM(τ ) -ϕ f S base ≤ 3 4 (1 -V ) < 0 ∀τ > 0. Proof. The first statement directly follows from Lemma 3. The second statement is proven by Lemma 5 and Lemma 6.

B EXPERIMENT DETAILS FOR IMAGENET B.1 IMAGE AUGMENTATION

The augmentation applied in order during training are: • Random Resize crop to a 224 × 224 image. A random patch of the image is selected and resized to a 224 × 224 image. • Random color jitter. Modifying the brightness, the contrast, the saturation and the hue. • Random gray scale. Randomly applying a gray scale filter to the image • Random Gaussian blur. Randomly applying a Gaussian bluer filter. • Random solarization. Randomly applying a solarization filter. The parameters of the augmentations are presented in Table 16 . At validation and test time, we resize the images to 256 × 256 and then center crop a patch of 224 × 224. For both training and evaluation, we re-normalize the image using the statistic of the training set.g

B.2 LINEAR EVALUATION

We follow the evaluation protocol from (Chen et al., 2020b) . The linear evaluation is done by training a linear classifier on the frozen representation of the ImageNet training samples. We train a linear classifier with a cross-entropy objective for 100 epochs using SGD with nesterov, a momentum of 0.9 and a batch size of 256. We perform learning rate scheduling at epoch 60 and epoch 80 where we divide the learning rate by a factor of 10. During training, we apply random resized crop to 224 × 224 pixels and random horizontal flip. We sweep over a set of 4 learning rates: {0.5, 0.1, 0.05, 0.01}, 3 l1 weight decays: {0, 1e -6, 1e -5} and 3 τ d for SEM: {0.01, 0.1, 1}, using a validation set of 10 images per class and re-traing using the full training set. We report the results on the test set.

B.3 ROBUSTNESS EXPERIMENTS

We follow the evaluation procedure from (Lee et al., 2021) . We treated the robustness datasets as additional "test sets" in that we simply evaluated them using the evaluation procedure described above. The images were resized to a 256 × 256 before being center cropped to a 224 × 224 image. The evaluation procedure was performed using the public robustness benchmark evaluation code of (Djolonga et al., 2020) § .

B.4 TRANSFER LEARNING LINEAR PROBE

We follow the linear evaluation protocol of (Kolesnikov et al., 2019; Chen et al., 2020b) We train a linear classifier using a regularized multinomial logistic regression from the scikit-learn package (Pedregosa et al., 2011) . The representation is frozen, so that we do not train the encoder backbone nor the batch-normalization statistics. We do not perform any augmentations and the images are resized to 224 pixels using bicubic resampling and the normalized using the statistics on ImageNet's training set. We tune the regularizer term from a range of 45 logarithmically-spaced values between 10 -6 and 10 5 using a small validation set and re-train using the full training set. For SEM, we set τ d = 0 for all experiments.

B.5 TRANSFER LEARNING FINE-TUNING

We follow the same fine-tuning protocol of (Chen et al., 2020b; Grill et al., 2020) . We initialize the encoder with the pre-trained model and a classifier head with random initialization. We train for 20,000 steps with a batch size of 256 using SGD with a Nesterov momentum of 0.9. We set the momentum parameter for the batch normalization to be max(1 -10/s, 0.9) where s is the number of steps per epoch. During pre-training, we use random resize to 224 × 224 pixels and random horizontal flipping. At test time, we resize the images along the shortest size to 256 pixels using cubic resampling following by a center resize to 224 × 224 pixels. Due to computational constraint, we only tune the learning rate using a search of 7 values spaces on logarithmic scales between 0.0001 and 0.1. For SEM, we set τ d = 1. for all experiments After choosing the best learning rate of a validation set, we re-run the models using the full training set and evaluate it on the test set, which we use to report the numbers.

B.6 SEMI-SUPERVISED LEARNING

We follow the semi-supervised learning protocol of (Chen et al., 2020b; Grill et al., 2020) . We initialize the network using the pre-trained representation and initialize a classification head using random initialization. We fine-tune the encoder while training the classification head using a small subset of ImageNet. We choose the same subset used in prior works which is defined in the TensorFlow-Dataset software. During training, we random resize the images to 224 × 224 pixels along the shorter size using bicubic resampling followed by a center crop and random horizontal flipping. At test time, we resize the image to 224 × 224. We optimize the cross entropy loss with nestorov and a momentum of 0.9 using batch sizes of 224. We train models for {30, 50} and take the best performing on the validation set. The learning rate used is chosen among a set of 5 learning rates: {0.01, 0.02, 0.05, 0.1, 0.005}. For SEM, we also search τ d ∈ {0.01, 0.1, 1}. We perform the search on the best performing one on the validation set and the number are returned are obtained using the test set after re-training using the full training set.

C HYPERPARAMETERS

The implementation of the SSL methods used in this work are taken from Solo-Learn (da Costa et al., 2021) to which we added the SEM module. The pre-training hyper-parameters of every SSL methods trained on CIFAR-100 with ResNet-18 used in this work are the default provided in the companion repository of Solo-Learn. The hyper-parameters are also provided in the launch scripts accompanying this work. Due to the large number of SSL methods probed in this work and the amount of space it would require to exhaustively detail all of the hyper-parameters, we refer the reader to the code. For the CIFAR-100 results obtained with BYOL and a ResNet-50, we have slightly modified the default parameters. Otherwise, the baseline BYOL model would not obtain competitive results. The hyper-parameters were tuned using the BYOL baseline and the SEM module was not considered in the selection of the SSL hyper-parameters. The BYOL hyper-parameters are presented in the launch script accompanying this work and presented below for completeness. For the ImageNet experiments, we took the hyper-parameters proposed in the launch scripts of Solo-Learn to which we only modified the amount of epochs (100 epochs to 200 epochs.) Here, we present all of the SEM hyper-parameters used in every experiments. These hyper-parameters can also be found in the launch scripts accompanying this work. We present the hype-parameters used to train for BYOL+SEM and MoCo+SEM on CIFAR100. Unless mentioned otherwise, these are the parameters used. 

D ADDITIONAL STUDIES OF SEM

In Section 4.2, we discussed the effect of scaling L and V as well as changing the Softmax temperature during pre-training of the online network and changing the Softmax temperature for the downstream task. Here, we propose additional studies of SEM to provide a better mastery of the method. We provide a method for reducing the memory overhead of SEM and experiments demonstrating that despite this version still largely outperform the baseline. We additionally present the effect of modifying the embedder contributing to the insight on how to get the most out of SEM. Next, we have discussion with a study of the spectrum of the covariance matrix of the SEM representation and the BYOL representation, showing insight how SEM can particularly improve the training signal during pre-training. We provide a scaling analysis of BYOL and BYOL + SEM on CIFAR-100. We end with an experiment showing that pre-training with SEM is necessary to get the best performance.

D.1 AN EFFICIENT VARIANT OF SEM

A large over-complete representation may induce a significant memory footprint due to the additional parameters of the fully connected linear layer used to map to and from the representation. For SEM we require two such mappings as depicted in Figure 2c for BYOL. To reduce the amount of parameters, we propose to sparsify the weight matrix of the fully connected linear layer. We propose to do so by taking the block diagonal of the parameters of the matrix multiplication and setting the parameters outside the block diagonal to 0. Formally, let v ∈ R b×m , w ∈ R m×o and y = v • w be the fully connected matrix multiplication. Instead, we partition v into n blocks with v i ∈ R b× m n and define n smaller w i ∈ R m n × o n , where i ∈ [L] is the i th block. Then, we perform a batch matrix multiplication of v i and w i that we concatenate as follows: y i = v i • w i and ȳi = Concat([y 1 , . . . , y n ]). Thus, the amount of parameters of this matrix multiplication scales in O( m•o n ), allowing us to reduce the memory consumption by increasing n, the number of blocks. We perform an experiment where we partition the embedder and the first linear layer of the projector into 8 blocks. We present the results in Table 17 in which we compare the # of parameters, the # of activations, the allocated vRAM by pytorch, the FLOPs/sample and the accuracy of BYOL, BYOL+SEM and BYOL+SEM/8 representing the model with 8 blocks obtained following the method described above. We observe that partitioning the matrix multiplications of SEM allows to vastly reduce the computation parameters while still yielding an important improvement over the baseline. This result demosntrate that SEM can be beneficial while inducing minimal computational overhead. Attentive readers may notice that this performance is better compared to the ablation presented in Figure 3 . The difference in performance is due to probing the embedder's output (i.e. z θ ) in Figure 3 and probing the encoder's output (i.e. e θ ) in Table 17 . Using the each ablation's representation for probing to the other recovers the performance observed by each.

D.2 ADDITIONAL ABLATION OF THE SEM PARAMETERS

Ablating the embedder In the main text, we mentioned that we use batch normalization)) at the output of the embedder. The reason we use batch normalization is mostly due to the fact that we wanted to avoid tuning any hyper-parameters that were not related to SEM to emphasize its contribution. Using BatchNorm gave the best performance without tuning the hyper-parameters of the baseline models. Here, we want to emphasize that SEM can be used without batch norm, but more hyper-parameters might need to be tuned for it to perform as well as the model with batch norm in the encoder. For example, we found that using no weight decay was important to get better performance when we did not have batch normalization as illustrated in Table table 18 . We leave the full study of the interaction of SEM with the SSL related parameters for future work. Table 18 : Understanding the relationship between the use of BatchNorm in the embedder and the weight decay hyper-parameter. BatchNorm weight decay Accuracy 0 67.2 1e-5 57.9 0 68.3 1e-5 73.9 Another decision is to use a linear layer as the embedder. Other alternative may include using the Identidy function (i.e. the output of the encoder is used for SEM). However, if we want to systematically use the same encoder as the SSL model, then we are constrained to a representation size that is the one of the ResNet encoder (i.e. 512 for a ResNet-18). Finally, we showcase that using a more expressive embedder leads to exacerbated performance and recommend practitioner to limit the expressivity of their embedder. A very very large embedding Using a ResNet-18 encoder and the method proposed in Section D.1, we further scale the embedding size of SEM to see where the performance saturates for classification. In Figure 7 we observe that the performance saturates for L = 10000 for the classification task. We conjecture that the optimal L might be different for other tasks, but we leave that study for future work.

D.3 ANALYZE OF THE SPECTRUM OF THE COVARIANCE MATRIX OF THE REPRESENTATION

To obtain a better insight on why the SEM representation leads to better downstream performance, we analyze the spectrum of the covariance matrix of the representation using the methodology presented Top-1 Accuracy Figure 7 : Study of very very large L using a ResNet-18 backbone and 8 SEM/8 blocks using the method described in Section D.1. in Jing et al. (2022) . That is, we collect the embedding vectors of the test set of CIFAR-100 using a pre-trained model using ResNet-50. For BYOL, we have an additional embedder without softmax normalization (as done in Figure 3 ). For BYOL and BYOL+SEM we use the embedder's output (z θ ) to perform the evaluation. To compute the covariance matrix C ∈ R L•V ×L•V of the embedding layer z, we define z := N i=1 z i /N the average representation over the N samples and compute the covariance as follows: C := 1 N N i=1 (z i -z)(z i -z) . To plot the spectrum of the covariance matrix, we take the singularalue decomposition of the matrix (C = U SV ) with S the diagonal of the singular values, which we plot in sorted order and logarithm scale in Figure 8 . This experiment demonstrates that the softmax normalization counters the dimensionality collapse that was discussed in Jing et al. (2022) . Interestingly, the drop observed with SEM with L ≥ 500 occurs at the index 2048 which is the dimensionality output of the ResNet-50 encoder. We perform a scaling experiment on CIFAR-100 where we compare the scaling behaviour of BYOL and BYOL + SEM. We evaluate the computational cost of the methods and the resulting downstream accuracy for a range of four resnets: ResNet-18, ResNet-50, ResNet-50 x2 and ResNet-50 x4. In Figure 9 , we observe that SEM has a better scaling behaviour than the baseline, especially as we increase the width of the ResNet-50. For BYOL, we observe that the performance decays for ResNet-50 with width x2 and x4. This is not unprecendented, as prior works as demonstrated other methods where scaling up the capacity of a model led to decrease in performance. When comparing the discrepancy with Figure 1 , we attribute that to the fact that CIFAR-100 is a small dataset. In fact, we observe that the training accuracy stays constant to about 79% for all the ResNet-50 scales demonstrating overfitting for the baseline BYOL. Nevertheless, SEM prevents the decrease in performance and even lead to further improved performance as we increase the scale of the ResNet-50.

D.5 THE ROLE OF PRE-TRAINING WITH SEM

We probe the downstream accuracy obtained of a model pre-trained without SEM and add SEM normalization only for the downstream classification. For this experiment, we take a pre-trained model with embedder (i.e. BYOL + embed) with L = 5000 and V = 13 and add the softmax normalization only for the downstream classification. We do not use SEM during pre-training. We observe that using SEM for downstream classification leads to an improvement even when the model is not pre-trained with SEM, demonstrating the utility of SEM downstream classification. However, we note that the performance of the model pre-trained without SEM is much weaker and thus demonstrates the imprtance of also pre-training using SEM.

E CIFAR-10 RESULTS

We confirm that our method also yield improvement on simpler datasets such as CIFAR-10. Here, we compare BYOL and BYOL + SEM on a ResNet-50 and observe and improvement of 1.6%. 



* In this subsection, we refer to the extracted representation as z, the embedder's output † A hyper-parameter search was performed to select the best performing hyper-parameter. ‡ Although "flatfish" may seem out of place in the third set, manually checking CIFAR images showed that many images labelled "flatfish" are often humans holding flatfish. § https://github.com/google-research/robustness_metrics



Figure 1: Linear probe accuracy of BYOL and BYOL + SEM on Ima-geNet trained for 200 epochs with a ResNet-50 architecture.

Figure 2: (a) Procedure to obtain Simplicial Embeddings (SEM). A matrix z ∈ R L×V contains L vectors z i ∈ R V . The vectors z i are normalized with σ τ , the softmax operation with temperature τ . The normalized vectors are concatenated into the vector ẑ. (b) Normalized histogram of the entropies H(z i ) of each simplex zi for the sample in CIFAR's training dataset at the end of pre-training with various τ . The peak at ln(2) for τ = 0.01 and τ = 0.1 are a large number of simplices with two elements close to 0.5. (c) Integration of SEM with BYOL(Grill et al., 2020). The encoder outputs a latent vector which is embedded into the matrix z ∈ R L×V and then transformed into SEM.

Figure 3: Effect of the Softmax when scaling up L on the linear probe accuracy. Using a RN-50.

Figure 4: Effect of τ p and τ d on a RN-50.

IN) the in-distribution test set provided in IMAGENET; (IN-C) IMAGENET-C, which exhibits a set of common image corruptions (Hendrycks & Dietterich, 2019); (IN-R) IMAGENET-R (Hendrycks et al., 2021) which consists of different renderings for several IMAGENET classes; and (IN-V2) IMAGENET-V2 (Recht et al., 2019), a distinct test set for IMAGENET collected using the same process; (IN-A) IMAGENET-A(Chen et al., 2020a) contains a set of samples that are miclassifier by a IMAGENET ResNet-50 classifier. We use the methodology and software proposed inDjolonga et al. (2020; 2021)

Figure 6: Semantic coherence of the features. (a) and (b) Subset of W 5 , the bipartite graph of the most 5 highest magnitude features on BYOL + SEM features (a) and BYOL on the encoded features (b). (c) Coherence of the top K features to the semantics of the super-class of the categories of CIFAR-100. It is taken as the number of pairwise categories in the same super-class for which a feature is among its top K most predictive features over the total number of pairwise categories.

Figure 8: Spectrum of the covariance matrix of the represention for BYOL and BYOL + SEM obtained with a ResNet-50 encoder.

Figure 9: Scaling the ResNet encoder for CIFAR-100.

Figure 10: Comparison of the full semantic coherence graph W 5 between BYOL and BYOL + SEM.

Linear probe top-1 accuracy on CIFAR-100 trained for 1000 epochs with a ResNet-18/50 encoder. We compare the test accuracy of several SSL models with and without SEM. Boldface indicates highest accuracy. Green rows indicate a SSL method + SEM.





Top-1 transfer learning accuracy from IMA-GENET pre-trained representation.

For all our CIFAR-100 training, we used 1 RTX-8000 per experiment. For our ImageNet experiments, we used parallel training with 2 40GB A100 for the training with ResNet50 and ResNet50-x2 and 4 40GB A100 for the training with ResNet50-x4. With this setup, the training takes about a week for the ResNet50 experiments and about 10 days for the ResNet50-x2 and ResNet50-x4 experiments.

BYOL with all ResNet-50 architectures for ImageNet.

# of parameters, # of activations, allocated memory, computation efficiency (FLOPs/sample) and CIFAR-100 accuracy of BYOL, BYOL with SEM and its memory-efficient variant with 8 blocks (denoted BYOL + SEM/8).

Comparing alternative embedders.

Downstream accuracy of training a classifier with SEM normalization of the representation while using unormalized representation during pretraining. Experiments performed with a ResNet-50 encoder.

Downstream accuracy of training a classifier with SEM normalization of the representation while using unormalized representation during pretraining. Experiments performed with a ResNet-50 encoder.

ACKNOWLEDGEMENTS

The authors are grateful for the insightful discussions with Xavier Bouthillier, Hattie Zhou, Sébastien Lachapelle, Tristan Deleu, Yuchen Lu, Eeshan Dhekane, Maude Lizaire, Julien Roy and David Dobre. We acknowledge funding support from Samsung and Hitachi, as well as support from Aaron Courville's CIFAR CCAI chair. We also wish to acknowledge Mila and Compute Canada for providing the computing infrastructure that enabled this project. Finally, this project would not have been possible without the contribution of the following open source projects: Pytorch (Paszke et al., 2019), Orion (Bouthillier et al., 2022 ), Solo-Learn (da Costa et al., 2021), Scikit-Learn (Pedregosa et al., 2011) , and Numpy (Harris et al., 2020) .

annex

As we have analyzed ϕ(f S SEM(τ ) ) in the previous two lemmas, we are now ready to compare ϕ(f S SEM(τ ) ) and ϕ(f S base ), which is done in the following lemma: Lemma 6. For any τ > 0,Proof. From Lemma 4, for any τ > 0,By choosing an element in the set over which the supremum is taken, for any δ ≥ ∆ > 0,where q1 = 1, qj = 1 -δ for j ∈ {2, . . . , V }, q 1 = δ -1, and q j = -1 for j ∈ {2, . . . , V }.By combining those, for for any τ > 0 and δ ≥ ∆ > 0,The 100 classes of CIFAR-100 (Krizhevsky, 2009) are grouped into 20 superclasses. The list of superclass for each class in 

