SYNCHRONIZED PRUNING FOR EFFICIENT CONTRASTIVE SELF-SUPERVISED LEARNING Anonymous

Abstract

Various self-supervised learning (SSL) methods have demonstrated strong capability in learning visual representations from unlabeled data. However, the current state-of-the-art (SoTA) SSL methods largely rely on heavy encoders to achieve comparable performance as the supervised learning counterpart. Despite the well-learned visual representations, the large-sized encoders impede the energyefficient computation, especially for resource-constrained edge computing. Prior works have utilized the sparsity-induced asymmetry to enhance the contrastive learning of dense models, but the generality between asymmetric sparsity and self-supervised learning has not been fully discovered. Furthermore, transferring the supervised sparse learning to SSL is also largely under-explored. To address the research gap in prior works, this paper investigates the correlation between intraining sparsity and SSL. In particular, we propose a novel sparse SSL algorithm, embracing the benefits of contrastiveness while exploiting high sparsity during SSL training. The proposed framework is evaluated comprehensively with various granularities of sparsity, including element-wise sparsity, GPU-friendly N :M structured fine-grained sparsity, and hardware-specific structured sparsity. Extensive experiments across multiple datasets are performed, where the proposed method shows superior performance against the SoTA sparse learning algorithms with various SSL frameworks. Furthermore, the training speedup aided by the proposed method is evaluated with an actual DNN training accelerator model.

1. INTRODUCTION

The early empirical success of deep learning was primarily driven by supervised learning with massive labeled data, e.g., ImageNet (Krizhevsky et al., 2012) . To overcome the labeling bottleneck of deep learning, learning visual representations without label-intensive datasets has been widely investigated (Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Zbontar et al., 2021) . The recent self-supervised learning (SSL) methods have shown great success and achieved comparable performance to the supervised learning counterpart. The common property of various SSL designs is utilizing different augmentations from the original images to generate contrastiveness, which requires duplicated encoding with wide and deep models (Meng et al., 2022) . The magnified training effort and extensive resource consumption make the SSL-trained encoder infeasible for on-device computing (e.g., mobile devices). The contradiction between label-free learning and extraordinary computation cost limits further applications of SSL, also necessitating efficient sparse training techniques for self-supervised learning. For supervised learning, sparsification (a.k.a. pruning) has been widely explored, aiming to reduce computation and memory costs by removing unimportant parameters during training or fine-tuning. Conventional supervised pruning explores weight sparsity based on a pre-trained model followed by additional fine-tuning to recover the accuracy (Han et al., 2016) . For self-supervised learning, recent work (Chen et al., 2021 ) also sparsified a pre-trained dense SSL model for the downstream tasks with element-wise pruning. In addition to the fine-grained sparsity, MCP (Pan et al., 2022) exploited the filter-wise sparsity on the MoCo-SSL (He et al., 2020) model. Both of these sparse SSL works (Chen et al., 2021; Pan et al., 2022) exploit sparsity based on the pre-trained dense model. However, compared to supervised learning, obtaining the pre-trained model via SSL requires a significant amount of additional training effort (∼200 epochs vs. ∼1,000 epochs). Therefore, exploring post-training sparsity via fine-tuning is not an ideal solution for efficient SSL. On the other hand, sparsifying the model during supervised training (Dettmers & Zettlemoyer, 2019; Evci et al., 2020) has emerged as a promising technique to elevate training efficiency while obtaining a sparse model. To accurately localize the unimportant parameters, prior works investigated various types of important metrics, including gradient-based pruning (Dettmers & Zettlemoyer, 2019) and the "prune-regrow" scheme (Liu et al., 2021) . Compared to the post-training sparsification methods, in-training sparsification for supervised training has achieved memory/computation reduction as well as speedup of the training process. However, exploiting in-training sparsity for SSL models that are trained from scratch is still largely under-explored. More recently, SDCLR (Jiang et al., 2021) proposed the sparsified "self-damaging" encoder, which generates the "sparse-dense" SSL by exploiting fixed high sparsity on one contrastive path (e.g., offline encoder), while keeping the counterpart dense (e.g., online encoder), as shown in Figure 1 (a). Such "sparse-dense" SSL architecture helps to enhance contrastive learning, leading to improved performance with the non-salient and imbalanced samples. Nevertheless, it mainly focuses on the performance enhancement of the SSL-trained dense model (i.e., SimCLR (Chen et al., 2020a) ), and whether such a "sparse-dense" asymmetric learning scheme works in other SSL methods remains unclear. On the other hand, the compatibility of the existing SoTA sparse training techniques (Dettmers & Zettlemoyer (2019) ; Evci et al. (2020) ; Liu et al. (2021) ) is also ambiguous to SSL. As shown in Figure 1 • We first discover the limitations of the sparsity-induced asymmetric SSL in SDCLR (Jiang et al., 2021) and show that the sparsity-induced "sparse-dense" asymmetric architecture is not universally applicable for various SSL schemes. • We demonstrate the incompatibility of the SoTA "prune-and-regrow" sparse training method for SSL. Specifically, we formalize the iterative architectural changes caused by applying "prune-and-regrow" to SSL, named as architecture oscillation, and observe that frequently updating the pruning candidates lead to larger architecture oscillation, which further hinders the performance of self-supervised learning. 

2. RELATED WORKS

2.1 CONTRASTIVE SELF-SUPERVISED LEARNING Self-supervised learning recently has gained popularity due to its ability to learn visual representation without labor-intensive labeling. Specifically, pioneering research works (He et al., 2020; Chen et al., 2020a) utilize the contrastive learning scheme (Hadsell et al., 2006) that aims to group the correlated positive samples while repelling the mismatched negative samples (Oord et al., 2018) . The performance of the contrastive learning-based approaches largely depends on the contrastiveness between the positive and negative samples, which requires large-sized batches to support. As indicated by SimCLR (Chen et al., 2020a) , the performance of SSL is sensitive to the training batch size, and the inflated batch size elevates training cost. MoCo (He et al., 2020; Chen et al., 2020b) alleviates such issue with queue-based learning and momentum encoder, where the extensive queueheld negative samples provide proficient contrastiveness, and the slow-moving average momentum encoder derives consistent negative pairs. BYOL (Grill et al., 2020) simplifies and outperforms the prior works by only learning positive samples, while the online latent features are projected by an additional predictor q θ : online prediction = q θ (g θ (f θ (X))) offline target = g ξ (f ξ (X ′ )) Where f and g represent the encoder and projector of online (θ) and offline (ξ) paths with augmented input X and X ′ , respectively. The predictor q θ generates an alternative view of the projected positive samples, and the offline momentum encoder provides consistent encoding for contrastive learning. Overall, salient and consistent contrastiveness is essential to contrastive self-supervised learning.

2.2. SPARSE TRAINING

DNN sparsification has been widely investigated under the supervised learning domain, which can be generally categorized based on the starting point of sparsification. Early works mainly focus on post-training sparsification (Han et al., 2016; Evci et al., 2020; Jayakumar et al., 2020) , which removes the weights from the pre-trained model and then recovers the accuracy with subsequent fine-tuning. Other works exploit weight sparsity prior to the training process (Wang et al., 2019; Lee et al., 2018) , and the resultant model is trained with the sparsified architecture. In contrast to post-training or pre-training sparsification, exploiting sparsity during training generates the compressed model with one-time training from scratch, eliminating the requirement of a pre-trained model or extensive searching process. With the full observability of the training process, the magnitude of the gradient can be used to evaluate the model reflection with the exploited sparsity. Motivated by this, GraNet (Liu et al., 2021) utilizes the "prune-and-regrow" technique to periodically remove the unimportant non-zero weights from the sparse model and then regrow certain pruning candidates back. Given the targeted sparsity s t and total prune ratio r t at iteration t, unimportant weights w are removed based on the Top-K magnitude scores: w ′ = TopK(|w|, r t ) Subsequently, the sensitive weights are re-grown back based on the reflection of gradient g t : w = w ′ + TopK(g t i!=w ′ ,rt-st ) Since the gradient g t indicates the instant model sensitivity at iteration t, the optimal sparse model architecture can be varied between two adjacent pruning steps.

2.3. CONTRASTIVE LEARNING WITH SPARSITY-INDUCED ASYMMETRY

As introduced in Section 2.1, salient and consistent contrastiveness is essential for contrastive SSL, where the contrastiveness can be constructed via negative samples or the auxiliary predictors (Grill et al., 2020) . Inspired by (Hooker et al., 2019) , SDCLR (Jiang et al., 2021) amplifies the contrastiveness by pruning one encoder of SimCLR (Chen et al., 2020a) while keep the identical twin dense. Such "sparsity-induced asymmetry" elevates the performance of SSL with the improved performance of the dense model on the long-tailed data samples. However, SDCLR (Jiang et al., 2021) is not designed for model compression or efficiency improvements. Furthermore, the generality of such sparsity-induced asymmetry remains under-explored in other SSL frameworks. 3 CHALLENGE OF SPARSE SELF-SUPERVISED LEARNING

3.1. LIMITATIONS OF SPARSITY-INDUCED ASYMMETRY

It has been shown in SDCLR (Jiang et al., 2021 ) that the sparsity-induced "sparse-dense" asymmetry is beneficial to contrastive SSL. SDCLR (Jiang et al., 2021) is specifically built upon the SimCLR (Chen et al., 2020a) framework with shared encoders, where the pruned architectures have the dense twin in the mirrored contrastive encoder. However, the generality of sparsity-induced asymmetry remains unproved in other SSL methods, which motivates us to investigate the question: Question 1: For contrastive self-supervised learning with non-identical encoders, will the sparsityinduced asymmetric encoders still result in elevated performance for contrastive learning? To answer the above question, we use MoCo-V2 (Chen et al., 2020b) and follow the procedure of SDCLR (Jiang et al., 2021) to generate a highly-sparse online encoder prior to the training process. Given the online and offline (momentum) encoder θ and ξ with weights W θ and W ξ , we have: online output = g θ (f θ (X * (M θ • W θ ))) (5) offline output = g ξ (f ξ (X ′ )) The online encoder mask M θ produces a sparse online encoder with initialized element-wise sparsity (Han et al., 2016) at 90%, while the offline encoder is updated by exponential moving average (EMA). The gradient-based update of the online encoder keep recovering the performance drop caused by the high sparsity mask. Following the setup of SDCLR (Jiang et al., 2021) , the sparsity is periodically updated at the beginning of each epoch. Table 1 summarizes the linear evaluation accuracy on the CIFAR-10 dataset with different static online sparsity values. As opposed to the performance of SimCLR in (Jiang et al., 2021) , directly applying the high sparsity-based perturbation to MoCo-V2 (Chen et al., 2020b) is challenging, and leads to considerable performance degradation. Reversing the sparsity between online and offline encoder also shows the similar results, as presented in Appendix D. Summarizing these empirical results, our main observation is: Observation 1: Compared to the online encoder, the EMA-updated momentum encoder has the delayed architecture, which makes it unqualified to be the "competitor" as SDCLR (Jiang et al., 2021) . The directly-applied high sparsity overshoots the asymmetric learning, leading to degraded self-supervised learning.

3.2. FREQUENT ARCHITECTURE CHANGING HINDERS SELF-SUPERVISED LEARNING

As depicted in Eq. 4, the "prune-and-regrow" scheme such as GraNet (Liu et al., 2021) uses instant gradient magnitude to indicate the model sensitivity after magnitude pruning, removing the unimportant and insensitive weights while gradually achieving high sparsity. Observation 1 demonstrates the incompatibility of the directly-applied high sparsity in SSL, then the following question raises: Table 2 : Sparse training with "prune-and-regrow" scheme on BYOL (Grill et al., 2020) . BYOL (Grill et al., 2020) CIFAR-10 Acc (%) Question 2: If we apply the gradually-increased sparsity for both encoders, will the "prune-andregrow" scheme also be feasible for self-supervised learning? To address Question 2, we use the SoTA GraNet (Liu et al., 2021) as the example algorithm to exploit in-training sparsity on both encoders of BYOL (Grill et al., 2020) , where the regrowing process is only applied to the online encoder. Starting with the dense models, we gradually prune the online and offline encoders to 90% and 60% sparsity in an element-wise fashion with periodically-updated sparsity. For sparse SSL training, the results of such gently-increased sparsity scheme reported in Table 2 outperforms those by (Jiang et al., 2021) (Table 1 ) by a significant margin. However, the state-of-the-art supervised pruning algorithm still incurs 2.3% linear evaluation accuracy degradation with SSL on the CIFAR-10 dataset. Compared to the self-damaging SSL with fixed sparsity (Jiang et al., 2021) , the "prune-and-regrow" method keeps swapping the pruning candidates to minimize the model sensitivity, oscillating the encoder architecture during training. We quantify such architecture oscillation by XORing the masks generated from magnitude pruning M M P and gradient-based regrow M g : G cor = M M P ⊕ M g ∈ {0, 1} Under the same sparsity ratio, the number of "1"s in Observation 2: Sparsifying the model with frequently changing architecture hinders the contrastiveness and consistency of self-supervised learning and leads to degraded encoder performance. G As shown in Observation 1 and Observation 2, high sparsity-induced asymmetry is not directlyapplicable to sparse SSL, while the consistency requirements of SSL negates the plausibility of gradual sparsity increment. The dilemma between self-supervised learning and sparse training derives the following challenge: How can we efficiently sparsify the model during self-supervised training while maximizing the benefits of the sparsity-induced asymmetry?

4. METHOD

To address the above challenge, we propose Synchronized Contrastive Pruning (SyncCP), which successfully alleviates the contradiction between the needs of high sparsity and the requirements of consistency in self-supervised learning.

4.1. SYNCHRONIZED SPARSIFICATION (SYNCS)

The rationale behind the sparsity-induced asymmetric SSL is that the perturbation generated by the pruned encoder elevates the difference between contrastive features. As indicated by Observation 1 and Table 1 , the high sparsity-induced asymmetry is not universally applicable, but the SSL can be rewarded from the asymmetry incurred by lower sparsity (e.g., 50%), where the SSL-trained sparse and dense encoders exhibit negligible accuracy degradation compared to the baseline. Motivated by this, we propose the Synchronized Sparsification (SyncS) technique to exploit sparsity in both contrastive encoders. Given the online and offline (momentum) encoder θ and ξ with weights W θ and W ξ , the in-training sparsification can be expressed as: online output = g θ (f θ (X * (M θ • W θ ))) (8) offline output = g ξ (f ξ (X ′ * (M ξ • W ξ ))) Where M θ and M ξ represent the online and offline (momentum) sparse masks with sparsity s θ and s ξ . The proposed SyncS scheme gradually exploits the sparsity in both encoders while maintaining a consistent sparsity gap ∆ s between them during SSL training. At each pruning step t, we have: s t θ = s f θ + (s i θ -s f θ )(1 - t -t 0 n∆t ) 3 s t ξ = s f ξ + (s i ξ -s f ξ )(1 - t -t 0 n∆t ) 3 (11) s.t |s t θ -s t ξ | = ∆ s , for t ∈ {t 0 , t 0 + ∆t, ..., t 0 + n∆t} (12) The exponent controls the speed of sparsity increment, we adopt the sparsity schedule of Eq. 10-12 from (Liu et al., 2021) to minimize the impact of the parameter tuning. The synchronized sparsity increment with the constraints of ∆ s prevents the exceeding asymmetry between contrastive encoders while minimizing the distortion caused by the changing sparsity. In practice, ∆ s is treated as a tunable parameter which impacts the final sparsity of both online and offline encoders. To guarantee the consistency of the contrastive sparsity, both s θ and s ξ are initialized by Erdos Renyi Kernel (ERK) (Evci et al., 2020) , and with respect to ∆ s , we evaluate the impact of ∆ s in Appendix.

4.2. CONTRASTIVE SPARSIFICATION INDICATOR (CSI)

Achieving high sparsity requires gentle sparsity increment, but as presented in Observation 2, the inconsistent architecture difference deteriorates the contrastiveness of SSL. On the other hand, the popular EMA-based update (He et al., 2020) allows the momentum encoder to generate consistent latent representation, but the lagged architecture makes the momentum encoder become an unqualified "competitor" to the online encoder, which violates the findings of (Jiang et al., 2021) . To address such conflict, we propose the Contrastive Sparsification Indicator (CSI), a simple-yeteffective method that automatically selects the starting point of sparsity increment based on the learning progress of SSL. During the SSL training, CSI first generates the pseudo pruning decisions of both encoders based on element-wise magnitude pruning with target sparsity s f θ and s f ξ : M * θ = 1{|w θ | ∈ TopK(|w θ |, s f θ )} (13) M * ξ = 1{|w ξ | ∈ TopK(|w ξ |, s f ξ )} (14) Where 1 represents the indicator function, and the resultant pseudo masks of M * θ and M * ξ will not be applied to the weights. Subsequently, CSI XORs the pseudo pruning masks to generate G (Eq. 15), and the percentage of "1"s in G is equivalent to the architecture inconsistency I (Eq. 16), where |G| represents the total number of element in G. Instead of using cosine similarity, the bit-wise XOR can be easily implemented on hardware to quantify the architecture difference during training. Given the sparsity gap ∆ s defined by SyncS, CSI activates the sparsity increment when I equals to ∆ s , and this moment is defined as the CSI checkpoint. In other words, when the architecture difference between online and offline encoders is mainly caused by the sparsity difference, it is the optimal moment to start exploiting the in-training sparsity with the gradually-increased sparsity. G = M * θ ⊕ M * ξ (15) I = 1 - 1{G = 0} /|G| With the ability to automatically select the starting point of sparsity increment, the proposed CSI method automatically sparsifies the model with the full awareness of the SSL process. For the SSL framework with shared encoder (Zbontar et al., 2021) , the architecture inconsistency I is computed based on the sparse architecture of two consecutive epochs, and the sparsity increment is activated when I is less than a pre-defined threshold τ (e.g., τ = 0.1). Figure 3 shows the sparsification scheme with and without SyncS. As summarized in Table 3 , holding the sparse momentum architecture after the CSI checkpoint interrupts the consistency between the online and momentum encoders. Although the momentum encoder retains the low sparsity at 20%, the absence of consistent asymmetry from synchronized contrastive pruning (SCP) causes the degraded model performance. On top of the proposed SyncS and CSI schemes, we adopt the prune-and-regrow scheme (Liu et al., 2021) with modifications to exploit sparsity during SSL training. To further alleviate the contrastiveness oscillation caused by changing sparsity, we slowly average the gradient magnitude by exponential moving average (EMA) with gentle momentum, instead of using the instant score. The detailed pseudo code of the proposed algorithm is summarized in Appendix C.

5. EXPERIMENTAL RESULTS

In this section, we validate the proposed sparse training scheme and compare it with the current SoTA sparse training schemes. Unlike the work by (Jiang et al., 2021) , the proposed scheme exploits in-training sparsity in both contrastive paths (encoders) and aims to achieve energy-efficient self-supervised learning. The proposed method is applied to multiple mainstream SSL frameworks, including EMA-based methods (Chen et al., 2020b; Grill et al., 2020) and SSL with shared encoder (Zbontar et al., 2021) . The linear evaluation accuracy and training cost reduction are reported for multiple datasets, including CIFAR-10, CIFAR-100, and ImageNet-2012. Furthermore, this work exploits in-training sparsity with various sparsity granularities, including element-wise sparsity, N :M sparsity (Zhou et al., 2020) , and structural sparsity for a custom hardware accelerator. CIFAR-10 and CIFAR-100 Table 4 summarizes the linear evaluation accuracy of the proposed method on CIFAR-10 and CIFAR-100 datasets with element-wise sparsity. We use ResNet-18 (1×) as the backbone and train the model from scratch by 1,000 epochs. Following the typical high sparsity results reported with supervised learning, we report the model performance with 80% and 90% target sparsity. To sparsify both encoders during SSL training, we initialize the sparsity of online and offline (momentum) encoders as 30% and 0%, where the ∆ s is set to 30%. The initialized sparse encoders reduce the overall memory footprint throughout the entire training process. We rigorously transfer the SoTA GraNet Liu et al. (2021) to SSL based on its open-sourced implementation, the proposed method outperforms GraNet-SSL with 1.26% and 1.86% accuracy improvements on CIFAR-10 and CIFAR-100 datasets, respectively. In all experiments, we report the average accuracy with its variation in 3 runs. In addition to element-wise sparsity, the recent Nvidia Ampere architecture is equipped with the Sparse Tensor Cores to accelerate the inference computation on GPU with N :M structured finegrained sparsity (Zhou et al., 2020) , where the N dense elements remain within an M -sized group. Powered by the open-sourced Nvidia-ASP library, SyncCP sparsifies BYOL training (Grill et al., 2020) by targeting 100% N :M sparse groups in online encoders. Starting from scratch, the percentage of the N :M sparse groups is initialized as 30% and 0% in online and momentum encoders with ∆ s =30%. After the CSI checkpoint, the percentage of N :M groups gradually increases. Appendix A describes the detailed pruning algorithm of N :M sparsification. Table 6 summarizes linear evaluation accuracy and inference time reduction on the CIFAR-10 and CIFAR-100 datasets. The resultant model achieves up to 2.08× inference acceleration with minimum accuracy degradation. The inference time is measured on an Nvidia 3090 GPU with FP32 data precision. ImageNet-2012 Since the BYOL (Grill et al., 2020) learning scheme achieves the best performance with CIFAR datasets, we further evaluate the proposed method with ResNet-50 on ImageNet based on the BYOL framework (Grill et al., 2020) . Following the typical high sparsity results reported in Table 4 , we report the model performance with 80% and 90% element-wise sparsity. The data augmentation setup is adopted from the open-sourced library (Costa et al., 2022) . Starting from scratch, the model is trained by 300 epochs, where both online and momentum encoders are initialized by ERK with ∆ s = 30%. While we believe a more fine-grained hyperparameter tuning and extended training efforts could lead to better accuracy, we choose the above scheme for simplicity and reproducibility. Table 5 shows the comparison of linear evaluation accuracy on ImageNet-2012 dataset. Compared to the self-damaging scheme (Jiang et al., 2021) and GraNet (Liu et al., 2021) , the proposed algorithm achieves the same highly-sparse network with 4.72% and 1.21% Top-1 inference accuracy improvements, respectively. GraNet exploits in-training sparsity throughout the entire training process, but the inconsistent contrastiveness hampers the model performance. On the other hand, the dense encoder limits the efficiency of the self-damaging scheme (Jiang et al., 2021) scheme, and the static high sparsity degrades the model performance. It has been shown that SSLtrained encoders are strong visual learners (Grill et al., 2020; Ericsson et al., 2021) . Appendix A summarizes the performance of the proposed algorithm by fine-tuning the ImageNet-trained sparse encoders on CIFAR-10 and CIFAR-100 datasets. With only 300 epochs of sparse SSL training, the resultant sparse encoder outperforms the SoTA supervised sparse training algorithms. 

A DOWNSTREAM TASKS PERFORMANCE WITH THE PRE-TRAINED SPARSE ENCODER

The pre-trained sparse encoder can be used for downstream tasks. We verified the performance sparse ImageNet-trained model in Table 5 with the downstream tasks. The following table summarizes the transfer learning performance by fine-tuning the ImageNet-trained sparse BYOL encoder on CIFAR-10 and CIFAR-100 datasets. Given the fixed target sparsity = 90%, it is easy to tell that the higher ∆ s leads to denser momentum encoder and less computation reduction. Meanwhile, sparsifies the momentum encoder at the beginning of training is sub-optimal. Furthermore, less initial sparsity and smaller ∆ s value (4th row) achieves the best tradeoff between computation reduction and model performance.

D SPARSIFYING ONLINE OR OFFLINE ENCODERS

We mirror the experiment of Table 1 by exploiting high sparsity in the momentum encoder while keeping the online encoder dense: Overall, sparsifying the online encoder leads to the optimal performance.

E DETAILED EXPERIMENTAL SETUP OF SYNCCP E.1 LINEAR EVALUATION PROTOCOL

As in (Kolesnikov et al., 2019; Kornblith et al., 2019; Chen et al., 2020a) MoCo-V2 The ResNet-18 (×) encoder is trained by MoCo-V2 (Chen et al., 2020b) from scratch by 1,000 epochs with SGD optimizer and 256 batch size. The learning rate is set to 0.3 with Cosine learning rate decay and 10 epochs warmup. The detailed data augmentation is summarized in Table 13 . BYOL The ResNet-18 (×) encoder is trained by BYOL (Grill et al., 2020) from scratch by 1,000 epochs with LARS-SGD optimizer (You et al., 2017) . The predictor is constructed with 4096 hidden features and 256 output dimension. We use 256 batch size along with 1.0 learning rate. The Cosine learning rate scheduler is used with 10 epochs warmup training. The detailed data augmentation is summarized in Table 14 .



Figure 1: (a) Applying self-damaging scheme (Jiang et al., 2021) to SSL. (b) Applying prune-andregrow scheme (Liu et al., 2021) to SSL. (c) Proposed contrastiveness-aware sparse training.

(b), they require to frequently prune and regrow the model architecture during SSL training, while SDCLR maintains a fixed "sparse-dense" architecture during the entire training process. The under-explored sparse contrastiveness and expensive self-supervised learning inspire us to investigate the following question: How to efficiently sparsify the model during self-supervised training with the awareness of contrastiveness?To address this question, we present Synchronized Contrastive Pruning (SyncCP), a novel sparse training algorithm designed for self-supervised learning. To maximize the energy efficiency of SSL training, SyncCP exploits in-training sparsity in both encoders with high compatibility of SSL. The main contributions of this work are:

Figure 2: (a) Layer-wise oscillation at different steps of pruning. "SC" stands for the shortcut connection of ResNet-18 model. (b) Gradually-increased sparsity of GraNet (Liu et al., 2021) leads to inconsistent asymmetry.

cor indicates the amount of architecture oscillation caused by the gradient-based regrow. During the early stage of training, almost all the magnitude pruning candidates are replaced by the regrowing process, as shown in Figure 2(a). The high degree of architecture oscillation implies drastic changes in the sparse model architecture. In the meantime, gradually sparsifying two encoders with different target sparsity further destroys the consistency of self-supervised learning, as shown in Figure 2(b). As a result, we have the following observation for Question 2:

Figure 3: Sparse BYOL training process (a) without SyncS and (b) with SyncS.

, we use the standard linear evaluation protocol on CIFAR-10/100 and ImageNet-2012 datasets, which training a linear classifier on top of the frozen SSL-trained encoder. During linear evaluation, we apply spatial augmentation and random flips. The linear classifier is optimized by SGD with cross-entropy loss. E.2 CIFAR-10/100 EXPERIMENTS The training hyper-parameters of the compared individual sparse training works are same for CIFAR-10 and CIFAR-100. We provide the detailed training setup of different self-supervised learning frameworks as follow:

image augmentation settings for MoCo-V2(Chen et al., 2020b)  on CIFAR-10/100.

• We present SyncCP, a new sparse training algorithm designed for self-supervised learning. SyncCP gradually exploits high in-training sparsity in both encoders with contrastive synchronization and optimally-triggered sparsification, maximizing the training efficiency without hurting the contrastiveness of SSL. • SyncCP is a general sparse training method for SSL which is compatible with various granularities of sparsity, including element-wise pruning, N :M sparsity, and structured sparsity designed for a custom hardware accelerator. • We validated the proposed method against previous SoTA sparsification algorithms on CIFAR-10, CIFAR-100, and ImageNet datasets. Across various SSL frameworks, SyncCP consistently achieves SoTA accuracy in all experiments.

Largely degraded performance of MoCo-V2(Chen et al., 2020b)  with self-damaging SSL(Jiang et al., 2021) on CIFAR-10 dataset.

Performance comparison of BYOL on CIFAR-10 dataset with/without SyncS.

Linear evaluation comparison on CIFAR-10/100 datasets with element-wise sparsity.

ImageNet-2012 accuracy and training cost comparison with SoTA works on ResNet-50 with BYOL(Grill et al., 2020).

Linear evaluation accuracy comparison on CIFAR-10/100 datasets with N :M structured fine-grained sparsity.

Hardware training acceleration of the proposed structured SyncCP on CIFAR-10 datasaet.Hardware-based Structured PruningThe hardware practicality of element-wise sparsification is often limited by the irregularity of fine-grained sparsity and index requirement. To that end, we employ structured sparsity based on group-wise EMA scores towards achieving actual hardware training acceleration. The encoders are initialized by ERK with 30% and 0% sparse groups while keeping ∆ s = 30%. The structured sparsity starts to gradually increase after the CSI checkpoint. We adopt the training accelerator specifications from(Venkataramanaiah et al., 2022) and choose K l (# of output channels) × C l (# of input channels) = 8×8 as the sparse group size. Table7evalautes the training speedup of BYOL(Grill et al., 2020) aided by the structured sparse training. The proposed algorithm achieves up to 1.91× training acceleration with minimal accuracy degradation.

Transfer learning performance of the ImageNet-trained BYOL encoder on CIFAR-10 and CIFAR-100 datasets.We evaluate the impact of ∆ s with different sparsification schemes on CIFAR-10 dataset with BYOL SSL framework. With the target sparsity = 90%, the following table summarizes the accuracy and training cost reduction with different ∆ s values associated with different initial and final density.

The linear evaluation accuracy and the training FLOPs reduction with sparsity gap ∆ s .

Largely degraded performance of MoCo-V2 with self-damaging SSL on CIFAR-10 dataset.With the proposed CSI + SyncS sparsification method, we empirically observe that exploiting high sparsity in online model leads to better performance with gradient-based model update. Table12summarizes the comparison results of BYOL on CIFAR-10 dataset with the proposed algorithm.

Performance comparison of exploiting higher sparsity in online or momentum encoders.

annex

As aforementioned, the findings of Observation 2 implies the incompatibility of the instant gradient and magnitude score. Together with the proposed SyncS and CSI methods, weight importance is measured by the magnitude score, while the sensitivity of the model is quantified by the gently averaged gradient magnitude with EMA:Table 10 summarizes the linear evaluation accuracy of ResNet-18 trained by BYOL (Grill et al., 2020) . We initialize s 0 θ and s 0 ξ as 40% and 10%, where the ∆ s is set to 30%, the EMA momentum is set to 0.1 for gentle gradient score averaging. 

