SYNCHRONIZED PRUNING FOR EFFICIENT CONTRASTIVE SELF-SUPERVISED LEARNING Anonymous

Abstract

Various self-supervised learning (SSL) methods have demonstrated strong capability in learning visual representations from unlabeled data. However, the current state-of-the-art (SoTA) SSL methods largely rely on heavy encoders to achieve comparable performance as the supervised learning counterpart. Despite the well-learned visual representations, the large-sized encoders impede the energyefficient computation, especially for resource-constrained edge computing. Prior works have utilized the sparsity-induced asymmetry to enhance the contrastive learning of dense models, but the generality between asymmetric sparsity and self-supervised learning has not been fully discovered. Furthermore, transferring the supervised sparse learning to SSL is also largely under-explored. To address the research gap in prior works, this paper investigates the correlation between intraining sparsity and SSL. In particular, we propose a novel sparse SSL algorithm, embracing the benefits of contrastiveness while exploiting high sparsity during SSL training. The proposed framework is evaluated comprehensively with various granularities of sparsity, including element-wise sparsity, GPU-friendly N :M structured fine-grained sparsity, and hardware-specific structured sparsity. Extensive experiments across multiple datasets are performed, where the proposed method shows superior performance against the SoTA sparse learning algorithms with various SSL frameworks. Furthermore, the training speedup aided by the proposed method is evaluated with an actual DNN training accelerator model.

1. INTRODUCTION

The early empirical success of deep learning was primarily driven by supervised learning with massive labeled data, e.g., ImageNet (Krizhevsky et al., 2012) . To overcome the labeling bottleneck of deep learning, learning visual representations without label-intensive datasets has been widely investigated (Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Zbontar et al., 2021) . The recent self-supervised learning (SSL) methods have shown great success and achieved comparable performance to the supervised learning counterpart. The common property of various SSL designs is utilizing different augmentations from the original images to generate contrastiveness, which requires duplicated encoding with wide and deep models (Meng et al., 2022) . The magnified training effort and extensive resource consumption make the SSL-trained encoder infeasible for on-device computing (e.g., mobile devices). The contradiction between label-free learning and extraordinary computation cost limits further applications of SSL, also necessitating efficient sparse training techniques for self-supervised learning. For supervised learning, sparsification (a.k.a. pruning) has been widely explored, aiming to reduce computation and memory costs by removing unimportant parameters during training or fine-tuning. Conventional supervised pruning explores weight sparsity based on a pre-trained model followed by additional fine-tuning to recover the accuracy (Han et al., 2016) . For self-supervised learning, recent work (Chen et al., 2021 ) also sparsified a pre-trained dense SSL model for the downstream tasks with element-wise pruning. In addition to the fine-grained sparsity, MCP (Pan et al., 2022) exploited the filter-wise sparsity on the MoCo-SSL (He et al., 2020) model. Both of these sparse SSL works (Chen et al., 2021; Pan et al., 2022) exploit sparsity based on the pre-trained dense model. However, compared to supervised learning, obtaining the pre-trained model via SSL requires a significant amount of additional training effort (∼200 epochs vs. ∼1,000 epochs). Therefore, exploring post-training sparsity via fine-tuning is not an ideal solution for efficient SSL. • We first discover the limitations of the sparsity-induced asymmetric SSL in SDCLR (Jiang et al., 2021) and show that the sparsity-induced "sparse-dense" asymmetric architecture is not universally applicable for various SSL schemes. • We demonstrate the incompatibility of the SoTA "prune-and-regrow" sparse training method for SSL. Specifically, we formalize the iterative architectural changes caused by applying "prune-and-regrow" to SSL, named as architecture oscillation, and observe that frequently updating the pruning candidates lead to larger architecture oscillation, which further hinders the performance of self-supervised learning. 



Figure 1: (a) Applying self-damaging scheme (Jiang et al., 2021) to SSL. (b) Applying prune-andregrow scheme (Liu et al., 2021) to SSL. (c) Proposed contrastiveness-aware sparse training. On the other hand, sparsifying the model during supervised training (Dettmers & Zettlemoyer, 2019; Evci et al., 2020) has emerged as a promising technique to elevate training efficiency while obtaining a sparse model. To accurately localize the unimportant parameters, prior works investigated various types of important metrics, including gradient-based pruning (Dettmers & Zettlemoyer, 2019) and the "prune-regrow" scheme (Liu et al., 2021). Compared to the post-training sparsification methods, in-training sparsification for supervised training has achieved memory/computation reduction as well as speedup of the training process. However, exploiting in-training sparsity for SSL models that are trained from scratch is still largely under-explored.More recently, SDCLR(Jiang et al., 2021)  proposed the sparsified "self-damaging" encoder, which generates the "sparse-dense" SSL by exploiting fixed high sparsity on one contrastive path (e.g., offline encoder), while keeping the counterpart dense (e.g., online encoder), as shown in Figure1(a). Such "sparse-dense" SSL architecture helps to enhance contrastive learning, leading to improved performance with the non-salient and imbalanced samples. Nevertheless, it mainly focuses on the performance enhancement of the SSL-trained dense model (i.e., SimCLR (Chen et al., 2020a)), and whether such a "sparse-dense" asymmetric learning scheme works in other SSL methods remains unclear. On the other hand, the compatibility of the existing SoTA sparse training techniques (Dettmers & Zettlemoyer (2019); Evci et al. (2020); Liu et al. (2021)) is also ambiguous to SSL. As shown in Figure 1(b), they require to frequently prune and regrow the model architecture during SSL training, while SDCLR maintains a fixed "sparse-dense" architecture during the entire training process. The under-explored sparse contrastiveness and expensive self-supervised learning inspire us to investigate the following question: How to efficiently sparsify the model during self-supervised training with the awareness of contrastiveness? To address this question, we present Synchronized Contrastive Pruning (SyncCP), a novel sparse training algorithm designed for self-supervised learning. To maximize the energy efficiency of SSL training, SyncCP exploits in-training sparsity in both encoders with high compatibility of SSL. The main contributions of this work are:

• We present SyncCP, a new sparse training algorithm designed for self-supervised learning. SyncCP gradually exploits high in-training sparsity in both encoders with contrastive

