SMARTFRZ: AN EFFICIENT TRAINING FRAMEWORK USING ATTENTION-BASED LAYER FREEZING

Abstract

There has been a proliferation of artificial intelligence applications, where model training is key to promising high-quality services for these applications. However, the model training process is both time-intensive and energy-intensive, inevitably affecting the user's demand for application efficiency. Layer freezing, an efficient model training technique, has been proposed to improve training efficiency. Although existing layer freezing methods demonstrate the great potential to reduce model training costs, they still remain shortcomings such as lacking generalizability and compromised accuracy. For instance, existing layer freezing methods either require the freeze configurations to be manually defined before training, which does not apply to different networks, or use heuristic freezing criteria that is hard to guarantee decent accuracy in different scenarios. Therefore, there lacks a generic and smart layer freezing method that can automatically perform "insituation" layer freezing for different networks during training processes. To this end, we propose a generic and efficient training framework (SmartFRZ). The core proposed technique in SmartFRZ is attention-guided layer freezing, which can automatically select the appropriate layers to freeze without compromising accuracy. Experimental results show that SmartFRZ effectively reduces the amount of computation in training and achieves significant training acceleration, and outperforms the state-of-the-art layer freezing approaches.

1. INTRODUCTION

Deep neural networks (DNNs) have become the core enabler of a wide spectrum of AI applications, such as natural language processing (Vaswani et al., 2017; Kenton & Toutanova, 2019) , visual recognition (Li et al., 2022; Faraki et al., 2021) , automatic machine translation (Sun et al., 2020; Zheng et al., 2021) , and also the emerging application domains such as robot-assisted eldercare (Do et al., 2018; Bemelmans et al., 2012) , mobile diagnosis (Panindre et al., 2021; Abdel-Basset et al., 2020) , and wild surveillance (Akbari et al., 2021; Ke et al., 2020) . To satisfy the growing demand for adaptability and training efficiency of DNN models in deployment, a surge of research efforts has been devoted to designing efficient training paradigms (He et al., 2021; Yuan et al., 2021; Wu et al., 2020; 2021) . For example, sparse training (Evci et al., 2020; Yuan et al., 2021) and low-precision training (Yang et al., 2020; Zhao et al., 2021) are two active research areas for efficient training that can effectively reduce training costs, such as computing FLOPs and memory. However, there are still fundamental limitations in the generality of these methods when used in practice. Concretely, sparse training methods generally require the support of dedicated libraries or compiler optimizations (Chen et al., 2018; Niu et al., 2020) to leverage the model sparsity and save computation costs. Similarly, low-precision training (e.g., INT-8 or fewer bits) is hard to be supported by GPUs and requires specialized designs for edge devices such as field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). Therefore, it is desirable to have a generic efficient training method that can be easily and effectively adapted to various application scenarios. Recent studies (Brock et al., 2017; Liu et al., 2021) revealed that freezing some DNN layers at a certain stage (i.e., training iteration) during the training process will not degrade the accuracy of the final trained model and can effectively reduce training costs. Most importantly, layer freezing can be achieved using native training frameworks such as PyTorch/TensorFlow without additional support, making it more accessible to wide applications and users for training costs reduction and acceleration. Previous work has mainly used heuristic freezing strategies, such as empirically selecting which DNN layers to freeze and when to freeze (Lee et al., 2019; Yuan et al., 2022) . As such, these heuristic freezing strategies require a trial-and-error process to find appropriate freezing configurations for individual tasks/networks, resulting in inconvenience and inefficiency when deployed to various application scenarios. Some recent works attempt to freeze the DNN layers in an adaptive manner by using the gradient-norm (Liu et al., 2021) or SVCCA score (He et al., 2021) . However, it is known that DNN models generally do not monotonically converge to their optimal position. As a result, these adaptive methods, which decide whether the layers are frozen or not using the heuristic criteria, are less robust to the fluctuation of model training, leading to compromised accuracy. Therefore, we raise a fundamental question that has seldom been asked: Is there a layer freezing method that can overcome the above-mentioned shortcomings while keeping the method efficient? Inspired by the outstanding performance of attention mechanism in solving sequence classification problems such as classification tasks (Long et al., 2018) , dialog detection (Shen & Lee, 2016), and affect recognition (Gorrostieta et al., 2018) , it is possible that the attention mechanism could also be a promising solution for layer freezing in efficient training, but it has never been explored in prior literature. In this paper, we innovatively introduce attention mechanism for layer freezing. In specific, we design a lightweight attention-based predictor to collect and rank the DNN context information from multiple timestamps during training process. Based on the prediction, we adaptively freeze DNN layers to save training computation costs and accelerate training process while maintaining high model accuracy. To train the attention-based predictor, we propose a layer representational similarity-based method to generate a special training dataset using publicly available dataset (e.g., ImageNet). Then, the predictor is trained offline once, and learns the generic convergence pattern along the training history, which can be generalized across different models and datasets. We summarize our contributions as: • We design a novel and lightweight predictor using attention mechanism for layer freezing in efficient training. The predictor automatically captures DNN context information from multiple timestamps and adaptively freezes the layers during the training process. • We propose to leverage the layer representational similarity to generate a special dataset for training the attention-based predictor. The trained predictor can be used for different datasets and networks. 



Recent researches (Brock et al., 2017; Kornblith et al., 2019)  have found that not all layers in deep nerual networks need to be trained equally. For example, the early layers in DNNs are responsible for low-level features extraction and usually have fewer parameters than the later layers, making the early layers converge faster during training. Therefore, the layer freezing techniques are proposed, which stop updating certain layers during the training process to save the training costs(Lee et al., 2019; Zhang & He, 2020).The previous work mainly uses heuristic strategies to determine which layers to freeze and when to freeze them. For example, Brock et al. (2017) uses a linear/cubic schedule to freeze the layers sequentially (one by one from the first layers to the later layers). The recent workYuan et al. (2022)

