SMARTFRZ: AN EFFICIENT TRAINING FRAMEWORK USING ATTENTION-BASED LAYER FREEZING

Abstract

There has been a proliferation of artificial intelligence applications, where model training is key to promising high-quality services for these applications. However, the model training process is both time-intensive and energy-intensive, inevitably affecting the user's demand for application efficiency. Layer freezing, an efficient model training technique, has been proposed to improve training efficiency. Although existing layer freezing methods demonstrate the great potential to reduce model training costs, they still remain shortcomings such as lacking generalizability and compromised accuracy. For instance, existing layer freezing methods either require the freeze configurations to be manually defined before training, which does not apply to different networks, or use heuristic freezing criteria that is hard to guarantee decent accuracy in different scenarios. Therefore, there lacks a generic and smart layer freezing method that can automatically perform "insituation" layer freezing for different networks during training processes. To this end, we propose a generic and efficient training framework (SmartFRZ). The core proposed technique in SmartFRZ is attention-guided layer freezing, which can automatically select the appropriate layers to freeze without compromising accuracy. Experimental results show that SmartFRZ effectively reduces the amount of computation in training and achieves significant training acceleration, and outperforms the state-of-the-art layer freezing approaches.

1. INTRODUCTION

Deep neural networks (DNNs) have become the core enabler of a wide spectrum of AI applications, such as natural language processing (Vaswani et al., 2017; Kenton & Toutanova, 2019 ), visual recognition (Li et al., 2022; Faraki et al., 2021) , automatic machine translation (Sun et al., 2020; Zheng et al., 2021) , and also the emerging application domains such as robot-assisted eldercare (Do et al., 2018; Bemelmans et al., 2012 ), mobile diagnosis (Panindre et al., 2021; Abdel-Basset et al., 2020) , and wild surveillance (Akbari et al., 2021; Ke et al., 2020) . To satisfy the growing demand for adaptability and training efficiency of DNN models in deployment, a surge of research efforts has been devoted to designing efficient training paradigms (He et al., 2021; Yuan et al., 2021; Wu et al., 2020; 2021) . For example, sparse training (Evci et al., 2020; Yuan et al., 2021) and low-precision training (Yang et al., 2020; Zhao et al., 2021) are two active research areas for efficient training that can effectively reduce training costs, such as computing FLOPs and memory. However, there are still fundamental limitations in the generality of these methods when used in practice. Concretely, sparse training methods generally require the support of dedicated libraries or compiler optimizations (Chen et al., 2018; Niu et al., 2020) to leverage the model sparsity and save computation costs. Similarly, low-precision training (e.g., INT-8 or fewer bits) is hard to be supported by GPUs and requires specialized designs for edge devices such as field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). Therefore, it is desirable to have a generic efficient training method that can be easily and effectively adapted to various application scenarios. Recent studies (Brock et al., 2017; Liu et al., 2021) revealed that freezing some DNN layers at a certain stage (i.e., training iteration) during the training process will not degrade the accuracy of * These authors contributed equally. 1

