BUDGETED TRAINING FOR VISION TRANSFORMER

Abstract

The superior performances of Vision Transformers often come with higher training costs. Compared to their CNN counterpart, Transformer models are hungry for large-scale data and their training schedules are usually prolonged. This sets great restrictions on training Transformers with limited resources, where a proper tradeoff between training cost and model performance is longed. In this paper, we address the problem by proposing a framework that enables the training process under any training budget from the perspective of model structure, while achieving competitive model performances. Specifically, based on the observation that Transformer exhibits different levels of model redundancies at different training stages, we propose to dynamically control the activation rate of the model structure along the training process and meet the demand on the training budget by adjusting the duration on each level of model complexity. Extensive experiments demonstrate that our framework is applicable to various Vision Transformers, and achieves competitive performances on a wide range of training budgets.

1. INTRODUCTION

Benefited from the large model capacity, Vision Transformers (ViTs) (Dosovitskiy et al., 2021) have demonstrated their predominant performance on various vision tasks, including object detection (Wang et al., 2021a; Liu et al., 2021; Li et al., 2022b) , semantic segmentation (Zheng et al., 2021; Strudel et al., 2021 ), video understanding (Fan et al., 2021; Arnab et al., 2021) , etc. However, these improvements come at huge training costs in which the datasets, the model parameters, and the computation complexities have grown enormous in size. For example, ViT-G/14 with Greedy Soup (Wortsman et al., 2022) achieves 90.9% accuracy on the ImageNet (Deng et al., 2009) benchmark while having 1843M training parameters and being pretrained on a dataset of 3 billion scale. Under this circumstance, computation resource has been becoming an inevitable overhead that prevents common users from training desired vision models. The methodology of designing modern Transformers is finding the best trade-off between the computation costs and the model performances (Han et al., 2022) . Besides the widely used factors like the number of the learnable parameters, the floating point operations (FLOPs) and the inference latency, training cost is also an essential resource that involves training schedule (Wu et al., 2020; Yin et al., 2022; Wang et al., 2022b) , memory usage (Pan et al., 2021; Wang et al., 2021c; Ni et al., 2022) and training-stage complexity (Zhang & He, 2020; Gong et al., 2019; Dong et al., 2020) . Therefore, the topic of training Transformers efficiently has received broad research interests, especially considering the large-scale data and prolonged schedules in training. Considering that many of the research labs and companies are not able to afford the full training schedule of the best model, one usual solution is to train a better one given a desirable and acceptable total training cost. Previous works that focus on addressing the training efficiency problem mainly learn model-specific schedules based on handcraft designs (Gong et al., 2019; Gu et al., 2020; McDanel & Huynh, 2022) or Automated Machine Learning (Li et al., 2022a) . However, these approaches either adjust the costs in the training process, or only provide training schedules based on a sparse set of training costs. The inflexibility hinders from generalizing to a pre-defined budget. 



Figure1: (a) demonstrates our method consistently outperforms Linear-LR(Li et al., 2020)  onDeiT-S (Touvron et al., 2021)  under three different training budgets of 25%,50%, and 75%. Our method even improves 1.1% over the original model under full budget. (b) shows that our method dynamically adjusts the activation rate of model computation by gradually increasing the attention heads, the token numbers and the MLP hidden dimensions. Our method manages to control the model redundancy during training to meet the given budget while achieving good performance.In this paper, we take a step forward and focus on the problem of budgeted training(Li et al., 2020), i.e., achieving the highest model performance under any given training budget that can be measured by total training time or computation cost. Different from previous work including using smaller model variants, coreset selection (Mirzasoleiman et al., 2020; Killamsetty et al., 2021), efficient training schedules (Li et al., 2020; Chen et al., 2022a), we target this problem from the perspective of the inherent properties of Vision Transformers. Specifically, we focus on leveraging the redundancies of model structure during ViT training. There exists several types of redundancies including the feature diversities across different attention heads, the hidden dimensions in the MLP blocks and the number of attended visual tokens. It is shown that these redundancies are correlated with training process, especially they tend to be higher at early stages. This motivates us to dynamically control the activation rate of the model along the training process, where less parameters participate in the early training stages and the full model capacity is activated at late stages. As depicted in Fig. 1(b), we activate 2 attention heads, 51% tokens and 384 MLP hidden dimensions in the first stage, which condenses the model redundancy and keeps a low computation cost, and then the activation rate of the model then gradually increases as training goes on. In this way, the training process becomes more compact, where information loss is greatly avoided and results in limited influence on the model performance. Based on this technique, we can adjust the duration at different level of training stages in order to accommodate to different training budgets. Fig. 1(a) shows our method consistently outperforms the baseline at three different budgets. Extensive experiments demonstrate that our method significantly outperforms the other budgeted training baselines and achieves competitive training cost-performance trade-off on various Vision Transformer models.

