NASOA: TOWARDS FASTER TASK-ORIENTED ONLINE FINE-TUNING

Abstract

Fine-tuning from pre-trained ImageNet models has been a simple, effective, and popular approach for various computer vision tasks. The common practice of finetuning is to adopt a default hyperparameter setting with a fixed pre-trained model, while both of them are not optimized for specific tasks and time constraints. Moreover, in cloud computing or GPU clusters where the tasks arrive sequentially in a stream, faster online fine-tuning is a more desired and realistic strategy for saving money, energy consumption, and CO2 emission. In this paper, we propose a joint Neural Architecture Search and Online Adaption framework named NASOA towards a faster task-oriented fine-tuning upon the request of users. Specifically, NASOA first adopts an offline NAS to identify a group of training-efficient networks to form a pretrained model zoo. We propose a novel joint block and macro level search space to enable a flexible and efficient search. Then, by estimating fine-tuning performance via an adaptive model by accumulating experience from the past tasks, an online schedule generator is proposed to pick up the most suitable model and generate a personalized training regime with respect to each desired task in a one-shot fashion. The resulting model zoo 1 is more training efficient than SOTA NAS models, e.g. 6x faster than RegNetY-16GF, and 1.7x faster than EfficientNetB3. Experiments on multiple datasets also show that NASOA achieves much better fine-tuning results, i.e. improving around 2.1% accuracy than the best performance in RegNet series under various time constraints and tasks; 40x faster compared to the BOHB method.

1. INTRODUCTION

Fine-tuning using pre-trained models becomes the de-facto standard in the field of computer vision because of its impressive results on various downstream tasks such as fine-grained image classification (Nilsback & Zisserman, 2008; Welinder et al., 2010) , object detection (He et al., 2019; Jiang et al., 2018; Xu et al., 2019) and segmentation (Chen et al., 2017; Liu et al., 2019) . Kornblith et al. (2019) ; He et al. (2019) verified that fine-tuning pre-trained networks outperform training from scratch. It can further help to avoid over-fitting (Cui et al., 2018) as well as reduce training time significantly (He et al., 2019) . Due to those merits, many cloud computing and AutoML pipelines provide fine-tuning services for an online stream of upcoming users with new data, different tasks and time limits. In order to save the user's time, money, energy consumption, or even CO2 emission, an efficient online automated fine-tuning framework is practically useful and in great demand. Thus, in this work, we propose to explore the problem of faster online fine-tuning. The conventional practice of fine-tuning is to adopt a set of predefined hyperparameters for training a predefined model (Li et al., 2020) . It has three drawbacks in the current online setting: 1) The design of the backbone model is not optimized for the upcoming fine-tuning task and the selection of the backbone model is not data-specific. 2) A default setting of hyperparameters may not be optimal across tasks and the training settings may not meet the time constraints provided by users. 3) With the incoming tasks, the regular diagram is not suitable for this online setting since it cannot memorize and accumulate experience from the past fine-tuning tasks. Thus, we propose to decouple our faster fine-tuning problem into two parts: finding efficient fine-tuning networks and generating optimal fine-tuning schedules pertinent to specific time constraints in an online learning fashion. Recently, Neural Architecture Search (NAS) algorithms demonstrate promising results on discovering top-accuracy architectures, which surpass the performance of hand-crafted networks and saves human's efforts (Zoph et Yao et al., 2020) . However, those NAS works usually focus on inference time/FLOPS optimization and their search space is not flexible enough which cannot guarantee the optimality for fast fine-tuning. In contrast, we resort to developing a NAS scheme with a novel flexible search space for fast fine-tuning. On the other hand, hyperparameter optimization (HPO) methods such as grid search (Bergstra & Bengio, 2012) , Bayesian optimization (BO) (Strubell et al., 2019a; Mendoza et al., 2016) , and BOHB (Falkner et al., 2018) are used in deep learning and achieve good performance. However, those search-based methods are computationally expensive and require iterative "trial and error", which violate our goal for faster adaptation time. In this work, we propose a novel Neural Architecture Search and Online Adaption framework named NASOA. First, we conduct an offline NAS for generating an efficient fine-tuning model zoo. We design a novel block-level and macro-structure search space to allow a flexible choice of the networks. Once the efficient training model zoo is created offline NAS by Pareto optimal models, the online user can enjoy the benefit of those efficient training networks without any marginal cost. We then propose an online learning algorithm with an adaptive predictor to modeling the relation between different hyperparameter, model, dataset meta-info and the final fine-tuning performance. The final training schedule is generated directly from selecting the fine-tuning regime with the best predicted performance. Benefiting from the experience accumulation via online learning, the diversity of the data and the increasing results can further continuously improve our regime generator. Our method behaves in a one-shot fashion and doesn't involve additional searching cost as HPO, endowing the capability of providing various training regimes under different time constraints. Extensive experiments are conducted on multiple widely used fine-tuning datasets. The searched model zoo ET-NAS is more training efficient than SOTA ImageNet models, e.g. 5x training faster than RegNetY-16GF, and 1.7x faster than EfficientNetB3. Moreover, by using the whole NASOA, our online algorithm achieves superior fine-tuning results in terms of both accuracy and fine-tuning speed, i.e. improving around 2.1% accuracy than the best performance in RegNet series under various tasks; saving 40x computational cost comparing to the BOHB method. In summary, our contributions are summarized as follows: • To the best of our knowledge, we make the first effort to propose a faster fine-tuning pipeline that seamlessly combines the training-efficient NAS and online adaption algorithm. Our NASOA can effectively generate a personalized fine-tuning schedule of each desired task via an adaptive model for accumulating experience from the past tasks. • The proposed novel joint block/macro level search space enables a flexible and efficient search. The resulting model zoo ET-NAS is more training efficient than very strong ImageNet SOTA models e.g. EfficientNet, RegNet. All the ET-NAS models have been released to help the community skipping the computation-heavy NAS stage and directly enjoy the benefit of NASOA. • The whole NASOA pipeline achieves much better fine-tuning results in terms of both accuracy and fine-tuning efficiency than current fine-tuning best practice and HPO method,e.g. BOHB. 

3. THE PROPOSED APPROACH

The goal of this paper is to develop an online fine-tuning pipeline to facilitate a fast continuous cross-task model adaption. By the preliminary experiments in Section 4.1, we confirm that the model architectures and hyperparameters such as the learning rate and frozen stages will greatly influence the accuracy and speed of the fine-tuning program. Thus, our NASOA includes two parts as shown in the Figure 1 where A is the architecture, acc(.) is the Top-1 accuracy on ImageNet, T s (.) is the average step time of one iteration, and T m is the maximum step time allowed. The step time is defined to be the total time of one iteration, including forward/backward propagation, and parameter update. Search Space Design is extremely important (Radosavovic et al., 2020) . As shown in Figure 2 Block-level Search Space. We consider a search space based on 1-3 successive nodes of 5 different operations. Three skip connections with one fixed residual connection are searched. Element-wise add or channel-wise concat is chosen to combine the features for the skip-connections. For each selected operation, we also search for the ratio of changing channel size: ×0.25, ×0. Therefore, for macro-level search space, we design a flexible search space to find the optimal channel size (width), depth (total number of blocks), when to down-sample, and when to raise the channels. Our macro-level structure consists of 4 flexible stages. The spatial size of the stages is gradually down-sampled with factor 2. In each stage, we stack a number of block architectures. The positions of the doubling channel block are also flexible. This search space consists of 1.5 × 10 7 unique architectures. Details of the search space and its encodings can be found in Appendix B.1. Multi-objective Searching Algorithm. For MOOP in Eq 1, we define architecture A 1 dominates A 2 if (i) A 1 is no worse than A 2 in all objectives; (ii) A 1 is strictly better than A 2 in at least one objective. A * is Pareto optimal if there is no other A that dominate A * . The set of all Pareto optimal architectures constitutes the Pareto front. To solve this MOOP problem, we modify a well-known method named Elitist Non-Dominated sorting genetic algorithm (NSGA-II) (Deb et al., 2000) to optimize the Pareto front P f . The main idea of NSGA-II is to rank the sampled architectures by non-dominated sorting and preserve a group of elite architectures. Then a group of new architectures is sampled and trained by mutation of the current elite architectures on the P f . The algorithm can be paralleled on multiple computation nodes and lift the P f simultaneously. We modify the NSGA-II algorithm to become a NAS algorithm: a) To enable parallel searching on N computational nodes, we modify the non-dominated-sort method to generate exactly N mutated models for each generation, instead of a variable size as the original NSGA-II does. b) We define a group of mutation operations for our block/macro search space for NSGA-II to change the network structure dynamically. c) We add a parent computation node to measure the selected architecture's training speed and generate the Pareto optimal models. Details of our NSGA-II can be found in Appendix B.2.2. Efficient Training Model Zoo Z oo (ET-NAS). By the proposed NAS method, we then create an efficient-training model zoo Z oo named ET-NAS which consists of K Pareto optimal models A * i on P f . Then A * i are pretrained by ImageNet. Details of our NAS and A * i can be found in Appendix B.

3.2. ONLINE TASK-ORIENTED FINE-TUNING SCHEDULE GENERATION

With the help of efficient training Z oo , the marginal computational cost of each user is minimized while they can enjoy the benefit of NAS. We then need to decide a suitable fine-tuning schedule upon the user's upcoming tasks. Given user's dataset D and fine-tuning time constraint T l , an online regime generator G(., .) is desired: [Regime FT , A * i ] = G(D, T l ), such that Acc(A F ineT une i , D val ) is maximized, where the Regime FT includes all the hyperparameters required, i.e., lr schedule, total training steps, and frozen stages. G(., .) also needs to pick up the most suitable pretrained model A * i from Z oo . Note that existing search-based HPO methods require huge computational resources and cannot fit in our online one-shot training scenario. Instead, we first propose an online learning predictor Acc P to model the accuracy on the validation set Acc(A F T i , D val ) by the meta-data information. Then we can use the predictor to construct G(., .) to generate an optimal hyperparameter setting and model.

3.2.1. ONLINE LEARNING FOR MODELING

Acc(A F T i , D val ) Recently, Li et al. (2020) suggest that the optimal hyperparameters for fine-tuning are highly related to some data statistics such as domain similarity to the ImageNet. Thus, we hypothesis that we can model the final accuracy by a group of predictors, e.g., model information, meta-data description, data statistics stat(D), domain similarity, and hyperparameters. We list the variables we considered to predict the accuracy result as follows: Model Those variables can be easily calculated ahead of the fine-tuning. One can prepare offline training data by fine-tuning different kinds of dataset and collect the accuracy correspondingly and apply a Multi-layer Perceptron Regression (MLP) offline on it. However, online learning should be a more realistic setting for our problem. In cloud computing service or a GPU cluster, a sequence of finetuning requests with different data will arrive from time to time. The predictive model can be further improved by increasing the diversity of the data and the requests over time. Using a fixed depth of MLP model in the online setting may be problematic. Shallow networks maybe more preferred for small number of instances, while deeper model can achieve better performance when the sample size becomes larger. Inspired by Sahoo et al. ( 2017), we use an adaptive MLP regression to automatically adapt its model capacity from simple to complex over time. Given the input variables, the prediction of the accuracy is given by: AccP (A * i , RegimeFT, stat(D)) = L l=1 α l f l , where l = 0, ..., L f l = h l W l , h l = RELU (Φ l h l-1 ), h0 = [A * i , RegimeFT, stat(D)]. The predicted accuracy is a weighted sum of the output f l of each intermediate fully-connected layer h l . The W l and Φ l are the learnable weights of each fully-connected layer. The α l is a weight vector assigning the importance to each layer and α = 1. Thus the predictor Acc P can automatically adapt its model capacity from simple to complex along with incoming tasks. The learnable weight α l controls the importance of each intermediate layer and the final predicted accuracy is a weighted sum of f l of them. The network can be updated by a Hedge Backpropagation (Freund & Schapire, 1999) in which α l is updated based on the loss suffered by this layer l as follows: α l ← α l β L(f l ,Acc gt ) , W l ← W l -ηα l ∇W l L(f l , Accgt) Φ l ← Φ l -η L j=l αj∇W l L(fj, Accgt), α l ← α l α l where β ∈ (0, 1) is the discount rate, the weight α l are re-normalized such that α = 1, and η is the learning rate. Thus, during the online update, the model can choose an appropriate depth by α l based on the performance of each output at that depth. By utilizing the online cumulative results, our generator gains experience that helps future prediction. Generating Task-oriented Fine-tuning Schedule. Our schedule generator G then can make use of the performance predictor to find the best training regime: G(D, T l ) = arg max A∈Zoo,Regime F T ∈S F T Acc P (A, Regime F T , stat(D)). Once the time constraint T l is provided, the max number of iterations for different A * i can be calculated by an offline step-time lookup table for Z oo . The corresponding meta-data variables can be then calculated for the incoming task. The optimal selection of model and hyperparameters is obtained by ranking the predicted accuracy of all possible grid combinations. The Detailed algorithm can be found in the Appendix B.6. Faster Fine-tuning Model Zoo (ET-NAS). After identifying the A * i from our search, we fully train those models on ImageNet following common practice. Note that all the models including ET-NAS-L can be easily pretrained on a regular 8-card GPU node since our model is training-efficient. We are confident to release our models for the public to reproduce our results from scratchfoot_1 and let the public to save their energy/CO2/cost. Due to the length of the paper, we put the detailed encoding and architectures of the final searched models in the Appendix B.4. Surprisingly, we found that smaller models should use simpler structures of blocks while bigger models prefer complex blocks. Comparing our searched backbone to the conventional ResNet/ResNeXt, we find that early stages in our models are very short which is more efficient since feature maps in an early stage are very large and the computational cost is comparably large. This also verified our findings in Appendix B. 

4.3. RESULTS FOR ONLINE ADAPTIVE PREDICTOR Acc P

Experimental Settings. We evaluate our online algorithm based on ten widely used image classification datasets, that cover various fine-tuning tasks as shown in Table 1 . Five of them (in bold) are chosen to be the online learning training set (meta-training dataset). 30K samples are collected by continually sampling a subset of each dataset and fine-tuning with a randomized hyperparameters on it. Each subset varies from #. classes and #. images. The variables in Section 3.2.1 are calculated accordingly. The fine-tuning accuracy is then evaluated on the test set. Then 30K sample is split into 24K meta-training samples and 6K meta-validation samples. Then an adaptive MLP regression in Eq 3 are used to fit the data and predict the Acc(A F T i , D val ). We use L = 10 with 64 units in each hidden layer. We use a learning rate of 0.01 and β = 0.99. As baselines, we also compare the results of using fixed MLP with plain backpropagation with different layers(L = 3, 6, 10, 14). MAE (mean absolute error) and MSE (mean square error) are performance metrics to measure the cumulative error with different segments of the tasks stream. Comparison of online learning method. The cumulative error obtained by all the baselines and the proposed method to predict the fine-tuning accuracy is shown in Table 4 . It can be seen that our adaptive MLP with hedge backpropagation is better than fixed MLP in terms of the cumulative error of the predicted accuracy. Our method enjoys the benefit from the adaptive depth which allows faster convergence in the initial stage and strong predictive power in the later stage. Ablative interpretation of performance superiority. Table 5 calculates the average fine-tuning accuracy over tasks. Our NAS model zoo can greatly increase the fine-tuning average accuracy from 77.17% to 87.45%, which is the main contribution of the performance superiority. Using our online adaptive scheduler instead of BOHB can significantly reduce the computational cost (-40x).

5. CONCLUSION

We propose the first efficient task-oriented fine-tuning framework aiming at saving the resources for GPU clusters and cloud computing. The joint NAS and online adaption strategy achieve much better fine-tuning results in terms of both accuracy and speed. The searched architectures are more training-efficient than very strong baselines such as RegNet and EfficientNet. Our experiments on multiple datasets show our NASOA achieves 40x speed-up comparing to BOHB. The proposed NASOA can be well adapted to more tasks such as detection and segmentation in the future.

A PRELIMINARY EXPERIMENTS A.1 EXPERIMENTS SETTINGS

The preliminary experiments aim at figuring out what kinds of factors impact the speed and accuracy of fine-tuning. We fine-tune several ImageNet pretrained backbones on various datasets as shown in Table 6 (right) and exam different settings of hyperparameters by a grid search such as: learning rate (0.0001, 0.001, 0.01, 0.1), frozen stages (-1,0,1,2,3), and frozen BN (-1,0,1,2,3). the training curves on CUB-Birds and Caltech101 to in the main text of this paper. We also compare the fine-tune results along time with various networks on these datasets as shown in Figure 7 . On Caltech101, ResNet50 dominates the training curve at the very beginning. However, on other datasets, ResNet18 and ResNet34 can perform better then ResNet50 when the training time is short.

A.2 FINDINGS OF THE PRELIMINARY EXPERIMENTS

With those preliminary experiments, we summarize our findings as follows. Some of the findings are also verified by some existing works. • Fine-tuning performs always better than training from scratch. As shown in Table 6 , finetuning shows superior results than training from scratch in terms of both accuracy and training time for all the datasets. This finding is also verified by Kornblith et al. (2019) . Thus, fine-tuning is the most common way to train a new dataset and our framework can be generalized to applications. • We should the optimize learning rate and frozen stage for each dataset. From Table 7 , it seems that the optimal learning rate and optimal frozen stage found by grid search are different for various datasets. Figure 6also shows that the number of the frozen stages will affect both training time and final accuracy. Guo et al. ( 2019) also showed that frozen different stages are crucial for fine-tuning task. Those two hyperparameters should be optimized for different datasets. • Model matters most. Suitable model should be selected according to the task and time constraints. Figure 6 (right) suggests that always choosing the biggest model to fine-tune may not be an optimal choice, smaller model can be better than the bigger model if the training time is limited. On the other hand, it is also important to consider the training efficiency of the model since a better model can be converged faster by a limited GPU budget. For example, Figure 7 shows that if the time constraint is short, we should choose a smaller network i.e. ResNet18 here. Thus, it is urgent to construct a training-efficient model zoo. • BN running statistics should not be frozen during fine-tuning. We found that frozen BN has a very limited effect on the training time (less than ±5%), while not freezing BN will lead to better results in almost all the datasets. Thus, BN is not frozen all experiments for our NASOA.

B DETAILS OF THE OFFLINE NAS B.1 SEARCH SPACE ENCODINGS

The search space of our architectures is composed of block and macro levels, where the former decides what a block is composed of, such as operators, number of channels, and skip connections, while the latter is concerned about how to combine the block into a whole network, e.g., when to do down-sampling, and where to change the number of channels. Block-level design. A block consists of at most three operators, each of which is divided into 5 species and has 5 different number of output channels. Each kind of operator is denoted by an op number, and the output channel of the operator is decided by the ratio between it and that of the current block. Details are shown in Table B .1.1. By default, there is a skip connection between the input and output of the block, which sums their values up. In addition to that, at most 3 other skip connections are contained in a block, which either adds or concatenates the values between them. Each operation is followed by a batch normalization layer, and after all the skip connections are calculated, a ReLU layer is triggered.

B.1.1 BLOCK-LEVEL ARCHITECTURE

Block-level encoding. The encoding of each block-level architecture is composed of two parts separated by '-', which considers the operators and skip connections respectively. For the first part (operators part), each operator is represented by two numbers: op number and ratio number (shown in Table B .1.1). As the output channel of the last operation always equals to that of the current block, the ratio number of this operator is removed. Therefore, the encoding of a block with n operators always has length 2n -1 for the first part of block-level encoding. For the second part (skip connections part), every skip connection consists of one letter for addition ('a') / concatenation ('c'), and two numbers for place indices. n operators separate the block to n+1 parts, which are indexed with 0, 1, 2, . . . , n. Thus 'a01' means summing up the value before and after the first operator. Since the skip connection between the beginning and end of the block always exists, it is not shown in the encoding. Thus this part has length 3k -3 (possibly 0) when there is k skip connections. Some of the encoding examples are shown in Figure 8 . Non-dominated sorting is mainly used to sort the solutions in population according to the Pareto dominance principle, which plays a very important role in the selection operation of many multiobjective evolutionary algorithms. In non-dominated sorting, an individual A is said to dominate another individual B, if and only if there is no objective of A worse than that objective of B and there is at least one objective of A better than that objective of B. Without loss of generality, we assume that the solutions of a population S can be assigned to K Pareto fronts F i , i = 1, 2, . . . , K. B Non-dominated sorting first selects all the non-dominated solutions from population S and assigns them to F 1 (the rank 1 front); it then selects all the non-dominated solutions from the remaining solutions and assigns them to F 2 (the rank 2 front); it repeats the above process until all individuals have been assigned to a Pareto front.

B.2.2 NSGA-II: ELITIST NON-DOMINATED SORTING GENETIC ALGORITHM

To solve the problem in Eq. 1, Elitist Non-Dominated sorting genetic algorithm (NSGA-II) (Deb et al., 2000 ) is adopted to optimize the Pareto front P f as shown in Algorithm 1. In this paper, we choose this kind of sample-based NAS algorithm instead of many popular parameter-sharing NAS method. This is because we want to further analysis of the sampled architectures and achieve insights and conclusions of the efficient training. The main idea of NSGA-II is to rank the sampled architectures by non-dominated sorting and preserve a group of elite architectures. Then a group of new architectures is sampled and trained by mutation of the current elite architectures on the P f . The algorithm can be paralleled on multiple computation nodes and lift the P f simultaneously. The mutation in the block-level search space includes adding new skip-connection, modifying the current operations and ratios. Meanwhile, the mutation in the macro-level search space includes randomly adding or deleting one block in one stage, exchanging the position of doubling channel block with its neighbor, and modifying the base channels. This well-known NSGA-II is easy to implement and we can easily monitor the improvement of each iteration. The stop criterion depends on the time limit or the computation cost constraints. In the phase of block-level search, a proxy task of ImageNet is created, which is a subset sampled fromits training set. This subset constitutes 100 labels, each of which has 500 images as the training set, and 100 as the validation set. We call this dataset ImageNet-100 in the following parts of this paper. To avoid interference with macro architecture, the macro-level architecture is fixed to be the same as that of ResNet50. Each model is trained by ImageNet-100 with a batch size of 32 for 90 epochs and learning rate 0.1, which takes 3˜10 hours on a single NVIDIA Tesla-V100 GPU. We do a random search at first, which uniformly samples all the valid blocks in the search space. Evolutionary Algorithm (EA) is then performed with three kinds of mutations: 1) replace one operator with another; 2) change the output channel of one layer; 3) Add/remove/modify a skip connection. We keep updating the Pareto Front between step time and accuracy during the whole process. As a result, 10 blocks are selected as the candidates for the following rounds. Practically, during our search, the performance of early stop models aligns well with the fully-train accuracy. We checked the Spearman Rank Correlation for 103 architectures: ρ = 96.6%. Thus, using early stop can greatly reduce the search cost by around 90% while keeping our NAS effective. We search the block-level architectures with the 10 blocks attained by block-level search. Random search is adopted at first, where the number of blocks is chosen randomly between 10 and 50, and the first and last channel is drawn from {32, 64, 128} and {512, 1024, 2048} respectively. EA search is then applied, where the mutations allowed are: 1) Add a '1'; 2) Remove a '1'; 3) Swap two different adjacent numbers. Similar to block-level search, Pareto Front between step time and accuracy is also kept updated. designs. Positive coefficients indicate a positive relationship. "P-Value" shows the significance of the variables. We summarize and highlight several noteworthy conclusions uncovered by our analysis:

B.3.2 MACRO-LEVEL SEARCH

• By observing optimal A * i , smaller models should use simpler blocks while bigger models prefer complex blocks. Simply increasing depth/width to expand the model in Tan & Le (2019) may not be optimal. • Adding additional skip connections will decrease the training efficiency of the model (The Coef is significantly negative). Using "add" to combine the features is more efficient than "concat". • Using "conv3x3, w group=4" is the best operation among the searched operations (Coef is 0.295). Separable conv3x3 is not efficient for training (Coef is -0.2). • The first double-channel position should be more close to the beginning of the network, while the final double channel-position should be delayed to the end of the network. • Fewer blocks should be assigned to the first two stages. More should be assigned to the 3rd stage. B.5 CO 2 CONSUMPTION ANALYSIS Fine-tuning from the pretrained ImageNet/language model is a de-facto practice in the deep learning field (CV/NLP). Our NASOA improves the efficiency of fine-tuning which has the potentials to greatly reduce computational cost in GPU clusters/cloud computing. According to a recent study (Strubell et al., 2019b) , developing and tuning for one typical R&D project (Strubell et al., 2018) in Google Cloud computing needs about $250k cost, 82k kWh electricity, and 123k lbs CO 2 emission, which equals to the CO 2 consumption of air traveling (NY↔SF) 62 times. Among most of them, 123 hyperparameter grid searches were performed for new datasets, resulting in 4789 jobs in total. It is believed that the proposed faster fine-tuning pipeline can save up to 40x computational cost among them. Furthermore, we have released all the searched efficient models to help the public skipping the computation-heavy NAS stage and directly enjoy the benefit of our methods. In conclusion, our NASOA is meaningful for environment protection and energy saving. The shortest/longest time constraint (budget) is defined as the time of fine-tuning 10/50 epochs for ResNet18/ResNet101 and the rest are equally divided into the log-space, which can be represented as: t x = t 0 * (t 3 /t 0 ) x/t3 , where x = [0, 1, 2, 3], t 0 is the time to train ResNet18 for 10 epochs and t 3 is the time to train ResNet101 for 50 epochs. We only compare the HPO setting under the same max computational budgets equal to t 1 in Table 5 (left). For random search, we randomly sample candidates from predefined search space until M etaData ← M etaData ∪ {(A, Regime F T , Acc)} Add this result to meta-data 7: end for 8: Acc P ← TRAIN(Acc P , M etaData) Train predictor with all meta-data 9: end procedure 10: procedure ONLINEPREDICTION(Z oo , D new , Acc P , T l , T s , H) Acc P ← TRAIN(Acc P , {A * , Regime * F T , Acc}) Improve predictor with this meta-data 21: end procedure reaching the max computational budget. And for BOHB, we use the opensource implementation of BOHB at https://github.com/automl/HpBandSter. We set random fraction=0.3, percent of good observations=15%, min budget=25% and max budget=100% with respect to our max computational budget. And we use BOHB with 40x computational cost than our proposed methods with different model zoos. The results are presented in Section 4.4.



The efficient training model zoo (ET-NAS) has been released at: https://github.com/NAS-OA/ NASOA The efficient training model zoo (ET-NAS) has been released at: https://github.com/NAS-OA/ NASOA



Figure 1: Overview of our NASOA. Our faster task-oriented online fine-tuning system has two parts: a) Offline NAS to generate an efficient training model zoo with good accuracy and training speed; b) An online fine-tuning regime generator to perform a task-specific fine-tuning with a suitable model under user's time constraint. objective of efficient fine-tuning. On the other hand, limited works discuss the model selection and HPO for fine-tuning. Kornblith et al. (2019) finds that ImageNet accuracy and fine-tuning accuracy of different models are highly correlated. Li et al. (2020); Achille et al. (2019) suggest that the optimal hyperparameters and model for fine-tuning should be both dataset dependent and domain similarity dependent (Cui et al., 2018). HyperStar (Mittal et al., 2020) is a concurrent HPO work demonstrating that a performance predictor can effectively generate good hyper-parameters for a single model. However, those works don't give an explicit solution about how to perform finetuning in a more practical online scenario. In this work, we take the advantage of online learning (Hoi et al., 2018; Sahoo et al., 2017) to build a schedule generator, which allows us to memorize the past training history and provide up-and-coming training regimes for new coming tasks on the fly. Besides, we introduce the NAS model zoo to further push up the speed and performance.

A), -Ts(A)) subject to Ts(A) ≤ Tm (1)

Figure 3: Comparison of the training and inference efficiency of our searched models (ET-NAS) with SOTA models on ImageNet. Our searched models are considerably faster, e.g., ET-NAS-G is 6x training faster than RegNetY-16GF, and ET-NAS-I is 1.5x training faster than EfficientNetB3. Although our models are optimized for fast training, the inference speed is comparable to EfficientNet and better than RegNet series.

Figure 5: Comparison of the final fine-tuning results under four time constraints for the testing dataset. Red square lines are the results of our NASOA in one-shot. The dots on the other solid line are the best performance of all the models in that series can perform. The model and training regime generated by our NASOA can outperform the upper bound of other methods in most cases. Our methods can improve around 2.1%˜7.4% accuracy than the upper bound of RegNet/EfficientNet series on average.

Frozen stages/frozen BN= k means 1 to kth stage's parameters/BN statistics are not updated during training. The training settings most follow Li et al. (2020) and we report the Top-1 validation accuracy and training time. Its detailed experiment settings are hyperparameters are listed as follows:Comparing fine-tuning and training from scratch. We use ResNet series (R-18 to R-50) to evaluate the effect of fine-tuning and training from scratch. Following Li et al. (2020), we train networks on Flowers102, CUB-Birds, MIT67 and Caltech101 datasets for 600 epochs for training from scratch and 350 epochs for fine-tuning to ensure all models converge on all datasets. We use SGD optimizer with an initial learning rate 0.01, weight decay 1e-4, momentum 0.9. The learning rate is decreased by factor 10 at 400 and 550 epoch for training from scratch and 150, 250 epoch for fine-tuning.Optimal learning rate and frozen stage. We perform a simple grid search on Flowers102, Stanford-Car, CUB-Birds, MIT67, Stanford-Dog, and Caltech101 datasets with ResNet50 to find optimal learning rate and frozen stage on different datasets with the default fine-tune setting inLi et al. (2020). The hyperparameters ranges are: learning rate (0.1, 0.01, 0.001, 0.0001), frozen stage (-1, 0, 1, 2, 3).Comparing different frozen stages and networks along time. We fix different stages of ResNet50 to analyze the influence of different frozen stages to the accuracy and along the training time on Flowers102, Stanford-Car, CUB-Birds, MIT67, Stanford-Dog, and Caltech101 datasets. We pick

Figure 6: (Left) Fine-tuning ResNet101 with different weight-frozen stages. "Freeze: k" means 0 to k stage's parameters are not updated during training. The number of frozen stage will effect both training time and accuracy. Its optimal frozen setting varies with datasets. (Right) Comparison of accuracy/time different fine-tuning models. Different models should be selected upon the request of different datasets and training constraints.

Figure7: Fine-tune results along time with various networks on these datasets. It can be seen that if the time constraints is short, we should choose a smaller network.

Figure 8: Block structure and two block samples. (a) shows a three-node graph. (b) is an example with encoding "031-", and (c) is "02031-a02".

Figure 9: Results of the block-level search in ImageNet-100. The y-axis denotes the accuracy and xaxis denotes the latency. Blue dots are models searched in this step, while the red ones are Basic Block with first channel 64, 128, 192; Inverted Bottleneck Block (expansion rate 4) with first channel 64, 128; Bottle-Neck Block (expansion rate 4) with first channel 256, 320. It can be found that our algorithm can find more efficient block in the block-level search.

Different from the previous phase, the whole ImageNet dataset is utilized for training. Each model is trained with a batch size of 1024 and a learning rate of 0.2 for 40 epochs.B.4 ET-NAS: MODEL ZOO INFORMATION AND THEIR ENCODINGSAfter the NAS process done in Section B.3, 12 models are selected as our model zoo ET-NAS of fine-tuning. Details of these models are shown in TableB.4. The inference time and step time are measured in ms on a single Nvidia V100, with batch size of 64. The resolution follows the standard setting of ImageNet: 224x224. By observing the optimal model in the table, smaller models should use simpler blocks while bigger models prefer complex blocks.

DETAILED ALGORITHMS OF NASOA Detailed algorithms of our Model Zoo (ET-NAS) search can be found in Algorithm 2. The pseudo code of online fine-tuning schedule generator training, prediction, and update can be found in Algorithm 3. Algorithm 2 Efficient Training Model Zoo (ET-NAS) Creation Input: Block/Macro Search Space S i , S a , Stop Criterion Γ, #Computation Nodes K, Sensitive Factor , #Block Architectures M , #Models in Model Zoo N . Output: Final Model Zoo Z oo 1: procedure BLOCKSEARCH(S i , Γ, K, , M ) 2: P f ← NSGA-II(Γ, S i , K, ) Our modified NSGA-II, see Algorithm 1 3:Cells ← MOSTCOMMON(P f , M ) Most common M cells from P f 4: end procedure 5: procedure MACROSEARCH(S a , Γ, K, , N ) 6:P f ← NSGA-II(Γ, S a (Cells), K, ) 7: Z oo ← NSGASORT(P f , N, )Choose models based on crowding-distance 8: end procedureC IMPLEMENT DETAILS OF HPO METHODSWe use BOHB and random search in our experiments as the HPO baseline. As stated in Section 4.4,

Online Fine-Tuning schedule Generator Training, Prediction, and Update Input: Model Zoo Z oo , Time Evaluator T s , Acc Evaluator T r , Hyper-parameter Space S HP , Known Datasets D old , New Dataset D new , #Meta-data H M , Time Constraint T l , #Configurations H. Output: Optimal Model A * , Hyper-parameters Regime * F T , Predictor Acc P 1: procedure OFFLINETRAINING(Z oo , T r , S HP , D old , H M ) 2:M etaData ← ∅, Acc P ← ADAPTIVEMLP(.)Initialize default predictor 3:for D ∈ D old , i ← 1 to H M do 4:A, Regime F T ← RANDOM(Z oo , S HP ) Randomly select from search space Acc ← T r (D, A, Regime F T ) Train with selected configuration 6:

for i ← 1 to H do 13: A ← RANDOM(Z oo ) 14: Epoch ← T l ÷ T s (A, D new ) Always choose the largest epoch within T l 15: Regime F T ← RANDOM(S HP |Epoch) Randomly select condition on Epoch 16: M etaData ← M etaData ∪ {(A, Regime F T )} 17: end for 18: A * , Regime * F T ← PREDICT(G, M etaData) Choose the optimal from H configs 19: Acc ← T r (D new , A * , Regime * F T ) 20:

al., 2018; Liu et al., 2018a;b; Radosavovic et al., 2019; Tan et al., 2019b; Real et al., 2019a; Tan & Le, 2019;



Size, Depth, #. blocks in each stage, When to double the channels.

, types and channels of Operations, Number, place, and types of Additional Skip Connections Our joint block/macro-level search space to find efficient training networks. Our block-level search space covers many popular designs such as ResNet, ResNext, MobileNet Block. Our macro-level search space allows small adjustment of the network in each stage thus the resulting models are more flexible and efficient.

5, ×1, ×2, ×4. Note that it can cover many popular block designs such as Bottleneck (He et al., 2016), ResNeXt (Xie et al., 2017) and MB block (Sandler et al., 2018). It consists of 5.4 × 10 6 unique blocks.

A * i name (one-hot dummy variable) ImageNet Acc. of the A *

Datasets and their statistics used in this paper. Datasets in bold are used to construct the online learning training set. The rest are used to test our NASOA. It is commonly believed that Aircrafts, Flowers102 and Blood-cell deviate from the ImageNet domain.We conduct a complete preliminary experiment to justify our motivation and model settings. Details can be found in the Appendix A. According to our experiments, we find that for an efficient finetuning, the model matters most. The suitable model should be selected according to the task and time constraints. Thus constructing a model zoo with various sizes of training-efficient models and picking up suitable models should be a good solution for faster fine-tuning. We also verify some existing conclusions: Fine-tuning performs better than training from scratch(Kornblith et al., 2019) so that our topic is very important for efficient GPU training; Learning rate and frozen stage are crucial for fine-tuning(Guo et al., 2019), which needs careful adjustment.

Online error rate of our method and fixed MLP.Our adaptive MLP with hedge backpropagation is better in the online setting of predicting the fine-tuning accuracy.

Comparison of the final NASOA results with other generated by our NASOA can outperform the upper bound of other methods in most cases. On average, our methods can improve around 2.1%/7.4% accuracy than the best model of RegNet/EfficientNet series under various time constraints and tasks. It is noteworthy that our NASOA performs better especially in the case of short time constraint, which demonstrates that our schedule generator is capable provide both efficient and effective regimes for fast fine-tuning.

. This ablative study calculates the average fine-tuning accuracy over 5 tasks.

Comparison of Top-1 accuracy and training time (min) on different datasets. Comparing to training from scratch, fine-tuning shows superior results in terms of both accuracy and training time.

Fine-tuning on R50, the optimal learning rate and optimal frozen stage found by grid search are different and should be optimized individually.

.1.2 MACRO-LEVEL ARCHITECTUREMacro-level design. We only consider networks with exactly 4 stages. The first block of each stage (except Stage 1) reduces the resolution of both width and height by half, where the stride 2 is added to the first operator that is not conv1x1. Other blocks do not change the resolution. One block's output channel is either the same, or an integer multiple of the input channel.Macro-level encoding. The 4 stages are divided apart by 3 '-' signs. For every stage, each block is represented by an integer, which shows the ratio between output and input channel for this block.B.1.3 ENCODING AS A WHOLEThus the whole backbone can be encoded by simply concatenating the block and macro encoding. The encoding of the whole network is formatted as: {Block EN CODIN G} {F irst CHAN N EL} {M acro EN CODIN G} Some common architectures, including ResNet and Wide ResNet can be accurately represented by our encoding scheme, which is shown in TableB.1.3.

The operations and channel changing ratios considered in our paper. Encoding for operators and ratios. c stands for the channels of the current block.

ResNets and Wide ResNets are represented by our encoding scheme. Basic Block is represented as '020-', as the two operators are both conv3x3 (denoted as '0'), and the output channel of the first operator equals to that of the block output (represented as '2'), and no other skip connection except the one connecting input and output; the macro-arch of ResNet 18 is encoded as '11-21-21-21', as each stage contains two blocks, where the first block in Stage 2, 3, 4 doubles the number of channels.

Algorithm 1 Our modified NSGA-II Searching AlgorithmInput Stop criterion, Search Space, number of computation nodes N.Train the Q t on N computation nodes and Evaluate the accuracy of Q t .

The searched optimal efficient training models "ET-NAS" found by our NAS search. 'Acc' means the accuracy evaluated on the ImageNet; inference time and step time are measured in ms on single Nvidia V100, with a batch size of 64. By observing the optimal model, smaller models should use simpler blocks while bigger models prefer complex blocks.

annex

Figure 10 shows the comparison of our ET-NAS models with other SOTA ImageNet models. Inference time and training step time are measured in ms on a single Nvidia V100, with bs = 64. Our ET-NAS series show superior performance comparing to RegNet, EfficientNet series. Comparing to some EA-based NAS methods such as OFANet and Amoebanet, our method is also efficient in terms of training. We found that there exists a performance ranking gap between inference time and training step time in Figure 10 . This is mainly due to the depth and the main type of operation of the models. We found that deeper networks with separable conv such as EfficientNet/MobileNet have a larger training-step-time/inference-time ratio comparing to our models (shallower&more common conv).

B.4.1 WHAT MAKES A NETWORK EFFICIENT-TRAINING?

To answer this question, we first need to define a score for the efficiency of the searched models A. In MOOP, the goodness of a solution is determined by dominance. Thus, we can use the non-dominated sorting algorithm to sort the A according to the Pareto dominance principle. Each architecture is assigned to one Pareto front and the rank R P of that Pareto front can be regarded as the goodness of a solution, in our case, the efficiency. We then defined the efficiency score of A as: s E (A) = -R P (A)-mean(R P (A)) std(R P (A)). Since Pareto optimal front is the Rank 1 Pareto front, larger efficiency score s E (A) means better efficiency.Then we perform a multivariate linear regression analysis on the A S . According to our search space, ordinal/nominal variables that describe the model are denoted as predictors to fit the s E (A). Table 11 shows the coefficients from the regression analysis on both block-level and macro-level

