DIFFAUTOML: DIFFERENTIABLE JOINT OPTIMIZA-TION FOR EFFICIENT END-TO-END AUTOMATED MA-CHINE LEARNING

Abstract

The automated machine learning (AutoML) pipeline comprises several crucial components such as automated data augmentation (DA), neural architecture search (NAS) and hyper-parameter optimization (HPO). Although many strategies have been developed for automating each component in separation, joint optimization of these components remains challenging due to the largely increased search dimension and different input types required for each component. While conducting these components in sequence is usually adopted as a workaround, it often requires careful coordination by human experts and may lead to sub-optimal results. In parallel to this, the common practice of searching for the optimal architecture first and then retraining it before deployment in NAS often suffers from architecture performance difference in the search and retraining stages. An end-to-end solution that integrates the two stages and returns a ready-to-use model at the end of the search is desirable. In view of these, we propose a differentiable joint optimization solution for efficient end-to-end AutoML (DiffAutoML). Our method performs co-optimization of the neural architectures, training hyper-parameters and data augmentation policies in an end-to-end fashion without the need of model retraining. Experiments show that DiffAutoML achieves state-of-the-art results on ImageNet compared with end-to-end AutoML algorithms, and achieves superior performance compared with multi-stage AutoML algorithms with higher computational efficiency. To the best of our knowledge, we are the first to jointly optimize automated DA, NAS and HPO in an en-to-end manner without retraining.

1. INTRODUCTION

While deep learning has achieved remarkable progress in various tasks such as computer vision and natural language processing, it usually requires tremendous human involvement to design and train a satisfactory deep model for one task (He et al., 2016; Sandler et al., 2018) . To alleviate such burden on human users, a dozen of AutoML algorithms are proposed in recent years to enable training a model from data automatically without human experiences, including automated data augmentation (DA), neural architecture search (NAS), and hyper-parameter optimization (HPO) (e.g., Chen et al., 2019; Cubuk et al., 2018; Mittal et al., 2020) . These AutoML components are usually developed independently. However, implementing these AutoML components for a specific task in separate stages not only suffers from low efficiency but also leads to sub-optimal results (Dai et al., 2020; Dong et al., 2020) . How to achieve full-pipeline "from data to model" automation efficiently and effectively is still a challenging problem. One main difficulty for achieving automated "from data to model" is how to combine different Au-toML components (e.g., NAS and HPO) appropriately for a specific task. Optimizing these components in a joint manner is an intuitive solution but usually suffers from the enormous and impractical search space. Dai et al. (2020) and Wang et al. (2020) introduced pre-trained predictors to achieve the joint optimization of NAS and HPO, and the joint optimization of NAS and automated model compression, respectively. For a new coming task, however, it is usually burdensome to pre-train such a predictor. On the other hand, Dong et al. ( 2020) investigated the joint optimization between NAS and HPO via differentiable architecture and hyper-parameter search spaces. Automated DA is seldom considered in the joint optimization of AutoML components. Nevertheless, our experimental results showed that different data augmentation protocols may result in different optimal architectures (see Sec. 4.3 for more details). Based on these considerations, it is worthy to investigating the joint optimization among automated DA, NAS and HPO. Another main challenge to achieve automated "from data to model" is the end-to-end searching and training of models without parameter retraining. Even considering only one AutoML component, many NAS algorithms require two stages including searching and retraining (e.g., Liu et al., 2018; Xie et al., 2019) . Automated DA methods such as Lim et al. ( 2019) also needed to retrain the model parameters when the DA policies were searched. In these cases, whether the searched architectures or DA policies would perform well after retraining is questionable, due to the inevitable difference of training setup between the searching and retraining stages. Recently, Hu et al. ( 2020) developed a differentiable NAS methods to provide direct NAS without parameter retraining. Other AutoML components, including HPO and automated DA, are seldom considered in the task-specific end-toend AutoML algorithms. Considering the above challenges, we propose DiffAutoML, a differentiable joint optimization solution for efficient end-to-end AutoML. In DiffAutoML, end-to-end NAS optimization is realized in a differentiable one-level manner with the help of stochastic architecture search and the approximation for the gradient of architecture parameters. Meanwhile, the DA and HPO are regarded as dynamic schedulers, which adapt themselves to the update of network parameters and network architecture. Specifically, Differentiable relaxation is used in DA optimization in an one-level way while the hyper-gradient is used to in HPO in a two-level way. With this differentiable method, Dif-fAutoML can effectively deal with the huge search space and the low optimization efficiency caused by this joint optimization problem. To summarize, our main contributions are as follows: • We propose a well-defined AutoML problem, i.e., task-specific end-to-end AutoML framework, which aims to achieve automated "from data to model" without human involvement. • We first jointly optimize three different AutoML components, including automated DA, NAS and HPO, in a differentiable search space with high efficiency. • Experiment results show that, compared with optimizing modules in sequence, one-stage DiffAutoML can effectively realize the co-optimization of different modules. • Extensive experiments also reveal mutual influence among different AutoML components, i.e., the change of settings for one module may greatly influence the optimal results of another module, justifying the necessity of end-to-end joint optimization.

2. RELATED WORKS

In this section, we briefly introduce some related AutoML algorithms, including automated DA, NAS and HPO. A more detailed introduction is provided in Appendix D. Automated Data Augmentation Data augmentation is commonly used to increase the data diversity and thus to improve the generalization performance of deep learning models (DeVries & Taylor, 2017; Zhang et al., 2017; Yun et al., 2019) . Several prior works have been proposed to automate the search of data augmentation policies, using methods ranging from reinforcement learning (Cubuk et al., 2018; Zhang et al., 2020) , evolutionary algorithm (Ho et al., 2019) , Bayesian optimization (Lim et al., 2019) , gradient-based approaches (Lin et al., 2019) to simple grid search (Cubuk et al., 2020) . Cubuk et al. (2020) point out that the optimal data augmentation policy depends on the model size. This indicates that fixing an augmentation policy when searching for neural architectures may lead to sub-optimal solutions, thus motivating joint optimization. Neural Architecture Search Recent advances in NAS have demonstrated state-of-the-art performance outperforming human experts' design on a variety of tasks (Zoph & Le, 2017; Cai et al., 2018; Liu et al., 2018; Kandasamy et al., 2018; Pham et al., 2018; Real et al., 2018; Zoph et al., 2018; Ru et al., 2020b) . Especially the development of gradient-based one-shot methods (Liu et al., 2018; Chen et al., 2019; Guo et al., 2019; Xie et al., 2019; Hu et al., 2020) have significantly reduced the computational costs of NAS. To further improve the search efficiency, most NAS methods search architectures in a low-fidelity set-up (e.g., fewer training epochs, smaller architectures) and retrain the optimal architecture using the full set-up before deployment. This separation of search and evaluation, however, usually suffer from sub-optimal results (Hu et al., 2020) . End-to-end NAS

