LEARNING TO PREDICT PARAMETER FOR UNSEEN DATA Anonymous authors Paper under double-blind review

Abstract

Typical deep learning models depend heavily on large amounts of training data and resort to an iterative optimization algorithm (e.g., SGD or Adam) for learning network parameters, which makes the training process very time-and resourceintensive. In this paper, we propose a new training paradigm and formulate network parameter training into a prediction task: given a network architecture, we observe there exists correlations between datasets and their corresponding optimal network parameters, and explore if we can learn a hyper-mapping between them to capture the relations, such that we can directly predict the parameters of the network for a new dataset never seen during the training phase. To do this, we put forward a new hypernetwork with the purpose of building a mapping between datasets and their corresponding network parameters, and then predict parameters for unseen data with only a single forward propagation of the hypernetwork. At its heart, our model benefits from a series of GRU sharing weights to capture the dependencies of parameters among different layers in the network. Extensive experimental studies are performed and experimental results validate our proposed method achieves surprisingly good efficacy. For instance, it takes 119 GPU seconds to train ResNet-18 using Adam from scratch and the network obtains a top-1 accuracy of 74.56%, while our method costs only 0.5 GPU seconds to predict the network parameters of ResNet-18 achieving comparable performance (73.33%), more than 200 times faster than the traditional training paradigm.

1. INTRODUCTION

Deep learning has yielded superior performance in a variety of fields in the past decade, such as computer vision (Kendall & Gal, 2017) , natural language processing (DBL), reinforcement learning (Zheng et al., 2018; Fujimoto et al., 2018) , etc. One of the keys to success for deep learning stems from huge amounts of training data used to learn a deep network. In order to optimize the network, the traditional training paradigm takes advantage of an iterative optimization algorithm (e.g., SGD) to train the model in a mini-batch manner, leading to huge time and resource consumption. For example, when training RestNet-101 (He et al., 2016) on the ImageNet (Deng et al., 2009) dataset, it often takes several days or weeks for the model to be well optimized with GPU involved. Thus, how to accelerate the training process of the network is an emergent topic in deep learning. Nowadays, many methods for accelerating training of deep neural networks have been proposed (Kingma & Ba, 2015; Ioffe & Szegedy, 2015; Chen et al., 2018) . The representative works include optimization based techniques by improving the stochastic gradient descent (Kingma & Ba, 2015; Yong et al., 2020; Anil et al., 2020) , normalization based techniques (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016; Ba et al., 2016 ), parallel training techniques (Chen et al., 2018; Kim et al., 2019) , et. Although these methods have showed promising potential to speed up the training of the network, they still follow the traditional iterative-based training paradigm. In this paper, we investigate a new training paradigm for deep neural networks. In contrast to previous works accelerating the training of the network, we formulate the parameter training problem into a prediction task: given a network architecture, we attempt to learn a hyper-mapping between datasets and their corresponding optimal network parameters, and then leverage the hypermapping to directly predict the network parameters for a new dataset unseen during training. A basic 1

