RESPONSE MODELING OF HYPER-PARAMETERS FOR DEEP CONVOLUTIONAL NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Neural Networks (DNN). Current methodologies fail to define an analytical response surface (Bergstra & Bengio, 2012) and remain a training bottleneck due to their use of additional internal hyper-parameters and lengthy manual evaluation cycles. We demonstrate that the low-rank factorization of the convolution weights of intermediate layers of a CNN can define an analytical response surface. We quantify how this surface acts as an auxiliary to optimizing training metrics. We introduce a fully autonomous dynamic tracking algorithm -autoHyper -that performs HPO on the order of hours for various datasets including ImageNet and requires no manual intervention or a priori knowledge. Our method -using a single RTX2080Ti -is able to select a learning rate within 59 hours for AdaM (Kingma & Ba, 2014) on ResNet34 applied to ImageNet and improves in top-1 test accuracy by 4.93% over the default learning rate. In contrast to previous methods, we empirically prove that our algorithm and response surface generalize well across model, optimizer, and dataset selection removing the need for extensive domain knowledge to achieve high levels of performance.

1. INTRODUCTION

The choice of Hyper-Parameters (HP) -such as initial learning rate, batch size, and weight decayhas shown to greatly impact the generalization performance of Deep Neural Network (DNN) training (Keskar et al., 2017; Wilson et al., 2017; Li et al., 2019; Yu & Zhu, 2020) . By increasing the complexity of network architectures (from high to low parameterized models) and training datasets (class number and samples), the manual intervention to tune these parameters for optimization becomes a practically expensive and highly challenging task. Therefore, the problem of Hyper-Parameter Optimization (HPO) becomes central to developing highly efficient training workflows. Recent studies shift the gear toward development of a meaningful metric measure to explain effective HP tuning for DNN training. This is done in several behavioural studies, including changes in loss surfaces (Keskar et al., 2017) , input perturbation analysis (Novak et al., 2018) , and the energy norm of the covariance of gradients (Jastrzebski et al., 2020) , just to name a few. In fact, the abstract formulation of the HPO problem, as highlighted by Bergstra & Bengio (2012) , can be modelled by λ * ← arg min λ∈Λ {E x∼M [L(x; A λ (X (train) )]}, where, X (train) and x are random variables, modelled by some natural distribution M , that represent the train and validation data, respectively, L(•) is some expected loss, and A λ (X (train) ) is a learning algorithm that maps X (train) to some learned function, conditioned on the hyper-parameter set λ. Note that this learned function, denoted as f (θ; λ; X (train) ), involves its own inner optimization problem. The HPO in (1) highlights two optimization problems of which optimization over λ cannot occur until optimization over f (θ; λ; X (train) ) is complete. This fact applies heavy computational burden for HPO. Bergstra & Bengio (2012) reduce this burden by attempting to solve the following λ * ← arg min λ∈Λ τ (λ), where τ is called the hyper-parameter response function or response surface, and Λ is some set of choices for λ (i.e. the search space). The goal of the response surface is to introduce an auxiliary function parameterized by λ of which its minimization is directly correlated to minimization of the objective function f (θ). Little advancements in an analytical model of the response surface has led to estimating it by (a) running multiple trials of different HP configurations (e.g. grid searching), using evaluation against validation sets as an estimate to τ ; or (b) characterizing the distribution model of a configuration's performance metric (e.g. cross-validation performances) to numerically define a relationship between τ and λ. An important shift occurred when Bergstra & Bengio (2012) showed that random searching is more efficient to grid searching, particularly when optimizing high-dimensional HP sets. To mitigate the time complexity and increase overall performance, subsequent methods attempted to characterize the distribution model for such random configurations (Snoek et al., 2012; Eggensperger et al., 2013; Feurer et al., 2015a; b; Klein et al., 2017; Falkner et al., 2018) et al., 2020; Sivaprasad et al., 2020) . Importantly, these ranges are generally chosen based on intuition, expert domain knowledge, or some form of a priori knowledge. In this paper, we employ the notion of knowledge gain (Hosseini & Plataniotis, 2020) to model a response surface -solvable with low computational overhead -and use it to perform automatic HPO that does not require any a priori knowledge while still achieving competitive performance against baselines and existing state of the art (SOTA) methods. Our goal is therefore to develop an algorithm that is fully autonomous and domain independent that can achieve competitive performance (not necessarily superior performance). We restrict our response surface to consider a single HP, namely the initial learning rate η, and support this choice by noting that the initial learning rate is the most sensitive and important HP towards final model performance (Goodfellow et al., 2016; Bergstra & Bengio, 2012; Yu & Zhu, 2020 ) (see also Figure 10 in Appendix C). We demonstrate how our method's optimization directly correlates to optimizing model performance. Finally, we provide empirical measures of the computational requirements of our algorithm and present thorough experiments on a diverse set of Convolutional Neural Network (CNN) and Computer Vision dataset that demonstrate the generalization of our response surface. The main contributions of this work are as follows: 1. Inspired by knowledge gain, we introduce a well-defined, analytical response surface using the low-rank-factorization of convolution weights (Equation 5). 2. We propose a dynamic tracking algorithm of low computational overhead on the order of minutes and hours, dubbed autoHyper, to optimize our response surface and conduct HPO. 3. This algorithm requires no domain knowledge, human intuition, or manual intervention, and is not bound by a manually set searching space, allowing for completely automatic setting of the initial learning rate; a novelty for deep learning practitioners.

1.1. RELATED WORKS

We leave extensive analysis of the related works to established surveys (Luo, 2016; He et al., 2019; Yu & Zhu, 2020) but present a general overview here. Grid searching and manual tuning techniques that require extensive domain knowledge trial various configurations and retain the best. Random search (Bergstra & Bengio, 2012) was proven to be more efficient, particularly in high-dimensional cases, but these methods suffer from redundancy and high computational overhead. Bayesian optimization (Snoek et al., 2012; Eggensperger et al., 2013; Feurer et al., 2015a; b; Klein et al., 2017) techniques attempt to characterize the distribution model of the random HP configurations. They fail to properly define the response surface τ and resolve to estimating it by rationalizing a Gaussian process over sampling points. The use of neural networks over Gaussian to model the generalization performance was shown to have better computational performance (Snoek et al., 2015; Springenberg et al., 2016) . Furthermore, the early stopping methods (Karnin et al., 2013; Li et al., 2017; 2018) spawn various configurations with equal resource distributions, successively stopping poorperforming configurations and reassigning resources dynamically. Population-based training (PBT)



or employed population control(Young et al., 2015; Jaderberg et al., 2017)  or early-stopping(Karnin et al., 2013; Li et al., 2017;  2018). However, these methods suffer from (a) additional internal HPs that require manual tuning facilitated by extensive domain knowledge; (b) heavy computational overhead whereby the optimization process takes days to weeks in most cases(Li et al., 2017; Falkner et al.

