RESPONSE MODELING OF HYPER-PARAMETERS FOR DEEP CONVOLUTIONAL NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Neural Networks (DNN). Current methodologies fail to define an analytical response surface (Bergstra & Bengio, 2012) and remain a training bottleneck due to their use of additional internal hyper-parameters and lengthy manual evaluation cycles. We demonstrate that the low-rank factorization of the convolution weights of intermediate layers of a CNN can define an analytical response surface. We quantify how this surface acts as an auxiliary to optimizing training metrics. We introduce a fully autonomous dynamic tracking algorithm -autoHyper -that performs HPO on the order of hours for various datasets including ImageNet and requires no manual intervention or a priori knowledge. Our method -using a single RTX2080Ti -is able to select a learning rate within 59 hours for AdaM (Kingma & Ba, 2014) on ResNet34 applied to ImageNet and improves in top-1 test accuracy by 4.93% over the default learning rate. In contrast to previous methods, we empirically prove that our algorithm and response surface generalize well across model, optimizer, and dataset selection removing the need for extensive domain knowledge to achieve high levels of performance.

1. INTRODUCTION

The choice of Hyper-Parameters (HP) -such as initial learning rate, batch size, and weight decayhas shown to greatly impact the generalization performance of Deep Neural Network (DNN) training (Keskar et al., 2017; Wilson et al., 2017; Li et al., 2019; Yu & Zhu, 2020) . By increasing the complexity of network architectures (from high to low parameterized models) and training datasets (class number and samples), the manual intervention to tune these parameters for optimization becomes a practically expensive and highly challenging task. Therefore, the problem of Hyper-Parameter Optimization (HPO) becomes central to developing highly efficient training workflows. Recent studies shift the gear toward development of a meaningful metric measure to explain effective HP tuning for DNN training. This is done in several behavioural studies, including changes in loss surfaces (Keskar et al., 2017) , input perturbation analysis (Novak et al., 2018) , and the energy norm of the covariance of gradients (Jastrzebski et al., 2020) , just to name a few. In fact, the abstract formulation of the HPO problem, as highlighted by Bergstra & Bengio (2012) , can be modelled by λ * ← arg min λ∈Λ {E x∼M [L(x; A λ (X (train) )]}, where, X (train) and x are random variables, modelled by some natural distribution M , that represent the train and validation data, respectively, L(•) is some expected loss, and A λ (X (train) ) is a learning algorithm that maps X (train) to some learned function, conditioned on the hyper-parameter set λ. Note that this learned function, denoted as f (θ; λ; X (train) ), involves its own inner optimization problem. The HPO in (1) highlights two optimization problems of which optimization over λ cannot occur until optimization over f (θ; λ; X (train) ) is complete. This fact applies heavy computational burden for HPO. Bergstra & Bengio (2012) reduce this burden by attempting to solve the following λ * ← arg min λ∈Λ τ (λ), where τ is called the hyper-parameter response function or response surface, and Λ is some set of choices for λ (i.e. the search space). The goal of the response surface is to introduce an auxiliary 1

