RESPONSE MODELING OF HYPER-PARAMETERS FOR DEEP CONVOLUTIONAL NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Neural Networks (DNN). Current methodologies fail to define an analytical response surface (Bergstra & Bengio, 2012) and remain a training bottleneck due to their use of additional internal hyper-parameters and lengthy manual evaluation cycles. We demonstrate that the low-rank factorization of the convolution weights of intermediate layers of a CNN can define an analytical response surface. We quantify how this surface acts as an auxiliary to optimizing training metrics. We introduce a fully autonomous dynamic tracking algorithm -autoHyper -that performs HPO on the order of hours for various datasets including ImageNet and requires no manual intervention or a priori knowledge. Our method -using a single RTX2080Ti -is able to select a learning rate within 59 hours for AdaM (Kingma & Ba, 2014) on ResNet34 applied to ImageNet and improves in top-1 test accuracy by 4.93% over the default learning rate. In contrast to previous methods, we empirically prove that our algorithm and response surface generalize well across model, optimizer, and dataset selection removing the need for extensive domain knowledge to achieve high levels of performance.

1. INTRODUCTION

The choice of Hyper-Parameters (HP) -such as initial learning rate, batch size, and weight decayhas shown to greatly impact the generalization performance of Deep Neural Network (DNN) training (Keskar et al., 2017; Wilson et al., 2017; Li et al., 2019; Yu & Zhu, 2020) . By increasing the complexity of network architectures (from high to low parameterized models) and training datasets (class number and samples), the manual intervention to tune these parameters for optimization becomes a practically expensive and highly challenging task. Therefore, the problem of Hyper-Parameter Optimization (HPO) becomes central to developing highly efficient training workflows. Recent studies shift the gear toward development of a meaningful metric measure to explain effective HP tuning for DNN training. This is done in several behavioural studies, including changes in loss surfaces (Keskar et al., 2017) , input perturbation analysis (Novak et al., 2018) , and the energy norm of the covariance of gradients (Jastrzebski et al., 2020) , just to name a few. In fact, the abstract formulation of the HPO problem, as highlighted by Bergstra & Bengio (2012) , can be modelled by λ * ← arg min λ∈Λ {E x∼M [L(x; A λ (X (train) )]}, where, X (train) and x are random variables, modelled by some natural distribution M , that represent the train and validation data, respectively, L(•) is some expected loss, and A λ (X (train) ) is a learning algorithm that maps X (train) to some learned function, conditioned on the hyper-parameter set λ. Note that this learned function, denoted as f (θ; λ; X (train) ), involves its own inner optimization problem. The HPO in (1) highlights two optimization problems of which optimization over λ cannot occur until optimization over f (θ; λ; X (train) ) is complete. This fact applies heavy computational burden for HPO. Bergstra & Bengio (2012) reduce this burden by attempting to solve the following λ * ← arg min λ∈Λ τ (λ), where τ is called the hyper-parameter response function or response surface, and Λ is some set of choices for λ (i.e. the search space). The goal of the response surface is to introduce an auxiliary function parameterized by λ of which its minimization is directly correlated to minimization of the objective function f (θ). Little advancements in an analytical model of the response surface has led to estimating it by (a) running multiple trials of different HP configurations (e.g. grid searching), using evaluation against validation sets as an estimate to τ ; or (b) characterizing the distribution model of a configuration's performance metric (e.g. cross-validation performances) to numerically define a relationship between τ and λ. An important shift occurred when Bergstra & Bengio (2012) showed that random searching is more efficient to grid searching, particularly when optimizing high-dimensional HP sets. To mitigate the time complexity and increase overall performance, subsequent methods attempted to characterize the distribution model for such random configurations (Snoek et al., 2012; Eggensperger et al., 2013; Feurer et al., 2015a; b; Klein et al., 2017; Falkner et al., 2018) or employed population control (Young et al., 2015; Jaderberg et al., 2017) or early-stopping (Karnin et al., 2013; Li et al., 2017; 2018) . However, these methods suffer from (a) additional internal HPs that require manual tuning facilitated by extensive domain knowledge; (b) heavy computational overhead whereby the optimization process takes days to weeks in most cases (Li et al., 2017; Falkner et al., 2018; Yu & Zhu, 2020) ; (c) poor generalization across model selection, datasets, and general experimental configurations (e.g. optimizers); and (d) strong dependence on a manually defined search ranges that heavily influences results (Choi et al., 2020; Sivaprasad et al., 2020) . Importantly, these ranges are generally chosen based on intuition, expert domain knowledge, or some form of a priori knowledge. In this paper, we employ the notion of knowledge gain (Hosseini & Plataniotis, 2020) to model a response surface -solvable with low computational overhead -and use it to perform automatic HPO that does not require any a priori knowledge while still achieving competitive performance against baselines and existing state of the art (SOTA) methods. Our goal is therefore to develop an algorithm that is fully autonomous and domain independent that can achieve competitive performance (not necessarily superior performance). We restrict our response surface to consider a single HP, namely the initial learning rate η, and support this choice by noting that the initial learning rate is the most sensitive and important HP towards final model performance (Goodfellow et al., 2016; Bergstra & Bengio, 2012; Yu & Zhu, 2020 ) (see also Figure 10 in Appendix C). We demonstrate how our method's optimization directly correlates to optimizing model performance. Finally, we provide empirical measures of the computational requirements of our algorithm and present thorough experiments on a diverse set of Convolutional Neural Network (CNN) and Computer Vision dataset that demonstrate the generalization of our response surface. The main contributions of this work are as follows: 1. Inspired by knowledge gain, we introduce a well-defined, analytical response surface using the low-rank-factorization of convolution weights (Equation 5). 2. We propose a dynamic tracking algorithm of low computational overhead on the order of minutes and hours, dubbed autoHyper, to optimize our response surface and conduct HPO. 3. This algorithm requires no domain knowledge, human intuition, or manual intervention, and is not bound by a manually set searching space, allowing for completely automatic setting of the initial learning rate; a novelty for deep learning practitioners.

1.1. RELATED WORKS

We leave extensive analysis of the related works to established surveys (Luo, 2016; He et al., 2019; Yu & Zhu, 2020) but present a general overview here. Grid searching and manual tuning techniques that require extensive domain knowledge trial various configurations and retain the best. Random search (Bergstra & Bengio, 2012) was proven to be more efficient, particularly in high-dimensional cases, but these methods suffer from redundancy and high computational overhead. Bayesian optimization (Snoek et al., 2012; Eggensperger et al., 2013; Feurer et al., 2015a; b; Klein et al., 2017) techniques attempt to characterize the distribution model of the random HP configurations. They fail to properly define the response surface τ and resolve to estimating it by rationalizing a Gaussian process over sampling points. The use of neural networks over Gaussian to model the generalization performance was shown to have better computational performance (Snoek et al., 2015; Springenberg et al., 2016) . Furthermore, the early stopping methods (Karnin et al., 2013; Li et al., 2017; 2018) spawn various configurations with equal resource distributions, successively stopping poorperforming configurations and reassigning resources dynamically. Population-based training (PBT) methods (Young et al., 2015; Jaderberg et al., 2017) follow an evolutionary approach by spawning various experimental configurations and adapting poor-performing trials to warm restart with inherited learnable parameters and HPs. In addition, other methods such as orthogonal array tuning (Zhang et al., 2019) , box-constrained derivative-free optimization (Diaz et al., 2017) , reverse dynamics algorithm for SGD optimization (Maclaurin et al., 2015) , and hybrid methods (Swersky et al., 2013; 2014; Domhan et al., 2015; Falkner et al., 2018; Kandasamy et al., 2016) exist but demonstrate no significant benefits over the previous techniques. Generally, each of these methods suffer from high computational overheads -on the order of days to weeks to converge -as well as additional internal HPs that heavily influence performance and generalization. In recent years, many Python libraries have also been developed that include these optimization methods (Bergstra et al., 2013; Kotthoff et al., 2017; Akiba et al., 2019) .

2. A NEW RESPONSE SURFACE MODEL

In this section, we motivate and develop a new response surface model τ (λ) based on the low-rank factorization of convolutional weights in a CNN. Unlike the common approach of cross-validation performance measures, we define a new measure on the well-posedness of the intermediate layers of a CNN and relate this measure to the general performance of the network. We first start by adopting the low-rank measure of convolution weights.  -------------→ U d Σ d V T d W d +E d We note the importance of factorizing the unfolded matrix W d using a low-rank factorization (we use the Variational Bayesian Matrix Factorization (VBMF) (Nakajima et al., 2013) ). Without this factorization, the presence of noise will inhibit proper analysis. This noise E d will "capture" the randomness of initialization and ignoring it will allow us to better analyze our unfolded matrices and make our response surface robust to initialization method. Following the definition of Knowledge Gain (KG) from Hosseini & Plataniotis (2020), one can now define a metric for each network layer using the norm energy of the low-rank factorization as G d ( W d ) = 1 N d • σ 1 ( W d ) N d i=1 σ i ( W d ). where, σ 1 ≥ σ 2 ≥ . . . ≥ σ N d are the associated low-rank singular values in descending order. Here N d = rank{ W d } and the unfolding can be done in either input or output channels i.e. d ∈ {3, 4}. For more information on KG as well as its efficient algorithmic computation, we refer the reader to Hosseini & Plataniotis (2020). The metric defined in (3) is normalized such that G d ∈ [0, 1] and can be used to probe CNN layers to monitor their efficiency in the carriage of information from input to output feature maps. We can further parameterize the KG by the HP set λ, epoch t, and network layer as Ḡd,t, (λ). A perfect network and set of HPs would yield Ḡd,T, (λ) = 1 ∀ ∈ [L] where L is the number of layers in the network and T is the last epoch. In this case, network layer functions as a better autoencoder through iterative training and the carriage of information throughout the network is maximized. Conversely, Ḡd,T, (λ) = 0 indicates that the information flow is very weak such that the mapping is effectively random ( E d is maximized.).

2.2. DEFINITION OF NEW RESPONSE FUNCTION

Interestingly, if Ḡd,t, (λ) = 0 in early stages of training, it is evidence that no learning has occurred, indicative of an initial learning rate that is too small (no progress has been made to reduce the randomization). It then becomes useful to track the zero-valued KGs within a network's intermediate layers' input and output channels, which effectively becomes a measure of channel rank. We denote this rank per epoch as follows: Z t (λ) ← 1 2L ∈[L] d∈{3,4} 1 -Ḡd,t, (λ) where Z t (λ) ∈ [0, 1). Finally, we define the average rank across T epochs as Z(λ) ← 1 T t∈[T ] Z t (λ). Note that Z(λ) ∈ [0, 1). The average rank measure in ( 4) is therefore a normalized summation of the zero-valued singular values of the low-rank factorization across all layers' input and output unfolded tensor arrays. Relating to the notion of HPO and the response surface, we return to (1) and ( 2). Where previously the nature of these two optimization problems was poorly understood or practically unsolvable, we propose a new problem that is well understood and practically solvable (on the computational order of hours). To solve for the optimal HP set λ, we look at the following optimization problem λ * ← arg min λ 1 -Z(λ), subject to ∇ λ Z(λ) 2 2 ≤ (5) where ∈ [0, 1) is some small conditioning error. Returning to equation 2, our response surface is therefore defined as τ = 1 -Z(λ) subject to ∇ λ Z(λ) 2 2 ≤ . Note that we now simplify our problem to only consider λ = η. Also, we do not explicitly calculate the gradient ∇ λ Z(λ), but rather use this constraint to guide our dynamic tracking algorithm (see section 3). To explain this somewhat counterintuitive formulation, we analyze Figures 1(a) & 2, which demonstrate that as learning rates increase, Z(η) plateaus to zero. Specifically, we notice that optimal learning rates lie towards the inception of the plateau of Z(η), before Z(η) = 0. This can also be seen in Figures 8 & 7 in Appendix A. Therefore, we wish to design our response surface such that the solution lies at the inception of the plateau-ing region (see the red dot in Figure 1(a) ). In this case, our constraint ∇ λ Z(λ) 2 2 ≤ promotes learning rates that lie along this plateau-ing region, while arg min λ 1 -Z(λ) promotes, of those learning rates in the plateau-ing region, a learning rate that lies towards the inception of this plateau-ing region. More generally, high Z(η) indicates a learning rate that is too small, as intermediate layers do not make sufficient progress in early stages of learning and therefore their KGs remain very low. This observation follows that of Li et al. (2019) in which larger initial learning rates result in better generalization performance. Promotion of these larger learning rates is therefore achieved by our gradient constraint in Equation 5. Conversely, too large of a learning rate can over-regulate a network and, therefore, we wish not to minimize Z(η) completely but tune it to be sufficiently small to arrive at the inception of its plateau, creating a sort of kick-start if you will. This is achieved by the balance between our minimization and constraint in Equation 5. Finally, we choose T = 5 in our experiment. The early phase of training has been shown to be an important criterion in optimal model performance (Jastrzebski et al., 2020; Li et al., 2019) and we therefore wish to only consider 5-epochs to ensure optimization within this phase. Additionally, Figure 2 and Figures 8 & 7 in Appendix A tell us that Z(η) stabilizes after 5 epochs.

2.3. EMPIRICAL EVALUATION OF NEW RESPONSE MODEL

Figure 1 (a) visualizes results of our method on ResNet34 on CIFAR10 optimized with AdaM. The learning rate selected by our method results in lowest training loss and highest top-1 training accuracy over the 5-epoch range we consider. We note the importance of stopping at the inception of this plateau region as even though higher learning rates, highlighted by the gray-scale lines/markers, result in potentially lower Z(η), they do not guarantee lower training losses or higher training accuracies. We conclude that our response surface is a strong auxiliary to training loss, and optimizing HPs relative to our response surface will in fact optimize towards training loss and accuracy. Figure 1 (b) displays the histogram of Z(η) values over various experimental configurations (see subsection 4.1). Note the presence of a multimodal distribution that peaks at low Z(η) but importantly not zero. This visualizes our method's tendency to converge to a consistent range of values for irrespective of experimental configuration, showing the generalization of our response surface.

3. AUTOHYPER: AUTOMATIC TUNING OF INITIAL LEARNING RATE

The pseudo-code for autoHyper is presented in Algorithm 3. Analyzing equation 5, we state that the optimal solution lies within the inception of the plateauing region of Z(η). To find this region, autoHyper first initializes a logarithmic grid space, from η min = 1 × 10 -4 to η max = 0.1 of S = 20 step sizes, denoted by Ω. It iterates through each η i ∈ Ω; i ∈ {0, . . . , 19}, and computes Z(η), until a plateau is reached. Once a plateau is reached, Ω is reset such that η min and η max "zoom" towards the learning rates at the plateau. This process is repeated recursively until no significant difference between η min and η max exists. On average, this recursion occurs 3 to 4 times and as shown in Figure 3 , the number of trialled learning rates remains very low (between 10-30 on average). Importantly, our algorithm is not constrained by its initial grid space. As it tracks Z(η) over learning rates, it may grow and shrink its search bounds dynamically. This permits our method to be fully autonomous and require no human intuition in setting of the initial grid space.

Algorithm 1 autoHyper

Require: grid space function Ψ, learning rate significant difference delta α = 5 × 10 -5 , and rate of change function ζ 1: procedure RESTART( ) 2: learning rate index i = 0 3: Ω = Ψ(ηmin, ηmax, S) 4: end procedure 5: RESTART( ) 6: while True do 7: if i = |Ω| then // increase search space since no plateau has been found yet 8: set ηmin = ηmax, increase ηmax and RESTART( ) 9: end if 10: if ηmax -ηmin < α then // limits of search space are not significantly different 11: return ηmax 12: end if 13: with ηi ← Ωi, train for 5 epochs 14: compute rank: Z(ηi) per equation 4 15: if Z(ηi) = 1.0 then // all KG is zero-valued, ηmin is too small 16: increase ηmin and RESTART( ) 17: end if 18: if i = 0 and Z(ηi) < 0.5 and this is the first run then // initial ηmin is too large 19: reduce ηmin and RESTART( ) 20: else 21: if Z(ηi) = 0.0 then // all KG is non-zero, don't search further, perform "zoom" 22: set ηmin = Ωi-2, ηmax = Ωi and RESTART( ) 23: end if 24: compute rate of change of Z(ηi): δ ← ζ({Z(η0), . . . , Z(ηi)}) 25: if rate of change plateaus then // perform "zoom" 26: set ηmin = Ωi-1, ηmax = Ωi and RESTART( ) 27: end if 28: end if 29: i += 1 30: end while Figure 3: Computational analysis of autoHyper over various setups (number of learning rates au-toHyper trialled before converging). ResNet34 trials take 3 minutes, 3 minutes, 18 minutes, and 220 minutes for CIFAR10, CIFAR100, TinyImageNet and, ImageNet, respectively. ResNet18, ResNeXt50, and DenseNet121 trials take 2 minutes, 3 minutes, and 3 minutes respectively for both CIFAR10 and CIFAR100. We note here that the choice of Ψ and ζ mentioned in Algorithm 3 (i.e. grid space and rate of change functions, respectively) will have a significant affect on the final generated learning rate. We make use of numpy's geomspace function for the logarithmic grid spacing, and calculate the rate of change in Z(η) by taking the cumulative product of the sequence of Z(η i ), to the power of 0.8. A logarithmic grid space is used as our response surface is more sensitive to smaller learning rates (see Figure 2 ). Note that initial grid bounds are not important as our algorithm can shift those bounds dynamically, however the successive increments between learning rates in the grid must be sufficiently small (on the order of 1 × 10 -4 as in our initialization). Since our response surface itself is not guaranteed to monotonically decrease, as shown in Figure 4 (b), we employ the cumulative product of Z(η i ) (as our rate of change function), which is a monotonically decreasing function -since Z(η i ) ∈ [0, 1) -and is therefore always guaranteed to converge. The cumulative product (to the power of 0.8) is a good choice because it (a) is always guaranteed to plateau (since 0 ≤ Z(η i ) < 1), which removes the need for some manually tuned threshold and (b) because it dampens noise well. Because the cumulative product on its own degrades to zero rather quickly in many scenarios, raising it to the power of 0.8 regulates this effect. This power is technically tune-able, however we show empirically in Figure 4 (a) and 4(b) that 0.8 behaves well for both stable and unstable architectures. Refer to Figure 9 in Appendix C for the performance results of EfficientNetB0.

4. EXPERIMENTS

In this section, we conduct an ablative study of our algorithm autoHyper and response surface on various network architectures trained using various optimizers and applied to image classification datasets. We also compare autoHyper against existing SOTA; Random Search.

4.1. EXPERIMENTAL SETUPS

Ablative study. All experiments are run using an RTX2080Ti, 3 cores of an Intel Xeon Gold 6246 processor, and 64 gigabytes of RAM. In our ablative study, we run experiments on CIFAR10 (Krizhevsky et al., 2009) , CIFAR100 (Krizhevsky et al., 2009) , TinyImageNet (Li et al.) , and Im-ageNet (Russakovsky et al., 2015) . On CIFAR10 and CIFAR100, we apply ResNet18 (He et al., 2015) , ResNet34 (He et al., 2015) , ResNeXt50 (Xie et al., 2016) , and DenseNet121 (Huang et al., 2017) . On TinyImageNet and ImageNet, we apply ResNet34. For architectures applied to CIFAR10 and CIFAR100, we train using AdaM (Kingma & Ba, 2014), AdaBound (Luo et al., 2019) , AdaGrad (Duchi et al., 2011) , RMSProp (Tieleman & Hinton, 2012) , AdaS (β = {0.8, 0.9, 0.95, 0.975}) (Hosseini & Plataniotis, 2020) (with early-stop), and SLS (Vaswani et al., 2019) . For ResNet34 applied to TinyImageNet, we train using AdaM, AdaBound, AdaGrad, and AdaS (β = {0.8, 0.9, 0.95, 0.975}) . For ResNet34 applied to ImageNet, we train using AdaM and AdaS (β = {0.9, 0.95, 0.975}) . Note that the β in AdaS-variants is known as the gain factor and trades between performance and convergence rate : a low β converges faster but at the cost of performance and vice-versa. For each experimental setup we ran one training sequence using suggested learning rates (baseline) and one training sequence using learning rates generated by autoHyper (see Tables 1-4 in Appendix B). Refer to Appendix B for additional details on the ablative study. Comparison to Random Search. Because Random Search generally requires iterative manual refinement and is highly sensitive to the manually set search space (Choi et al., 2020; Sivaprasad et al., 2020) , we attempt a fair comparison by providing the same initial search space that autoHyper starts with, and allow for the same number of trials that autoHyper takes (see Figure 3 ). We note however that this does provide the Random Search with a slight advantage since a priori knowledge of how many trials to consider is not provided to autoHyper. See Appendix D for additional detail.

4.2. RESULTS

Consistent performance across architectures, datasets, and optimizers. We visualize the primary results of each experiment in Figure 5 (a) (additional results are shown in Figure 11 in Appendix C). From these figures we see how well our method generalizes to experimental configurations by noting the consistency in top-1 test accuracies when training using the autoHyper generated initial learning rate vs. the baseline. Further, we note that if there is loss of performance when using an initial learning rate generated by autoHyper, we identify that this loss is < 1% in all experiments except three: On CIFAR100, the baselines of ResNeXt50 trained using AdaM, ResNext50 trained using RMSProp, and DenseNet121 trained using AdaBound achieve 1.2%, 2.28% and 1.9% better top-1 test accuracy, respectively. We note importantly however that when accounting for the standard deviation of each of these results, only the DenseNet121 experiement maintains its > 1% improvement. Refer to Appendix C (Tables 5 6 7 8 ) to see the tabulated results of each experiment. We also importantly highlight how autoHyper is able to generalize across experimental setup whereas Random Search cannot (see Figure 5(b) ). Because Random Search (and other SOTA methods) depend heavily on their manually defined parameters such as epoch budget or initial search space, generalization to experimental setup is not feasible, as demonstrated here. In contrast, we have shown that autoHyper is perfectly capable of performing well no matter the experimental setting without need for manual intervention/refinement of any kind; a novelty. Fully autonomous discovery of optimal learning rates. Importantly, we highlight how our method is able to fully autonomously tune the initial learning and achieve very competitive performance. Whereas traditional HPO methods (like Random Search) are extremely sensitive to initialization of the search space, which would normally require extensive domain or a priori knowledge to set, our method is not: given a new dataset, model, and/or other hyper-parameter configurations, a practitioner could use simply call our algorithm to automatically set a very competitive initial learning rate. If truly superior performance is required, one could perform more extensive HPO around the autoHypersuggested learning rate, removing the need to perform iterative manual refinement. Superior performance over existing SOTA. As visualized in Figure 13 , although Random Search proves competitive for Adabound and AdaM applied on CIFAR10 and CIFAR100, it cannot find a competitive learning rate for AdaS β = 0.9 or AdaGrad and performs worse for AdaM applied on TinyImageNet. AdaGrad applied on TinyImageNet loses as much as 4% top-1 test accuracy. This highlights how autoHyper can automatically find more competitive learning rates to a Random Search given the same computational budget, and with significantly less manual intervention. These results additionally highlight why validation loss (or accuracy) cannot be used as a substitute to our metric (see Figure 14 in subsection D.2 for additional discussion). Drastic improvements in AdaM applied to TinyImageNet and ImageNet. ResNet34 trained using AdaM and applied to TinyImageNet and ImageNet achieves final improvements of 3.14% and 4.93% in top-1 test accuracy, respectively (see Table 5 in Appendix C). Such improvements come at a minimal cost using our method, requiring 13 trials (4 hours) and 16 trials (59 hours) for TinyImageNet and ImageNet, respectively (see Figure 3 ). Extremely fast and consistent convergence rates. We visualize the convergence rates of our method in Figure 3 . Importantly, we identify the consistency of required trials per optimizer across architecture and dataset selection as well as the low convergence times. We identify that the longest convergence time for our method is on ResNet34 trained using AdaS β = 0.95 applied to ImageNet, which took 31 trials and a total of 114 hours. We note that our method exhibits less consistent results when optimizing using SLS as SLS tends to result in high Z(η) over multiple epochs and different learning rates. Despite this, our model still converges and results in competitive performance. Performance improvement over increased epoch budgets. In reference to Table 6 in Appendix C, we highlight how, of the 29 experimental configurations, when trained using the initial learning rate suggested by autoHyper, only 12 of them outperform the baseline. However, as training progresses, we note that by the end of the fixed epoch budget, 18 of the 29 experiments trained using the initial learning rate suggested by autoHyper outperform the baselines. Further, in many of the cases where baselines perform better, they remain within the standard deviation of trials, and are therefore not significantly better. These results are surprising as our goal with this method was to achieve competitive results in tuning the initial learning rate however, in more than half the cases, our method results in increased performance at a significantly smaller computational cost. In this work we proposed an analytical response surface that acts as auxiliary to training metrics and generalizes well. We proposed an algorithm, autoHyper, that solves this surface and quickly generates learning rates that are competitive to author-suggested and Random Search-suggested values -in some cases, even drastically superior. We therefore have introduced an algorithm that can perform HPO fully autonomously and extremely efficiently, and resolves many of the drawbacks of current SOTA. Figure 6 visualizes our response surface over a multi-dimension HP set and highlights how our response surface remains solvable. We identify that autoHyper could be adapted to simultaneously optimize multiple HPs by tracking tangents across this surface towards the minimum, but leave this to future work. 

B ADDITIONAL EXPERIMENTAL DETAILS FOR SUBSECTION 4.1

We note the additional configurations for our experimental setups. Datasets: For CIFAR10 and CIFAR100, we perform random cropping to 32 × 32 and random horizontal flipping on the training images and make no alterations to the test set. For TinyImageNet, we perform random resized cropping to 64 × 64 and random horizontal flipping on the training images and center crop resizing to 64 × 64 on the test set. For ImageNet, we follow He et al. (2015) and perform random resized cropping to 224 × 244 and random horizontal flipping and 256 × 256 resizing with 224 × 224 center cropping on the test set. Additional Configurations: Experiments on CIFAR10, CIFAR100, and TinyImageNet used minibatch sizes of 128 and ImageNet experiments used mini-batch sizes of 256. For weight decay, 5 × 10 -4 was used for AdaS-variants on CIFAR10 and CIFAR100 experiments and 1 × 10 -4 for all optimizers on TinyImageNet and ImageNet experiments, with the exception of AdaM using a weight decay of 7.8125 × 10 -6 . For AdaS-variant, the momentum rate for momentum-SGD was set to 0.9. All other hyper-parameters for each respective optimizer remained default as reported in their original papers. For CIFAR10 and CIFAR100, we use the manually tuned suggested learning rates as reported in Wilson et al. (2017) for AdaM, RMSProp, and AdaGrad. For TinyImageNet and ImageNet, we use the suggested learning rates as reported in each optimizer's respective paper. Refer to Tables 1-4 to see exactly which learning rates were used, as well as the learning rates generated by autoHyper. CIFAR10, CIFAR100, and TinyImageNet experiments were trained for 5 trials with a maximum of 250 epochs and ImageNet experiments were trained for 3 trials with a maximum of 150 epochs. Due to AdaS' stable test accuracy behaviour as demonstrated by Hosseini & Plataniotis ( 2020), an early-stop criteria, monitoring testing accuracy, was used for CIFAR10, CIFAR100, and ImageNet experiments. For CIFAR10 and CIFAR100, a threshold of 1 × 10 -3 for AdaS β = 0.8 and 1 × 10 -4 for AdaS β = {0.9, 0.95} and patience window of 10 epochs. For ImageNet, a threshold of 1 × 10 -4 for AdaS β = {0.8, 0.9, 0.95} and patience window of 20 epochs. No early stop is used for AdaS β = 0.975 . Learning Rates: We report every learning rate in Tables 1 2 3 4 . Large deviation from the suggested initial learning rates. Referring to Tables 1 2 3 4 & 9, we notice variation in autoHyper suggested learning rates as compared to the author-suggested and Random Search-selected ones. The learning rates generated by our method reveal the "blind spots" that the authors originally overlooked in their HPO. Interestingly, however, we note the similarity in initial learning for ResNet34 trained using AdaM on CIFAR10, and can confirm this as an optimal learning rate. Importantly, our method is significantly quicker than the grid searching technique employed by Wilson et al. (2017) . Observations on the generalization characteristics of optimizers. 2020). We additionally contribute that AdaS does generalize well. We also highlight SLS' multi-order-of-magnitude tolerance to initial learning rate as well as the stability of the AdaS optimizer, particularly when applied on TinyImageNet. Figure 9 : Test accuracy and trianing loss for EfficientNetB0 applied to CIFAR100. Importantly, EfficientNetB0 is an unstable network architecture in relation to our response surface and yet our method, autoHyper, is still able to converge and achieve competitive performance. As before, lines indicated by the '*' (solid lines), are results using initial learning rate as suggested by autoHyper. These results visualize the inconsistency in tracking test loss as a metric to optimize final testing accuracy. This can be seen, for example, when looking at the test loss and test accuracy plots for AdaM, where the test loss for the baseline is lower than that of the autoHyper suggested results but autoHyper achieves better test accuracy. These results also highlight the instability of tracking testing accuracy or less instead of the metric defined in Equation 5. Under review as a conference paper at ICLR 2021 Table 5 : List of ResNet34 top-1 test accuracy using various optimizers and applied to various datasets. Note that each cell reports the average accuracy over each experimental trial, sub-scripted by the standard deviation over the trials. Values on the left are from trials trained using the initial learning rate as generated by autoHyper. Values on the right are from trials trained using the suggested initial learning rate. Note a '*' indicates that early-stop has been activated. Under review as a conference paper at ICLR 2021 Table 6 : List of ResNet18 top-1 test accuracies using various optimizers and applied to various datasets. Note that each cell reports the average accuracy over each experimental trial, sub-scripted by the standard deviation over the trials. Values on the left are from trials trained using the initial learning rate as generated by autoHyper. Values on the right are from trials trained using the suggested initial learning rate. Note a '*' indicates that early-stop has been activated. Under review as a conference paper at ICLR 2021 Table 7 : List of ResNeXt50 top-1 test accuracies using various optimizers and applied to various datasets. Note that each cell reports the average accuracy over each experimental trial, sub-scripted by the standard deviation over the trials. Values on the left are from trials trained using the initial learning rate as generated by autoHyper. Values on the right are from trials trained using the suggested initial learning rate. Note a '*' indicates that early-stop has been activated. Under review as a conference paper at ICLR 2021 Table 8 : List of DenseNet121 top-1 test-accuracies using various optimizers and applied to various datasets. Note that each cell reports the average accuracy over each experimental trial, sub-scripted by the standard deviation over the trials. Values on the left are from trials trained using the initial learning rate as generated by autoHyper. Values on the right are from trials trained using the suggested initial learning rate. Note a '*' indicates that early-stop has been activated. The search space is set to [1 × 10 -4 , 0.1] and a loguniform (see SciPy) distribution is used for sampling. This is motivated by the fact that autoHyper also uses and logarithmically-spaced grid space. We note that we ran initial tests against a uniform distribution for sampling was done and showed slightly worse results, as the favouring of smaller learning rates benefits the optimizers we considered. In keeping with autoHyper's design, the learning rate that resulted in lowest training loss after 5 epochs was chosen. One could also track validation accuracy, however as visualized in Figures 5(a ) & 13, validation loss is more stable for the datasets we are considering. This selection could be altered if the dataset being used exhibits a different behaviour, however this would be a manual alteration at the selection of the practitioner -one that does not need to be made if using autoHyper.

D.2 ADDITIONAL DISCUSSION AND RESULTS

Can you replace Z(η) with validation loss? Replacing Z(η) with validation loss does not work because greedily taking validation loss (or accuracy) is not stable nor domain independent. Analyzing Figures 5(a) & 9, validation loss/accuracy is unstable since either the network (EfficientNetB0 in Figure 9 ) or the dataset (TinyImageNet in Figure 5 (a)) results in unstable top-1 test accuracy/test loss scores that are unreliable to track. See also Figure 14 , which demonstrates the inbaility to track validation loss/accuracy for various learning rates. Further, validation accuracy/loss can vary greatly based on initialization, whereas our method does not vary due to its low-rank factorization. Finally, our metric, Z(η), is always guaranteed to be zero with a sufficiently small learning rate and maximized with large learning rates, therefore we can always dynamically adapt our search range to the proper range. This fact is not so true for tracking validation accuracy/loss. Additionally, low validation loss does not correlate to high validation accuracy (an additional figure, Figure 12 , in Appendix C shows this). You might then suggest to take a k of the best performing learning rates based on validation accuracy/loss and focus on those, but this requires you to manually define k then attempt a manually defined Grid/Random Search refinements around those areas, with manual heuristics to indicate when to stop searching, whereas our method is fully automatic and self-converges. Not to mention, this would take more time. In summation, existing SOTA method like Random Search cannot compete with autoHyper when given similar budgets and minimizing the manual intervention/refinement. This displays autoHyper's prominent feature of being a low-cost, fully automatic algorithm to search for optimal hyperparameter bounds (namely in this work, the initial learning rate). Future work could include using autoHyper to quickly discover this optimal hyper-parameter range, and then further refine using more extensive HPO methods with greater budgets if truly superior performance is required, and this could further alleviate a lot of manual refinement that currently plagues existing SOTA methods. Optimizer CIFAR10 CIFAR100 TinyImageNet AdaM 0.000330 0.000333 0.000125 0.000241 0.000175 0.0001965 AdaBound 0.000392 0.000347 0.000353 0.000347 0.000124 0.0000944 AdaGrad 0.001598 0.002861 0.001828 0.002236 0.000715 0.0022359 AdaS (0.9) 0.006779 0.012374 0.009252 0.010190 0.039480 0.0085857 



2.1 KNOWLEDGE GAIN VIA LOW-RANK FACTORIZATIONConsider a four-way array (4-D tensor) W ∈ R N1×N2×N3×N4 as the convolution weights of an intermediate layer of a CNN (N 1 and N 2 being the height and width of kernel size, and N 3 and N 4 to the input and output channel size, respectively). Under the convolution operation, the input feature maps F I ∈ R W ×H×N3 are mapped to an arbitrary output feature map F O ∈ R W ×H×N4 by

Figure 1: (a) Auxiliary representation of Z(η) to training loss and training accuracy. The red dot indicates the learning rate chosen by our method with corresponding metrics drawn in red lines. Lines and scatter points in grayscale show various trialled learning rates. (b) Distribution of Z(η) taken from the searching phase of autoHyper over numerous experimental configurations applied on CIFAR10, CIFAR100, TinyImageNet, and ImageNet (see subsection 4.1).

Figure 2: Z(η) for various learning rates using AdaBound, AdaGrad, and AdAM on ResNet34 applied to CIFAR10. The author-suggested initial learning rate is indicated by the red markers, and the autoHyper suggested learning rate is indicated by the green markers.

Figure 4: Z(η) (blue) vs. cumprod(Z(η)) 0.8 (orange) for (a) a stable and (b) an unstable architecture on CIFAR10.

Figure 5: Results of the (a) ablative study and (b) Random Search comparison experiments. Titles below plots indicate what experiment the above plots refers to. Legend labels marked by '*' (solid lines) show results for autoHyper generated learning rates and dotted lines are the (a) baselines and (b) Random Search results.

Figure 6: Z(η, γ) for learning rate (η) and mini-batch size (γ).

Figure 7: Rank (Z(η)) for various learning rates on VGG16 trained using AdaM, AdaGrad, AdaS β = 0.8, and RMSProp and applied to CIFAR10. A fixed epoch budget of 20 was used. We highlight how across these 20 epochs, very little progress is made beyond the first first epochs. It is from this analysis that we choose our epoch range of T = 5.

Figure 8: (Z(η)) for various learning rates on ResNet34 trained using AdaM, AdaGrad, AdaS β = 0.8, and RMSProp and applied to CIFAR10. A fixed epoch budget of 20 was used. We highlight how across these 20 epochs, very little progress is made beyond the first first epochs. It is from this analysis that we choose our epoch range of T = 5.

Figure 5(a) identifies the poor generalization characteristics of AdaM, AdaBound, AdaGrad, and SLS where they consistently achieve low training losses, but do not exhibit equivalently high top-1 test accuracies. We note that these observations are similar to those made by Wilson et al. (2017); Li et al. (2019); Jastrzebski et al. (

Figure10: Demonstration of the importance of initial learning rate in scheduled learning rate case, for ResNet18 applied on CIFAR10, using Step-Decay method with step-size = 25 epochs and decay rate = 0.5. As before, the dotted line represents the baseline results, with initial learning rate = 0.1, and the solid line represents the results using autoHyper's suggested learning rate of 0.008585. These results highlight the importance of initial learning rate, even when using a scheduled learning rate heuristic, and demonstrates the importance of the additional step-size and decay rate hyperparameters. Despite better initial performance from the autoHyper suggest learning rate, the stepsize and decay rate choice cause the performance to plateau too early.

Figure 11: Full results of CIFAR100, TinyImageNet, and ImageNet experiments. Top-1 test accuracy and training losses are reported for CIFAR100 experiments and top-1 and top-5 test and training accuracies are reported for TinyImageNet and ImageNet. Titles below the figures indicate to which experiments the above figures belong to. As before, lines indicated by the '*' (solid lines), are results using initial learning rate as suggested by autoHyper.

Figure 12: Top-1 Test Accuracy and Test Loss for ResNet34 Experiments applied on TinyImageNet.As before, lines indicated by the '*' (solid lines), are results using initial learning rate as suggested by autoHyper. These results visualize the inconsistency in tracking test loss as a metric to optimize final testing accuracy. This can be seen, for example, when looking at the test loss and test accuracy plots for AdaM, where the test loss for the baseline is lower than that of the autoHyper suggested results but autoHyper achieves better test accuracy. These results also highlight the instability of tracking testing accuracy or less instead of the metric defined in Equation5.

Figure 13: Top-1 test accuracy and train loss for ResNet34 applied to TinyImageNet, CIFAR10, and CIFAR100, using learning rates as suggested by either a Random Search (as described above) or autoHyper. Titles below plots indicate what experiment the above plots refers to. Legend labels marked by '*' (solid lines) show results for autoHyper generated learning rates and dotted lines are the Random Search results.

Figure 14: Visualization of the (a) validation loss and (b) validation accuracy for various learning rates ResNet34 on various datasets. These figures demonstrate the inability to propoerly track these metrics as we do ours (i.e. Z(η))

Learning rates for ResNet34 experiments. Left inner columns show suggested, right inner columns show autoHyper generated. Note that the superscript for AdaS-variants indicates their β gain factor.

Learning rates for ResNet18 experiments. Left inner columns show suggested, right inner columns show autoHyper generated.

Learning rates for ResNeXt50 experiments. Left inner columns show suggested, right inner columns show autoHyper generated.

Learning rates for DenseNet121 experiments. Left inner columns show suggested, right inner columns show autoHyper generated.

±1.30 92.92±0.32/93.09 ±0.52 94.75±0.12/94.99 ±0.23 95.01±0.13/95.14 ±0.34 95.13±0.11/95.24 ±0.15 RMSProp 90.69±0.47/90.86 ±0.58 91.41±0.55/91.59 ±0.77 92.22±0.68/92.69 ±0.33 92.94 ±0.33 /92.88±0.30 93.03 ±0.23 /92.90±0.29 SLS 93.30 ±0.16 /93.28±0.10 93.41±0.09/93.48 ±0.09 93.39±0.10/93.49 ±0.09 93.34±0.13/93.41 ±0.08 93.33±0.06/93.45 ±0.16 CIFAR100 AdaBound 69.21 ±0.59 /68.02±0.75 71.38 ±0.44 /70.57±0.40 72.39 ±0.27 /71.67±0.49 72.83 ±0.16 /72.08±0.27 73.15 ±0.24 /71.94±0.66 AdaGrad 65.35 ±0.46 /65.15±0.27 66.72 ±0.34 /66.58±0.38 67.03 ±0.50 /66.91±0.31 67.16 ±0.50 /66.97±0.25 67.43 ±0.59 /67.02±0.23 AdaM 68.31±0.48/68.66 ±0.46 69.71±0.63/69.78 ±0.27 70.43±0.29/70.45 ±0.42 70.98 ±0.43 /70.61±0.33 71.43 ±0.28 /71.11±0.37

Learning rates for ResNet34 Random Search comparison. Left inner columns show Random Search generated, right inner columns show autoHyper generated.

