CONSTRUCTING MULTIPLE HIGH-QUALITY DEEP NEURAL NETWORKS: A TRUST-TECH-BASED AP-PROACH

Abstract

The success of deep neural networks relied heavily on efficient stochastic gradient descent-like training methods. However, these methods are sensitive to initialization and hyper-parameters. In this paper, a systematical method for finding multiple high-quality local optimal deep neural networks from a single training session, using the TRUST-TECH (TRansformation Under Stability-reTaining Equilibria Characterization) method, is introduced. To realize effective TRUST-TECH searches to train deep neural networks on large datasets, a dynamic search paths (DSP) method is proposed to provide an improved search guidance in TRUST-TECH method. The proposed DSP-TT method is implemented such that the computation graph remains constant during the search process, with only minor GPU memory overhead and requires just one training session to obtain multiple local optimal solutions (LOS). To take advantage of these LOSs, we also propose an improved ensemble method. Experiments on image classification datasets show that our method improves the testing performance by a substantial margin. Specifically, our fully-trained DSP-TT ResNet ensmeble improves the SGD baseline by 15% (CIFAR10) and 13%(CIFAR100). Furthermore, our method shows several advantages over other ensembling methods.

1. INTRODUCTION

Due to the high redundancy on parameters of deep neural networks (DNN), the number of local optima is huge and can grow exponentially with the dimensionality of the parameter space (Auer et al. (1996) ; Choromanska et al. (2015) ; Dauphin et al. (2014b) ). It still remains a challenging task to locate high-quality optimal solutions in the parameter space, where the model performs satisfying on both training and testing data. A popular metric for the quality of a local solution is to measure its generalization capability, which is commonly defined as the gap between the training and testing performances (LeCun et al. (2015) ). For deep neural networks with high expressivity, the training error is near zero, so that it suffices to use the test error to represent the generalization gap. Generally, local solvers do not have the global vision of the parameter space, so there is no guarantee that starting from a random initialization can locate a high-quality local optimal solution. On the other hand, one can apply a non-local solver in the parameter space to find multiple optimal solutions and select the high-quality ones. Furthermore, one can improve the DNN performance by ensembling these high-quality solutions with high diversity. TRUST-TECH plays an important role in achieving the above goal. In general, it computes highquality optimal solutions for general nonlinear optimization problems, and the theoretical foundations can be bound in (Chiang & Chu (1996) ; Lee & Chiang (2004) ). It helps local solvers escape from one local optimal solution (LOS) and search for other LOSs. It has been successfully applied in guiding the Expectation Maximization method to achieve higher performance (Reddy et al. (2008)) , training ANNs (Chiang & Reddy (2007) ; Wang & Chiang (2011) ), estimating finite mixture models (Reddy et al. (2008) ), and solving optimal power flow problems (Chiang et al. (2009) ; Zhang & Chiang (2020) ). Additionally, it does not interfere with existing local or global solvers, but cooperates with them. TRUST-TECH efficiently searches the neighboring subspace of the promising candidates for new LOSs in a tier-by-tier manner. Eventually, a set of high-quality LOSs can be found. The idea of TRUST-TECH method is the following: for a given loss surface of an op-timization problem, each LOS has its own stability region. If one start from one local optimum, and track the loss values along a given direction, we will find an exit point where loss start to decrease steadily, which means another stability region corresponding to a nearby LOS is found. By following a trajectory in the stability region, another LOS is computed. We propose an optima exploring algorithm designed for DNNs that is able to find high-quality local optima in a systematic way, and thereby form optimal and robust ensembles. Normally for a deep neural network, exit points can hardly be found by original TRUST-TECH due to the huge dimensionality. So, in this work we introduce the Dynamic Searching Paths (DSP) method instead of fixed directions. We set the search directions to be trainable parameters. After an exploration step forward along the current direction, we calibrate the direction using the current gradient. By doing so, the method can benefit from not only the mature Stochastic Gradient Descent (SGD) training paradigm with powerful GPU acceleration capability, but also exit points can be easily found. The overall DSP-TT method consists of four stages. First, we train the network using local solvers to get a tier-0 local optimal solution. Second, our proposed Dynamic Search Path TRUST-TECH (DSP-TT) method is called to find nearby solutions in a tier-by-tier manner. Third, a selection process is performed so that candidates with high quality are chosen. Finally, ensembles are built with necessary fine-tunings on selected member networks. To the best of our knowledge, this paper is the first one to search for multiple solutions on deep neural networks in a systematical way. Our major contributions and highlights are summarized as follows: • We propose the Dynamic Searching Path (DSP) method that enables exploration on highdimensional parameter space efficiently. • We show that combining TRUST-TECH method with DSP (DSP-TT) is effective in finding multiple optimal solutions on deep neural networks systematically. • We design and implement the algorithm efficiently that it obtains multiple local solutions within one training session with minor GPU memory overhead. • We develop the DSP-TT Ensembles of solutions found by DSP-TT with high quality and diversity for further improving the DNN performance.

2. RELATED WORK

The synergy between massive numbers of parameters and nonlinear activations in deep neural networks leads to the existence of multiple LOSs trained on a specific dataset. Experiments show that different initializations lead to different solutions with various qualities (Dauphin et al. (2014a) ). Even with the same initialization, the network can converge to different solutions depending on the loss function and the solver (Im et al. (2016) ). Many regularization techniques are therefore proposed to force the network to converge to a better solution, some of which are proven to be useful and popular (Kingma & Ba (2015) ; Srivastava et al. (2014) ; Ioffe & Szegedy (2015) ). However, it is still mysterious how these regularized solutions are compared to the global optimum. There are researchers that focus on characterizing different local optima and investigating the internal relations among them. It is claimed in (Hochreiter & Schmidhuber (1997) ; Keskar et al. ( 2016)) that sharp minima prevent deep neural networks from generalizing well on the testing dataset. Later, Dinh et al. (2017) argued that the definition of flatness in (Keskar et al. (2016) ) is problematic and came up with an example where solutions with different geometries can have similar test time performances. Li et al. (2018) designed a new visualization method that rebuilt the correspondence between the sharpness of the minimizer and the generalization capability. On the other hand, some researchers apply meta-heuristic algorithms to obtain a better local minimizer (Gudise & Venayagamoorthy (2003) ; Zhang et al. (2007) ; Juang (2004) ; Leung et al. (2003) ). However, these methods were either designed for obsolete toy models or on explicit benchmark objective functions where there are analytical forms for global optimum, and therefore the effectiveness of these algorithms on deep architectures and large datasets seems unconvincing. Moreover, the advantage of the global searching ability seems to be crippled when it comes to deep neural networks, and the minimizers they found are still local. Recently, Garipov et al. (2018) 2011) used the TRUST-TECH (Chiang & Chu (1996) ; Lee & Chiang (2004) ; Chiang & Alberto (2015) ) method to perform a systematic search for diversified minimizers to obtain their ensembles. They implemented TRUST-TECH for training and constructing high-quality ensembles of artificial neural networks and showed that their method consistently outperforms other training methods. We generalize this method and tailor it for deep architectures and to work efficiently with popular local solvers in deep learning. tier-zero LOS), the corresponding tier-1 LOSs are ω1,1, ω1,2, ω1,3. Similarly, its tier-2 LOSs are ω2,1, ω2,2, ω2,3.

3. TRUST-TECH METHOD FOR MULTIPLE OPTIMAL SOLUTIONS

Note that the corresponding stability boundaries of tier-2 have a non-empty intersection with the stability boundaries of its tier-1 LOSs. Another category of methods has been developed in recent years for systematically computing a set of local optimal solutions in a deterministic manner. This family of methods is termed TRUST-TECH methodology, standing for Transformation Under Stability-reTaining Equilibria Characterization. It is based on the following transformations: (i) the transformation of a local optimal solution (LOS) of a nonlinear optimization problem into a stable equilibrium point (SEP, Chiang & Chu (1996) ) of a continuous nonlinear dynamical system. (ii) the transformation of the search space of nonlinear optimization problems into the union of the closure of stability regions of SEPs. Hence, the optimization problem (i.e. the problem of finding LOSs) is transformed into the problem of finding SEPs, and therefore we use the terms LOS and SEP interchangeably in the following discussion. It will become clear that the stability regions of SEPs play an important role in finding these local optimal solutions. We note that, given a LOS, its corresponding first-tier LOSs are defined as those optimal solutions whose corresponding stability boundaries have a non-empty intersection with the stability boundary of the LOS (Chiang & Chu (1996) ; Lee & Chiang (2004) ). The definition of the stability boundary and its characterization can be found in Chiang & Fekih-Ahmed (1996) . Similarly, its second-tier LOSs are defined as those optimal solutions whose corresponding stability boundaries have a non-empty intersection with the stability boundary of first-tier LOSs (Chiang & Chu (1996) ; Lee & Chiang (2004) ). See fig. 1 for an illustration. We consider a general nonlinear unconstrained optimization problem defined as follows: min x c(x) where c : D ⊂ R n → R is assumed to be continuously differentiable and D the set of feasible points (or search space). A point x * ∈ D is called a local minimum if c(x * ) ≤ c(x) for all x ∈ D with x -x * < σ for σ > 0. To systematically search for multiple LOSs, a generalized negative gradient system based on the objective eq. ( 1) is constructed and is described by dx dt = -grad R c(x) = -R(x) -1 • ∇c(x) = f (x(t)) where the state vector x(t) of this dynamic system belongs to the Euclidean space R n and the function f : R n → R n satisfies the sufficient condition for the existence and the uniqueness of the solutions. R(x) is a positive definite symmetric matrix (also known as the Riemannian metric) that generalizes various training algorithms. For example, if R(x) = I (identity), it is a naive gradient descent algorithm. If R(x) = J (x) J (x) (J is the Jacobian matrix), then it is the Gauss-Newton method. If R(x) = J (x) J (x) + µI, it becomes the Levenberg-Marquardt (LM) algorithm. The Theorem of the Equilibrium Points and Local Optima (Lee & Chiang ( 2004)) shows one nice property of the gradient system (2), which is the critical point of the optimization problem (1) is a (asymptotically) SEP of the dynamic system (2). i.e. x is a SEP of (2) if and only if x is an isolated local minimum for (1). Hence, the task of finding the LOSs of (1) can be achieved by finding the corresponding SEPs of (2). In short, TRUST-TECH is a dynamical method designed to systematically compute multiple LOSs with the following features: (i) it is a systematic and deterministic method to escape from a LOS towards another LOS, (ii) it finds multiple LOSs in a tier-by-tier manner (see fig. 1 ), and (iii) has a solid theoretical foundation (Chiang & Chu (1996) ; Lee & Chiang (2004) ; Chiang & Alberto (2015) ; Zhang & Chiang (2020) ). Another distinguishing feature of TRUST-TECH is its ability to guide a local method and/or a metaheuristic method for effective computation of a set of LOSs or even the global optimal solution.

3.2. SYSTEMATIC SEARCH ON DEEP NEURAL NETS

Our method follows the paradigm of the TRUST-TECH. The center idea is to find multiple LOSs in a tier-by-tier manner. On small scale problems, applying fixed searching directions is proven to be effective in practice (Chiang et al. (2009) 2011)). In these applications, either random directions or eigen-vectors of the objective Hessian evaluated at each SEP were used. But in training deep neural networks, finding a proper direction is challenging. For a deep neural network, when searching along a random and fixed direction, the loss value will grow indefinitely. Another issue is that the computational cost of the original TRUST-TECH is high. Specifically, it assumes cheap evaluation of the objective function at each search step. However, in supervised learning of a large dataset, only the empirical loss is accessible instead of the ground-truth objective function. This means evaluating the loss function for the entire training set, which is almost impossible since it is limited by computational restrictions. To tackle both challenges, we propose the Dynamic Search Path (DSP) method that enables exploration on deep neural networks' parameter space. Furthermore, we apply the DSP method to serve as the search paths for TRUST-TECH (DSP-TT). Details are discussed in section 3.2.1. An example of a one-tier DSP-TT method is shown in Algorithm 1.

3.2.1. OBTAINING DYNAMIC SEARCHING PATHS FOR TRUST-TECH

In this section, we go through the details on how to construct dynamic searching paths during TRUST-TECH computation and how it helps converge to nearby LOSs. Construction of searching path is inspired by the mode connectivity proposed in Garipov et al. (2018) , in which the authors found there exists low-loss "tunnels" between different LOSs. But Mode Connectivity is used to find a high-accuracy pathway between two local solutions by optimizing the expectation over a uniform distribution on the path. Our focus is finding the proper searching directions towards nearby optima when starting from one LOS. They also claimed that a path φ θ cannot be explicitly learned when given one starting point. However, we find that such a construction is possible. Specifically, by redesigning the objective and combining the exploration capability of TRUST-TECH and the exploitation capability of SGD, another local optimum can be found starting from one LOS. More generally, by using such an optimization-based path-finding technique, one can find multiple tier-one SEPs (i.e. nearby LOSs) simultaneously. To do this, we first train the neural network to obtain a LOS ω 0 ∈ R |net| . Then we define a trainable search direction vector d i ∈ R |net| (randomly initialized at d 0 ), so that during a TRUST-TECH search from ω 0 , instantaneous parameter vector at step i is represented as (ω 0 + d i ). At each step i, DSP updates the direction d i as: d i = ρ 1 (i) • d i-1 + ρ 2 (i) • f (ω 0 + d i ) The first term describes ρ 1 (i) • d i-1 the original TRUST-TECH search with no direction calibration, where ρ 1 (i) ∈ (0, ρ max ] is the step size schedule for exploration phase whose value increases from 0 to ρ max w.r.t. step i. The second term is the DSP calibration term, where ρ 2 (t i ) is the step size schedule for the calibration, descent d represents a general local descent solvers, such as Gradient Descent, Newton's Method, etc. f (•) is the dynamics defined in Equation ( 2), where various local solvers can be applied here. The stopping criteria of ρ 1 (t i ) is determined dynamically by either an exit point is found or ρ max is reaches. The above steps repeats until (ω 0 + d i ) converges to another LOS, which we call it a tier-1 solution associated with ω 0 . An intuitive demonstration of this process is shown in Figure 2b . Our proposed scheme is scalable to performing multiple search directions starting from one LOS. To do this, we initialize multiple directions, and at each step, each search direction is updated via eq. ( 3). It is also worth noting that during training, the computation graph size is the same as the original network, since the algorithm only picks one direction to be included in the computation graph. Thus, minor memory overhead is introduced in practice. As for the computational efficiency, our proposed method evaluates objectives on mini-batches instead of the entire dataset, and determines the stopping criteria by an exponential moving average of past batch evaluations. To further stabilize the stochastic behavior caused by mini-batch evaluations, buffer variables are used to determine the state transition between up (loss values are climbing in current stability region) and down (reaches a nearby stability region and the loss decreases steadily). These resolve the efficiency issue of the original TRUST-TECH on large scale supervised learning problems.

4. DSP-TT ENSEMBLES OF DEEP NEURAL NETWORKS

When training budget is less constrained, high-quality of each tier-1 solution is emphasized as having better test accuracy than the tier-0 network. On the other hand, for building ensembles with a limited budget, high-quality is emphasized more on the diversity among the collection of local optimal neural networks found to better serve the ensemble, in stead of on a single network. With the proposed DSP-TT, a set of optimal network parameters with high accuracy can be found systematically given enough training budget, and with limited budget, the high diversity among tier-0 and tier-1 solutions still remedies the weaker performance on tier-1 networks when serving the ensemble. Individual qualities are guaranteed because the starting point of any search is already a LOS from mature SGD-based solvers with high quality, which is also shown from the experiments, especially in Table 4 . As for diversity, SEPs (i.e. optimal parameter values, or LOS) are separated by at least two stability regions because each SEP has its own stability region. It is necessary to initialize parameters in different stability regions in order to find multiple optimal solutions. The proposed TRUST-TECH based method is systematic in characterizing stability regions while other heuristic-based algorithms are not. And therefore, the diversity among SEPs found by our method is also high due to the mutual exclusiveness of stability regions. The high-quality LOSs with high diversity further motivate us to build ensembles to make a more robust and accurate model than each single member. First, a list of candidates with high quality and diversity are selected. After that, a fine-tuning process is executed if necessary to help any underfitted candidates toward greater convergence. Since the searching process already integrates the gradient information, the fine-tuning in our algorithm requires little effort. In fact, as shown in the experiments, fine-tuning does not show a benefit for the ensembling performance, so this procedure is ignored by default. Finally, we build the final ensembles by either averaging (regression) or voting for (classification) the outputs. Sophisticated ensembling methods can be applied here, however it is out of the scope of this paper.

5. EXPERIMENTS

Exit point verification is run using MLPs on UCI-wine and MNIST datasets. Further experiments are run using VGG-16 (Simonyan & Zisserman (2014) ), DenseNet-100-BC (Huang et al. (2017b) ) and ResNet-164 (He et al. (2016) ) on CIFAR datasets. The program is developed on PyTorch framework. Each configuration is run multiple times and the average performance are shown. For DSP-TT ensembles, exit points are usually found in negligible time (e.g. around 1min on CIFAR compared to a full training which takes hours). So 50 epochs are given to one tier of DSP-TT search with all exit points, while the rest of the budget are given to tier-0 training. 

5.1. EXIT POINT VERIFICATION

Exit points play an important role in TRUST-TECH method in finding multiple local optimal solutions. Figures 3a and 3b shows full gradient and batch version of a loss change with respect to the DSP-TT search iterations along one search path. The loss value first goes up, escaping from the tier-0 solution. At a certain point, the loss reaches a local maximum and then goes down, suggesting that the search path hits the stability boundary and enters a nearby stability region. To further verify that an exit point lies on the stability boundary, we do the following visualization: Several points along the search path near the exit point are sampled. Then a forward integration (gradient descent with small step size) is executed starting from each sample. Trajectories are plotted by projecting the parameter space onto two random orthogonal directions. Due to high computation cost, this process is only simulated using a 1-layer MLP with 5 neurons (61 parameters) trained on UCI-wine dataset. Each integration process is executed for 50,000 steps with step size of 0.01. As shown in fig. 3c , The points before (red) and after (blue) exit converge to two different points on the 2D projection space. We also observe the cosine between the initial and updated search directions remains close to 1.0 throughout the search process, suggesting that gradients only calibrate extreme dimensions of the initial direction, but does not interfere with the remaining majority of dimensions. The proposed DSP-TT computes: 5 tierone (from tier-zero LOS) and 5 tiertwo (from the best tier-one LOS) LOSs. Among these, we perform the following ensembles: Tier-1 (5 tier-one LOSs); Tier-1-tune (5 tier-one LOSs, each with a finetuning); Tier-0-1 (1 tier-zero and 5 tier-one LOSs); Tier-0-1-2 (1 tier-zero, 5 tier-one and 5 tier-two LOSs). We use SGD as the local solver and DenseNet as the architecture. As shown in table 1, all DSP-TTenhanced ensembles outperform the baseline model. Although Tier-0-1-2 performs mostly best among all, it is sufficient to use Tier-0-1 in practice for efficiency, and therefore we use Tier-0-1 in all the following experiments. From table 1, we also find that although fine-tuning individuals can improve its own performance, it does not help much on the ensembles performance. This shows that the diversity introduced by our algorithm dominates the fine-tuning improvements by individuals. So in later experiments, all fine-tunings are neglected. 

5.3. COMPARISON WITH OTHER ENSEMBLE ALGORITHMS

In this section, we compare our method with other popular ensemble methods (Huang et al. (2017a) ; Garipov et al. (2018) ) in deep learning. Results are shown in tables 2 and 3. Besides accuracy, member diversity is another major quality for ensembles. Ideally, we want all members perform relatively well, while each member learns some knowledge that differs from that of others. We measure the output correlation (Huang et al. (2017a) ) and the parameter distance (Garipov et al. (2018) From the hardware side, DSP-TT search process introduces minor overhead to the GPU memory usage. Specifically, baseline training of ResNet-164 takes 3819Mb GPU memory, which increases to 3921Mb during DSP-TT search. This justifies our previous claim that TRUST-TECH does not increase the size of the computation graph with only a little additional overhead.

5.4. ABLATION TEST ON DSP-TT HYPERPARAMETERS

The key hyperparameters for DSP-TT are ρ 1 (pace of search step) and ρ 2 (step size of calibration step) defined in Section 3.2.1. In this part we test the sensitivity of the two. We perform tests on a grid of ( dρ1 dt , ρ 2 ) pairs, and record (1) the number of iterations to finish a DSP-TT search for exit points, (2) average ρ 1 of each search path when an exit point is reached, and (3) average distance between the search origin (tier-0 solution) and each exit point. As shown in Figure 4 , DSP-TT is insensitive to ρ 2 . And figs. 4b and 4c show (1) ρ 1 and the distance between tier-0 and exit points are highly correlated, and (2) The surface becomes flat after the increment speed of ρ 1 passes 5e -4, suggesting that other stability regions are reached. 

6. CONCLUSION AND FUTURE WORK

In this paper, we propose a novel Dynamic Search Path TRUST-TECH training method for deep neural nets. Unlike other global solvers, our proposed method efficiently explores the parameter space in a systematic way. To make the original TRUST-TECH applicable to deep neural networks, we first develop the Dynamic Searching Path (DSP) method. Second, we adopt the batch evaluation formula to increase the algorithm efficiency. Additionally, to further improve the model performance, we build the DSP-TT Ensembles. Test cases show that our proposed training method helps individual models obtain a better performance, even when tier-1 search is applied. Our method is general purposed, so that it can be applied to various architecture with various local solver. Moreover, it is observed from Table 1 that percentage improvements in error rate is not as significant as that in loss. This suggests that the cross-entropy loss may be the bottleneck for further improvements in performance for classification tasks. Thus, designing a proper loss function that can be more sensitive to classification accuracy would be a valuable topic in the future.



Figure 1: Given a LOS (i.e. ω0, a

Figure 2: (a) Phase portrait of a two-dimensional objective surface, ω0 and ω1 are two SEPs; (b) A demonstration of the DSP method on the same objective. Black arrows represent the DSP path, blue vectors represent forward search steps, and red vectors represent calibration steps.

Training budget: DenseNet has 300 epochs of training budget, and ResNet/VGG has 200 epochs. Batch size: 128 for VGG and ResNet, and 64 for DenseNet. DSP-TT parameters: ρ 1 increases 0.001 per iteration, ρ 2 is 0.1× of the initial tier-0 learning rate. Fine-tuning phase requires 10 epochs per solution. All others: DenseNet follows Huang et al. (2017b), VGG and ResNet follows Garipov et al. (2018).

Figure 3: (a) Loss progress on the training and testing set of MNIST during a DSP-TT(full gradient) search. (b) Error rate progress on the training and testing set of CIFAR100 during a DSP-TT(batched) search. (c) Exit Point Verification: Points along the search path near the exit point (top left with "X" marker) are sampled and then integrated until convergence. The points before (red) and after (blue) exit converge to different LOSs.

Figure 4: Sensitivity test on ρ1 an ρ2. X-axis: increment rate of ρ1; Y-axis: ρ2; Z-axis: (a) Running iterations when all exit points are reached. (b) Average ρ1 when each exit point is reached. (c) Average distance from the search origin when each exit point is reached.

reveal the relation among local optima by building pathways, called Mode Connectivities, as simple as polygonal chains or Bezier curves that connect any two local optima.Draxler et al. (2018) also found similar results at the same time, although they used the Nudged Elastic Band(Jonsson et al. (1998)) method from quantum chemistry.

Performance comparison among ensemble methods on various architectures on CIFAR datasets. (Output Correlation: mean Pearson Correlation Coefficients among all members. * : numbers from Garipov et al. (2018), * * : numbers from Huang et al. (2017b). Source codes for FGE (https://github.com/timgaripov/dnnmode-connectivity) and SSE (https://github.com/gaohuang/SnapshotEnsemble) are from the authors, respectively, and we present the best results we could achieve.)

Detailed comparison with ResNet-164 trained on the CIFAR datasets. (Parameter Distance: Euclidean distance of parameters.)

). In table 2, the correlation by DSP-TT outperforms other ensemble methods.

annex

Yong-Feng Zhang and Hsiao-Dong Chiang. Enhanced elite-load: A novel cmpsoatt methodology constructing short-term load forecasting model for industrial applications. IEEE Transactions on Industrial Informatics, 16:2325-2334, 2020.

