CONSTRUCTING MULTIPLE HIGH-QUALITY DEEP NEURAL NETWORKS: A TRUST-TECH-BASED AP-PROACH

Abstract

The success of deep neural networks relied heavily on efficient stochastic gradient descent-like training methods. However, these methods are sensitive to initialization and hyper-parameters. In this paper, a systematical method for finding multiple high-quality local optimal deep neural networks from a single training session, using the TRUST-TECH (TRansformation Under Stability-reTaining Equilibria Characterization) method, is introduced. To realize effective TRUST-TECH searches to train deep neural networks on large datasets, a dynamic search paths (DSP) method is proposed to provide an improved search guidance in TRUST-TECH method. The proposed DSP-TT method is implemented such that the computation graph remains constant during the search process, with only minor GPU memory overhead and requires just one training session to obtain multiple local optimal solutions (LOS). To take advantage of these LOSs, we also propose an improved ensemble method. Experiments on image classification datasets show that our method improves the testing performance by a substantial margin. Specifically, our fully-trained DSP-TT ResNet ensmeble improves the SGD baseline by 15% (CIFAR10) and 13%(CIFAR100). Furthermore, our method shows several advantages over other ensembling methods.

1. INTRODUCTION

Due to the high redundancy on parameters of deep neural networks (DNN), the number of local optima is huge and can grow exponentially with the dimensionality of the parameter space (Auer et al. (1996) ; Choromanska et al. (2015) ; Dauphin et al. (2014b) ). It still remains a challenging task to locate high-quality optimal solutions in the parameter space, where the model performs satisfying on both training and testing data. A popular metric for the quality of a local solution is to measure its generalization capability, which is commonly defined as the gap between the training and testing performances (LeCun et al. (2015) ). For deep neural networks with high expressivity, the training error is near zero, so that it suffices to use the test error to represent the generalization gap. Generally, local solvers do not have the global vision of the parameter space, so there is no guarantee that starting from a random initialization can locate a high-quality local optimal solution. On the other hand, one can apply a non-local solver in the parameter space to find multiple optimal solutions and select the high-quality ones. Furthermore, one can improve the DNN performance by ensembling these high-quality solutions with high diversity. TRUST-TECH plays an important role in achieving the above goal. In general, it computes highquality optimal solutions for general nonlinear optimization problems, and the theoretical foundations can be bound in (Chiang & Chu (1996) ; Lee & Chiang (2004) ). It helps local solvers escape from one local optimal solution (LOS) and search for other LOSs. It has been successfully applied in guiding the Expectation Maximization method to achieve higher performance (Reddy et al. 2020)). Additionally, it does not interfere with existing local or global solvers, but cooperates with them. TRUST-TECH efficiently searches the neighboring subspace of the promising candidates for new LOSs in a tier-by-tier manner. Eventually, a set of high-quality LOSs can be found. The idea of TRUST-TECH method is the following: for a given loss surface of an op-timization problem, each LOS has its own stability region. If one start from one local optimum, and track the loss values along a given direction, we will find an exit point where loss start to decrease steadily, which means another stability region corresponding to a nearby LOS is found. By following a trajectory in the stability region, another LOS is computed. We propose an optima exploring algorithm designed for DNNs that is able to find high-quality local optima in a systematic way, and thereby form optimal and robust ensembles. Normally for a deep neural network, exit points can hardly be found by original TRUST-TECH due to the huge dimensionality. So, in this work we introduce the Dynamic Searching Paths (DSP) method instead of fixed directions. We set the search directions to be trainable parameters. After an exploration step forward along the current direction, we calibrate the direction using the current gradient. By doing so, the method can benefit from not only the mature Stochastic Gradient Descent (SGD) training paradigm with powerful GPU acceleration capability, but also exit points can be easily found. The overall DSP-TT method consists of four stages. First, we train the network using local solvers to get a tier-0 local optimal solution. Second, our proposed Dynamic Search Path TRUST-TECH (DSP-TT) method is called to find nearby solutions in a tier-by-tier manner. Third, a selection process is performed so that candidates with high quality are chosen. Finally, ensembles are built with necessary fine-tunings on selected member networks. To the best of our knowledge, this paper is the first one to search for multiple solutions on deep neural networks in a systematical way. Our major contributions and highlights are summarized as follows: • We propose the Dynamic Searching Path (DSP) method that enables exploration on highdimensional parameter space efficiently. • We show that combining TRUST-TECH method with DSP (DSP-TT) is effective in finding multiple optimal solutions on deep neural networks systematically. • We design and implement the algorithm efficiently that it obtains multiple local solutions within one training session with minor GPU memory overhead. • We develop the DSP-TT Ensembles of solutions found by DSP-TT with high quality and diversity for further improving the DNN performance.

2. RELATED WORK

The synergy between massive numbers of parameters and nonlinear activations in deep neural networks leads to the existence of multiple LOSs trained on a specific dataset. Experiments show that different initializations lead to different solutions with various qualities (Dauphin et al. (2014a) ). Even with the same initialization, the network can converge to different solutions depending on the loss function and the solver (Im et al. ( 2016)). Many regularization techniques are therefore proposed to force the network to converge to a better solution, some of which are proven to be useful and popular (Kingma & Ba (2015) 



(2008)), training ANNs (Chiang & Reddy (2007); Wang & Chiang (2011)), estimating finite mixture models (Reddy et al. (2008)), and solving optimal power flow problems (Chiang et al. (2009); Zhang & Chiang (

;Srivastava et al. (2014);Ioffe & Szegedy (2015)). However, it is still mysterious how these regularized solutions are compared to the global optimum. for obsolete toy models or on explicit benchmark objective functions where there are analytical forms for global optimum, and therefore the effectiveness of these algorithms on deep architectures and large datasets seems unconvincing. Moreover, the advantage of the global searching ability seems to be crippled when it comes to deep neural networks, and the minimizers they found are still local.Recently, Garipov et al. (2018)  reveal the relation among local optima by building pathways, called Mode Connectivities, as simple as polygonal chains or Bezier curves that

