QUASICONVEX SHALLOW NEURAL NETWORK

Abstract

Deep neural networks generally have highly non-convex structures, resulting in multiple local optima of network weights. The non-convex network is likely to fail, i.e., being trapped in bad local optima with large errors, especially when the task involves convexity (e.g., linearly separable classification). While convexity is essential in training neural networks, designing a convex network structure without strong assumptions (e.g., linearity) of activation or loss function is challenging. To extract and utilize convexity, this paper presents the QuasiConvex shallow Neural Network (QCNN) architecture with mild assumptions. We first decompose the network into building blocks where quasiconvexity is thoroughly studied. Then, we design additional layers to preserve quasiconvexity where such building blocks are integrated into general networks. The proposed QCNN, interpreted as a quasiconvex optimization problem, allows for efficient training with theoretical guarantees. Specifically, we construct equivalent convex feasibility problems to solve the quasiconvex optimization problem. Our theoretical results are verified via extensive experiments on common machine learning tasks. The quasiconvex structure in QCNN demonstrates even better learning ability than non-convex deep networks in some tasks.

1. INTRODUCTION

Neural networks have been at the heart of machine learning algorithms, covering a variety of applications. In neural networks, the optimal network weights are generally found by minimizing a supervised loss function using some form of stochastic gradient descent (SGD) (Saad (1998)), in which the gradient is evaluated using the backpropagation procedure (LeCun et al. (1998) ). However, the loss function is generally highly non-convex, especially in deep neural networks, since the multiplication of weights between hidden layers and non-linear activation functions tend to break the convexity of the loss function. Therefore, there are many local optima solutions of network weights (Choromanska et al. (2015) ). While some experiments show that certain local optima are equivalent and yield similar learning performance, the network is likely to be trapped in bad local optima with a large loss. 2014)). Many people believe that the multiple layers in deep neural networks allow models to learn more complex features and perform more intensive computational tasks. However, deep neural networks are generally highly non-convex in the loss function, which makes the training burdensome. Since the loss function has many critical points, which include spurious local optima and saddle points (Choromanska et al. (2015) ), it hinders the network from finding the global optima and makes the training sensitive to the initial guess. In fact, (Sun et al. (2016) ) pointed out that increasing depth in neural networks is not always good since there is a trade-off between non-convex structure and representation power. In some engineering tasks requiring additional physical modeling, simply applying deep neural networks is likely to fail. Even worse, we usually don't know how to improve the deep neural networks during a failure since it is a black box procedure without many theoretical guarantees. 1



Is non-convex deep neural networks always better? Deep neural networks have shown success in many machine learning applications, such as image classification, speech recognition, and natural language processing (Hinton & Salakhutdinov (2006); Ciregan et al. (2012), Hinton et al. (2012), and Kingma et al. (

