QUASICONVEX SHALLOW NEURAL NETWORK

Abstract

Deep neural networks generally have highly non-convex structures, resulting in multiple local optima of network weights. The non-convex network is likely to fail, i.e., being trapped in bad local optima with large errors, especially when the task involves convexity (e.g., linearly separable classification). While convexity is essential in training neural networks, designing a convex network structure without strong assumptions (e.g., linearity) of activation or loss function is challenging. To extract and utilize convexity, this paper presents the QuasiConvex shallow Neural Network (QCNN) architecture with mild assumptions. We first decompose the network into building blocks where quasiconvexity is thoroughly studied. Then, we design additional layers to preserve quasiconvexity where such building blocks are integrated into general networks. The proposed QCNN, interpreted as a quasiconvex optimization problem, allows for efficient training with theoretical guarantees. Specifically, we construct equivalent convex feasibility problems to solve the quasiconvex optimization problem. Our theoretical results are verified via extensive experiments on common machine learning tasks. The quasiconvex structure in QCNN demonstrates even better learning ability than non-convex deep networks in some tasks.

1. INTRODUCTION

Neural networks have been at the heart of machine learning algorithms, covering a variety of applications. In neural networks, the optimal network weights are generally found by minimizing a supervised loss function using some form of stochastic gradient descent (SGD) (Saad (1998)), in which the gradient is evaluated using the backpropagation procedure (LeCun et al. (1998) ). However, the loss function is generally highly non-convex, especially in deep neural networks, since the multiplication of weights between hidden layers and non-linear activation functions tend to break the convexity of the loss function. Therefore, there are many local optima solutions of network weights (Choromanska et al. (2015) ). While some experiments show that certain local optima are equivalent and yield similar learning performance, the network is likely to be trapped in bad local optima with a large loss. 2014)). Many people believe that the multiple layers in deep neural networks allow models to learn more complex features and perform more intensive computational tasks. However, deep neural networks are generally highly non-convex in the loss function, which makes the training burdensome. Since the loss function has many critical points, which include spurious local optima and saddle points (Choromanska et al. (2015) ), it hinders the network from finding the global optima and makes the training sensitive to the initial guess. In fact, (Sun et al. ( 2016)) pointed out that increasing depth in neural networks is not always good since there is a trade-off between non-convex structure and representation power. In some engineering tasks requiring additional physical modeling, simply applying deep neural networks is likely to fail. Even worse, we usually don't know how to improve the deep neural networks during a failure since it is a black box procedure without many theoretical guarantees. Issue 2: Solution to non-convexity is not practical. To overcome non-convexity in neural networks, new designs of network structure were proposed. The first line of research focused on specific activation functions (e.g., linear or quadratic) and specific target functions (e.g., polynomials) (Andoni et al. ( 2014)) where the network structure can be convexity. However, such methods were limited in practical applications (Janzamin et al. ( 2015)). Another line of research aimed at deriving the dual problem of the optimization problem formulated by neural network training. Unlike the non-convex neural network, its dual problem is usually convex. Then, conditions ensuring strong duality (zero duality gap and dual problem solvable) were discussed to find the optimal solution to the neural network. For example, Ergen & Pilanci (2020) derived the dual problem for neural networks with ReLU activation, and Wang et al. ( 2021) showed that parallel deep neural networks have zero duality gap. However, the derivation of strong duality in the literature requires the planted model assumption, which is impractical in many real-world datasets. Aside from studying the convexity in network weights, some work explored the convexity in data input and label. For instance, an input convex structure with given weights Amos et al. ( 2017) altered the neural network output to be a convex function of (some of) the inputs. Nevertheless, such a function is only an inference procedure with given network weights. In this work, we introduce QCNN, the first QuasiConvex shallow Neural Network structure that learns the optimal weights in a quasiconvex optimization problem. We first decompose a general neural network (shown in the middle of Figure 1 ) into building blocks (denoted by distinct colors). In each building block, the multiplication of two weights, as well as the non-linear activation function in the forward propagation, makes the building block non-convex. Nevertheless, inspired by Boyd et al. ( 2004), we notice that the multiplication itself is quasiconcave if the activation function is ReLU. The quasiconvexity (quasiconcavity) is a generalization of convexity (concavity), which shares similar properties, and hence, is a desired property in the neural network. To preserve quasiconcavity in the network structure when each building block is integrated, we design special layers (e.g., minimization pooling layer), as shown in the middle of Figure 1 . In doing so, we arrive at a quasiconvex optimization problem of training the network, which can be equivalently solved by tackling convex feasibility problems. Unlike non-convex deep neural networks, the quasi-convexity in QCNN enables us to learn the optimal network weights efficiently with guaranteed performance. 



Is non-convex deep neural networks always better? Deep neural networks have shown success in many machine learning applications, such as image classification, speech recognition, and natural language processing (Hinton & Salakhutdinov (2006); Ciregan et al. (2012), Hinton et al. (2012), and Kingma et al. (

Figure 1: Proposed Method. (Left) The motivation and challenge of this study. (Middle) We design a quasiconvex neural network structure to efficiently train for optimal network weights in a quasiconvex optimization problem. The quasiconvexity is studied and preserved via special pooling layers. (Right) Unlike non-convex loss function, the quasiconvex loss function of our design allows for finding the global optima.

of training non-convex neural networks. In training a non-convex neural network, the commonly used method, such as gradient descent in the backpropagation procedure, can get stuck in bad local optima and experience arbitrarily slow convergence (Janzamin et al. (2015)). Explicit examples of the failure of network training and the presence of bad local optima have been discussed in (Brady et al. (1989); Frasconi et al. (1993)). For instance, Brady et al. (1989) constructed simple cases of linearly separable classes that backpropagation fails. Under non-linear separability setting, Gori & Tesi (1992) also showed failure of backpropagation. These studies indicate that deep neural

