QUASICONVEX SHALLOW NEURAL NETWORK

Abstract

Deep neural networks generally have highly non-convex structures, resulting in multiple local optima of network weights. The non-convex network is likely to fail, i.e., being trapped in bad local optima with large errors, especially when the task involves convexity (e.g., linearly separable classification). While convexity is essential in training neural networks, designing a convex network structure without strong assumptions (e.g., linearity) of activation or loss function is challenging. To extract and utilize convexity, this paper presents the QuasiConvex shallow Neural Network (QCNN) architecture with mild assumptions. We first decompose the network into building blocks where quasiconvexity is thoroughly studied. Then, we design additional layers to preserve quasiconvexity where such building blocks are integrated into general networks. The proposed QCNN, interpreted as a quasiconvex optimization problem, allows for efficient training with theoretical guarantees. Specifically, we construct equivalent convex feasibility problems to solve the quasiconvex optimization problem. Our theoretical results are verified via extensive experiments on common machine learning tasks. The quasiconvex structure in QCNN demonstrates even better learning ability than non-convex deep networks in some tasks.

1. INTRODUCTION

Neural networks have been at the heart of machine learning algorithms, covering a variety of applications. In neural networks, the optimal network weights are generally found by minimizing a supervised loss function using some form of stochastic gradient descent (SGD) (Saad (1998) ), in which the gradient is evaluated using the backpropagation procedure (LeCun et al. (1998) ). However, the loss function is generally highly non-convex, especially in deep neural networks, since the multiplication of weights between hidden layers and non-linear activation functions tend to break the convexity of the loss function. Therefore, there are many local optima solutions of network weights (Choromanska et al. (2015) ). While some experiments show that certain local optima are equivalent and yield similar learning performance, the network is likely to be trapped in bad local optima with a large loss. Issue 1: Is non-convex deep neural networks always better? Deep neural networks have shown success in many machine learning applications, such as image classification, speech recognition, and natural language processing (Hinton & Salakhutdinov (2006) ; Ciregan et al. (2012) , Hinton et al. (2012), and Kingma et al. (2014) ). Many people believe that the multiple layers in deep neural networks allow models to learn more complex features and perform more intensive computational tasks. However, deep neural networks are generally highly non-convex in the loss function, which makes the training burdensome. Since the loss function has many critical points, which include spurious local optima and saddle points (Choromanska et al. (2015) ), it hinders the network from finding the global optima and makes the training sensitive to the initial guess. In fact, (Sun et al. (2016) ) pointed out that increasing depth in neural networks is not always good since there is a trade-off between non-convex structure and representation power. In some engineering tasks requiring additional physical modeling, simply applying deep neural networks is likely to fail. Even worse, we usually don't know how to improve the deep neural networks during a failure since it is a black box procedure without many theoretical guarantees. Issue 2: Solution to non-convexity is not practical. To overcome non-convexity in neural networks, new designs of network structure were proposed. The first line of research focused on specific activation functions (e.g., linear or quadratic) and specific target functions (e.g., polynomials) (Andoni et al. (2014) ) where the network structure can be convexity. However, such methods were limited in practical applications (Janzamin et al. (2015) ). Another line of research aimed at deriving the dual problem of the optimization problem formulated by neural network training. Unlike the non-convex neural network, its dual problem is usually convex. Then, conditions ensuring strong duality (zero duality gap and dual problem solvable) were discussed to find the optimal solution to the neural network. For example, Ergen & Pilanci (2020) derived the dual problem for neural networks with ReLU activation, and Wang et al. (2021) showed that parallel deep neural networks have zero duality gap. However, the derivation of strong duality in the literature requires the planted model assumption, which is impractical in many real-world datasets. Aside from studying the convexity in network weights, some work explored the convexity in data input and label. For instance, an input convex structure with given weights Amos et al. (2017) altered the neural network output to be a convex function of (some of) the inputs. Nevertheless, such a function is only an inference procedure with given network weights. In this work, we introduce QCNN, the first QuasiConvex shallow Neural Network structure that learns the optimal weights in a quasiconvex optimization problem. We first decompose a general neural network (shown in the middle of Figure 1 ) into building blocks (denoted by distinct colors). In each building block, the multiplication of two weights, as well as the non-linear activation function in the forward propagation, makes the building block non-convex. Nevertheless, inspired by Boyd et al. (2004) , we notice that the multiplication itself is quasiconcave if the activation function is ReLU. The quasiconvexity (quasiconcavity) is a generalization of convexity (concavity), which shares similar properties, and hence, is a desired property in the neural network. To preserve quasiconcavity in the network structure when each building block is integrated, we design special layers (e.g., minimization pooling layer), as shown in the middle of Figure 1 . In doing so, we arrive at a quasiconvex optimization problem of training the network, which can be equivalently solved by tackling convex feasibility problems. Unlike non-convex deep neural networks, the quasi-convexity in QCNN enables us to learn the optimal network weights efficiently with guaranteed performance.

2. RELATED WORK

Failure of training non-convex neural networks. In training a non-convex neural network, the commonly used method, such as gradient descent in the backpropagation procedure, can get stuck in bad local optima and experience arbitrarily slow convergence (Janzamin et al. (2015) ). Explicit examples of the failure of network training and the presence of bad local optima have been discussed in (Brady et al. (1989) ; Frasconi et al. (1993) ). For instance, Brady et al. (1989) constructed simple cases of linearly separable classes that backpropagation fails. Under non-linear separability setting, Gori & Tesi (1992) also showed failure of backpropagation. These studies indicate that deep neural networks are not suitable for all tasks. Therefore, it motivates us to think: can simple networks with convex structure beat deep networks with non-convex structure in some tasks? Convexity in neural network. The lack of convexity has been seen as one of the major issues of deep neural networks (Bengio et al. (2005) ), drawing much research in the machine learning community. Many people studied convex structures and convex problems in neural networks. For instance, (Bengio et al. (2005) ; Andoni et al. (2014) ; Choromanska et al. (2015) ; Milne (2019) ; Rister & Rubin (2017) ) showed that training a neural network under some strong conditions can be viewed as a convex optimization problem. (Farnia & Tse (2018) ; Ergen & Pilanci (2020) ; Wang et al. (2021) ; Pilanci & Ergen (2020) ) studied convex dual problems of the neural network optimization and derived strong duality under specific assumptions. Aside from directly studying convexity in neural networks, people also discussed conditions where the local optima become global. For example, Haeffele & Vidal (2015) presented that if the network is over-parameterized (i.e., has sufficient neurons) such that there exist local optima where some of the neurons have zero contribution, then such local optima is global optima (Janzamin et al. (2015) ). Similarly, Haeffele & Vidal (2017) also showed that all critical points are either global minimizers or saddle points if the network size is large enough. However, such studies only provide theoretical possibilities while efficient algorithms to solve (infinitely) large networks are missing. Based on the existing literature, we restrict our research to finding a simple but practical neural network with some convexity to provide performance guarantees. Quasiconvex optimization problem. The study of quasiconvex functions, as well as quasiconvex optimization problems, started from (Fenchel & Blackett (1953) ; Luenberger (1968)) and has become popular nowadays since the real-world function is not always convex. For instance, quasiconvex functions have been of particular interest in economics (Agrawal & Boyd (2020) ), modeling the utility functions in an equilibrium study (Arrow & Debreu (1954) ; Guerraggio & Molho (2004) ). The quasiconvex optimization has been applied to many applications recently, including engineering (Bullo & Liberzon (2006) ), model order reduction (Sou), computer vision (Ke & Kanade (2007) ) and machine learning (Hazan et al. (2015) ). Among many solutions to quasiconvex optimization problems, a simple algorithm is bisection (Boyd et al. (2004) ), which solves equivalent convex feasibility problems iteratively until converging.

3. PRELIMINARY

To model for the general case, we consider a L-layer network with layer weights W l ∈ R m l-1 ×m l , ∀l ∈ [L], where m 0 = d and m L = 1 are the input and output dimensions, respectively. As the dimensions suggest, the input data is a vector, and the output is a scalar. Given a labeled dataset D = {(x i , y i )} n i=1 with n samples, we consider a neural network with the following architecture. f θ (X) = h L , h l = g(h l-1 W l ), ∀l ∈ [L] (1) where h l denotes the layer activation and h 0 = X ∈ R n×d is the data matrix. Here, θ = {W l } L l=1 are the network weights which need to be optimized via training, and g(•) is the non-linear activation function. The network is trained with L 2 loss as follows: θ = arg min θ 1 2 ∥f θ (X) -y∥ 2 2 , where y ∈ R n is the data label vector. The loss function in Equation 2 is generally non-convex because of the multiplication of weights as well as non-linear activation functions. As discussed previously, non-convexity will likely cause the network to be trapped in a bad local optima with large errors. Therefore, we still want to extract some convexity in this loss function to help with the training process. In this paper, we will show that quasiconvexity and quasiconcavity are hidden in the network. These properties can be utilized to construct a convex optimization problem to train the optimal network weights. Here, we introduce the definition of quasiconvexity and quasiconcavity. Definition 1. A function f : R d → R is quasiconvex if its domain and all its sublevel sets {x ∈ dom f |f (x) ≤ α}, ∀α are convex. Similarly, a function is quasiconcave if -f is quasiconvex, i.e., every superlevel set {x|f (x) ≥ α} is convex. We also note, since convex functions always have convex sublevel sets, they are naturally quasiconvex, while the converse is not true. Therefore, the quasiconvexity can be regarded as a generalization of convexity, which is exactly what we seek in the non-convex deep neural networks.

4. QUASI-CONCAVE STRUCTURE

For designing a quasiconvex structure of neural networks, we start by considering the simplest and most representative building block in a network and analyze its characteristics. Specifically, we consider the network f (w 1 ; w 2 ) = g(g(x ⊤ w 1 )w 2 ), where x ∈ R d is the input data, w 1 ∈ R d and w 2 ∈ R are the weights for two hidden layers. We analyze such a two-layer structure because it is the simplest case in neural networks yet can be generalized to deep neural networks. In Equation 3, the network is not convex in weights (w 1 ; w 2 ) because (1) the network function contains the multiplication of the weights and (2) the activation function g(•) is usually non-linear. Nevertheless, we still want to explore the potential possibilities of the network becoming convex or being related to convex. Inspired by Boyd et al. (2004) , we notice that although the multiplication of two weights in the forward propagation makes the network non-convex, the multiplication itself is quasiconcave in specific circumstances. For example, the product of two variables forms the shape of a saddle, which is not convex in those variables. However, if we restrict these two variables to be positive, the saddle shape will reduce to a quasiconcave surface, as shown in Figure 2 . 3, a straightforward approach is to assume the network weights (w 1 ; w 2 ) to be nonnegative, like in (Amos et al. (2017) ). However, this assumption will significantly reduce the neural network's representation power. In fact, suppose there are m weights, constraining all the weights to be non-negative will result in only 1/2 m representation power. To bypass this impractical assumption, we notice that some activation functions naturally restrict the output to be non-negative. For example, the ReLU activation function g(x) = max{0, x} forces the negative input to be zero. Therefore, we can demonstrate that the network in Equation 3with ReLU activation function is quasiconcave in the network weights, as shown in Theorem 1. Lemma 1. The function f (w 1 , w 2 ) = w 1 w 2 with dom f = R 2 + is quasiconcave. Theorem 1. The neural network in Equation 3 with ReLU activation function g(•) is quasiconcave in the network weights (w 1 ; w 2 ). Proof. To prove quasiconcavity, we need to show that all superlevel sets S α = {(w 1 ; w 2 )|f (w 1 ; w 2 ) ≥ α}, ∀α ∈ R are convex sets. When α ≤ 0, the superlevel set is the complete set, i.e., S α = dom f due to the ReLU activation function. Hence, S α is evidently convex. When α > 0, the superlevel set is neither the empty set nor the complete set. For any two elements ( ŵ1 ; ŵ2 ), ( w1 ; w2 ) ∈ S α , we aim to show that (λ ŵ1 + (1λ) w1 ; λ ŵ2 + (1λ) w2 ) ∈ S α for λ ∈ (0, 1). From the condition α > 0, we know that x ⊤ ŵ1 > 0 and ŵ2 > 0, as well as x ⊤ w1 > 0 and w2 > 0. Therefore, we would have f (λ ŵ1 + (1 -λ) w1 ; λ ŵ2 + (1 -λ) w2 ) = λx ⊤ ŵ1 + (1 -λ)x ⊤ w1 [λ ŵ2 + (1 -λ) w2 ] = λ 2 x ⊤ ŵ1 ŵ2 + (1 -λ) 2 x ⊤ w1 w2 + λ(1 -λ) x ⊤ ŵ1 w2 + x ⊤ w1 ŵ2 ≥ λ 2 α + (1 -λ) α + λ(1 -λ) x ⊤ ŵ1 w2 + x ⊤ w1 ŵ2 ≥ λ 2 α + (1 -λ) α + λ(1 -λ) α ŵ2 w2 + α w2 ŵ2 ≥ λ 2 α + (1 -λ) α + λ(1 -λ) × 2α = α[λ 2 + (1 -λ) 2 + 2λ(1 -λ)] = α. In Theorem 1, we show that the simple two-layer network in Equation 3 is quasiconcave in network weights, given that the activation function is ReLU. Then, a natural question arises: does this property of quasiconcavity remain in deeper networks? Unfortunately, the quasiconcavity does not hold in more complex neural networks due to one fact: the summation of quasiconvex (quasiconcave) functions is not necessarily quasiconvex (quasiconcave). The deeper networks can be regarded as weighted summations of many networks in Equation 3, hence, not quasiconcave anymore. Therefore, we aim to design new network structures to preserve the property of quasiconcavity to more general neural networks. To achieve this goal, we focus on the operations that preserve quasiconcavity, including (1) the composition of a non-decreasing convex function, (2) the non-negative weighted minimization, and (3) the supremum over some variables. Among these operations, we choose the minimization procedure because it is easy to apply and has a simple gradient. Specifically, we can apply a minimization pooling layer to integrate the simple networks in Equation 3, as shown in Figure 3 . In doing so, we manage to extend the network in the simplest building block to more general structures, where quasiconcavity is ensured by Lemma 2 and visually explained in Figure 4 . Meanwhile, we note that the proposed network is still a shallow network. Although infinitely stacking layers with appropriate minimization pooling layers can also keep the entire network convex, too many minimization pooling layers will damage the representation power of the neural network. Lemma 2. Provided that f 1 , • • • , f n are quasiconcave functions defined on the same domain, the non-negative weighted minimum / o Q = " > A A A C N H i c b V D L S g M x F M 3 U V 6 2 v U Z d u g k V o N 2 W m F B V E K L o R 3 F S w D 2 j r k E k z 0 9 D M g y S j l q E f 5 c Y P c S O C C 0 X c + g 1 m 2 k H 6 8 E L g 3 H P O J f c e O 2 R U S M N 4 0 z J L y y u r a 9 n 1 3 M b m 1 v a O v r v X E E H E M a n j g A W 8 Z S N B G P V J X V L J S C v k B H k 2 I 0 1 7 c J n o z X v C B Q 3 8 W z k M S d d D r k 8 d i p F U l K V f O 4 W O h 2 T f d u K H k W W e j R v u J U 2 5 e A 7 d g v u n P 4 7 u O j I I p + 3 F G b u l 5 4 2 S M S 6 4 C M w U 5 E F a N U t / 6 f Q C H H n E l 5 g h I d q m E c p u j L i k m J F R r h M J E i I 8 Q C 5 p K + g j j 4 h u P D 5 6 B I 8 U 0 4 N O w N X z J R y z 0 x M x 8 o Q Y e r Z y J k u K e S 0 h / 9 P a k X R O u z H 1 w 0 g S H 0 8 + c i I G Z Q C T B G G P c o I l G y q A M K d q V 4 j 7 i C M s V c 4 5 F Y I 5 f / I i a J R L 5 n G p c l P J V y / S O L L g A B y C A j D B C a i C K 1 A D d Y D B E 3 g F H + B T n Z D R V d 2 o m L 9 I 9 z B g / Z F x h g Y 2 U = " > A A A C O n i c b V D L S g M x F M 3 4 r P V V d e k m W I R 2 U 2 Z K U U G E o h u X L d g H d M Y h k 2 b a 0 M y D J K O W Y b 7 L j V / h z o U b F 4 q 4 9 Q N M 2 y n 0 4 Q m B k 3 P v I f c e J 2 R U S F 1 / 0 1 Z W 1 9 Y 3 N j N b 2 e 2 d 3 b 3 9 3 M F h U w Q R x 6 S B A x b w t o M E Y d Q n D U k l I + 2 Q E + Q 5 j L S c w c 2 o 3 n o g X N D A v 5 P D k F g e 6 v n U p R h J J d m 5 u l s w P S T 7 j h u 3 E t u 4 n D 4 e E 7 t c v I J m 2 K e F n j p T / S m 5 N 2 U Q z n q K c 5 6 i n c v r J X 0 M u E y M l O R B i p q d e z W 7 A Y 4 8 4 k v M k B A d Q w + l F S M u K W Y k y Z q R I C H C A 9 Q j H U V 9 5 B F h x e P V E 3 i q l C 5 0 A 6 6 u L + F Y n X X E y B N i 6 D m q c z S l W K y N x P 9 q n U i 6 F 1 Z M / T C S x M e T j 9 y I Q R n A U Y 6 w S z n B k g 0 V Q Z h T N S v E f c Q R l i r t r A r B W F x 5 m T T L J e O s V K l X 8 t X r N I 4 M O A Y n o A A M c A 6 q 4 B b U Q A N g 8 A z e w S f 4 0 l 6 0 D + 1 b + 5 m 0 r m i p 5 w j M Q f v 9 A 1 + s r c w = < / l a t e x i t > f (W 1 ; w 2 ) = (g(g(x > W 1 )w 2 )) < l a t e x i t s h a 1 _ b a s e 6 4 = " m M K J O L X P C P v B n e M 5 0 M T K q F L 5 H U w = " > A A A C W n i c b V F Z S w M x E M 6 u 2 t O j H m + + B I v Q f S m 7 t a g g Q t E X H y v Y A 7 p 1 y a b Z N j R 7 m G T V s u y f 9 E U E / 4 p g e i i 1 d U L C N / P N T C Z f 3 I h R I U 3 z Q 9 M 3 N r c y 2 V y + U N z e 2 d 0 r 7 R + 0 R R h z T F o 4 Z C H v u k g Q R g P S k l Q y 0 o 0 4 Q b 7 L S M c d 3 0 7 5 z j P h g o b B g 5 x E p O + j Y U A 9 i p F U I a f 0 5 F V s H 8 m R 6 y W d 1 L G u l p z a r / O S O m f G N b S j E a 3 M j q F a P + R r + m j L M F r u Y i x 3 M Y w / b Q y n V D a r 5 s z g O r A W o A w W 1 n R K b / Y g x L F P A o k Z E q J n m Z H s J 4 h L i h l J C 3 Y s S I T w G A 1 J T 8 E A + U T 0 k 5 k 0 K T x V k Q H 0 Q q 5 2 I O E s u l y R I F + I i e + q z O m U Y p W b B v / j e r H 0 L v s J D a J Y k g D P L / J i B m U I p z r D A e U E S z Z R A G F O 1 a w Q j x B H W K r f K C g R r N U n r 4 N 2 r W q d V + v 3 9 X L j Z i F H D h y D E 1 A B F r g A D X A H m q A F M H g H X 1 p G y 2 q f u q 7 n 9 e I 8 V d c W N Y f g j + l H 3 4 1 Z t M I = < / l a t e x i t > f (W 1 ; W 2 ; w 3 ) = ( (g(g(x > W 1 )W 2 ))w 3 )) f := min{a 1 f 1 , a 2 f 2 , • • • , a n f n } is quasiconcave given a 1 , • • • , a n ∈ R + . Proof. The superlevel set S α = {x ∈ dom f |f (x) ≥ α} of f can be regarded as: S α = {x ∈ dom f | min{a 1 f 1 (x), a 2 f 2 (x), • • • , a n f n (x)} ≥ α} = ∩ n i=1 {x ∈ dom f |f i ≥ α a i }, which is the intersection of (convex) superlevel sets of  f i (i = 1, • • • , n).

5. QUASICONVEX OPTIMIZATION OF NEURAL NETWORK

In Section 4, we design a neural network structure where output f (θ) is a quasiconcave function over the network weights θ. To further utilize the property of quasiconcavity, in this section, we propose to train the neural network as a quasiconvex optimization problem. Even though function f (θ) is quasiconcave, the optimization problem in Equation 2is not quasiconvex, since the L 2 loss is not monotonic. However, if we restrict the network output to be smaller than the network labels, i.e., f (θ) ≤ y, the L 2 loss is non-increasing in this range. Therefore, the resulting loss function in Equation 2, as a composition of a convex non-increasing function over a quasiconcave function, is quasiconvex. That is, the training of QCNN is an unconstrained quasiconvex optimization problem P * = min θ l(θ) = min θ 1 2 ∥f (θ) -y∥ 2 2 . ( ) To solve the quasiconvex optimization problem in Equation 4, we can transform it into an equivalent convex feasibility problem. Let φ t (θ) := ytf (θ), t ∈ R be a family of convex functions satisfying l(θ) ≤ t ⇐⇒ φ t (θ) ≤ 0. Then, the quasiconvex optimization problem in Equation 4 can be equivalently considered as min θ l(θ) ⇐⇒ min θ,t t =⇒ find θ (5) s.t. l(θ) ≤ t s.t. φ t (θ) ≤ 0. The problem in Equation 5 is a convex feasibility problem since the inequality constraint function is convex. For every given value t, we can solve the convex feasibility problem. If the convex feasibility problem is feasible, i.e., ∃θ, φ t (θ) ≤ 0, this point θ is also feasible for the quasiconvex problem by satisfying l(θ) ≤ t. It indicates that the optimal value P * is smaller than t, i.e., P * ≤ t. In this circumstance, we can reduce the value t and conduct the above procedure again to approach the optimal value P * . On the other hand, if the convex feasibility problem is infeasible, we know that P * ≥ t. In this case, we should increase the value of t. Through this procedure, the quasiconvex optimization problem in Equation 4 can be solved using bisection, i.e., solving a convex feasibility problem at each step (Boyd et al. (2004) ). The procedure is summarized as Algorithm 1. Algorithm 1 QCNN Training Process 1: given l ≤ P * , u ≥ P * , tolerance ϵ > 0 ▷ lower/upper bounds of optimal value 2: while ul > ϵ do ▷ convergence criterion 3: t := (l + u)/2 4: Solve the convex feasibility problem in Equation 55: if problem Equation 5 is feasible then end if 10: end while ▷ return the current feasible point θ Remark 1. The quasiconvex optimization problem in Equation 4has zero duality gap. The proof can be derived by verifying that our unconstrained quasiconvex optimization problem in Equation 4satisfies the condition in Fang et al. (2014) . Therefore, the quasiconvex optimization problem could also be solved via exploring its dual problem.

6. EXPERIMENTS

We use the proposed framework in Section 5 to conduct several machine learning tasks, comparing QCNN to deep neural networks. Our experiments aim to validate the core benefits of QCNN: (1) the convexity, even in shallow networks, makes learning more accurate than non-convex deep networks in some tasks, and (2) the convexity enables the network to be more robust and converge faster.

6.1. FUNCTION APPROXIMATION

Since the purpose of neural networks can be generally seen as learning a mapping from input x to label y, in this section, we evaluate the performance of using QCNN to approximate some function. Synthetic dataset. For synthetic scenario, the dataset is generated by randomly sampling x from a uniform distribution Unif(-1, 1) and calculating the corresponding label y = f (x) given function f . We generate 1,000 samples for training and 200 samples for testing, where the mean square error (MSE) of the testing set is used to evaluate the model performance. The results of approximating various functions are summarized in Figure 5 . As we see, the performance of deep neural networks depends on the choice of initial guess of network weights. In the first two experiments (first two rows in Figure 5 ), the deep network seems to be trapped in a bad local optima, which corresponds to a relatively large MSE. In the third experiment, the deep network arrives at a good local optima. However, it still exhibits certain flaws at the non-differentiable points (turning points) in the function f . It matches the finding of Brady et al. (1989) where deep neural networks fail in simple cases of linearly separable classifying tasks. On the contrary, although QCNN uses a shallow structure, its quasiconvexity nature enables it to learn piecewise linear functions to approximate function f . In many replications of experiments, we find that learning procedure of QCNN is more robust to initial guess of network weights since it is quasiconvex. Moreover, QCNN demonstrates a quicker convergence when learning the function f . 2019)). These labels are also used to calculate two metrics: optimal dataset scale (ODS) and optimal image scale (OIS) to evaluate the model performance. The comparison was performed against DeepNet (Kivinen et al. (2014) ). In the experiment, we find that the performance of QCNN and DeepNet depend on the objects in the image. For some objects with clear and angular contours, e.g., a phone in Figure 6 (a), detecting such a contour can be seen as learning a closed polygon with piecewise linear functions defining its edges. For such a class of images, the ODS of QCNN achieves 0.824 compared to 0.784 of DeepNet, while the OIS of QCNN achieves 0.831 compared to 0.798 of DeepNet. On the contrary, DeepNet has better accuracy in recognizing complex (e.g., highly non-linear) contours. In this class of images, the ODS of QCNN is 0.717 compared to 0.743 of DeepNet, while and the OIS of QCNN is 0.729 compared to 0.760 of DeepNet. To conclude, QCNN with the quasiconvex structure still outperforms the deep networks when the task involves some characteristics related to convexity. Mass-damper system dataset. Aside from synthetic functions and irregular functions, we also learn functions that have physical meanings. Specifically, we consider the mass-damper system, which can be depicted as: q = -DRD ⊤ M -1 q. In the system, q is a vector of momenta, D is the incidence matrix of the system, R is the diagonal matrix of the damping coefficients for each line of the system, and M is the diagonal matrix of each node mass of the system. Thus, we can set y = q and x = q with the goal to learn the parameter matrix -DRD ⊤ M -1 . We simulate the dataset for a 10-node system and obtain 6,000 samples for 1-min simulations with a step size to be 0.01s (Li et al. (2022) ). Figure 6 (b) shows the prediction error (MSE) during training epochs, where QCNN converges faster to a smaller error, compared to deep neural networks. This is perhaps because that the target parameter matrix in the system constructs a linear bridge between input x and label y. Using the neural network to detect the change point λ can be seen as classifying pre-change data and post-change data using five samples in a shifted window. The classifying threshold is chosen as α, i.e., the maximum false alarm rate (Liao et al. (2016) ). The results are shown in Figure 7 using 1,000 Monte Carlo experiments. As we see, QCNN shows a smaller average detection delay than deep neural networks. Meanwhile, it seems that QCNN is less likely to falsely report a fake change, since the empirical false alarm rate of QCNN is below that of deep networks, and is mostly below the theoretical upper bound α (especially when α → 0). QCNN outperforms the deep neural network in this task because the transition of the distribution is abrupt, as shown in Figure 7 (Left). The abrupt change results in a non-differentiable/non-smooth point in the mapping to be learned, which is more efficiently represented by QCNN via piecewise linear functions. 2021)). Among these meters, around 1,973 have installed solar panels, and are labeled as solar in the classification task. The remainder of the meters are labeled as non-solar. The average smart meter readings, including the household electricity consumption and the PV generation, as well as the household's address, are considered as input data to classify whether a meter has solar panels. We randomly select 20,000 samples from this dataset to train and select 1,000 samples to test the performance. Figure 8 shows the location of all the meters and solar meters. As we see, the meters that have solar installed are concentrated in a roughly convex area instead of being spread over in the entire area. Therefore, learning the classifier for solar meters using the feature of address is equivalent to learning a convex domain. It could explain that the classification accuracy of QCNN (94.2%) outperforms that of deep networks (92.7%). 

7. CONCLUSION

In this work, we analyze the problem of convex neural networks. First, we observe that deep neural networks are not suitable for all tasks since the network is highly non-convex. The non-convex network could fail, i.e., being trapped in bad local optima with large errors, especially when the task involves convexity (e.g., linearly separable classification). Therefore, it motivates us to design a convex structure of neural networks to ensure efficient training with performance guarantees. While convexity is damaged due to the multiplication of weights as well as non-linear activation functions, we manage to decompose the neural network into building blocks, where the quasiconvexity is thoroughly studied. In the building block, we find that the multiplication of ReLU output is quasiconcave over network weights. To preserve the property of quasiconcavity when such building blocks are integrated into a general network, we design minimization pooling layers. The proposed Quasi-Convex shallow Neural Network (QCNN), can be equivalently trained via solving convex feasibility problems iteratively. With the quasiconvex structure, QCNN allows for efficient training with theoretical guarantees. We verify the proposed QCNN using several common machine learning tasks. The quasiconvex structure in QCNN demonstrates even better learning ability than non-convex deep networks in some tasks.



Figure 1: Proposed Method. (Left) The motivation and challenge of this study. (Middle) We design a quasiconvex neural network structure to efficiently train for optimal network weights in a quasiconvex optimization problem. The quasiconvexity is studied and preserved via special pooling layers. (Right) Unlike non-convex loss function, the quasiconvex loss function of our design allows for finding the global optima.

Figure 2: (Left) The function f (w 1 , w 2 ) = w 1 w 2 has a saddle shape. (Middle) Constrained on positive domain, i.e., dom f = R 2 + , the function becomes quasiconcave. (Right) The quasiconcave function always has convex superlevel sets.

t e x i t s h a 1 _ b a s e 6 4 = " 6 B w Y j U K Z 9 Z a 6 3 n e e F Z + k f E r 0

e 9 b e t S / t e 2 L N a O n M P p g p 7 e c X o A S s E A = = < / l a t e x i t > f (w 1 ; w 2 ) = g(g(x > w 1 )w ) t e x i t s h a 1 _ b a s e 6 4 = " h p n y S a J h + / C A h H t i 5 m N v H w 1 0 1 y A = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 9 F L x 4 r 2 A 9 o Q 9 l s N 8 3 S 3 U 3 Y 3 Q g l 9 C 9 4 8 a C I V / + Q N / + N m z Y H b X 0 w 8 H h v h p l 5 Q c K Z N q 7 7 7 Z Q 2 N r e 2 d 8 q 7 l b 3 9 g 8 O j 6 v F J V 8 e p I r R D Y h 6 r f o A 1 5 U z S j m G G 0 3 6 i K B Y B p 7 1 g e p f 7 v S e q N I v l o 5 k l 1 B d 4 I l n I C D a 5 N E w i N q r W 3 L q 7 A F o n X k F q U K A 9 q n 4 N x z F J B Z W G c K z 1 w H M T 4 2 d Y G U Y 4 n V e G q a Y J J l M 8 o Q N L J R Z U + 9 n i 1 j m 6 s M o Y h b G y J Q 1 a q L 8 n M i y 0 n o n A d g p s I r 3 q 5 e J / 3 i A 1 4 Y 2 f M Z m k h k q y X B S m H J k Y 5 Y + j M V O U G D 6 z B B P F 7 K 2 I R F h h Y m w 8 F R u C t / r y O u l e 1 b 1 m v f H Q q L V u i z j K c A b n c A k e X E M L 7 q E N H S A Q w T O 8 w p s j n B f n 3 f l Y t p a c Y u Y U / s D 5 / A E W d o 5 I < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B T

Figure 3: The structure of quasiconvex shallow neural network (QCNN).

Figure 4: The minimization of quasiconcave functions on the same domain is still quasiconcave.

Figure 5: The performance of approximating functions using deep neural networks and QCNN: QCNN tends to behave better because the quasiconvex structure enables it to learn piecewise linear mappings more efficiently.

Figure 6: (a) QCNN works better in detecting angular contours while deep networks are better for detecting complex contours. (b) MSE against training epochs in learning the mass-damper function.6.2 CLASSIFICATION TASKThe experiments in Section 6.1 represent the regression tasks. In this section, we further consider the classification task, covering two major categories of machine learning applications.Changepoint detection of distributions. Finding the transition of the underlying distribution of a sequence has various applications in engineering fields, such as video surveillance (Sultani et al. (2018)), sensor networks Xie & Siegmund (2013) and infrastructure health monitoring Liao et al. (2019). Aside from engineering tasks, it is also important in many machine learning tasks, including speech recognition Chowdhury et al. (2012), sequence classification Ahad & Davenport (2020), and dataset shift diagnosis Lu et al. (2016). To simulate a sequence of measurements, we randomly generate the pre-change sequence from normal distribution N (0, 0.2) and generate the post-change sequence from N (1, 0.1) where the change time is λ = 50. The time-series sequence is shown in the left part of Figure 7.

Figure 7: (Left) The change of underlying distribution of the sequence. (Middle) The average detection delay of detecting the distribution change using QCNN and deep neural networks. (Right) The false alarm rate of detecting the distribution change using QCNN and deep neural networks. Solar meters classification. The UMass Smart dataset (Laboratory for Advanced System Software (Accessed Sep. 2022.)) contains approximately 600,000 meters from a U.S. city with a one-hour interval between each meter reading (Cook et al. (2021)). Among these meters, around 1,973 have installed solar panels, and are labeled as solar in the classification task. The remainder of the meters are labeled as non-solar. The average smart meter readings, including the household electricity consumption and the PV generation, as well as the household's address, are considered as input data to classify whether a meter has solar panels. We randomly select 20,000 samples from this dataset to train and select 1,000 samples to test the performance. Figure8shows the location of all the meters and solar meters. As we see, the meters that have solar installed are concentrated in a roughly convex area instead of being spread over in the entire area. Therefore, learning the classifier for solar meters using the feature of address is equivalent to learning a convex domain. It could explain that the classification accuracy of QCNN (94.2%) outperforms that of deep networks (92.7%).

Figure 8: The locations of (solar) meters and the classification accuracy of using QCNN and deep neural network to classify the solar meters.

