ON THE LANDSCAPE OF SPARSE LINEAR NETWORKS Anonymous

Abstract

Network pruning, or sparse network has a long history and practical significance in modern applications. Although the loss functions of neural networks may yield bad landscape due to non-convexity, we focus on linear activation which already owes benign landscape. With no unrealistic assumption, we conclude the following statements for the squared loss objective of general sparse linear neural networks: 1) every local minimum is a global minimum for scalar output with any sparse structure, or non-intersected sparse first layer and dense other layers with orthogonal training data; 2) sparse linear networks have sub-optimal local-min for only sparse first layer due to low rank constraint, or output larger than three dimensions due to the global minimum of a sub-network. Overall, sparsity breaks the normal structure, cutting out the decreasing path in original fully-connected networks.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable empirical successes in the domains of computer vision, speech recognition, and natural language processing, sparking great interests in the theory behind their architectures and training. However, DNNs are often found to be highly overparameterized, making them computationally expensive with large amounts of memory and computational power. For example, it may take up to weeks on a modern multi-GPU server for large datasets such as ImageNet (Deng et al., 2009) . Hence, DNNs are often unsuitable for smaller devices like embedded electronics, and there is a pressing demand for techniques to optimize models with reduced model size, faster inference and lower power consumption. Sparse networks, that is, neural networks in which a large subset of the model parameters are zero, have emerged as one of the leading approaches for reducing model parameter count. It has been shown empirically that deep neural networks can achieve state-of-the-art results under high levels of sparsity (Han et al., 2015b; Gale et al., 2019; Louizos et al., 2017a) . Modern sparse networks are mainly obtained from network pruning (Zhu & Gupta, 2017; Lee et al., 2018; Liu et al., 2018; Frankle & Carbin, 2018) , which has been the subject of a great deal of work in recent years. However, training a sparse network with fixed sparsity patterns is difficult (Evci et al., 2019) and few theoretical understanding of general sparse networks has been provided. Previous work has already analyze deep neural networks, showing that the non-convexity of the associated loss functions may cause complicated and strange optimization landscapes. However, the property of general sparse networks is poorly understood. Saxe et al. (2013) empirically showed that the optimization of deep linear models exhibits similar properties as deep nonlinear models, and for theoretical development, it is natural to begin with linear models before studying nonlinear models (Baldi & Lu, 2012) . In addition, several works (Sun et al., 2020) have show bad minimum exists with nonlinear activation. Hence, it is natural to begin with linear activation to understand the impact of sparsity. In this article, we go further to consider the global landscape of general sparse linear neural networks. We need to emphasize that dense deep linear networks already satisfy that every local minimum is a global minimum under mild conditions (Kawaguchi, 2016; Lu & Kawaguchi, 2017) , but findings are different and complicated for sparse linear network. The goal of this paper is to study the relation between sparsity and local minima with the following contributions: • First, we point out that every local minimum is a global minimum in scalar target case with any depths, any widths and any sparse structure. Besides, we also briefly show that similar results hold for non-overlapping filters and orthogonal data feature when sparsity only occurs in the first layer. • Second, we find out that sparse connections would already give sub-optimal local minima in general non-scalar case through analytic and numerical examples built on convergence analyze. The local-min may be produced from two situations: a sub-sparse linear network which owes its minimum as a local-min of the original sparse network; a rank-deficient solution between different data features due to sparse connections, while both cases verify the fact that sparsity cuts out the decreasing path in original fully-connected networks. Overall, we hope our work contributes to a better understanding of the landscape of sparsity network on simple neural networks, and provide insights for future research. The remainder of our paper is organized as follows. In Section 2, we derive the positive findings of shallow sparse linear networks, providing similar landscape as dense linear networks. In Section 3, we give several examples to show the existence of bad local-min for non-scalar case. In section 4, we briefly generalize the results from shallow to deep sparse linear networks. Some proofs are in Appendix.

1.1. RELATED WORK

There is a rapidly increasing literature on analyzing the loss surface of neural network objectives, surveying all of which is well outside our scope. Thus, we only briefly survey the works most related to ours. Local minima is Global. The landscape of a linear network date back to Baldi & Hornik (1989) , proving that shallow linear neural networks do not suffer from bad local minima. Kawaguchi (2016) generalized same results to deep linear neural networks, and subsequent several works (Arora et al., 2018; Du & Hu, 2019; Eftekhari, 2020) give direct algorithm-type convergence based on this benign property, though algorithm analysis is beyond the scope of this paper. However, situations are quite complicated with nonlinear activations. Multiple works (Ge et al., 2017; Safran & Shamir, 2018; Yun et al., 2018) show that spurious local minima can happen even in two-layer network with population or empirical loss, some are specific to two-layer and difficult to generalize to general multilayer cases. Another line of works (Arora et al., 2018; Allen-Zhu et al., 2018; Du & Hu, 2019; Du et al., 2018; Li et al., 2018; Mei et al., 2018) understands the landscape of neural network in an overparameterized setting, discovering benign landscape with or without gradient method. Since modern sparse networks reserve few parameters compared to overparameterization, we still seek a fundamental view of sparsity in contrast. Our standpoint is that spurious local minima can happen when applied with specific sparsity even in linear networks. Sparse networks. Sparse networks (Han et al., 2015b; a; Zhu & Gupta, 2017; Frankle & Carbin, 2018; Liu et al., 2018) have a long history, but appears heavily on the experiments, and mainly related to network pruning, which has practical importance for reducing model parameter count and deploying diverse devices. However, training sparse networks (from scratch) suffers great difficulty. Frankle & Carbin (2018) recommend reusing the sparsity pattern found through pruning and train a sparse network from the same initialization as the original training ('lottery') to obtain comparable performance and avoid bad solution. Besides, for fixed sparsity patterns, Evci et al. (2019) attempt to find a decreasing objective path from 'bad' solutions to the 'good' ones in the sparse subspace but fail, showing bad local minima can be produced by pruning, while we give more direct view of simple examples to verify this. Moreover, several recent works also give abundant methods (Molchanov et al., 2017; Louizos et al., 2017b; Lee et al., 2018; Carreira-Perpinán & Idelbayev, 2018) for choosing weights or sparse network structure while achieving similar performance. In theoretical view, Malach et al. (2020) prove that a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, providing guarantee for 'good' sparse networks. Some works analyze convolutional network (Shalev-Shwartz et al., 2017; Du et al., 2018) as a specific sparse structure. Brutzkus & Globerson (2017) analyze non-overlapping and overlapping structure as we do, but with weight sharing to simulate CNN-type structure, and under teacher-student setting with population risk. We do not follow CNN-type network but in general sparse networks, though still linear, to conclude straightforward results.

2.1. PRELIMINARIES AND NOTATION

We use bold-faced letters (e.g., w and a) to denote vectors, capital letters (e.g., W = [w ij ] and A = [a ij ]) for matrices. Let P X be the orthogonal projection to the column space of the matrix X, and λ i (H) is the i-th smallest eigenvalue of a real symmetric matrix H. We consider the training samples and their outputs as {(x i , y i )} n i=1 ⊂ R dx × R dy , which may come from unknown distribution D. We form the data matrices X = [x 1 , . . . , x n ] T ∈ R n×dx and Y = [y 1 , . . . , y n ] T ∈ R n×dy , respectively. In our analysis in Sections 2 and 3, we consider a two-layer (sparse) linear neural network with squared loss: min W,A L(W, A) := 1 2 Y -XW A 2 F , where the first layer weight matrix W = [w 1 , . . . , w d ] ∈ R dx×d , and the second layer weight matrix A = [a 1 , . . . , a d ] T ∈ R d×dy . After weights pruning or sparsity constraint, many weights parameters become zero and would not be updated during retraining. We adopt S j := {k : w kj = 0} as pruned dimensions in the j-th column of W , and -S j := S c j = [d x ]\S j , where [d] := {1, . . . , d}. In addition, w j,S denotes the sub-vector of w j choosing the positions in S, X S the sub-matrix of X choosing the column indices in S. We let p j = d x -|S j |, where |S| is the cardinality of the set S. Then w j,-Sj ∈ R pj is the remaining j-th column in first layer weight which leaves out pruned dimension set S j . Similarly, X -Sj ∈ R n×pj means the remaining data matrix connected to j-th node in the first layer. Finally, for simplicity, we denote X -j = X -Sj , w -j = w j,-Sj , and (•) as the pruned layer weight with several zero elements not updated all along, if no ambiguity. Before we begin, a small note on the sparse structure we concern: there may have unnecessary connections and nodes, such as a node with zero out-degree which can be retrieved and excluded from the final layer to the first layer, and other cases are showing in Appendix C. Thus we do not consider them in the subsequent proof and assume each data dimension has valid output connection, i.e., ∩ d j=1 S j = ∅.

2.2. SCALAR CASE

In the scalar case, assume d y = 1. We then simplify A = (a 1 , . . . , a d ) T . When pruning any weight a i in the second layer, the output of the i-th node in the first layer contribute zero to final output. Hence w i can also be pruned. Without loss of generality, we assume second layer parameters are not pruned. After pruning several parameters, the original problem becomes min w-i,ai L( W , A) := 1 2 Y -(X -1 , . . . , X -i , . . . , X -d )    a 1 w -1 . . . a d w -d    2 F . ( ) Theorem 1 For a two-layer linear neural network with scalar output and any sparse structure, every local minimum is a global minimum. Proof: From Eq. (2), if a local minimizer satisfies a i = 0 for some 1 ≤ i ≤ d, then based on the second order condition for a local minima, we have      ∂ 2 L ∂a 2 i ∂ 2 L ∂a i ∂w T -i ∂L ∂w -i ∂a i ∂L ∂w -i ∂w T -i     0, which implies that   w T -i X -i X T -i w -i -Y - d i=1 X -i w -i a i T X -i -X T -i Y - d i=1 X -i w -i a i 0   0. (4) Then X T -i Y - d i=1 X -i w -i a i = 0, which is the global minimizer condition of w -i a i . Otherwise, a i = 0, then from the first-order condition for a local minima, ∂L ∂w -i = a i X T -i Y - d i=1 X -i w -i a i = 0, showing that X T -i Y - d i=1 X -i w -i a i = 0 , which also gives the global minimizer condition of w -i a i . Hence every local minimum is a global minimum.

2.3. NON-SCALAR CASE WITH DENSE SECOND LAYER

Now we discuss the case of non-scalar outputs. By the intractable and various sparse structure, we first consider pruning only the first layer while retaining the dense second layer. Then the remaining problem is formulated as follows: min w-i,ai L( W , A) := 1 2 Y - d i=1 X -i w -i a T i 2 F . Intuitively, if we can separate the weight parameters into d parts, based on linear network results, we can still guarantee no bad local-min. We show that non-overlapping first layer weight or disjoint feature extractor, as the left graph of Figure 1 depicts, and orthogonal data feature meet requirements. Theorem 2 For a two-layer sparse linear neural network with dense second layer, assume that X is full column rank, and ∀ i = j, X T -i X -j = 0. Then every local minimum is global. Proof: Since ∀ i = j, X T -i X -j = 0 and X is full column rank, X -i and X -j share no same columns. Additionally, from our assumption ∩ d j=1 S j = ∅, we have ∩ d j=1 S j = [d x ], meaning that (X -1 , . . . , X -d ) is X with different arrangement of columns. Hence, Y = P X Y + (I -P X )Y = X(X T X) -1 X T Y + (I -P X )Y = d i=1 X -i Z i + (I -P X )Y, (6) where Z i = (X T X) -1 X T Y -Si is the sub-matrix choosing the row indices in -S i . Then we only need to consider the objective: min w-i,ai L( W , A) = 1 2 d i=1 X -i Z i -w -i a T i 2 F = 1 2 d i=1 X -i Z i -w -i a T i 2 F . (7) x 1 x 2 x 3 h (1) 1 h (1) 2 y 1 y 2 y 3 w 1 w 2 w 3 w 4 = 0 w 5 w 6 w 7 w 8 = 0 Hidden layer 1

Hidden layer 1

Input layer Output layer x 1 x 2 x 3 h (1) 1 h (1) 2 h (1) 3 h (1) 4 h (1) 1 h (1) 2 h (1) 3 h (1) 4 h (1) 1 h (1) 2 h (1) 3 h (1) 4 y 1 1/3 1/3 1/3 1/2 1/2 1 1 1 1 1 1/3 1/3 1/3 1 1 1 1 Hidden layer 1 We will see that the objective has already been separated into d parts, while each part is a two-layer dense linear network with full column rank data matrix. Based on Theorem 2.2 in Lu & Kawaguchi (2017) or Eckart-Young-Mirsky theorem (Eckart & Young, 1936; Mirsky, 1960) , we obtain that every local minimum is a global minimum.

Hidden

Additionally, we need to explain the assumption that non-overlapping filters in the first layer involves convolution networks (Brutzkus & Globerson, 2017) if weight sharing is used. Otherwise, we will show a bad local minima exists when the first layer is overlapped or training data is not orthogonal in Section 3.

2.4. GENERAL CASE WITH d y = 2

Previous findings imply positive results with one-dimension outputs, or specific sparse structure and data assumption. In this subsection, we discuss an arbitrary sparse structure with outputs of dimension d y = 2. We first prove that some connections still owe common benign landscape which can be removed. Theorem 3 For a sparse two-layer network, a node with full out-degree and one in-degree can be removed if we consider the remaining structure with objective under some projected data, having no influence for spurious local minima. Previous result simplifies the sparse structure including such a hidden node with one connection to input and full connection to output. Next, we will provide another type reduction with only one connection to output when sparsity is applied to both layers with d y = 2. We mention the final layer output as Node 1 and Node 2, and the hidden nodes which have only one connection to the output layer as R 1 and R 2 while the remaining full connection set as R. Set T 1 = ∩ j∈R1 S j , T 2 = ∩ j∈R2 S j . We define U(T ) = {j : w ij = 0, i ∈ T } as the hidden node set connected with data feature in T . When the hidden node sets connected to T 1 and T 2 satisfy the condition below, we are able to simply the sparse structure into the dense layer case. Theorem 4 For a sparse two-layer linear network with d y = 2, if U(T 1 ∩ T 2 ) ∩ U(T 1 \ T 2 ) = ∅ and U(T 1 ∩ T 2 ) ∩ U(T 2 \ T 1 ) = ∅, then there is a sub-network with dense second layer optimized with some projected training data, sharing the same local minimum for the remaining parameters. The formal proof of the theorem can be found in Appendix E. Additionally, from the proof of Theorem 4, the objective is converted into two objectives with weight sharing in the first layer even the assumption does not meet. Weight sharing structure has been shown in some related work (Shalev-Shwartz et al., 2017; Brutzkus & Globerson, 2017 ), so we do not give detailed description and leave it as future work. Now for a sparse two-layer linear network with d y = 2, we focus on the case which has dense second layer. If one hidden node only has one in-degree, then based on Theorem 3, we can remove such node and consider the objective optimized with some projected data. Therefore, each hidden node should have at least two in-degree. Because one hidden node obviously leads to no bad local minima, the least sparse structure has two hidden nodes with totally eight connections (e.g., two constructions in Figure 1 ). We will show the existence of bad local-minima in Section 3.

2.5. GENERAL CASE WITH d y ≥ 3

We finish this section by discovering that a sub-network with its global minima might yield a spurious local minima of the original sparse network when d y ≥ 3. Theorem 5 There exists a spurious local minima that is a global minimum of sub-network in twolayer sparse network when output dimension d y ≥ 3. Proof: We consider the sparse structure in Figure 2 with only eight connections. The objective is min wi L(w 1 , . . . , w 8 ) := 1 2 Y -X w 1 0 w 2 w 3 0 w 4 w 5 w 6 0 0 w 7 w 8 2 F = 1 2 Y -X w 1 w 5 w 1 w 6 0 w 2 w 5 w 2 w 6 + w 3 w 7 w 3 w 8 0 w 4 w 7 w 4 w 8 2 F . Let X = I 3 and Y = 1 2 0 2 10 0 0 0 4 . Clearly, X and Y have full column rank that is the common assumption in previous work. Then w 1 0 w 2 w 3 0 w 4 = 1 0 2 2 0 0 , w 5 w 6 0 0 w 7 w 8 = 1 2 0 0 3 0 satisfy ∇L = 0, and L(w 1 , . . . , w 8 ) = 8. In addition, for any small disturbances δ i , i = 1 . . . , 8, 2L(w 1 + δ 1 , . . . , w 8 + δ 8 ) = (δ 1 + δ 5 + δ 1 δ 5 ) 2 + (2δ 1 + δ 6 + δ 1 δ 6 ) 2 + (δ 2 + 2δ 5 + δ 2 δ 5 ) 2 + (2δ 2 + 2δ 6 + δ 2 δ 6 + 3δ 3 + 2δ 7 + δ 3 δ 7 ) 2 + (2 + δ 3 ) 2 δ 2 8 + (3 + δ 7 ) 2 δ 2 4 + (δ 4 δ 8 -4) 2 ≥ (2 + δ 3 ) 2 δ 2 8 + (3 + δ 7 ) 2 δ 2 4 + (δ 4 δ 8 -4) 2 ≥ 2 [(2 + δ 3 ) (3 + δ 7 ) -4] |δ 4 δ 8 | + 16. Since the perturbations δ i are small, we have (2 + δ 3 ) (3 + δ 7 )-4 > 0. Hence, L(w 1 +δ 1 , . . . , w 8 + δ 8 ) ≥ 8, verifying the local minimizer. However, when  w 1 0 w 2 w 3 0 w 4 =   √ 10/5 0 √ 10 0 0 2   , w 5 w 6 0 0 w 7 w 8 = √ 10/5 √ 10 0 0 0 2 , L( w 1 , d 1 , a 1 = SV D(Z 1 + D(Z 2 -d 2 w 2 a T 2 )); 5: w 2 , d 2 , a 2 = SV D(Z 2 + D T (Z 1 -d 1 w 1 a T 1 )); 6: end while 7: w 1 = d 1 w 1 , w 2 = d 2 w 2 . 8: if λ 1 (∇ 2 L), λ 2 (∇ 2 L) ≈ 0, λ 3 (∇ 2 L) > 0 then 9: Return the solution w 1 , a 1 , w 2 , a 2 . 10: else 11: Try again from another initialization. 12: end if 13: Output: w 1 , a 1 , w 2 , a 2 . Notice that we have no rank constraint for the Z i in Eq. ( 5). Suppose singular value decomposition of X -i as X -i = U i D i V T i with U i ∈ R n×pi , D i ∈ R pi×pi , V i ∈ R pi×pi . Since D i has full rank, we take D i V i Z i and D i V i w i as new targets and variables. With a slight abuse of notation, then the problem becomes min W ,A 1 2 d i=1 U i Z i -w i a T i 2 F . In the following, we show d = 2 is enough to give counter examples. Similarly, using the singular value decomposition of U T 1 U 2 as U T 1 U 2 = Ū D V T with a rectangle diagonal matrix D ∈ R p1×p2 . Notice that U 1 , U 2 are column orthogonal matrices, thus D ii ≤ 1, and |{i : D ii = 1}| equals to the overlapping columns between X -1 and X -2 . Finally, the objective becomes L(w 1 , w 2 , a 1 , a 2 ) = 1 2 Z 1 -w 1 a T 1 2 F + 1 2 Z 2 -w 2 a T 2 2 F + tr[(Z 1 -w 1 a T 1 ) T D(Z 2 -w 2 a T 2 )]. If we fix w 2 and a 2 , we can see w 1 and a 1 are the best rank-1 approximation of Z 1 +D(Z 2 -w 2 a T 2 ), since w 1 and a 1 are the solution of arg min w1,a1 Z 1 + D(Z 2 -w 2 a T 2 ) -w 1 a T 1 2 F . Similarly, w 2 and a 2 are the best rank-1 approximation of Z 2 + D T (Z 1 -w 1 a T 1 ). Empirically, we use alternating update method to find the solution in Algorithm 1 for two blocks, where SV D(•) is a classical method getting the largest singular value and the corresponding singular vectors. Since each update does not increase the loss, this makes the convergence of sequence w 1 , a 1 , w 2 , a 2 . Once the algorithm converges, the first-order condition is satisfied and two eigenvectors with zero eigenvalue of the Hessian matrix are decided. Moreover, we can also prove that the convergent solution is indeed a local minima (detail see Appendix B). Otherwise, we examine a local minimum using gradient descent or other optimization method started with noise, if necessary. Based on Algorithm 1, we find several cases with bad local minima including the overlapping case (∃i, D ii = 1). The results are shown in Table 1 . We observe distinct gaps between the local minima because our choice of elements in the Z i is small. In the non-overlapping setting, the algorithm reaches the local min quickly and shows several different examples. As for the overlapping setting, a simple construction is leaving out the repeated feature away with zero items in the Z i , though we also show bad-min applied with the overlapping data feature in Row 3 in Table 1 . It is interesting to note that for d = 2, only at most two local minimum are found, and we can easily broaden the alternating update method into general d case in Appendix D, that will also verify similar observation: at most d local minimum produced by a sparse-first-layer network with hidden d nodes, which leaves as future work. Overall, sparsity breaks the original matrix structure, leading to additional low rank constraint in this case, and still cuts out the decreasing path in the original fully-connected network.  i := λ i (∇ 2 L). Z 1 Z 2 D λ 3 λ 1 , λ 2 ∇L 2 Objective L 1 1 1 0 1 1 1 -1 0.5 0 0 0.9 2.1 • 10 -1 0 ∼ 10 -14 < 10 -14 0.5143043518476 1.4 • 10 -1 0 ∼ 10 -14 < 10 -14 0.6781647585271 -2 0 0 -1 0 1 -2 2 0.8 0 0 0.1 3.4 • 10 -1 0 ∼ 10 -14 < 10 -14 0.5373672988360 1.7 • 10 -1 0 ∼ 10 -14 < 10 -14 0.6805528480352 -1 0 1 1 -1 0 0 0 1 1 -2 0 1 0 0 0 0.6 0 0 0 0.8 1.7 • 10 -1 0 ∼ 10 -14 < 10 -14 0.8980944246693 2.5 • 10 -1 0 ∼ 10 -14 < 10 -14 0.4712847600704 Additionally, a descent algorithm still will diverge to infinity. For instance, the example in Appendix A shows that there is a sequence diverging to infinity while the function values are decreasing and convergent.

4. LANDSCAPE OF DEEP SPARSE LINEAR NETWORKS

In this section, we briefly extend Theorems 1 and 2 to deep sparse linear networks and leave the proof in Appendix F. The intuition is that deep linear networks have similar landscape property as the shallow case (Lu & Kawaguchi, 2017; Eftekhari, 2020) . However, understanding the landscape of an arbitrary deep sparse linear network is still complicated. Theorem 6 For a deep sparse linear neural network with scalar output (d y = 1) and any sparse structure, every local minimum is a global minimum. The proof intuition can be described by induction based on shallow linear networks. The above theorem shows that sparsity introduces no bad-min when applied with scalar target. In addition, we give a simple choice for obtaining a global minimizer below. How to obtain a global minimizer in scalar case: One way is to set the first-layer weights as the global minimizer in the two-layer case with a i = 1, then the remaining layers uniformly distribute the output of each node to the next layer. For example, if one node has k output connections, then each connection assigns weight 1/k. Hence, the sum of each layer output remains the best solution to approximate target Y (see the right graph of Figure 2 for example). Theorem 7 For a deep sparse linear neural network with a sparse first layer and dense other layers, assume that X, Y have full column rank, and ∀ i = j, X T -i X -j = 0. If d i ≥ min{d 1 , d y }, ∀i ≥ 1 , where d i is the hidden width in the i-th layer, then every local minimum is a global minimum. Note that under our assumption d i ≥ min{d 1 , d y }, ∀i ≥ 1, the deep linear network we study has the same solution as the shallow linear network when the first-layer weight fixed. Hence, the optimal value for our objective function is equal to the optimal value of the shallow network problem.

5. DISCUSSION

We have discussed the landscape of sparse linear networks with several arguments. On the positive side, spurious local minimum does not exist when the objective applied with scalar target, or with separated first layer and orthogonal training data. On the negative side, we have discovered the bad local minimum when the previous conditions are violated in a general sparse two-layer linear network, that is, one is generated from low rank constraint, another is produced from sub-sparse structure. Both the cases show that sparsity cuts out the decreasing path in the original fully-connected network. Since dense linear networks possess benign landscape, we have concluded that sparsity or network pruning destroys the favourable solutions. However, some heuristic algorithms combining training and pruning still work well in practice, leading to mystery of modern network pruning methods and sparse network design. Other interesting questions for future research include understanding the gap between global minimum and spurious local minimum, or showing a similar performance of bad-min, particularly, combining with pruning algorithms.

A DECREASING PATH OF SPARSE LINEAR NETWORK WITH SPARSE FIRST LAYER

In addition, there still exists decreasing path to infinity: min wi L(w 1 , . . . , w 8 ) : = 1 2 Y -X w 1 0 w 2 w 3 0 w 4 w 5 w 6 w 7 w 8 2 F = 1 2 Y -X w 1 w 5 w 1 w 6 w 2 w 5 + w 3 w 7 w 2 w 6 + w 3 w 8 w 4 w 7 w 4 w 8 2 F (9) X = I 3 , Y = 1 0 0 1 1 1 , Choose (w 1 , w 2 , w 3 , w 4 , w 5 , w 6 , w 7 , w 8 ) = (-1 √ k , - √ k, 1, 1, 1 √ k , 0, 1, 1), with k ∈ N + . then 2L(w 1 , . . . , w 8 ) = ( 1 k + 1) 2 > 1 decreases when k increases. Since min wi L(w 1 , . . . , w 8 ) = 0, hence we get a decreasing path to infinity, but not a global minimum.

B ALGORITHM ANALYSIS

We built algorithm guarantee in the following: First, since each update step, the objective doesn't increase, then the algorithm will converge. Second, we verify that the convergent solution (w * 1 , a * 1 , w * 2 , a * 2 ) satisfy zero gradient. Recall the first-order condition: - ∂L ∂w 1 = Z 1 -w 1 a T 1 + D(Z 2 -w 2 a T 2 ) a 1 = Z 1 + D(Z 2 -w 2 a T 2 ) a 1 -a T 1 a 1 w 1 , - ∂L ∂a 1 = Z 1 -w 1 a T 1 + D(Z 2 -w 2 a T 2 ) T w 1 = Z 1 + D(Z 2 -w 2 a T 2 ) T w 1 -w T 1 w 1 a 1 , - ∂L ∂w 2 = Z 2 -w 2 a T 2 + D T (Z 1 -w 1 a T 1 ) a 2 = Z 2 + D T (Z 1 -w 1 a T 1 ) a 2 -a T 2 a 2 w 2 , - ∂L ∂a 2 = Z 2 -w 2 a T 2 + D T (Z 1 -w 1 a T 1 ) T w 2 = Z 2 + D T (Z 1 -w 1 a T 1 ) T w 2 -w T 2 w 2 a 2 . (10) Notice that w * 1 , a * 1 is the best rank-1 approximation of Z 1 + D(Z 2 -w * 2 a * T 2 ), and w * 2 , a * 2 are the best rank-1 approximation of Z 2 + D T (Z 1 -w * 1 a * T 1 ). Then we have already got a solution (w * 1 , a * 1 , w * 2 , a * 2 ) with zero gradient. Third, we verify that the convergent solution is a local minimizer through the conditions we checked. Set r 1 = Z 1 + D(Z 2 -w 2 a T 2 ), r 2 = Z 2 + D T (Z 1 -w 1 a T 1 ). Then ∇ 2 L(w 1 , a 1 , w 2 , a 2 ) =     a T 1 a 1 I p1 -r 1 + 2w 1 a T 1 a T 1 a 2 D Dw 2 a T 1 -r 1 + 2w 1 a T 1 T w T 1 w 1 I dy a 2 w T 1 D w T 1 Dw 2 I dy a T 1 a 2 D T D T w 1 a T 2 a T 2 a 2 I p2 -r 2 + 2w 2 a T 2 a 1 w T 2 D T w T 1 Dw 2 I dy -r 2 + 2w 2 a T 2 T w T 2 w 2 I dy     (11) Set H * := ∇ 2 L(w * 1 , a * 1 , w * 2 , a * 2 ). Observe that w * T 1 , -a * T 1 , 0 T , 0 T H * = 0, 0 T , 0 T , w * T 2 , -a * T 2 H * = 0, showing that H * has zero eigenvalue with at least two eigenvectors v 1 = w * T 1 , -a * T 1 , 0 T , 0 T T and v 2 = 0 T , 0 T , w * T 2 , -a * T 2 T . Suppose the third smallest eigenvalue is λ 3 ≥ > 0, then for any direction v with v 2 = 1, we have v = α 1 v1 + α 2 v2 + α 3 v3 with v 3 ⊥v 1 , v 3 ⊥v 2 , 3 i=1 α 2 i = 1, and w := w/ w 2 . If α 3 = 0, then vHv = α 2 3 v3 H v3 ≥ α 2 3 λ 3 ≥ α 2 3 ≥ 0. Otherwise, we set v = δ 1 v 1 + δ 2 v 2 with small δ 1 , δ 2 as perturbation, and the perturbed parameters are notated as w1 , ã1 , w2 , ã2 = (1 -δ 1 )w * 1 , (1 -δ 2 )w * 2 , (1 + δ 1 )a * 1 , (1 + δ 2 )a * 2 , which yields L( w1 , ã1 , w2 , ã2 ) = X 1 Z 1 -(1 -δ 2 1 )w 1 a T 1 + X 2 Z 2 -(1 -δ 2 2 )w 2 a T 2 2 F = X 1 Z 1 -w 1 a T 1 + X 2 Z 2 -w 2 a T 2 2 F + δ 2 1 w 1 a T 1 + δ 2 2 w 2 a T 2 2 F + 2δ 2 1 tr[ w 1 a T 1 T r 1 -w 1 a T 1 ] + 2δ 2 2 tr[ w 2 a T 2 T r 2 -w 2 a T 2 ] = X 1 Z 1 -w 1 a T 1 + X 2 Z 2 -w 2 a T 2 2 F + δ 2 1 w 1 a T 1 + δ 2 2 w 2 a T 2 2 F = L(w 1 , a 1 , w 2 , a 2 ) + δ 2 α 2 1 w 1 a T 1 + δ 2 α 2 1 w 2 a T 2 2 F ≥ L(w 1 , a 1 , w 2 , a 2 ). Third equality holds for the rank-1 approximation of the solution. Hence, the convergent solution is a local minimizer. Fourth, due to numerical error, we can not obtain exact convergent solution, but we are able to obtain approximate solution (w t 1 , a t 1 , w t 2 , a t 2 ) after t iterations with L (w t 1 , a t 1 , w t 2 , a t 2 ) - L (w * 1 , a * 1 , w * 2 , a * 2 ) ≤ 2 , and then use Weyl's inequality (Safran & Shamir, 2018, Theorem 2) , λ i (∇ 2 L(w t 1 , a t 1 , w t 2 , a t 2 )) -λ i (∇ 2 L(w * 1 , a * 1 , w * 2 , a * 2 )) < O( ), where λ i (H) is the i-th smallest eigenvalue of the real symmetric matrix H. Therefore, if the approximate solution is approximate positive semi-definite with a large third smallest eigenvalue, we conclude the convergent solution is a local minimizer.

C USELESS CONNECTIONS AND NODES IN SPARSE NETWORK

In this section, we explain several kinds of unnecessary connections suffered from sparsity or network pruning. x 1 x 2 x 3 h (1) 1 h (1) 2 h (1) 3 h (1) 4 h (2) 1 h (2) 2 h (2) 3 h (2) 4 y 1 y 2

Hidden layer 2

Input layer Output layer Figure 3 : An example of sparse network with no bias. Lines are connections of original sparse network, dotted lines are useless connections that can be removed, and solid lines are effective connections. 1. Zero out-degree I: if a node have zero out-degree, such as h (2) 1 in Figure 3 , we can eliminate the input connections. 2. Zero out-degree II: if a node have zero out-degree when removed output connections in latter layers, such as h (1) 1 in Figure 3 . Though it owes one output connection, the connected node h (2) 1 is zero out-degree, hence the connection can be removed, leading to zero outdegree. We can eliminate the input connections of h (1) 1 as well. 3. Zero in-degree I: if a node have zero in-degree, such as h (2) 4 and h (1) 4. Zero in-degree II: if a node have zero in-degree when removed input connections in former layers, such as h (2) 3 in Figure 3 . Though it owes one input connection, the connected node h (1) 4 is zero in-degree, hence the connection can be removed, leading to zero in-degree. We can eliminate the output connections of h (2) 3 as well.

D GENERAL d BLOCKS ALGORITHM

Algorithm 2 Sparse-d-Opt (X 1 , . . . , X d , Z 1 , . . . , Z d ): Obtain the solution of two-layer sparse linear neural network with d hidden neurons. 1: Input: Input matrix X 1 , . for i = 1, . . . , d do 5: w i , d i , a i = SV D(Z i + j =i X T i X j (Z j -d j w j a T j )); 6: end for 7: end while 8: w i = d i w i , i = 1, . . . , d. 9: if λ 1 (∇ 2 L), . . . , λ d (∇ 2 L) ≈ 0, λ d+1 (∇ 2 L) > 0 then 10: Return the solution w i , a i , i = 1, . . . , d. 11: else 12: Try again from another initialization. 13: end if 14: Output: w i , a i , i = 1, . . . , d. The analysis that the convergent solution is a local minimizer is similar to d = 2, so we are not going to repeat the details. We list some examples searched for d > 2 below. d = 3: Target: Z 1 = 0 0 0 1 , Z 2 = 1 0 1 1 , Z 3 = 1 0 -1 -1 . Training data:  X T X =         1.0 0.0 -0. i := λ i (∇ 2 L). λ d+1 λ 1 , . . . , λ d ∇L 2 Objective L 1.9 • 10 -1 0 ∼ 10 -14 < 10 -14 0.357481957 9.7 • 10 -2 0 ∼ 10 -14 < 10 -14 0.521705675 4.9 • 10 -2 0 ∼ 10 -14 < 10 -14 0.539730382 E MISSING PROOFS FOR SECTION 2 E.1 PROOF OF THEOREM 3 Proof: Suppose the j-th node has full out-degree and one in-degree, so that w -j ∈ R. We treat objective with fixed other weights and only consider optimizing w -j , a j . min wj ,aj (w j , a j ) = 1 2 Y -X -j w j a T j 2 F , Y := Y - i =j X -i w -i a T i . Based on the proof of scalar case, a local minimizer (w * j , a * j ) of (w j , a j ) must satisfy the condition X T -j Y -X -j w * j a * T j = 0, showing that (w * j , a * j ) = 1 2 I -P X-j Y 2 F . Therefore, the objective with remaining weights becomes: min w-i,ai,i =j 1 2 I n -P X-j Y - i =j I n -P X-j X -i w -i a T i 2 F . We define I n -P X-j X Sj , I n -P X-j Y as new training dataset which is the projection into the orthogonal complement of X -j , and remove some elements in w -i corresponding to the column in X -j . Moreover, if X has full column rank, then projected data I n -P X-j X Sj still has full column rank. Hence, removing the above connections doesn't affect the spurious local minima since these connections preserve certain solution. E.2 PROOF OF THEOREM 4 Proof: The original loss function can be formulated as below, 2L( W , A) = Y 1 - i∈R1 X -i w -i a i1 - j∈R X -j w -j a j1 2 F + Y 2 - i∈R2 X -i w -i a i2 - j∈R X -j w -j a j2 2 F . Under similar analysis as scalar case, ∀i ∈ R 1 , X T -i   Y 1 - i∈R1 X -i w -i a i1 - j∈R X -j w -j a j1   = 0. ∀i ∈ R 2 , X T -i   Y 2 - i∈R2 X -i w -i a i2 - j∈R X -j w -j a j2   = 0. Hence,        w -i1 a i11 . . . w -ij a ij 1 . . . w -i |R 1 | a i |R 1 | 1        = X -i1 , • • • , X -ij , • • • , X -i |R 1 | +   Y 1 - j∈R X -j w -j a j1   , i j ∈ R 1 .        w -i1 a i12 . . . w -ij a ij 2 . . . w -i |R 2 | a i |R 1 | 1        = X -i1 , • • • , X -ij , • • • , X -i |R 2 | +   Y 2 - j∈R X -j w -j a j2   , i j ∈ R 2 . Then the objective becomes: I n -P X -T 1   Y 1 - j∈R X -j w -j a j1   2 2 + I n -P X -T 2   Y 2 - j∈R X -j w -j a j2   2 2 . We can see the objective is separated into two parts with shared sparse first-layer weights. Notice that if i / ∈ T 1 , then X i ∈ X T1 , hence I n -P X -T 1 X i = 0. Therefore, we simplify the problem as min W , A L( W , A) = 1 2 I n -P X -T 1   Y 1 - i∈T1 X i j:i / ∈Sj w ij a j1   2 2 + 1 2 I n -P X -T 2   Y 2 - i∈T2 X i j:i / ∈Sj w ij a j2   2 2 . ( ) Use previous analysis again on T 1 \ T 2 in first output dimension and T 2 \ T 1 in second output dimension since no overlap in parameters by the condition U (T 1 ∩ T 2 ) ∩ U(T 1 \ T 2 ) = ∅ and U(T 1 ∩ T 2 ) ∩ U(T 2 \ T 1 ) = ∅. Therefore, we simplify the problem again as min W , A L( W , A) = 1 2 I n -P In-P X -T 1 X T 1 \T 2 I n -P X -T 1   Y 1 - i∈T1∩T2 X i j:i / ∈Sj w ij a j1   2 2 + 1 2 I n -P In-P X -T 1 X T 2 \T 1 I n -P X -T 2   Y 2 - i∈T1∩T2 X i j:i / ∈Sj w ij a j2   2 2 . Using the fact that (I n -P W1 ) (I n -P W2 ) = I n -P (W1,W2) if W T 1 W 2 = 0. Hence the remaining problem is same as min W , A L( W , A) = 1 2 2 k=1 I n -P X -T 1 ∩T 2   Y k - i∈T1∩T2 X i j:i / ∈Sj w ij a jk   2 2 . Therefore, the remaining network structure has dense second layer.

F MISSING PROOFS FOR SECTION 4

The objective of a deep linear network with squared loss is min W (1) ,...,W (L) 1 2 Y -XW (1) • • • W (L) 2 F , where the i-th layer weight matrix W (i) ∈ R di-1×di , d 0 = d x , d L = d y , Data matrix X ∈ R n×dx , Y ∈ R n×dy . We adopt S (i) j = {k : W (i) kj = 0} as pruned dimensions in j-th column of W (i) . Besides, W (i) j,-S as the remaining j-th column in i-th layer weight which leaves out pruned dimension set S. For simplification, we denote w (i) -j = w (i) j,-S (i) j ∈ R di-1-|S (i) j | , w (i) jk = W (i) jk , and W (i) as the pruned weight matrix with several zero elements as before.

F.1 PROOF OF THEOREM 6

Proof: Using induction. Base on Theorem 1, we have already proof two layer case. If the result holds for (L -1)-layer sparse linear network, we consider L layer case. We denote X new := X W (1) as new training set, and := Y -X W (1) • • • W (L) . Then based on inductive assumption, T X new = 0, showing that T X -i w (1) -i = 0, ∀1 ≤ i ≤ d 1 . Combined with first-order condition: ∂L ∂w (1) -i = -T X -i ( W (2) • • • W (L) ) i = 0. If ( W (2) • • • W (L) ) i = 0, then T X -i = 0, which satisfies the global minimizer condition. Otherwise, any value of w -i doesn't change the loss since the forward path already contribute zero to the final output. Hence, arbitrary choice of w (1) -i owes same objective value. Thus, from Eq. ( 17), we still obtain T X -i = 0. Thus any local minimum is a global minimum for the pruned sparse model.

F.2 PROOF OF THEOREM 7

Proof: Since ∀ i = j, X -i , X -j share no same columns and X T X = I d , then ∀ i = j, X T -i X -j = 0. Besides, from our assumption ∩ m j=1 S j = ∅, then ∩ m j=1 S j = {1, . . . , d}, meaning that (X -1 , . . . , X -d ) is X with different arrangement of columns. Hence Y = P X Y + (I -P X )Y = X(X T X) -1 X T Y + (I -P X )Y d i=1 X -i Z i + (I -P X )Y, (18) Set W (2) = [a 1 , . . . , a d1 ] T , then the objective becomes 1 2 d1 i=1 Z i -w -i a T i W (3) • • • W (L) 2 F . We set X = X W (1) = (X -1 w -1 , . . . , X -d1 w -d1 ). Now we show the following problems have same local minimizer condition for w -1 . (P) L( W (1) , W (2) , . . . , W (L) ) = 1 2 Y -X W (2) • • • W (L) 2 F , (P1) L 2 ( W , A) = 1 2 Y - d1 i=1 X -i w -i a T i 2 F . ( ) If there is a local minimizer w -1 , . . . , w -d1 = 0, for problem (P), since d i ≥ min{d 1 , d y }, ∀i ≥ 1 and X, Y have full column rank, then based on Theorem 2.3 in Lu & Kawaguchi (2017) , a local minimizer of L( W (1) , W (2) , . . . , W (L) ) is obtained when W (2) • • • W (L) = X T X -1 X T Y. Notice that X T X = diag(w T -1 w -1 , . . . , w T -d1 w -d1 ). Then the objective is simplified as 2L( W (1) ) = Y -X X T X -1 X T Y 2 F = Y - d1 i=1 (X -i w -i )(X -i w -i ) T Y w T -i w -i 2 F = d1 i=1 X -i Z i - (X -i w -i )(X -i w -i ) T X -i Z i w T -i w -i 2 F = d1 i=1 X -i Z i - X -i w T -i w -i Z i w T -i w -i 2 F For problem (P1), similarly, a local minimizer of L 2 ( W , A) is obtained when (X -j w -j ) T Y - d1 i=1 X -i w -i a T i = 0. Then a T j = (X-j w-j ) T Y w T -j w-j , showing same loss objective as 2L 2 ( W ) = Y - d1 i=1 (X -i w -i )(X -i w -i ) T Y w T -i w -i 2 F = d1 i=1 X -i Z i - (X -i w -i )(X -i w -i ) T X -i Z i w T -i w -i 2 F = 2L( W (1) ). Finally, based on Theorem 2, every local minimum of (P1) is a global minimum. Hence every local minimum of (P) is a global minimum. If there exists i 0 , such that w -i0 = 0, we show that Z i0 = 0 below. Without loss of generality, we assume i 0 = 1, then the value of a 1 does not affect the objective, we take a 1 = 0 as well. In order to show the result, we only perturb w -1 , a 1 , W (3) , . . . , W (L) into w -1 + ∆w, a 1 + ∆a, W (3) + ∆ 3 , . . . , W (L) + ∆ L and analyze the difference of loss as ∆L. We set ∆W L i=3 W (i) + ∆ i - L i=3 W (i) , W o L i=3 W (i) . ( ) Then the perturbation leads to 2∆L(∆w, ∆a, ∆W ) = Z 1 -∆w∆a T (W o + ∆W ) 2 F -Z 1 2 F + i =1 Z i -w -i a T i (W o + ∆W ) 2 F -Z i -w -i a T i W o 2 F = -2tr[Z T 1 ∆w∆a T (W o + ∆W )] + ∆w∆a T (W o + ∆W ) 2 F -2 i =1 tr[∆W T a i w T -i Z i -w -i a T i W o ] + w -i a T i ∆W 2 F . Applying the first case to the remaining parameters excluding w -1 and a 1 (If there are several w -i s are zero, we can leave them all out), we have a T i W o = (X -i w -i ) T Y w T -i w -i = (w -i ) T Z i w T -i w -i , which agrees with w T -i Z i -w -i a T i W o = 0, i = 1. Hence the second term in the final row of Eq. ( 23) is zero. Besides, let us note the first-order term of ∆w, showing that tr[Z T 1 ∆w∆a T (W o + ∆W )] = 0. Otherwise, given w -1 = Θ(t -1 ), a -1 = Θ(t -1 ), ∆W = Θ(t -3 ), as t → ∞, the sign in the final expansions of Eq. ( 23) depends on the fist term that is indefinite. Therefore, ∆a (W o + ∆W ) Z T 1 = 0, then (W o + ∆W ) Z T 1 = 0, leading to W o Z T 1 = 0 and ∆W Z T 1 = 0. In view of expression ∆W , it holds that ∆W Z T 1 = d1 i=3 W (3) • • • W (i-1) ∆ i W (i+1) • • • W (L) Z T 1 + . . . = L-2 t=1 f t (∆ 3 , . . . , ∆ L )Z T 1 , where f t (∆ 3 , . . . , ∆ L ) is the sum of the product in W (3) , . . . , W (L) , ∆ 3 , . . . , ∆ L that contains exactly t different ∆ i s. Then from small-order terms to high-order terms, we obtain f t (∆ 3 , . . . , ∆ L )Z 1 = 0. It follows from f L-2 = ∆ 3 • • • ∆ L , d i ≥ min{d 1 , d L }, and the arbitrary of ∆ 3 • • • ∆ L , we get Z i = 0. Finally, when Z 1 = 0, It is evident that w 1 = 0 already satisfies the global minimizer condition since the objective is separated as d1 i=1 Z i -w -i a T i W (3) • • • W (L) 2 F . This completes the proof.



BAD LOCAL MINIMUM WITH SPARSE FIRST LAYERNow we turn back to the dense second layer case in Section 2.3 with d y =2, and assume X has full column rank. We give an algorithm to check the existence of spurious local minima when ∃ i = j, s.t., X T -i X -j = 0. in Figure3, we can eliminate the output connections, but notice that when the node has a bias term, then we can not remove output connections since the bias constant will still propagate to subsequent layers.



Figure 1: Sparse network without (left) / with (right) overlapping filters in the first layer.

Figure 2: Left: spurious local minimum exists in sparse linear network with d y = 3 shown in above, that is the global minimum of a sub-network. Right: one simple weight assignment for obtaining the global minimum in deep sparse linear network with scalar output.

w 1 , . . . , w 8 ) = 0.18 < 8. Hence, a bad local minimum exists.We underline that the bad local minimum is produced from the sub-network when w 4 =w 8 =0. Since we encounter no bad local minimum in a dense linear network, sparse connections indeed destroy the benign landscape because sparsity obstructs the decreasing path asEvci et al. (2019) mentioned from experiments.Algorithm 1 Sparse-2-Opt (Z 1 , Z 2 , D): Obtain the solution of two-layer sparse linear neural network with two hidden neurons.1: Input: Target matrix Z 1 , Z 2 and covariance diagonal matrix D.2: Initialize w 2 , d 2 , a 2 ;

Examples found by algorithm with spurious local minimum. All experiments run 600 iterations, except last one with 1000 iterations. λ

. . , X d . Target matrix Z 1 , . . . , Z d ; 2: Initialize w i , d i , a i , i = 2, . . . , d; 3: while not converge do

Examples found by algorithm with spurious local minimum when d y = 3. All experiments run 2000 iterations. λ

