ON THE LANDSCAPE OF SPARSE LINEAR NETWORKS Anonymous

Abstract

Network pruning, or sparse network has a long history and practical significance in modern applications. Although the loss functions of neural networks may yield bad landscape due to non-convexity, we focus on linear activation which already owes benign landscape. With no unrealistic assumption, we conclude the following statements for the squared loss objective of general sparse linear neural networks: 1) every local minimum is a global minimum for scalar output with any sparse structure, or non-intersected sparse first layer and dense other layers with orthogonal training data; 2) sparse linear networks have sub-optimal local-min for only sparse first layer due to low rank constraint, or output larger than three dimensions due to the global minimum of a sub-network. Overall, sparsity breaks the normal structure, cutting out the decreasing path in original fully-connected networks.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable empirical successes in the domains of computer vision, speech recognition, and natural language processing, sparking great interests in the theory behind their architectures and training. However, DNNs are often found to be highly overparameterized, making them computationally expensive with large amounts of memory and computational power. For example, it may take up to weeks on a modern multi-GPU server for large datasets such as ImageNet (Deng et al., 2009) . Hence, DNNs are often unsuitable for smaller devices like embedded electronics, and there is a pressing demand for techniques to optimize models with reduced model size, faster inference and lower power consumption. Sparse networks, that is, neural networks in which a large subset of the model parameters are zero, have emerged as one of the leading approaches for reducing model parameter count. It has been shown empirically that deep neural networks can achieve state-of-the-art results under high levels of sparsity (Han et al., 2015b; Gale et al., 2019; Louizos et al., 2017a) . Modern sparse networks are mainly obtained from network pruning (Zhu & Gupta, 2017; Lee et al., 2018; Liu et al., 2018; Frankle & Carbin, 2018) , which has been the subject of a great deal of work in recent years. However, training a sparse network with fixed sparsity patterns is difficult (Evci et al., 2019) and few theoretical understanding of general sparse networks has been provided. Previous work has already analyze deep neural networks, showing that the non-convexity of the associated loss functions may cause complicated and strange optimization landscapes. However, the property of general sparse networks is poorly understood. Saxe et al. (2013) empirically showed that the optimization of deep linear models exhibits similar properties as deep nonlinear models, and for theoretical development, it is natural to begin with linear models before studying nonlinear models (Baldi & Lu, 2012) . In addition, several works (Sun et al., 2020) have show bad minimum exists with nonlinear activation. Hence, it is natural to begin with linear activation to understand the impact of sparsity. In this article, we go further to consider the global landscape of general sparse linear neural networks. We need to emphasize that dense deep linear networks already satisfy that every local minimum is a global minimum under mild conditions (Kawaguchi, 2016; Lu & Kawaguchi, 2017) , but findings are different and complicated for sparse linear network. The goal of this paper is to study the relation between sparsity and local minima with the following contributions: • First, we point out that every local minimum is a global minimum in scalar target case with any depths, any widths and any sparse structure. Besides, we also briefly show that similar results hold for non-overlapping filters and orthogonal data feature when sparsity only occurs in the first layer. • Second, we find out that sparse connections would already give sub-optimal local minima in general non-scalar case through analytic and numerical examples built on convergence analyze. The local-min may be produced from two situations: a sub-sparse linear network which owes its minimum as a local-min of the original sparse network; a rank-deficient solution between different data features due to sparse connections, while both cases verify the fact that sparsity cuts out the decreasing path in original fully-connected networks. Overall, we hope our work contributes to a better understanding of the landscape of sparsity network on simple neural networks, and provide insights for future research. The remainder of our paper is organized as follows. In Section 2, we derive the positive findings of shallow sparse linear networks, providing similar landscape as dense linear networks. In Section 3, we give several examples to show the existence of bad local-min for non-scalar case. In section 4, we briefly generalize the results from shallow to deep sparse linear networks. Some proofs are in Appendix.

1.1. RELATED WORK

There is a rapidly increasing literature on analyzing the loss surface of neural network objectives, surveying all of which is well outside our scope. Thus, we only briefly survey the works most related to ours. Local minima is Global. The landscape of a linear network date back to Baldi & Hornik (1989) , proving that shallow linear neural networks do not suffer from bad local minima. Kawaguchi ( 2016) generalized same results to deep linear neural networks, and subsequent several works (Arora et al., 2018; Du & Hu, 2019; Eftekhari, 2020) give direct algorithm-type convergence based on this benign property, though algorithm analysis is beyond the scope of this paper. However, situations are quite complicated with nonlinear activations. Multiple works (Ge et al., 2017; Safran & Shamir, 2018; Yun et al., 2018) show that spurious local minima can happen even in two-layer network with population or empirical loss, some are specific to two-layer and difficult to generalize to general multilayer cases. Another line of works (Arora et al., 2018; Allen-Zhu et al., 2018; Du & Hu, 2019; Du et al., 2018; Li et al., 2018; Mei et al., 2018) understands the landscape of neural network in an overparameterized setting, discovering benign landscape with or without gradient method. Since modern sparse networks reserve few parameters compared to overparameterization, we still seek a fundamental view of sparsity in contrast. Our standpoint is that spurious local minima can happen when applied with specific sparsity even in linear networks. Sparse networks. Sparse networks (Han et al., 2015b; a; Zhu & Gupta, 2017; Frankle & Carbin, 2018; Liu et al., 2018) 



have a long history, but appears heavily on the experiments, and mainly related to network pruning, which has practical importance for reducing model parameter count and deploying diverse devices. However, training sparse networks (from scratch) suffers great difficulty. Frankle & Carbin (2018) recommend reusing the sparsity pattern found through pruning and train a sparse network from the same initialization as the original training ('lottery') to obtain comparable performance and avoid bad solution. Besides, for fixed sparsity patterns, Evci et al. (2019) attempt to find a decreasing objective path from 'bad' solutions to the 'good' ones in the sparse subspace but fail, showing bad local minima can be produced by pruning, while we give more direct view of simple examples to verify this. Moreover, several recent works also give abundant methods(Molchanov et al., 2017; Louizos et al., 2017b; Lee et al., 2018; Carreira-Perpinán &  Idelbayev, 2018)  for choosing weights or sparse network structure while achieving similar performance. In theoretical view,Malach et al. (2020)  prove that a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, providing guarantee for 'good' sparse networks. Some works analyze convolutional network(Shalev-Shwartz et al., 2017; Du et al., 2018)  as a specific sparse structure.Brutzkus &  Globerson (2017)  analyze non-overlapping and overlapping structure as we do, but with weight sharing to simulate CNN-type structure, and under teacher-student setting with population risk. We do not follow CNN-type network but in general sparse networks, though still linear, to conclude straightforward results.

