ON THE LANDSCAPE OF SPARSE LINEAR NETWORKS Anonymous

Abstract

Network pruning, or sparse network has a long history and practical significance in modern applications. Although the loss functions of neural networks may yield bad landscape due to non-convexity, we focus on linear activation which already owes benign landscape. With no unrealistic assumption, we conclude the following statements for the squared loss objective of general sparse linear neural networks: 1) every local minimum is a global minimum for scalar output with any sparse structure, or non-intersected sparse first layer and dense other layers with orthogonal training data; 2) sparse linear networks have sub-optimal local-min for only sparse first layer due to low rank constraint, or output larger than three dimensions due to the global minimum of a sub-network. Overall, sparsity breaks the normal structure, cutting out the decreasing path in original fully-connected networks.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable empirical successes in the domains of computer vision, speech recognition, and natural language processing, sparking great interests in the theory behind their architectures and training. However, DNNs are often found to be highly overparameterized, making them computationally expensive with large amounts of memory and computational power. For example, it may take up to weeks on a modern multi-GPU server for large datasets such as ImageNet (Deng et al., 2009) . Hence, DNNs are often unsuitable for smaller devices like embedded electronics, and there is a pressing demand for techniques to optimize models with reduced model size, faster inference and lower power consumption. Sparse networks, that is, neural networks in which a large subset of the model parameters are zero, have emerged as one of the leading approaches for reducing model parameter count. It has been shown empirically that deep neural networks can achieve state-of-the-art results under high levels of sparsity (Han et al., 2015b; Gale et al., 2019; Louizos et al., 2017a) . Modern sparse networks are mainly obtained from network pruning (Zhu & Gupta, 2017; Lee et al., 2018; Liu et al., 2018; Frankle & Carbin, 2018) , which has been the subject of a great deal of work in recent years. However, training a sparse network with fixed sparsity patterns is difficult (Evci et al., 2019) and few theoretical understanding of general sparse networks has been provided.

Previous work has already analyze deep neural networks,

showing that the non-convexity of the associated loss functions may cause complicated and strange optimization landscapes. However, the property of general sparse networks is poorly understood. Saxe et al. ( 2013) empirically showed that the optimization of deep linear models exhibits similar properties as deep nonlinear models, and for theoretical development, it is natural to begin with linear models before studying nonlinear models (Baldi & Lu, 2012) . In addition, several works (Sun et al., 2020) have show bad minimum exists with nonlinear activation. Hence, it is natural to begin with linear activation to understand the impact of sparsity. In this article, we go further to consider the global landscape of general sparse linear neural networks. We need to emphasize that dense deep linear networks already satisfy that every local minimum is a global minimum under mild conditions (Kawaguchi, 2016; Lu & Kawaguchi, 2017) , but findings are different and complicated for sparse linear network. The goal of this paper is to study the relation between sparsity and local minima with the following contributions: • First, we point out that every local minimum is a global minimum in scalar target case with any depths, any widths and any sparse structure. Besides, we also briefly show that

