NO SPURIOUS LOCAL MINIMA: ON THE OPTIMIZATION LANDSCAPES OF WIDE AND DEEP NEURAL NETWORKS

Abstract

Empirical studies suggest that wide neural networks are comparably easy to optimize, but mathematical support for this observation is scarce. In this paper, we analyze the optimization landscapes of deep learning with wide networks. We prove especially that constraint and unconstraint empirical-risk minimization over such networks has no spurious local minima. Hence, our theories substantiate the common belief that increasing network widths not only improves the expressiveness of deep-learning pipelines but also facilitates their optimizations.

1. INTRODUCTION

Deep learning depends on optimization problems that seem impossible to solve, and yet, deeplearning pipelines outperform their competitors in many applications. A common suspicion is that the optimizations are often easier than they appear to be. In particular, while most objective functions are nonconvex and, therefore, might have spurious local minima, recent findings suggest that optimizations are not hampered by spurious local minima as long as the neural networks are sufficiently wide. For example, Dauphin et al. (2014) suggest that saddle points, rather than local minima, are the main challenges for optimizations over wide networks; Goodfellow et al. (2014) give empirical evidence for stochastic-gradient descent to converge to a global minimum of the objective function of wide networks; Livni et al. (2014) show that the optimizations over some classes of wide networks can be reduced to a convex problem; Soudry & Carmon (2016) (2020) . Broadly speaking, we call a local minimum spurious if there is no nonincreasing path to a global minimum (see Section 2.2 for a formal definition). While the absence of spurious local minima does not preclude saddle points or suboptimal local minima in general, it means that one can move from every local minimum to a global minimum without increasing the objective function at any point-see Figure 1 for an illustration. Venturi et al. (2019) prove that there are no spurious local minima if the networks are sufficiently wide. Their theory has two main features that had not been established before: First, it holds for the entire landscapes-rather than for subsets of them. This feature is crucial: even randomized algorithms typically converge to sets of Lebesgue measure zero with probability one, that is, statements about "almost all" local minima are not necessarily meaningful. Second, their theory allows for arbitrary convex loss functions. This feature is important, for example, in view of the trends toward robust alternatives of the least-squares loss (Belagiannis et al., 2015; Jiang et al., 2018; Wang et al., 2016) . On the other hand, their theory has three major limitations: it is restricted to polynomial activation, which is convenient mathematically but much less popular than ReLU activation, it disregards regularizers and constraints, which have become standard in deep learning and in machine learning at large (Hastie et al., 2015) , and it restricts to Lacotte & Pilanci (2020) made progress on two of these limitations: first, their theory caters to ReLU activation rather than polynomial activation; second, their theory allows for weight decay, which is a standard way to regularize estimators. However, their work is still restricted to one-hidden-layer networks. The interesting question is, therefore, whether such results can also be established for deep networks. And more generally, it would be highly desirable to have a theory for the absence of spurious local minima in a broad deep-learning framework. In this paper, we establish such a theory. We prove that the optimization landscapes of empirical-risk minimizers over wide feedforward networks have no spurious local minima. Our theory combines the features of the two mentioned works, as it applies to the entire optimization landscapes, allows for a wide spectrum of loss functions and activation functions, and constraint and unconstraint estimation. Moreover, it generalizes these works, as it allows for multiple outputs and arbitrary depths. Additionally, our proof techniques are considerably different from the ones used before and, therefore, might be of independent interest. Guide to the paper Sections 2 and 5 are the basic parts of the paper: they contain our main result and a short discussion of its implications. Readers who are interested in the underpinning principles should also study Section 3, and readers who want additional insights on the proof techniques are referred to Section 4. The actual proofs are stated in the Appendix.

2. DEEP-LEARNING FRAMEWORK AND MAIN RESULT

In this section, we specify the deep-learning framework and state our main result. The framework includes a wide range of feedforward neural networks; in particular, it allows for arbitrarily many outputs and layers, a range of activation and loss functions, and constraint as well as unconstraint estimation. Our main result guarantees that if the networks are sufficiently wide, the objective function of the empirical-risk minimizer does not have any spurious local minima.

2.1. FEEDFORWARD NEURAL NETWORKS

We consider input data from a domain D x ⊂ R d and output data from a domain D y ⊂ R m . Typical examples are regression data with D y = R m and classification data with D y = {±1} m . We model the data with layered, feedforward neural networks, that is, we study sets of functions G • • = {g Θ : D x → R m : Θ ∈ M} ⊂ G • • = {g Θ : D x → R m : Θ ∈ M} with g Θ [x] • • = Θ l f l Θ l-1 • • • f 1 [Θ 0 x] for x ∈ D x (1) and M ⊂ M • • = Θ = (Θ l , . . . , Θ 0 ) : Θ j ∈ R p j+1 ×p j . The quantities p 0 = d and p l+1 = m are the input and output dimensions, respectively, l the depth of the networks, and w • • = min{p 1 , . . . , p l } the minimal width of the networks. The functions f j : R p j → R p j are called the activation functions. We assume that the activation functions are elementwise functions in the sense that f j [b] = (f j [b 1 ], . . . , f j [b p j ]) for all b ∈ R p j , where f j : R → R is an arbitrary function. This allows for an unlimited variety in the type of activation, including ReLU f j : b → max{0, b}, leaky ReLU f j : b → max{0, b} + min{0, cb} for a fixed



Figure 1: spurious local minimum of a hypothetical objective function

