NO SPURIOUS LOCAL MINIMA: ON THE OPTIMIZATION LANDSCAPES OF WIDE AND DEEP NEURAL NETWORKS

Abstract

Empirical studies suggest that wide neural networks are comparably easy to optimize, but mathematical support for this observation is scarce. In this paper, we analyze the optimization landscapes of deep learning with wide networks. We prove especially that constraint and unconstraint empirical-risk minimization over such networks has no spurious local minima. Hence, our theories substantiate the common belief that increasing network widths not only improves the expressiveness of deep-learning pipelines but also facilitates their optimizations.

1. INTRODUCTION

Deep learning depends on optimization problems that seem impossible to solve, and yet, deeplearning pipelines outperform their competitors in many applications. A common suspicion is that the optimizations are often easier than they appear to be. In particular, while most objective functions are nonconvex and, therefore, might have spurious local minima, recent findings suggest that optimizations are not hampered by spurious local minima as long as the neural networks are sufficiently wide. For example, Dauphin et al. (2014) suggest that saddle points, rather than local minima, are the main challenges for optimizations over wide networks; Goodfellow et al. (2014) give empirical evidence for stochastic-gradient descent to converge to a global minimum of the objective function of wide networks; Livni et al. (2014) show that the optimizations over some classes of wide networks can be reduced to a convex problem; Soudry & Carmon (2016) These findings raise the question of whether common optimization landscapes over wide (but finite) neural networks have no spurious local minima altogether. Progress in this direction has recently been made in Venturi et al. ( 2019) and then Lacotte & Pilanci (2020). Broadly speaking, we call a local minimum spurious if there is no nonincreasing path to a global minimum (see Section 2.2 for a formal definition). While the absence of spurious local minima does not preclude saddle points or suboptimal local minima in general, it means that one can move from every local minimum to a global minimum without increasing the objective function at any point-see Figure 1 for an illustration. Venturi et al. (2019) prove that there are no spurious local minima if the networks are sufficiently wide. Their theory has two main features that had not been established before: First, it holds for the entire landscapes-rather than for subsets of them. This feature is crucial: even randomized algorithms typically converge to sets of Lebesgue measure zero with probability one, that is, statements about "almost all" local minima are not necessarily meaningful. Second, their theory allows for arbitrary convex loss functions. This feature is important, for example, in view of the trends toward robust alternatives of the least-squares loss (Belagiannis et al., 2015; Jiang et al., 2018; Wang et al., 2016) . On the other hand, their theory has three major limitations: it is restricted to polynomial activation, which is convenient mathematically but much less popular than ReLU activation, it disregards regularizers and constraints, which have become standard in deep learning and in machine learning at large (Hastie et al., 2015) , and it restricts to



suggest that differentiable local minima of objective functions over wide networks are typically global minima; Nguyen & Hein (2018) indicate that critical points in wide networks are often global minima; and Allen-Zhu et al. (2019) and Du et al. (2019) suggest that stochastic-gradient descent typically converges to a global minimum for large networks.

