NO SPURIOUS LOCAL MINIMA: ON THE OPTIMIZATION LANDSCAPES OF WIDE AND DEEP NEURAL NETWORKS

Abstract

Empirical studies suggest that wide neural networks are comparably easy to optimize, but mathematical support for this observation is scarce. In this paper, we analyze the optimization landscapes of deep learning with wide networks. We prove especially that constraint and unconstraint empirical-risk minimization over such networks has no spurious local minima. Hence, our theories substantiate the common belief that increasing network widths not only improves the expressiveness of deep-learning pipelines but also facilitates their optimizations.

1. INTRODUCTION

Deep learning depends on optimization problems that seem impossible to solve, and yet, deeplearning pipelines outperform their competitors in many applications. A common suspicion is that the optimizations are often easier than they appear to be. In particular, while most objective functions are nonconvex and, therefore, might have spurious local minima, recent findings suggest that optimizations are not hampered by spurious local minima as long as the neural networks are sufficiently wide. For example, Dauphin et al. (2014) suggest that saddle points, rather than local minima, are the main challenges for optimizations over wide networks; Goodfellow et al. (2014) give empirical evidence for stochastic-gradient descent to converge to a global minimum of the objective function of wide networks; Livni et al. (2014) show that the optimizations over some classes of wide networks can be reduced to a convex problem; Soudry & Carmon (2016) suggest that differentiable local minima of objective functions over wide networks are typically global minima; Nguyen & Hein (2018) indicate that critical points in wide networks are often global minima; and Allen-Zhu et al. (2019) and Du et al. (2019) suggest that stochastic-gradient descent typically converges to a global minimum for large networks. These findings raise the question of whether common optimization landscapes over wide (but finite) neural networks have no spurious local minima altogether. Progress in this direction has recently been made in Venturi et al. (2019) and then Lacotte & Pilanci (2020) . Broadly speaking, we call a local minimum spurious if there is no nonincreasing path to a global minimum (see Section 2.2 for a formal definition). While the absence of spurious local minima does not preclude saddle points or suboptimal local minima in general, it means that one can move from every local minimum to a global minimum without increasing the objective function at any point-see Figure 1 for an illustration. Venturi et al. (2019) prove that there are no spurious local minima if the networks are sufficiently wide. Their theory has two main features that had not been established before: First, it holds for the entire landscapes-rather than for subsets of them. This feature is crucial: even randomized algorithms typically converge to sets of Lebesgue measure zero with probability one, that is, statements about "almost all" local minima are not necessarily meaningful. Second, their theory allows for arbitrary convex loss functions. This feature is important, for example, in view of the trends toward robust alternatives of the least-squares loss (Belagiannis et al., 2015; Jiang et al., 2018; Wang et al., 2016) . On the other hand, their theory has three major limitations: it is restricted to polynomial activation, which is convenient mathematically but much less popular than ReLU activation, it disregards regularizers and constraints, which have become standard in deep learning and in machine learning at large (Hastie et al., 2015) , and it restricts to shallow networks, that is, networks with only one hidden layer, which contrasts the deep architectures that are used in practice (LeCun et al., 2015) . Lacotte & Pilanci (2020) made progress on two of these limitations: first, their theory caters to ReLU activation rather than polynomial activation; second, their theory allows for weight decay, which is a standard way to regularize estimators. However, their work is still restricted to one-hidden-layer networks. The interesting question is, therefore, whether such results can also be established for deep networks. And more generally, it would be highly desirable to have a theory for the absence of spurious local minima in a broad deep-learning framework. In this paper, we establish such a theory. We prove that the optimization landscapes of empirical-risk minimizers over wide feedforward networks have no spurious local minima. Our theory combines the features of the two mentioned works, as it applies to the entire optimization landscapes, allows for a wide spectrum of loss functions and activation functions, and constraint and unconstraint estimation. Moreover, it generalizes these works, as it allows for multiple outputs and arbitrary depths. Additionally, our proof techniques are considerably different from the ones used before and, therefore, might be of independent interest. Guide to the paper Sections 2 and 5 are the basic parts of the paper: they contain our main result and a short discussion of its implications. Readers who are interested in the underpinning principles should also study Section 3, and readers who want additional insights on the proof techniques are referred to Section 4. The actual proofs are stated in the Appendix.

2. DEEP-LEARNING FRAMEWORK AND MAIN RESULT

In this section, we specify the deep-learning framework and state our main result. The framework includes a wide range of feedforward neural networks; in particular, it allows for arbitrarily many outputs and layers, a range of activation and loss functions, and constraint as well as unconstraint estimation. Our main result guarantees that if the networks are sufficiently wide, the objective function of the empirical-risk minimizer does not have any spurious local minima.

2.1. FEEDFORWARD NEURAL NETWORKS

We consider input data from a domain D x ⊂ R d and output data from a domain D y ⊂ R m . Typical examples are regression data with D y = R m and classification data with D y = {±1} m . We model the data with layered, feedforward neural networks, that is, we study sets of functions G • • = {g Θ : D x → R m : Θ ∈ M} ⊂ G • • = {g Θ : D x → R m : Θ ∈ M} with g Θ [x] • • = Θ l f l Θ l-1 • • • f 1 [Θ 0 x] for x ∈ D x and M ⊂ M • • = Θ = (Θ l , . . . , Θ 0 ) : Θ j ∈ R p j+1 ×p j . The quantities p 0 = d and p l+1 = m are the input and output dimensions, respectively, l the depth of the networks, and w • • = min{p 1 , . . . , p l } the minimal width of the networks. The functions f j : R p j → R p j are called the activation functions. We assume that the activation functions are elementwise functions in the sense that f j [b] = (f j [b 1 ], . . . , f j [b p j ]) for all b ∈ R p j , where f j : R → R is an arbitrary function. This allows for an unlimited variety in the type of activation, including ReLU f j : b → max{0, b}, leaky ReLU f j : b → max{0, b} + min{0, cb} for a fixed c ∈ (0, 1), polynomial f j : b → cb k for fixed c ∈ (0, ∞) and k ∈ [1, ∞), and sigmoid activation f j : b → 1/(1 + e -b ) as popular examples, and it allows for different activation functions in each layer. We study the most common approaches to parameter estimation in this setting: constraint and unconstraint empirical-risk minimization. The loss function l : R m × R m → R is assumed convex in its first argument; this includes all standard loss functions, such as the least-squares loss l : (a, b) → ||a -b|| 2 2 , the absolut-deviation loss l : (a, b) → ||a -b|| 1 (both typically used for regression), the logistic loss l : (a, b) → -(1 + b) log[1 + a] -(1 -b) log[1 -a], the hinge loss l : (a, b) → max{0, 1 -ab} (both typically used for binary classification D y = {±1}), and so forth. The optimization domain is the set M • • = Θ ∈ M : r[Θ] ≤ 1 for a constraint r : M → R. Given data (x 1 , y 1 ), . . . , (x n , y n ) ∈ D x × D y , the empirical-risk minimizers are then the networks g Θerm with Θ erm ∈ arg min Θ∈M n i=1 l g Θ [x i ], y i . (2) It has been shown that constraints can facilitate the optimization as well as improve generalizationsee Krizhevsky et al. (2012) and Livni et al. (2014) , among others. For ease of presentation, we limit ourselves to the following class of constraints: r[Θ] • • = max a r max j∈{1,...,l} |||Θ j ||| 1 , b r |||Θ 0 ||| q for all Θ ∈ M for fixed tuning parameters a r , b r ∈ [0, ∞), a parameter q ∈ (0, ∞], and ||| • ||| q the usual row-wise q "-norm," that is, |||Θ j ||| q • • = max k ( i |(Θ j ) ki | q ) 1/q for q ∈ (0, ∞) and |||Θ j ||| ∞ • • = max ki |(Θ j ) ki |. This class of constraints includes the following four important cases: • Unconstraint estimation: a r = b r = 0. In other words, M = M. Unconstraint estimation had been the predominant approach in the earlier days of deep learning and is still used today (Anthony & Bartlett, 1999) . • Connection sparsity: q = 1. This constraint yields connection-sparse networks, which have received considerable attention recently (Barron & Klusowski, 2018; 2019; Kim et al., 2016; Taheri et al., 2020) . • Strong sparsity: q < 1. Nonconvex constraints have been popular in statistics for many years (Fan & Li, 2001; Zhang, 2010) , but our paper is probably the first one that includes such constraints in a theoretical analysis in deep learning. • Input constraints: a r = 0. Some researchers have argued for applying certain constraints, such as node-sparsity, only to the input level (Feng & Simon, 2017) . In general, while our proof techniques also apply to many other types of constraints, there are two main reasons for using the mentioned sparsity-inducing constraints to illustrate our results: First, sparsity has become very popular in deep learning, because it can lower the burden on memory and optimization as well as increase interpretability (Hebiri & Lederer, 2020) . And second, the above examples allow us to demonstrate that the discussed features of wide networks do not depend on smooth and convex constraints such as weight decay. Our theory can also be adjusted to the regularized versions of the empirical-risk minimizers, that is, for the networks indexed by any parameter in the set arg min Θ∈M n i=1 l g Θ [x i ], y i + r[Θ] . The proofs are virtually the same as for the constraint versions; we omit the details for the sake of brevity. One line of research develops statistical theories for constraint and unconstraint empirical-risk minimizers-see Bartlett & Mendelson (2002) and Lederer (2020) , among others. As detailed above, empirical-risk minimizers are the networks whose parameters are global minima of the objective function Θ → l[g Θ ] • • = n i=1 l g Θ [x i ], y i over M for fixed data (x 1 , y 1 ), . . . , (x n , y n ). While the function g Θ → l[g Θ ] is convex by assumption, the objective function Θ → l[g Θ ] is usually nonconvex. It is thus unclear, per se, whether deep-learning pipelines can be expected to yield global minima of the objective function and, therefore, whether the statistical theories are valid in practice. Our goal is, broadly speaking, to establish conditions under which global minimization of (4) can indeed be expected.

2.2. ABSENCE OF SPURIOUS LOCAL MINIMA

We now show that the objective function ( 4) has no spurious local minima if the networks are sufficiently wide. Recall that a parameter Θ ∈ M that satisfies l[g Θ ] ≤ l[g Γ ] for all Γ ∈ M with |||Θ -Γ||| ≤ c for a constant c ∈ (0, ∞) and a norm |||•||| on M is called a local minimum of the objective function (4). If the statement holds for every c ∈ (0, ∞), the parameter Θ is called a global minimum. Objective functions in deep learning have typically many local and global minima; an important question is whether there are "bad" local minima, that is, suboptimal local minima that are difficult to escape from. We formalize this notion as follows: Definition 1 (Spurious local minima). Let Θ ∈ M be a local minimum of the objective function (4).

If there is no continuous function

h : [0, 1] → M that satisfies (i) h[0] = Θ and h[1] = Γ for a global minimum Γ ∈ M of the objective function (4) and (ii) t → l[g h[t] ] is nonincreasing, we call the parameter Θ a spurious local minimum. See again Figure 1 for an illustration. The following theorem is our main result: Theorem 1 (Absence of spurious local minima). Consider the setup of Section 2.1. If w ≥ 2m(n+1) l , the objective function (4) has no spurious local minima. In other words, empirical-risk minimization over sufficiently wide networks does not involve spurious local minima. Hence, as long as there are means to circumvent saddle points (Dauphin et al., 2014) , it is reasonable to expect that algorithms can find a global minimum and, therefore, that the known statistical theories for empirical-risk minimizers apply in practice. The theorem applies very broadly. First, it includes all local minima rather than "many" or "almost all" local minima. This feature is important, because even randomized algorithms usually converge to a few, fixed points with high probability. Second, the framework allows for arbitrary convex loss functions. This feature caters, for example, to a current trend toward robust alternatives of the least-squares loss function (Barron, 2019; Lederer, 2020) . Third, the framework includes ReLU activation. ReLU activation is nondifferentiable and, therefore, mathematically more challenging than, for example, linear and polynomial activation, but it has become the predominant type of activation in practice. Forth, the framework includes constraint as well as unconstraint estimation. Constraint estimation is particularly suitable for wide networks, and for overparameterized networks more generally, because it can avoid overfitting and facilitate optimizations. Fifth, our statement holds for arbitrary output dimensions and depths. The latter is particularly important in view of the current trend toward deep architectures. In sum, our result is a sweeping proof of the fact that wide networks have no spurious local minima, and it sheds light on the optimization landscapes of deep learning more generally. The bound on the network widths becomes 2m(n+1) l = 2(n+1) in the case of a single output (m = 1) and a single hidden layer (l = 1), which coincides with the bounds that have been established for shallow networks with one output and specific activation functions and estimators-see Lacotte & Pilanci (2020) and references therein. Thus, our theory applies extremly broadly and still gives the expected results in the simple cases. In fact, our proofs only require one layer to have a width of at least 2m(n + 1) l , but instead of losing ourselves in technical details about the condition, we focus on the main message of Theorem 1: optimizations become easier with increasing widths.

3. UNDERLYING CONCEPTS

In this section, we introduce concepts that we use in our proofs and that might also be of interest more generally. We first formulate the notion of path equivalence, which yields a practical characterization of spurious local minima. We then formulate specific parameters that can act as mediators between path-equivalent parameters. The main reason why our proof techniques are quite different from what can be found in the literature is that we cater to deep networks and a range of activation functions.

3.1. PATH RELATIONS

The objective functions for optimizing neural networks are typically continuous but not convex or differentiable. In the following, we characterize the absence of spurious local minima in a way that suits these characteristics of neural networks. The key concept is formulated in the following definition. Definition 2 (Path relations). Consider two parameters Θ, Γ ∈ M. If there is a continuous function h Θ,Γ : [0, 1] → M that satisfies h Θ,Γ [0] = Θ, h Θ,Γ [1] = Γ, and t → l[g h Θ,Γ [t] ] is constant, we say that Θ and Γ are path constant and write Θ↔Γ. If there is a continuous function h Θ,Γ : [0, 1] → M that satisfies h Θ,Γ [0] = Θ, h Θ,Γ [1] = Γ, and t → l[g h Θ,Γ [t] ] is convex, we say that Θ and Γ are path convex and write Θ Γ. If there are parameters Θ , Γ ∈ M such that (i) Θ↔Θ and Γ↔Γ and (ii) Θ Γ , we say that Θ and Γ are path equivalent and write Θ Γ. Path constantness means that two parameters are connected by a continuous path of parameters that is constant with respect to the loss; path convexity relaxes "constant" to "convex;" path equivalence allows for additional mediators. The three relations are ordered in the sense that Θ↔Γ ⇒ Θ Γ ⇒ Θ Γ, and they satisfy a number of other basic properties. Lemma 1 (Basic properties). It holds for all Θ, Γ, Ψ ∈ M that 1. Θ↔Θ; Θ Θ; and Θ Θ (reflexivity); 2. Θ↔Γ ⇒ Γ↔Θ; Θ Γ ⇒ Γ Θ; and Θ Γ ⇒ Γ Θ (symmetry).

3.. Θ↔Γ and Γ↔Ψ ⇒ Θ↔Ψ (transitivity).

The proof is straightforward and, therefore, omitted. The lemma illustrates that the path relations equip the parameter space with solid mathematical structures. We can finally use the above-stated concepts to characterize spurious local minima. Proposition 1 (Characterization of spurious local minima). Assume that for all Θ ∈ M, there is a global minimum of the objective function (4), denoted by Γ, such that Θ Γ. Then, the objective function (4) has no spurious local minima. Hence, path equivalence of all parameters to a global minimum is a sufficient condition for the absence of spurious local minima. This statement is the main result of Section 3.1.

3.2. BLOCK PARAMETERS

The parameterization of neural networks is typically ambiguous: many different parameters yield the same network. We leverage this ambiguity to make the networks more tractable. The key concept is formulated in the following definition. 2. (Θ v ) ij = 0 for all v ∈ {1, . . . , l -1} and i > s and for all v ∈ {1, . . . , l -1} and j > s; Θ 2 Θ 1 Θ 0 Θ 2 Θ 1 Θ 0 3. (Θ l ) ij = 0 for all j > s, we call Θ an s-upper-block parameter of depth l. Similarly, if 1. (Θ 0 ) ji = 0 for all j ≤ p 1 -s; 2. (Θ v ) ij = 0 for all v ∈ {1, . . . , l -1} and i ≤ p v+1 -s and for all v ∈ {1, . . . , l -1} and j ≤ p v -s; 3. (Θ l ) ij = 0 for all j ≤ p l -s, we call Θ an s-lower-block parameter of depth l. We denote the sets of the s-upper-block and s-lower-block parameters of depth l by U s,l and L s,l , respectively. Trivial examples are the 0-block parameters U 0 = L 0 = {0 = (0 p l+1 ×p l , . . . , 0 p 1 ×p 0 )} and the sblock parameters U s,l = L s,l = M for s ≥ max{p 1 , . . . , p l }. More generally, the block parameters consist of block matrices: see Figure 2 . We show in the following that block parameters can be mediators in the sense of path equivalence. We first show that every parameter is path constant to a block parameter. Proposition 2 (Path connections to block parameters). For every Θ ∈ M and s • • = m(n + 1) l , there are Θ, Θ ∈ M with Θ ∈ U s,l and Θ ∈ L s,l such that Θ↔Θ and Θ↔Θ. In particular, every parameter is path connected to both an upper-block parameter and a lower-block parameter. The interesting cases are wide networks: for fixed s, the wider the network, the more pronounced the block structure. We then show that there is a connection between upper-block and lower-block parameters. Proposition 3 (Path connections among block parameters). Consider two block parameters Θ ∈ U s,l and Γ ∈ L s,l . If w ≥ 2s, it holds that Θ Γ. Hence, every upper-block parameter is path connected to every lower-block parameter-as long as the minimal width of the networks is sufficiently large. We finally combine Propositions 2 and 3. Corollary 1 (All parameters are path equivalent). Consider two arbitrary parameters Θ, Γ ∈ M. If w ≥ 2m(n + 1) l , it holds that Θ Γ. See Figure 3 for an illustration. The corollary ensures that as long as the minimal width is sufficiently large, all networks are path equivalent. This result, therefore, connects directly to the characterization of spurious local minima in Proposition 1 of the previous section. 

4. AUXILLIARY RESULTS

In this section, we state four auxilliary results.

4.1. TWO-LAYER NETWORKS

Here, we show that two-layer networks can be reparametrized such that they are indexed by block parameters. We first introduce the notation r[M , q] • • = |||M ||| q = max a∈{1,...,b} c j=1 |M aj | q 1/q for all M ∈ R b×c , q ∈ (0, ∞) and r[M , ∞] • • = |||M ||| ∞ = max aj |M aj | for all M ∈ R b×c . These functions are the building blocks of the constraint on Page 3. Next, given a permutation p : {1, . . . , c} → {1, . . . , c} and a matrix M ∈ R b×c , we define the matrix M p ∈ R b×c through (M p ) ij • • = M ip[j] . Similarly, given a permutation p : {1, . . . , b} → {1, . . . , b} and a matrix M ∈ R b×c , we define the matrix M p ∈ R b×c through (M p ) ji • • = M p[j]i . The result is then the following: Lemma 2 (Two-Layer networks). Consider three matrices A ∈ R u×v , B ∈ R v×o , and C ∈ R o×r , two constants q A ∈ (0, 1] and q B ∈ (0, ∞], and a function h : R → R. With some abuse of notation, define h : R v×r → R v×r through (h[M ]) ji • • = h[M ji ] for all M ∈ R v×r . Then, there are matrices A ∈ R u×v and B ∈ R v×o and a permutation p : {1, . . . , v} → {1, . . . , v} such that • Ah[BC] = A p h[B p C]; • r[A, q A ] ≤ r[A p , q A ] and r[B, q B ] ≤ r[B p , q B ]; • A ij = 0 for j > u(r + 1); B ji = 0 for j > u(r + 1) and B ji = (B p ) ji otherwise. Similarly, there are matrices A ∈ R u×v and B ∈ R v×o and a permutation p : {1, . . . , v} → {1, . . . , v} such that • Ah[BC] = A p h[B p C]; • r[A, q A ] ≤ r[A p , q A ] and r[B, q B ] ≤ r[B p , q B ]; • A ij = 0 for j ≤ v -u(r + 1); B ji = 0 for j ≤ v -u(r + 1) and B ji = (B p ) ji otherwise. Hence, the parameter matrices of two-layer networks can be brought into the shapes illustrated in Figure 2 . We apply this result repeatedly in the proof of Proposition 2.

4.2. SYMMETRY PROPERTY OF NEURAL NETWORKS

Next, we point out a symmetry in our setup for the neural networks. Lemma 3 (Symmetry property). Consider permutations p j : {1, . . . , p j } → {1, . . . , p j } for j ∈ {0, . . . , l + 1}. Assume that p 0 and p l+1 are the identity functions: p 0 [j] = p l+1 [j] = j for all j. The parameter Θ ∈ M is a spurious local minimum of the objective function (4) if and only if Γ ∈ M defined through (Γ j ) uv • • = (Θ j ) p j+1 [u]p j [v] for all j ∈ {0, . . . , l}, u ∈ {1, . . . , p j+1 }, and v ∈ {1, . . . , p j } is a spurious local minimum of the objective function (4). The proof follows readily from our setup in Section 2.1 and, therefore, is omitted. The lemma illustrates that the parameterizations of neural networks are highly ambiguous. But in this case, the ambiguity is convenient, because it allows us to permute the rows and columns of the parameters to bring the parameters in shapes that are easy to manage.

4.3. PROPERTY OF CONVEX FUNCTIONS

We now establish a simple property of convex functions. Lemma 4 (Property of convex functions). Consider a convex function h : [0, 1] → R. If h[0] > h[ t] for a t ∈ (0, 1], there is a c ∈ arg min t∈(0,1] {h[t]} such that the function h : [0, 1] → R defined through h[t] • • = h[ct] for all t ∈ [0, 1] is nonincreasing and h[0] > h[1]. This lemma connects the convexity from Definition 2 with the spurious local minima from Definition 1. We use this result in the proof of Proposition 1.

4.4. CARATH ÉODORY-TYPE RESULT

Carathéodory's theorem goes back to Carathéodory (1911) ; see Boltyanski & Martini (2001) ; Fenchel (1929) ; Hanner & Rådström (1951) for related results. The following statement combines the classical theorem and the much more recent results Bastero et al. (1995, Theorem 1 and Lemma 1). Lemma 5 (Carathéodory-Type result). Consider a number q ∈ (0, 1], vectors z 1 , . . . , z h ⊂ R r , and the vectors' q-convex hull conv q [z 1 , . . . , z h ] • • = { h j=1 tj z j : t ∈ [0, 1] h , || t|| q = 1}. Then, every vector v ∈ conv q [z 1 , . . . , z h ] can be written as v = h j=1 t j z j with t ∈ [0, 1] h , ||t|| q ≤ 1, and #{j ∈ {1, . . . , h} : t j = 0} ≤ r + 1. The cardinality of a set A is denoted by #{A} and the q -norm of a vector v ∈ R a by ||v|| q • • = ( a j=1 |v j | q ) 1/q . The lemma follows readily from the mentioned results; therefore, its proof is omitted. The lemma states that every vector in the q-convex hull is a q-convex combination of at most r + 1 vectors from the set of vectors that generate the q-convex hull. (Note that for q < 1, our definition of the q-convex hull is more restrictive than the standard definition-cf. Bastero et al. (1995, Remark on Page 142)-but it leads to a concise statement and is sufficient for our purposes.)

5. DISCUSSION

Empirical evidence has long suggested that wide networks are comparatively easy to optimize. In this paper, we underpin these observations with rigorous theory: we prove that the optimization landscapes of empirical-risk minimization over wide networks have no spurious local minima. Standard competitors to deep learning are "classical" high-dimensional estimators, such as the ridge and the lasso (Hoerl & Kennard, 1970; Tibshirani, 1996) , that have been studied in statistics extensively (Zhuang & Lederer, 2018) . A common argument for these estimators is that their objective functions are often convex or equipped with efficient algorithms for optimization (Bien et al., 2018; Friedman et al., 2010) , but our results indicate that global optimization of the objective functions in deep learning can be feasible as well. Our framework allows for arbitrary depths, for constraint as well as unconstraint estimation, essentially arbitrary activation, and for a very wide spectrum of loss functions and input and output data. This generality demonstrates that the absence of spurious local minima is not a feature of specific estimators, network functions, or data but, instead, a universal property of wide networks. Our theory, therefore, supports the use of wide networks in general-possibly together with regularization or constraints to avoid overfitting. The main idea of most approaches in the field is to construct basis functions for the networks. In contrast, we formulate parametrizations that make the networks easy to work with. We could thus envision testing our new concepts beyond the presented application.

A PROOFS A.1 PROOF OF THEOREM 1 AND AN EXTENSION

Proof of Theorem 1. The proof combines the main results of Sections 3.1 and 3.2. Let Θ ∈ M be an arbitrary parameter and Γ ∈ M a global minimum of the objective function (4). In view of Proposition 1 in Section 3.1, we need to show that Θ Γ, and this follows directly from Corollary 1 in Section 3.2. Figure 1 illustrates that a nonconvex objective function can have spurious local minima, but Theorem 1 proves that this is not the case for deep learning with wide networks. Figure 1 also illustrates that a nonconvex objective function can have "disconnected" global minima, but, in view of Corollary 1 and our results more generally, we can rule out this case as well. In other words, the set of global minima is "connected" in the sense that it is a set of path-constant parameters. The connectedness of the global minima is of minor importance in practice but an interesting topological property nevertheless.

A.2 PROOF OF PROPOSITION 1

Proof of Proposition 1. The key idea is to exploit the properties of monotone functions and convex functions. Let Θ ∈ M be an arbitrary parameter and Γ ∈ M a global minimum of the objective function ( 4) such that Θ Γ. Let Θ , Γ ∈ M and h Θ ,Γ : [0, 1] → M be as described in Definition 2. Note first that, since Γ↔Γ (see Definition 2), the parameter Γ is also a global minimum of the objective function. We then use 1. simple algebra, 2. the assumed convexity of the function t → l[g h Θ ,Γ ] (see Definition 2 again), 3. the assumed endpoints of the function h Θ ,Γ (see Definition 2 once more), 4. the fact that Γ is a global minimum of (4), and 5. the fact that (1 -t) + t = 1 to derive for all t ∈ [0, 1] that l g h Θ ,Γ [t] = l g h Θ ,Γ [(1-t)•0+t•1] ≤ (1 -t)l g h Θ ,Γ [0] + tl g h Θ ,Γ [1] = (1 -t)l g h Θ ,Γ [0] + tl[g Γ ] ≤ (1 -t)l g h Θ ,Γ [0] + tl g h Θ ,Γ [0] = l g h Θ ,Γ [0] . Assume first that the inequality is strict: l[g h Θ ,Γ [ t] ] < l[g h Θ ,Γ [0] ] for a t ∈ [0, 1]. Then, by the assumed convexity of t → l[g h Θ ,Γ [t] ] (see Definition 2) and Lemma 4, there is a c ∈ arg min t∈(0,1] {l[g h Θ ,Γ [t] ]} such that h : [0, 1] → R defined through h[t] • • = l[g h Θ ,Γ [ct] ] for all t ∈ [0, 1] is nonincreasing, h[1] is a global minimum of the objective function (4), and h[0] = l[g h Θ ,Γ [0] ] = l[g Θ ] > h[1] = l[g h Θ ,Γ [c] ]. Hence, the function h : [0, 1] → M defined through h[t] • • = h Θ ,Γ [ct] for all t ∈ [0, 1] is a function that satisfies the conditions of Definition 1 for the parameter Θ and the global minimum h [1] . Combining this result with the assumed relationship Θ↔Θ (see Definition 2) yields-see Definition 1-the fact that Θ is not a spurious local minimum of the objective function (4). We can thus assume that t → l[g h Θ ,Γ [t] ] is constant, which implies Θ ↔Γ by Definition 2. The fact that Θ↔Θ (see Definition 2 again) and the transitivity of the path constantness (see Property 3 in Lemma 1) then yield the fact that Θ↔Γ . Hence, Θ is a global minimum and, therefore, not a spurious local minimum-see Definition 1 again.

A.3 PROOF OF PROPOSITION 2

Proof of Proposition 2. Our proof strategy is to apply Lemma 2, which is designed for one individual layer, layer by layer. We first introduce some convenient notation. We define, with some abuse of notation, f j : R p j ×n → R p j ×n through (f j [M ]) uv • • = f j [M uv ] for all j ∈ {1, . . . , l}, u ∈ {1, . . . , p j }, v ∈ {1, . . . , n}, and M ∈ R p j ×n . We also define the data matrix X ∈ R d×n through X ji • • = (x i ) j for all j ∈ {1, . . . , d} and i ∈ {1, . . . , n}, that is, each column of X consists of one sample. We finally write g Θ [X] • • = g Θ [x 1 ], . . . , g Θ [x n ] = Θ l f l Θ l-1 • • • f 1 [Θ 0 X] ∈ R m×n for all Θ ∈ M . Hence, g Θ [X] summarizes the network's outputs for the given data. Given a parameter Θ ∈ M, we establish a corresponding upper-block parameter Θ ∈ U s,l layer by layer, starting from the outermost layer. We write g Θ [X] = Θ l =:A∈R p l+1 ×p l f l =:h Θ l-1 =:B∈R p l ×p l-1 f l-1 Θ l-2 • • • f 1 [Θ 0 X] =:C∈R p l-1 ×n . Lemma 2 for two-layer networks then gives (by Lemma 3, we can assume without loss of generality the fact that p is the identity function, that is, A p = A and B p = B) g Θ [X] = Θ l f l Θ l-1 0 f l-1 Θ l-2 • • • f 1 [Θ 0 X] for a matrix Θ l ∈ R p l+1 ×p l that satisfies r[Θ l , 1] ≤ r[Θ l , 1] (recall the definition of r on Page 7) and meets Condition 3 in the first part of Definition 3 on block parameters as long as s ≥ p l+1 (n + 1) = m(n + 1), and for a matrix Θ l-1 ∈ R m(n+1)×p l-1 that satisfies r[ Θ l-1 , 1] ≤ r[Θ l-1 , 1] (or r[ Θ l-1 , q] ≤ r[Θ l-1 , q] if l = 1 ) and consists of the first m(n + 1) rows of the matrix Θ l-1 . (We implicitly assume here and in the following p j ≥ m(n + 1) l-j+1 for all j ∈ {1, . . . , l}-which is the generic case in view of Corollary 1-to keep the notation manageable, but extending the proof to the general case is straightforward.) Now, define a parameter Γ l ∈ M through Γ l • • = (Θ l , Θ l-1 , . . . , Θ 0 ) and a function h Θ,Γ l : [0, 1] → M through h Θ,Γ l [t] • • = (1 -t)Θ + tΓ l for all t ∈ [0, 1] . The function h Θ,Γ l is continuous and satisfies h Θ,Γ l [0] = Θ and h Θ,Γ l [1] = Γ l . Moreover, we can 1. use the definitions of the function h Θ,Γ l and the networks, 2. split the network along the outermost layer, 3. invoke the block shape of Θ l and the definition of Θ l-1 as the m(n + 1) first rows of the matrix Θ l-1 , 4. use the above-stated inequalities for the network g Θ [X], and 5. consolidate the terms to show for all t ∈ [0, 1] that g h Θ,Γ l [t] [X] = (1 -t)Θ l + tΘ l f l Θ l-1 • • • f 1 [Θ 0 X] = (1 -t)Θ l f l Θ l-1 • • • f 1 [Θ 0 X] + tΘ l f l Θ l-1 • • • f 1 [Θ 0 X] = (1 -t)Θ l f l Θ l-1 • • • f 1 [Θ 0 X] + tΘ l f l Θ l-1 0 • • • f 1 [Θ 0 X] = (1 -t)g Θ [X] + tg Θ [X] = g Θ [X] . Hence, the function t → l[g h Θ,Γ l [t] ] is constant. Finally, we use 1. the definition of the constraint in (3), 2. the definition of the function h Θ,Γ l , 3. the convexity of the 1 -norm, 4. the above stated fact that r[Θ l , 1] ≤ r[Θ l , 1], 5. a consolidation, 6. again the definition of the regularizer, and 7. the fact that Θ ∈ M to show for all t ∈ [0, 1] that r h Θ,Γ l [t] = max a r max j∈{1,...,l} h Θ,Γ l [t] j 1 , b r h Θ,Γ l [t] 0 q = max a r max j∈{1,...,l-1} |||Θ j ||| 1 , a r |||(1 -t)Θ l + tΘ l ||| 1 , b r |||Θ 0 ||| q ≤ max a r max j∈{1,...,l-1} |||Θ j ||| 1 , (1 -t)a r |||Θ l ||| 1 + ta r |||Θ l ||| 1 , b r |||Θ 0 ||| q ≤ max a r max j∈{1,...,l-1} |||Θ j ||| 1 , (1 -t)a r |||Θ l ||| 1 + ta r |||Θ l ||| 1 , b r |||Θ 0 ||| q = max a r max j∈{1,...,l-1} |||Θ j ||| 1 , a r |||Θ l ||| 1 , b r |||Θ 0 ||| q = r[Θ] ≤ 1 . Hence, h Θ,Γ l [t] ∈ M for all t ∈ [0, 1]. In conclusion, we have shown-see Definition 2-that Θ and Γ l are path constant: Θ↔Γ l . We then move one layer inward. Lemma 2 ensures that (recall again Lemma 3) Θ l-1 A∈R m(n+1)×p l-1 f l-1 h Θ l-2 B∈R p l-1 ×p l-2 • • • f 1 [Θ 0 X] C∈R p l-2 ×n = Θl-1 f l-1 Θ l-2 0 • • • f 1 [Θ 0 X] for a matrix Θl-1 ∈ R m(n+1)×p l-1 that satisfies r[ Θl-1 , 1] ≤ r[ Θ l-1 , 1] ≤ r[Θ l-1 , 1] and meets Condition 3 in the first part of Definition 3 on block parameters as long as s ≥ m(n + 1) 2 , and for a matrix Θ l-2 ∈ R m(n+1) 2 ×p l-2 that satisfies r[ Θ l-2 , 1] ≤ r[Θ l-2 , 1] (or r[ Θ l-2 , q] ≤ r[Θ l-2 , q] if l = 2 ) and consists of the first m(n + 1) 2 rows of the matrix Θ l-2 . Next, we define Θ l-1 ∈ R p l ×p l-1 through (Θ l-1 ) uv • • = ( Θl-1 ) uv for u ≤ m(n + 1) and (Θ l-1 ) uv • • = 0 otherwise. Combining this definition with the above-derived results yields g Θ [X] = Θ l f l Θ l-1 f l-1 Θ l-2 0 • • • f 1 [Θ 0 X] , and the matrix Θ l-1 satisfies r[Θ l-1 , 1] = r[ Θl-1 , 1] ≤ r[Θ l-1 , 1] and meets Condition 2 in the first part of Definition 3 on block parameters as long as s ≥ m(n + 1) 2 . Similarly as above, define a parameter Γ l-1 ∈ M through Γ l-1 • • = (Θ l , Θ l-1 , Θ l-2 , . . . , Θ 0 ) and a function h Γ l ,Γ l-1 : [0, 1] → M through h Γ l ,Γ l-1 [t] • • = (1 -t)Γ l + tΓ l-1 for all t ∈ [0, 1] to show that Γ l ↔Γ l-1 . In view of Property 3 in Lemma 1, we can conclude that Θ↔Γ l-1 . Finish the proof by induction over the layers, and note that the lower-block parameters can be established in the same way.

A.4 PROOF OF PROPOSITION 3

Proof of Proposition 3. The key ingredient of the proof is the assumed block structure of the parameters. Consider two block parameters Θ ∈ U s,l and Γ ∈ L s,l and define the parameters Θ • • = (Θ l , Θ l-1 + Γ l-1 , . . . , Θ 0 + Γ 0 ) ∈ M ; Γ • • = (Γ l , Θ l-1 + Γ l-1 , . . . , Θ 0 + Γ 0 ) ∈ M and the function h Θ ,Γ : [0, 1] → M t → (1 -t)Θ + tΓ = (1 -t)Θ l + tΓ l , Θ l-1 + Γ l-1 , . . . , Θ 0 + Γ 0 . By the row-wise structure of the constraint (see Page 3 again), the convexity of the 1 -norm, and the block shapes of the parameters (see Figure 2 again), we can find that r [h Θ ,Γ [t]] ≤ max{r[Θ], r[Γ]} ≤ 1 for all t ∈ [0, 1], that is, h Θ ,Γ [t] ∈ M for all t ∈ [0, 1]. One can also verify readily the fact that the function h Θ ,Γ is continuous, h Θ ,Γ [0] = Θ , and h Θ ,Γ [1] = Γ . Next, we define [Θ , Γ ] c1,c2 • • = (c 1 Θ l + c 2 Γ l , Θ l-1 + Γ l-1 , . . . , Θ 0 + Γ 0 ) ∈ M for all c 1 , c 2 ∈ R , which generalizes h Θ ,Γ in the sense that h Θ ,Γ [t] = [Θ , Γ ] 1-t,t for all t ∈ [0, 1]. We then 1. invoke the definition of [Θ , Γ ] c1,c2 and the definition of the networks in (1), 2. split the network along the outer layer, 3. use the block structures of the parameters and the assumption that f j [b] = (f j [b 1 ], . . . , f j [b p j ] ) for all j ∈ {1, . . . , l} and b ∈ R p j , 4. continue in this fashion, and 5. invoke again the definition of the networks in (1) to find for all c 1 , c 2 ∈ R and x ∈ D x that g [Θ ,Γ ]c 1 ,c 2 [x] = (c 1 Θ l + c 2 Γ l )f l (Θ l-1 + Γ l-1 )f l-1 • • • f 1 (Θ 0 + Γ 0 )x = c 1 Θ l f l (Θ l-1 + Γ l-1 )f l-1 • • • f 1 (Θ 0 + Γ 0 )x + c 2 Γ l f l (Θ l-1 + Γ l-1 )f l-1 • • • f 1 (Θ 0 + Γ 0 )x = c 1 Θ l f l Θ l-1 f l-1 • • • f 1 (Θ 0 + Γ 0 )x + c 2 Γ l f l Γ l-1 f l-1 • • • f 1 (Θ 0 + Γ 0 )x = • • • = c 1 Θ l f l Θ l-1 f l-1 • • • f 1 Θ 0 x + c 2 Γ l f l Γ l-1 f l-1 • • • f 1 Γ 0 x = c 1 g Θ [x] + c 2 g Γ [x] . Finally, we use 1. the definition of the function h Θ ,Γ and of the parameter [Θ , Γ ] c1,c2 , 2. the above display with c 1 = 1 -(1 -a)t 1 -at 2 and c 2 = (1 -a)t 1 + at 2 , 3. a rearrangement of the terms, 4. the assumed convexity of the loss function l, 5. again the above display, and 6. again the definition of h Θ ,Γ , to find for all a, t 1 , t 2 ∈ [0, 1] that l g h Θ ,Γ [(1-a)t1+at2] = l g [Θ ,Γ ] 1-(1-a)t 1 -at 2 ,(1-a)t 1 +at 2 = l 1 -(1 -a)t 1 -at 2 g Θ + (1 -a)t 1 + at 2 g Γ = l (1 -a) -(1 -a)t 1 g Θ + (a -at 2 )g Θ + (1 -a)t 1 g Γ + at 2 g Γ = l (1 -a) (1 -t 1 )g Θ + t 1 g Γ + a (1 -t 2 )g Θ + t 2 g Γ ≤ (1 -a)l (1 -t 1 )g Θ + t 1 g Γ + al (1 -t 2 )g Θ + t 2 g Γ = (1 -a)l g [Θ ,Γ ]1-t 1 ,t 1 + al g [Θ ,Γ ]1-t 2 ,t 2 = (1 -a)l g h Θ ,Γ [t1] + al g h Θ ,Γ [t2] , which means that t → l[g h Θ ,Γ [t] ] is convex. We conclude-see Definition 2-that Θ Γ . In view of Definition 2, it is left to show that Θ↔Θ and Γ↔Γ . Consider now the function h Θ,Θ : [0, 1] → M t → (Θ l , Θ l-1 + tΓ l-1 , . . . , Θ 0 + tΓ 0 ) . By the row-wise structure of the constraint (see Page 3 once more) and the block shapes of the parameters (see Figure 2 once more), we can find that r [h Θ,Θ [t]] ≤ max{r[Θ], r[tΓ]} ≤ 1 for all t ∈ [0, 1], that is, h Θ,Θ [t] ∈ M for all t ∈ [0, 1]. One can also verify readily the fact that the function h Θ,Θ is continuous, h Θ,Θ [0] = Θ, and h Θ,Θ [1] = Θ . Moreover, we can 1. invoke the definition of h Θ,Θ and the definition of the networks in (1), 2. use the block structures of the parameters and the elementwise structure of the activation function, 3. continue in this fashion, and 4. invoke the definitions of the function h Θ,Θ and the networks in (1) again to find for all t ∈ [0, 1] and x ∈ D x that g h Θ,Θ [t] [x] = Θ l f l (Θ l-1 + tΓ l-1 )f l-1 • • • f 1 (Θ 0 + tΓ 0 )x = Θ l f l Θ l-1 f l-1 • • • f 1 (Θ 0 + tΓ 0 )x = • • • = Θ l f l Θ l-1 f l-1 • • • f 1 Θ 0 x = g h Θ,Θ [0] [x] , which implies that the function t → l[g h Θ,Θ [t] ] is constant. We conclude that-see Definition 2that Θ↔Θ . We can show in a similar way that Γ↔Γ . Hence, given Definition 2, we find that Θ Γ, as desired. A.5 PROOF OF LEMMA 2 Proof of Lemma 2. The proof is essentially a careful reparametrization. We proceed in three steps: see Figure 4 for an overview. Step 1: Fix a k ∈ {1, . . . , u}. We first show that there is a matrix Ȧ ∈ R u×v such that 1. Ȧh[BC] = Ah[BC]; 2. r[ Ȧ, q A ] ≤ r[A, q A ]; 3. #{j ∈ {1, . . . , v} : Ȧkj = 0} ≤ r + 1; 4. Ȧaj = A aj for all a = k. Hence, we replace the matrix A, which contains the "output parameters," by a matrix whose kth row has at most r + 1 nonzero entries-see the illustration in Figure 4 . The proof of this step is based on our version of Carathéodory's theorem in Lemma 5. For every k ∈ {1, . . . , u} and i ∈ {1, . . . , r}, elementary matrix algebra yields that Ah[BC] ki = v j=1 A kj h[BC] ji . Denoting the row vectors of a matrix M ∈ R a×b by M 1• , . . . , M a• ∈ R b , we then get Ah[BC] k• = v j=1 A kj h[BC] j• . The case ||A k• || q A = 0 is straightforward to deal with: all elements of A k• are then equal to zero, and we can just set Ȧ • • = A. We can thus assume the fact that ||A k• || q A = 0. We then get from the preceding equality that Ah[BC] k• = v j=1 A kj ||A k• || q A =: tj ||A k• || q A h[BC] j• =:zj . Since || t|| q A = v j=1 | tj | q A 1/q A = v j=1 A kj ||A k• || q A q A 1/q A = 1 ||A k• || q A v j=1 |A kj | q A 1/q A = ||A k• || q A ||A k• || q A = 1 , the previous equality means that the vector (Ah[BC]) k• ∈ R r is a q A -convex combination of the 2v vectors z 1 , . . . , z v , -z 1 , . . . , -z v ∈ R r , that is, (Ah[BC]) k• ∈ conv q [z 1 , . . . , z v , -z 1 , . . . , -z v ].

Starting point

A B Step 1 Ȧ k B Step 2 A B Step 3 Hence, by Lemma 5, there is a t ∈ [-1, 1] v such that ||t|| q A ≤ 1, #{j ∈ {1, . . . , v} : t j = 0} ≤ r + 1, and A B Ah[BC] k• = v j=1 t j ||A k• || q A h[BC] j• . Hence, by the definition of the row vectors, Ah[BC] ki = v j=1 t j ||A k• || q A h[BC] ji for all i ∈ {1, . . . , r}, and, more generally, we find for all a ∈ {1, . . . , u} and i ∈ {1, . . . , r} that Ah[BC] ai = v j=1 t j ||A k• || q A h[BC] ji for a = k ; v j=1 A aj h[BC] ji otherwise . This motivates us to define Ȧ ∈ R u×v through Ȧaj • • = t j ||A k• || q A for a = k ; A aj otherwise . Properties 1, 3, and 4 then follow immediately. Property 2 can be derived by using 1. the definition of the basic regularizer r on Page 7, 2. the definition of Ȧ, 3. the linearity of finite sums, 4. the definition of the q -norms on Page 8, 5. the above-derived property ||t|| q A ≤ 1, 6. a consolidation, and 7. again the definition of the basic regularizer r on Page 7: = r[A, q A ] q A , as desired. This concludes the proof of the first step. r[ Ȧ, q A ] q A = max ||A k• || q A q A v j=1 |t j | q A for a = k v j=1 |A aj | q A otherwise = max a∈{1,...,u} v j=1 |A kj | q A ||t|| q A q A for a = k v j=1 |A aj | q A otherwise ≤ max Step 2: We now show that there is a matrix A ∈ R u×v such that 1. Ah[BC] = Ah[BC]; A.6 PROOF OF LEMMA 4 Proof of Lemma 4. The proof is a simple exercise in calculus. An illustration of the quantities involved in the proof is given in Figure 5 . The function h is convex by assumption; hence, it is continuous. Then, according to the extreme value theorem, there is a number c ∈ arg min t∈[0,1] h[t] . Since h[ t] < h[0] for a t ∈ (0, 1], it holds that c ∈ (0, 1]. Now consider t 2 ∈ [0, 1] and t 1 ∈ [0, t 2 ), that is, t 2 > t 1 . Basic calculus ensures that h[t 2 ] = h[ct 2 ] = h ct 1 + ct 2 -ct 1 = h ct 1 + ct 2 -ct 1 c -ct 1 (c -ct 1 ) = h ct 1 + t 2 -t 1 1 -t 1 (c -ct 1 ) = h 1 - t 2 -t 1 1 -t 1 ct 1 + t 2 -t 1 1 -t 1 c ≤ 1 - t 2 -t 1 1 -t 1 h[ct 1 ] + t 2 -t 1 1 -t 1 h[c] ≤ 1 - t 2 -t 1 1 -t 1 h[ct 1 ] + t 2 -t 1 1 -t 1 h[ct 1 ] = h[ct 1 ] = h[t 1 ] . Hence, h is nonincreasing. Moreover, h[0] = h[c • 0] = h[0] > h[ t] ≥ h[c] = h[c • 1] = h[1] . Hence, h[0] > h[1]. This concludes the proof. 



Figure 1: spurious local minimum of a hypothetical objective function

Figure 2: Illustration of the parameters of an s-upper block parameter (left) and an s-lower block parameter (right) for l = 2. The dark areas of the matrices can consist of arbitrary values; the light areas consist of zeros.

Figure 3: path-equivalence between two parameters Θ and Γ-see Corollary 1 and Definition 2

Figure 4: overview of the proof of Lemma 2

Figure 5: quantities in the proof of Lemma 4

2.. r[ A, q

A ] ≤ r[A, q A ];3. #{j ∈ {1, . . . , v} : A aj = 0} ≤ r + 1 for all a ∈ {1, . . . , u}.Hence, we replace the matrix A by a matrix whose every row has at most r + 1 nonzero entries-see again the illustration in Figure 4 .

Since

Step 1 changes only the kth row of A (see Property 4 derived in Step 1), we can apply it to one row after another.Step 3: We finally prove the first part of the lemma-see again the illustration in Figure 4 .By Property 3 of the previous step, the matrix A has at most u(r + 1) nonzero columns. Verify that replacing A by A p and B by B p for a suitable permutation p leads to an A whose entries outside the first u(r + 1) columns are equal to zero-while all other properties remain intact. We denote this version of A by A. We then derive for all j ∈ {1, . . . , v} and i ∈ {1, . . . , r} thatwhere we define (with some abuse of notation)Combining this result with the results of Step 2 (with A and B replaced by A p and B p , respectively) yields for all a ∈ {1, . . . , u} and i ∈ {1, . . . , r} thatWe then define B ∈ R v×o throughWe then use 1. the above-stated equality, 2. the definition of B, 3. a similar derivation as above, 4. the block structure of A, and 5. a similar derivation as above to establish for all a ∈ {1, . . . , u} and i ∈ {1, . . . , r} the fact thatThe other properties stated in the lemma follow readily.The second part of the lemma can be derived in the same way.

