SPURIOUS LOCAL MINIMA PROVABLY EXIST FOR DEEP CONVOLUTIONAL NEURAL NETWORKS

Abstract

In this paper, we prove that a general family of infinitely many spurious local minima exist in the loss landscape of deep convolutional neural networks with squared loss or cross-entropy loss. Our construction of spurious local minima is general and applies to practical dataset and CNNs containing two consecutive convolutional layers. We develop some new techniques to solve the challenges in construction caused by convolutional layers. We solve a combinatorial problem to show that a differentiation of data samples is always possible somewhere in feature maps. Empirical risk is then decreased by perturbation of network parameters that can affect different samples in different ways. Despite filters and biases are tied in each feature map, in our construction this perturbation only affects the output of a single ReLU neuron. We also give an example of nontrivial spurious local minimum in which different activation patterns of samples are explicitly constructed. Experimental results verify our theoretical findings.

1. INTRODUCTION

Convolutional neural networks (CNNs) (e.g. Lecun et al. (1998) ; Krizhevsky et al. (2012) ; Simonyan & Zisserman (2015) ; Szegedy et al. (2015) ; He et al. (2016) ; Huang et al. (2017) ), one of the most important models in deep learning, have been successfully applied to many domains. Spurious local minima, whose losses are greater than that of global minimum, play an important role in the training of deep CNNs and understanding of deep learning models. It is widely believed that spurious local minima exist in the loss landscape of CNNs which is thought to be highly non-convex, as evidenced by some experimental studies (e.g. Dauphin et al. (2014) ; Goodfellow et al. (2015) ; Liao & Poggio (2017) ; Freeman & Bruna (2017) ; Draxler et al. (2018) ; Garipov et al. (2018) ; Li et al. (2018) ; Mehmeti-Gopel et al. (2021) ). However, the existence of spurious local minima for deep CNNs caused by convolutions has never been proved mathematically before. In this paper, we prove that infinite spurious local minima exist in the loss landscape of deep CNNs with squared loss or cross-entropy loss. This is in contrast to the "no spurious local minima" property of deep linear networks. The construction of spurious local minima in this paper is general and applies to practical dataset and CNNs containing two consecutive convolutional layers, which is satisfied by popular CNN architectures. The idea is to construct a local minimum θ at first, and then construct another point θ in parameter space which has the same empirical risk as θ and there exist regions around θ with less empirical risks. However, the construction of spurious local minima for CNNs faces some technical challenges, and the construction for fully connected deep networks cannot be directly extended to CNNs. Our main contribution in this paper is to tackle these technical challenges. In the construction of spurious local minima for fully connected deep ReLU networks (He et al. (2020) ; Ding et al. (2019) ; Goldblum et al. (2020) ; Liu et al. (2021) ), in order to construct θ and perturb around it, data samples are split into some groups according to the inputs of a specific ReLU neuron such that each group will behave differently under the perturbation of network parameters so as to produce a lower risk. This technique relies on data split and parameter perturbation, and cannot be directly applied to CNNs due to the following difficulties. Every neuron in CNN feature maps has limited receptive field that covers partial pixels in an input image (take image as an example), and hence the inputs to a ReLU neuron can be identical even for distinct samples, making them hard to distinguish. This data split issue is further complicated by the nonlinear ReLU activations that truncate negative inputs, and the activation status can vary from place to place and from sample to sample. Moreover, the filters and biases of CNNs are shared by all neurons in the same feature map, and thus adjusting the output of a ReLU neuron by perturbing these tied parameters will also affect other neurons in the same feature map. We solve these challenges by developing some new techniques in this paper. By taking limited receptive fields and possible distinct activation status for different locations and samples into account, we solve a combinatorial problem to show that a split of data samples is always possible somewhere in feature maps. We then present a construction of CNN parameters (θ ) that can be perturbed to achieve lower losses for general local minima θ. Despite the parameters are tied, our construction can perturb the outputs of samples at a single neuron in feature maps without affecting other locations. We also give a concrete example of spurious local minima for CNNs. To our best knowledge, this is the first work showing existence of spurious local minima in deep CNNs introduced by convolutional layers. This paper is organized as follows. Section 1.1 is related work. Section 2 describes convolutional neural networks, and gives some notations used in this paper. In section 3, our general results on spurious local minima are given with some discussions. In section 4, we present an example of nontrivial spurious local minima for CNNs. Section 5 presents experimental results to verify our theoretical findings. Finally, conclusions are provided. More lemmas, experimental details and all proofs are given in appendices.

1.1. RELATED WORK

For some neural networks and learning models, it has been shown that there exist no spurious local minima. These models include deep linear networks ( Baldi & Hornik (1989) ; Kawaguchi (2016) ; Lu & Kawaguchi (2017) ; Laurent & von Brecht (2018) ; Yun et al. (2018) ; Nouiehed & Razaviyayn (2018) ; Zhang (2019) ), matrix completion and tensor decomposition (e.g., Ge et al. (2016) ), onehidden-layer networks with quadratic activation (Soltanolkotabi et al. (2019) ; Du & Lee (2018) ), deep linear residual networks (Hardt & Ma (2017) ) and deep quadratic networks (Kazemipour et al. (2020) ). Existence of spurious local minima for one-hidden-layer ReLU networks has been demonstrated, by constructing examples of networks and data samples, in Safran & Shamir (2018) ; Swirszcz et al. (2016) ; Zhou & Liang (2018) ; Yun et al. (2019) ; Ding et al. (2019) ; Sharifnassab et al. (2020) ; He et al. (2020) ; Goldblum et al. (2020) etc. For deep ReLU networks, He et al. (2020) ; Ding et al. (2019) ; Goldblum et al. (2020) ; Liu et al. (2021) showed that spurious local minima exist for fully connected deep neural networks with some general loss functions. For these spurious local minima, all ReLU neurons are active and deep neural networks are reduced to linear predictors. Spurious local minima for CNNs are not treated in these works. In comparison, we deal with spurious local minima for CNNs in this work, and the constructed spurious local minima can be nontrivial in which nonlinear predictors are generated and some ReLU neurons are inactive. Du et al. (2018) ; Zhou et al. (2019) ; Brutzkus & Globerson (2017) showed the existence of spurious local minima for one-hidden-layer CNNs with a single non-overlapping filter, Gaussian input and squared loss. In contrast, we discuss practical deep CNNs with multiple filters of overlapping receptive fields and arbitrary input, for both squared and cross-entropy loss. Given non-overlapping filter and Gaussian input, the population risk with squared loss function can be formulated analytically with respect to the single filter w, which facilitates the analysis of loss landscape. Thus, the techniques used in Du et al. (2018) ; Zhou et al. (2019) ; Brutzkus & Globerson (2017) cannot be extended to the general case of empirical risk with arbitrary input samples discussed in this paper. Nguyen & Hein (2018) showed that a sufficiently wide CNN (include a wide layer which has more neurons than the number of training samples followed by a fully connected layer) has a well-behaved loss surface with almost no bad local minima. Liu (2022) explored spurious local minima for CNNs introduced by fully connected layers. Du et al. (2019) ; Allen-Zhu et al. (2019) explored the local convergence of gradient descent for sufficiently over-parameterized deep networks including CNNs.

2.1. NOTATIONS

We use M i,• and M (i, •) to denote the ith row of a matrix M , and use M •,i and M (•, i) to denote the ith column of M . M i,j and M ij are the (i, j) entry of M . The ith component of a vector v is denoted as v i , v i or v(i). [N ] is equivalent to {1, 2, • • • , N }. i : j means i, i + 1, • • • , j. v(i : end) stands for the components of a vector v from the ith one to the last one. 1 n denotes a vector of size n whose components are all 1s.

2.2. CONVOLUTIONAL NEURAL NETWORKS

A typical CNN includes some convolutional layers, pooling layers and fully connected layers. Because spurious local minima caused by fully connected layers in CNNs can be treated in the same way as in fully connected networks, we will focus on spurious local minima caused by convolutional layers. Convolutional layers can take advantage of the translational invariance inherent in data, and are defined as follows. Suppose the lth layer is a convolutional layer, neighboring neurons at layer (l -1) are grouped into patches for the convolution operation. Let P l and s l be, respectively, the number of patches and the size of each patch in the lth layer, and denote by T l the number of convolutional filters (or the number of feature maps) in the lth layer. Denote o l 1 , • • • , o l P l ∈ R s l as the set of patches in the lth layer. If there are multiple feature maps in layer (l -1), a patch will include corresponding neighboring neurons from every feature map. The number of neurons in each feature map in the lth layer is P l-1 . The number of neurons in the lth layer is then n l = T l P l-1 . Given the pth (p ∈ [P l-1 ]) patch o l-1 p and the tth (t ∈ [T l ]) filter w l t ∈ R s l-1 , the output of corresponding neuron in layer l is given as follows, o l (h) = σ(w l t • o l-1 p + b l t ), where h = (t -1)P l-1 + p. σ(x) = max(0, x) is the ReLU activation function. There is a single bias b l t for each feature map. For ease of presentation, we will assume that the strides for convolutions are equal to one and the input feature maps are zero padded such that for each convolutional layer, the sizes of input and output feature maps are equal. The convolution operation can then be reformulated as a matrixvector product. We explain this by considering the following one-dimensional example. Given a one-dimensional input o l-1 = (a, b, c, d, e) T and a filter w 1 = (w 1 1 , w 2 1 , w 3 1 ) T , the input becomes (0, a, b, c, d, e, 0) T after padding with zeros. Here, T l = 1, s l-1 = 3, P l-1 = 5 after zero-padding, n l = T l P l-1 = 5. The convolution operation is equivalent to o l = σ           w 2 1 w 3 1 0 0 0 w 1 1 w 2 1 w 3 1 0 0 0 w 1 1 w 2 1 w 3 1 0 0 0 w 1 1 w 2 1 w 3 1 0 0 0 w 1 1 w 2 1           a b c d e      +      b l 1 b l 1 b l 1 b l 1 b l 1           . One can set (w 1 1 = 0, w 2 1 = 1, w 3 1 = 0) in 2 to forward o l-1 unchanged (before adding bias and ReLU activation). This is equivalent to setting the weight matrix as a identity one. For the case of two-dimensional input, the input can be converted into a vector and the convolution operation can be reformulated as a matrix-vector product using similar idea. We will use matrix-vector product to represent convolution operations. Denote the weight matrix and bias vector of the lth layer as W l and b l , respectively, with W l ∈ R n l ×n l-1 and b l ∈ R n l , then the output of the lth layer is o l = σ(W l o l-1 + b l ). This matrix-vector product formulation can also be used to represent fully connected layers. We only consider average pooling layers. If layer l is a average pooling layer, each patch used in the pooling operation will include only the neighboring neurons within a single feature map in layer (l -1). Let P l-1 be the number of patches in a single feature map of the (l -1)th layer, and hence the size of each feature map in layer l is P l-1 . For average pooling, the output of the pth neuron o l (p) in each feature map of layer l is computed by o l (p) = mean(o l-1 p (1), • • • , o l-1 p (s l-1 )), p ∈ [P l-1 ], where o l-1 p (i) is the ith element in the pth patch of layer (l -1) within a feature map. The average pooling operation is a linear operation and thus can can be represented by a matrix-vector product. The exact forms of parameter matrices for average pooling and fully connected operations will be given in Appendix A in detail. Consider a training set {(x 1 , y 1 ) , (x 2 , y 2 ) , . . . , (x N , y N )}, where x i ∈ R dx , y i ∈ R dy (i ∈ [N ]) are, respectively, the input and target output. Let L be the number of layers in a CNN, and let o l,i be the output o l for the ith data sample (denote o 0,i := x i ). The output vector of a CNN is o i := o L,i = W L o L-1,i + b L , i ∈ [N ]. We introduce a diagonal matrix I l,i ∈ R n l ×n l for each sample and each ReLU layer to represent the activation status of ReLU neurons, whose diagonal entries are I l,i k,k = 1 if W l (k, •)o l-1,i + b l k > 0 and I L,i k,k = 0 otherwise. Consequently, the output of a convolutional layer can be written as o l,i = I l,i (W l o l-1,i + b l ). The CNN output can be written as o i = W L I L-1,i W L-1 • • • W 1 x i + b 1 + • • • + b L-1 + b L . ( ) The empirical risk (training loss) is defined as R(W 1 , b 1 , • • • , W L , b L ) = 1 N N i=1 l (o i , y i ) , ( ) where l is the loss function, such as the widely used cross-entropy loss for classification problems. Our main goal in this paper is to construct spurious local minima for the empirical risk function of deep CNNs given in (5).

3. SPURIOUS LOCAL MINIMA

In this section, we first construct a general local minimum with parameters θ := W l , b l L l=1 . Then, we present another point θ in parameter space which has an equal empirical risk with θ. We further perturb θ and show that empirical risk can be decreased. Therefore, θ is a spurious local minimum in parameter space. Our construction is general and can be applied to CNNs with two consecutive convolutional layers, which is easily satisfied by most practical CNN models, such as AlexNet (Krizhevsky et al. (2012) ), VGG (Simonyan & Zisserman (2015) ) and GoogleNet (Szegedy et al. (2015) ).

3.1. CONSTRUCTION OF LOCAL MINIMA

We first give a general construction of nontrivial local minima. Given a local minimum for a subnetwork of a CNN, we show that it will induce a local minimum of the whole CNN if the subnetwork is embedded into the CNN appropriately. Local minima for subnetworks of a CNN can be constructed using any method. Our construction of spurious local minima is general and does not rely on the concrete form of local minima for subnetworks. In section 4, we will give an explicit construction of exemplar nontrivial local minima for subnetworks. Our construction of spurious local minima is general and will utilize two consecutive convolutional layers, whereever there are in the CNN. For ease of presentation, first we will assume that the last two hidden layers before the output layer (the Lth layer, which is fully connected) are two consecutive convolutional layers. The general case of having two consecutive convolutional layers in other places will be discussed in section 3.3. When layers L -1 and L -2 are two consecutive convolutional layers, we will fix the parameters of the last two convolutional layers in a way such that the output o L-3,i is passed through layers L -2 and L -1 unchanged. Two feature maps in layers L -2 and L -1, respectively, are reserved for the perturbation described later. Without loss of generality, let them be the first and second feature maps in these two layers, respectively. Fig. 1 (a) shows the top layers of the CNN architeture with some associated parameters. Note that the size of a single output feature map in the lth layer is P l-1 , and P L-2 = P L-3 = P L-4 after padding. We have the following lemma to informally descibe the construction of local minima θ, and its formal version is given Lemma 6 in Appendix C. Lemma 1. (informal). Given a CNN whose numbers of filters in the last two convolutional layers satisfy T L-1 ≥ 3, T L-2 ≥ 3, the parameters W L (•, 1 : 2P L-2 ), W L-1 , b L-1 and W L-2 , b L-2 of the last three layers are set to propagate o L-3,i unchanged to the output neurons, and the first and second feature maps in both layer L -2 and layer L -1 have no contributions to the final output. The remaining parameters, including W L (•, 2P L-2 + 1 : end), b L and W l , b l L-3 l=1 , constitute a subnetwork, and if they locally minimize the training loss when fixing other parameters, then point θ := W l , b l L l=1 is a local minimum in parameter space. In this lemma, W L (•, 1 : 2P L-2 ) means the weights of connections from output neurons to the first two feature maps in layer L -1. The identity forward propagations in layers L -1 and L -2 (see Fig. 1 (a)) can be implemented by setting corresponding submatrices in W L-1 and W L-2 to identity ones, which is equivalent to setting corresponding filters to the form of (0, • • • , 1, • • • , 0) T ( with a single nonzero entry in the middle) as explained in section 2.2, and setting biases b L-1 , b L-2 to zeros. The embedding scheme to construct local minima or saddle points has also been used for fully connected networks by Fukumizu et al. (2019) ; Fukumizu & Amari (2000) ; He et al. (2020) ; Liu et al. (2021) ; Zhang et al. (2022) . However, embedding for CNNs needs to tackle the problems of limited receptive fields, arbitrary activation status at different locations, and tied weights and biases.

3.2. CONSTRUCTION OF SPURIOUS LOCAL MINIMA

Our construction of spurious local minima relies on the following assumption. Assumption 1. At local minimum θ, the inputs to the final fully connected layer are distinct for different samples, i.e., ∀i, j ∈ [N ] and i = j, o L-1,i = o L-1,j . For practical CNNs and datasets with rich contents, the number of neurons n L-1 input to the final fully connected layer is big and the chance of o L-1,i (note that o L-1,i = 0 since I L-1,i = 0 by the nondegenerate requirement in Lemma 6, the formal version of Lemma 1) being identical for different samples is very small, thus Assumption 1 is reasonable and enforces a very mild restriction on local minimum θ. This assumption is used to prevent distinct samples from becoming indistinguishable when passing through ReLU layers and being truncated by them. For fully connected deep neural networks, (He et al. (2020) ; Liu et al. (2021) ) assumed that all data points are distinct, i.e., ∀i, j ∈ [N ] and i = j, x i = x j . Under this condition, we show in Lemma 7 in Appendix C that for practical datasets and CNNs there always exists local minimum θ for which Assumption 1 holds. An exception is the neural collapse phenomenon that during the terminal phase of training, the features of the final hidden layer tend to collapse to class feature means. As a result, our construction of spurious local minima does not include those that may arise during the ending training phase when the loss is driving towards zero.

3.2.1. DATA SPLIT

In the following, we will show that given Assumption 1, there must exist at least one location in feature maps where the outputs of all samples can be split into two parts using a threshold such that the two parts will behave differently under perturbations. This is not a easy task due to the following difficulties introduced by convolutional layers. Firstly, the receptive field of each neuron in a convolutional layer is smaller than the input image, thus at each location in feature maps, the effective inputs (pixels in the receptive field) from distinct samples can be identical and indistinguishable. Secondly, the activation status of hidden ReLU neurons can vary from place to place and from sample to sample. Assumption 1 will help us to distinguish different samples. Thirdly, the filters and biases are shared by all locations in a feature map, and one cannot perturb these parameters at a location without affecting other places. We assume that training loss at θ satisfies R(θ) > 0.  i := ∂l(o(xi),y i ) ∂o(1) = 0 for some i ∈ [N ], where o(1) is the first component of output vector o. By optimality of θ := W l , b l L l=1 , we have ∂R ∂b L 1 = 1 N N i=1 ∂l(o(xi),y i ) ∂o(1) = 1 N N i=1 u i = 0. We will split the outputs of all samples at layer L -3. Given CNN parameters θ := W l , b l L l=1 and data samples {x 1 , x 2 , • • • , x N }, we consider the outputs o L-3,i (j) (i ∈ [N ] , j ∈ [M ]) in feature maps of layer L -3, where M is the number of neurons (locations) in all feature maps of layer L -3 used in Lemma 1. For all j ∈ [M ] , i ∈ [N ], we denote v j i := W L-3 (j, •) o L-4,i , w j := W L (1, 2P L-2 + j) . v j i is the output at layer L-3 before adding bias and ReLU activation. Thus, at each location j ∈ [M ], there is a list (v j 1 , v j 2 , • • • , v j N ) . Let xj i be the vector composed of pixels of x i in the receptive field of the jth location. xj i can be identical for distinct samples, resulting in identical v j i s. Identical v j i s can also be resulted from ReLU activations that remove the difference between input samples. In general, at each location j in the feature maps, (v j 1 , v j 2 , • • • , v j N ) are organized in groups according to their magnitudes. Assume there are g j groups in the list (v j 1 , v j 2 , • • • , v j N ) . Without loss of generality, we can re-index the samples in descending order of v j i and write the ordered list as v j 1 = v j 2 = • • • v j m1 > v j m1+1 = • • • v j m2 > • • • > v j mg j -1+1 = • • • = v j mg j = v j N . Let u = (u 1 , u 2 , • • • , u N ) T and note that u = 0 and N i=1 u i = 0. In the following lemma, we show that there is at least one location where the split of data samples exists. This is a combinatorial problem which needs to deal with multiple locations and multiple ordered groups at each location, and the groupings of samples can be arbitrary at each location. Lemma 2. Assume that training loss R( W l , b l L l=1 ) > 0 for squared or cross-entropy loss. Under Assumption 1, there exist some locations j ∈ [M ] where the ordered list v j 1 = v j 2 = • • • v j m1 > v j m1+1 = • • • v j m2 > • • • > v j mg j -1+1 = • • • = v j mg j = v j N can be splitted into two parts v j 1 , v j 2 , • • • , v j n and v j n+1 , • • • , v j N by a threshold η such that v j 1 , v j 2 , • • • , v j n > η, v j n+1 , v j n+2 , • • • , v j N < η, n i=1 u i = 0. There may be more than one location where such data split exists, and we choose the one with the biggest threshold η for later usage. Let h be the location with the biggest split threshold, and without loss of generality, assume it is located in the first feature map of layer L -3 (that is why we put a special emphasis on the first and second feature maps in layers L -2 and L -1 in Lemma 1 and Lemma 4). Since η for the hth location is the biggest threshold among all locations j ∈ [M ], by 7, n i=1 u i = 0 are not satisfied at locations whose thresholds are less than η, thus we have ∀j ∈ [M ], j = h, i∈{1,2,••• ,N |v j i >η} u i = 0. (8) An Auxiliary Lemma At each location j ∈ [M ], samples are organized into groups according to v j i (i ∈ [N ] ). v j i s in each group are equal. Let I j q be the set of indices of samples in the qth group of the jth location. We have the following lemma to show that i∈I j q u i cannot equal to zero for all groups and all locations. Lemma 3. Under Assumption 1, for CNNs with squared or cross-entropy loss, if R( W l , b l L l=1 ) > 0, then there must exist some j ∈ [M ] and q ∈ [g j ] such that i∈I j q u i = 0. Lemma 3 will be used by Lemma 2 to find data split that satisfies (7). We will prove Lemma 3 by induction. The number of groups g j at each location and the value of v j i for each group can be arbitrary. We will prove that if Lemma 3 holds for m samples, then for all possible configurations of current groups and all ways of adding a new sample, it also holds for m + 1 samples.

3.2.2. CONSTRUCTION OF θ

Given the CNN parameters θ := (W l , b l ) L l=1 specified in Lemma 6 and the threshold η in Lemma 2, we will give the point θ := W l , b l L l=1 in parameter space and show that training loss at θ is equal to that at θ. We only change the parameters that are related with the first and second feature maps in layers L -2 and L -1, respectively, to obtain θ , and remaining parameters are fixed. The purpose of these parameter setting is to make ) can propagate, i.e., the paths that connect the first, second, and third feature maps, respectively, in layers L -2 and L -1 (see Fig. 1 ) R(θ ) = R(θ) . If η ≥ -b L-3 1 , a positive value of σ(v j i + b L-3 1 ) will pass through either the first or the second path. If η < -b L-3 1 , a positive value of σ(v j i + b L-3 1 ) will pass through all three paths and the outputs from the first and second paths counteract. The differentiation of paths will make it possible for different parts of samples to behave differently under perturbation of parameters. Remaining parameters in θ keep fixed. Then, R(θ ) = R(θ).

3.2.3. PERTURBATION OF θ

Next, we will show that by perturbing θ , we can decrease the training loss. We demand that only the final output contributed by location h in the first feature map of layer L -3, where v j 1 , v j 2 , • • • , v j n > η and n i=1 u i = 0 happens, will be affected by the perturbation so as to decrease the training loss. The perturbation is then designed such that a single bias in the path each v j i (with v j i > η) passes through is perturbed. The final output is not affected by locations other than h even with v j i > η due to n i=1 u i = 0. The following lemma informally describe our parameter perturbation scheme and its formal version is given in Lemma 10 in Appendix C). Lemma 5. (informal). If η ≥ -b L-3 1 , perturb b L-2 2 . If η < -b L-3 1 , perturb b L-2 1 . Remaining parameters in θ keep fixed. Then under Assumption 1, a training loss lower than R(θ ) is obtained under this perturbation.

3.3. MAIN RESULTS AND DISCUSSION

Combining Lemma 1, Lemma 4 and Lemma 5, we have the following theorem. Theorem 1. Under Assumption 1, the local minima θ := W l , b l L l=1 given in Lemma 1 are spurious if R(θ) > 0 for squared or cross-entropy loss. Due to nonnegative homogeneity of ReLU activation, infinitely many spurious local minima exist by scaling the parameters of different layers in θ. In the general case when the two consecutive convolutional layers we utilized are not located at the top and there are some convolutional, average pooling, or fully connected layers between them and the output layer, we can still construct spurious local minima using similar idea. The only difference is the setting of parameters for layers above the two consecutive convolutional layers, and we set them such that the output of the two consecutive convolutional layers are propagated unchanged (except pooling operations in possible subsequent pooling layers) to the first fully connected layer, which plays the role of layer L in Lemmas 1 and 4, and then the output of the first fully connected layer is forwarded unchanged to the output neurons. For such constructed θ and θ , we can show that θ is still a local minimum, there is still R(θ ) = R(θ) and the empirical risk can be decreased by perturbing θ . The details will be given in Appendix D. The requirement of having two consecutive convolutional layers may be further relaxed, and we leave it to our future work. Remark 1: Our construction of spurious local minima shows that for practical datasets and CNNs each local minimum of the subnetwork is associated with a spurious local minimum. Since the output of a CNN with ReLU activations is a piece-wise linear function, and from the perspective of fitting data samples with piece-wise linear output (Liu (2022) ), the local minima of subnetworks of a CNN (and consequently the spurious local minima of the CNN) may be common due to the abundance of different fitting patterns. Furthermore, as suggested by Xiong et al. (2020) , CNNs have more expressivity than fully connected NNs per parameter in terms of the number of linear regions produced. The ability of producing more linear regions implies more fitting patterns, thus we conjecture that CNNs are more likely to produce spurious local minima than fully connected NNs of the same size. We leave the exploration of these ideas to our future work.

4. AN EXAMPLE OF NONTRIVIAL SPURIOUS LOCAL MINIMA

In this section, we will construct an example of nontrivial local minimum in which some neurons are inactive, which can serve as the subnetwork in Lemma 1. By Theorem 1, the associated local minimum for the CNN that contains this subnetwork is spurious. In comparison, in He et al. ( 2020 The idea is to split data samples into two groups using a hyperplane that is perpendicular to an axis, and then fit each group of samples using different predictors. We design a CNN architecture with appropriate parameters to generate these two predictors, and show that perturbing parameters will increase or at least maintain the loss.

Given data samples {x

1 , x 2 , • • • , x N }, we use a hyperplane with normal p to split the data samples. Choose any index k ∈ [d x ], we set p = (0, 0, • • • , 1, 0, • • • , 0) T where only the kth component of p is 1. Without loss of generality, assume the kth component of input x is in located in its first channel. Let I = i|x k i = max j∈[N ] x k j , i.e., I is the set of indices of samples having the largest kth component. Choose one element i * in set I, and denote by j * the index of any sample with the largest x k among x k n (n ∈ [N ], n / ∈ I), then we set c p = - 1 2 x k i * + x k j * . For any point x on the positive side of hyperplane p, we have σ(p • x + c p ) = σ(x k - 1 2 (x k i * + x k j * )) = x k - 1 2 (x k i * + x k j * ) > 0. ( ) We fit the two groups of samples in different ways using two subnetworks. Concatenating the two subnetworks, we obtain a full CNN. The following theorem gives an informal description of its architecture and parameters. The formal description will be given in Appendix B. For this local minimum, the ReLU units in the first subnetwork that correspond to the kth component of input are only activated by samples x j (j ∈ I) and are inactive for remaining samples. Thus, the local minimum W l , b l L l=1 is nontrivial for which different activation patterns exist and the resulting predictor is nonlinear. When using the local minimum θ 1 constructed in Theorem 2 as the subnetwork of θ in Lemma 1, in order to further construct θ shown in Lemma 4, we need to make sure that Assumption 1 and R(θ) > 0 hold. With the same ideas as dicussed in section 3.2.1, we can construct the second subnetwork such that there always exists point θ 1 (and consequently θ) for which both Assumption 1 and R(θ) > 0 hold for popular datasets and CNN architectures.

5. EXPERIMENTAL RESULTS

We conduct experiments on CIFAR-10 image set to verify the correctness of Theorem 1 and Theorem 2. We use two approaches to show the existence of spurious local minima. The first one is to visualize the loss landscape around local minima θ using the technique given in Li et al. (2018) . Given two random directions δ and η, we compute the empirical losses R(θ + αδ + βη) at grid points in the two-dimensional plane specified by δ and η, where (α, β) are the coodinates of each grid point. Then, the level sets of empirical loss are depicted using empirical losses at these grid points. The second approach is to compute the empirical losses around θ along many random directions and see whether losses lower than R(θ) exist. Experimental details are given in Appendix F. Fig. 2 shows the results of the level set visualization approach. Fig. 2 (a) and Fig. 2(b ) demonstrate that the local minimum θ constructed in Lemma 6 is spurious since R(θ ) = R(θ) and local minimum with loss lower than R(θ ) exists. Fig. 2(c ) shows that the point θ 1 constructed in Theorem 2 is a local minimum. Fig. 3 shows the results of random direction approach and will be given in Appendix F.

6. CONCLUSION

We have proved that convolutional layers can introduce infinite spurious local minima in the loss landscape of deep convolutional neural networks with squared loss or cross-entropy loss. To show this, we developed new techniques to solve the challenges introduced by convolutional layers. We solved a combinatorial problem to demonstrate that a split of outputs of data samples is always possible somewhere in feature maps. In this combinatorial problem, we overcame the difficulty of arbitrary groupings of outputs of data samples caused by limited receptive fields and arbitrary activation status of hidden neurons. We also solved the tied parameters problem, giving perturbations of filters and biases to decrease the training loss that affect only the output of a single neuron in the feature map.

OPERATIONS

We give a one-dimensional example to show how to formulate the average pooling operation as a matrix-vector product. Given a one-dimensional input o l-1 = (a, b, c, d) T , suppose the patches for pooling are non-overlapping, which is usually the case, and the patch size is 2. The output is then o l = ( 1 2 (a + b), 1 2 (c + d)) T , which is equivalent to the following matrix-vector product o l = 1 2 1 1 0 0 0 0 1 1    a b c d    . Thus, W l = 1 2 1 1 0 0 0 0 1 1 . Parameter matrix for average pooling operation in two-dimensional case can be obtained using similar idea by converting the feature maps in input layer into a vector, and the nonzero entries in each row of W l are not all adjacent. For fully connected layers, there are no constraints on the entries of parameter matrix W l ∈ R n l ×n l-1 .

B AN EXAMPLE OF NONTRIVIAL LOCAL MINIMA FOR CNNS

We fit the two groups of samples separated by the hyperplane with normal p in different ways. For samples {x j |j ∈ [N ], j / ∈ I}, we use a subnetwork with parameter matrix W 2 to fit them, i.e., the output of this subnetwork is o j = W 2 xj , ∀j ∈ [N ], j / ∈ I, where xj is the input augmented with scalar 1 (at the same time weight matrices are augmented to include biases (denoted as Ŵ l ) to make notations simple). We use a subnetwork with parameter matrix W 1 to fit data samples {x j |j ∈ I}, and W 1 is given by W 1 (i, •)x j = W 2 (i, •)x j + λ i (p T x j + c p ), j ∈ I, i ∈ [d y ]. Then, the output is o j = W 1 xj = W 2 xj + (p • x j + c p ) • λ 1 , λ 2 , . . . , λ dy T , j ∈ I. ( ) {λ i , i ∈ [d y ]} are constants determined as follows. If there is only one sample x i * in set I with target output y i * and the loss function is the squared loss, there must exist ỹi * such that l (ỹ i * , y i * ) = 0. Let o i * = ỹi * , then we have ỹi * (j) = W 2 (j, •)x i * + λ j (p • x i * + c p ) , ∀j ∈ [d y ] , which leads to λ j = ỹi * (j) -W 2 (j, •)x i * x k i * + c p , ∀j ∈ [d y ] . If there are multiple elements in set I or the loss function is the cross-entropy loss, the parameters {λ j , j ∈ [d y ]} are determined by {λ j , j ∈ [d y ]} = argmin {µj , j∈[dy]} i∈I l W 2 xi + x k i + c p µ 1 , µ 2 , • • • , µ dy T , y i . ( ) Concatenating the two subnetworks, we obtain a full CNN. The following theorem gives its architecture and parameters W l , b l L l=1 , and shows that θ 1 := W l , b l L l=1 is a local minimum. For ease of presentation, we ignore pooling layers at first. Theorem 3. (formal version of Theorem 2). Given a CNN, assume the number of filters in each convolutional layer satisfies T l ≥ 2 (l ∈ [L -1]), and the size of each convolutional layer satisfies n l -P l-1 ≥ d y (l ∈ [L -1]). Its parameters are given as follows. The weight matrix of the first filter in the first convolutional layer is W 1 (1 : P 0 , 1 : P 0 ) = I d P 0 ×P 0 , where I d P 0 ×P 0 is the identity matrix of size P 0 × P 0 , P 0 is the number of patches in the input used for convolution. The corresponding bias is b 1 1 = c p . ( ) The filters and biases of higher convolution layers are set as W l (1 : P l-1 , 1 : P l-1 ) = I d P l-1 ×P l-1 , b l 1 = 0, l ∈ {2, 3, • • • , L -1} to forward propagate the first feature map of o 1 (x i ) unchanged. The subnetwork with parameter matrix W 2 is formed by the part of CNN that removes the first filter and all activation units in each layer, and its output is given as follows. For all j ∈ [N ] and j / ∈ I, o j = W 2 xj :=W L (•, P L-2 + 1 : n L-1 )[W L-1 (P L-2 + 1 : n L-1 , P L-3 + 1 : n L-2 ) • • • (W 1 (P 0 + 1 : n 1 , •)x j + b 1 (P 0 + 1 : n 1 )) + • • • + b L-1 (P L-2 + 1 : n L-1 )] + b L . (20) W 2 is optimized to minimize i∈[N ], i / ∈I l (W 2 xi , y i ). The final fully connected layer is set as W L (i, 1 : P L-2 ) = (0, • • • , λ i , • • • , 0) , i ∈ [d y ], where only the kth component connecting to the first subnetwork is nonzero and equals to λ i . Sufficiently large positive constants c l (l ∈ [L -1]) are added to the biases of each convolutional layer in the second subnetwork such that the involved ReLU units are all activated. The biases of top fully connected layer b L i (i ∈ [d y ]) are then substituted by b L i -∆o L (i), where ∆o L (i) is recursively computed as ∆o 1 (i) = c 1 , i ∈ [P 0 + 1 : n 1 ], ∆o l (i) = j∈[P l-2 +1:n l-1 ] W l (i, j)∆o l-1 (j) + c l , i ∈ [P l-1 + 1 : n l ], l ∈ {2, 3, • • • , L -1} , ∆o L (i) = j∈[P L-2 +1:n L-1 ] W L (i, j)∆o L-1 (j), i ∈ [d y ]. All remaining parameters are set to zeros. Then, the point θ 1 = W l , b l L l=1 specified above is local minimum in parameter space. If pooling layers are used in CNNs, we can adjust the parameters in Theorem 3 to reflect the effects of parameter matrices of pooing layers. However, the adjusted W l , b l L l=1 is still a local minimum. The condition for the number of filters is used to contain two subnetworks. The condition for the size of each convolutional layers is used to make sure θ 1 presented in Theorem 3 meets the full rank requirement. Because the size of each convolutional layer satisfies n l -P l-1 ≥ d y (l ∈ [L -1]), the matrix W 2 ∈ R dy×dx for the second predictor is full rank around θ 1 (see the proof of Theorem 3 in Appendix G.9). These conditions are easily satisfied by most CNNs used in practice.

C LEMMAS FOR CONSTRUCTION OF SPURIOUS LOCAL MINIMA

The following Lemma 6 descibe the construction local minima θ, and it is the formal version of Lemma 1 in section 3.1. Lemma 6. Given a CNN whose numbers of filters in the last two convolutional layers satisfy T L-1 ≥ 3, T L-2 ≥ 3, the parameters of its last three layers are set as follows to propagate o L-3 (x) to the output layer. The first and second feature maps in both layer L -2 and layer L -1 have no contributions to the final output. W L (•, 1 : 2P L-2 ) = 0, W L-1 (i, i) = 1, i ∈ [min (n L-1 , n L-2 )] ; W L-1 (i, j) = 0, i = j, W L-2 (i, •) = 0, i ∈ [2P L-3 ] ; W L-2 (2P L-3 + i, i) = 1, i ∈ [min (n L-2 -2P L-3 , n L-3 )] , W L-2 (2P L-3 + i, j) = 0, i = j, i ∈ [min (n L-2 -2P L-3 , n L-3 )] , b L-1 = 0, b L-2 = 0. ( ) The remaining parameters, including W L (•, 2P L-2 + 1 : end), b L and W l , b l L-3 l=1 , constitute a subnetwork, and they locally minimize the following training loss when fixing those parameters in (23), R( W l , b l L-3 l=1 , W L (•, 2P L-2 + 1 : end), b L ) := 1 N N i=1 l W L I L-1,i W L-1 • • • W 1 x i + b 1 + • • • + b L-1 + b L , y i . ( ) Assume that Îl,i = 0 for all l ∈ [L -1] and i ∈ [N ]. Then, the point θ := W l , b l L l=1 obtained above is a local minimum in parameter space. The requirement in Lemma 6 that Îl,i = 0 for all l ∈ [L-1] and i ∈ [N ] is used to prevent degenerate cases in which all ReLU neurons in a layer are deactivated for some samples. The nondegenerate cases always exist since we can fix sufficiently big biases such that all ReLU neurons are turned on and then train the subnetwork. The following Lemma 7 shows that for practical datasets and popular CNN architectures, there always exists point θ in parameter space where Assumption 1 holds. Lemma 7. For CNNs whose number of feature maps in each convolutional and average pooling layer is greater than or equals to the number of input channels n 0 and width of each fully connected layer is bigger than or equals to that of output layer, and datasets in which all data points are distinct, i.e., ∀i, j ∈ [N ] and i = j, x i = x j , if the distinctness of samples is preserved after each pooling operation of CNNs when directly applied to the input, then there always exists point θ for which Assumption 1 holds. For popular CNN architectures, such as AlexNet and VGG used in CIFAR10 and ImageNet image classifiction tasks, the number of feature maps in each layer is greater than the number of input channels, and the output layer is indeed narrower than hidden fully connected layers. For practical datasets, the distinctness of samples is usually preserved after several pooling operations. Thus, the conditions in Lemma 7 are reasonable in practice. Proof. We consider the general case in which the two consecutive convolutional layers we utilized are not located at the top, and there are some convolutional, pooling, or fully connected layers between them and the final output layer. For CNNs whose number of feature maps in each convolutional and average pooling layer is greater than or equals to the number of input channels, we can set the parameters W l , b l L l=1 such that each convolutional layer forward propagates its input unchanged using the first n 0 feature maps and the remaining feature maps are set to zeros using zero filters and biases. Also, since the width of each fully connected layer is bigger than or equals to that of output layer, we use the first d y neurons in the first fully connected layer to perform the fully connection operation, and subsequent fully connected layers forward propagate their inputs unchanged to the final output neurons. Sufficiently large biases are used in the first convolutional layer and the first fully connected layer such that all ReLU neurons in the first n 0 feature maps of convolutional and pooling layers and the first d y ReLU neurons in hidden fully connected layers are turned on, and their effects can be cancelled at the final output layer (as done for the second subnetwork in 22 in Theorem 3). By doing so, the training loss of the CNN can be expressed as R (θ) = 1 N N i=1 l W f c W pool x i , y i , where W pool is a constant matrix that characterizes the average pooling operations from input to the first fully connected layer, and W f c characterizes the computation in the first fully connected layer. The CNN has been reduced to a linear classifier and the corresponding point θ in parameter space becomes a local minimum when minimizing 1 N N i=1 l W f c W pool x i , y i with respect to W f c . The only operation that can affect the distinctness of o L-1,i (i ∈ [N ]) at layer L-1 (or more generally, the two consecutive convolutional layers used in the construction of spurious local minima) is then the pooling operation. If the distinctness of samples x i (i ∈ [N ]) is preserved after each pooling operation of the CNN when directly applied to x i (i.e., W pool x i ), then ∀i, j ∈ [N ] and i = j, o L-1,i = o L-1,j , hence Assumption 1 holds for θ. The following Lemma 8 shows that for training data that cannot be fit by linear models and popular CNN architectures, point θ in parameter space with training loss R(θ) > 0 always exists. Lemma 8. For training data that cannot be fit by linear models, and CNNs whose number of feature maps in each convolutional and average pooling layer is greater than or equals to the number of input channels n 0 and width of each fully connected layer is bigger than or equals to that of output layer, there always exists a point θ = W l , b l L l=1 in parameter space such that training loss R(θ) > 0. The point θ = W l , b l L l=1 in Lemma 8 can be the same as that in Lemma 7. Therefore, combining the two lemmas, there exists point θ where both Assumption 1 and R(θ) > 0 hold. Most practical datasets, such as CIFAR10 and ImageNet, are complex and cannot be fit by linear models. Thus, the conditions in Lemma 8 are reasonable in practice. Proof. Let W ∈ R dy×dx is a local minimizer of 1 N N i=1 l (W x i , y i ), where l is the loss function. For training data that cannot be fit by linear models, we have 1 N N i=1 l (W x i , y i ) > 0. We use the same parameter settings as in the proof of Lemma 7, and then the training loss of CNN is expressed as R(θ) = 1 N N i=1 l W f c W pool x i , y i . Since 1 N N i=1 l (W x i , y i ) = min W ∈R dy ×dx 1 N N i=1 l (W x i , y i ) > 0, we have R(θ) > 0. The following Lemma 9 descibe the construction the point θ , and it is the formal version of Lemma 4 in section 3.2.2. Lemma 9. Given the CNN parameters θ := (W l , b l ) L l=1 specified in Lemma 6, let θ := W l , b l L l=1 . If η ≥ -b L-3 1 , set W L (1, P L-2 + 1 : 2P L-2 ) = W L (1, 2P L-2 + 1 : 3P L-2 ) , W L (1, 1 : P L-2 ) = W L (1, 2P L-2 + 1 : 3P L-2 ) , W L (1, 2P L-2 + 1 : 3P L-2 ) = 0 T , W L (2 : d y , 1 : 2P L-2 ) = 0, W L-1 (i, i) = -1, i ∈ [P L-3 ] ; W L-1 (i, i) = +1, i ∈ {P L-3 + 1 : 2P L-3 } , b L-1 1 = η + b L-3 1 , b L-1 2 = η + b L-3 1 , W L-2 (i, i) = -1, i ∈ [P L-4 ] ; W L-2 (P L-3 + i, i) = +1, i ∈ [P L-4 ] , b L-3 1 = - min i∈[N ], j∈[P L-4 ] v j i , b L-2 1 = η + b L-3 1 ; b L-2 2 = -η -b L-3 1 ; b L 1 = b L 1 - P L-2 j=1 w j η + b L-3 1 . ( ) D CONSTRUCTION OF SPURIOUS LOCAL MINIMA: THE GENERAL CASE When the two consecutive convolutional layers we utilized are not located at the top, and there are some convolutional, pooling, or fully connected layers between the two consecutive convolutional layers used in our construction and the final output layer, we can still construct spurious local minima using settings similar to Lemmas 6, 9 and 10. We let the first fully connected layer play the role of layer L in Lemmas 6 and 9, and set the parameters of layers above the two consecutive convolutional layers such that the output of the two consecutive convolutional layers are propagated unchanged (except the pooling operations in subsequent pooling layers) to the first fully connected layer, whose output is then forwarded invariantly to the final output neurons. Similar to the proof of Lemma 7, this is always possible by utilizing the minimal widths among subsequent convolutional layers and fully connected layers, respectively. Then, by reserving the first two feature maps in layers starting from the two consecutive convolutional layers and minimizing the loss of the remaining subnetwork, we can obtain a local minimum θ. By setting the parameters connected to the two consecutive convolutional layers as in Lemma 9, we can obtain θ . The positive value of σ(v j i + b L-3

1

) may pass through different paths as before for different location j. If there are average pooling layers above the two consecutive convolutional layers, since average pooling operation is linear, the average poolings in feature maps of different paths are finally aggregated in the first fully connected layer, producing an output equal to that of θ. Therefore, there is still R(θ ) = R(θ). The perturbation scheme is the same as in Lemma 10, and the perturbation will persist after passing through the linear average pooling operations, thus the empirical risk can be decreased by perturbing θ as before.

E MORE AUXILIARY LEMMAS Denote

u i := ∂l(o(xi),y i ) ∂o(1) and o i := o L,i (1) = M j=1 w j σ v j i + b L-3 tj + b L 1 , where b L-3 tj is the bias associated with the feature map in which location j resides. We have the following lemma for squared and cross-entropy loss. Lemma 11. For convolutional neural networks with squared loss or cross-entropy loss, the followings can be achieved through perturbing W L (1, •) under Assumption 1, where o i and u i denote, respectively, o i and u i after perturbation. 1. o i = o j if o i = o j (i, j ∈ [N ], i = j). 2. u i = 0 if u i = 0 (i ∈ [N ]). 3. u i = u j or u i + u j = 0 if u i = u j or u i + u j = 0, respectively, (i, j ∈ [N ], i = j). 4. u = (u 1 , u 2 , • • • , u N ) T = 0 if R = 1 N N i=1 l (o i , y i ) > 0. 5. Nonzero gaps between o i s or u i s can still exist after subsequent perturbations. The maintenance of nonzero gaps between o i s or u i s guarantees that o i = o j or u i = u j etc. still hold after subsequent perturbations. In the following, we absorb biases into weights and the CNN output is written as o i = W L I L-1,i W L-1 • • • I 1,i W 1 x i . Here, we omit the hat on augmented variables for notational simplicity, and the exact meaning of involved symbols is clear from context. The following two lemmas will be used to prove u i = u j or u i + u j = 0 (i, j ∈ [N ], i = j) in Lemma 11 for cross-entropy loss. Lemma 12. For cross-entropy loss, if u i = u j for two samples x i , x j (i = j) and I L-1,i W L-1 • • • W 1 x i = αI L-1,j W L-1 • • • I 1,j W 1 x j (α = 0, α = 1), then under Assumption 1, u i = u j can be achieved through perturbing W L (1, •). Lemma 13. For cross-entropy loss, if u i + u j = 0 for two samples x i , x j (i = j) and I L-1,i W L-1 • • • W 1 x i = αI L-1,j W L-1 • • • I 1,j W 1 x j (α = 0, α = 1), then under Assumption 1, u i + u j = 0 can be achieved through perturbing W L (1, •).

F EXPERIMENTAL DETAILS

We use CIFAR-10 image set to train CNNs, which consists of 10 classes and 50000 training images of size 32 × 32. The CNN used in our experiments has 7 convolutional layers, with the number of channels being 64, 64, 128, 128, 256, 256, 256 , respectively, and no pooling layers are used. The convolution filters are all 3×3. Each convolutional layer is followed by a ReLU layer. The subnetwork in Lemma 6 is trained by Adam optimizer with 150 epochs, with a learning rate of 0.001 and weight decay of 0.0005. The subnetwork W 2 in Theorem 3 is trained by Adam optimizer with 500 epochs, with a learning rate of 0.0001 and weight decay of 0.0005. Both subnetworks use the Kaiming initialization (He et al. (2015) ). The biggest threshold η in Lemma 2 is found to be 6.8915. Fig. 3 shows the results of random direction approach.

G MISSING PROOFS

For the proofs in the following sections, we assume that the perturbation of network parameters for a differentiable local minimum θ = W l , b l L l=1 is very small such that the activation patterns Proof. When parameters are perturbed as W l , b l L l=1 -→ W l + δW l , b l + δb l L l=1 , the output after perturbation is as follows, I l,i (l ∈ [L -1], i ∈ [N ]) will keep constant. o (x i ) := o L,i = Ŵ L + δ Ŵ L ÎL-1,i Ŵ L-1 + δ Ŵ L-1 ÎL-2,i • • • Ŵ 1 + δ Ŵ 1 xi = ( Ŵ L ÎL-1,i Ŵ L-1 ÎL-2,i • • • Ŵ 1 + δF i )x i , ∀i ∈ [N ], where δF i = δ Ŵ L ÎL-1,i Ŵ L-1 ÎL-2,i • • • Ŵ 1 + Ŵ L ÎL-1,i δ Ŵ L-1 ÎL-2,i • • • Ŵ 1 + • • • + δ Ŵ L ÎL-1,i δ Ŵ L-1 ÎL-2,i • • • δ Ŵ 1 . The training loss after perturbation is R((W l + δW l , b l + δb l ) L l=1 ) = 1 N N i=1 l (o (x i ) , y i ) = 1 N N i=1 l Ŵ L ÎL-1,i Ŵ L-1 ÎL-2,i • • • Ŵ 1 xi + δF i xi , y i . Since for every sample the output of each layer is nondegenerate, when minimizing 24), the space of o (x i ) has been fully explored by R( W l , b l L-3 l=1 , W L (•, 2P L-2 + 1 : end), b L ) := 1 N N i=1 l Ŵ L ÎL-1,i Ŵ L-1 ÎL-2,i • • • Ŵ 1 xi , y i in ( Ŵ L ÎL-1,i Ŵ L-1 • • • Ŵ 1 in the neighborhood during minimization. Therefore, R((W l + δW l , b l + δb l ) L l=1 ) cannot be lower than R((W l , b l ) L l=1 ) despite the perturbation in δF i , and W l , b l L l=1 is thus a local minimum in parameter space. G.2 PROOF OF LEMMA 2 Proof. At a location j ∈ [M ] (more specifically, M := min (n L-1 -2P L-2 , n L-2 -2P L-3 , n L-3 )), if there exist some groups in the ordered list of v j i (i ∈ [N ] ) such that i∈I j q u i = 0, and suppose G 1 = v j mq+1 , v j mq+2 , • • • , v j mq+1 is the first such group in the ordered list, we can set η = 1 2 v j mq+1+1 + v j mq+1 , which is the midpoint of the gap between group G 1 and the next group, and set n = m q+1 . We then have n i=1 u i = mq i=1 u i + i∈G1 u i = i∈G1 u i = 0. The requirements in (7) are satisfied and we have found a split for data samples. If for every group q ∈ [g j ] at location j, there is i∈I j q u i = 0, we have to explore other locations in feature maps to see whether there exist such splits. According to Lemma 3, i∈I j q u i = 0 There are three possible cases for the value of v j i . Note that we already have η + b L-3 1 ≥ 0 if η ≥ -b L-3 1 . 1) v j i ≥ η. In this case, we have v j i -η ≥ 0 and v j i + b L-3 1 ≥ v j i -η ≥ 0, then z j = σ(η + b L-3 1 ) + σ(v j i -η + η + b L-3 1 ) -η + b L-3 1 = v j i + b L-3 3 = z j > 0, ∀j ∈ [P L-2 ] . (37) 2) -b L-3 1 ≤ v j i < η . We have -v j i + η > 0. By (34) and v j i + b L-3 1 ≥ 0, we then have z j = σ --v j i + η + η + b L-3 1 + σ η + b L-3 1 -η + b L-3 1 = σ(v j i + b L-3 1 ) = z j ≥ 0, ∀j ∈ [P L-2 ] . 3) v j i < -b L-3 1 . In this case, we have v j i < η, v j i + b L-3 1 < 0, then z j = σ v j i -η + η + b L-3 1 + σ η + b L-3 1 -η + b L-3 1 = σ v j i + b L-3 1 = 0 = z j , ∀j ∈ [P L-2 ] . In all three possible cases, we have obtained z j = z j , ∀j ∈ [P L-2 ]. Therefore, õ (x i ) = P L-2 j=1 w j • z j + b L 1 = õ (x i ) , ∀i ∈ [N ]. We can obtain o k (x i ) = o k (x i ) (∀i ∈ [N ] , ∀k ∈ [d y ], k = 1) for remaining output components since W L (2 : d y , 1 : 2P L-2 ) = 0. The outputs are thus not changed, so does the training loss. We now discuss the case of η < -b L-3 1 . Denote w i := W L (1, 2P L-2 + i) , ∀i ∈ [P L-2 ] . The output õ (x i ) that is affected by the parameter setting in ( 26) is as follows, õ (x i ) =W L (1, 1 : P L-2 ) σ W L-1 (1 : P L-2 , 1 : P L-3 ) σ W L-2 (1 : P L-3 , 1 : P L-4 ) o L-3,i + b L-2 1 1 P L-3 + b L-1 1 1 P L-2 + W L (1, P L-2 + 1 : 2P L-2 ) σ W L-1 (P L-2 + 1 : 2P L-2 , P L-3 + 1 : 2P L-3 ) σ W L-2 (P L-3 + 1 : 2P L-3 , 1 : P L-4 ) o L-3,i + b L-2 2 1 P L-3 + b L-1 2 1 P L-2 + W L (1, 2P L-2 + 1 : 3P L-2 ) σ W L-1 (2P L-2 + 1 : 3P L-2 , 2P L-3 + 1 : 3P L-3 ) σ W L-2 (2P L-3 + 1 : 3P L-3 , 1 : P L-4 ) o L-3,i + b L-2 3 1 P L-3 + b L-1 3 1 P L-2 + b L 1 . (40) Using (26), we have õ (x i ) = p L-2 j=1 w j [σ σ v j i + b L-3 1 -η -b L-3 1 -σ σ v j i + b L-3 1 -η -b L-3 1 +σ σ v j i + b L-3 1 -η -b L-3 1 + η + b L-3 1 ] + b L 1 . = p L-2 j=1 w j z j + b L 1 , where z j :=σ σ v j i + b L-3 1 -η -b L-3 1 -σ σ v j i + b L-3 1 -η -b L-3 1 +σ σ v j i + b L-3 1 -η -b L-3 1 + η + b L-3 1 . There are also three possible cases for the value of v j i . Note that we already have η + b L-3 1 < 0 when η < -b L-3 1 . 1) v j i ≥ -b L-3 1 . In this case, using v j i -η > 0, v j i + b L-3 1 ≥ 0 and (42), we have z j = v j i -η -v j i -η + σ v j i + b L-3 1 = v j i + b L-3 1 = z j ≥ 0, ∀j ∈ [p L-2 ] . (43) 2) η ≤ v j i < -b L-3 1 . Using v j i -η ≥ 0, v j i + b L-3 1 < 0, we have z j = v j i -η -v j i -η + σ v j i + b L-3 1 = 0 = z j , ∀j ∈ [p L-2 ] . 3) v j i < η. Using v j i -η < 0, v j i + b L-3 1 < 0, we have z j = σ η + b L-3 1 = 0 = z j = σ v j i + b L-3 1 , ∀j ∈ [p L-2 ] . Therefore, in all possible cases of η and v j i , the output o 1 (x i ) = o 1 (x i ) , ∀i ∈ [N ]. We have o k (x i ) = o k (x i ) (∀i ∈ [N ] , ∀k ∈ [d y ], k = 1) for remaining output components. As a result, the training loss satisfies R (θ) = R (θ ).

G.4 PROOF OF LEMMA 10

Proof. We will show that by perturbing θ , the training loss can be decreased. We use different perturbation schemes to show this for different cases of η. If η ≥ -b L-3 1 , we only perturb b L-2 2 = -η -b L-3 1 as b L-2 2 → b L-2 2 + δb = -η -b L-3 1 + δb. ( ) Under this perturbation, the output of each sample is perturbed as follows according to different cases of v j i . 1) v j i ≥ η. In this case, after perturbation, by modifying (34) we have z j = σ σ(v j i -η + δb) + η + b L-3 1 . Since η is at the midpoint of a gap between adjacent v j i s (as indicated in the proof of Lemma 2), there is v j i > η, hence for sufficiently small perturbation δb, v j i -η + δb > 0 and v j i + b L-3 3 + δb > 0 hold. Therefore, z j = v j i + b L-3 1 + δb = z j + δb, ∀j ∈ [P L-2 ] . 2) -b L-3 1 ≤ v j i < η. Since v j i -η + δb < 0, the perturbation δb does not pass through the ReLU activation, z j = v j i + b L-3 1 = z j , ∀j ∈ [P L-2 ] . Under review as a conference paper at ICLR 2023 2) η ≤ v j i < -b L-3 1 . In this case, since a gap exists between η and v j i when splitting samples with η in Lemma 2, we have η < v j i and consequently v j i -η + δb > 0 for sufficiently small δb. Therefore, z j = v j i -η + δb -v j i -η = δb = z j + δb, ∀j ∈ [P L-2 ] . 3) v j i < η. Using v j i -η + δb < 0, we have z j = σ η + b L-3 1 = 0 = z j , ∀j ∈ [P L-2 ] . (59) The perturbation δb has been blocked by ReLU units. Only in the first and second cases, the output o 1 (x i ) is affected by the perturbation δb. Thus, we have z j = z j + δb, if v j i > η, ∀i ∈ [N ], j ∈ [P L-2 ]. ( ) The output is perturbed as δo 1 (x i ) = P L-2 j=1 w j δz j = P L-2 j=1 w j • δb • I v j i > η . Remaining output components keep unchanged. The training loss is perturbed as follows, δR = 1 N N i=1 u i δo 1 (x i ) = 1 N δb • P L-2 j=1 w j i∈{1,2,••• ,N |v j i >η} u i . ( ) Because η is the biggest split threshold among all locations, only for the hth location, there is i∈{1,2,••• ,N |v j i >η} u i = 0. We then have δR = 1 N δb • w h • n i=1 u i , which is the same as (52). By setting δb with appropriate sign, we can obtain δR < 0 and conclude that θ is a spurious local minimum. If a perturbation of W L (1, •) is required to split data samples as shown in Lemma 11, the training loss at θ is perturbed as R (θ + δθ) = R (θ) + ∂R (θ) ∂W L (1, •) δW L (1, •) = R (θ) due to the optimality condition ∂R(θ) ∂W L (1,•) = 0 T since θ is a local minimum. The perturbation δθ only modifies W L (1, •) and does not change other parameters. Therefore, by starting from θ + δθ and setting the parameters of the last three layers as in Lemma 9, we can obtain a point θ with R (θ + δθ) = R (θ ). Thus we have R (θ) = R (θ ). We then perturb the biases of θ as in Lemma 10 to obtain a loss lower than R (θ), which shows that θ is a spurious local minimum.

G.5 PROOF OF LEMMA 3

Proof. We prove by induction. We will often perturb the network parameters to tune o i or u i (i ∈ [N ]), which does not change empirical loss, as shown in (64). Let us first consider some special cases. CASE 1. Suppose the groups at two locations are location 1 : (u 1 , u 2 ) , location 2 : (u 1 , u 2 ) . (65) If u 1 + u 2 = 0, then the sequence (u 1 , u 2 ) -(u 1 , u 2 ) (the first element in the sequence is (u 1 , u 2 ) from location 1, and the second one is (u 1 , u 2 ) from location 2) is a CI sequence. However, this CI sequence cannot be vanished by adding a new sample since we can use perturbation of network parameters to prevent it from happening, described later in the induction step. If u 1 + u 2 = 0, by Lemma 11, we can perturb network parameters to obtain u 1 + u 2 = 0 at first. CASE 2. Suppose the groups at two locations are location 1 : (u 1 , u 2 ) , location 2 : (u 1 ) , (u 2 ) , where groups on the left have bigger v j i s than those on the right at each location. By Lemma 11, we can perturb network parameters to obtain u 1 = 0 and u 2 = 0 for squared loss and cross-entropy loss, then the sequences (u 1 , u 2 ) -(u 1 ) and (u 1 , u 2 ) -(u 2 ) are both non-CI sequences, thus the induction hypothesis holds. If (u 1 , u 2 ) in location 1 is separated as (u 1 ) , (u 2 ) after perturbation, then it can be treated by the following case 3. CASE 3. Suppose the groups at two locations are location 1 : (u 1 ) , (u 2 ) location 2 : (u 1 ) , (u 2 ) . (67) or location 1 : (u 1 ) , (u 2 ) location 2 : (u 2 ) , (u 1 ) . For these cases, by Lemma 11, if u 1 = u 2 , we can perturb the parameters to obtain u 1 = u 2 for squared loss and cross-entropy loss, then the sequences (u 1 ) -(u 2 ) and (u 2 ) -(u 1 ) are both non-CI sequences. The sequences (u 1 ) -(u 1 ) and (u 2 ) -(u 2 ) are CI sequences. Again, by appropriate perturbations when adding new samples in the induction step, such CI sequences will not be removed and thus Lemma 3 holds. M ≥ 3. If there are more than two locations, we can select two locations as a subset, which must belong to one of the above three possible cases. The full sequences of all locations contain those sequences in this two-location subset as sub-sequences, and consequently groups with i∈I j q j u i = 0 still exist in full sequences. Therefore, induction hypothesis holds for M ≥ 3.

G.5.4 INDUCTION STEP

At last, we prove that if the induction hypothesis holds for N = m samples, then it holds for N = m + 1 samples. Suppose sample x m+1 is to be added given current arrangement of groups. Note that we can have u m+1 = 0 according to Lemma 11. There are the following three possible situations. 1) x m+1 is added as a new group at every location. This is equivalent to adding a new CI sequence. CI and non-CI sequences formed by previous groups still exist, thus induction hypothesis holds. 2) x m+1 is inserted into an existing group at every location. More specifically, this can be divided into the following four cases. A) x m+1 is inserted into an existing group satisfying i∈I j q u i = 0 at every location. This is equivalent to adding a new CI sequence into existing groups. Since CI or non-CI sequences still exist, the induction hypothesis holds. B) x m+1 is inserted into existing CI or non-CI sequences. If inserting x m+1 into an existing CI sequence, it will either be a new CI sequence, or annihilated by this addition. By induction hypothesis, non-CI sequences may still exist, hence induction hypothesis holds again after inserting x m+1 . Another possible situation is that CI sequences exist, such as (u i ) -(u i ) -• • • -(u i ) (i ∈ [N ] ), and when adding new samples, there is a risk of annihilating such CI sequences, and if no remaining non-CI sequences exist, then no groups with i∈I j q u i = 0 exist. We can perturb the outputs of samples to avoid such situation using a perturbation of network parameters. For example, if u 1 + u 3 = 0, then adding u 3 into the sequence (u 1 ) -(u 1 ) -• • • -(u 1 ) will cause its disappearance. We will have o 1 = o 3 since sample 1 and sample 3 always appear in the same group at every location after the addition. After perturbation that tunes o 1 and o 3 differently to get o 1 = o 3 , as shown in Lemma 11, sample 1 and sample 3 then cannot appear in the same group at every location. The annihilation of sequence (u 1 ) -(u 1 ) -• • • -(u 1 ) is thus avoided and induction hypothesis holds. If inserting x m+1 into an existing non-CI sequence, it will still be a non-CI sequence by definition. Therefore, induction hypothesis holds. C) x m+1 is inserted into an existing CI sequence at some locations and into an existing non-CI sequence at remaining locations. The involved CI sequence becomes a non-CI one, and the involved non-CI sequence may become a non-CI sequence, a CI sequence or a zero sequence. Anyway, non-CI sequences still exist and cannot be removed by adding a sample, and thus induction hypothesis holds. D) x m+1 is inserted into an existing CI or non-CI sequence at some locations and into groups with i∈I j q u i = 0 at remaining locations. In this case, non-CI sequences will exist after inserting x m+1 , thus induction hypothesis holds. 3) x m+1 is inserted into existing groups at some locations j ∈ J 1 , and added as new groups at remaining locations j ∈ J 2 . Equivalently, we can regard new groups in J 2 (before adding x m+1 ) as existing groups with i∈I j q u i = 0, thus the above case D) can be applied directly, and induction hypothesis is maintained.

G.5.5 INDUCTION CONCLUSION

We have proved that if the induction hypothesis holds for N = m samples, then it holds for N = m+1 samples. By induction hypothesis, there exists j ∈ [M ] , q ∈ [g j ] , i∈I j q u i = 0, and consequently a split of data samples somewhere in feature maps is always possible.

G.6 PROOF OF LEMMA 12

Proof. For cross-entropy loss, the training loss for a sample (x i , y i ) is l (o(x i ), y i )) = - dy j=1 y i,j log e -oi,j dy k=1 e -o i,k , where o i,j is the jth component of o i . Its derivative is u i = ∂l (o(x i ), y i ) ∂o i,1 = e -oi,1 dy k=1 e -o i,k -y i,1 . Let p i = e -o i,1 dy k=1 e -o i,k , we have u i = p i -y i,1 . Denote w T := W L (1, •), o := o i,1 . Under the perturbation δw, by Taylor expansion we have u i = u i + ∂u i ∂o (I L-1,i W L-1 • • • I 1,i W 1 x i ) • δw + 1 2 ∂ 2 u i ∂o 2 (I L-1,i W L-1 • • • I 1,i W 1 x i ) • δw 2 + O( δw 3 ). ( ) Note that for o L-1,i = I L-1,i W L-1 • • • I 1,i W 1 x i , we have o L-1,i (1 : 2P L-2 ) = 0 (∀i ∈ [N ] ) due to the parameter setting in Lemma 6, thus perturbation δw(1 : 2P L-2 ) has no effect in (72). For cross-entropy loss, the first and second order derivatives of u i are ∂u i ∂o = p i -p 2 i , ∂ 2 u i ∂o 2 = ∂ 2 u i ∂o∂p i ∂p i ∂o = (1 -2p i ) ∂u i ∂o , Let Q i := I L-1,i W L-1 • • • I 1,i W 1 , we now prove that if Q i x i = αQ j x j (α = 0, α = 1) , then by setting δw appropriately, one can always make u i = u j . Note that ∂ui ∂o = p i -p 2 i = 0 since p i = 0 and p i = 1 for finite inputs. Also note the requirement I L-1,i = 0 in Lemma 6, we have Q  i x i = 0, i ∈ [N ]. We can set δw = Q i xi Q i xi such that (Q i x i ) • δw = 0 and (Q j x j ) • δw = 0, (Q i x i ) • δw = ∂uj ∂o (Q j x j ) • δw, hence we have u i = u j up to the first-order Taylor expansion. If Q i x i = αQ j x j and ∂ui ∂o = 1 α ∂uj ∂o , we need to consider the second-order term in (72), and if ∂ 2 ui ∂o 2 α 2 = ∂ 2 uj ∂o 2 , we have u i = u j up to the second-order Taylor approximation. Otherwise, we have (1 -2p i )α = 1 -2p j , We next show that ( 75) and ( 76) cannot both hold for each case of y i,1 ∈ {0, 1}, and thus we have u i = u j up to either the first-order or the second-order Taylor approximation. If u i = p i , u j = p j or u i = p i -1, u j = p j -1, then u i = u j implies p i = p j . From (75) and α = 1, contradiction is resulted and hence u i = u j . If samples x i and x j have distinct labels, say without loss of generality, u i = p i -1, u j = p j , then u i = u j leads to p j = p i -1. Substituting it into (75) and (76), we obtain (p i -p 2 i )α = (p i -p 2 i ) + 2(p i -1) and (1 -2p i )α = (1 -2p i ) + 2, respectively. ( 77) and ( 78) can be transformed into p i (1 -p i )α = (1 -p i )(p i -2) and (1 -p i )(2α -2) = α + 1, (80) respectively. By 1 -p i = 0 , (79) results in p i α = p i -2. (81) Solving ( 80) and ( 81), we obtain α = -1, p i = 1. However, p i = 1 is impossible for finite inputs, therefore, we have u i = u j after the perturbation.

G.7 PROOF OF LEMMA 13

Proof. We now prove that if u i + u j = 0 and Q i x i = αQ j x j (α = 0, α = 1), then by setting δw appropriately, one can always obtain u i + u j = 0.  We next show that ( 82) and ( 83) cannot both hold for every case of y i,1 ∈ {0, 1}. If u i = p i , u j = p j , then u i + u j = 0 implies p j = -p i . Substituting it into (82) and ( 83) and with simple calculations, we can obtain -p i α -p i = 1 -α (84) and 2p i α + 2p i = α -1. (85) The solution is α = 1, p i = 0, which contradict α = 1, p i = 0. If u i = p i -1, u j = p j -1, then u i + u j = 0 implies p j = 2 -p i . In this case, the solution to (82) and ( 83) is α = 1, p i = 1, again contradiction is resulted. If samples x i and x j have distinct labels, say without loss of generality, u i = p i -1, u j = p j , then u i +u j = 0 leads to p j = 1-p i . The solution to (82) and (83) will be α = -1 and p i can be arbitrary. However, α = -1 implies Q i x i = -Q j x j . Due to I L-1,i = 0 by the requirement in Lemma 6, we have for all i ∈ [N ] that all components of Q i x i are nonnegative and some components of it must be positive, and thus Q i x i = -Q j x j leads to contradiction. Therefore, α = -1 is impossible. In summary, for various cases of y i,1 ∈ {0, 1}, we all have u i + u j = 0.

G.8 PROOF OF LEMMA 11

Proof. 1. Proof of o i = o j (i, j ∈ [N ], i = j) for squared loss and cross-entropy loss. We perturb W L (1, •) to make o i = o j . Under this perturbation, we have o i = o i + (I L-1,i W L-1 • • • I 1,i W 1 x i ) • δw = o i + (Q i x i ) • δw. ( ) Since o L-1,i = o L-1,j by Assumption 1, we have Q i x i = Q j x j . Then, if o i = o j , we have o i -o j = (Q i x i -Q j x j ) • δw. Let δw = (Q i x i -Q j x j ) , where is a sufficiently small positive number, there is o i -o j = Q i x i -Q j x j 2 = 0. Thus, o i = o j . Q i+1 x i+1 ) (i ∈ [s -1]). We set δw small enough as follows to make sure the nonzero gaps in o i -o i+1 (i ∈ [s -1]) do not disappear. For the gap o i -o i+1 , we have o i -o i+1 ≥ o i -o i+1 -δw Q i x i -Q i+1 x i+1 (i ∈ [s -1] ). We already have o i -o i+1 > 0 for nonzero gaps, and if δw are small enough such that ∀i ∈ [s -1], o i -o i+1 -δw Q i x i -Q i+1 x i+1 > 0, or equivalently ∀i ∈ [s -1], δw < min (i∈[s-1], o i -o i+1 >0) o i -o i+1 Q i x i -Q i+1 x i+1 , then we will have ∀i ∈ [s -1], o i -o i+1 > 0. Note that Q i x i -Q i+1 x i+1 = 0. The nonzero gaps between neighboring samples still exist. Additional perturbations can be treated in the same spirit. For a dataset with N samples, at most N perturbations are needed. The total perturbation is obtained by adding the perturbations in all steps, i.e., δw + δw + • • • . The gaps between u i s can be treated similarly using small enough perturbations. G.9 PROOF OF THEOREM 3 Proof. The parameter matrix W 1 (1 : P 0 , 1 : P 0 ) of the first convolutional layer can be implemented by a filter like w 1 1 = (0, 0, 1, 0, 0) (for s 0 = 5). With the parameter setting in Theorem 3, the output of the kth neuron in the first convolution layer is o 1 k = σ x k i + c p , ∀i ∈ [N ] Then, we have o 1 k (x i ) = σ x k i + c p = x k i + c p > 0, ∀i ∈ I (93) due to (10). The outputs of remaining neurons in the first feature map of the first convolutional layer do not matter since they will be annihilated later by W L . Given x k i + c p < 0 (∀i / ∈ I), we have o 1 k (x i ) = 0, i ∈ [N ] , i / ∈ I. The filters and biases of higher convolutional layers (l ∈ {2, 3, • • • , L -1}) for the first predictor propagate o l (1 : P l-1 ) (the outputs in the first feature map of each layer) forward without changing them. Sufficiently large positive constants c l (l ∈ [L -1]) are added to the biases of each layer such that the ReLU units in the second subnetwork are all activated and their outputs are not truncated. The first convolutional layer will be added c 1 to each of its output neuron, and the change of outputs of higher convolutional layers can be recursively computed as ∆o 1 (i) = c 1 , i ∈ [P 0 + 1 : n 1 ], ∆o l (i) = W L (i, j)∆o L-1 (j), i ∈ [d y ]. (95) The effect of introducing c l (l ∈ [L -1]) is cancelled at the output layer by subtracting ∆o L (i) from o L (i) (i ∈ [d y ]). This is equivalent to subtracting ∆o L (i) from b L i . By the parameter setting, the output of each sample is o (x i ) = W 2 xi + σ x k i + c p • λ 1 , λ 2 , • • • , λ dy T , ∀i ∈ [N ] ,



Figure 1: The top layers and associated parameters for constructing local minima and spurious local minima. In general, our construction can be applied to CNNs with two consecutive convolutional layers, like layers L -1 and L -2 in this figure.

and enable different behaviors under perturbation of parameters for different parts of samples. Different settings are designed depending on whether η ≥ -b L-3 1 or not. The setting for the case of η ≥ -b L-3 1 is illustrated in Fig.1(b), and the setting for η < -b L-3 1 is shown in Fig.1(c). The following lemma informally describe the parameter settings and its formal version is given in Lemma 9 in Appendix C). Lemma 4. (informal). Given the CNN parameters θ := (W l , b l ) L l=1 specified in Lemma 1 (formally Lemma 6 in Appendix C), let θ := W l , b l L l=1 , and set W L , b L , W L-1 , b L-1 and W L-2 , b L-2 in a way such that the value w j • σ(v j i + b L-3 1 ) + b L 1 (the original output of the first output neuron, contributed by location j in the first feature map of layer L -3) is not changed for every location j and each sample i ∈ [N ], no matter how big v j i is relative to η and b L-3 1 . There are three possible paths through which a value σ(v j i + b L-3 1

);Ding et al. (2019);Goldblum et al. (2020);Liu et al. (2021) for deep ReLU networks, spurious local minima are trivial since all ReLU neurons are active and deep neural networks are reduced to linear predictors.

Figure 2: The visualization of level sets of empirical loss around specific locations in parameter space. (a) and (b) demonstrate that θ is a spurious local minimum since R(θ ) = R(θ). (c) shows that θ 1 constructed in Theorem 2 is a local minimum.

Figure 3: The variation of empirical loss in 200 random directions around specific locations in parameter space. In (b), some directions have losses lower than R(θ ) = R(θ). (a) and (b) together demonstrate that θ is a spurious local minimum. (c) shows that θ 1 constructed in Theorem 2 is a local minimum.

∂ui ∂o (Q i x i ) • δw = -∂uj ∂o (Q j x j ) • δw, hence u i + u j = 0 up to the first-order Taylor expansion. If Q i x i = αQ j x j and ∂ui ∂o = -1 α ∂uj∂o , we need to consider the second-order term in (72), and if∂ 2 ui ∂o 2 α 2 = -∂ 2 uj∂o 2 , we have u i + u j = 0 up to the second-order Taylor approximation. Otherwise, we have ∂ui ∂o = -1 α ∂uj ∂o and ∂ 2 ui ∂o 2 α 2 = -∂ 2 uj ∂o 2 , which result in -(p i -p 2 i )α = p j -p 2 j (82) and -(1 -2p i )α + (1 -2p j ) = 0,

j∈[P l-2 +1:n l-1 ] W l (i, j)∆o l-1 (j) + c l , i ∈ [P l-1 + 1 : n l ], l ∈ {2, 3, • • • , L -1} , ∆o L (i) = j∈[P L-2 +1:n L-1 ]

By Lemma 8 given in Appendix C, for training data that cannot be fit by linear models (also assumed by He et al. (2020); Liu et al. (2021)) and popular CNN architectures, there always exists point θ such that both Assumption 1 and R(θ) > 0 hold. Moreover, by Lemma 11 in Appendix E, for squared and cross-entropy loss function l,

where is a sufficiently small positive number. By (72), if ∂ui ∂o = 1

annex

(26) Remaining parameters in θ keep fixed. Then, R(θ ) = R(θ).The purpose of introducing b L-3Next, we show that by perturbing θ , we can decrease the training loss. We use different perturbation schemes for different cases of η. The following Lemma 10 descibe the perturbation scheme, and it is the formal version of Lemma 5 in section 3.2.3.Remaining parameters in θ keep fixed. Then under Assumption 1, a training loss lower than R(θ ) is obtained under this perturbation with a proper sign of δb.cannot happen for every group q and every location j, and there must exist some locations where i∈I j q u i = 0 for some groups. As a result, we can split data samples as before at these locations. There may be more than one location where such data split exists, and we choose the one with the biggest threshold η and use it in the perturbation of training loss. Let h be the location having the biggest split threshold, we then have

G.3 PROOF OF LEMMA 9

Proof. There are two possible cases for η : η ≥ -b L-31 , which will be treated differently in the following.In order to compare the training losses at θ and θ , we need to compute the output o (x i ) := o L,i obtained by θ . We will give the computation of its first component o 1 (x i ) in detail. We only need to compute those terms in o 1 (x i ) that are affected by the parameter settings in ( 25) and ( 26), and denote by õ (x i ) the sum of such terms., only the first and the second feature maps in layers L -1 and L -2 are involved. Let o L-3,i := σ(W L-3 (1 :, where 1 P L-4 is the vector of size P L-4 whose components are all 1s. o L-3,i is the output of the first feature map at layer L -3 using the new bias b L-3For notational simplicity, denote W L (1, i) (i ∈ [P L-2 ]) as w i , and note that v j i := W L-3 (j, •) o L-4,i , (j ∈ [P L-4 ]). Using (25), we havewhereWe also denoteand thus the partial output obtained from θ isCombining the above three cases, it is found that the output o 1 (x i ) is perturbed only if v j i > η. Correspondingly, the perturbation of output iswhere I () is the indicator function. Other components of output do not change under this perturbation since W L (2 :The training loss is perturbed as. Therefore, we haveFor all samples x i with v j i > η (∀j ∈ [P L-2 ]), by Lemma 2, only at the hth neuron in the first feature map of layer L -3,) both hold. At other locations in this feature map, because η is the biggest split threshold among all locations, even ifwhere Sgn() is the sign function, we obtain δR < 0.(53)Therefore, the training loss is decreased by the perturbation, resulting inThus, θ is a spurious local minimum.If w h = 0, we perturb it as well by w h → δw, thenSetting Sgn(δb) = -Sgn (δw n i=1 u i ) can still decrease the training loss, and thus the conclusion that θ is a spurious local minimum still holds.With this perturbation, the output of each sample is perturbed in three different cases according to the value ofIn this case, by modifying (42) we haveG.5.1 SPECIAL CASES:If there is only one data sample, i.e., N = 1, there will be only one group at each location. Then, by Lemma 11, we have u = (u 1 ) = 0, thus Lemma 3 holds.We then consider the case of only one location: M = 1. If there are some groups with i∈I 1 q u i = 0, Lemma 3 already holds. If there exists a group consisting of a single sample with u i = 0, according to Lemma 11, we can perturb network parameters to get u i = 0 and thus Lemma 3 holds. Otherwise, all groups have multiple samples and there are i∈I 1 q u i = 0 for all groups, and by R(θ) > 0 and consequently u = 0, there must exist some group k such that there exists i ∈ I 1 k and u i = 0. All samples in group k have identical outputs, and by Lemma 11, we can separate this sample with u i = 0 from other samples in this group by perturbation, i.e., o i = o j (∀j ∈ I 1 k , j = i). By doing so, we can obtain a isolated sample with u i = 0, and thus Lemma 3 is satisfied.

G.5.2 INDUCTION HYPOTHESIS

Now we discuss the cases for N ≥ 2 and M > 1, and use induction to prove.In order to give our induction hypothesis, we first give some definitions. For multiple locations (M > 1), there are usually a number of groups at each location. Recall that I j q denotes the set of indices of samples in the qth group of the jth location. We can select one group with i∈I j q j u i (the sum of u i s in this group) from each location j and form a sequence i∈I 1If all elements in this sequence are equal and nonzero, i.e., i∈I 1u i = 0, we call it a complete and identical sequence of groups (CI sequence). If all elements in a sequence are nonzero but not equal, or partial elements in this sequence equal to zero, we call it a non-CI sequence.Given a CI sequence, after adding a sample u k into the group q j at each location j, if i∈I 1) contains the new sample index k). Therefore, if for all groups and all locations, there exists only one CI sequence, and there are i∈I j q j u i = 0 for all groups not in this CI sequence, then it is possible that for all j ∈ [M ] and q ∈ [g j ] ,i∈I j q u i = 0 after adding a new sample that can annihilate this CI sequence, thus Lemma 3 is violated.In order to prove Lemma 3, it suffices to prove that for every possible group configuration groups with i∈I j q u i = 0 exist after removing CI sequences. Sometimes only CI sequences exist, like), and when adding new samples, there is a risk of annihilating such CI sequences. We can avoid such situations using perturbations of network parameters (this will be explained in sections G.5.3 and G.5.4). Consequently, we give the following induction hypothesis.For multiple locations, either non-CI sequences exist after removing each CI sequence (setting each element i∈I j q j u i in it to zero), or the removal of CI sequences can be prevented.By this induction hypothesis, after adding a new sample x m+1 into one existing group at each location, groups with i∈I j q j u i = 0 still exist, and Lemma 3 holds accordingly.As base cases, we will prove that this induction hypothesis holds for N = 2. We then prove that if it holds for N = m (m ≥ 2), it will also hold when adding a new sample x m+1 , as induction step. G.5.3 BASE CASES: N = 2 M = 2. First, we discuss the cases of two locations. When N = 2, all possible group configurations are discussed in the following and we will prove that induction hypothesis holds for each case.

2.. Proof of u

For squared loss, we perturb W L (1, •) to make u i = 0 if u i = 0. With this perturbation, we haveWe havein Lemma 6, we have u i = 0.For cross-entropy loss function, the derivative u i is given in (70). For finite inputs and y i,1 ∈ {0, 1}, we haveFor squared loss, we perturb W L (1, •) to make u i = u j or u i + u j = 0. Under this perturbation, if u i = u j , by ( 87) we haveFor cross-entropy loss function, recall that w T := W L (1, •) andUnder perturbation δw, by the first-order Taylor expansion in (72), using (73) we haveNote that Q i x i = 0, Q j x j = 0, and p i -p 2 i = 0 since p i = 0 and p i = 1.If Q i x i is parallel to Q j x j , i.e., Q i x i = αQ j x j (α = 0)), we can obtain u i = u j or u i + u j = 0 (i, j ∈ [N ], i = j) by Lemma 12 or Lemma 13, respectively. α = 1 is not allowed by Assumption 1.Therefore, there must be some nonzero components in vectorFor cross-entropy loss, according to (70),5. Nonzero gaps between o i s or u i s can still exist after subsequent perturbations, for squared loss and cross-entropy loss.We discuss the gaps between o i s in detail. Given some samples with identical outputs, say without loss of generality, We now want to perturb the outputs of samples again to avoid o n = o n+1 using δw .Under the second perturbation δw , the output of each sample iswhere the expression of W 2 xi is given in (20). This impliesand o (x i ) = ỹi , l (o (x i ) , y i ) = 0, i ∈ I (98) if there is a single index in set I, where ỹi is defined in section 4. If there are multiple elements in setis minimized.For samples {x i , i / ∈ I}, we haveWe now show that for the CNN given in Theorem 3, θ 1 = W l , b l L l=1 is a local minimum. This can be seen by perturbing the parameters and showing the non-decreaseness of loss. The training loss isUnder perturbation δW l , δb l L l=1 , since there is a gap between x k i (i ∈ [N ]) and the threshold 1 2 x k i * + x k j * , the ReLU unit in (92) keeps its activation status for every sample when the perturbation of b 1 1 = c p is sufficiently small. Therefore, the output of each sample is still in the form of ( 96). Accordingly, if there are multiple elements in set I,where δo (x i ) (i ∈ I) and δo (x i ) (i / ∈ I) are caused by the perturbations of parameters in both W 1 and W 2 . (104)For the case of single element in set I, R ≥ R can be shown easily due to (98). Therefore, θ 1 = W l , b l L l=1 is a local minimum.

