SPURIOUS LOCAL MINIMA PROVABLY EXIST FOR DEEP CONVOLUTIONAL NEURAL NETWORKS

Abstract

In this paper, we prove that a general family of infinitely many spurious local minima exist in the loss landscape of deep convolutional neural networks with squared loss or cross-entropy loss. Our construction of spurious local minima is general and applies to practical dataset and CNNs containing two consecutive convolutional layers. We develop some new techniques to solve the challenges in construction caused by convolutional layers. We solve a combinatorial problem to show that a differentiation of data samples is always possible somewhere in feature maps. Empirical risk is then decreased by perturbation of network parameters that can affect different samples in different ways. Despite filters and biases are tied in each feature map, in our construction this perturbation only affects the output of a single ReLU neuron. We also give an example of nontrivial spurious local minimum in which different activation patterns of samples are explicitly constructed. Experimental results verify our theoretical findings.



)). However, the existence of spurious local minima for deep CNNs caused by convolutions has never been proved mathematically before. In this paper, we prove that infinite spurious local minima exist in the loss landscape of deep CNNs with squared loss or cross-entropy loss. This is in contrast to the "no spurious local minima" property of deep linear networks. The construction of spurious local minima in this paper is general and applies to practical dataset and CNNs containing two consecutive convolutional layers, which is satisfied by popular CNN architectures. The idea is to construct a local minimum θ at first, and then construct another point θ in parameter space which has the same empirical risk as θ and there exist regions around θ with less empirical risks. However, the construction of spurious local minima for CNNs faces some technical challenges, and the construction for fully connected deep networks cannot be directly extended to CNNs. Our main contribution in this paper is to tackle these technical challenges. In the construction of spurious local minima for fully connected deep ReLU networks (He et al. ( 2020 2021)), in order to construct θ and perturb around it, data samples are split into some groups according to the inputs of a specific ReLU neuron such that each group will behave differently under the perturbation of network parameters so as to produce a lower risk. This technique relies on data split and parameter perturbation, and cannot be directly applied to CNNs due to the following difficulties. Every neuron in CNN feature maps has limited receptive field that covers partial pixels in an input image (take image as an example), and hence the inputs to a ReLU neuron can be identical even for distinct samples, making them hard to distinguish. This data split issue is further complicated by the nonlinear ReLU activations that truncate negative inputs, and the 1



(CNNs) (e.g. Lecun et al. (1998); Krizhevsky et al. (2012); Simonyan & Zisserman (2015); Szegedy et al. (2015); He et al. (2016); Huang et al. (2017)), one of the most important models in deep learning, have been successfully applied to many domains. Spurious local minima, whose losses are greater than that of global minimum, play an important role in the training of deep CNNs and understanding of deep learning models. It is widely believed that spurious local minima exist in the loss landscape of CNNs which is thought to be highly non-convex, as evidenced by some experimental studies (e.g. Dauphin et al. (2014); Goodfellow et al. (2015); Liao & Poggio (2017); Freeman & Bruna (2017); Draxler et al. (2018); Garipov et al. (2018); Li et al. (2018); Mehmeti-Gopel et al. (

); Ding et al. (2019); Goldblum et al. (2020); Liu et al. (

