IMPLICIT CONVEX REGULARIZERS OF CNN ARCHI-TECTURES: CONVEX OPTIMIZATION OF TWO-AND THREE-LAYER NETWORKS IN POLYNOMIAL TIME

Abstract

We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons, and data dimension. More specifically, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for several two-and three-layer CNN architectures. We first prove that two-layer CNNs can be globally optimized via an 2 norm regularized convex program. We then show that multi-layer circular CNN training problems with a single ReLU layer are equivalent to an 1 regularized convex program that encourages sparsity in the spectral domain. We also extend these results to three-layer CNNs with two ReLU layers. Furthermore, we present extensions of our approach to different pooling methods, which elucidates the implicit architectural bias as convex regularizers.

1. INTRODUCTION

Convolutional Neural Networks (CNNs) have shown a remarkable success across various machine learning problems (LeCun et al., 2015) . However, our theoretical understanding of CNNs still remains restricted, where the main challenge arises from the highly non-convex and nonlinear structure of CNNs with nonlinear activations such as ReLU. Hence, we study the training problem for various CNN architectures with ReLU activations and introduce equivalent finite dimensional convex formulations that can be used to globally optimize these architectures. Our results characterize the role of network architecture in terms of equivalent convex regularizers. Remarkably, we prove that the proposed methods are polynomial time with respect to all problem parameters. Bengio et al. (2006) ; Bach (2017). However, these studies are restricted to two-layer fully connected networks with infinite width, thus, the optimization problem involves infinite dimensional variables. Moreover, it has been shown that even adding a single neuron to a neural network leads to a non-convex optimization problem which cannot be solved efficiently (Bach, 2017) 2019) proved that the minimum 2 norm two-layer network that perfectly fits a one dimensional dataset outputs the linear spline interpolation. Moreover, Gunasekar et al. (2018) studied certain linear convolutional networks and revealed an implicit non-convex quasi-norm regularization. However, as the number of layers increases, the regularization approaches to 0 quasi-norm, which is not computationally tractable. Recently, Pilanci & Ergen (2020) showed that two-layer CNNs with linear activations can be equivalently optimized as nuclear and 1 norm regularized convex problems. Although all the norm characterizations provided by these studies are insightful for future research, existing results are quite restricted due to linear activations, simple settings or intractable problems. 

Convex neural network training was previously considered in

(Xkuj) + w1jk + w2j j XUjwj j (X l Uljw1j) + w2j Implicit Regularization • 2 • 2 • * (nuclear norm) • 1 • F • 1 Shallow CNNs and their representational power: As opposed to their relatively simple and shallow architecture, CNNs with two/three layers are very powerful and efficient models. Belilovsky et al. (2019) show that greedy training of two/three layer CNNs can achieve comparable performance to deeper models, e.g., VGG-11(Simonyan & Zisserman, 2014) . However, a full theoretical understanding and interpretable description of CNNs even with a single hidden layer is lacking in the literature. Our contributions: Our contributions can be summarized as follows: • We develop convex programs that are polynomial time with respect to all input parameters: the number of samples, data dimension, and the number of neurons to globally train CNNs. To the best of our knowledge, this is the first work characterizing polynomial time trainability of nonconvex CNN models. More importantly, we achieve this complexity with explicit and interpretable convex optimization problems. Consequently, training CNNs, especially in practice, can be further accelerated by leveraging extensive tools available from convex optimization theory. • Our work reveals a hidden regularization mechanism behind CNNs and characterizes how the architecture and pooling strategies, e.g., max-pooling, average pooling, and flattening, dramatically alter the regularizer. As we show, ranging from 1 and 2 norm to nuclear norm (see Table 1 for details), ReLU CNNs exhibit an extremely rich and elegant regularization structure which is implicitly enforced by architectural choices. In convex optimization and signal processing, 1 , 2 and nuclear norm regularizations are well studied, where these structures have been applied in compressed sensing, inverse problems, and matrix completion. Our results bring light to unexplored and promising connections of ReLU CNNs with these established disciplines. Notation and preliminaries: We denote matrices/vectors as uppercase/lowercase bold letters, for which a subscript indicates a certain element/column. We use I k for the identity matrix of size k. We denote the set of integers from 1 to n as To keep the presentation simple, we will use a regression framework with scalar outputs and squared loss. However, we also note that all of our results can be extended to vector outputs and arbitrary convex regression and classification loss functions. We present these extensions in Appendix. In our regression framework, we denote the input data matrix and the corresponding label vector as X ∈ R n×d and y ∈ R n , respectively. Moreover, we represent the patch matrices, i.e., subsets of columns, extracted from X as X k ∈ R n×h , k ∈ [K], where h denotes the filter size. With this notation, {X k u} K k=1 describes a convolution operation between the filter u ∈ R h and the data matrix X. Throughout the paper, we will use the ReLU activation function defined as (x) + = max{0, x}. However, since CNN training problems with ReLUs are not convex in their conventional form, below we introduce an alternative formulation for this activation, which will be crucial for our derivations. Prior Work (Pilanci & Ergen, 2020): Recently, Pilanci & Ergen (2020) introduced an exact convex formulation for training two-layer fully connected ReLU networks in polynomial time for training data X ∈ R n×d of constant rank, where the model is a standard two-layer scalar output network f θ (X) := m j=1 (Xu j ) + α j . However, this model has three main limitations. First, as noted by the authors, even though the algorithm is polynomial time, i.e., O(n r ), provided that r := rank(X), the complexity is exponential in r = d, i.e., O(n d ), if X is full rank. Additionally, as a direct consequence of their model, the analysis is limited to fully connected architectures. Although they briefly analyzed some CNN architectures in Section 4, as emphasized by the authors, these are either fully linear (without ReLU) or separable over the patch index k as fully connected models, which do not correspond to weight sharing in classical CNN architectures in practice. Finally, their analysis does not extend to three-layer architectures with two ReLU layers since the analysis of two ReLU layers is significantly more challenging. On the contrary, we prove that classical CNN architectures can be globally optimized by standard convex solvers in polynomial time independent of the rank



The results on two-layer CNNs are presented in Appendix A.4. This refers to an L-layer network with only one ReLU layer and circular convolutions.



. Another line of research in Parhi & Nowak (2019); Ergen & Pilanci (2019; 2020a;b;c;d); Pilanci & Ergen (2020); Savarese et al. (2019); Gunasekar et al. (2018); Maennel et al. (2018); Blanc et al. (2019); Zhang et al. (2016) focuses on the effect of implicit and explicit regularization in neural network training and aims to explain why the resulting network generalizes well. Among these studies, Parhi & Nowak (2019); Ergen & Pilanci (2020b;c;d); Savarese et al. (

[n]. Moreover, • F and • * are Frobenius and nuclear norms and B p := {u ∈ C d : u p ≤ 1} is the unit p ball. We also use 1[x ≥ 0] as an indicator.

CNN architectures and the corresponding norm regularization in our convex programs

