IMPLICIT CONVEX REGULARIZERS OF CNN ARCHI-TECTURES: CONVEX OPTIMIZATION OF TWO-AND THREE-LAYER NETWORKS IN POLYNOMIAL TIME

Abstract

We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons, and data dimension. More specifically, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for several two-and three-layer CNN architectures. We first prove that two-layer CNNs can be globally optimized via an 2 norm regularized convex program. We then show that multi-layer circular CNN training problems with a single ReLU layer are equivalent to an 1 regularized convex program that encourages sparsity in the spectral domain. We also extend these results to three-layer CNNs with two ReLU layers. Furthermore, we present extensions of our approach to different pooling methods, which elucidates the implicit architectural bias as convex regularizers. Convolutional Neural Networks (CNNs) have shown a remarkable success across various machine learning problems (LeCun et al., 2015) . However, our theoretical understanding of CNNs still remains restricted, where the main challenge arises from the highly non-convex and nonlinear structure of CNNs with nonlinear activations such as ReLU. Hence, we study the training problem for various CNN architectures with ReLU activations and introduce equivalent finite dimensional convex formulations that can be used to globally optimize these architectures. Our results characterize the role of network architecture in terms of equivalent convex regularizers. Remarkably, we prove that the proposed methods are polynomial time with respect to all problem parameters. Convex neural network training was previously considered in Bengio et al. (2006) ; Bach (2017). However, these studies are restricted to two-layer fully connected networks with infinite width, thus, the optimization problem involves infinite dimensional variables. Moreover, it has been shown that even adding a single neuron to a neural network leads to a non-convex optimization problem which cannot be solved efficiently (Bach, 2017) .



• 2 • * (nuclear norm) • 1 • F • 1 Shallow CNNs and their representational power: As opposed to their relatively simple and shallow architecture, CNNs with two/three layers are very powerful and efficient models. Belilovsky et al. (2019) show that greedy training of two/three layer CNNs can achieve comparable performance to deeper models, e.g., VGG-11 (Simonyan & Zisserman, 2014) . However, a full theoretical understanding and interpretable description of CNNs even with a single hidden layer is lacking in the literature. Our contributions: Our contributions can be summarized as follows: • We develop convex programs that are polynomial time with respect to all input parameters: the number of samples, data dimension, and the number of neurons to globally train CNNs. To the best of our knowledge, this is the first work characterizing polynomial time trainability of nonconvex CNN models. More importantly, we achieve this complexity with explicit and interpretable convex optimization problems. Consequently, training CNNs, especially in practice, can be further accelerated by leveraging extensive tools available from convex optimization theory. • Our work reveals a hidden regularization mechanism behind CNNs and characterizes how the architecture and pooling strategies, e.g., max-pooling, average pooling, and flattening, dramatically alter the regularizer. As we show, ranging from 1 and 2 norm to nuclear norm (see Table 1 for details), ReLU CNNs exhibit an extremely rich and elegant regularization structure which is implicitly enforced by architectural choices. In convex optimization and signal processing, 1 , 2 and nuclear norm regularizations are well studied, where these structures have been applied in compressed sensing, inverse problems, and matrix completion. Our results bring light to unexplored and promising connections of ReLU CNNs with these established disciplines. Notation and preliminaries: We denote matrices/vectors as uppercase/lowercase bold letters, for which a subscript indicates a certain element/column. We use I k for the identity matrix of size k. We denote the set of integers from 1 to n as [n] . Moreover, • F and • * are Frobenius and nuclear norms and B p := {u ∈ C d : u p ≤ 1} is the unit p ball. We also use 1[x ≥ 0] as an indicator. To keep the presentation simple, we will use a regression framework with scalar outputs and squared loss. However, we also note that all of our results can be extended to vector outputs and arbitrary convex regression and classification loss functions. We present these extensions in Appendix. In our regression framework, we denote the input data matrix and the corresponding label vector as X ∈ R n×d and y ∈ R n , respectively. Moreover, we represent the patch matrices, i.e., subsets of columns, extracted from X as X k ∈ R n×h , k ∈ [K], where h denotes the filter size. With this notation, {X k u} K k=1 describes a convolution operation between the filter u ∈ R h and the data matrix X. Throughout the paper, we will use the ReLU activation function defined as (x) + = max{0, x}. However, since CNN training problems with ReLUs are not convex in their conventional form, below we introduce an alternative formulation for this activation, which will be crucial for our derivations. Prior Work (Pilanci & Ergen, 2020) : Recently, Pilanci & Ergen (2020) introduced an exact convex formulation for training two-layer fully connected ReLU networks in polynomial time for training data X ∈ R n×d of constant rank, where the model is a standard two-layer scalar output network f θ (X) := m j=1 (Xu j ) + α j . However, this model has three main limitations. First, as noted by the authors, even though the algorithm is polynomial time, i.e., O(n r ), provided that r := rank(X), the complexity is exponential in r = d, i.e., O(n d ), if X is full rank. Additionally, as a direct consequence of their model, the analysis is limited to fully connected architectures. Although they briefly analyzed some CNN architectures in Section 4, as emphasized by the authors, these are either fully linear (without ReLU) or separable over the patch index k as fully connected models, which do not correspond to weight sharing in classical CNN architectures in practice. Finally, their analysis does not extend to three-layer architectures with two ReLU layers since the analysis of two ReLU layers is significantly more challenging. On the contrary, we prove that classical CNN architectures can be globally optimized by standard convex solvers in polynomial time independent of the rank 2 ). More importantly, we extend this analysis to three-layer CNNs with two ReLU layers to achieve polynomial time convex training as proven in Theorem 4.1. P 2 K # of constraints 2nP conv K 2nP conv K 2 2nP cconv 2n(P 1 K + 1)P 2 Complexity O h 3 r 3 c nK rc 3rc O h 3 r 3 c nK rc 3rc O d 3 r 3 cc n rcc 3rcc O d 3 m 3 r 3 c n mrc 3mrc (see Table 1.1 HYPERPLANE ARRANGEMENTS Let H be the set of all hyperplane arrangement patterns of X, defined as the following set H := {sign(Xw)} : w ∈ R d , which has finitely many elements, i.e., |H| ≤ N H < ∞, N H ∈ N. We now define a collection of sets that correspond to positive signs for each element in H, by S := {∪ hi=1 {i}} : h ∈ H . We first note that ReLU is an elementwise function that masks the negative entries of a vector or matrix. Hence, given a set S ∈ S, we define a diagonal mask matrix D(S) ∈ R n×n defined as D(S) ii := 1[i ∈ S] . Then, we have an alternative representation for ReLU as (Xw) + = D(S)Xw given D(S)Xw ≥ 0 and (I n -D(S)) Xw ≤ 0. Note that these constraints can be compactly defined as (2D(S) -I n ) Xw ≥ 0. If we denote the cardinality of S as P , i.e., the number of regions in a partition of R d by hyperplanes passing through the origin and are perpendicular to the rows of the data matrix X with r := rank(X) ≤ min(n, d), then P can be upper-bounded as follow Ojha, 2000; Stanley et al., 2004; Winder, 1966; Cover, 1965) (see Appendix A.2 for details). P ≤ 2 r-1 k=0 n -1 k ≤ 2r e(n -1) r r (

1.2. CONVOLUTIONAL HYPERPLANE ARRANGEMENTS

We now define a notion of hyperplane arrangements for CNNs, where we introduce the patch matrices {X k } K k=1 instead of directly operating on X. We first construct a new data matrix as M = [X 1 ; X 2 ; . . . X K ] ∈ R nK×h . We then define convolutional hyperplane arrangements as the hyperplane arrangements for M and denote the cardinality of this set as P conv . Then, we have P conv ≤ 2 rc-1 k=0 nK -1 k ≤ 2r c e(nK -1) r c rc where r c := rank(M) ≤ h and K = d-h stride + 1. Note that when the filter size h is fixed, P conv is polynomial in n and d. Similarly, we consider hyperplane arrangements for circular CNNs followed by a linear pooling layer, i.e., XUw, where U ∈ R d×d is a circulant matrix generated by the elements u ∈ R h . Then, we define circular convolutional hyperplane arrangements and denote the cardinality of this set as P cconv , which is exponential in the rank of the circular patch matrices, i.e., r cc . Remark 1.1. There exist P hyperplane arrangements of X where P is exponential in r. Thus, if X is full rank, r = d, then P can be exponentially large in the dimension d. As we will show, this makes the training problem for fully connected networks challenging. On the other hand, for CNNs, the number of relevant hyperplane arrangements P conv is exponential in r c . If M is full rank, then r c = h d and accordingly P conv P . This shows that the parameter sharing structure in CNNs enables a significant reduction in the number of possible hyperplane arrangements. Consequently, as shown in the sequel and Table 2 , our results imply that the complexity of training problem is significantly lower compared to fully connected networks.

2.1. TWO-LAYER CNNS WITH AVERAGE POOLING

We first consider an architecture with m filters, average poolingfoot_3 , i.e., is defined as f θ (X) := j k (X k u j ) + w j with parameters θ := {u j , w j }, and standard weight decay regularization, which can be trained via the following problem p * 1 = min {uj ,wj } m j=1 1 2 m j=1 K k=1 (X k u j ) + w j -y 2 2 + β 2 m j=1 u j 2 2 + w 2 j , where u j ∈ R h and w ∈ R m are the filter and output weights, respectively, and β > 0 is a regularization parameter. After a rescaling (see Appendix A.3), we obtain the following problem p * 1 = min {uj ,wj } m i=1 uj ∈B2,∀j 1 2 m j=1 K k=1 (X k u j ) + w j -y 2 2 + β w 1 . Then, taking dual with respect to w and changing the order of min-max yields the weak dual p * 1 ≥ d * 1 = max v - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. max u∈B2 K k=1 v T (X k u) + ≤ β, which is a semi-infinite optimization problem and the dual can be obtained as a finite dimensional convex program using semi-infinite optimization theory (Goberna & López-Cerdá, 1998) . The same dual also corresponds to the bidual of equation 1. Surprisingly, strong duality holds when m exceeds a threshold. Then, using strong duality, we characterize a set of optimal filter weights as the extreme point of the constraint in equation 3. Below, we use this characterization to derive an exact convex formulation for equation 1. Theorem 2.1. Let m be a number such that m ≥ m * for some m * ∈ N, m * ≤ n + 1, then strong duality holds for equation 3, i.e., p * 1 = d * 1 , and the equivalent convex program for equation 1 is min {ci,c i } Pconv i=1 ci,c i ∈R h ,∀i 1 2 Pconv i=1 K k=1 D(S k i )X k (c i -c i ) -y 2 2 + β Pconv i=1 ( c i 2 + c i 2 ) (4) s.t. (2D(S k i ) -I n )X k c i ≥ 0, (2D(S k i ) -I n )X k c i ≥ 0, ∀i, k. Moreover, an optimal solution to equation 1 with m * filters can be constructed as follows (u * j1i , w * j1i ) = c * i c * i 2 , c * i 2 if c * i 2 > 0 (u * j2i , w * j2i ) = c * i c * i 2 , -c * i 2 if c * i 2 > 0 , where {c * i , c * i } Pconv i=1 are optimal, m * := Pconv i=1 1[ c * i 2 = 0] + Pconv i=1 1[ c * i 2 = 0], j si ∈ [|J s |] given the definitions J 1 := {i 1 : c i1 > 0} and J 2 := {i 2 : c i2 > 0}. 4 Therefore, we obtain a finite dimensional convex formulation with 2hP conv variables and 2nP conv K constraints for the non-convex problem in equation 1. Since P conv is polynomial in n and d given a fixed r c ≤ h, equation 4 can be solved by a standard convex optimization solver in polynomial time. Remark 2.1. Table 2 shows that for fixed rank r c , or fixed filter size h, the complexity is polynomial in all problem parameters: n (number of samples), m (number of filters, i.e., neurons), and d (dimension). The filter size h is typically a small constant, e.g., h = 9 for 3 × 3 filters. We also note that for fixed n and rank(X) = d, the complexity of fully connected networks is exponential in d, which cannot be improved unless P = N P even for m = 2 (Boob et al., 2018; Pilanci & Ergen, 2020) . However, this result shows that CNNs can be trained to global optimality with polynomial complexity as a convex program. Interpreting non-convex CNNs as convex variable selection models: Interestingly, we have the sum of the squared 2 norms of the weights (i.e., weight decay regularization) in the non-convex problem equation 1 as the regularizer, however, the equivalent convex program in equation 4 is regularized by the sum of the 2 norms of the weights. This particular regularizer is known as group 1 norm, and is well-studied in the context of sparse recovery and variable selection (Yuan & Lin, 2006; Meier et al., 2008) . Hence, our convex program reveals an implicit variable selection mechanism in the original non-convex problem. More specifically, the original features in X are mapped to higher dimensions via convolutional hyperplane arrangements as {D(S k i )X k } Pconv i=1 and followed by a convex variable selection strategy using the group 1 norm. Below, we show that this implicit regularization changes significantly with the CNN architecture and pooling strategies and can range from 1 and 2 norms to nuclear norm.

2.2. TWO-LAYER CNNS WITH MAX POOLING

Here, we consider the architecture with max pooling, i.e., f θ (X) = j maxpool {(X k u j ) + } k w j , which is trained as follows p * 1 = min {uj ,wj } m i=1 uj ∈B2,∀j 1 2 m j=1 maxpool {(X k u j ) + } K k=1 w j -y 2 2 + β w 1 , where maxpool(•) is an elementwise max function over the patch index k. Then, taking dual with respect to w and changing the order of min-max yields p * 1 ≥ d * 1 = max v - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. max u∈B2 v T maxpool {(X k u) + } K k=1 ≤ β. (6) Theorem 2.2. Let m be a number such that m ≥ m * for some m * ∈ N, m * ≤ n + 1, then strong duality holds for equation 6, i.e., p * 1 = d * 1 , and the equivalent convex program for equation 5 is min {ci,c i } Pconv i=1 ci,c i ∈R h ,∀i 1 2 Pconv i=1 K k=1 D(S k i )X k (c i -c i ) -y 2 2 + β Pconv i=1 ( c i 2 + c i 2 ) (7) s.t. (2D(S k i ) -I n )X k c i ≥ 0, (2D(S k i ) -I n )X k c i ≥ 0, ∀i, k, D(S k i )X k c i ≥ D(S k i )X j c i , D(S k i )X k c i ≥ D(S k i )X j c i , ∀i, j, k. Moreover, an optimal solution to equation 5 can be constructed from equation 7 as in Theorem 2.1. We note that max pooling corresponds to the last two linear constraints of the above program. Hence, max pooling can be interpreted as additional regularization, which constraints the parameters further.

3. MULTI-LAYER CIRCULAR CNNS

In this section, we first consider L-layer circular CNNs with L -2 pooling layers before ReLU, i.e., f θ (X) = j (X l U lj w 1j ) + w 2j , which is trained via the following non-convex problem p * 2 = min {{u lj } L-2 l=1 ,w1j ,w2j } m j=1 u lj ∈U L ,∀l,j 1 2 m j=1 X L-2 l=1 U lj w 1j + w 2j -y 2 2 + β 2 m j=1 w 1j 2 2 + w 2 2j , where U lj ∈ R d×d is a circulant matrix generated using u lj ∈ R h l and U L := {(u 1 , . . . , u L-2 ) : u l ∈ R h l , ∀l ∈ [L -2]; L-2 l=1 U l 2 F ≤ 1} and we include unit norm constraints w.l.o.g. Theorem 3.1. Let m be a number such that m ≥ m * for some m * ∈ N, m * ≤ n + 1, then strong duality holds for equation 8, i.e., p * 2 = d * 2 , and the equivalent convex problem is min {ci,c i } Pcconv i=1 ci,c i ∈C d ,∀i 1 2 Pcconv i=1 D(S i ) X (c i -c i ) -y 2 2 + β d L-2 2 Pcconv i=1 ( c i 1 + c i 1 ) (9) s.t. (2D(S i ) -I n ) Xc i ≥ 0, (2D(S i ) -I n ) Xc i ≥ 0, ∀i, where X = XF and F ∈ C d×d is the DFT matrix. Additionally, as in Theorem 2.1, we can construct an optimal solution to equation 8 from equation 9.foot_6  Remarkably, although the sum of the squared 2 norms in the non-convex problem in equation 8 stand for the standard weight decay regularizer, the equivalent convex program in equation 9 is regularized by the sum of the 1 norms which encourages sparsity in the spectral domain X. Thus, even with the simple choice of the weight decay in the non-convex problem, the architectural choice for a CNN implicitly employs a more sophisticated regularizer that is revealed by our convex optimization approach. We further note that in the above problem D(S i ) X are the spectral features of a subset of data points which are seperated by a hyperplane from all the other spectral features. While such spectral features can be very predictive for images in many applications, we believe that our convex program also sheds light into the undesirable bias of CNNs, e.g., towards certain textures and low frequencies (Geirhos et al., 2018; Rahaman et al., 2019) .

4. THREE-LAYER CNNS WITH TWO RELU LAYERS

Here, we consider three-layer CNNs with two ReLU layers, which has the following primal problem p * 3 = min {uj ,w1j ,w2j } m j=1 uj ∈B2 1 2 m j=1 K k=1 (X k u j ) + w 1jk + w 2j -y 2 2 + β 2 m j=1 w 1j 2 2 + w 2 2j (10) with f θ (X) = j k (X k u j ) + w 1jk + w 2j and the following convex equivalent problem. Theorem 4.1. Let m be a number such that m ≥ m * for some m * ∈ N, m * ≤ n + 1, then strong duality holds for equation 8, i.e., p * 3 = d * 3 , and the equivalent convex problem is min {c ijk ,c ijk } ijk c ijk ,c ijk ∈R h 1 2 P2 j=1 D S2j P1 i=1 K k=1 I ijk D(S k 1i )X k c ijk -c ijk -y 2 2 + β P1 i=1 P2 j=1 C ij F + C ij F s.t. (2D(S 2j ) -I n ) P1 i=1 K k=1 I ijk D(S k 1i )X k c ijk ≥ 0, (2D(S k 1i ) -I n )X k c ijk ≥ 0, ∀i, j, k (2D(S 2j ) -I n ) P1 i=1 K k=1 I ijk D(S k 1i )X k c ijk ≥ 0, (2D(S k 1i ) -I n )X k c ijk ≥ 0, ∀i, j, k. where P 1 and P 2 are the number hyperplane arrangements for the first and second layers, I ijk ∈ {±1} are sign patterns to enumerate all possible sign patterns of the second layer weights, and C ij = [c ij1 . . . c ijK ] (see Appendix A.10 for further details). It is interesting to note that, although the sum of the squared 2 norms in the non-convex problem equation 10 is the standard weight decay regularizer, the equivalent convex program equation 11 is regularized by the sum of the Frobenius norms that promote matrix group sparsity, where the groups are over the patch indices. Note that this is similar to equation 4 except an extra summation due to having one more ReLU layer. Therefore, we observe that adding more convolutional layers with ReLU implicitly regularizes for group sparsity over a richer hierarchical representation of the data via two consecutive hyperplane arrangements.

5. PROOF OF THE MAIN RESULT (THEOREM 2.1)

Here, we provide our proof technique for Theorem 2.1. We first focus on the single-sided constraint max u∈B2 K k=1 v T (X k u) + ≤ β, where the maximization problem can be written as max S k ⊆[n] S k ∈S max u∈B2 K k=1 v T D(S k )X k u s.t. (2D(S k ) -I n )X k u ≥ 0, ∀k. Since the maximization is convex and strictly feasible for fixed D(S k ), equation 13 can be written as max S k ⊆[n] S k ∈S min α k ≥0 max u∈B2 K k=1 v T D(S k )X k + α T k (2D(S k ) -I n )X k u = max S k ⊆[n] S k ∈S min α k ≥0 K k=1 v T D(S k )X k + α T k (2D(S k ) -I n )X k 2 . We now enumerate all hyperplane arrangements and index them in an arbitrary order, i.e., denoted as S 1 i , . . . , S K i , where i ∈ [P conv ], P conv = |S K |, S K := {(S 1 i , . . . , S K i ) : S k i ∈ S, ∀k, i}. Then, equation 12 ⇐⇒ ∀i ∈ [P conv ], min α k ≥0 K k=1 v T D(S k i )X k + α T k (2D(S k i ) -I n )X k 2 ≤ β ⇐⇒ ∀i ∈ [P conv ], ∃α ik ≥ 0 s.t. K k=1 v T D(S k i )X k + α T ik (2D(S k i ) -I n )X k 2 ≤ β. We now use the same approach for the two-sided constraint in equation 3 to obtain the following max v α ik ,α ik ≥0 - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. K k=1 v T D(S k i )X k + α T ik (2D(S k i ) -I n )X k 2 ≤ β (14) K k=1 -v T D(S k i )X k + α T ik (2D(S k i ) -I n )X k 2 ≤ β, ∀i. Note that this problem is convex and strictly feasible for v = α ik = α ik = 0. Therefore, Slater's conditions and consequently strong duality holds, and equation 14 can be written as min λi,λ i ≥0 max v α ik ,α ik ≥0 - 1 2 v -y 2 2 + 1 2 y 2 2 + Pconv i=1 λ i β - K k=1 v T D(S k i )X k + α T ik (2D(S k i ) -I n )X k 2 + Pconv i=1 λ i β - K k=1 -v T D(S k i )X k + α T ik (2D(S k i ) -I n )X k 2 . ( ) Next, we first introduce new variables z i , z i ∈ R h . Then, by recalling Sion's minimax theorem (Sion, 1958) , we change the order of the inner max-min as follows min λi,λ i ≥0 min zi∈B2 z i ∈B2 max v α ik ,α ik ≥0 - 1 2 v -y 2 2 + 1 2 y 2 2 + Pconv i=1 λ i β + K k=1 v T D(S k i )X k + α T ik (2D(S k i ) -I n )X k z i + Pconv i=1 λ i β + K k=1 -v T D(S k i )X k + α T ik (2D(S k i ) -I n )X k z i . ( ) We now compute the maximum with respect to v, α ik , α ik analytically to obtain the following Then, we apply a change of variables and define c i = λ i z i and c i = λ i z i . Thus, we obtain min λi,λ i ≥0 min zi∈B2 z i ∈B2 1 2 Pconv i=1 K k=1 D(S k i )X k (λ i z i -λ i z i ) -y 2 2 + β Pconv i=1 (λ i + λ i ) s.t. (2D(S k i ) -I n )X k z i ≥ 0, (2D(S k i ) -I n )X k z i ≥ 0, ∀i, k. min ci,c i ∈R h 1 2 Pconv i=1 K k=1 D(S k i )X k (c i -c i ) -y 2 2 + β Pconv i=1 ( c i 2 + c i 2 ) s.t. (2D(S k i ) -I n )X k c i ≥ 0, (2D(S k i ) -I n )X k c i ≥ 0, ∀i, k, since λ i = c i 2 and λ i = c i 2 are feasible and optimal. Then, using the prescribed {u * j , w * j } m * j=1 , we evaluate the non-convex objective in equation 1 as follows p * 1 ≤ 1 2 m * j=1 K k=1 (X k u * j ) + w * j -y 2 2 + β 2 Pconv i=1,c * i =0   c * i c * i 2 2 2 + c * i 2 2 2   + β 2 Pconv i=1,c * i =0   c * i c * i 2 2 2 + c * i 2 2 2   which has the same objective value with equation 18. Since strong duality holds for the convex program, p * 1 = d * 1 , which is equal to the value of equation 18 achieved by the prescribed parameters.

6. NUMERICAL EXPERIMENTS

In this sectionfoot_7,foot_8 , we present numerical experiments to verify our claims. We first consider a synthetic dataset, where (n, d) = (6, 20), X ∈ R 6×20 is generated using a multivariate normal distribution with zero mean and identity covariance, and y = [1 -1 1 -1 -1 1] T . We then train the threelayer circular CNN model in equation 8 using SGD and the convex program equation 9. In Figure 1 , we plot the regularized objective value with respect to the computation time with 5 different independent realizations for SGD. We also plot both the non-convex objective in equation 8 and the convex objective in equation 9 for our convex program, where optimal prescribed parameters are used to convert the solution of the convex program to the original non-convex CNN architecture (see Appendix A.9). In Figure 1a , we use 5 filters with h = 3 and stride 1, where only one trial converges to the optimal objective value achieved by both our convex program and feasible network. As m increases, all the trials are able to converge to the optimal objective value in Figure 1b . We also evaluate the same model on a subset of MNIST (LeCun) and CIFAR10 (Krizhevsky et al., 2014) for binary classification. Here, we first randomly sample the dataset and then select (n, d, m, h, stride) = (99, 50, 20, 3, 1) and a batch size of 10 for SGD. Similarly for CIFAR10, we select (n, d, m, h, stride) = (99, 50, 40, 3, 1) and use a batch size of 10 for SGD. In Figure 2 , we plot both the regularized objective values in equation 8 and equation 9, and the corresponding test accuracies with the computation time. Since the number of filters is large enough, all the SGD trials converge the optimal value provided by our convex program.

7. CONCLUDING REMARKS

We studied various non-convex CNN training problems and introduced exact finite dimensional convex programs. Particularly, we provide equivalent convex characterizations for ReLU CNN architectures in a higher dimensional space. Unlike the previous studies, we prove that these equivalent characterizations have polynomial complexity in all input parameters and can be globally optimized via convex optimization solvers. Furthermore, we show that depending on the type of a CNN architecture, equivalent convex programs might exhibit different norm regularization structure, e.g., 1 , 2 , and nuclear norm. Thus, we claim that the implicit regularization phenomenon in modern neural networks architectures can be precisely characterized as convex regularizers. Therefore, extending our results to deeper networks is a promising direction. We also conjecture that the proposed convex approach can also be used to analyze popular heuristic techniques to train modern deep learning architectures. For example, after our work, Ergen et al. (2021) studied batch normalization through our convex framework and revealed an implicit patchwise whitening effect. Similarly, Sahiner et al. (2021) extended our model to vector outputs. More importantly, in the light of our results, efficient optimization algorithms can be developed to exactly (or approximately) optimize deep CNN architectures for large scale experiments in practice, which is left for future research. 

A APPENDIX

In this section, we present additional materials and proofs of the main results that are not included in the main paper due to the page limit.

A.1 ADDITIONAL NUMERICAL RESULTS

Here, we present additional numerical experiments to further verify our theory. We first perform an experiment with another synthetic dataset, where X ∈ R 6×15 is generated using a multivariate normal distribution with zero mean and identity covariance, and y = [1 -1 1 1 1 -1] T . In this case, we use the two-layer CNN model in equation 1 and the corresponding convex program in equation 4. In Figure 3 , we perform the experiment using m = 5, 8, 15 filters of size h = 10 and stride 5, where we observe that as the number of filters increases, the ratio of the trials converging to the optimal objective value increases as well. In order to apply our convex approach in Theorem 2.1 to larger scale experiments, we now introduce an unconstrained version of the convex program in equation 4 as follows min {ci,c i } Pconv i=1 ci,c i ∈R h ,∀i 1 2 Pconv i=1 K k=1 D(S k i )X k (c i -c i ) -y 2 2 + β Pconv i=1 ( c i 2 + c i 2 ) (19) + ρ1 T Pconv i=1 K k=1 -(2D(S k i ) -I n )X k c i + + -(2D(S k i ) -I n )X k c i + , where ρ > 0 is a trade-off parameter. Since the problem in equation 19 is in an unconstrained form, we can directly optimize its parameters using conventional algorithms such as SGD. Hence, we use PyTorch to optimize the parameters of a two-layer CNN architecture using both the non-convex objective in equation 1 and the convex objective in equation 19, where we use the full CIFAR-10 dataset for binary classification, i.e., (n, d) = (10000, 3072). In Figure 4 , we provide the training objective and the test accuracy of each approach with respect to the number of epochs. Here, we observe that the optimization on the convex formulation achieves lower training objective and higher test accuracy compared to the classical optimization on the non-convex problem.

A.2 CONSTRUCTING HYPERPLANE ARRANGEMENTS IN POLYNOMIAL TIME

In this section, we discuss the number of distinct hyperplane arrangements, i.e., P , and present algorithm that enumerates all the distinct arrangements in polynomial time. We first consider the number of all distinct sign patterns sign(Xw) for all w ∈ R d . This number corresponds to the number of regions in a partition of R d by hyperplanes passing through the origin, Consider an arrangement of n hyperplanes in R r , where n ≥ r. Let us denote the number of regions in this arrangement by P n,r . In Ojha (2000) ; Cover (1965) , it is shown that this number satisfies P n,r ≤ 2 r-1 k=0 n -1 k . For hyperplanes in general position, the above inequality is in fact an equality. In Edelsbrunner et al. (1986) , the authors present an algorithm that enumerates all possible hyperplane arrangements O(n r ) time, which can be used to construct the data for the convex programs we present throughout the paper. Published as a conference paper at ICLR 2021

A.3 EQUIVALENCE OF THE 1 PENALIZED OBJECTIVES

In this section, we prove the equivalence between the original problems with 2 regularization and their 1 penalized versions. We also note that similar equivalence results were also presented in Savarese et al. (2019) ; Neyshabur et al. ( 2014); Ergen & Pilanci (2019; 2020c; d) . We start with the equivalence between equation 1 and equation 2. Lemma A.1. The following two problems are equivalent: min {uj ,wj } m j=1 1 2 m j=1 K k=1 (X k u j ) + w j -y 2 2 + β 2 m j=1 u j 2 2 + w 2 j = min {uj ,wj } m j=1 uj ∈B2,∀j 1 2 m j=1 K k=1 (X k u j ) + w j -y 2 2 + β m j=1 w 1 . Proof of Lemma A.1. We rescale the parameters as ūj = γ j u j and wj = w j /γ j , for any γ j > 0. Then, the output becomes m j=1 K k=1 (X k ūj ) + wj = m j=1 K k=1 (X k u j γ j ) + w j γ j = m j=1 K k=1 (Xu j ) + w j , which proves that the scaling does not change the network output. In addition to this, we have the following basic inequality 1 2 m j=1 ( u j 2 2 + w 2 j ) ≥ m j=1 (|w j | u j 2 ), where the equality is achieved with the scaling choice γ j = |wj | uj 2 1 2 is used. Since the scaling operation does not change the right-hand side of the inequality, we can set u j 2 = 1, ∀j. Therefore, the right-hand side becomes w 1 . Now, let us consider a modified version of the problem, where the unit norm equality constraint is relaxed as u j 2 ≤ 1. Let us also assume that for a certain index j, we obtain u j 2 < 1 with w j = 0 as an optimal solution. This shows that the unit norm inequality constraint is not active for u j , and hence removing the constraint for u j will not change the optimal solution. However, when we remove the constraint, u j 2 → ∞ reduces the objective value since it yields w j = 0. Therefore, we have a contradiction, which proves that all the constraints that correspond to a nonzero w j must be active for an optimal solution. This also shows that replacing u j 2 = 1 with u j 2 ≤ 1 does not change the solution to the problem. Next, we prove the equivalence between equation 8 for L = 3 and equation 30. Lemma A.2. The following two problems are equivalent: min {uj ,w1j ,w2j } m j=1 uj ∈B2,∀j 1 2 m j=1 (XU j w 1j ) + w 2j -y 2 2 + β 2 m j=1 w 1j 2 2 + w 2 2j = min {uj ,w1j ,w2j } m j=1 uj ,w1j ∈B2,∀j 1 2 m j=1 (XU j w 1j ) + w 2j -y 2 2 + β w 2 1 . Proof of Lemma A.2. We rescale the parameters as w1j = γ j w 1j and w2j = w 2j /γ j , for any γ j > 0. Then, the output becomes m j=1 (XU j w1j ) + w2j = m j=1 (XU j w 1j γ j ) + w 2j γ j = m j=1 (XU j w 1j ) + w 2j , which proves that the scaling does not change the network output. In addition to this, we have the following basic inequality 1 2 m j=1 ( w 1j 2 2 + w 2 2j ) ≥ m j=1 ( w 1j 2 |w 2j |), where the equality is achieved with the scaling choice γ j = |w2j | w1j 2 1 2 is used. Since the scaling operation does not change the right-hand side of the inequality, we can set w 1j 2 = 1, ∀j. Therefore, the right-hand side becomes w 2 1 . The rest of the proof directly follows from the proof of Lemma A.1.

A.4 TWO-LAYER LINEAR CNNS

We now consider two-layer linear CNNs, for which the training problem is min {uj ,wj } m j=1 1 2 K k=1 m j=1 X k u j w jk -y 2 2 + β 2 m j=1 u j 2 2 + w j 2 2 . ( ) Theorem A.1. (Pilanci & Ergen, 2020) The equivalent convex program for equation 20 is min {z k } K k=1 ,z k ∈R h 1 2 K k=1 X k z k -y 2 2 + β [z 1 , . . . , z K ] * . ( ) Proof of Theorem A.1. We first apply a rescaling (as in Lemma A.1) to the primal problem in equation 20 as follows min {uj ,wj } m j=1 uj ∈B2 1 2 K k=1 m j=1 X k u j w jk -y 2 2 + β m j=1 w j 2 . Then, taking the dual with respect to the output layer weights w j yields max v - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. max u∈B2 k v T X k u 2 ≤ β. ( ) Let us then reparameterize the problem above as follows max M,v - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. σ max (M) ≤ β, M = [X T 1 v . . . X T K v], where σ max (M) represent the maximum singular value of M. Then the Lagrangian is as follows L(λ, Z, M, v) = - 1 2 v -y 2 2 + 1 2 y 2 2 + λ (β -σ max (M)) + trace(Z T M) -trace(Z T [X T 1 v . . . X T K v]) = - 1 2 v -y 2 2 + 1 2 y 2 2 + λ (β -σ max (M)) + trace(Z T M) -v T K k=1 X k z k where λ ≥ 0. Then maximizing over M and v yields the following dual form min {z k } K k=1 ,z k ∈R h 1 2 K k=1 X k z k -y 2 2 + β [z 1 . . . z K ] * , where [z 1 . . . z K ] * = Z * = i σ i (Z) is the 1 norm of singular values, i.e., nuclear norm (Recht et al., 2010) . The regularized training problem for two-layer circular CNNs as follows min {uj ,wj } m i=1 1 2 m j=1 XU j w j -y 2 2 + β 2 m j=1 u j 2 2 + w j 2 2 (23) where U j ∈ R d×d is a circulant matrix generated by a circular shift modulo d using u j ∈ R h . Theorem A.2. (Pilanci & Ergen, 2020) The equivalent convex program for equation 23 is min z∈C d 1 2 Xz -y 2 2 + β √ d z 1 , ( ) where X = XF and F ∈ C d×d is the DFT matrix. Proof of Theorem A.2. We first apply a rescaling (as in Lemma A.1) to the primal problem in equation 23 as follows min {uj ,wj } m i=1 uj ∈B2 1 2 m j=1 XU j w j -y 2 2 + β m j=1 w j 2 and then taking the dual with respect to the output layer weights w j yields max v - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. max D∈D v T XFDF H 2 ≤ β, where D := {D : D 2 F ≤ d}. In the problem above, we use the eigenvalue decomposition U = FDF H , where F ∈ C d×d is the DFT matrix and D ∈ C d×d is a diagonal matrix defined as D := diag( √ dFu). We also note that the unit norm constraint in the primal problem, i.e., u j ∈ B 2 , is equivalent to D j ∈ D since D j = diag( √ dFu j ) and D j 2 F = d u j 2 2 due the properties of circulant matrices. Now let us first define a variable change as X = XF. Then, the problem above can be equivalently written as max v - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. max D∈D v T XD 2 ≤ β. Since D is a diagonal matrix with a norm constraint on its diagonal entries, for an arbitrary vector s ∈ C n , we have s T D 2 = n i=1 |s i | 2 |D ii | 2 ≤ s max n i=1 |D ii | 2 = s max √ d, where s max := max i |s i |. If we denote the maximum index as i max := arg max i |s i |, then the upper-bound is achieved when D ii = √ d if i = i max 0, otherwise . Using this observation, the problem above can be further simplified as max v - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. v T X ∞ ≤ β √ d . Then, taking the dual of this problem gives the following min z∈C d 1 2 Xz -y 2 2 + β √ d z 1 .

A.5 EXTENSIONS TO VECTOR OUTPUTS

Here, we present the extensions of our approach to vector output. To keep the notation and presentation simple, we consider the vector output version of the two-layer linear CNN model in Section A.4. The training problem is as follows min {uj ,{w jk } K k=1 } m j=1 1 2 K k=1 m j=1 X k u j w T jk -Y 2 F + β 2 m j=1 u j 2 2 + K k=1 w jk 2 2 . The corresponding dual problem is given by max V - 1 2 V -Y 2 F + 1 2 Y 2 F s.t. max u∈B2 K k=1 V T X k u 2 2 ≤ β. The maximizers of the dual are the maximal eigenvectors of K k=1 X T k VV T X k , which are optimal filters. We now focus on the dual constraint as in Proof of Theorem 2.1. max u∈B2 K k=1 V T X k u 2 2 = max u,s,g k ∈B2 K k=1 s k g T k V T X k u = max u,s,g k ∈B2 K k=1 s k V, X k ug T k = max G k * ≤1 s∈B2 K k=1 s k V, X k G k = max G k * ≤s k s∈B2 K k=1 V, X k G k Then, the rest of the derivations directly follow Section A.4.

A.6 EXTENSIONS TO ARBITRARY CONVEX LOSS FUNCTIONS

In this section, we first show the procedure to create an optimal standard CNN architecture using the optimal weights provided by the convex program in. Then, we extend our derivations to arbitrary convex loss functions. In order to keep our derivations simple and clear, we use the regularized two-layer architecture in equation 1. For a given convex loss function (•, y), the regularized training problem can be stated as follows p * 1 = min {uj ,wj } m j=1   m j=1 K k=1 (X k u j ) + w j , y   + β 2 m j=1 ( u j 2 2 + w 2 j ) . Then, the corresponding finite dimensional convex equivalent is min {ci,c i } Pconv i=1 ci,c i ∈R h ,∀i Pconv i=1 K k=1 D(S k i )X k (c i -c i ), y + β Pconv i=1 K k=1 ( c i 2 + c i 2 ) (26) s.t. (2D(S k i ) -I n )X k c i ≥ 0, (2D(S k i ) -I n )X k c i ≥ 0, ∀i, k. We now define m * := Pconv i=1 1[ c * i 2 = 0] + Pconv i=1 1[ c * i 2 = 0], where {c * i , c * i } Pconv i=1 are the optimal weights in equation 26. Theorem A.3. The convex program equation 26 and the non-convex problem equation 25, where m ≥ m * has identical optimal values. Moreover, an optimal solution to equation 25 can be constructed from an optimal solution to equation 26 as follows (u * j1i , w * j1i ) = c * i c * i 2 , c * i 2 if c * i 2 > 0 (u * j2i , w * j2i ) = c * i c * i 2 , -c * i 2 if c * i 2 > 0 , where {c * i , c * i } Pconv i=1 are the optimal solutions to equation 26. Proof of Theorem A.3. We first note that there will be m * vectors {c * i , c * i }. Constructing {u * j , w * j } m * j=1 as stated in the theorem, and plugging in the non-convex objective equation 25, we obtain the value p * 1 ≤   m * j=1 K k=1 (X k u * j ) + w * j , y   + β 2 Pconv i=1,c * i =0   c * i c * i 2 2 2 + c * i 2 2 2   + β 2 Pconv i=1,c * i =0   c * i c * i 2 2 2 + c * i 2 2 2   which is identical to the objective value of the convex program equation 26. Since the value of the convex program is equal to the value of it's dual d * 1 in the dual, we conclude that p * 1 = d * 1 , which is equal to the value of the convex program equation 26 achieved by the prescribed parameters. We also show that our dual characterization holds for arbitrary convex loss functions min {uj ,wj } m j=1 uj ∈B2,∀j   m j=1 K k=1 (X k u j ) + w j , y   + β w 1 , where (•, y) is a convex loss function. Theorem A.4. The dual of equation 27 is given by max v - * (v) s.t. K k=1 v T (X k u) + ≤ β, ∀u ∈ B 2 , where * is the Fenchel conjugate function defined as * (v) = max z z T v -(z, y) . Proof of Theorem A.4. The proof follows from classical Fenchel duality (Boyd & Vandenberghe, 2004) . We first describe equation 27 in an equivalent form as follows min {uj ,wj } m j=1 ,z uj ∈B2,∀j (z, y) + β w 1 s.t. z = m j=1 K k=1 (X k u j ) + w j , . Then the dual function is g(v) = min {uj ,wj } m j=1 ,z uj ∈B2,∀j (z, y) -v T z + v T m j=1 K k=1 (X k u j ) + w j + β w 1 . Therefore, using the classical Fenchel duality (Boyd & Vandenberghe, 2004 ) yields the claimed dual form.

A.7 STRONG DUALITY RESULTS

Proposition A.1. Given m ≥ m * , strong duality holds for equation 3, i.e., p * 1 = d * 1 . We first review the basic properties of infinite size neural networks and introduce technical details to derive the dual of equation 3. We refer the reader to Rosset et al. (2007) ; Bach (2017) for further details. Let us first consider a measurable input space X with a set of continuous basis functions (i.e., neurons or filters in our context) ψ u : X → R, which are parameterized by u ∈ B 2 . Next, we use real-valued Radon measures with the uniform norms (Rudin, 1964) . Let us consider a signed Radon measure denoted as µ. Now, we can use µ to formulate an infinite size neural network as f (x) = u∈B2 ψ u (x)dµ(u), where x ∈ X is the input. The norm for µ is usually defined as its total variation norm, which is the supremum of u∈B2 g(u)dµ(u) over all continuous functions g(u) that satisfy |g(u)| ≤ 1. Now, we consider the case where the basis functions are ReLUs, i.e., ψ u = x T u + . Then, the output of a network with finitely many neurons, say m neurons, can be written as f (x) = m j=1 ψ uj w j which can be obtained by selecting µ as a weighted sum of Dirac delta functions, i.e., µ = m j=1 w j δ(u -u j ). In this case, the total variation norm, denoted as µ T V , corresponds to the 1 norm w 1 . Now, we ready to derive the dual of equation 3, which can be stated as follows (see Section 8.6 of Goberna & López-Cerdá (1998) and Section 2 of Shapiro (2009) for further details) d * 1 ≤ p 1,∞ = min µ 1 2 u∈B2 K k=1 (X k u) + dµ(u) -y 2 2 + β µ T V . Although equation 28 involves an infinite dimensional integral form, by Caratheodory's theorem, we know that the integral can be represented as a finite summation, to be more precise, a summation of at most n + 1 Dirac delta functions (Rosset et al., 2007) . If we denote the number of Dirac delta functions as m * , where m * ≤ n + 1, then we have p 1,∞ = min {uj ,wj } m * j=1 uj ∈B2,∀j 1 2 m * j=1 K k=1 (X k u j ) + w j -y 2 2 + β w 1 = p * 1 provided that m ≥ m * . We now need to show that strong duality holds, i.e., p * 1 = d * 1 . We first note that the semi-infinite problem equation 3 is convex. Then, we prove that the optimal value is finite. Since β > 0, we know that v = 0 is strictly feasible, and achieves 0 objective value. Moreover, since -yv 2 2 ≤ 0, the optimal objective value p * 1 is finite. Therefore, by Theorem 2.2 of Shapiro (2009) , strong duality holds, i.e., p * 1,∞ = d * 1 provided that the solution set of equation 3 is nonempty and bounded. We also note that the solution set of equation 3 is the Euclidean projection of y onto a convex, closed and bounded set since (X k u) + can be expressed as the union of finitely many convex closed and bounded sets. A.8 PROOF OF THEOREM 2.2 The proof follows the proof of Proposition A.1. The dual of equation 6 is as follows d * 1 ≤ p 1,∞ = min µ 1 2 u∈B2 maxpool {(X k u) + } K k=1 dµ(u) -y 2 2 + β µ T V , which has the following finite equivalent p 1,∞ = min {uj ,wj } m * j=1 uj ∈B2,∀j 1 2 m * j=1 maxpool {(X k u j ) + } K k=1 w j -y 2 2 + β w 1 = p * 1 provided that m ≥ m * . We now need to show that strong duality holds, i.e., p * 1 = d * 1 . Since maxpool(•) can be expressed as the union of finitely many convex, closed and bounded sets, the rest of the strong duality results directly follow from the proof of Proposition A.1. We now focus on the single-sided dual constraint max u∈B2 v T maxpool( (X k u) + } K k=1 ≤ β, which can be written as max S k ⊆[n] S k ∈S max u∈B2 K k=1 v T D(S k )X k u s.t. (2D(S k ) -I n )X k u ≥ 0, ∀k, D(S k )X k u ≥ D(S k )X j u, ∀j, k ∈ [K], K k=1 D(S k ) = I n . We again enumerate all hyperplane arrangements and index them in an arbitrary order, where we define the overall set as S K := {(S 1 i , . . . , S K i ) : S k i ∈ S, ∀k, i; K k=1 D(S k i ) = I n , ∀i} and P conv = |S K |. Then, following the same steps in ( 13)-( 17) gives the following convex problem min wi,w i ∈R h 1 2 Pconv i=1 K k=1 D(S k i )X k (w i -w i ) -y 2 2 + β Pconv i=1 ( w i 2 + w i 2 ) (29) s.t. (2D(S k i ) -I n )X k w i ≥ 0, (2D(S k i ) -I n )X k w i ≥ 0, ∀i, k, D(S k i )X k w i ≥ D(S k i )X j w i , D(S k i )X k w i ≥ D(S k i )X j w i , ∀i, j, k. We now note that there will be m * pairs {w * i , w * i }. Then, we can construct a set of weights {u * j , w * j } m * j=1 as defined in the theorem and evaluate the non-convex objective in equation 5 using these weights as follows A.9 PROOF OF THEOREM 3.1 By using a rescaling for each w 1j and w 2j , equation 8 can be equivalently stated as p * 1 ≤ 1 2 m * j=1 maxpool { X k u * j + } K k=1 w * j -y 2 2 + β 2 Pconv i=1,w * i =0   w * i w * i 2 2 2 + w * i 2 2 2   + β 2 Pconv i=1,w * i =0   w * i w * i 2 2 2 + w * i 2 p * 2 = min {{u lj } L-2 l=1 ,w1j ,w2j } m j=1 w1j ∈B2,u lj ∈U L ,∀l,j 1 2 m j=1 X L-2 l=1 U lj w 1j + w 2j -y 2 2 + β w 2 1 . Let us denote the eigenvalue decomposition of U lj as U lj = FD lj F H , where F ∈ C d×d is the DFT matrix and D lj ∈ C d×d is a diagonal matrix. Then, we again take the dual with respect to w 2 and change the order of min-max as follows p * 2 ≥ d * 2 = max v - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. max D lj ∈D L w1j ∈B2 v T XF L-2 l=1 D lj F H w 1j + ≤ β, ∀j, where D L := {(D 1 , . . . , D L-2 : D l ∈ C d×d , ∀l ∈ [L -2]; L-2 l=1 D l 2 F ≤ d L-2 }. Below we prove that strong duality holds for equation 31. In order to obtain the dual of the semi-infinite problem in equation 31, we again take dual with respect to v (see Appendix A.7 and Goberna & López-Cerdá (1998); Shapiro (2009) for further details), which yields d * 2 ≤ p 2,∞ = min µ 1 2 θ L ∈Θ L XF L-2 l=1 D l F H w 1 + dµ(θ L ) -y 2 2 + β µ T V where Θ L := {(D 1 , . . . , D L-2 , w 1 ) : D l ∈ D L , ∀l ∈ [L -2]; w 1 ∈ B 2 }. Then, selecting µ = m * j=1 w 2j δ(θ L -θ Lj ), where m * ≤ n + 1, gives p 2,∞ = min {{D lj } L-2 l=1 ,w1j ,w2j } m * j=1 D lj ∈D L ,w1j ∈B2,∀j,l 1 2 m * j=1 XF L-2 l=1 D lj F H w 1j + w 2j -y 2 2 + β w 2 1 = p * 2 provided that m ≥ m * holds. Then, the rest of the strong duality proof directly follows from Proof of Proposition A.1. We now focus on the single-sided dual constraint max D l ∈D L w1∈B2 v T X L-2 l=1 D l w1 + ≤ β, which can be written as max S⊆[n] S∈S max D l ∈D L w1∈B2 v T D(S 1 ) X L-2 l=1 D l w1 s.t. (2D(S) -I n ) X L-2 l=1 D l w1 ≥ 0. Since the inner maximization is convex (after a variable change as q = L-2 l=1 D l w1 ) and there exists a strictly feasible solution for a fixed D(S) matrix, equation 32 can also be written as max S⊆[n] S∈S min α≥0 max D l ∈D L w1∈B2 v T D(S i ) Xz + α T (2D(S i ) -I n ) X L-2 l=1 D l w1 = max S⊆[n] S∈S min α≥0 v T D(S) X + α T (2D(S) -I n ) X ∞ d L-2 2 . We now enumerate all hyperplane arrangements and index them in an arbitrary order, which are denoted as D(S i ), where i ∈ [P cconv ]. Then, we have equation 32 ⇐⇒ ∀i ∈ [P cconv ], min α≥0 v T D(S i ) X + α T (2D(S i ) -I n ) X ∞ d L-2 2 ≤ β ⇐⇒ ∀i ∈ [P cconv ], ∃α i ≥ 0 s.t. v T D(S i ) X + α T i (2D(S i ) -I n ) X ∞ d L-2 2 ≤ β. We now use the same approach for the two-sided constraint in equation 31 to represent equation 31 as a finite dimensional convex problem as follows max v αi,α i ≥0 - 1 2 v -y 2 2 + 1 2 y 2 2 (33) s.t. v T D(S i ) X + α T i (2D(S i ) -I n ) X ∞ d L-2 2 ≤ β, -v T D(S i ) X + α T i (2D(S i ) -I n ) X ∞ d L-2 2 ≤ β, ∀i. We note that the above problem is convex and strictly feasible for v = α i = α i = 0. Therefore, equation 33 can be written as min λi,λ i ≥0 max v αi,α i ≥0 - 1 2 v -y 2 2 + 1 2 y 2 2 + Pcconv i=1 λ i β -v T D(S i ) X + α T i (2D(S i ) -I n ) X ∞ d L-2 2 + Pcconv i=1 λ i β --v T D(S i ) X + α T i (2D(S i ) -I n ) X ∞ d L-2 2 . ( ) Next, we introduce new variables z i , z i ∈ C d to represent equation 34 as min λi,λ i ≥0 max v αi,α i ≥0 min zi∈B1 z i ∈B1 - 1 2 v -y 2 2 + 1 2 y 2 2 + Pcconv i=1 λ i β + d L-2 2 v T D(S i ) X + α T i (2D(S i ) -I n ) X z i + Pcconv i=1 λ i β + d L-2 2 -v T D(S i ) X + α T i (2D(S i ) -I n ) X z i . We note that the objective is concave in v, α i , α i and convex in z i , z i . Moreover the set B 1 is convex and compact. We recall Sion's minimax theorem (Sion, 1958) for the inner max-min problem and express the strong dual of the problem equation 35 as min λi,λ i ≥0 min zi∈B1 z i ∈B1 max v αi,α i ≥0 - 1 2 v -y 2 2 + 1 2 y 2 2 + Pcconv i=1 λ i β + d L-2 2 v T D(S i ) X + α T i (2D(S i ) -I n ) X z i + Pcconv i=1 λ i β + d L-2 2 -v T D(S i ) X + α T i (2D(S i ) -I n ) X z i . Now, we can compute the maximum with respect to v, α i , α i analytically to obtain the following problem , where φ i is defined such that c * i = diag (|c * i |) e jφi . Thus, we can directly set the parameters as follows D * li = diag d 1 2 |c * i | c * i 1 , w * 1i = diag |c * i | d L-2 2 e jφi , w * 2i = c * i 1 d L-2 2 , which can be equivalently written as U * li = Fdiag d 1 2 |c * i | c * i 1 F H , w * 1i = |c * i | d L-2 2 , w * 2i = c * i 1 d L-2 2 to exactly match with the problem formulation in equation 8. We first note that The same steps can also be applied to c * i . Then, the rest of the proof directly follows from Theorem 2.1. Therefore, we prove that a set of optimal layer weights for equation 8, denoted as {{U * lj } L-2 l=1 , w * 1j , w * 2j } m * j=1 , can be obtained from the optimal solution to equation 38, denoted as {c Then, the dual is p * 3 ≥ d * 3 = max v - 1 2 v -y 2 2 + 1 2 y 2 2 s.t. max u,w1∈B2 v T K k=1 (X k u) + w 1k + ≤ β, Dual of equation 39  d * 3 ≤ p 3,∞ = min I k ∈{±1} max S 2 ,S k 1 ⊆[n] S 2 ,S k 1 ∈S max w 1 ∈B 2 max q k 2 ≤|w 1k | v T D(S2) K k=1 I k D(S k 1 )X k q k (42) s.t. (2D(S2) -In) K k=1 I k D(S k 1 )X k q k ≥ 0, (2D(S k 1 ) -In)X k q k ≥ 0, where we introduce the notation I k ∈ {±1} to enumerate all possible sign patterns for w 1k . Since the inner maximization is convex and there exists a strictly feasible solution for fixed D(S k 1 ), D(S 2 ), and I k equation 42 can also be written as max I k ∈{±1} max S 2 ,S k 1 ⊆[n] S 2 ,S k 1 ∈S min α k ,γ≥0 max w 1 ∈B 2 max q k 2 ≤w 1k K k=1 v T D(S k 1 )X k q k + α T k (2D(S k 1 ) -In)X k q k + I k γ T (2D(S2) -In)D(S k 1 )X k q k = max I k ∈{±1} max S 2 ,S k 1 ⊆[n] S 2 ,S k 1 ∈S min α k ,γ≥0 max w 1 ∈B 2 K k=1 v T D(S k 1 )X k q k + α T k (2D(S k 1 ) -In)X k q k + I k γ T (2D(S2) -In)D(S k 1 )X k 2 |w 1k | = max I k ∈{±1} max S 2 ,S k 1 ⊆[n] S 2 ,S k 1 ∈S min α k ,γ≥0 K k=1 v T D(S k 1 )X k q k + α T k (2D(S k 1 ) -In)X k q k + I k γ T (2D(S2) -In)D(S k 1 )X k 2 2 1 2



The results on two-layer CNNs are presented in Appendix A.4. This refers to an L-layer network with only one ReLU layer and circular convolutions. TWO-LAYER CNNSIn this section, we present exact convex formulation for two-layer CNN architectures. We define the average pooling operation as K k=1 (X k uj) + , which is also known as global average pooling. Since our proof technique is similar for different CNNs, we present only the proof of Theorem 2.1 in Section The rest of the proofs can be found in Appendix (including the strong duality results in A.7). The details are presented in Appendix A.9 Additional numerical results can be found in Appendix A.1. We use CVX(Grant & Boyd, 2014) and CVXPY(Diamond & Boyd, 2016;Agrawal et al., 2018) with the SDPT3 solver(Tütüncü et al., 2001) to solve convex optimization problems.



Figure 1: Training cost of the three-layer circular CNN trained with SGD (5 initialization trials) on a synthetic dataset (n= 6, d = 20, h = 3, stride = 1), where the green and red line with a marker represent the objective value obtained by the proposed convex program in equation 9 and the non-convex objective value in equation 8 of a feasible network with the weights found by the convex program, respectively. We use markers to denote the total computation time of the convex solver.

numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.2 Constructing hyperplane arrangements in polynomial time . . . . . . . . . . . . 12 A.3 Equivalence of the 1 penalized objectives . . . . . . . . . . . . . . . . . . . . 14 A.4 Two-layer linear CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.5 Extensions to vector outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.6 Extensions to arbitrary convex loss functions . . . . . . . . . . . . . . . . . . . 17 A.7 Strong duality results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.8 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.9 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.10 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Figure 3: Training cost of a two-layer CNN (with average pooling) trained with SGD (5 initialization trials) on a synthetic dataset (n = 6, d = 15, h = 10, stride = 5), where the green line with a marker represents the objective value obtained by the proposed convex program in equation 4 and the red line with a marker represents the non-convex objective value in equation 1 of a feasible network with the weights found by the convex program. Here, we use markers to denote the total computation time of the convex optimization solver.

same objective value with equation 29. Since strong duality holds for the convex program, we have p * 1 = d * 1 , which is equal to the value of the convex program equation 29 achieved by the prescribed parameters above.

i ) X (λ i z i -λ i z i ) -y . (2D(S i ) -I n ) Xz i ≥ 0, (2D(S i ) -I n ) Xz i ≥ 0, ∀i.Now we apply a change of variables and define ci = d L-2 2 λ i z i and c i = d L-2 2 λ i z i . Thus, we obtain min i ) X (c i -c i ) -y c i 1 + c i 1 ) (38) s.t. (2D(S i ) -I n ) Xc i ≥ 0, (2D(S i ) -I n ) Xc i ≥ 0, ∀i,where we eliminate the variables λ i , λ i , sinceλ i = c i 1 /dOptimal weight construction for equation 8:Given the optimal weights for the convex program in equation 38, i.e., denoted as {c * i ,

= d L-2 , ∀i, l, therefore, this set of parameters is feasible for equation 8. Now, we prove the optimality by showing that these parameters have the same regularization cost with the convex program in equation 38 as follows

the proof for the three-layer CNN architecture with two ReLU layers, which has the following

holds, i.e., d * 3 = p 3,∞ , by Proof of Proposition A.1. Then, the finite equivalent is as followsp 3,∞ = p 3 = min uj ,k uw 1k s.t. (2D(S2) -In) K k=1 D(S k 1 )X k uw 1k ≥ 0, (2D(S k 1 ) In)X k uw 1k ≥ 0 = max

CNN architectures and the corresponding norm regularization in our convex programs

Computational complexity results for training CNNs to global optimality using a standard interior-point solver (n: # of data samples, d: data dimensionality, K: # of patches, r c : maximal rank for the patch matrices (r c ≤ h), r cc : rank for the circular convolution, h: filter size ,m: # of filters)

ACKNOWLEDGEMENTS

This work was partially supported by the National Science Foundation under grants IIS-1838179 and ECCS-2037304, Facebook Research, Adobe Research and Stanford SystemX Alliance.

annex

Published as a conference paper at ICLR Then, we have equation 41 ⇐⇒ ∀i, j, minWe now use the same approach for the two-sided constraint as followsWe note that the above problem is convex and strictly feasible forTherefore, Slater's conditions and consequently strong duality holds (Boyd & Vandenberghe, 2004) , and equation 43 can be written asNext, we introduce new variables z ijk , z ijk ∈ R h to represent equation 44 asThen, the strong dual of the problem equation 45 asNow, we compute the maximum with respect to v, α ijk , α ijk , γ ij , γ ij analytically to obtain the following problemNow we apply a change of variables and define c ijk = λ ij z ijk and c ijk = λ ij z ijk . Thus, we obtain

