PARALLEL DEEP NEURAL NETWORKS HAVE ZERO DUALITY GAP

Abstract

Training deep neural networks is a challenging non-convex optimization problem. Recent work has proven that the strong duality holds (which means zero duality gap) for regularized finite-width two-layer ReLU networks and consequently provided an equivalent convex training problem. However, extending this result to deeper networks remains to be an open problem. In this paper, we prove that the duality gap for deeper linear networks with vector outputs is non-zero. In contrast, we show that the zero duality gap can be obtained by stacking standard deep networks in parallel, which we call a parallel architecture, and modifying the regularization. Therefore, we prove the strong duality and existence of equivalent convex problems that enable globally optimal training of deep networks. As a by-product of our analysis, we demonstrate that the weight decay regularization on the network parameters explicitly encourages low-rank solutions via closed-form expressions. In addition, we show that strong duality holds for three-layer standard ReLU networks given rank-1 data matrices.

1. INTRODUCTION

Deep neural networks demonstrate outstanding representation and generalization abilities in popular learning problems ranging from computer vision, natural language processing to recommendation system. Although the training problem of deep neural networks is a highly non-convex optimization problem, simple first order gradient based algorithms, such as stochastic gradient descent, can find a solution with good generalization properties. However, due to the non-convex and non-linear nature of the training problem, underlying theoretical reasons for this remains an open problem. The Lagrangian dual problem (Boyd et al., 2004) plays an important role in the theory of convex and non-convex optimization. For convex optimization problems, the convex duality is an important tool to determine their optimal values and to characterize the optimal solutions. Even for a non-convex primal problem, the dual problem is a convex optimization problem the can be solved efficiently. As a result of weak duality, the optimal value of the dual problem serves as a non-trivial lower bound for the optimal primal objective value. Although the duality gap is non-zero for non-convex problems, the dual problem provides a convex relaxation of the non-convex primal problem. For example, the semi-definite programming relaxation of the two-way partitioning problem can be derived from its dual problem (Boyd et al., 2004) . The convex duality also has important applications in machine learning. In Paternain et al. (2019) , the design problem of an all-encompassing reward can be formulated as a constrained reinforcement learning problem, which is shown to have zero duality. This property gives a theoretical convergence guarantee of the primal-dual algorithm for solving this problem. Meanwhile, the minimax generative adversarial net (GAN) training problem can be tackled using duality (Farnia & Tse, 2018) . In lines of recent works, the convex duality can also be applied for analyzing the optimal layer weights of two-layer neural networks with linear or ReLU activations (Ergen & Pilanci, 2019; Pilanci & Ergen, 2020; Ergen & Pilanci, 2020a; b; Lacotte & Pilanci, 2020; Sahiner et al., 2020) . Based on the convex duality framework, the training problem of two-layer neural networks with ReLU activation can be represented in terms of a single convex program in Pilanci & Ergen (2020) . Such convex optimization formulations are extended to two-layer and three-layer convolutional neural network training problems in Ergen & Pilanci (2021b) . Strong duality also holds for deep linear neural networks with scalar output (Ergen & Pilanci, 2021a) . The convex optimization formulation essentially gives a detailed characterization of the global optimum of the training problem. This enables us to examine in numerical experiments whether popular optimizers for neural networks, such as gradient descent or stochastic gradient descent, converge to the global optimum of the training loss. Admittedly, a zero duality gap is hard to achieve for deep neural networks, especially for those with vector outputs. This imposes more difficulty to understand deep neural networks from the convex optimization lens. Fortunately, neural networks with parallel structures (also known as multi-branch architecture) appear to be easier to train. Practically, the usage of parallel neural networks dates back to AlexNet (Krizhevsky et al., 2012) . Modern neural network architecture including Inception (Szegedy et al., 2017) , Xception (Chollet, 2017) and SqueezeNet (Iandola et al., 2016) utilize the parallel structure. As the "parallel" version of ResNet (He et al., 2016a; b) , ResNeXt (Xie et al., 2017) and Wide ResNet (Zagoruyko & Komodakis, 2016) exhibit improved performance on many applications. Recently, it was shown that neural networks with parallel architectures have smaller duality gaps (Zhang et al., 2019) compared to standard neural networks. Furthermore, Ergen & Pilanci (2021c; e) proved that there is no duality gap for parallel architectures with three-layers. On the other hand, it is known that overparameterized parallel neural networks have benign training landscapes (Haeffele & Vidal, 2017; Ergen & Pilanci, 2019) . The parallel models with the over-parameterization are essentially neural networks in the mean-field regime (Nitanda & Suzuki, 2017; Mei et al., 2018; Chizat & Bach, 2018; Mei et al., 2019; Rotskoff et al., 2019; Sirignano & Spiliopoulos, 2020; Akiyama & Suzuki, 2021; Chizat, 2021; Nitanda et al., 2020) . The deep linear model is also of great interests in the machine learning community. For training 2 loss with deep linear networks using Schatten norm regularization, Zhang et al. (2019) show that there is no duality gap. The implicit regularization in training deep linear networks has been studied in Ji & Telgarsky (2018) ; Arora et al. (2019) ; Moroshko et al. (2020) . From another perspective, the standard two-layer network is equivalent to the parallel two-layer network. This may also explain why there is no duality gap for two-layer neural networks.

1.1. CONTRIBUTIONS

Following the convex duality framework introduced in Ergen & Pilanci (2021a; 2020a) , which showed the duality gap is zero for two-layer networks, we go beyond two-layer and study the convex duality for vector-output deep neural networks with linear activation and ReLU activation. Surprisingly, we prove that three-layer networks may have duality gaps depending on their architecture, unlike two-layer neural networks which always have zero duality gap. We summarize our contributions as follows. • For training standard vector-output deep linear networks using 2 regularization, we precisely calculate the optimal value of the primal and dual problems and show that the duality gap is non-zero, i.e., Lagrangian relaxation is inexact. We also demonstrate that the 2regularization on the parameter explicitly forces a tendency toward a low-rank solution, which is boosted with the depth. However, we show that the optimal solution is available in closed-form. • For parallel deep linear networks, with certain convex regularization, we show that the duality gap is zero, i.e, Lagrangian relaxation is exact. • For parallel deep ReLU networks of arbitrary depth, with certain convex regularization and sufficient number of branches, we prove strong duality, i.e., show that the duality gap is zero. Remarkably, this guarantees that there is a convex program equivalent to the original deep ReLU neural network problem. We summarize the duality gaps for parallel/standard neural network in Table 1 .

1.2. NOTATIONS

We use bold capital letters to represent matrices and bold lowercase letters to represent vectors. Denote [n] = {1, . . . , n}. For a matrix W l ∈ R m l-1 ×m l , for i ∈ [m l-1 ] and j ∈ [m l ], we denote w col l,i as its linear activation ReLU activation i-th column and w row l,j as its j-th row. Throughout the paper, X ∈ R N ×d is the data matrix consisting of d dimensional N samples and Y ∈ R N ×K is the label matrix for a regression/classification task with K outputs. We use the letter P (D) for the optimal value of the primal (dual) problem. L = 2 L = 3 L > 3 L = 2 L = 3 L > 3 standard networks previous work (0) (0) this paper (0) ( = 0) ( = 0) (0) parallel networks previous work (0) (0) (0) (0) (0) this paper (0) (0) (0) (0) (0) (0)

1.3. MOTIVATIONS AND BACKGROUND

Recently a series of papers (Pilanci & Ergen, 2020; Ergen & Pilanci, 2021a; 2020a) studied two-layer neural networks via convex duality and proved that strong duality holds for these architectures. Particularly, these prior works consider the following weight decay regularized training framework for classification/regression tasks. Given a data matrix X ∈ R N ×d consisting of d dimensional N samples and the corresponding label matrix y ∈ R N , the weight-decay regularized training problem for a scalar-output neural network with m hidden neurons can be written as follows P := min W1,w2 1 2 φ(XW 1 )w 2 -y 2 2 + β 2 ( W 1 2 F + w 2 2 2 ), where W 1 ∈ R d×m and w 2 ∈ R m are the layer weights, β > 0 is a regularization parameter, and φ is the activation function, which can be linear φ(z) = z or ReLU φ(z) = max{z, 0}. Then, one can take the dual of (1) with respect to W 1 and w 2 obtain the following dual optimization problem D := max λ - 1 2 λ -y 2 2 + 1 2 y 2 2 , s.t. max w1: w1 2≤1 |λ T φ(Xw 1 )| ≤ β. We first note that since the training problem (1) is non-convex, strong duality may not hold, i.e., P ≥ D. Surprisingly, as shown in Pilanci & Ergen (2020) ; Ergen & Pilanci (2021a; 2020a) , strong duality in fact holds, i.e., P = D, for two-layer networks and therefore one can derive exact convex representations for the non-convex training problem in (1). However, extensions of this approach to deeper and state-of-the-art architectures are not available in the literature. Based on this observation, the central question we address in this paper is: Does strong duality hold for deep neural networks? Depending on the answer to the question above, an immediate next questions we address is Can we characterize the duality gap (P-D)? Is there an architecture for which strong duality holds regardless of the depth? Consequently, throughout the paper, we provide a full characterization of convex duality for deeper neural networks. We observe that the dual of the convex dual problem of the nonconvex minimum norm problem of deep networks correspond to a minimum norm problem of deep networks with parallel branches. Based on this characterization, we propose a modified architecture for which strong duality holds regardless of depth.

1.4. ORGANIZATION

This paper is organized as follows. In Section 2, we review standard neural networks and introduce parallel architectures. For deep linear networks, we derive primal and dual problems for both standard and parallel architectures and provide calculations of optimal values of these problems in Section 3. We derive primal and dual problems for three-layer ReLU networks with standard architecture and precisely calculate the optimal values for whitened data in Section 4. We also show that deep ReLU networks with parallel structures have no duality gap.

2. STANDARD NEURAL NETWORKS VS PARALLEL ARCHITECTURES

We briefly review the convex duality theory for two-layer neural networks in Appendix A. To extend the theory to deep neural networks, we fist consider the L-layer neural network with the standard architecture: f θ (X) = A L-1 W L , A l = φ(A l-1 W l ), ∀l ∈ [L -1], A 0 = X, where φ is the activation function, W l ∈ R m l-1 ×m l is the weight matrix in the l-th layer and θ = (W 1 , . . . , W L ) represents the parameter of the neural network. We then introduce the neural network with parallel architectures: f prl θ (X) = A L-1 W L , A l,j = φ(A l-1,j W l,j ), ∀l ∈ [L -1], A 0,j = X, ∀j ∈ [m]. Here for l ∈ [L -1], the l-th layer has m weight matrices W l,j ∈ R m l-1 ×m l where j ∈ [m]. Specifically, we let m L-1 = 1 to make each parallel branch as a scalar-output neural network. In short, we can view the output A L-1 from a parallel neural network as a concatenation of m scalar-output standard neural work. In Figures 1 and 2 , we provide examples of neural networks with standard and parallel architectures. We shall emphasize that for L = 2, the standard neural network is identical to the parallel neural network. We next present a summary of our main result. Theorem 1 (main result) For L ≥ 3, there exists an activation function φ and a L-layer standard neural network defined in (3) such that the strong duality does not hold, i.e., P > D. In contrast, for any L-layer parallel neural network defined in (4) with linear or ReLU activations and sufficiently large number of branches, strong duality holds, i.e., P = D. We elaborate on the primal problem with optimal value P and the dual problem with optimal value D in Section 3 and 4. We first consider the neural network with standard architecture, i.e., f θ (X) = XW 1 . . . W L . Consider the following minimum norm optimization problem:

3. DEEP LINEAR NETWORKS

P lin = min {W l } L l=1 1 2 L l=1 W l 2 F , s.t. XW 1 , . . . , W L = Y, where the variables are W 1 , . . . , W L . As shown in the Proposition 3.1 in (Ergen & Pilanci, 2021a) , by introducing a scale parameter t, the problem (5) can be reformulated as P lin = min t>0 L -2 2 t 2 + P lin (t), where the subproblem P lin (t) is defined as P lin (t) = min {W l } L l=1 K j=1 w row L,j 2 , s.t. XW 1 . . . W L = Y, W i F ≤ t, i ∈ [L -2], w col L-1,j 2 ≤ 1, j ∈ [m L-1 ]. To be specific, these two formulations have the same optimal value and the optimal solutions of one problem can be rescaled into the optimal solution of another solution. Based on the rescaling of parameters in P lin (t) , we characterize the dual problem of P lin (t) and its bi-dual, i.e., dual of the dual problem. Proposition 1 The dual problem of P lin (t) is a convex optimization problem given by D lin (t) = max Λ tr(Λ T Y) s.t. max Wi F ≤t,i∈[L-2], w L-1 2≤1 Λ T XW 1 . . . W L-2 w L-1 2 ≤ 1. There exists a threshold of the number of branches m * ≤ KN + 1 such that D lin (t) = BD lin (t), where BD lin (t) is the optimal value of the bi-dual problem BD lin (t) = min {W l,j } l∈[L],j∈[m * ] m * j=1 w row L,j 2 , s.t. m * j=1 XW 1,j . . . W L-2,j w col L-1,j w row L,j = Y, W i,j F ≤ t, i ∈ [L -2], j ∈ [m * ], w col L-1,j 2 ≤ 1, j ∈ [m * ]. Detailed derivation of the dual and the bi-dual problems are provided in Appendix C.1. As Λ = 0 is a strict feasible point for the dual problem, the optimal dual solutions exist due to classical results in strong duality for convex problems. The reason why we do not directly take the dual of P lin is that the objective function in P lin involves the weights of first L -1 layer, which prevents obtaining a non-trivial dual problem. An interesting observation is that the bi-dual problem is related to the minimum norm problem of a parallel neural network with balanced weights. Namely, the Frobenius norm of the weight matrices {W l,j } L-2 l=1 in each branch j ∈ [m] has the same upper bound t. To calculate the value P lin (t) for fixed t ∈ R, we introduce the definition of Schatten-p norm. Definition 1 For a matrix A ∈ R m×n and p > 0, the Schatten-p quasi-norm of A is defined as A Sp =   min{m,n} i=1 σ p i (A)   1/p , where σ i (A) is the i-th largest singular value of A. The following proposition provides a closed-form solution for the sub-problem P lin (t) and determines its optimal value. Proposition 2 Suppose that W ∈ R d×K with rank r is given. Assume that m l ≥ r for l = 1, . . . , L -1. Consider the following optimization problem: min {W l } L l=1 1 2 W 1 2 F + • • • + W L 2 F , s.t. W 1 W 2 . . . W L = W. Then, the optimal value of the problem (7) is given by L 2 W 2/L S 2/L . Suppose that W = UΣV T . The optimal value is achieved when W l = U l-1 Σ 1/L U T l , i = l, . . . , L. Here U 0 = U, U L = V and for l = 1, . . . , L -1, U l ∈ R m l ×r satisfies that U T l U l = I r . To the best of our knowledge, this result was not known previously. Proposition 2 implies that P lin can be equivalently written as min L 2 W 2/L S 2/L s.t. XW = Y . Denote X † as the pseudo inverse of X. Although the objective is non-convex for L ≥ 3, this problem has a closed-form solution as we show next. Theorem 2 Suppose that X † Y = UΣV T is the singular value decomposition and let r := rank(X † Y). Assume that m l ≥ r for l = 1, . . . , L -1. The optimal solution to P lin is given in closed-form as follows: W l = U l-1 Σ 1/L U T l , l ∈ [L] where U 0 = U, U L = V. For l = 1, . . . , L -1, U l ∈ R m l ×r satisfies U T l U l = I r . Based on Theorem 2, the optimal value of P lin (t) and D lin (t) can be precisely calculated as follows. Theorem 3 Assume that m l ≥ rank(X † Y) for l = 1, . . . , L -1. For fixed t > 0, the optimal value of P lin (t) and D lin (t) are given by As a result, if the singular values of X † Y are not equal to the same value, the duality gap exists, i.e., P > D, for standard deep linear networks with L ≥ 3. We note that the optimal scale parameter t for the primal problem P lin is given by t * = W * 1/L S 2/L . This proves the first part of Theorem 1. We conclude that, the deep linear network training problem has a duality gap whenever the depth is three or more. In contrast, there exists no duality gap for depth two. Nevertheless, the optimal solution can be obtained in closed form as we have shown. In the following section, we introduce a parallel multi-branch architecture that always has zero duality gap regardless of the depth. P lin (t) = t -(L-2) X † Y S 2/L , (10) and D lin (t) = t -(L-2) X † Y * .

3.2. PARALLEL DEEP LINEAR NEURAL NETWORKS

Now we consider the parallel multi-branch network structure as defined in Section 2, and consider the corresponding minimum norm optimization problem: min {W l,j } l∈[L],j∈[m] 1 2   L-1 l=1 m j=1 W l,j 2 F + W L 2 F   , s.t. m j=1 XW 1,j . . . W L-2,j w col L-1,j w row L,j = Y. Due to a rescaling to achieve the lower bound of the inequality of arithmetic and geometric means, we can formulate the problem (12) in the following way. In other words, two formulations ( 12) and ( 13) have the same optimal value and the optimal solutions of one problem can be mapped to the optimal solutions of another problem. Proposition 3 The problem (12) can be formulated as min {W l,j } l∈[L],j∈[m] L 2 m j=1 w row L,j 2/L 2 , s.t. m j=1 XW 1,j . . . W L-2,j w col L-1,j w row L,j = Y, W l,j F ≤ 1, l ∈ [L -2], j ∈ [m], w col L-1,j 2 ≤ 1, j ∈ [m]. We note that z 2/L is a non-convex function of z and we cannot hope to obtain a non-trivial dual. To solve this issue, we consider the • L F regularized objective given by P prl lin = min {W l,j } l∈[L],j∈[m] 1 2   L-1 l=1 m j=1 W l,j L F + m j=1 w row L,j L 2   , s.t. m j=1 XW 1,j . . . W L-2,j w col L-1,j w row L,j = Y. Utilizing the arithmetic and geometric mean (AM-GM) inequality, we can rescale the parameters and formulate (14). To be specific, the two formulations ( 14) and ( 15) have the same optimal value and the optimal solutions of one problem can be rescaled to the optimal solutions of another problem and vice versa. Proposition 4 The problem (14) can be formulated as P prl lin = min {W l,j } l∈[L],j∈[m] L 2 m j=1 w row L,j 2 , s.t. m j=1 XW 1,j . . . W L-2,j w col L-1,j w row L,j = Y, W l,j F ≤ 1, l ∈ [L -2], j ∈ [m], w col L-1,j 2 ≤ 1, j ∈ [m]. The dual problem of P prl lin is a convex problem D prl lin = max Λ tr(Λ T Y), s.t. max Wi F ≤1,i∈[L-2], w L-1 2≤1 Λ T XW 1 . . . W L-2 w L-1 2 ≤ L/2 In contrary to the standard linear network model, the strong duality holds for the parallel linear network training problem ( 14). Theorem 4 There exists a critical width m * ≤ KN + 1 such that as long as the number of branches m ≥ m * , the strong duality holds for the problem (14). Namely, P prl lin = D prl lin . The optimal values are both L 2 X † Y * . This implies that there exist equivalent convex problems which achieve the global optimum of the deep parallel linear network. Comparatively, optimizing deep parallel linear neural networks can be much easier than optimizing deep standard linear networks.

4.1. STANDARD THREE-LAYER RELU NETWORKS

We first focus on the three-layer ReLU network with standard architecture. Specifically, we set φ(z) = max{z, 0}. Consider the minimum norm problem P ReLU = min {Wi} 3 i=1 1 2 3 i=1 W i 2 F , s.t. ((XW 1 ) + W 2 ) + W 3 = Y. Here we denote (z) + = max{z, 0}. Similarly, by introducing a scale parameter t, this problem can be formulated as P ReLU = min t>0 1 2 t 2 + P ReLU (t), where P ReLU (t) is defined as P ReLU (t) = min {Wi} 3 i=1 K j=1 w row 3,j 2 , s.t. W 1 F ≤ t, w col 2,j 2 ≤ 1, j ∈ [m 2 ], ((XW 1 ) + W 2 ) + W 3 = Y. ( ) The proof is analagous to the proof of Proposition 3.1 in (Ergen & Pilanci, 2021a) . To be specific, these two formulations have the same optimal value and their optimal solutions can be mutually transformed into each other. For W 1 ∈ R d×m , we define the set A(W 1 ) = {((XW 1 ) + w 2 ) + | w 2 2 ≤ 1}. ( ) We derive the convex dual problem of P ReLU (t) in the following proposition. Proposition 5 The dual problem of P ReLU (t) defined in (18) is a convex problem defined as D ReLU (t) = max Λ tr(Λ T Y), s.t. max W1: W1 F ≤t max v:v∈A(W1) Λ T v 2 ≤ 1. ( ) There exists a threshold of the number of branches m * ≤ KN +1 such that D ReLU (t) = BD ReLU (t) where BD ReLU (t) is the optimal value of the bi-dual problem BD ReLU (t) = min {W1,j } m * j=1 ,W2∈R m 1 ×m * ,W3∈R m * ×K K j=1 w row 3,j 2 , s.t. m * j=1 ((XW 1,j ) + w col 2,j ) + w row 3,j = Y, W 1,j F ≤ t, w col 2,j 2 ≤ 1, j ∈ [m * ]. We note that the bi-dual problem defined in ( 21) indeed optimizes with a parallel neural network satisfying W 1,j F ≤ t, w col 2,j 2 ≤ 1, j ∈ [m * ]. For the case where the data matrix is with rank 1 and the neural network is with scalar output, we show that there is no duality gap. We extend the result in (Ergen & Pilanci, 2021d) from two-layer ReLU networks to three-layer ReLU networks. Theorem 5 For a three-layer scalar-output ReLU network, let X = ca T 0 be a rank-one data matrix. Then, strong duality holds, i.e., P ReLU (t) = D ReLU (t). Suppose that λ * is the optimal solution to the dual problem D ReLU (t), then the optimal weights for each layer can be formulated as W 1 =tsign(|(λ * ) T (c) + | -|(λ * ) T (-c) + |)ρ 0 ρ T 1 , w 2 = ρ 1 . Here ρ 0 = a 0 / a 0 2 and ρ 1 ∈ R m1 + satisfies ρ 1 = 1. For general standard three-layer neural networks, although we have BD ReLU (t) = D ReLU (t), it may not hold that P ReLU (t) = D ReLU (t) as the bi-dual problem corresponds to optimizing a parallel neural network instead of a standard neural network to fit the labels. To theoretically justify that the duality gap can be zero, we consider a parallel multi-branch architecture for ReLU networks in the next section.

4.2. PARALLEL DEEP RELU NETWORKS

For the corresponding parallel architecture, we show that there is no duality gap for arbitrary depth ReLU networks, as long as the number of branches is large enough. Consider the following minimum norm problem: P prl ReLU = min 1 2 L-1 l=1 m j=1 W l,j L F + W L L F , s.t. m j=1 ((XW 1,j ) + . . . w col L-1,j ) + w row L,j = Y. (22) As the ReLU activation is homogeneous, we can rescale the parameter to reformulate (22) and derive the dual problem. We note that two formulations ( 22) and ( 23) have the same optimal value and the optimal solutions of one problem can be rescaled to the optimal solutions of another problem and vice versa. Proposition 6 The problem (22) can be reformulated as min L 2 m j=1 w row L,j 2 , s.t. m j=1 ((XW 1,j ) + w col L-1,j ) + w row L,j = Y, W l,j F ≤ 1, l ∈ [L -2], w col L-1,j 2 ≤ 1, j ∈ [m]. (23) The dual problem of (23) is a convex problem defined as D prl ReLU = max tr(Λ T Y), s.t. max v=((XW1)+...W L-2 )+w L-1 )+, W l F ≤1,l∈[L-2], w L-1 2≤1 Λ T v 2 ≤ L/2. ( ) For deep parallel ReLU networks, we show that with sufficient number of parallel branches, the strong duality holds, i.e., P = D. Theorem 6 Let m * be the threshold of the number of branches, which is upper bounded by KN + 1. Then, as long as the number of branches m ≥ m * , the strong duality holds for (23) in the sense that P prl ReLU = D prl ReLU . Similar to case of parallel deep linear networks, the parallel deep ReLU network also achieves zero-duality gap. Therefore, to find the global optimum for parallel deep ReLU network is equivalent to solve a convex program. This proves the second part of Theorem 1. Based on the strong duality results, assuming that we can obtain an optimal solution to the convex dual problem (24), then we can construct an optimal solution to the primal problem (23) as follows. Theorem 7 Let Λ * be the optimal solution to (24). Denote the set of maximizers arg max v=((XW1)+...W L-2 )+w L-1 )+, W l F ≤1,l∈[L-2], w L-1 2≤1 (Λ * ) T v 2 (25) as {v 1 , . . . , v m * }, where v i = ((XW 1,i ) + . . . W L-2,i ) + w L-1,i ) + with W l,i F ≤ 1, l ∈ [L -2] and w L-1,i 2 ≤ 1 and m * ≤ KN + 1 is the critical threshold of the number of branches. Let w row L,1 , . . . , w row L,m * be an optimal solution to the convex problem P prl,sub ReLU = min W L L 2 m * j=1 w row L,j 2 , s.t. m j=1 ((XW 1,j ) + w col L-1,j ) + w row L,j = Y. ( ) Then, (W 1 , . . . , W L ) is an optimal solution to (23). We note that finding the set of maximizers in ( 25) can be challenging in practice due to the highdimensionality of the constraint set.

5. CONCLUSION

We present the convex duality framework for standard neural networks, considering both multi-layer linear networks and three-layer ReLU networks with rank-1. In stark contrast to the two-layer case, the duality gap can be non-zero for neural networks with depth three or more. Meanwhile, for neural networks with parallel architecture, with the regularization of L-th power of Frobenius norm in the parameters, we show that strong duality holds and the duality gap reduces to zero. A limitation of our work is that we primarily focus on minimum norm interpolation problems. We believe that our results can be easily generalized to a regularized training problems with general loss function, including squared loss, logistic loss, hinge loss, etc.. Another interesting research direction is investigating the complexity of solving our convex dual problems. Although the number of variables can be high for deep networks, the convex duality framework offers a rigorous theoretical perspective to the structure of optimal solutions. These problems can also shed light into the optimization landscape of their equivalent non-convex formulations. We note that it is not yet clear whether convex formulations of deep networks present practical gains in training. However, in Mishkin et al. (2022) ; Pilanci & Ergen (2020) it was shown that convex formulations provide significant computational speed-ups in training two-layer neural networks. Furthermore, similar convex analysis was also applied various architectures including batch normalization (Ergen et al., 2022b) , vector output networks (Sahiner et al., 2021) , threshold and polynomial activation networks (Ergen et al., 2023; Bartan & Pilanci, 2021) , GANs (Sahiner et al., 2022a) , autoregressive models (Gupta et al., 2021) , and Transformers (Ergen et al., 2022a; Sahiner et al., 2022b) .

A CONVEX DUALITY FOR TWO-LAYER NEURAL NETWORKS

We briefly review the convex duality theory for two-layer neural networks introduced in Ergen & Pilanci (2021a; 2020a) . Consider the following weight-decay regularized training problem for a vector-output neural network architecture with m hidden neurons min W1,W2 1 2 φ(XW 1 )W 2 -Y 2 F + β 2 ( W 1 2 F + W 2 2 F ), where W 1 ∈ R d×m and W 2 ∈ R m×K are the variables, and β > 0 is a regularization parameter. Here φ is the activation function, which can be linear φ(z) = z or ReLU φ(z) = max{z, 0}. As long as the network is sufficiently overparameterized, there exists a feasible point for such that φ(XW 1 )W 2 = Y. Then, a minimum norm variantfoot_0 of the training problem in ( 27) is given by min W1,W2 1 2 ( W 1 2 F + W 2 2 F ) s.t. φ(XW 1 )W 2 = Y. ( ) As shown in Pilanci & Ergen (2020) , after a suitable rescaling, this problem can be reformulated as min W1,W2 m j=1 w row 2,j 2 , s.t. φ(XW 1 )W 2 = Y, w col 1,j 2 ≤ 1, j ∈ [m]. where [m] = {1, . . . , m}. Here w row 2,j represents the j-th row of W 2 and w col 1,j denotes the j-th column of W 1 . The rescaling does not change the solution to (28). By taking the dual with respect to W 1 and W 2 , the dual problem of (29) with respect to variables is a convex optimization problem given by max Λ tr(Λ T Y), s.t. max u: u 2≤1 Λ T φ(Xu) 2 ≤ 1, where Λ ∈ R N ×K is the dual variable. Provided that m ≥ m * , where m * is a critical threshold of width upper bounded by m * ≤ N + 1, the strong duality holds, i.e., the optimal value of the primal problem (29) equals to the optimal value of the dual problem (30).

B DEEP LINEAR NETWORKS WITH GENERAL LOSS FUNCTIONS

We consider deep linear networks with general loss functions, i.e., min {W l } L l=1 (XW 1 . . . W L , Y) + β 2 L i=1 W i 2 F , where (Z, Y) is a general loss function and β > 0 is a regularization parameter. According to Proposition 2, the above problem is equivalent to min W (XW, Y) + βL 2 W 2/L S 2/L . The 2 regularization term becomes the Schatten-2/L quasi-norm on W to the power 2/L. Suppose that there exists W such that l(XW, Y) = 0. With β → 0, asymptotically, the optimal solution to the problem (31) converges to the optimal solution of min W W 2/L S 2/L , s.t. (XW, Y) = 0. In other words, the 2 regularization explicitly regularizes the training problem to find a low-rank solution W.

C PROOFS OF MAIN RESULTS FOR LINEAR NETWORKS C.1 PROOF OF PROPOSITION 1

Consider the Lagrangian function L(W 1 , . . . , W L , Λ) = K j=1 w L,j 2 + tr(Λ T (Y -XW 1 . . . W L )). Here Λ ∈ R N ×K is the dual variable. We note that P (t) = min W1,...,W L max Λ L(W 1 , . . . , W L , Λ), s.t. W i F ≤ t, i ∈ [L -2], w col L-1,j 2 ≤ 1, j ∈ [m L-1 ], = min W1,...,W L-1 max Λ tr(Λ T Y) - m L-1 j=1 I Λ T XW 1 . . . W L-2 w L-1,j 2 ≤ 1 , s.t. W i F ≤ t, i ∈ [L -2], w col L-1,j 2 ≤ 1, j ∈ [m L-1 ], = min W1,...,W L-2 ,W L-1 max Λ tr(Λ T Y) -I Λ T XW 1 . . . W L-2 w L-1 2 ≤ 1 , s.t. W i F ≤ t, i ∈ [L -2], w L-1 2 ≤ 1. ( ) Here I(A) is 0 if the statement A is true. Otherwise it is +∞. For fixed W 1 , . . . , W L-1 , the constraint on W L is linear so we can exchange the order of max Λ and min W L in the second line of (34). By exchanging the order of min and max, we obtain the dual problem D(t) = max Λ min W1,...,W L-2 tr(Λ T Y) -I Λ T XW 1 . . . W L-2 w L-1 2 ≤ 1 , s.t. W i F ≤ t, i ∈ [L -2], w L-1 2 ≤ 1, = max Λ tr(Λ T Y) s.t. Λ T XW 1 . . . W L-2 w L-1 2 ≤ 1 ∀ W i F ≤ t, i ∈ [L -2], w L-1 2 ≤ 1. ( ) Now we derive the bi-dual problem. The dual problem can be reformulated as max Λ tr(Λ T Y), s.t. Λ T XW 1 . . . W L-2 w L-1 2 ≤ 1, ∀(W 1 , . . . , W L-2 , w L-1 ) ∈ Θ. ( ) Here the set Θ is defined as Θ = {(W 1 , . . . , W L-2 , w L-1 )| W i F ≤ t, i ∈ [L -2], w L-1 2 ≤ 1}. ( ) By writing θ = (W 1 , . . . , W L-2 , w L-1 ), the dual of the problem ( 36) is given by min µ TV , s.t. θ∈Θ XW 1 . . . W L-2 w L-1 dµ (θ) = Y. ( ) Here µ : Σ → R K is a signed vector measure and Σ is a σ-field of subsets of Θ. The norm µ TV is the total variation of µ, which can be calculated by µ T V = sup u: u(θ) 2≤1 Θ u T (θ)dµ(θ) =: K i=1 Θ u i (θ)dµ i (θ) , where we write µ =    µ 1 . . . µ K   . The formulation in (38) has infinite width in each layer. According to Theorem 10 in Appendix G, the measure µ in the integral can be represented by finitely many Dirac delta functions. Therefore, we can rewrite the problem (38) as min m * j=1 w row L,j 2 , s.t. m * j=1 XW 1,j . . . W L-2,j w col L-1,j w row L,j = Y, W i,j F ≤ t, i ∈ [L -2], w col L-1,j 2 ≤ 1, j ∈ [m * ]. Here the variables are W i,j for i ∈ [L -2] and j ∈ [m * ], W L-1 and W L . As the strong duality holds for the problem ( 40) and ( 36), we can reformulate the problem of D lin (t) as the bi-dual problem (40).

C.2 PROOF OF PROPOSITION 2

We restate Proposition 2 with details. Proposition 7 Suppose that W ∈ R d×K with rank r is given. Consider the following optimization problem: min 1 2 W 1 2 F + • • • + W L 2 F , s.t. W 1 W 2 . . . W L = W, in variables W i ∈ R mi-1×mi . Here m 0 = d, m L = K and m i ≥ r for i = 1, . . . , L -1. Then, the optimal value of the problem (41) is given by L 2 W 2/L S 2/L . ( ) Suppose that W = UΣV T . The optimal value can be achieved when W i = U i-1 Σ 1/L U T i , i = 1, . . . , N, U 0 = U, U L = V. ( ) Here U i ∈ R r×mi satisfies that U T i U i = I. We start with two lemmas. Lemma 1 Suppose that A ∈ S n×n is a positive semi-definite matrix. Then, for any 0 < p < 1, we have n i=1 A p ii ≥ n i=1 λ i (A) p . ( ) Here λ i is the i-th largest eigenvalue of A. Lemma 2 Suppose that P ∈ R d×d is a projection matrix, i.e., P 2 = P . Then, for arbitrary W ∈ R d×K , we have σ i (P W ) ≤ σ i (W ), where σ i (W ) represents the i-th largest singular value of W . Now, we present the proof for Proposition 2. For L = 1, the statement apparently holds. Suppose that for L = l this statement holds. For L = l + 1, by writing A = W 2 . . . W l+1 , we have min W 1 2 F + • • • + W L 2 F , s.t. W 1 W 2 . . . W l+1 = W = min W 1 2 F + l A 2/l 2/l , s.t. W 1 A = W, = min t 2 + l A 2/l 2/l , s.t. W 1 A = W, W 1 F ≤ t. (45) Suppose that t is fixed. It is sufficient to consider the following problem: min A 2/l 2/l , s.t. W 1 A = W, W 1 F ≤ t. ( ) Suppose that there exists W 1 and A such that W = W 1 A. Then, we have WA † A = W 1 AA † A = W. As WA † = W 1 AA † , according to Lemma 2, WA † F ≤ W 1 F ≤ t. Therefore, (WA † , A) is also feasible for the problem (46). Hence, the problem ( 46) is equivalent to min A 2/l 2/l , s.t. WA † A = W, WA † F ≤ t. ( ) Assume that W is with rank r. Suppose that A = UΣV T , where Σ ∈ R r0×r0 . Here r 0 ≥ r. Then, we have A † = VΣ -1 U T . We note that WA † 2 F = tr(WVΣ -2 V T W T ) = tr(V T W T WVΣ -2 ) (48) Denote G(V) = V T W T WV. This implies that r i=1 σ i (A) -2 (G(V)) ii ≤ t 2 . Therefore, we have r0 i=1 σ i (A) -2 (G(V)) ii r0 i=1 σ i (A) 2/l l ≥ r0 i=1 (G(V)) 1/(l+1) ii l+1 . As WV T V = W, the non-zero eigenvalues of G(V) are exactly the non-zero eigenvalues of WVV T W T = WW T , i.e., the square of non-zero singular values of W. From Lemma 1, we have r0 i=1 (G(V)) 1/(l+1) ii ≥ r0 i=1 λ i (G(V)) 1/(l+1) ≥ r i=1 σ i (W) 2/(l+1) . Therefore, we have A 2/l S 2/l = r0 i=1 σ i (A) 2/l ≥ t -2/l r i=1 σ i (W) 2/(l+1) (l+1)/l This also implies that min A 2/l 2/l , s.t. W 1 A = W, W 1 F ≤ t ≥t -2/l r i=1 σ i (W) 2/(l+1) (l+1)/l . ( ) Suppose that W = r i=1 u i σ i v T i is the SVD of W. We can let A = r i=1 σ 2/(l+1) i 1/2 t r i=1 u i σ l/(l+1) i ρ T i , W 1 = t r i=1 σ 2/(l+1) i 1/2 r i=1 ρ i σ 1/(l+1) i v T i . Here ρ i 2 = 1 and ρ T i ρ j = 0 for i = j. Then, W 1 A = W and W 1 F ≤ t. We also have A 2/L S 2/L =t -2/l r i=1 σ 2/(l+1) i 1/l r i=1 σ 2/(l+1) i =t -2/l r i=1 σ i (W) 2/(l+1) (l+1)/l . In summary, we have min t 2 + l A S 2/l 2/l , s.t. W 1 A = W, W 1 F ≤ t. = min t>0 t 2 + lt -2/l r i=1 σ i (W) 2/(l+1) (l+1)/l =(l + 1) r i=1 σ i (W) 2/(l+1) (l+1)/2 = W 2/(l+1) S 2/(l+1) . This completes the proof.

C.3 PROOF OF THEOREM 2

From Proposition 2, the minimum norm problem ( 5) is equivalent to min L W 2/L S 2/L , s.t. XW = Y, in variable W ∈ R d×K . According to Lemma 2, for any feasible W satisfying XW = Y, because X † XW = X † Y and X † X is a projection matrix, we have L W 2/L S 2/L ≥ L X † Y 2/L S 2/L . We also note that XX † Y = XX † XW = XW = Y. Therefore, X † Y is also feasible for the problem (54). This indicates that P lin = L 2 X † Y 2/L S 2/L . C.4 PROOF OF THEOREM 3 For a feasible point (W 1 , . . . , W L ) for P lin (t), we note that (W 1 /t, . . . , W L-2 /t, W L-1 , t L-2 W L ) is feasible for P lin (1). This implies that t L-2 P lin (t) = P lin (1), or equivalently, P lin (t) = t -(L-2) P lin (1). Recall that P lin = min t>0 L -2 2 t 2 + t -(L-2) P lin (1) = L 2 (P lin (1)) 2/L . From Theorem 2, we have P lin = L 2 X † Y 2/L S 2/L . This implies that P lin (1) = X † Y S 2/L and P lin (t) = t -(L-2) X † Y S 2/L . For the dual problem D lin (t) defined in (35), we note that Λ T XW 1 . . . W L-2 w L-1 2 ≤ Λ T XW 1 . . . W L-2 2 w L-1 2 ≤ Λ T X 2 L-2 l=1 W l 2 w L-1 2 ≤ Λ T X 2 L-2 l=1 W l F w L-1 2 = t L-2 Λ T X 2 . The equality can be achieved when W l = tu l u T l+1 for l ∈ [L -2], where u l 2 = 1 for l = 1, . . . , L -1. Specifically, we set u L-1 = w L-1 and let u 0 as right singular vector corresponds to the largest singular value of Λ T X. Therefore, the constraints on Λ is equivalent to Λ T X 2 ≤ t -(L-2) . Thus, according to the Von Neumann's trace inequality, it follows tr(Λ T Y) = tr(Λ T XX † Y) ≤ Λ T X 2 X † Y * ≤ t -(L-2) X † Y * . Suppose that X † Y = UΣV T is the singular value decomposition. Let Σ = diag(σ 1 , . . . , σ r ) where σ 1 ≥ σ 2 ≥ • • • ≥ σ r > 0 and r = rank(X † Y). We note that X † Y S 2/L = r i=1 σ 2/L i L/2 =σ 1 1 + r i=2 (σ i /σ 1 ) 2/L L/2 ≥σ 1 1 + r i=2 (σ i /σ 1 ) = r i=1 σ r . The equality holds if and only if σ 1 = • • • = σ r . This is because for given x ∈ (0, 1) and a ≥ 1, (a + x p ) 1/p is strictly decreasing w.r.t. p ∈ (0, 1]. As a result, we have D lin (t) = t -(L-2) X † Y * ≤ t -(L-2) X Y S 2/L = P lin (t). The equality is achieved if and only if the singular values of X † Y are the same. In other words, the inequality is strict when X † Y has different singular values. Then, the duality gap exists for the standard neural network.

C.5 PROOF OF PROPOSITION 3

For simplicity, we write W L-1,j = w col L-1,j and W L,j = w row L,j for j ∈ [m]. For the j-th branch of the parallel network, let Ŵl,j = α l,j W l,j for l ∈ [L]. Here α l,j > 0 for l ∈ [L] and they satisfies that L l=1 α l,j = 1 for j ∈ [m]. Therefore, we have XW 1,j . . . W L-2,j w col L-1,j w row L,j = X Ŵ1,j . . . ŴL-2,j ŵcol L-1,j ŵrow L,j . This implies that { Ŵl,j } l∈[L],j∈[m] is also feasible for the problem (12). According to the the inequality of arithmetic and geometric means, the objective function in ( 12) is lower bounded by 1 2 m j=1 L l=1 α 2 l,j W l,j 2 F ≥ m j=1 L 2 L l=1 α 2/L l,j W l,j 2/L F = L 2 m j=1 L l=1 W l,j 2/L F . ( ) The equality is achieved when α l,j = We first show that the problem ( 14) is equivalent to (15). The proof is analogous to the proof of Proposition 3. For simplicity, we write W L-1,j = w col L-1,j and W L,j = w row L,j for j ∈ [m]. Let α l,j > 0 for l ∈ [L] and they satisfies that L l=1 α l,j = 1 for j ∈ [m]. Consider another parallel network { Ŵl,j } l∈[L],j∈[m] whose j-th branch is defined by Ŵl,j = α l,j W l,j for l ∈ [L]. As L l=1 α l,j = 1, we have XW 1,j . . . W L-2,j w col L-1,j w row L,j = X Ŵ1,j . . . ŴL-2,j ŵcol L-1,j ŵrow L,j . This implies that { Ŵl,j } l∈[L],j∈[m] is also feasible for the problem ( 14). According to the the inequality of arithmetic and geometric means, the objective function in ( 12) is lower bounded by 1 2 m j=1 L l=1 α L l,j W l,j L F ≥ m j=1 L 2 L l=1 (α l,j W l,j F ) = L 2 m j=1 L l=1 W l,j F . ( ) The equality is achieved when α l,j = L,j 2 . Hence, the problem ( 14) is equivalent to (15). For the problem (15), we consider the Lagrangian function L(W 1 , . . . , W L ) = L 2 m j=1 w row L,j 2 + tr   Λ T (Y - m j=1 XW 1,j . . . W col L-1,j W row L,j )   . (66) The primal problem is equivalent to P prl lin = min W1,...,W L max Λ L(W 1 , . . . , W L , Λ), s.t. W l,j F ≤ t, j ∈ [m l ], l ∈ [L -2], w col L-1,j 2 ≤ 1, j ∈ [m L-1 ], = min W1,...,W L-1 max Λ min W L L(W 1 , . . . , W L , Λ), s.t. W l,j F ≤ 1, l ∈ [L -2], w col L-1,j 2 ≤ 1, j ∈ [m], = min W1,...,W L-1 max Λ tr(Λ T Y) - m L j=1 I Λ T XW 1,j . . . W L-2,j w col L-1,j 2 ≤ L/2 , s.t. W l,j F ≤ 1, l ∈ [L -2], w col L-1,j 2 ≤ 1, j ∈ [m]. The dual problem follows D prl lin = max Λ tr(Λ T Y), s.t. Λ T XW 1,j . . . W L-2,j 2 ≤ L/2, ∀ W l,j F ≤ 1, l ∈ [L -2], W col L-1,j 2 ≤ 1, j ∈ [m], = max Λ tr(Λ T Y), s.t. Λ T XW 1 . . . W L-2 w L-1 2 ≤ L/2, ∀ W i F ≤ 1, i ∈ [L -2], w L-1 2 ≤ 1. C.7 PROOF OF THEOREM 4 We can rewrite the dual problem as D prl lin = max Λ tr(Λ T Y), s.t. Λ T XW 1 . . . W L-2 w L-1 2 ≤ L/2, ∀(W 1 , . . . , W L-2 , w L-1 ) ∈ Θ, where the set Θ is defined as Θ = {(W 1 , . . . , W L-2 , w L-1 )| W l F ≤ 1, l ∈ [L -2], w L-1 2 ≤ 1}. By writing θ = (W 1 , . . . , W L-2 , w L-1 ), the bi-dual problem, i.e., the dual problem of (69), is given by min µ TV , s.t. θ∈Θ XW 1 . . . W L-2 w L-1 dµ (θ) = Y. Here µ : Σ → R K is a signed vector measure, where Σ is a σ-field of subsets of Θ and µ TV is its total variation. The formulation in (71) has infinite width in each layer. According to Theorem 10 in Appendix G, the measure µ in the integral can be represented by finitely many Dirac delta functions. Therefore, there exists a critical threshold of the number of branchs m * < KN + 1 such that we can rewrite the problem (71) as min m * j=1 w row L,j 2 , s.t. m * j=1 XW 1,j . . . W L-2,j w col L-1,j w row L,j = Y, W i,j F ≤ 1, l ∈ [L -2], w col L-1,j 2 ≤ 1, j ∈ [m * ]. Here the variables are W l,j for l ∈ [L -2] and j ∈ [m * ], W L-1 and W L . This is equivalent to (15). As the strong duality holds for the problem ( 69) and ( 71), the primal problem ( 15) is equivalent to the dual problem (69) as long as m ≥ m * . Now, we compute the optimal value of D prl lin . Similar to the proof of Theorem 3, we can show that the constraints in the dual problem ( 69) is equivalent to Λ T X 2 ≤ L/2. ( ) Therefore, we have tr(Λ T Y) ≤ λ T X 2 X † Y * ≤ L 2 X † Y * . This implies that P prl lin = D prl lin = L 2 X † Y * .

D STAIRS OF DUALITY GAP FOR STANDARD DEEP LINEAR NETWORKS

We consider partially dualizing the non-convex optimization problem by exchanging a subset of the minimization problems with respect to the hidden layers. Consider the Lagrangian for the primal problem of standard deep linear network P lin (t) = min {W l } L-1 l=1 max Λ tr(Λ T Y) -I Λ T XW 1 . . . W L-2 w L-1 2 ≤ 1 , s.t. W i F ≤ t, i ∈ [L -2], w L-1 2 ≤ 1. By changing the order of L -2 mins and the max in (75), for l = 0, 1, . . . , L -2, we can define the l-th partial "dual" problem D (l) lin (t) = min W1,...W l max Λ min W l+1 ,...,W L-2 tr(Λ T Y) -I Λ T XW 1 . . . W L-2 w L-1 2 ≤ 1 , s.t. W i F ≤ t, i ∈ [L -2], w L-1 2 ≤ 1. For l = 0, D lin (t) corresponds the primal problem P lin (t), while for l = L -2, D lin (t) is the dual problem D lin (t). From the following proposition, we illustrate that the dual problem of D  m * j=1 XW 1 . . . W l W l+1,j . . . W L-2,j w col L-1,j w row L,j = Y, W i F ≤ t, i ∈ [l], W i,j F ≤ t, i = l + 1, . . . , L -2, j ∈ [m * ], w col L-1,j 2 ≤ 1, j ∈ [m * ], where the variables are W i ∈ R mi-1×mi for i ∈ [l], W i,j ∈ R mi-1×mi for i = l + 1, . . . , L -2, j ∈ [m * ], W L-1 ∈ R m L-2 ×m * and W L ∈ R m * ×m L . We can interpret the problem (77) as the minimum norm problem of a linear network with parallel structures in (l + 1)-th to (L -2)-th layers. This indicates that for l = 0, 1, . . . , L -2, the bi-dual formulation of D (l) lin (t) can be viewed as an interpolation from a network with standard structure to a network with parallel structure. Now, we calculate the exact value of D  l) lin (t) = t -(L-2) X † Y S 2/(l+2) . Suppose that the eigenvalues X † Y are not identical to each other. Then, we have P lin (t) = D (L-2) lin (t) > D (L-3) lin (t) > • • • > D (0) lin (t) = D(t). In Figure 3 , we plot D 

D.1 PROOF OF PROPOSITION 9

We note that max Λ tr(Λ T Y), s.t. Λ T XW 1 . . . W L-2 2 ≤ 1, W i F ≤ t, i = l + 1, . . . , L -2, = max Λ min Wj+1,...,W L-2 tr(Λ T Y), s.t. Λ T XW 1 . . . W l 2 ≤ t -(L-2-l) . Therefore, we can rewrite D  s.t. Λ T XW 1 . . . W l 2 ≤ t -(L-2-l) , W i F ≤ t, i ∈ [l], = min W1,...W l max Λ t -(L-2-l) tr(Λ T Y), s.t. Λ T XW 1 . . . W l 2 ≤ 1, W i F ≤ t, i ∈ [l]. From the equation (10), we note that min W1,...Wj max Λ tr(Λ T Y) s.t. Λ T XW 1 . . . W j 2 ≤ 1, W i F ≤ t, i ∈ [j], = min K j=1 w l+2,j 2 , s.t. W i F ≤ t, i ∈ [L -2], w L-1,j 2 ≤ 1, j ∈ [m L-1 ], XW 1 . . . W l+2 = Y =t -l X † Y S 2/(l+2) . This completes the proof.

E PROOFS OF MAIN RESULTS FOR RELU NETWORKS E.1 PROOF OF PROPOSITION 5

For the problem of P (t), introduce the Lagrangian function L(W 1 , W 2 , W 3 , Λ) = K j=1 w row 3,j 2 -tr(Λ T (((XW 1 ) + W 2 ) + W 3 -Y)). According to the convex duality of two-layer ReLU network, we have P ReLU (t) = min W1 F ≤t, w2 ≤1 max Λ tr(Λ T Y) -I( Λ T ((XW 1 ) + w 2 ) + 2 ≤ 1) = min W1 F ≤t max Λ min w2 ≤1 tr(Λ T Y) -I( Λ T ((XW 1 ) + w 2 ) + 2 ≤ 1) = min W1 F ≤t max Λ tr(Λ T Y), s.t. Λ T v 2 ≤ 1, ∀v ∈ A(W 1 ). By changing the min and max, we obtain the dual problem. D ReLU (t) = max Λ tr(Λ T Y), s.t. Λ T v 2 ≤ 1, v ∈ A(W 1 ), ∀ W 1 F ≤ t. The dual of the dual problem writes min µ TV , s.t. W1 F ≤t, w2 2≤1 ((XW 1 ) + w 2 ) + dµ (W 1 , w 2 ) = Y. ( ) Here µ is a signed vector measure and µ TV is its total variation. Similar to the proof of Proposition 1, we can find a finite representation for the optimal measure and transform this problem to min {W1,j } m * j=1 ,W2∈R m 1 ×m * ,W3∈R m * ×K K j=1 w 3,j 2 , s.t. m * ((XW 1,j ) + w 2,j ) + w T 3,j = Y, W 1,j F ≤ t, w 2,j 2 ≤ 1. (87) Here m * ≤ KN + 1. This completes the proof.

E.2 PROOF OF THEOREM 5

For rank-1 data matrix that X = ca T 0 , suppose that A 1 = (XW 1 ) + . It is easy to observe that A 1 = (c) + a T 1,+ + (-c) + a T 1,-, Here we let a 1,+ = (W T 1 a 0 ) + and a 1,-= (-W T 1 a 0 ) + . For a three-layer network, suppose that λ * is the optimal solution to the dual problem D ReLU (t). We consider the extreme points defined by arg max W1 F ≤t, w2 2≤1 |(λ * ) T ((XW 1 ) + w 2 ) + |. (88) For fixed W 1 , because a T 1,+ a 1,-= 0, suppose that w 2 = u 1 a 1,+ + u 2 a 1,-+ u 3 r, And the corresponding optimal w 2 is w 2 = a 1,+ / a 1,+ 2 or w 2 = a 1,-/ a 1,-2 . Then, the problem becomes We note that max{ a 1,+ 2 , a 1,-2 } ≤ W T 1 a 0 2 ≤ W 1 2 a 0 2 ≤ t a 0 2 . Thus the optimal W 1 is given by Here ρ 0 = a 0 / a 0 2 and ρ 1 ∈ R m l + satisfies ρ 1 = 1. This implies that the optimal w 2 is given by w 2 = ρ 1 . On the other hand, if (λ * ) T (c) + and (λ * ) T (-c) + have same signs, then, the optimal w 2 follows Λ T v i 2 ≤ L/2. (96) Apparently we have D prl,sub ReLU ≤ D prl ReLU . As Λ * is the optimal solution to D prl ReLU and Λ * is feasible to D prl,sub ReLU , we have D prl,sub ReLU ≥ D prl ReLU . This implies that D prl,sub ReLU = D prl ReLU . We note that ( 26) is the dual problem of (96). Therefore, as a corollary of Theorem 6, we have P prl,sub ReLU = D prl,sub ReLU = D prl ReLU = P prl ReLU . Therefore, (W 1 , . . . , W L ) is the optimal solution to (23).

F PROOFS OF AUXILIARY RESULTS

F.1 PROOF OF LEMMA 1 Denote a ∈ R n such that a i = A ii and denote b ∈ R n such that b i = λ i (A). We can show that a is majorized by b, i.e., for k ∈ [n -1], we have This completes the proof. k i=1 a (i) ≤ k i=1 b (i) ,

F.2 PROOF OF LEMMA 2

According to the min-max principle for singular value, we have σ i (W ) = min dim(S)=d-i+1 max x∈S, x 2=1 W x 2 . As P is a projection matrix, for arbitrary x ∈ R d , we have P W x 2 ≤ W x 2 . Therefore, we have max x∈S, x 2=1 P W x 2 ≤ max x∈S, x 2=1 W x 2 . This completes the proof.



This corresponds to weak regularization, i.e., β → 0 in (27) as considered inWei et al. (2018).



Figure 1: Standard Architecture Layer 1 Input Layer 2 Layer 3 Layer 4

, we can simply let W l,j F = 1 and the lower bound becomes L completes the proof.C.6 PROOF OF PROPOSITION 4

for l ∈ [L] and j ∈ [m]. As the scaling operation does not change L l=1 W l,j F , we can simply let W l,j F = 1 and the lower bound becomes L

corresponds to a minimum norm problem of a neural network with parallel structure.Proposition 8 There exists a threshold of the number of branches m * ≤ KN + 1 such that the problem D (l) lin (t) is equivalent to the "bi-dual" problem min

The optimal value D (l) lin (t) follows D



for l = 0, . . . , 5 for an example.

Figure 3: Example of D(l) lin (t).

(l) lin (t) = min W1,...W l max Λ tr(Λ T Y),

where r T a 1,+ = r T a 1,-= 0 and r 2 = 1. The maximization problem on w 2 reduces to arg max u1,u2,u3(λ * ) T (c) + a 1,+ 2 2 (u 1 ) + + (λ * ) T (-c) + a 1If (λ * ) T (c) + and (λ * ) T (-c) + have different signs, then the optimal value is max{|(λ * ) T (c) + | a 1,+ 2 , |(λ * ) T (-c) + | a 1,-2 }.

arg max W1 max{|(λ * ) T (c) + | a 1,+ 2 , |(λ * ) T (-c) + | a 1,-2 }.

tsign(|(λ * ) T (c) + | -|(λ * ) T (-c) + |)ρ 0 ρ T 1 .

w 2 = |(λ * ) T (c) + |a 1,+ + |(λ * ) T (-c) + |a 1,- ((λ * ) T (c) + ) 2 a 1,+ 2 2 + ((λ * ) T (-c) + ) 2 a 1,-2 2The maximization problem of W 1 is equivalent to arg maxW1 F ≤t ((λ * ) T (c) + ) 2 a 1,+ 2 2 + ((λ * ) T (c) -) 2 a 1optimal W 1 is given by W 1 = tsign(|(λ * ) T (c) + | -|(λ * ) T (-c) + |)ρ 0 ρ T 1 .Here ρ 0 = a 0 / a 0 2 and ρ 1 ∈ R m1 + satisfies ρ 1 = 1.E.3 PROOF OF PROPOSITION 6Analogous to the proof of Proposition 4, we can reformulate (22) into (23). The rest of the proof is analogous to the proof of Proposition 4. For the problem (23), we consider the Lagrangian functionL(W 1 , . . . , W L ) XW 1,j ) + . . . . . . W L-2,j ) + w col L-1,j) + w row L,j ) ReLU = max tr(Λ T Y), s.t. max i∈[m * ]

. Here a (i) is the i-th largest entry in a. We first note that is majorized by b. As f (x) = -x p is a convex function, according to the Karamata'

Existing and current results for duality gaps in L-layer standard and parallel architectures. we compare our duality gap characterization with previous literature. Each check mark indicates whether a characterization of the duality gap exists for the corresponding architecture and the number next to it indicates whether the gap is zero or not.

11) Here • * represents the nuclear norm. P lin (t) = D lin (t) if and only if the singular values of X † Y are equal.

ACKNOWLEDGEMENTS

This work was partially supported by the National Science Foundation (NSF) under grants ECCS-2037304, DMS-2134248, NSF CAREER award CCF-2236829, the U.S. Army Research Office Early Career Award W911NF-21-1-0242, and the ACCESS -AI Chip Center for Emerging Smart Systems, sponsored by InnoHK funding, Hong Kong SAR.

Published as a conference paper at ICLR 2023

The primal problem is equivalent to(90) By exchanging the order of min and max, the dual problem followsE.4 PROOF OF THEOREM 6The proof is analogous to the proof of Theorem 4. We can rewrite the dual problem aswhere the set Θ is defined asBy writing θ = (W 1 , . . . , W L-2 , w L-1 ), the bi-dual problem, i.e., the dual problem of (92), is given byHere µ : Σ → R K is a signed vector measure, where Σ is a σ-field of subsets of Θ and µ TV is its total variation. The formulation in (94) has infinite width in each layer. According to Theorem 10 in Appendix G, the measure µ in the integral can be represented by finitely many Dirac delta functions. Therefore, there exists m * ≤ KN + 1 such that we can rewrite the problem (94) asHere the variables are W l,j for l ∈ [L -2] and j ∈ [m * ], W L-1 and W L . This is equivalent to (23).As the strong duality holds for the problem ( 92) and ( 94), the primal problem ( 23) is equivalent to the dual problem (92) as long as m ≥ m * .

G CARATHEODORY'S THEOREM AND FINITE REPRESENTATION

We first review a generalized version of Caratheodory's theorem introduced in (Rosset et al., 2007) .Theorem 8 Let µ be a positive measure supported on a bounded subset D ⊆ R N . Then, there exists a measure ν whose support is a finite subset of D, {z 1 , . . . , z k }, with k ≤ N + 1 such thatand µ TV = ν TV .We can generalize this theorem to signed vector measures.Theorem 9 Let µ : Σ → R K be a signed vector measure supported on a bounded subset D ⊆ R N .Here Σ is a σ-field of subsets of D. Then, there exists a measure ν whose support is a finite subset ofand ν TV = µ TV .PROOF Let µ be a signed vector measure supported on a bounded subset D ⊆ R N . Consider the extended setThen, µ corresponds to a scalar-valued measure μ on the set D and µ TV = μ TV . We note that D is also bounded. Therefore, by applying Theorem 8 to the set D and the measure μ, there exists a measure ν whose support is a finite subset of D,and μ TV = ν TV . We can define ν as the signed vector measure whose support is a finite subset {z 1 , . . . , z k } and dν(z i ) = u i d (z i u i ). Then, ν TV = ν TV = μ TV = µ TV . This completes the proof.Now we are ready to present the theorem about the finite representation of a signed-vector measure.Theorem 10 Suppose that θ is the parameter with a bounded domain Θ ⊆ R p and φ(X, θ) : R N ×d × Θ → R N is an embedding of the parameter into the feature space. Consider the following optimization problemAssume that an optimal solution to (102) exists. Then, there exists an optimal solution μ supported on at most KN + 1 features in Θ.PROOF Let μ be an optimal solution to (102). We can define a measure P on R N as the push-forward of μ by P (B) = μ({θ|φ(X, θ) ∈ B}). Denote D = {φ(X, θ)|θ ∈ Θ}. We note that P is supported on D and D is bounded. By applying Theorem 9 to the set D and the measure P , we can find a measure Q whose support is a finite subset of D, {z 1 , . . . , z k } with k ≤ KN + 1. For each z i ∈ D, we can find θ i such that φ(X, θ i ) = z i . Then, μ = k i=1 δ(θ -θ i )dQ(z i ) is an optimal solution to (102) with at most KN + 1 features and μ TV = µ TV . Here δ(•) is the Dirac delta measure.

