PARALLEL DEEP NEURAL NETWORKS HAVE ZERO DUALITY GAP

Abstract

Training deep neural networks is a challenging non-convex optimization problem. Recent work has proven that the strong duality holds (which means zero duality gap) for regularized finite-width two-layer ReLU networks and consequently provided an equivalent convex training problem. However, extending this result to deeper networks remains to be an open problem. In this paper, we prove that the duality gap for deeper linear networks with vector outputs is non-zero. In contrast, we show that the zero duality gap can be obtained by stacking standard deep networks in parallel, which we call a parallel architecture, and modifying the regularization. Therefore, we prove the strong duality and existence of equivalent convex problems that enable globally optimal training of deep networks. As a by-product of our analysis, we demonstrate that the weight decay regularization on the network parameters explicitly encourages low-rank solutions via closed-form expressions. In addition, we show that strong duality holds for three-layer standard ReLU networks given rank-1 data matrices.

1. INTRODUCTION

Deep neural networks demonstrate outstanding representation and generalization abilities in popular learning problems ranging from computer vision, natural language processing to recommendation system. Although the training problem of deep neural networks is a highly non-convex optimization problem, simple first order gradient based algorithms, such as stochastic gradient descent, can find a solution with good generalization properties. However, due to the non-convex and non-linear nature of the training problem, underlying theoretical reasons for this remains an open problem. The Lagrangian dual problem (Boyd et al., 2004) plays an important role in the theory of convex and non-convex optimization. For convex optimization problems, the convex duality is an important tool to determine their optimal values and to characterize the optimal solutions. Even for a non-convex primal problem, the dual problem is a convex optimization problem the can be solved efficiently. As a result of weak duality, the optimal value of the dual problem serves as a non-trivial lower bound for the optimal primal objective value. Although the duality gap is non-zero for non-convex problems, the dual problem provides a convex relaxation of the non-convex primal problem. For example, the semi-definite programming relaxation of the two-way partitioning problem can be derived from its dual problem (Boyd et al., 2004) . The convex duality also has important applications in machine learning. In Paternain et al. (2019) , the design problem of an all-encompassing reward can be formulated as a constrained reinforcement learning problem, which is shown to have zero duality. This property gives a theoretical convergence guarantee of the primal-dual algorithm for solving this problem. Meanwhile, the minimax generative adversarial net (GAN) training problem can be tackled using duality (Farnia & Tse, 2018) . In lines of recent works, the convex duality can also be applied for analyzing the optimal layer weights of two-layer neural networks with linear or ReLU activations (Ergen & Pilanci, 2019; Pilanci & Ergen, 2020; Ergen & Pilanci, 2020a; b; Lacotte & Pilanci, 2020; Sahiner et al., 2020) . Based on the convex duality framework, the training problem of two-layer neural networks with ReLU activation can be represented in terms of a single convex program in Pilanci & Ergen (2020). Such convex optimization formulations are extended to two-layer and three-layer convolutional neural network training problems in Ergen & Pilanci (2021b) . Strong duality also holds for deep linear neural networks with scalar output (Ergen & Pilanci, 2021a) . The convex optimization formulation essentially gives a detailed characterization of the global optimum of the training problem. This enables us to examine in numerical experiments whether popular optimizers for neural networks, such as gradient descent or stochastic gradient descent, converge to the global optimum of the training loss. Admittedly, a zero duality gap is hard to achieve for deep neural networks, especially for those with vector outputs. This imposes more difficulty to understand deep neural networks from the convex optimization lens. Fortunately, neural networks with parallel structures (also known as multi-branch architecture) appear to be easier to train. Practically, the usage of parallel neural networks dates back to AlexNet (Krizhevsky et al., 2012) On the other hand, it is known that overparameterized parallel neural networks have benign training landscapes (Haeffele & Vidal, 2017; Ergen & Pilanci, 2019) . The parallel models with the over-parameterization are essentially neural networks in the mean-field regime (Nitanda & Suzuki, 2017; Mei et al., 2018; Chizat & Bach, 2018; Mei et al., 2019; Rotskoff et al., 2019; Sirignano & Spiliopoulos, 2020; Akiyama & Suzuki, 2021; Chizat, 2021; Nitanda et al., 2020) . The deep linear model is also of great interests in the machine learning community. 2020). From another perspective, the standard two-layer network is equivalent to the parallel two-layer network. This may also explain why there is no duality gap for two-layer neural networks.

1.1. CONTRIBUTIONS

Following the convex duality framework introduced in Ergen & Pilanci (2021a; 2020a), which showed the duality gap is zero for two-layer networks, we go beyond two-layer and study the convex duality for vector-output deep neural networks with linear activation and ReLU activation. Surprisingly, we prove that three-layer networks may have duality gaps depending on their architecture, unlike two-layer neural networks which always have zero duality gap. We summarize our contributions as follows. • For training standard vector-output deep linear networks using 2 regularization, we precisely calculate the optimal value of the primal and dual problems and show that the duality gap is non-zero, i.e., Lagrangian relaxation is inexact. We also demonstrate that the 2regularization on the parameter explicitly forces a tendency toward a low-rank solution, which is boosted with the depth. However, we show that the optimal solution is available in closed-form. • For parallel deep linear networks, with certain convex regularization, we show that the duality gap is zero, i.e, Lagrangian relaxation is exact. • For parallel deep ReLU networks of arbitrary depth, with certain convex regularization and sufficient number of branches, we prove strong duality, i.e., show that the duality gap is zero. Remarkably, this guarantees that there is a convex program equivalent to the original deep ReLU neural network problem. We summarize the duality gaps for parallel/standard neural network in Table 1 .

1.2. NOTATIONS

We use bold capital letters to represent matrices and bold lowercase letters to represent vectors. Denote [n] = {1, . . . , n}. For a matrix W l ∈ R m l-1 ×m l , for i ∈ [m l-1 ] and j ∈ [m l ], we denote w col l,i as its



For training 2 loss with deep linear networks using Schatten norm regularization, Zhang et al. (2019) show that there is no duality gap. The implicit regularization in training deep linear networks has been studied in Ji & Telgarsky (2018); Arora et al. (2019); Moroshko et al. (

. Modern neural network architecture including Inception (Szegedy et al., 2017), Xception (Chollet, 2017) and SqueezeNet (Iandola et al., 2016) utilize the parallel structure. As the "parallel" version of ResNet (He et al., 2016a;b), ResNeXt (Xie et al., 2017) and Wide ResNet (Zagoruyko & Komodakis, 2016) exhibit improved performance on many applications. Recently, it was shown that neural networks with parallel architectures have smaller duality gaps (Zhang et al., 2019) compared to standard neural networks. Furthermore, Ergen & Pilanci (2021c;e) proved that there is no duality gap for parallel architectures with three-layers.

