REVEALING THE STRUCTURE OF DEEP NEURAL NET-WORKS VIA CONVEX DUALITY

Abstract

We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weights for a norm regularized DNN training problem can be explicitly found as the extreme points of a convex set. For the special case of deep linear networks with K outputs, we prove that each optimal weight matrix is rank-K and aligns with the previous layers via duality. More importantly, we apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds. As a corollary, we prove that norm regularized deep ReLU networks yield spline interpolation for one-dimensional datasets which was previously known only for two-layer networks. Furthermore, we provide closed-form solutions for the optimal layer weights when data is rankone or whitened. We then verify our theory via numerical experiments.

1. INTRODUCTION

Deep neural networks (DNNs) have become extremely popular due to their success in machine learning applications. Even though DNNs are highly over-parameterized and non-convex, simple first-order algorithms, e.g., Stochastic Gradient Descent (SGD), can be used to successfully train them. Moreover, recent work has shown that highly over-parameterized networks trained with SGD obtain simple solutions that generalize well (Savarese et al., 2019; Parhi & Nowak, 2019; Ergen & Pilanci, 2020a; b) , where two-layer ReLU networks with the minimum Euclidean norm solution and zero training error are proven to fit a linear spline model in 1D regression. Therefore, regularizing the solution towards smaller norm weights might be the key to understand the generalization properties of DNNs. However, analyzing DNNs is still theoretically elusive even in the absence of nonlinear activations. Therefore, we study norm regularized DNNs and develop a framework based on convex duality such that a set of optimal solutions to the training problem can be analytically characterized. Deep linear networks have been the subject of extensive theoretical analysis due to their tractability. A line of research (Saxe et al., 2013; Arora et al., 2018a; Laurent & Brecht, 2018; Du & Hu, 2019; Shamir, 2018) focused on GD training dynamics, however, they lack the analysis of generalization properties of deep networks. Another line of research (Gunasekar et al., 2017; Arora et al., 2019; Bhojanapalli et al., 2016) studied the generalization properties via matrix factorization and showed that linear networks trained with GD converge to minimum nuclear norm solutions. Later on, Arora et al. (2018b); Du et al. (2018) showed that gradient flow enforces the layer weights to align. Ji & Telgarsky (2019) further proved that each layer weight matrix is asymptotically rank-one. These results provide insights to characterize the structure of the optimal layer weights, however, they require multiple strong assumptions, e.g., linearly separable training data and strictly decreasing loss function, which makes the results impractical. Furthermore, Zhang et al. ( 2019) provided some characterizations for nonstandard networks, which are valid for hinge loss and specific regularizations where the data matrix is included. Unlike these studies, we introduce a complete characterization for the regularized deep network training problem without requiring such assumptions. Our contributions: 1) We introduce a convex analytic framework that characterizes a set of optimal solutions to regularized training problems as the extreme points of a convex set, which is valid for vector outputs and popular loss functions including squared, cross entropy and hinge lossfoot_0 ; 2) For deep linear networks with K outputs, we prove that each optimal layer weight matrix aligns 1 : One dimensional interpolation using L-layer ReLU networks with 20 neurons in each hidden layer. As predicted by Corollary 4.2, the optimal solution is given by piecewise linear splines for any L ≥ 2. Additionally, we provide a comparison with previous studies about this characterization. Width (m) Depth (L) Vector outputs (K) Savarese et al. (2019) ∞ 2 (K = 1) Parhi & Nowak (2019) ∞ 2 (K = 1) Ergen & Pilanci (2020a;b) finite 2 (K = 1) Our work finite L ≥ 2 (K ≥ 1) Figure 1 & Table with the previous layers and becomes rank-K via convex duality; 3) For deep ReLU networks, we obtain the same weight alignment result for whitened or rank-one data matrices. As a corollary, we achieve closed-form solutions for the optimal hidden layer weights when data is whitened or rank-one (see Theorem 4.1 and 4.3). As another corollary, we prove that the optimal networks are linear spline interpolators for one-dimensional, i.e., rank-one, data which generalizes the two-layer results for one-dimensional data in Savarese et al. ( 2019); Parhi & Nowak (2019); Ergen & Pilanci (2020a;b) to arbitrary depth. We note that the analysis of ReLU networks for the one dimensional data considered in these works is non-trivial, which is a special case of our rank-one/whitened data assumption. Notation: We denote matrices/vectors as uppercase/lowercase bold letters. We use 0 k (or 1 k ) and I k to denote a vector of zeros (or ones) and the identity matrix of size k, respectively. We denote the set of integers from 1 to n as [n]. To denote Frobenius, operator, and nuclear norms, we use • F , • 2 , and • * , respectively. Furthermore, σ max (•) and σ min (•) represent the maximum and minimum singular values, respectively and B 2 is defined as B 2 := {u ∈ R d | u 2 ≤ 1}.

1.1. OVERVIEW OF OUR RESULTS

We consider an L-layer network with layer weights W l ∈ R m l-1 ×m l , ∀l ∈ [L], where m 0 = d and m L = 1, respectively. Then, given a data matrix X ∈ R n×d , the output is f θ,L (X) = A L-1 w L , A l = g(A l-1 W l ) ∀l ∈ [L -1], where A 0 = X and g(•) is the activation function. Given a label vector y ∈ R n , training problem can be formulated as follows min {θ l } L l=1 L(f θ,L (X), y) + βR(θ) , where L(•, •) is an arbitrary loss function, R(θ) is regularization for the layer weights, β > 0 is a regularization parameter, θ l = {W l , m l }, and θ := {θ l } L l=1 . In the paper, for the sake of presentation simplicity, we illustrate the conventional training setup with squared loss and 2 -norm regularization, i.e., L(f θ,L (X), y) = f θ,L (X) -yfoot_1 2 and R(θ) = L l=1 W l 2 F . However, our analysis is valid for arbitrary loss functions and different regularization terms as proven in Appendix. Thus, we consider the following optimization problem P * = min {θ l } L l=1 L(f θ,L (X), y) + β L l=1 W l 2 F . (2) Next, we show that the minimum 2 2 norm is equivalent to minimum 1 norm after a rescaling. Lemma 1.1. The following problems are equivalent : min {θ l } L l=1 L(f θ,L (X), y) + β L l=1 W l 2 F = min {θ l } L l=1 ,t L(f θ,L (X), y) + 2β w L 1 + β(L -2)t 2 s.t. w L-1,j ∈ B 2 , W l F ≤ t, ∀l ∈ [L -2] , where w L-1,j denotes the j th column of W L-1 . Using Lemma 1.1 2 , we first take the dual with respect to the output layer weights w L and then change the order of min-max to achieve the following dual deep network training problem, which provides a lower boundfoot_2 P * ≥D * = min 



Extensions to other loss functions, e.g., cross entropy and hinge loss, are presented Appendix A.1 The proof is presented in Appendix A.3. For the definitions and details see Appendix A.1.



1,j ∈B2,∀j W l F ≤t, ∀l∈[L-2] -L * (λ) + β(L -2)t 2 s.t. A T L-1 λ ∞ ≤ 2β .

