REVEALING THE STRUCTURE OF DEEP NEURAL NET-WORKS VIA CONVEX DUALITY

Abstract

We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weights for a norm regularized DNN training problem can be explicitly found as the extreme points of a convex set. For the special case of deep linear networks with K outputs, we prove that each optimal weight matrix is rank-K and aligns with the previous layers via duality. More importantly, we apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds. As a corollary, we prove that norm regularized deep ReLU networks yield spline interpolation for one-dimensional datasets which was previously known only for two-layer networks. Furthermore, we provide closed-form solutions for the optimal layer weights when data is rankone or whitened. We then verify our theory via numerical experiments.

1. INTRODUCTION

Deep neural networks (DNNs) have become extremely popular due to their success in machine learning applications. Even though DNNs are highly over-parameterized and non-convex, simple first-order algorithms, e.g., Stochastic Gradient Descent (SGD), can be used to successfully train them. Moreover, recent work has shown that highly over-parameterized networks trained with SGD obtain simple solutions that generalize well (Savarese et al., 2019; Parhi & Nowak, 2019; Ergen & Pilanci, 2020a; b) , where two-layer ReLU networks with the minimum Euclidean norm solution and zero training error are proven to fit a linear spline model in 1D regression. Therefore, regularizing the solution towards smaller norm weights might be the key to understand the generalization properties of DNNs. However, analyzing DNNs is still theoretically elusive even in the absence of nonlinear activations. Therefore, we study norm regularized DNNs and develop a framework based on convex duality such that a set of optimal solutions to the training problem can be analytically characterized. Deep linear networks have been the subject of extensive theoretical analysis due to their tractability. A line of research (Saxe et al., 2013; Arora et al., 2018a; Laurent & Brecht, 2018; Du & Hu, 2019; Shamir, 2018) focused on GD training dynamics, however, they lack the analysis of generalization properties of deep networks. Another line of research (Gunasekar et al., 2017; Arora et al., 2019; Bhojanapalli et al., 2016) studied the generalization properties via matrix factorization and showed that linear networks trained with GD converge to minimum nuclear norm solutions. Later on, Arora et al. (2018b); Du et al. (2018) showed that gradient flow enforces the layer weights to align. Ji & Telgarsky (2019) further proved that each layer weight matrix is asymptotically rank-one. These results provide insights to characterize the structure of the optimal layer weights, however, they require multiple strong assumptions, e.g., linearly separable training data and strictly decreasing loss function, which makes the results impractical. Furthermore, Zhang et al. ( 2019) provided some characterizations for nonstandard networks, which are valid for hinge loss and specific regularizations where the data matrix is included. Unlike these studies, we introduce a complete characterization for the regularized deep network training problem without requiring such assumptions. Our contributions: 1) We introduce a convex analytic framework that characterizes a set of optimal solutions to regularized training problems as the extreme points of a convex set, which is valid for vector outputs and popular loss functions including squared, cross entropy and hinge loss 1 ; 2) For deep linear networks with K outputs, we prove that each optimal layer weight matrix aligns 1 Extensions to other loss functions, e.g., cross entropy and hinge loss, are presented Appendix A.1 1

