RETHINK DEPTH SEPARATION WITH INTRA-LAYER LINKS Anonymous

Abstract

The depth separation theory is nowadays widely accepted as an effective explanation for the power of depth, which consists of two parts: i) there exists a function representable by a deep network; ii) such a function cannot be represented by a shallow network whose width is lower than a threshold. Here, we report that adding intra-layer links can greatly improve a network's representation capability through the bound estimation, explicit construction, and functional space analysis. Then, we modify the depth separation theory by showing that a shallow network with intra-layer links does not need to go as wide as before to express some hard functions constructed by a deep network. Such functions include the renowned "sawtooth" functions. Our results supplement the existing depth separation theory by examining its limit in a broader domain. Also, our results suggest that once configured with an appropriate structure, a shallow and wide network may have expressive power on a par with a deep network.

1. INTRODUCTION

Due to the widespread applications of deep networks in many important fields (LeCun et al., 2015) , mathematically understanding the power of deep networks has been a central problem in deep learning theory (Poggio et al., 2020) . The key issue is figuring out how expressive a deep network is or how increasing depth promotes the expressivity of a neural network better than increasing width. In this regard, there have been a plethora of studies on the expressivity of deep networks, which are collectively referred to as the depth separation theory. A popular idea to demonstrate the expressivity of depth is the complexity characterization that introduces appropriate complexity measures for functions represented by neural networks (Pascanu et al., 2013; Montufar et al., 2014; Telgarsky, 2015; Montúfar, 2017; Serra et al., 2018; Hu & Zhang, 2018; Xiong et al., 2020; Bianchini & Scarselli, 2014; Raghu et al., 2017) , and then reports that increasing depth can greatly boost such a complexity measure. In contrast, a more concrete way to show the power of depth is to construct functions that can be expressed by a small network of a given depth, but cannot be approximated by shallower networks, unless its width is sufficiently large (Telgarsky, 2015; 2016; Arora et al., 2016; Eldan & Shamir, 2016; Safran & Shamir, 2017; Venturi et al., 2021) . For example, Eldan & Shamir (2016) constructed a radial function and used Fourier spectrum analysis to show that a two-hidden-layer network can represent it with a polynomial number of neurons, but a one-hidden-layer network needs an exponential number of neurons to achieve the same level of error. Telgarsky (2015) employed a ReLU network to build a one-dimensional "sawtooth" function whose number of pieces scales exponentially over the depth. As such, a deep network can construct a sawtooth function with many pieces, while a shallow network cannot unless it is very wide. Arora et al. (2016) derived the upper bound of the maximal number of pieces for a univariate ReLU network, and used this bound to elaborate the separation between a deep and a shallow network. In a broad sense, we summarize the elements of establishing a depth separation theorem as the following: i) there exists a function representable by a deep network; ii) such a function cannot be represented by a shallow network whose width is lower than a threshold. The depth separation theory is nowadays widely accepted as an effective explanation for the power of depth. However, we argue that depth separation does not hold when we slightly adjust the structure of the shallow networks. Our investigation is on ReLU networks. As shown in Figure 1 (c), inspired by ResNet that adds residual connections across layers, we add residual connections within a layer, which forces a neuron to take the outputs of its neighboring neuron. Then, we find that inserting intra-layer links can greatly increase the maximum number of pieces represented by a shallow network. As such, we can modify the statement of depth separation: without the need of going as wide as before, a shallow network can express as a complicated function as a deep network could. Our result is valuable in two aspects. On the one hand, it non-trivially supplements the depth separation theory. In reality, a neural network often is not feedforward but uses shortcuts to link distant layers to facilitate feature reuse and easy training. Exploring the depth separation in the shortcut paradigm reveals the limit of the existing theory and motivates us to rethink the genuine power of depth. On the other hand, the superiority of depth over width nowadays seems to become a doctrine for most deep learning practitioners. However, studies such as width-depth equivalence (Fan et al., 2020) and representation ability of wide networks (Lu et al., 2017; Levine et al., 2020) show the essential role of width. In the same vein, our work suggests the potential of wide networks in expressivity. Once configured with an appropriate structure, a shallow and wide network may have expressive power on a par with a deep network. Note that adding intra-layer links is not equivalent to increasing depth. The common understanding of increasing depth is to increase the number of layers, while intra-layer links are just to connect neurons in the same layer, which is a slight change to the original network. Specifically, our roadmap to the modification of depth separation theorems includes two milestones. 1) Through bound analysis (Theorems 4, 6, and 8), explicit construction (Propositions 1, 2, and 3), and functional space analysis (Theorem 10), we substantiate that a network with intra-layer links can produce much more pieces than a feedforward network, and the gain is at most exponential, i.e., ( 3 2 ) k , where k is the number of hidden layers. 2) Since intra-layer links can yield more pieces, they can be used to modify depth separation theorems by empowering a shallow network to represent a function constructed by a deep network, even if the width of this shallow network is lower than the prescribed threshold. The modification is done in the cases of k 2 vs 3 (Theorem 12) and k 2 vs k (Theorem 13, the famous sawtooth function (Telgarsky, 2015) ). Also, we point out that the depth separation cannot be fully accomplished based on the bound analysis, unless the bound is proved to be tight. Thus, Arora et al. ( 2016)'s depth separation theorem might need to be re-examined. To summarize, our contributions are threefold. 1) We point out the limitation of the depth separation and propose to consider inserting intra-layer links in shallow networks. 2) We show via bound estimation, explicit construction, and functional space analysis that intra-layer links can make a ReLU network produce more pieces. 3) We modify the depth separation result including the famous Telgarsky (2015)'s theorem by demonstrating that a shallow network with intra-layer links does not need to go as wide as before to represent a function constructed by a deep network.

2. RELATED WORK

Recently, a plethora of depth separation studies have shown the superiority of deep networks over shallow ones from the points of view of complexity analysis and constructive analysis. The complexity analysis is to characterize the complexity of the function represented by a neural network, thereby demonstrating that increasing depth can greatly maximize such a complexity measure. Currently, one of the most popular complexity measures is the number of linear regions because it conforms to the functional structure of the widely-used ReLU networks. et al., 2016) , and topological entropy (Bu et al., 2020) . Please note that using complexity measures to justify the power of depth demands a tight bound estimation. Otherwise,



Figure 1: (a) feedforward, (b) residual, and (c) intra-linked.

For example, Pascanu et al. (2013); Montufar et al. (2014); Montúfar (2017); Serra et al. (2018); Hu & Zhang (2018); Hanin & Rolnick (2019) estimated the bound of the number of linear regions generated by a fully-connected ReLU network by applying Zaslavsky's Theorem (Zaslavsky, 1997). Xiong et al. (2020) offered the first upper and lower bounds of the number of linear regions for convolutional networks. Other complexity measures include classification capabilities (Malach & Shalev-Shwartz, 2019), Betti numbers (Bianchini & Scarselli, 2014), trajectory lengths (Raghu et al., 2017), global curvature (Poole

