RETHINK DEPTH SEPARATION WITH INTRA-LAYER LINKS Anonymous

Abstract

The depth separation theory is nowadays widely accepted as an effective explanation for the power of depth, which consists of two parts: i) there exists a function representable by a deep network; ii) such a function cannot be represented by a shallow network whose width is lower than a threshold. Here, we report that adding intra-layer links can greatly improve a network's representation capability through the bound estimation, explicit construction, and functional space analysis. Then, we modify the depth separation theory by showing that a shallow network with intra-layer links does not need to go as wide as before to express some hard functions constructed by a deep network. Such functions include the renowned "sawtooth" functions. Our results supplement the existing depth separation theory by examining its limit in a broader domain. Also, our results suggest that once configured with an appropriate structure, a shallow and wide network may have expressive power on a par with a deep network.

1. INTRODUCTION

Due to the widespread applications of deep networks in many important fields (LeCun et al., 2015) , mathematically understanding the power of deep networks has been a central problem in deep learning theory (Poggio et al., 2020) . The key issue is figuring out how expressive a deep network is or how increasing depth promotes the expressivity of a neural network better than increasing width. In this regard, there have been a plethora of studies on the expressivity of deep networks, which are collectively referred to as the depth separation theory. A popular idea to demonstrate the expressivity of depth is the complexity characterization that introduces appropriate complexity measures for functions represented by neural networks (Pascanu et al., 2013; Montufar et al., 2014; Telgarsky, 2015; Montúfar, 2017; Serra et al., 2018; Hu & Zhang, 2018; Xiong et al., 2020; Bianchini & Scarselli, 2014; Raghu et al., 2017) , and then reports that increasing depth can greatly boost such a complexity measure. In contrast, a more concrete way to show the power of depth is to construct functions that can be expressed by a small network of a given depth, but cannot be approximated by shallower networks, unless its width is sufficiently large (Telgarsky, 2015; 2016; Arora et al., 2016; Eldan & Shamir, 2016; Safran & Shamir, 2017; Venturi et al., 2021) . For example, Eldan & Shamir (2016) constructed a radial function and used Fourier spectrum analysis to show that a two-hidden-layer network can represent it with a polynomial number of neurons, but a one-hidden-layer network needs an exponential number of neurons to achieve the same level of error. Telgarsky (2015) employed a ReLU network to build a one-dimensional "sawtooth" function whose number of pieces scales exponentially over the depth. As such, a deep network can construct a sawtooth function with many pieces, while a shallow network cannot unless it is very wide. Arora et al. (2016) derived the upper bound of the maximal number of pieces for a univariate ReLU network, and used this bound to elaborate the separation between a deep and a shallow network. In a broad sense, we summarize the elements of establishing a depth separation theorem as the following: i) there exists a function representable by a deep network; ii) such a function cannot be represented by a shallow network whose width is lower than a threshold. The depth separation theory is nowadays widely accepted as an effective explanation for the power of depth. However, we argue that depth separation does not hold when we slightly adjust the structure of the shallow networks. Our investigation is on ReLU networks. As shown in Figure 1 (c), inspired by ResNet that adds residual connections across layers, we add residual connections within 1

