RETHINK DEPTH SEPARATION WITH INTRA-LAYER LINKS Anonymous

Abstract

The depth separation theory is nowadays widely accepted as an effective explanation for the power of depth, which consists of two parts: i) there exists a function representable by a deep network; ii) such a function cannot be represented by a shallow network whose width is lower than a threshold. Here, we report that adding intra-layer links can greatly improve a network's representation capability through the bound estimation, explicit construction, and functional space analysis. Then, we modify the depth separation theory by showing that a shallow network with intra-layer links does not need to go as wide as before to express some hard functions constructed by a deep network. Such functions include the renowned "sawtooth" functions. Our results supplement the existing depth separation theory by examining its limit in a broader domain. Also, our results suggest that once configured with an appropriate structure, a shallow and wide network may have expressive power on a par with a deep network.

1. INTRODUCTION

Due to the widespread applications of deep networks in many important fields (LeCun et al., 2015) , mathematically understanding the power of deep networks has been a central problem in deep learning theory (Poggio et al., 2020) . The key issue is figuring out how expressive a deep network is or how increasing depth promotes the expressivity of a neural network better than increasing width. In this regard, there have been a plethora of studies on the expressivity of deep networks, which are collectively referred to as the depth separation theory. A popular idea to demonstrate the expressivity of depth is the complexity characterization that introduces appropriate complexity measures for functions represented by neural networks (Pascanu et al., 2013; Montufar et al., 2014; Telgarsky, 2015; Montúfar, 2017; Serra et al., 2018; Hu & Zhang, 2018; Xiong et al., 2020; Bianchini & Scarselli, 2014; Raghu et al., 2017) , and then reports that increasing depth can greatly boost such a complexity measure. In contrast, a more concrete way to show the power of depth is to construct functions that can be expressed by a small network of a given depth, but cannot be approximated by shallower networks, unless its width is sufficiently large (Telgarsky, 2015; 2016; Arora et al., 2016; Eldan & Shamir, 2016; Safran & Shamir, 2017; Venturi et al., 2021) . For example, Eldan & Shamir (2016) constructed a radial function and used Fourier spectrum analysis to show that a two-hidden-layer network can represent it with a polynomial number of neurons, but a one-hidden-layer network needs an exponential number of neurons to achieve the same level of error. Telgarsky (2015) employed a ReLU network to build a one-dimensional "sawtooth" function whose number of pieces scales exponentially over the depth. As such, a deep network can construct a sawtooth function with many pieces, while a shallow network cannot unless it is very wide. Arora et al. (2016) derived the upper bound of the maximal number of pieces for a univariate ReLU network, and used this bound to elaborate the separation between a deep and a shallow network. In a broad sense, we summarize the elements of establishing a depth separation theorem as the following: i) there exists a function representable by a deep network; ii) such a function cannot be represented by a shallow network whose width is lower than a threshold. The depth separation theory is nowadays widely accepted as an effective explanation for the power of depth. However, we argue that depth separation does not hold when we slightly adjust the structure of the shallow networks. Our investigation is on ReLU networks. As shown in Figure 1 (c), inspired by ResNet that adds residual connections across layers, we add residual connections within a layer, which forces a neuron to take the outputs of its neighboring neuron. Then, we find that inserting intra-layer links can greatly increase the maximum number of pieces represented by a shallow network. As such, we can modify the statement of depth separation: without the need of going as wide as before, a shallow network can express as a complicated function as a deep network could. Our result is valuable in two aspects. On the one hand, it non-trivially supplements the depth separation theory. In reality, a neural network often is not feedforward but uses shortcuts to link distant layers to facilitate feature reuse and easy training. Exploring the depth separation in the shortcut paradigm reveals the limit of the existing theory and motivates us to rethink the genuine power of depth. On the other hand, the superiority of depth over width nowadays seems to become a doctrine for most deep learning practitioners. However, studies such as width-depth equivalence (Fan et al., 2020) and representation ability of wide networks (Lu et al., 2017; Levine et al., 2020) show the essential role of width. In the same vein, our work suggests the potential of wide networks in expressivity. Once configured with an appropriate structure, a shallow and wide network may have expressive power on a par with a deep network. Note that adding intra-layer links is not equivalent to increasing depth. The common understanding of increasing depth is to increase the number of layers, while intra-layer links are just to connect neurons in the same layer, which is a slight change to the original network. Specifically, our roadmap to the modification of depth separation theorems includes two milestones. 1) Through bound analysis (Theorems 4, 6, and 8), explicit construction (Propositions 1, 2, and 3), and functional space analysis (Theorem 10), we substantiate that a network with intra-layer links can produce much more pieces than a feedforward network, and the gain is at most exponential, i.e., ( 3 2 ) k , where k is the number of hidden layers. 2) Since intra-layer links can yield more pieces, they can be used to modify depth separation theorems by empowering a shallow network to represent a function constructed by a deep network, even if the width of this shallow network is lower than the prescribed threshold. The modification is done in the cases of k 2 vs 3 (Theorem 12) and k 2 vs k (Theorem 13, the famous sawtooth function (Telgarsky, 2015) ). Also, we point out that the depth separation cannot be fully accomplished based on the bound analysis, unless the bound is proved to be tight. Thus, Arora et al. (2016) 's depth separation theorem might need to be re-examined. To summarize, our contributions are threefold. 1) We point out the limitation of the depth separation and propose to consider inserting intra-layer links in shallow networks. 2) We show via bound estimation, explicit construction, and functional space analysis that intra-layer links can make a ReLU network produce more pieces. 3) We modify the depth separation result including the famous Telgarsky (2015)'s theorem by demonstrating that a shallow network with intra-layer links does not need to go as wide as before to represent a function constructed by a deep network.

2. RELATED WORK

Recently, a plethora of depth separation studies have shown the superiority of deep networks over shallow ones from the points of view of complexity analysis and constructive analysis. The complexity analysis is to characterize the complexity of the function represented by a neural network, thereby demonstrating that increasing depth can greatly maximize such a complexity measure. Currently, one of the most popular complexity measures is the number of linear regions because it conforms to the functional structure of the widely-used ReLU networks. For example, Pascanu et al. (2013); Montufar et al. (2014) ; Montúfar (2017) ; Serra et al. (2018) ; Hu & Zhang (2018) ; Hanin & Rolnick (2019) estimated the bound of the number of linear regions generated by a fully-connected ReLU network by applying Zaslavsky's Theorem (Zaslavsky, 1997) . Xiong et al. (2020) offered the first upper and lower bounds of the number of linear regions for convolutional networks. Other complexity measures include classification capabilities (Malach & Shalev-Shwartz, 2019) , Betti numbers (Bianchini & Scarselli, 2014) , trajectory lengths (Raghu et al., 2017) , global curvature (Poole et al., 2016) , and topological entropy (Bu et al., 2020) . Please note that using complexity measures to justify the power of depth demands a tight bound estimation. Otherwise, it is insufficient to say that shallow networks cannot be as powerful as deep networks, since deep networks cannot reach the upper bound. The construction analysis is to find a family of functions that are hard to approximate by a shallow network, but can be efficiently approximated by a deep network. Eldan & Shamir (2016) built a special radial function that is expressible by a 3-layer neural network with a polynomial number of neurons, but a 2-layer network can do the same level approximation only with an exponential number of neurons. Later, Safran & Shamir (2017) extended this result to a ball function, which is a more natural separation result. Venturi et al. (2021) generalized the construction of this type to a non-radial function. Telgarsky (2015; 2016) used an O(k 2 )-layer network to construct a sawtooth function. Given that such a function has an exponential number of pieces, it cannot be expressed by an O(k)-layer network, unless the width is O(exp(k)). Arora et al. (2016) estimated the maximal number of pieces a network can produce, and established the size-piece relation to advance the depth separation results from (k 2 , k) to (k, k ′ ), where k ′ < k. Other smart constructions include polynomials (Rolnick & Tegmark, 2017) , functions of a compositional structure (Poggio et al., 2017) , Gaussian mixture models (Jalali et al., 2019) , and so on. Our work also highlights the construction, and we use an intra-linked network to more efficiently build a sawtooth function.

3. NOTATION AND DEFINITION

Notation 1 (Feedforward networks). For an R w0 → R ReLU DNN with widths w 1 , . . . , w k of k hidden layers, we use f 0 = f (1) 0 , . . . , f (w0) 0 = x ∈ R w0 to denote the input of the network. Let f i = f (1) i , . . . , f (wi) i ∈ R wi , i = 1, • • • , k, be the vector composed of outputs of all neurons in the i-th layer, for i = 1, . . . , k, j = 1, . . . , w i , we use g (j) i = a (j) i , f i-1 + b (j) i to denote the pre-activation of the j-th neuron of the i-th layer, where a (j) i ∈ R wi-1 , b (j) i ∈ R are parameters. Then f (j) i = σ g (j) i , where σ(•) is the ReLU activation. The output of this network is g k+1 = ⟨a k , f k ⟩ + b k , where a k ∈ R w k , b k ∈ R are parameters. Notation 2 (Intra-linked networks). For an R w0 → R ReLU DNN with every 2 paired neurons linked in each hidden layer and widths w 1 , . . . , w k of k hidden layers, we use f0 = f (1) 0 , . . . , f (w0) 0 = x ∈ R w0 to denote the input of the network. Let fi = f (1) i , . . . , f i ∈ R wi be the vector composed of the neurons in the i-th layer, then for i = 1, . . . , k, j = 1, . . . , w i , we use g(j ) i = ã(j) i , fi-1 + b(j) i to denote the pre-activation of the j-th neuron in the i-th layer, where ã(j) i ∈ R wi-1 , b(j) i ∈ R are some parameters. In an intra-linked network, the j-th and (j + 1)-th neurons are linked, and the (j + 2)-th and (j + 3)-th neurons are linked, we prescribe f (j) i = σ g (j) i , f i = σ g(j+1) i - f (j) i , f i = σ g(j+2) i , f i = σ g(j+3) i - f (j+2) i . Similar with the classical network, the output of the network is gk+1 = ãk , fk + bk , where ãk ∈ R w k , bk ∈ R are parameters.

Notation 3 (Sawtooth function). A piecewise linear (PWL) function

g : [a, b] → R is of "N - sawtooth" shape, if g = (-1) n-1 x -(n -1) • b-a N , x ∈ (n -1) • b-a N , n • b-a N , for n ∈ [N ]. Definition 1 (Width and depth of a feedforward network (Arora et al., 2016) ). For any number of hidden layers k ∈ N, input and output dimensions w 0 , w k+1 ∈ N, an R w0 → R w k+1 feedforward network is given by specifying a sequence of k natural numbers w 1 , w 2 , . . . , w k representing widths of the hidden layers. The depth of the network is defined as k + 1. The width of the network is max {w 1 , . . . , w k }. Definition 2 (Width and depth of a shortcut network (Fan et al., 2020) ). Given a shortcut network Π, we delete the minimum number of links to make the resultant network Π ′ a feedforward network without isolated neurons. Then, we define the width and depth of Π as the width and depth of Π ′ . Admittedly, defining the width and depth of a network embedded with shortcuts is tricky. A reasonably good definition should conform to the customary understanding of depth and width. For example, the ResNet (He et al., 2016) shall not be taken as a wide network in the light of the proposed definition; otherwise, it conflicts with practitioners' common sense. Our definition for width and depth can fit the common sense for the network that is formed by slightly modifying a feedforward network, e.g., ResNet, DenseNet (Huang et al., 2017) , and S3Net (Fan et al., 2018) . Per our definition, the width and depth of intra-linked networks are also max {w 1 , . . . , w k } and k + 1, respectively, the same as the width and depth of the corresponding feedforward network. Despite being one-dimensional, the above results convincingly reveal that increasing depth can make a ReLU network express a much more complicated function, which is the heart of depth separation. Here, we shed new light on the depth separation problem with intra-layer links. Our primary argument is that if intra-layer links shown in Figure 1 (c) are inserted, there exist shallow networks that previously cannot express some hard functions constructed by deep networks now can do the job. Our investigation consists of two parts. First, we substantiate that adding intra-layer links can greatly increase the number of pieces via bound estimation, explicit construction, and functional space analysis. Then, adding intra-layer links can empower the shallow networks to represent complicated functions such as sawtooth functions, without the need of going as wide as before.

4.1.1. UPPER BOUND ESTIMATION

Lemma 3. Let g : R → R be a PWL function with w + 1 pieces, then the breakpoints of f := σ(g) consists of two parts: some old breakpoints of g and at most w + 1 newly produced breakpoints. Furthermore, f has w + 1 new breakpoints if and only if g has w + 1 distinct zero points. Proof. A direct calculus. Theorem 4 (Upper bound of feedforward networks). Let f : R → R be a PWL function represented by an R → R ReLU DNN with depth k + 1 and widths w 1 , . . . , w k of k hidden layers. Then f has at most k i=1 (w i + 1) number of pieces. This bound is the univariate case of the bound: Montúfar (2017) for n-dimensional inputs. In Appendix B, we offer constructions to show that this bound is achievable in a depth-bounded but width-unbounded network (depth=3) (Proposition 4) and a width-bounded (width=3) but depth-unbounded network (Proposition 5) in one-dimensional space. Previously many bounds Pascanu et al. (2013); Montufar et al. (2014) ; Montúfar (2017) ; Xiong et al. (2020) on linear regions were derived, however, it is unknown that these bounds are vacuous or tight, particularly for networks with more than one hidden layer. What makes Propositions 4 and 5 special is that they for the first time substantiate that Montúfar (2017)'s bound is tight over three-layer and deeper networks, although these results are for the one-dimensional case. k i=1 n j=0 wi j , derived in Remark 1 (Sharpening the bound in (Arora et al., 2016) ). Previously, Arora et al. (2016) computed the number of pieces produced by a network of depth k + 1 and widths w 1 , . . . , w k as 2 k+1 • (w 1 + 1)w 2 • • • w k . The reason why their bound has an exponential term is that when considering how ReLU activation increases the number of pieces, they repetitively computed the old breakpoints generated in the previous layer. Our Lemma 3 implies that the ReLU activation in fact cannot generate as many as double pieces. Since Arora et al. (2016) 's bound is loose, their depth separation theorem needs to be re-examined. Lemma 5. Let g 1 , g 2 : R → R be two PWL functions with totally w breakpoints. Set f 1 := σ (g 1 ) and f 2 := σ (g 2 -f 1 ). Then the breakpoints of f 2 consist of three parts: some breakpoints of g 2 , some breakpoints of f 1 , and at most 2w + 2 newly produced breakpoints. Furthermore, f 2 has 2w + 2 newly produced breakpoints if and only if g 2 -f 1 has 2w + 2 distinct zero points. Proof. A direct corollary of Lemma 3. Let us illustrate why the intra-linked architecture can produce more pieces. Given two PWL functions g 1 and g 2 which has totally w breakpoints, in the feedforward architecture, σ (g 1 ) and σ (g 2 ) have totally at most 3w + 2 breakpoints, which contains at most w old breakpoints of g 1 , g 2 and at most 2w + 2 newly produced breakpoints. However, in the intra-linked architecture, σ (g 2 -σ (g 1 )) can produce more breakpoints because σ(g 1 ) has two states: activated or deactivated. Then, σ(g 1 ) and σ (g 2 -σ (g 1 )) consist of at most w old breakpoints of g 1 , g 2 and (w + 1) + (2w + 2) = 3w + 3 newly produced breakpoints. Theorem 6 (Upper bound of intra-linked networks). Let f : R → R be a PWL function represented by a ReLU DNN with depth k + 1, widths w 1 , . . . , w k , and every two neurons linked in each hidden layer as Figure 1(c ). Assuming that w 1 , . . . , w k are even, f has at most k i=1 3 2 w i + 1 pieces. Proof. We prove by induction on k. For the base case k = 1, we assume for every odd j, the neurons f (j) 1 and f (j+1) 2 are linked. The number of breakpoints of f (j) 1 , j = 1, . . . , w 1 , is at most 2 + (-1) j . Hence, the first layer yields at most 3 2 w 1 + 1 pieces. For the induction step, we assume that for some k ≥ 1, any R → R ReLU DNN with every two neurons linked in each hidden layer, depth k + 1 and widths w 1 , . . . , w k of k hidden layers produces at most k i=1 3 2 w i + 1 pieces. Now we consider any R → R ReLU DNN with every two neurons linked in each hidden layer, depth k + 2 and widths w 1 , . . . , w k+1 of k + 1 hidden layers. By the induction hypothesis, each g(j) k+1 has at most k i=1 3 2 w i + 1 -1 breakpoints. Then the breakpoints of σ(g (j) k+1 ) consist of some breakpoints of g(j) k+1 and at most 3 2 w i + 1 -1 breakpoints. In all, the number of pieces we can therefore get is at most 1 + w k+1 2 • k i=1 3 2 w i + 1 + 2 • k i=1 3 2 w i + 1 + k i=1 3 2 w i + 1 -1 = k+1 i=1 3 2 w i + 1 . In the following theorems, we offer the bound estimation for high-dimensional cases. The detailed proof for Theorem 8 is put into Appendix A. Theorem 7 (Upper Bound of Feedforward Networks (Montúfar, 2017) ). Let f : R n → R be a PWL function represented by an R n → R ReLU DNN with depth k +1 and widths w 1 , . . . , w k of k hidden layers. Then f has at most k i=1 n j=0 wi j linear regions. Theorem 8 (Upper Bound of Intra-linked Networks). Let f : R n → R be a PWL function represented by an R n → R ReLU DNN with every two neurons linked in each hidden layer, depth k + 1 and widths w 1 , . . . , w k of k hidden layers. We assume each w i is even. Then f has at most k i=1 n j=0 3w i 2 +1 j linear regions. Remark 2. Although both adding a new layer (going deep) and adding intra-layer links involve composition, their mechanisms of producing pieces are fundamentally different. While the mechanism of going deep is the repetition effect (multiplication), i.e., the function value of the function being composed is oscillating, and each oscillation can generate corresponding pieces. The mechanism of intra-layer links is the gating effect (addition). The neuron being embedded have two activation states, and each state is leveraged by the neuron being linked to produce a breakpoint. Such a mechanism essentially conforms to the parallelism, which is of width paradigm.

4.1.2. EXPLICIT CONSTRUCTION.

Despite that the bound estimation offers some light, to convincingly illustrate that intra-layer links can increase the number of pieces, we need to supply the explicit construction for the intra-linked networks. The number of pieces in the construction should be bigger than either the upper bound of feedforward networks or the maximal number a feedforward network can achieve. Specifically, the constructions for intra-linked networks in Propositions 1 and 2 have a number of pieces larger than the upper bounds of feedforward networks. In Proposition 3, by enumerating all possible cases, we present a construction for an intra-linked network of width 2 and arbitrary depth whose number of pieces is larger than what a feedforward network of width 2 and arbitrary depth possibly achieves. Proposition 1 (The bound k i=1 3wi 2 + 1 is tight for a two-hidden-layer intra-linked network). Given an R → R two-hidden-layer ReLU network, with every two neurons linked in each hidden layer, for any even w 1 ≥ 6, w 2 ≥ 4, there exists a PWL function represented by such a network, whose number of pieces is 3w1 2 + 1 3w2 2 + 1 . Proof. To guarantee the bound k i=1 3wi 2 + 1 is tight, the following two conditions should be satisfied: (i) g(j) i and g(j+1 ) i - f (j) i have as many zero points as possible so that σ(g (j) i ) and σ(g (j+1) i - f (j) i ) can produce the maximal number of breakpoints; (ii) all old breakpoints of g(1) i , . . . , g(wi) i are reserved by g(j) i+1 , an affine transform of f (1) i , . . . , f i . We first consider the first hidden layer. Let f (1) 1 (x) = σ 9 2 x -27 , f (2) 1 (x) = σ 3 2 x - f (1) 1 (x) f (3) 1 (x) = σ(-2x + 2), f (4) 1 (x) = σ -x + 2 - f (3) 1 (x) f (5) 1 (x) = σ -7 2 x -7 4 , f (6) 1 (x) = σ -2x + 8 - f (5) 1 (x) . When w 1 = 6, we set g = -2 9 f (1) 1 - f (2) 1 + 1 2 f (3) 1 + f (4) 1 -4 7 f (5) 1 - f (6) 1 . When w 1 > 6, for each odd j > 6, let f (j) 1 = σ (-5 (x -a j + 3)) , f (j+1) 1 = σ -2 (x -a j ) - f (j) 1 , where a j = -19 2 -9 j-1 2 -3 , then the output of the first layer is expressed as the following: g = - 2 9 f (1) 1 - f (2) 1 + 1 2 f (3) 1 + f (4) 1 - 4 7 f (5) 1 - f (6) 1 + w2 j=7,j is odd (-1) j+1 2 2 5 f (j) 1 + f (j+1) 1 , which has 3 2 w 1 + 1 pieces and whose adjacent pieces have slopes of opposite signs. Note that any line y = b, where b ∈ (-13/2, -6), can cross all pieces of g + b. Thus, g fulfills the conditions of Lemma 5. We divide the breakpoints of g into two parts: B upper = {x : x is a breakpoint of g and g(x) > b} and B lower = {x : x is a breakpoint of g and g(x) ≤ b}. We refer to their counts as #B upper and #B lower , respectively. contains all breakpoints of B upper , and has #B upper + 3 2 w 1 + 1 + (3w 1 + 2) breakpoints. To reserve all the breakpoints of g, we do the similar thing for -g to gain f (3) 2 and f (4) 2 , whose affine combination has #B lower + 3 2 w 1 + 1 + (3w 1 + 2) breakpoints, which contains all breakpoints in B lower , and shares no breakpoints with the affine combination of f (1) 2 , f Hence, the affine combination of f (1) 2 , f (2) 2 , f (3) 2 , f (4) 2 has #B upper + # B lower + 2 • 3w1 2 + 1 + 2 • (3w 1 + 2) = 3w1 2 + 6 • 3w1 2 + 1 breaking points, which contains all the breakpoints of g.  f (1) 2 , f (2) 2 , f (3) 2 , f f (i) 2 } w2 i=5 such that the affine transformation of { f (i) 2 } w2 i=1 has pieces of 3 2 w 1 + 3w 2 2 • 3 2 w 1 + 1 + 1 = 3w 1 2 + 1 3w 2 2 + 1 . Proposition 2 (Use intra-linked networks to achieve a sawtooth function with k i=1 3wi 2 pieces). There exists a [0, 1] → R function represented by an intra-linked ReLU DNN with depth k + 1 and width w 1 , . . . , w k of k hidden layers, whose number of pieces is at least 3w1 2 • . . . • 3w k 2 . Proof. Let ϕ(x) = x defined over [0, ∆]. The core of the proof is to use a one-hidden-layer network of w ≥ 2 neurons to create 3w 2 pieces from ϕ(x).  1 3 ሚ 𝑓 (1) = 𝜎(𝑥 -𝛿) Δ 𝛿 ሚ 𝑓 (2) = 𝜎(𝑥 -ሚ 𝑓 1 + 𝛿) 2𝛿 2𝛿 1 3 ሚ 𝑓 (1) + ሚ 𝑓 (2) 1 2 ሚ 𝑓 (3) = 𝜎(2𝑥 -8𝛿) ሚ 𝑓 (4) = 𝜎(2𝑥 -6𝛿 -ሚ 𝑓 3 ) 3𝛿 4𝛿 5𝛿 6𝛿 Δ 𝛿 2𝛿 2𝛿 3𝛿 4𝛿 5𝛿 6𝛿 𝑂 Δ 𝛿 2𝛿 3𝛿 4𝛿 5𝛿 6𝛿 𝛿 2𝛿 3𝛿 4𝛿 5𝛿 6𝛿 Δ 2𝛿 𝛿 2𝛿 3𝛿 4𝛿 5𝛿 6𝛿 Δ 2𝛿 2𝛿 𝛿 𝛿 1 3 ሚ 𝑓 (1) + ሚ 𝑓 (2) - 1 2 ሚ 𝑓 3 -ሚ 𝑓 (4) 𝑂 Δ 𝛿 2𝛿 3𝛿 4𝛿 5𝛿 6𝛿 2𝛿 𝛿 Let δ = 2∆ 3w . Set g(1) = 3ϕ -3δ, f (1) = σ g(1) , g(2) = ϕ, f (2) = σ g(2) -f (1) + δ , and g(2j+1) = 4ϕ -4(3j + 1)δ, f (2j+1) = σ g(2j+1) , g(2j+2) = 2ϕ -6jδ, f (2j+2) = σ g(2j+2) -f (2j+1) , for all j = 1, . . . , w/2 -1. The output of this one-hidden-layer network is ξ ∆,w (x) = 1 3 f (1) + f (2) -δ + w 2 -1 j=1 (-1) j 1 2 f (2j+1) + f (2j+2) , which has 3w 2 pieces on [0, ∆]. ξ ∆,w (x) is of slope (-1) j on [jδ, (j + 1)δ], j = 0, . . . , 3w/2 -1, and ranges from 0 to δ on each piece. Figure 3 shows how the affine transformation of { f (1) , f (2) , f (3) , f (4) } constructs a sawtooth function of 6 pieces. Please note that flipping ϕ(x) or translating ϕ(x) will not prevent ξ ∆,w (ϕ(x)) from generating 3w 2 pieces. The targeted intra-linked ReLU network with depth k + 1 and width w 1 , . . . , w k of k hidden layers is designed as ξ ∆ k ,w k • ξ ∆ k-1 ,w k-1 • • • • • ξ ∆1,w1 (x), where ∆ i = 1/ i-1 j=1 3wi 2 . Proposition 3 (Intra-layer links can greatly increase the number of pieces in an R → R ReLU network with width 2 and arbitrary depth). Let f : R → R be a PWL function represented by an R → R (k + 1)-layer ReLU DNN with widths 2 of all k hidden layers. Then number of pieces of f is at most √ 7 k , if k is even, 3 • √ 7 k-1 , if k is odd. There exists an R → R (k + 1)-layer 2-wide ReLU DNN, with neurons linked in each hidden layer, which can produce at least 7 • 3 k-2 + 2 pieces. Proof. The proof is put in Appendix C.

4.1.3. FUNCTIONAL SPACE ANALYSIS

The above constructive analyses demonstrate that in the maximal sense, intra-layer links can empower a feedforward network to represent a function with more pieces. Now, we move one step forward by showing that intra-layer links can surprisingly expand the functional space of a feedforward network. The reason why this result is surprising is that one tends to think an intra-linked network produces an exclusively different function from a feedforward network. However, here we report that an intra-linked one-hidden-layer network of two neurons can express a feedforward one-hidden-layer network of two neurons (Lemma 9), and the opposite doesn't hold true. Furthermore, given an arbitrary feedforward ReLU network, adding intra-layer links in the first layer can definitely expand its functional space (Theorem 10). 2) , and d ∈ R, there exists some ã(1 Lemma 9. Let f (1) = σ a (1) x + b (1) , f (2) = σ a (2) x + b (2) , f = c (1) f (1) + c (2) f (2) + d, where a (1) a (2) > 0, b (1) , b (2) , c (1) , c ) , ã(2) , b(1) , b(2) , c(1) , c(2) , and d ∈ R such that f = c(1) f (1) + c(2) f (2) + d, where f (1) = σ ã(1) x + b(1) , f (2) = σ ã(2) x + b(2) -f (1) . Proof. Without loss of generality, we assume a (1) , a (2) > 0 and -b (2) a (2) < -b (1) a (1) . Then f is of slope 0, c (2) a (2) , and c (1) a (1) + c (2) a (2) on -∞, -b (2) a (2) , -b (2) a (2) , -b (1) a (1) and -b (1) a (1) , ∞ , respectively. Now we choose ã(1) , ã(2) satisfying 0 < ã(1) < ã(2) , and set b (i) = b (i) a (i) . ã(i) , i = 1, 2. Then f (1) is of slope 0 and ã(1) on -∞, -b (1) a (1) and -b (1) a (1) , ∞ , respectively, while f (2) is of slope 0, ã(2) , and ã(2 ) -ã(1) on -∞, -b (2) a (2) , -b (2) a (2) , -b (1) a (1) , and -b (1) a (1) , ∞ , respectively. Hence, let c(2) = c (2) a (2) /ã (2) , c(1) = c (1) a (1) + c (2) a (2) -c(2) ã(2) -ã(1) /ã (1) , and d = d, we have f = c(1) f (1) + c(2) f (2) + d. Theorem 10. Let f be any R → R PWL representable by a classical (k + 1)-layer ReLU DNN with widths w 1 > 2, . . . , w k of k hidden layers. Then, f can also be represented by a (k + 1)-layer ReLU DNN with widths w 1 , . . . , w k of k hidden layers, with neurons in the first layer linked. Proof. Let the output of the j-th neuron of the first layer in the feedforward network be f (j) 1 (x) = σ a (j) 1 x + b (j) 1 , j = 1, . . . , w 1 . Since the feedforward network is invariant to permutating neurons, we can link the arbitrary j-th and j ′ -th neuron if a (j) 1 a (j ′ ) 1 > 0, which directly concludes the proof according to Lemma 9.

4.2. MODIFY THE DEPTH SEPARATION THEOREM WITH INTRA-LAYER LINKS

In a broad sense, the depth separation theorem consists of two elements: i) there exists a function representable by a deep network; ii) such a function cannot be represented by a shallow network whose width is lower than a threshold. Since adding intra-layer links can generally improve the capability of a network, if one adds intra-layer links to a shallow network, the function constructed by a deep network can be represented by a shallow network, even if the width of this shallow network is still lower than the threshold. Theorems 12 and 13 modify the depth separation k 2 vs 3 and k 2 vs k, respectively, by presenting that a shallow network with intra-layer links only needs to go 2 3 times as wide as before to express the same function. Lemma 11 (A network with width=2 can approximate any univariate PWL function (Fan et al., 2018) ). Given a univariate PWL function with n pieces p(x), there exists a (n + 1)-layer network D(x) with two neurons in each layer such that f (x) = D(x). Theorem 12 (Modify the depth separation k 2 vs 3). For every k ≥ 2, there exists a function p(x) that can be represented by a (k 2 + 1)-layer ReLU DNN with 2 nodes in each layer, such that it cannot be represented by a classical 3-layer ReLU DNN W 3 (x) with width less than k -1, but can be represented by a 3-layer, 2(k-1) 3 -wide intra-linked ReLU DNN W3 (x). Proof. Combining Theorem 4, Proposition 1, and Lemma 11 straightly concludes the proof. Theorem 13 (Modify the depth separation k 2 vs k). For every k ≥ 1, there is a [0, 1] → R PWL function p(x) represented by a feedforward (2k 2 + 1)-layer ReLU DNN with at most 6 nodes in each layer, such that it cannot be represented by a classical (k + 1)-layer ReLU DNN W k (x) with width less than 6 k , but can be represented by a (k + 1)-layer intra-linked ReLU DNN Wk (x) with width no more than 4 • 6 k-1 . Proof. Per (Telgarsky, 2016) 's construction, a feedforward (2k 2 + 1)-layer ReLU DNN with at most 2 nodes in each layer can produce a sawtooth function of 2 k 2 pieces. Similarly, a feedforward (2k 2 + 1)-layer ReLU DNN with at most 6 nodes in each layer can have 6 k 2 pieces. Thus, it follows Theorem 4 that any classical (k + 1)-layer ReLU DNN W k (x) with width less than 6 k -1 cannot generate 6 k 2 pieces. However, according to the construction in Proposition 2, let w 1 = w 2 = • • • = w k = 4 • 6 k-1 , an intra-linked network can exactly express a sawtooth function with 6 k 2 pieces. Remark 3. Theorems 12 and 13 implicate that intra-layer links can reduce the bar of the width by 1/3. Although it is not an exponential reduction, our highlight is the existence of such shallow networks that can be transformed by intra-layer links to have the representation power on a par with a deep network. Such shallow networks go against the predictions of depth separation theory. Furthermore, suppose every n i neurons are intra-linked in the i-th layer, the upper bound of the number of pieces by a network of k hidden layers with widths w 1 , . . . , w k is k i=1 (ni+1)wi 2 + 1 . Therefore, by intra-linking more neurons, the bar of the width can be substantially reduced, and more shallow networks will become counterexamples of the depth separation theory.

5. DISCUSSION AND CONCLUSION

Please note that intra-layer links are an extension of residual connections. The former is to link the neurons inside a layer, while the latter is to connect neurons across layers. We take the intra-layer links as vertical residual connections, while the shortcuts of ResNet (He et al., 2016) are horizontal residual connections. It is widely recognized that horizontal residual connections can facilitate networks in the dimension of depth, e.g., the residual connection solves the training issues of deep networks and allows the network to go very deep. In contrast, our theoretical results show that vertical residual connections can promote the representation capability of networks in the dimension of width, e.g., the network with intra-layer links does not need to go as wide as before to represent the same function. By leveraging the capabilities of a network in both width and depth domains, we believe that the synergy of horizontal and vertical links in a network will further contribute to more powerful networks. More favorably, both horizontal and vertical links do not incorporate new parameters; therefore, their synergy is likely to enhance model efficiency. In this draft, via bound estimation, dedicated construction, and functional space analysis, we have shown that a network with intra-layer links is much more expressive than a feedforward one. Then, we have modified the depth separation results to that a shallow network that previously cannot express some functions constructed by deep networks now can do the job with intra-layer links. Our results supplement the existing depth separation theory, and suggest that the potential of wide networks can be released by an appropriate structure. Future endeavors can be put into training wide networks using intra-layer links to achieve comparable performance with deep networks.

A PROOF OF THEOREM 8

Lemma 14 (Zaslavsky's Theorem Zaslavsky (1975) ; Stanley (2004) ). Let A = {H i ⊂ V : 1 ≤ i ≤ m} be an arrangement in R n . Then, the number of regions for the arrangement A satisfies r(A) ≤ n i=0 m i . Proof. We prove by induction on k. For the base case k = 1, f (2i-1) 1 = σ g(2i-1) 1 pro- duces one hyperplane in the input space R n . Furthermore, f (2i) 1 = σ g(2i) 1 - f (2i-1) 1 = σ g(2i) 1 -σ g(2i-1) 1 produces at most two hyperplanes in the input space R n . Therefore, in total, the w 1 neurons in the first layer produces (1 + 2) • w1 2 = 3w1 2 hyperplanes in the input space R n . Then by Zaslavsky's Theorem, it will produce at most n j=0 w1+1 j linear regions in the input space R n . For the induction step, we assume that for some k ≥ 1, any R n → R ReLU DNN with every two neurons linked in each hidden layer, depth k + 1 and widths w 1 , . . . , w k of k hidden layers produces at most k i=1 n j=0 3w i 2 +1 j linear regions. Now we consider any R n → R ReLU DNN with every two neurons linked in each hidden layer, depth k+2 and widths w 1 , . . . , w k+1 of k+1 hidden layers. Then for each linear region S produced by the first k + 1 layers, again, f (2i-1) k+1 = σ g(2i-1) k+1 pro- duces one hyperplane in S. Furthermore, f (2i) k+1 = σ g(2i) k+1 - f (2i-1) k+1 = σ g(2i) k+1 -σ g(2i-1) k+1 produces at most two hyperplanes in the S. Therefore, in total, the w k+1 neurons in the k + 1 layer produces  (1 + 2) • w k+1 2 = 3w k+1

B SUPPLEMENTARY RESULTS FOR THE TIGHTNESS OF THEOREM 4

Proposition 4 (The bound k i=1 (w i + 1) is tight for a depth-bounded but width-unbounded network). Given an R → R two-hidden-layer ReLU network, for any width w 1 ≥ 3, w 2 ≥ 2 in the first and second hidden layers, there exists a PWL function represented by such a network, whose number of pieces is (w 1 + 1) (w 2 + 1). Proof. To guarantee the bound k i=1 (w i + 1) is tight, the following two requirements should be met: (i) each g (j) i , i = 0, 1, 2, j = 1, . . . , w i , has distinct zero points that are as much as its number of pieces, so that the activation step can produce the most new breakpoints; (ii) the breakpoints of each g (j) (i+1) , i = 0, 1, 2, j = 1, . . . , w i+1 , as the affine combination of f , so that all the old breakpoints are reserved. Now we give the proof in detail. Let f (1) 1 (x) = σ(3x), f (2) 1 (x) = σ(-x + 3), f (3) 1 (x) = σ 3 2 x -3 2 . When w 1 = 3, we set The female and male are mapped to 0 and 1, respectively. All samples with missing attributions are deleted. Finally, the processed data have 7,081 data points. Then, the data are randomly split into training and testing sets with a ratio of 0.8:0.2. We build networks with intra-layer links and compare them with the corresponding feedforward networks without intra-layer links. The optimizer is Adam Kingma & Ba (2014) with a learning rate of 0.1. The loss function is the binary cross-entropy function. The evaluation metric is the widely used micro-F1 score. The prediction results by feedforward and intra-linked networks are summarized in Table 1 , from which we draw two highlights. First, when the same structure is used, the intra-linked network consistently outperforms the feedforward network. More favorably, the intra-linked network takes the lead by a large margin: the minimum gain is 1.68% for the network structure 2-32-2, while the maximum gain is 2.66% for the network structure 2-16-16-2. Such a superiority corroborates our theoretical analysis that adding intra-layer links can boost the network's representation power. Second, the one-hidden-layer intra-linked network can sometimes perform superbly over the two-hidden-layer feedforward network (22-16-2 vs 22-8-8-2, 22-32-2 vs 22-16-16-2, 22-64-2 vs 22-32-32-2), when their parameters are comparable. This phenomenon suggests that the representation power of an intra-linked network is different from that of a feedforward network with more layers. • As Figure 6 shows, their mechanisms of producing pieces are fundamentally different. While the mechanism of adding a new layer is the repetition effect (multiplication), i.e., the function value of the function being composed is oscillating, and each oscillation can generate more pieces. The mechanism of intra-layer links is the gating effect (addition). The neuron being embedded have two activation states, and each state is leveraged by the neuron being linked to produce a breakpoint. Two states are integrated to generate more pieces. • Although both stacking a new layer and adding intra-layer links involve composition, they involve different numbers of (affine transform, activation). In a feedforward network, adding a new layer means the depth increases by 1, and the characteristic of stacking a new layer is doing the affine transformation followed by activation. As Figure 6 illustrates, a feedforward network with two layers involves two times of (affine transformation, activation). In contrast, adding intra-layer links in a fully-connected layer actually exerts a gating effect. When σ(W 2 x + b 2 ) > 0, the output is σ((W 1 + W 2 )x + b 1 + b 2 ); when σ(W 2 x + b 2 ) = 0, the output is σ(W 1 x + b 1 ). The number of (affine transformation, activation) is still one for both cases. • The function classes (the set of functions represented by some given neural network architecture) of our intra-linked network and the deeper feedforward network are not the same, and this will make a big difference. In some sense, the deeper feedforward network has a larger function class, and the function class of our intra-linked network is just a subset of it. However, our intra-linked network has more expressive power (i.e., number of pieces, VC dimension) per parameter. Also, the experiments showed that intra-linked networks can achieve better accuracy in solving real-world problems. This phenomenon can be seen as an analog to the comparison between CNNs and fully-connected NNs. The function classes of CNNs are just subsets of the function classes of fully-connected NNs with some further restrictions on the weights. However, CNNs usually have more expressive power per parameter and achieve better results in practice. Topologically, one can define the depth of a network as the length of the longest path from the input to the output, by regarding a neural network as a directed acyclic graph. With this definition, adding the intra-layer links certainly increases the depth by roughly a factor of two. However, if we examine the operations along the longest path, based on the above second observation, the number of (affine transformation, activation) remains to be the same with the number of layers. Thus, if the depth is defined as the number of (affine transformation, activation) that are actually executed, the depth of the intra-linked network is the same as that of the feedforward network. Since topologically one can make any shallow network deep by making a series of reducible identity layers, we argue that the intrinsic computational operation is more important than the extrinsic topology for the depth separation theory.

F EXTENSION TO MORE INTRA-LAYER LINKS

For an R w0 → R w k+1 ReLU DNN with depth k + 1, widths w 1 , . . . , w k of k hidden layers, We now assume that every n i neurons are intra-layer linked, where n i can divide w i without remainder. We use f0 = f (1) 0 , . . . , f 0 = x ∈ R w0 to denote the input of the network. Let fi = f (1) i , . . . , f (wi) i ∈ R wi , then for i = 1, . . . , k, j = 1, . . . , w i , we use g(j) i = ã(j) i , fi-1 + b(j) i to denote the j-th pre-activation in the i-th layer respectively, where a (j) i ∈ R wi-1 ,b (j) i ∈ R are some parameters. In an intra-linked network, the j -th, . . ., (j + n i -1)-th neurons in the i-th layer are linked, and the (j + n i )-th, • • • , (j + 2n i -1)-th neurons in the i-th layer are linked. We prescribe f (j) i = σ g (j) i and f (j+l) i = σ g (j+l) i - f (j+l-1) i , for l = 1, . . . , n -1. The output of the network is g(j) k+1 = ã(j) k , fk + b(j) k , j = 1, . . . , w k+1 , where ã(j) k ∈ R w k , b k ∈ R are parameters. Theorem 15. Let f : R → R be a PWL function represented by a R → R ReLU DNN with depth k + 1, widths w 1 , . . . , w k of k hidden layers and every n i neurons linked in the i-th hidden layer for some positive integer n i that divides w i without remainder. Then f has at most k i=1 ni+1 2 w i + 1 pieces. Proof. For convenience, we assume in the i-th layer, the j-th, • • • , (j + n i -1)-th neurons are linked, for i = 1, . . . , k, j = 1, . . . , wi -1. For the first layer, f 1 has one breakpoint and each f (j) 1 has at most j newly produced breakpoints and some old breakpoints of g(j) 1 and f (j-1) 1 , for j = 2, . . . , n 1 . Hence, the first layer gives at most ni+1 2 w i + 1 pieces. Then the rest of the proof is similar to Theorem 6. + 1 is tight for a one-hidden-layer intra-linked network). Given an R → R one-hidden-layer ReLU network with all neurons linked in the hidden layer, there exists a PWL function represented by such a network, whose number of pieces is (w1+1)w1 2 + 1. ሚ 𝑓 (1) x ሚ 𝑓 (2) = 𝜎(𝑤 2 𝑥 + 𝑏 (2) -ሚ 𝑓 (1) ) 𝑤 2 𝑥 + 𝑏 (2) ሚ 𝑓 (1) ሚ 𝑓 (2) x x ሚ 𝑓 (2) 𝑤 3 𝑥 + 𝑏 (3) ሚ 𝑓 (3) x x x ሚ 𝑓 (3) 𝑤 4 𝑥 + 𝑏 (4) ሚ 𝑓 (4) x x x x ሚ 𝑓 (1) = 𝜎(𝑤 1 𝑥 + 𝑏 (1) ) ሚ 𝑓 (3) = 𝜎(𝑤 3 𝑥 + 𝑏 (3) -ሚ 𝑓 (2) ) ሚ 𝑓 (4) = 𝜎(𝑤 4 𝑥 + 𝑏 (4) -ሚ 𝑓 (3) ) Figure 7 : The construction demonstrating that the bound k i=1 (w i +1)w i 2 + 1 is tight for a one-hidden-layer intra-linked network. Proof. Without loss of generality, a one-hidden-layer network with all neurons intra-linked is mathematically formulated as the following: f (1) = σ(w (1) x + b (1) ) f (j+1) = σ(w (j) x + b (j) -f (j) ) . To prove that the bound k i=1 (wi+1)wi 2 + 1 is tight for a one-hidden-layer network, the key is to make each f (j) produce j new breakpoints and have j non-zero pieces that share a point with y = 0. We use mathematical induction to derive our construction. Figure 7 schematically illustrates our construction idea. First, let f (1) = σ(x + 1) and f (2) = σ(0.5 × (x + 2) -f (1) ). Note that f (1) has 1 non-zero piece that shares a point with y = 0, and f (2) has 2 non-zero pieces that share a common point with y = 0. Then, given f (j) , j ≥ 2, we suppose f (j) has j non-zero pieces that share a point with y = 0. Since f (j) is continuous, we select its peaks {(x pi , f (j) (x pi ))} by the following conditions: i) f (j) is not differentiable at x pi ; ii) f (j) (x pi ) ̸ = 0. Next, let (x * , f (j) (x * )) be the lowest peak of f (j) . As long as the slope w (j+1) and the bias b (j+1) satisfy w (j+1) < f (j) 1 (x * ) x * +j+1 b (j+1) = w j+1 × (j + 1) , ( ) Theorem 18 (An arbitrarily deep network of width=3 and with all neurons in each layer intra-linked can achieve at least 5 k pieces). There exists an R → R function represented by an intra-linked ReLU DNN with depth k, width 3 in each layer, and all neurons intra-linked in each layer, whose number of pieces is at least 5 k . Proof. Following the same spirit in proof of Theorem 17, we construct three neurons as follows:    f (1) = σ(2x) f (2) = σ(x + 1 -σ( f (1) )) f (3) = σ( 1 3 (x + 2) -f (2) ) . ( ) The target function that returns us 5 pieces is ξ(x) = 1 100 × f (3) - 1 3 × f (2) + 0 × f (1) . Next, we just need to let each layer of the intra-linked network represent a stretched and down-pulled variant of ξ(x), e.g., the k-th layer ξ k (x) = T k • η k (x) -C k , where T k is a sufficiently large number and C k > 1 200 T k + 2 to ensure that [-2, 0] is within the function range of ξ k (x). Finally, the constructed network is . At this time, the improvement of representation power by intra-links is O(w) instead of a constant-level improvement. Thus, the separation is still valid if one allows increasing the width of feedforward networks by a constant factor. Proposition 6 (Modify the depth separation k 2 vs 2). For every k ≥ 2, there exists a function p(x) that can be represented by a (k 2 + 1)-layer ReLU DNN with 2 nodes in each layer, such that it cannot be represented by a classical 2-layer ReLU DNN W 2 (x) with width less than k 2 -1, but can be represented by a 2-layer, (2k)-wide intra-linked ReLU DNN W2 (x). ξ k • ξ k-1 • • • • • ξ 1 (x). Proof. Combining Theorem 4, Theorem 16, and Lemma 11 straightly concludes the proof.

G ANALYSIS EXTENDED TO ONE-NEURON-WIDE RESNET

Our analysis can be extended to ResNet to show the power of residual connections. Let us use a one-neuron-wide ResNet to demonstrate this point. It is straightforward to see that a one-neuronwide ReLU DNN can represent PWL functions with at most three pieces, no matter how deep the network is. However, if we add residual connections to the network, which gives a ResNet, it can represent PWL functions with much more pieces. Theorem 19. Let f : R → R be a PWL function represented by a one-neuron-wide ResNet. Mathematically, f = c k+1 f k + g k , where g 1 (x) = x, f i = σ (a i g i + b i ) , g i+1 = c i f i + g i , c k+1 , a i , b i , c i are parameters, for i = 1, . . . , k. Then f has at most 2k + 2 pieces. Proof. The first claim follows from Lemma 3 and a simple induction step. Following the idea of the construction in Propositions 1 and 2, we set c i = -2 and a i = 1 for all i and set b 1 = 0, b i = 2 -2 -i+2 for i = 2, . . . , k . Theorem 19 confirms that adding simple links can greatly improve the representation ability of a network. Actually, both ResNet and intra-layer linked networks do not increase the number of parameters, but they can represent more complicated functions than the feedforward of the same neurons used in each layer. Hence, the linked structure can improve the efficiency of parameters. Besides, we can see from the proof of Theorem 19 that the idea and construction in analyzing intralinked networks can indeed be utilized to analyze other architecture of networks.



https://www.kaggle.com/datasets/whenamancodes/credit-card-customers-prediction





Figure 1: (a) feedforward, (b) residual, and (c) intra-linked.

i + 1 -1 breakpoints, based on Lemma 5. The breakpoints of f

Figure 2: The PWL functions that reach the bound of Proposition 1 when w1 = 6, w2 = 4. Next, we construct the second hidden layer. f (1) 2 := σ (g + b 1 ), where b 1 ∈ (-13/2, -6), has

Figure 2. Repeating this procedure by selecting different b 1 , a, b 2 , we can construct the remaining {

Figure 3: A schematic illustration of how to use an intra-linked network to generate a sawtooth function.

hyperplanes in S. Then by Zaslavsky's Theorem, it will pro-

Figure 4: Construction of PWL functions to reach the bound of Proposition 4 when w1 = 3, w2 = 2.

Figure6: Adding intra-layer links is not equivalent to increasing depth in terms of the mechanism of generating more pieces, the number of (affine transform, activation), and function classes.

Corollary 15.1. Let f : R → R be a PWL function represented by a R → R ReLU DNN with all the neurons linked in each hidden layer, depth k + 1, and widths w 1 , . . . , w k of k hidden layers. Then f

Remark 4. Suppose each layer has w neurons:w 1 = w 2 = • • • = w k = w,and n = w, the upper bound of the intra-layer linked network is ( (w+1)w 2 + 1) k , which approximately equals to that of a feedforward network with width w 1 = w 2 = • • • = w k = (w+1)w 2

Figure 9: The improvement of representation power by intra-links is O(w) when all neurons in a layer are intra-linked.

Credit card customers prediction by feedforward and intra-linked networks.



annex

w2+1 , j = 2, . . . , w 2 .When w 1 > 3, we let f (j) 1 = σ(-2x -2(j -3)) and g(1) 2 = -f(1)1 -3 -j w2+1 , j = 2, . . . , w 2 .Then g (j)2 has w 1 + 1 distinct zero points. Hence for j = 1, . . . , w 2 , the breakpoints of fkeeps all breakpoints of g (j)2 and yields w 1 + 1 new breakpoints. Note that f . Therefore, the total number of pieces via an affine combination of fis (w 1 + 1) (w 2 + 1) pieces.Proposition 5 (The bound k i=1 (w i + 1) is tight for a width-bounded but depth-unbounded network). Given an R → R ReLU network with width w for the first layer and 3 for the other layers, for any depth k ≥ 2, there exists a PWL function represented by such a network, whose number of pieces is (w + 1)2 . Now we continue our proof by induction. Assume we have constructed fThrough a direct calculus, we know gi+1 has (w + 1) • 4 i-1 pieces with opposite slope in every two adjoint pieces and ranges from 0 to 3/6 i in each piece except the leftmost and rightmost piece, which implied we can totally obtain (w + 1) • 4 k-1 pieces.

C PROOF OF PROPOSITION 3

Proof. For the first assertion, we claim that each pre-activation g (j)i , 2 ≤ i ≤ k, j = 1, 2, cannot have its every two adjacent pieces of slope with different sign, which implies the activation cannot produce the most breakpoints as in Lemma 3. In fact, g (j) 2 , j = 1, 2, has at most 3 pieces. If some g (j)2 has 3 pieces, then by exhaustion , we know either it has a 0-slope, or it has two adjacent pieces with slopes of the same sign (see Figure 5 ). Hence, f (j) 2 , j = 1, 2, has at most 2 newly produced breakpoints. Then the output of the 2-nd layer has at most 2 + 2 × 2 = 6 breakpoints, i.e., 7 pieces. Applying the similar method to each piece, we can get our result via a simple induction step. 2 ,j = 1, 2 in Proposition 3. Now we come to the second assertion. For convenience, we say anUsing this fact, we can construct a PWL represented by a (k + 1)-layer 2-wide intra-linked ReLU DNN, which has 7 • 3 k-2 + 2 pieces. Actually, set8 is of "triangle-trapezoid-triangle" shape on [-1, 1]. Using the fact above repeatedly, we can construct a PWL function represented by a R → R (k + 1)layer, 2-wide, intra-linked ReLU DNN, which is constant on (-∞, -1] ∪ [1, ∞) and of "triangletrapezoid-triangle" shape on -1

D VALIDATING THE REPRESENTATION POWER OF INTRA-LINKED LINKS

Inspired by the encouraging theoretical analyses, we validate whether or not the intra-layer links can assist a network to deliver superior performance in real-world tasks. The task is to predict if a credit card holder will get churned so that the bank can provide better service to turn holders' decisions. This prediction task has 10,000 raw samples, and each has 18 customers' portfolio features including age, salary, marital status, credit card limit, credit card category, etc. The labels are 'get churned' or 'stay'. The detailed description of data and this task can be referred to in Kaggle 1 .The data are preprocessed as follows: The discrete value is assigned to different education levels based on the mapping { 'Uneducated': 0, 'High School': 1, 'College': 2, 'Graduate': 3, 'Post-Graduate': 4, 'Doctorate': 5 }. The income situation is assigned with values based on the mapping: { 'Less than $40K': 0, '$40K -$60K': 1, '$60K -$80K': 2, '$80K -$120K': 3, '$120K +': 4 }.w j+1) x + b (j+1) crosses and only crosses j pieces of f (j) . These pieces are exactly non-zero pieces that share a point with y = 0. Thus, plus the breakpoint -b (j+1) w (j+1) , f (j+1) generates a total of j + 1 new breakpoints. At the same time, f (j+1) has j + 1 non-zero pieces that share a point with y = 0. Figure 7 illustrates the process of induction.Finally, the total number of breakpoints is w1 j=1 j = (w1+1)w1 2 , which concludes our proof.Theorem 17 (An arbitrarily deep network of width=4 and with all neurons in each layer intra-linked can achieve at least 9 k pieces). There exists an R → R function represented by an intra-linked ReLU DNN with depth k, width 4 in each layer, and all neurons in each layer intra-linked, whose number of pieces is at least 9 k .Proof. The core of the proof is to use a one-hidden-layer all-neuron-intra-linked network of width 4 to create a quasi-sawtooth function with as many pieces as possible. We construct four neurons as follows:.(5)The profiles of f (1) , f (2) , f (3) , f (4) are shown in Figure 8 (a).(0,0) (0,1)(1,Figure 8 : A schematic illustration of how to use an intra-linked network to generate a sawtooth function.By combining f (1) , f (2) , f (3) , f (4) with carefully calibrated coefficients, we have the following quasi-sawtooth function that has 9 pieces areAs shown in Figure 8 (b), we have marked all breakpoints of η(x) to validate its correctness.Next, we just need to let each layer of the intra-linked network represent a stretched and down-pulled variant of η(x), e.g., the k-th layer η k (x) = M k •η(x)-B k , where M k is a sufficiently large number and B k > 5 504 M k + 3 to ensure that [-3, 0] is within the function range of η k (x). Finally, the constructed network is(7)

