EXPLAINING REPRESENTATION BOTTLENECKS OF CONVOLUTIONAL DECODER NETWORKS

Abstract

In this paper, we prove representation bottlenecks of a cascaded convolutional decoder 1 network, considering the capacity of representing different frequency components of an input sample. We conduct the discrete Fourier transform on each channel of the feature map in an intermediate layer of the decoder network. Then, we introduce the rule of the forward propagation of such intermediate-layer spectrum maps, which is equivalent to the forward propagation of feature maps through a convolutional layer. Based on this, we find that each frequency component in the spectrum map is forward propagated independently with other frequency components. Furthermore, we prove two bottlenecks in representing feature spectrums. First, we prove that the convolution operation, the zero-padding operation, and a set of other settings all make a convolutional decoder network more likely to weaken high-frequency components. Second, we prove that the upsampling operation generates a feature spectrum, in which strong signals repetitively appears at certain frequencies. We will release all codes when this paper is accepted.

1. INTRODUCTION

Deep neural networks (DNNs) have exhibited superior performance in many tasks. However, in recent years, many studies discovered some theoretical defects of DNNs, e.g., the vulnerability to adversarial attacks (Goodfellow et al., 2014) , and the difficulty of learning middle-complex interactions (Deng et al., 2022) . Besides, other studies explained typical phenomena during the training of DNNs, e.g., the double-descent phenomenon (Nakkiran et al., 2019) , the information bottleneck hypothesis (Tishby & Zaslavsky, 2015) , and the lottery ticket hypothesis (Frankle & Carbin, 2018) . In comparison, in this study, we propose a new perspective to investigate how a cascaded convolutional decoderfoot_0 network represents features at different frequencies. I.e., when we apply the discrete Fourier transform (DFT) to each channel of the feature map or the input sample, we try to prove which frequency components of each input channel is usually strengthened/weakened by the network. To this end, previous studies (Xu et al., 2019a; Rahaman et al., 2019) claimed that DNNs were less likely to encode high-frequency components. However, these studies focused on a specific frequency that took the landscape of the loss function on all input samples as the time domain. In comparison, we focus on a fully different type of frequency, i.e., the frequency w.r.t. the DFT on an input image or a feature map. • Reformulating forward propagation in the frequency domain. As the basis for subsequent theoretical proof, we discover that we can reformulate the traditional forward propagation of feature maps as a new forward propagation on the feature spectrum. We derive the rule that forward propagates spectrums of different channels through a cascaded convolutional network, which is mathematically equivalent to the forward propagation on feature maps through a cascaded convolutional network. • Based on the reformulation of the forward propagation, we prove the following conclusions. (1) The layerwise forward propagation of each frequency component of the spectrum map is independent with other frequency components. In the forward propagation process, each frequency component of the feature spectrum is forward propagated independently with other frequency components, if the convolution operation does not change the size of the feature map in each channel. In The convolution operation and the zero-padding operation make the decoder usually learn low-frequency components first and then gradually learn higher frequencies. (b) For cascaded upconvolutional layers, the upsampling operation in the decoder repeats strong frequency components of the input to generate spectrums of upper layers. We visualize the magnitude map of the feature spectrum, which is averaged over all channels. For clarity, we move low frequencies to the center of the spectrum map, and move high frequencies to corners of the spectrum map. High frequency components in magnitude maps in (b) are also weakened by the convolution operation after upsampling. this way, we analyze three classic operations, including the convolution, the zero-padding, and the upsampling operations, and prove two representation bottlenecks, as follows. (2) Representation bottleneck 1. We prove that both the convolution operation and the zero-padding operation make a cascaded convolutional decoder network more likely to weaken the high-frequency components of the input sample, as shown in Figure 1 (a), if the convolution operation with a padding operation does not change the size of the feature map in a channel. Besides, we also prove that the following three conditions further strengthen the above representation bottleneck, including (1) a deep network architecture; (2) a small convolutional kernel size; and (3) a large absolute value of the mean value of convolutional weights. (3) Representation bottleneck 2. We porve that the upsampling operation makes a cascaded convolutional decoder network generate a feature spectrum, in which strong signals repetitively appears at certain frequencies, as shown in Figure 1(b) . Note that all above findings can explain general trends of neural networks with convolution, zeropadding, and upsampling operations, instead of deriving the deterministic property of a specific network. Besides, we have not derived the property of max-pooling operations, so in this paper, it is difficult to extend such findings to neural networks for image classification.

2. RULES OF PROPAGATING FEATURE SPECTRUMS

In this section, we aim to reformulate the forward propagation of a cascaded convolutional decoder 1 network in the frequency domain. To this end, we first introduce the rule of a convolutional layer propagating the feature spectrum from a lower layer to an upper layer. • Convolution operation. Given a convolutional layer, let W [ker=1] , W [ker=2] , . . ., W [ker=D] denote D convolutional kernels of this layer, and let b [ker=1] , b [ker=2] , . . . , b [ker=D] ∈ R denote D bias terms. Each d-th kernel W [ker=d] ∈ R C×K×K is of the kernel size K × K, and C denotes the channel number. Accordingly, we apply the kernel on a feature F ∈ R C×M ×N with C channels, and obtain the output feature F ∈ R D×M ×N , as follows. F = Conv(F), s.t. F (d) = W [ker=d] ⊗ F + b [ker=d] 1D, d = 1, 2, . . . , D where F (d) ∈ R M ×N denotes the feature map of the d-th channel. ⊗ denotes the convolution operation. 1D ∈ R D is an all-ones vector. • Discrete Fourier transform. Given the c-th channel of the feature F ∈ R C×M ×N , i.e., F (c) ∈ R M ×N , we use the discrete Fourier transform (DFT) (Sundararajan, 2001) to compute the frequency spectrum of this channel, which is termed G (c) ∈ C M ×N , as follows. C denotes the algebra of complex numbers. Each frequency component at the frequency [u, v] is represented as a complex number, i.e., C) ] ∈ C C×M ×N denote the tensor of frequency spectrums of the C channels of F. We take the C-dimensional vector at the frequency [u, v] of the tensor G, i.e., G (c) uv ∈ C. Let G = [G (1) , . . . , G g (uv) = [G (1) uv , G (2) uv , . . . , G (C) uv ] ∈ C C , to represent the frequency component [u, v]. Frequency com- ponents closed to [0, 0], [0, N -1], [M -1, 0], or [M -1, N -1] represent low-frequency signals, whereas frequency components closed to [M/2, N/2] represent high-frequency signals. • Reformulating the layerwise forward propagation process. For a specific convolutional layer (the stride size of the convolution operation is 1), the rule of propagating spectrums of input features into spectrums of output features is given as follows, which well represents the traditional forward propagation of features in Equation (1). ) , H (2) , . . . , H (D) ] ∈ C D×M ×N denote spectrums of the output feature F ∈ R D×M ×N . Then, H can be computed as follows. Theorem 1. (proven in Appendix A.1). Let H = [H (1 h (u v ) = δ u v M N b + M -1 u=0 N -1 v=0 α u v uv • T (uv) g (uv) , s.t. δ u v = 1, u = 0; v = 0 0, otherwise ; α u v uv = 1 M N sin((M -K)λ uu π) sin(λ uu π) sin((N -K)γ vv π) sin(γ vv π) e i((M -K)λ uu +(N -K)γ vv )π ; (4) where h (u v ) = [H (1) u v , H u v , . . . , H ) , b (2) , . . . , b (D) ] ∈ R D denotes the vector of bias terms; (D) u v ] ∈ C D ; b = [b (1 α u v uv ∈ C is a coefficient; λ uu = (u-u )M -u(K-1) M (M -K+1) , γ vv = (v-v )N -v(K-1) N (N -K+1) . T (uv) ∈ C D×C is a matrix of complex numbers, which is exclusively determined by convolutional kernels W [ker=1] , W [ker=2] , . . ., W [ker=D] . T (uv) dc = K-1 t=0 K-1 s=0 W [ker=d] cts e i( ut M + vs N )2π , d = 1, 2, . . . , D; c = 1, 2, . . . , C. In Equation (3), the T (uv) term corresponds to the interference process (Beaver, 2018) in physics, and the α u v uv term corresponds to the diffraction process. According to Equation (4), for most [u , v ], [u, v], |α u v uv | is close to 0. |α u v uv | is relatively large only when [u , v ] is close to [u, v]. We notice that in most real implementations, the convolution operation does not change the size of the feature map in a channel by applying the padding operation. Decoder networks usually use the upsampling operation to increase the dimension of features. Therefore, we limit our research to the scope of convolution operations without changing the size of the feature map in a channel. Thus, we propose the following assumption, which is used in all subsequential proofs. Assumption 1. To simplify subsequential proofs, we assume that before each convolution operation, there exists a circular padding operation (Londono, 1982) , and set the stride size of the convolution operation to 1. The circular padding operation is used to extend the last row and the last column of the feature map in each channel, so as to avoid the convolution changing the size of the feature map. Thus, keeping the feature map size unchanged removes the diffraction process term α u v uv in theory, and derives Theorem 2. In fact, because |α u v uv | is small in most cases, the diffraction process is actually ignorable, even when the convolution operation changes the size of the feature map. Theorem 2. (proven in Appendix A.2). Based on Assumption 1, the layerwise dynamics of feature spectrums in the frequency domain can be simplified as follows. h (uv) = T (uv) g (uv) + δuvM N b (6) Understanding the convolution operation in the frequency domain. As Figure 2 shows, Theorem 2 means that conducting the convolution operation on an input feature F is essentially equivalent to conducting matrix multiplication on spectrums of F. For example, for all frequencies except for the fundamental frequency, we have the output spectrum h (uv) = T (uv) g (uv) . (Conclusion 1) Each frequency component of the feature spectrum is propagated independently with other frequencies, h (uv) = T (uv) g (uv) , where T (uv) is exclusively determined by convolutional weights. Therefore, g (uv) is propagated independently with other frequency components g (u v ) . • Reformulating the entire propagation process of a cascaded convolutional network. To simplify the further proof, we temporarily investigate the spectrum propagation of a network with L cascaded convolutional layers, but without activation functions. Nevertheless, we have conducted various experiments, and experimental results in Figure 3 show that all our theorems can well reflect properties of an ordinary cascaded convolutional network with ReLU layers. Let a convolutional network contain L cascaded convolutional layers. Each l-th layer contains C l convolutional kernels, W (l)[ker=1] , W (l)[ker=2] , . . . , W (l)[ker=C l ] ∈ R C l-1 ×K×K , with C l bias terms b (l,1) , b (l,2) , . . . , b (l,C l ) ∈ R. Let x ∈ R C 0 ×M ×N denote the input sample. The network generates the output sample x = net(x) ∈ R C L ×M ×N . Then, we derive the forward propagation of spectrums of x to spectrums of x in the frequency domain as follows.  Corollary 1. (proven in Appendix A.3.) Let G = [G (1) , G (2) , . . . , G (C 0 ) ] ∈ C C 0 ×M ×N h (uv) = T (uv)(L:1) g (uv) + δuvβ where T (uv)(L:1) = T (L,uv) • • • T (2,uv) T (1,uv) ∈ C C L ×C 0 ; g (uv) = [G (1) uv , G uv , . . . , G (C 0 ) uv ] ∈ C C 0 and h (uv) = [H (1) uv , H uv , . . . , H (C L ) uv ] ∈ C C L denote vectors at the frequency [u, v] in tensors G and H, respectively. β = M N b (L) + L j=2 T (00)(L:j) b (j-1) ∈ C C L ; b (l) = [b (l,1) , b (l,2) , . . . , b (l,C l ) ] ∈ R C l denotes the vector of bias terms of C l convolutional kernels in the l-th layer. Besides, the learning of parameters W (l) affects the matrix T (l,uv) . Therefore, we further reformulate the change of T (l,uv) during the learning process, as follows. Corollary 2. (proven in Appendix A.4.) Based on Assumption 1, the change of each frequency components T (l,uv) during the learning process is reformulated as follows. ∆T (l,uv)  = -ηM N M -1 u =0 N -1 v =0 χ u v uv T (u v )(l-1:1) g (u v ) + δ u v β ∂Loss ∂(h (u v ) ) T (u v )(L:l+1) ; (8) s.t. χ u v uv = 1 M N sin(K(u -u )π/M ) sin((u -u )π/M ) sin(K(v -v )π/N ) sin((v -v )π/N ) e i( (K-1)(u-u ) M + (K-1)(v-v ) N )π (9) where η is the learning rate; χ u v uv ∈ C is a coefficient; T (u v )(l-1:1) = T (l-1,u v ) • • • T (2,u v ) T (1,u v ) ∈ C C l-1 ×C 0 ; T (u v )(L:l+1) = T (L,u v ) • • • T (l+1,u v ) ∈ C C L ×C l ; β = M N b (l-1) + l-1 j=2 T (00)(l-1:j) b (j-1) ∈ C C l-1 ; T (uv)(l-1:1) is the conjugate of T (uv)(l-1:1) . Verifying the forward propagation in Corollary 1 and the change of T (l,uv) in Corollary 2. We computed the similarity between real spectrums where vec(•) represents the vectorization of a matrix, and norm(•) represents computing the norm of each complex number in a matrix. H * = [H * (1) , H * (2) , H * (3) To this end, we constructed the following three baseline networks to verify whether Corollary 1 derived from specific assumptions could also objectively reflect real forward propagations in real neural networks. Specifically, the first baseline network contained 10 convolutional layers. Each convolutional layer applied zero-paddings and was followed by an ReLU layer. Each convolutional layer contained 16 convolutional kernels (kernel size was 3 × 3) with 16 bias terms. We set the stride size of the convolution operation to 1. The second baseline network was constructed by removing all ReLU layers from the first baseline network, which was closer to the assumption in Corollary 1. (l,uv) in Corollary 2 and the real T (l,uv) measured in a real DNN. The shaded area represents the standard deviation. The third baseline network was constructed by replacing all zero-paddings with circular paddings from the second baseline network, which was exactly the same with the assumption in Corollary 1. Figure 3 (a) reports similarity(H * , H) that was measured on spectrums in different layers and averaged over all samples. The similarity between real spectrums and derived spectrums was large for all the three baseline networks, which verified Corollary 1. Note that the cosine similarity was computed based on high-dimensional vectors with as many as 32 2 or 64 2 or 224 2 dimensions (determined by the dataset), in which case tiny noises were accumulated significantly. Therefore, the similarity greater than 0.8 was already significant enough to verify the practicality of our theory. Besides, we also measured the similarity between the real change of T (l,uv) computed by measuring real network parameters, termed ∆ * T (l,uv) , and the change of T (l,uv) derived with certain assumptions in Corollary 2, i.e., ∆T (l,uv) , in order to verify Corollary 2. The similarity was also computed as similarity(∆ * T (l,uv) , ∆T (l,uv) ) = Ec[cos(vec(norm(∆ * T (l,uv) )), vec(norm(∆T (l,uv) )))]. The verification was also conducted on the above three baseline networks. Figure 3 (b) reports ∀l, similarity(∆ * T (l,uv) , ∆T (l,uv) ) averaged over all samples. The similarity was greater than 0.88 for all the three baseline networks, which verified Corollary 2.

3. REPRESENTATION BOTTLENECKS

We further analyze the effects of three classic operations on representing different frequency components of an input sample, including the convolution operation, the zero-padding operation, and the upsampling operation, and discover two representation bottlenecks. • Effects of the convolution operation. Given an initialized, cascaded, convolutional decoder 1 network with L convolutional layers, let us focus on the behavior of the decoder network in early epochs of training. We notice that each element in the matrix T (l,uv) is exclusively determined by the c-th channel of the d-th kenel W (l) [ker=d] c,0:K-1,0:K-1 ∈ R K×K according to Equation (5). Because parameters in W (l) in the decoder network are set to random noises, we can consider that all elements in T (l,uv) irrelevant to each other, i.e., ∀d = d , c = c , T (l,uv) dc is irrelevant to T (l,uv)  d c . Similarly, since different layers' parameters W (l) are irrelevant to each other in the initialized decoder network, we can consider that elements in different layers' T (l,uv) irrelevant to each other, i.e., ∀l = l , elements in T (l,uv) and elements in T (l ,uv) are irrelevant to each other. Moreover, since the early training of a DNN mainly modifies a few parameters according to the lottery ticket hypothesis (Frankle & Carbin, 2018) , we can still assume such irrelevant relationships in early epochs, as follows. Assumption 2. (proven in Appendix A.5) We assume that all elements in T (l,uv) are irrelevant to each other, and ∀l = l , elements in T (l,uv) and T (l ,uv) are irrelevant to each other in early epochs. ∀d = d ; ∀c = c , E W (l) [T (l,uv) dc T (l,uv) d c ] = E W (l) [T (l,uv) dc ]E W (l) [T (l,uv) d c ] (10) ∀l, d, c, d , c , E W (l) ,...,W (1) [T (l,uv) dc T (uv)(l-1:1) d c ] = E W (l) [T (l,uv) dc )]E W (l-1) ,...,W (1) [T (uv)(l-1:1) d c ] Besides, according to experimental experience, the mean value of all parameters in W (l) usually has a small bias during the training process, instead of being exactly zero. Therefore, let us assume that in early epochs, each parameter in W (l) is sampled from a Gaussian distribution N (µ l , σ 2 l ). According to h (uv) = T (uv)(L:1) g (uv) + δuvM N b in Corollary 1, each frequency component h (uv) of the output spectrum is exclusively determined by the component g (uv) of the input sample and the matrix T (uv)(L:1) = T (L,uv) • • • T (2,uv) T (1,uv) , since δuv = 0 on all frequencies other than the fundamental frequency. Therefore, the magnitude of T (uv)(L:1) reflects the strength of the network encoding this specific frequency component g (uv) . Theorem 3. (proven in Appendix A.5) Based on Assumption 1 and Assumption 2, we can prove that T (l,uv) dc follows a Gaussian distribution of complex numbers, as follows. ∀d, c T (l,uv) dc ∼ ComplexN (μ = µ l Ruv, σ2 = K 2 σ 2 l , r = σ 2 l R2u,2v) (12) s.t. Ruv = sin(uKπ/M ) sin(uπ/M ) sin(vKπ/N ) sin(vπ/N ) e i( (K-1)u M + (K-1)v N )π Different from the Gaussian distribution of real numbers, the Gaussian distribution of complex numbers has three parameters μ ∈ C, σ2 ∈ R and r ∈ C, which control the mean value, the variance, and the diversity of the phase of the sampled complex number, respectively. Specifically, a large value of |r| indicates that the sampled complex number T (l,uv) dc is less likely to have diverse phases. Ruv ∈ C is a complex coefficient, 0 ≤ |Ruv| ≤ K 2 . For a low-frequency component [u low , v low ], |R u low v low | is relatively large. Therefore, the second-order moment of T (l,u low v low ) dc , i.e., |µ l R u low v low | 2 + K 2 R 2 u low v low , is large, which indicates that the sampled T (l,u low v low ) dc is more likely to have a large norm. Besides, the parameter |r| = |σ 2 l R 2u low ,2v low | is large for low frequencies, which means that the sampled T (l,u low v low ) dc is less likely to have diverse phases. In contrast, for a high-frequency component [u high , v high ], the sampled T (l,u high v high ) dc is less likely to have a large norm and is more likely to have diverse phases. Theorem 4. (proven in Appendix A.6) For the simplest case that each convolutional layer only contains a feature map with a single channel, i.e., ∀l, C l = 1. Then, based on Theorem 3 and Assumption 2, ,uv) ∈ C follows a distribution, which is the product of L complex numbers, where each complex number follows a Gaussian distribution. The mean value of T (uv)(L:1) = T (L,uv) • • • T (2,uv) T (1 T (uv)(L:1) is L l=1 µ l Ruv ∈ C. The logarithm of the second-order moment is given as log SOM(T (uv)(L:1) ) = L l=1 log(|µ l Ruv| 2 + K 2 σ 2 l ) ∈ R. For the more general case that each convolutional kernel contains more than one channel, i.e., ∀l, C l > 1, the SOM(T (uv)(L:1) ) also approximately exponentially increases along with the depth of the network with a quite complicated analytic solution. Please see Appendix A.6 for the proof. (Conclusion 2) Therefore, according to above proof, the convolution operation makes a cascaded convolutional decoder network more likely to weaken the high-frequency components of the input sample, if the convolution operation does not change the feature map size. Specifically, we obtain the following five remarks to specify detailed mechanisms of weakening high-frequency components. Remark 1. According to Theorem 4, for each frequency component [u, v] , the second-order moment SOM(T (uv)(L:1) ) will exponentially increase along with the depth L of the network. We can consider that each layer' T (l,uv) has independent effects log( |µ l Ruv| 2 + K 2 σ 2 l ) on log SOM(T (uv)(L:1) ) = L l=1 log(|µ l Ruv| 2 + K 2 σ 2 l ). We admit that the conclusion in Remark 1 is derived from the second-order moment of T (uv)(L:1) , instead of a deterministic claim for a specific neural network. Nevertheless, according to the Law of Large Numbers, SOM(T (uv)(L:1) ) is still a convincing metric to reflect the significance of T (uv)(L:1) . Remark 2. If the decoder network is deep, then the decoder network is less likely to learn highfrequency components. It is because |Ruv| is relatively large for low-frequency components. In this way, the large effect of a single layer's T (l,uv) of low-frequency components on log SOM(T (uv)(L:1) ), i.e., log(|µ l Ruv| 2 + K 2 σ 2 l ), can be accumulated through different layers according to the Law of Large Numbers and the independence between different layers in Remark 1. Therefore, the large |R u low v low | value for a low-frequency component [u low , v low ] makes T (u low v low )(L:1) more likely to have a large norm, whereas the small |R u high v high | value for a high-frequency component [u high , v high ] makes T (u high v high )(L:1) less likely to have a large norm. This indicates that a deep decoder network will almost certainly strengthen the encoding of low-frequency components of the input sample, while weaken the encoding of high-frequency components. Remark 3. If the expectation µ l of convolutional weights in each l-th layer has a large absolute value |µ l |, then the decoder network is less likely to learn high-frequency components. It is because according to Theorem 4, a large absolute value |µ l | boosts the imbalance effects |µ l Ruv| 2 among different frequency components, thereby strengthening the trend of encoding low-frequency components of the input sample. Remark 4. If the convolutional kernel size K is small, then the decoder network is less likely to learn high-frequency components. It is because according to Theorem 4, a large K value alleviates imbalance of the second-order moment SOM(T (uv)(L:1) ) between low frequencies and high frequencies caused by the imbalance of |Ruv|. Thus, a small K value strengthens the trend of encoding low-frequency components of the input sample. Remark 5. If the cascaded convolutional decoder network is trained on natural images, then the decoder network is less likely to learn high-frequency components. Previous studies (Ruderman, 1994) have empirically found that natural images were dominated by low-frequency components. Specifically, frequency spectrums of natural images follow a Power-law distribution. I.e., lowfrequency components (e.g., the frequency component [u, v] closed to [0, 0], [0, 0] , and [M -1, N -1]) have much larger length g (uv) 2 = c |G (c) uv | 2 than other frequency components. Besides, according to rules of the forward propagation in Equation ( 7) and the change of T (l,uv) in Equation ( 8), if the frequency component g (uv) of the input image has a large magnitude, then h (uv) of the output image also has a large magnitude. This means that using natural images as the input strengthens the trend of encoding low-frequency components. These five remarks tell us different ways to strengthen or weaken the capacity of a decoder of modeling specific frequency components. Experiments in Section 4 have verified Remarks 1 to 4 in the general case that each convolutional layer contains more than one feature maps. • Effects of the zero-padding operation. To simplify the proof, let us consider the following onesize zero-padding. Given each c-th channel F (c) ∈ R M ×N of the feature map, the zero-padding puts zero values at the edge of F (c) , so as to obtain a new feature F (c) ∈ R M ×N , as follows. ∀m, n, F (c) mn = F (c) mn, 0 ≤ m < M, 0 ≤ n < N 0, M ≤ m < M , N ≤ n < N We have proven that the zero-padding operation boosts magnitudes of low-frequency components of feature spectrums of the feature map, as shown in Theorem 5. Theorem 5. (proven in Appendix A.7) Let each element in each c-th channel F (c) of the feature map follows the Gaussian distribution N (a, σ 2 ). G (c) ∈ C M ×N denotes the frequency spectrum of F (c) , and H (c) ∈ C M ×N denotes the frequency spectrum of the output feature F (c) after applying zero-padding on F (c) . Then, the zero-padding on F (c) brings in additional signals at each frequency [u, v] as follows, whose strength is measured by averaging over different sampled features. ∀0 ≤ u < M, 0 ≤ v < N, E F (c) [|H (c) uv -G (c) uv |] = |a| sin(M uπ/M ) sin(uπ/M ) N vπ/N vπ/N -M N δuv ; ∀M ≤ u < M , N ≤ v < N , E F (c) [|H (c) uv |] = a sin(M uπ/M ) sin(uπ/M ) N vπ/N vπ/N e -i( (M -1)u M + (N -1)v N )π (Conclusion 3) According to rules of the forward propagation in Equation ( 7) and the change of T (l,uv) in Equation ( 8), the zero-padding operation strengthens the trend of encoding low-frequency components of the input sample, because E F (c) [|H (c) uv -G (c) uv |] is large for low frequencies [u, v]. • Effects of the upsampling operation. Let the l-th intermediate-layer feature map F ∈ R C l ×M 0 ×N 0 pass through an upsampling layer to extend its width and height to M ×N , subject to M = M 0 •ratio, N = N 0 • ratio as follows. ∀c, m * , n * , F (c) m * n * = F (c) mn, mod(m * , ratio) = 0; mod(n * , ratio) = 0 0, otherwise s.t. m = m * /ratio n = n * /ratio (16) Theorem 6. (proven in Appendix A.8) Let G = [G (1) , G (2) , . . . , G (C l ) ] ∈ C C l ×M 0 ×N 0 denote spec- trums of the C l channels of feature F. Then, spectrums H = [H (1) , H (2) , . . . , H (C l ) ] ∈ C C l ×M ×N of the output feature F can be computed as follows. ∀c, u, v, H (c) u+(s-1)M 0 ,v+(t-1)N 0 = G (c) uv s.t. s = 1, . . . , M/M0; t = 1, . . . , N/N0 Theorem 6 shows that the upsampling operation repeats the strong magnitude of the fundamental frequency G (c) 00 of the lower layer to different frequency components ∀c, H (c) u * v * of the higher layer, where u * = 0, M0, 2M0, . . . ; v * = 0, N0, 2N0, . . .. Such a phenomenon is shown in Appendix C.2. (the linear increase of logSOM(h (uv) ) along with the layer number linearly). (b) A small kernel size K usually made the network learn a higher proportion p low of low-frequency components. (Conclusion 4) The upsampling operation makes the upconvolution operation generate a feature spectrum, in which strong signals of the input repetitively appears at certain frequencies. Such unexpected strong signals hurt the representation capacity of the network. More crucially, according to the spectrum propagation in Corollary 1, such unexpected frequency components can be further propagated to upper layers. Thus, Corollary 1 may provide some clues to differentiate real samples and the generated samples.

4. EXPERIMENTS

• Verifying that a neural network usually learned low-frequent components first. Our theorems prove that a cascaded convolutional decoder network weakens the encoding of high-frequency components. In this experiment, we visualized spectrums of the image generated by a decoder network, which showed that the decoder usually learned low-frequency components in early epochs and then shifted its attention to high-frequency components. To this end, we constructed a cascaded convolutional auto-encoder by using the VGG-16 (Simonyan & Zisserman, 2015) as the encoder network. The decoder network contained four upconvolutional layers. Each convolutional/upconvolutional layer in the auto-encoder applied zero-paddings and was followed by a batch normalization layer and an ReLU layer. The auto-encoder was trained on the Tiny-ImageNet dataset (Le & Yang, 2015) using the mean squared error (MSE) loss for image reconstruction. Our theorem was verified by the well-known phenomenon in Figure 1 (a), i.e., an auto-encoder usually first generated images with low-frequency components, and then gradually generated more high-frequency components. In addition, Appendix C.1 shows results on more datasets, which also yielded similar conclusions. • Verifying that the upsampling operation made a decoder network repeat strong signals at certain frequencies of the generated image in Theorem 6. To this end, we compared feature spectrums between the input spectrum and the output spectrum of the upsampling layer. We also conducted experiments on the auto-encoder introduced above. Figure 1(b) shows that the decoder network repeated strong signals at certain frequencies of the generated image. In addition, Appendix C.2 shows results on more datasets, which also yielded similar conclusions. • Verifying that the zero-padding operation strengthened the encoding of low-frequency components. To this end, we compared feature spectrums between the network with zero-padding operations and the network without zero-padding operations. Therefore, we constructed the following two baseline networks. The first baseline network contained 5 convolutional layers, and each layer applied zero-paddings. Each convolutional layer contained 16 convolutional kernels (kernel size was 7×7), except for the last layer containing 3 convolutional kernels. The second baseline network was constructed by replacing all zero-padding operations with circular padding operations. Results on the Broden dataset in Figure 5 (c) show that the network with zero-padding operations encoded more significant low-frequency components than the network with circular padding operations. In addition, Appendix C.3 shows results on more datasets, which also yielded similar conclusions. • Verifying factors that strengthened low-frequency components. (1) Verifying that a deep network strengthened low-frequency components in Remark 1 and Remark 2. To this end, we constructed a network with 50 convolutional layers. Each convolutional layer applied zero-paddings to avoid changing the size of feature maps, and was followed by an ReLU layer. We conducted this experiment on three datasets, including CIFAR-10 ( Krizhevsky et al., 2009) , Tiny-ImageNet, and Broden (Bau et al., 2017) datasets, respectively. The exponential increase of T (uv)(L:1) along with the network depth L in Remark 1 indicates that frequency component h (uv) of the network output also increases exponentially along with L. Therefore, for the frequency component h (uv) by each l-th layer in a real decoder network, we measured its second-order moment SOM(h (uv) ). Figure 4 (a) shows that SOM(h (uv) ) increased along with the layer number in an exponential manner. Besides, we visualized feature spectrums of different convolutional layers, which verified the claim in Remark 2 that a deep decoder network strengthens the encoding of low-frequency components of the input sample. Results on the Broden dataset in Figure 5 (a) show that magnitudes of lowfrequency components increased along with the network layer number. In addition, Appendix C.4 shows results on more datasets, which also yielded similar conclusions. (2) Verifying that a larger absolute mean value µ l of each l-th layer's parameters strengthened low-frequency components in Remark 3. To this end, we compared feature spectrums of the same network architecture with different mean values µ l of parameters. Therefore, we applied the network architecture used in the verification of the effects of the zero-padding, but we changed the kernel size to 9×9. Based on this architecture, we constructed three networks, whose parameters were sampled from Gaussian distributions N (µ = 0, σ 2 = 0.01 2 ), N (µ = 0.001, σ 2 = 0.01 2 ), and N (µ = 0.01, σ 2 = 0.01 2 ), respectively. Results on the Broden dataset in Figure 5 (b) show that magnitudes of lowfrequency components increased along with the absolute mean value of parameters. In addition, Appendix C.5 shows results on more datasets, which also yielded similar conclusions. (3) Verifying that a small kernel size K strengthened low-frequency components in Remark 4. To this end, we compared feature spectrums of networks with different kernel sizes. Therefore, we constructed three networks with kernel sizes of 1×1, 3×3, and 5×5. Each network contained 5 convolutional layers, each layer contained 16 convolutional kernels, except for the last layer containing 3 kernels. We used the metric p low = [u,v]∈Ω low Ec[|H (c) uv | 2 ]/ uv Ec[|H (c) uv | 2 ] to measure the ratio of low-frequency components to all frequencies, where  Ω low = [0 ≤ u < M/8, 0 ≤ v < N/8] ∪ [0 ≤ u < M/8, 7N/8 ≤ v < N ] ∪ [7M/8 ≤ u < M, 0 ≤ v < N/8] ∪ [7M/8 ≤ u < M, 7N/8 ≤ v < N ].

5. CONCLUSION

In this paper, we have reformulate the rule for the forward propagation of a cascaded convolutional decoder network in the frequency domain. Based on such propagation rules, we have discovered and theoretically proven that both the convolution operation and the zero-padding operation strengthen low-frequency components in the decoder. Besides, the upsampling operation repeats the strong magnitude of the fundamental frequency in the input feature to different frequencies of the spectrum of the output feature map. Such properties may hurt the representation capacity of a convolutional decoder network. Experiments have verified our theoretical proofs. Note that our findings can explain general trends of networks with above three operations, but cannot derive a deterministic property of a specific network, and cannot be extended to networks for image classification, because we have not derived the property of the max-pooling operation, which will be derived in the future.

ETHICS STATEMENT STATEMENT

As a fundamental research in machine learning, this paper does not introduce any new ethical or societal concerns. The results in this paper do not include misleading claims; their correctness is theoretically verified. Related work is accurately represented. Though in theory any technique can be misused, it is not likely to happen at the current stage.

REPRODUCIBILITY STATEMENT

This research discovered and theoretically explained two bottlenecks of a cascaded convolutional decoder network in representing feature spectrums. For our theoretical results, formal statements and the complete proofs of all Theorems, Corollaries, and the Assumption in Section 2 and Section A PROOFS OF OUR THEORETICAL FINDINGS We first introduce an important equation, which is widely used in the following proofs. Lemma 1. Given N complex numbers, e inθ , n = 0, 1, . . . , N -1, the sum of these N complex numbers is given as follows. ∀θ ∈ R, N -1 n=0 e inθ = sin( N θ 2 ) sin( θ 2 ) e i (N -1)θ

2

(1) Specifically, when N θ = 2kπ, k ∈ Z, -N < k < N , we have ∀θ ∈ R, N -1 n=0 e inθ = sin( N θ 2 ) sin( θ 2 ) e i (N -1)θ 2 = N δ θ ; s.t. N θ = 2kπ, k ∈ Z, -N < k < N, where δ θ = 1, θ = 0 0, otherwise We prove Lemma 1 as follows. Proof. First, let us use the letter S ∈ C to denote the term of N -1 n=0 e inθ . S = N -1 n=0 e inθ Therefore, e iθ S is formulated as follows. e iθ S = N n=1 e inθ ∈ C Then, S can be computed as S = e iθ S-S e iθ -1 . Therefore, we have S = e iθ S -S e iθ -1 = N n=1 e inθ - N -1 n=0 e inθ e iθ -1 = e iN θ -1 e iθ -1 = e i N θ 2 -e -i N θ 2 e i θ 2 -e -i θ 2 e i (N -1)θ 2 = (e i N θ 2 -e -i N θ 2 )/2i (e i θ 2 -e -i θ 2 )/2i e i (N -1)θ 2 = sin( N θ 2 ) sin( θ 2 ) e i (N -1)θ 2 Therefore, we prove that N -1 n=0 e inθ = sin( N θ 2 ) sin( θ 2 ) e i (N -1)θ 2 . Then, we prove the special case that when N θ = 2kπ, k ∈ Z, -N < k < N , N -1 n=0 e inθ = N δ θ = N, θ = 0 0, otherwise , as follows. When θ = 0, we have lim θ→0 N -1 n=0 e inθ = lim θ→0 sin( N θ 2 ) sin( θ 2 ) e i (N -1)θ 2 = lim θ→0 sin( N θ 2 ) sin( θ 2 ) = N When θ = 0, and N θ = 2kπ, k ∈ Z, -N < k < N , we have N -1 n=0 e inθ = sin( N θ 2 ) sin( θ 2 ) e i (N -1)θ 2 = sin(kπ) sin( kπ N ) e i (N -1)kπ N = 0 In the following proofs, the following two equations are widely used, which are derived based on Lemma 1. M -1 m=0 N -1 n=0 e -i( um M + vn N )2π = M -1 m=0 e im(-u2π M ) N -1 n=0 e in(-v2π N ) = (M δ -u2π M )(N δ -v2π N ) //According to Equation (2) = M N, u = v = 0 0, otherwise To simplify the representation, let δ uv be the simplification of δ -u2π M δ -v2π N in the following proofs. Therefore, we have M -1 m=0 N -1 n=0 e -i( um M + vn N )2π = M N δ uv = M N, u = v = 0 0, otherwise Similarly, we derive the second equation as follows. M -1 m=0 N -1 n=0 e i( (u-u )m M + (v-v )n N )2π = M -1 m=0 e im( (u-u )2π M ) N -1 n=0 e in( (v-v )2π N ) = M N δ (u-u )2π M δ (v-v )2π N //According to Equation (2) = M N δ u-u δ v-v = M N, u = u; v = v 0, otherwise A.1 PROOF OF THEOREM 1 In this section, we prove Theorem 1 in Section 2 of the main paper. ) , H (2) , . . . , H (D) ] ∈ C D×M ×N denote spectrums of the output feature F ∈ R D×M ×N . Then, H can be computed as follows. Theorem 1. Let H = [H (1 h (u v ) = δ u v M N b + M -1 u=0 N -1 v=0 α u v uv • T (uv) g (uv) , s.t. δ u v = 1, u = 0; v = 0 0, otherwise ; α u v uv = 1 M N sin((M -K)λ uu π) sin(λ uu π) sin((N -K)γ vv π) sin(γ vv π) e i((M -K)λ uu +(N -K)γ vv )π ; where ) , b (2) , . . . , b (D) ] ∈ R D denotes the vector of bias terms; h (u v ) = [H (1) u v , H (2) u v , . . . , H (D) u v ] ∈ C D ; b = [b (1 α u v uv ∈ C is a coefficient; λ uu = (u-u )M -u(K-1) M (M -K+1) , γ vv = (v-v )N -v(K-1) N (N -K+1) . T (uv) ∈ C D×C is a matrix of complex numbers, which is exclusively determined by convolutional kernels W [ker=1] , W [ker=2] , . . ., W [ker=D] . T (uv) dc = K t=0 K s=0 W [ker=d] cts e i( ut M + vs N )2π , d = 1, 2, . . . , D; c = 1, 2, . . . , C. Proof. Given each c-th channel of the feature spectrum G (c) , the corresponding feature F (c) in the time domain can be computed as follows. F (c) mn = 1 M N M -1 u=0 N -1 v=0 G (c) uv e i( um M + vn N )2π Then, let us conduct the convlution operation (in Equation ( 1) in the main paper) on feature F = [F (1) , F (2) , . . . , F (C) ], in order to obtain the output feature F ∈ R D×M ×N . ∀d = 1, 2, . . . ,D; 0 ≤ m < M ; 0 ≤ n < N ; F (d) mn = b (d) + C c=1 K-1 t=0 K-1 s=0 W ker=d cts F (c) m+t,n+s = b (d) + C c=1 K-1 t=0 K-1 s=0 W ker=d cts 1 M N M -1 u=0 N -1 v=0 G (c) uv e i( u(m+t) M + v(n+s) N )2π = b (d) + C c=1 1 M N M -1 u=0 N -1 v=0 G (c) uv e i( um M + vn N )2π K-1 t=0 K-1 s=0 W ker=d cts e i( ut M + vs N )2π = b (d) + C c=1 1 M N M -1 u=0 N -1 v=0 T (uv) dc G (c) uv e i( um M + vn N )2π Then, let us conduct the DFT on each channel of F, in order to obtain feature spectrums H (d) u v of F. ∀d = 1, 2, . . . , D; 0 ≤ u < M ; 0 ≤ v < N ; H (d) u v = M -1 m=0 N -1 n=0 F (l,d) mn e -i( u m M + v n N )2π = M -1 m=0 N -1 n=0 e -i( u m M + v n N )2π (b (d) + C c=1 1 M N M -1 u=0 N -1 v=0 T (uv) dc G (c) uv e i( um M + vn N )2π ) //Equation (3) = M N b (d) δ u v + C c=1 M -1 u=0 N -1 v=0 T (uv) dc G (c) uv 1 M N M -1 m=0 N -1 n=0 e i(( u M -u M )m+( v N -v N )n)2π // Let α u v uv = 1 M N M -1 m=0 N -1 n=0 e i(( u M -u M )m+( v N -v N )n)2π = M N b (d) δ u v + M -1 u=0 N -1 v=0 α u v uv C c=1 T (uv) dc G (c) uv When the convlution operation does not apply paddings, and its stride size is 1, M = M -K + 1, N = N -K + 1. In this way, α u v uv can be rewritten as follows. α u v uv = 1 M N M -1 m=0 N -1 n=0 e i(( u M -u M )m+( v N -v N )n)2π //M = M -K + 1, N = N -K + 1 = 1 M N M -K m=0 N -K n=0 e i(( u M - u M -K+1 )m+( v N - v N -K+1 )n)2π = 1 M N M -K m=0 e i( u M - u M -K+1 )2πm N -K n=0 e i( v N - v N -K+1 )2πn //According to Equation ( 1) = 1 M N sin((M -K)λ uu π) sin(λ uu π) sin((N -K)γ vv π) sin(γ vv π) e i((M -K)λ uu +(N -K)γ vv )π (5) where λ uu = (u-u )M -u(K-1) M (M -K+1) , γ vv = (v-v )N -v(K-1) N (N -K+1) . Therefore, we prove that the vector h (u v ) = [H (1) u v , H u v , . . . , H (D) u v ] ∈ C D can be computed as follows. ∀d = 1, 2, . . . , D; h (u v ) = δ u v M N b + M -1 u=0 N -1 v=0 α u v uv T (uv) g (uv) A.2 PROOF OF THEOREM 2 In this section, we prove Theorem 2 in Section 2 of the main paper. Theorem 2. Based on Assumption 1, the layerwise dynamics of feature spectrums in the frequency domain can be simplified as follows. h (uv) = T (uv) g (uv) + δuvM N b (6) Proof. Based on Assumption 1, the convolution operation does not change the size of the feature map, i.e., M = M , N = N . In this case, α u v uv can be computed as follows. α u v uv = 1 M N M -1 m=0 N -1 n=0 e i(( u M -u M )m+( v N -v N )n)2π = 1 M N M -1 m=0 N -1 n=0 e i( (u-u )m M + (v-v )n N )2π //M = M, N = N = 1 M N M -1 m=0 e i( (u-u )2π M )m N -1 n=0 e i( (v-v )2π N )n //According to Equation (4) = δ u-u δ v-v where δ u-u = 1, u = u 0, otherwise ; δ v-v = 1, v = v 0, otherwise . Therefore, h (u v ) can be computed as follows. h (u v ) = M -1 u=0 N -1 v=0 α u v uv T (u v ) g (u v ) + δ u v M N b = M -1 u=0 N -1 v=0 δ u-u δ v-v T (u v ) g (u v ) + δ u v M N b = T (u v ) g (u v ) + M N bδ u v Then, we prove that h (uv) = T (uv) g (uv) + M N bδ uv .

A.3 PROOF OF COROLLARY 1

In this section, we prove Corollary 1 in Section 2 of the main paper. ) , G (2) , . . . , G (C 0 ) ] ∈ C C 0 ×M ×N denote frequency spectrums of the C0 channels of x. Then, based on Assumption 1, spectrums of the generated image x, i.e., H = [H (1) , H (2) , . . . , H (C L ) ] ∈ C C L ×M ×N , can be computed as follows. Corollary 1. Let G = [G (1 h (uv) = T (uv)(L:1) g (uv) + δuvβ (8) where T (uv)(L:1) = T (L,uv) • • • T (2,uv) T (1,uv) ∈ C C L ×C 0 ; g (uv) = [G (1) uv , G (2) uv , . . . , G (C 0 ) uv ] ∈ C C 0 and h (uv) = [H (1) uv , H (2) uv , . . . , H (C L ) uv ] ∈ C C L denote vectors at the frequency [u, v] in tensors G and H, respectively. β = M N b (L) + L j=2 T (00)(L:j) b (j-1) ∈ C C L ; b (l) = [b (l,1) , b (l,2) , . . . , b (l,C l ) ] ∈ R C l denotes the vector of bias terms of C l convolutional kernels in the l-th layer. Proof. Let G (l) = [G (l,1) , G (l,2) , • • • , G (l,C l ) ] ∈ C C l ×M ×N denote feature spectrums of the l-th layer. Let g (l,uv) = [G (l,1) uv , G (l,2) uv , • • • , G (l,C l ) uv ] ∈ C C l denote the frequency component at the frequency [u, v] . When l = 0, g (0,uv) denotes the frequency component of the input sample. When l = L, g (L,uv) denotes the frequency component of the network output. Based on Theorem 2, g (l,uv) can be computed as follows. ∀l = 1, 2, . . . , L, g (l,uv) = T (l,uv) g (l-1,uv) + δ uv M N b (l) Then, the frequency component g (L,uv) of the network output can be computed as follows. ,uv) and β = M N b (L) + L j=2 T (00)(L:j) b (j-1) . Let h (uv) = g (L,uv) denote the frequency component of the network output, and let g (uv) = g (0,uv) denote the frequency component of the input sample. Then, we prove that h (uv) can be computed as follows. g (L,uv) = T (L,uv) g (L-1,uv) + δuvM N b (L) = T (L,uv) (T (L-1,uv) g (L-2,uv) + δuvM N b (L-1) ) + δuvM N b (L) = T (L,uv) T (L-1,uv) g (L-2,uv) + T (L,uv) δuvM N b (L-1) + δuvM N b (L) = • • • = T (l,uv) dc • • • T (1,uv) g (0,uv) + M N T (l,uv) dc • • • T (2,uv) b (1) δuv + • • • + M N b (L) δuv = T (l,uv) dc • • • T (1,uv) g (0,uv) + δuvM N (T (l,uv) dc • • • T (2,uv) b (1) + • • • + M N b (L) ) Let T (uv)(L:1) = T (l,uv) dc • • • T (2,uv) T (1 h (uv) = T (uv)(L:1) g (uv) + δ uv β

A.4 PROOF OF COROLLARY 2

In this section, we prove Corollary 2 in Section 2 of the main paper. Corollary 2. Based on Assumption 1, the change of each frequency components T (l,uv) during the learning process is reformulated as follows. ∆T (l,uv)  = -ηM N M -1 u =0 N -1 v =0 χ u v uv T (u v )(l-1:1) g (u v ) + δ u v β ∂Loss ∂(h (u v ) ) T (u v )(L:l+1) ; (9) s.t. χ u v uv = 1 M N sin(K(u -u )π/M ) sin((u -u )π/M ) sin(K(v -v )π/N ) sin((v -v )π/N ) e i( (K-1)(u-u ) M + (K-1)(v-v ) N )π ( ) where η is the learning rate; χ u v uv ∈ C is a coefficient; T (u v )(l-1:1) = T (l-1,u v ) • • • T (2,u v ) T (1,u v ) ∈ C C l-1 ×C 0 ; T (u v )(L:l+1) = T (L,u v ) • • • T (l+1,u v ) ∈ C C L ×C l ; β = M N b (l-1) + l-1 j=2 T (00)(l-1:j) b (j-1) ∈ C C l-1 ; T (uv)(l-1:1) is the conjugate of T (uv)(l-1:1) . Proof. First, we focus on a single convolutional layer. According to the DFT and the inverse DFT, we can obtain the mathematical relationship between G , as follows.               G (l,c) uv = M -1 m=0 N -1 n=0 F (l,c) mn e -i( um M + vn N )2π F (l,c) mn = 1 M N M -1 u=0 N -1 v=0 G (l,c) uv e i( um M + vn N )2π              T (l,uv) dc = K-1 t=0 K-1 s=0 W (l)[ker=d] cts e i( ut M + vs N )2π W (l)[ker=d] cts = 1 M N M -1 u=0 N -1 v=0 T (l,uv) dc e -i( ut M + vs N )2π = ∂Loss ∂W (l)[ker=d] cts .              ∂Loss ∂G (l,c) uv = 1 M N M -1 m=0 N -1 n=0 ∂Loss ∂F (l,c) mn e -i( um M + vn N )2π ∂Loss ∂F (l,c) mn = M -1 u=0 N -1 v=0 ∂Loss ∂G (l,c) uv e i( um M + vn N )2π              ∂Loss ∂T (l,uv) dc = 1 M N K-1 t=0 K-1 s=0 ∂Loss ∂W (l)[ker=d] cts e i( ut M + vs N )2π ∂Loss ∂W (l)[ker=d] cts = M -1 u=0 N -1 v=0 ∂Loss ∂T (l,uv) dc e -i( ut M + vs N )2π Let us conduct the convolution operation (based on Assumption 1) on the feature map F (l-1) = [F (l-1,1) , F (l-1,2) , . . . , F (l-1,C) ] ∈ R C×M ×N , and obtain the output feature map ,1) , F (l,2) , . . . , F (l,D) ] ∈ R D×M ×N of the l-th layer as follows. F (l) = [F (l F (l,d) mn = b (d) + C c=1 K-1 t=0 K-1 s=0 W (l)[ker=d] cts F (l-1,c) m+t,n+s Based on Equation ( 11) and Equation ( 12), and the derivation rule for complex numbers (Kreutz-Delgado, 2009) , the exact optimization step of T (l,uv) dc in real implementations can be computed as follows. ∂Loss ∂T (l,uv) dc = 1 M N K-1 t=0 K-1 s=0 ∂Loss ∂W (l)[ker=d] cts e i( ut M + vs N )2π //Equation (12) = 1 M N K-1 t=0 K-1 s=0 M -1 m=0 N -1 n=0 ∂Loss ∂F (l,d) mn • F (l-1,c) m+t,n+s e i( ut M + vs N )2π //Equation (13) //Equation (11) = 1 M N K-1 t=0 K-1 s=0 M -1 m=0 N -1 n=0 ∂Loss ∂F (l,d) mn • 1 M N M -1 u =0 N -1 v =0 G (l-1,c) u v e -i( u (m+t) M + v (n+s) N )2π e i( ut M + vs N )2π = 1 M N K-1 t=0 K-1 s=0 M -1 u =0 N -1 v =0 G (l-1,c) u v e -i( u t M + v s N )2π • 1 M N M -1 m=0 N -1 n=0 ∂Loss ∂F (l,d) mn e -i( u m M + v n N )2π e i( ut M + vs N )2π = 1 M N K-1 t=0 K-1 s=0 M -1 u =0 N -1 v =0 G (l-1,c) u v ∂Loss ∂G (l,d) u v e -i( u t M + v s N )2π e i( ut M + vs N )2π //Equation (12) = 1 M N K-1 t=0 K-1 s=0 M -1 u =0 N -1 v =0 G (l-1,c) u v ∂Loss ∂G (l,d) u v e i( (u-u )t M + (v-v )s N )2π = M -1 u =0 N -1 v =0 G (l-1,c) u v ∂Loss ∂G (l,d) u v • 1 M N K-1 t=0 K-1 s=0 e i( (u-u )t M + (v-v )s N )2π // Let χ u v uv = 1 M N K-1 t=0 K-1 s=0 e i( (u-u )t M + (v-v )s N )2π = M -1 u =0 N -1 v =0 χ u v uv G (l-1,c) u v ∂Loss ∂G (l,d) u v where χ u v uv can be rewritten as follows. χ u v uv = 1 M N K-1 t=0 K-1 s=0 e i( (u-u )t M + (v-v )s N )2π = 1 M N K-1 t=0 e i (u-u )2π M t K-1 s=0 e i (v-v )2π N s = 1 M N sin( K(u-u )π M ) sin( (u-u )π M ) sin( K(v-v )π N ) sin( (v-v )π N ) • e i( (K-1)(u-u ) M + (K-1)(v-v ) N )π //According to Equation1 Similarly, we computed the gradient of the loss function w.r.t. the spectrum map G (l-1,c) as follows. ∂Loss ∂G (l-1,c) u v = 1 M N M -1 m=0 N -1 n=0 ∂Loss ∂F (l-1,c) mn e -i( u m M + v n N )2π //Equation (12) = 1 M N M -1 m=0 N -1 n=0 K-1 t=0 K-1 s=0 W (l)[ker=d] cts • ∂Loss ∂F (l,d) m-t,n-s e -i( u m M + v n N )2π //Equation (13) //Equation (12) = 1 M N M -1 m=0 N -1 n=0 K-1 t=0 K-1 s=0 W (l)[ker=d] cts • M -1 u=0 N -1 v=0 ∂Loss ∂G (l,d) uv e i( u(m-t) M + v(n-s) N )2π ∂Loss e -i( u m M + v n N )2π = 1 M N M -1 m=0 N -1 n=0 M -1 u=0 N -1 v=0 ∂Loss ∂G (l,d) uv e i( um M + vn N )2π • K-1 t=0 K-1 s=0 W (l)[ker=d] cts e -i( ut M + vs N )2π e -i( u m M + v n N )2π = 1 M N M -1 m=0 N -1 n=0 M -1 u=0 N -1 v=0 ∂Loss ∂G (l,d) uv T (l,uv) dc e i( um M + vn N )2π e -i( u m M + v n N )2π //Equation (11) = M -1 u=0 N -1 v=0 ∂Loss ∂G (l,d) uv T (l,uv) dc • 1 M N M -1 m=0 N -1 n=0 e i( (u-u )m M + (v-v )n N )2π = M -1 u=0 N -1 v=0 ∂Loss ∂G (l,d) uv T (l,uv) dc • δ u-u δ v-v //Equation (4) = ∂Loss ∂G (l,d) u v T (l,u v ) dc ∂(T (l,uv) ) = M -1 u =0 N -1 v =0 χ u v uv g (l-1,u v ) ∂Loss ∂(g (l,u v ) ) (14) ∂Loss ∂(g (l-1,u v ) ) = ∂Loss ∂(g (l,u v ) ) T (l,u v ) Furthermore, we extend the above proof of a single convolutional layer to a network with L cascaded convolutional layers. Let g (l,u v ) denote the frequency component at the frequency [u , v ] of the l-th layer's output feature, and let T (l,uv) the matrix computed by the l-th layer's convolutional weights. Then, according to Equation (15), the gradient w.r.t. g (l,u v ) can be computed as follows. ∂Loss ∂(g (l,u v ) ) T = ∂Loss ∂(g (L,u v ) ) T T (L,u v ) • • • T (l+1,u v ) = ∂Loss ∂(g (L,u v ) ) T T (u v )(L:l+1) According to Equation ( 14), the gradient w.r.t. T (l,uv) can be computed as follows. ∂Loss ∂(T (l,uv) ) = M -1 u =0 N -1 v =0 χ u v uv g (l-1,u v ) ∂Loss ∂(g (l,u v ) ) //Corollary 1 = M -1 u =0 N -1 v =0 χ u v uv (T (u v )(l-1:1) g (0,u v ) + β δ u v ) ∂Loss ∂(g (L,u v ) ) T (u v )(L:l+1) // Let g (uv) = g (0,uv) ; h (uv) = g (L,uv) = M -1 u =0 N -1 v =0 χ u v uv (T (u v )(l-1:1) g (u v ) + β δ u v ) ∂Loss ∂(h (u v ) ) T (u v )(L:l+1) Let us use the gradient descent algorithm to update the convlutional weight W (l)[ker=d] c | n of the n-th epoch, the updated frequency spectrum W (l)[ker=d] c | n+1 can be computed as follows. ∀t, s, W (l)[ker=d] cts | n+1 = W (l)[ker=d] cts | n -η • ∂Loss ∂W (l)[ker=d] cts where η is the learning rate. Then, the updated frequency spectrum T (l,uv) | n+1 computed based on Equation ( 12) is given as follows. ∆T (l,uv) dc = T (l,uv) dc | n+1 -T (l,uv) dc | n = K-1 t=0 K-1 s=0 W (l)[ker=d] cts | n+1 e i( ut M + vs N )2π -T (l,uv) dc | n //Equation (11) = K-1 t=0 K-1 s=0 (W (l)[ker=d] cts | n -η • ∂Loss ∂W (l)[ker=d] cts )e i( ut M + vs N )2π -T (l,uv) dc | n = ( K-1 t=0 K-1 s=0 W (l)[ker=d] cts | n e i( ut M + vs N )2π -T (l,uv) dc | n ) -η K-1 t=0 K-1 s=0 ∂Loss ∂W (l)[ker=d] cts e i( ut M + vs N )2π = -η K-1 t=0 K-1 s=0 ∂Loss ∂W (l)[ker=d] cts e i( ut M + vs N )2π //Equation (11) = -ηM N ∂Loss ∂T (l,uv) dc //Equation (12) Therefore, we prove that any step on W (l)[ker=d] cts equals to M N step on T (uv) dc . In this way, pull Equation ( 17) in the change of T (l,uv) can be computed as follows. ∆T (l,uv)  = -ηM N M -1 u =0 N -1 v =0 χ u v uv T (u v )(l-1:1) g (u v ) + δ u v β ∂Loss ∂(h (u v ) ) T (u v )(L:l+1) A.5 PROOFS OF ASSUMPTION 2 AND THEOREM 3 In this section, we prove Assumption 2 and Theorem 3 in the main paper. Assumption 2. We assume that all elements in T (l,uv) are irrelevant to each other, and ∀l = l , elements in T (l,uv) and T (l ,uv) are irrelevant to each other in early epochs. ∀d = d ; ∀c = c , E W (l) [T (l,uv) dc T (l,uv) d c ] = E W (l) [T (l,uv) dc ]E W (l) [T (l,uv) d c ] ∀l, d, c, d , c , E W (l) ,...,W (1) [T (l,uv) dc T (uv)(l-1:1) d c ] = E W (l) [T (l,uv) dc )]E W (l-1) ,...,W (1) [T (uv)(l-1:1) d c ] Besides, according to experimental experience, the mean value of all parameters in W (l) usually has a small bias during the training process, instead of being exactly zero. Therefore, let us assume that in early epochs, each parameter in W (l) is sampled from a Gaussian distribution N (µ l , σ 2 l ). Proof. Given an initialized, cascaded, convolutional decoder 1 network with L convolutional layers, let us focus on the behavior of the decoder network in early epochs of training. We notice that each element in the matrix T (l,uv) is exclusively determined by the c-th channel of the d-th kenel W (l)[ker=d] c,1:K,1:K ∈ R K×K according to Equation ( 5). Because parameters in W (l) in the decoder network are set to random noises, we can consider that all elements in T (l,uv) irrelevant to each other, i.e., ∀d = d , c = c , T (l,uv) dc is irrelevant to T (l,uv) d c . Similarly, since different layers' parameters W (l) are irrelevant to each other in the initialized decoder network, we can consider that elements in different layers' T (l,uv) irrelevant to each other, i.e., ∀l = l , elements in T (l,uv) and elements in T (l ,uv) are irrelevant to each other. Moreover, since the early training of a DNN mainly modifies a few parameters according to the lottery ticket hypothesis (Frankle & Carbin, 2018) , we can still assume such irrelevant relationships in early epochs, as follows. Then, we prove Theorem 3. Theorem 3. Based on Assumption 1 and Assumption 2, we can prove that T (l,uv) dc follows a Gaussian distribution of complex numbers, as follows. ∀d, c T (l,uv) dc ∼ ComplexN (μ = µ l Ruv, σ2 = K 2 σ 2 l , r = σ 2 l R2u,2v) s.t. Ruv = sin(uKπ/M ) sin(uπ/M ) sin(vKπ/N ) sin(vπ/N ) e i( (K-1)u M + (K-1)v N )π Proof. According to Assumption 2, each convolutional weight follows a Gaussian distribution, i.e., W ker=d cts ∼ N (µ l , σ 2 l ). For the convenience of proving, let us extend W ker=d cts into an complex number. In this way, W ker=d cts follows a Gaussian distribution of complex numbers, i.e., W ker=d cts ∼ ComplexN (µ l , σ 2 l , 0). Previous studies Tse & Viswanath (2005) proved that given N complex numbers, if each complex number follows a Gaussian distribution, then the linear summation of these N complex numbers also follows a Gaussian distribution of complex numbers. Since T (l,uv) dc is a linear combination of ∀t, s, W where µ = E[T (l,uv) dc ] //By definetion of µ = E[ K-1 t=0 K-1 s=0 W (l)[ker=d] cts e i( ut M + vs N )2π ] //Equation (11) //∀t = t or s = s : E[W (l)[ker=d] cts W (l)[ker=d] ct s ] = E[W (l)[ker=d] cts ]E[W (l)[ker=d] ct s ] = K-1 t=0 K-1 s=0 E[W (l)[ker=d] cts ]e i( ut M + vs N )2π = µ l K-1 t=0 K-1 s=0 e i( ut M + vs N )2π //E[W (l)[ker=d] cts ] = µ l //let R uv = K-1 t=0 K-1 s=0 e i( ut M + vs N )2π = µ l R uv σ 2 = E[(T (l,uv) dc -E[T (l,uv) dc ])(T (l,uv) dc -E[T (l,uv) dc ])] //By definetion of σ 2 = V ar[T (l,uv) dc ] = V ar[ K-1 t=0 K-1 s=0 W (l)[ker=d] cts e i( ut M + vs N )2π ] //Equation (11) //∀t = t or s = s : E[W (l)[ker=d] cts W (l)[ker=d] ct s ] = E[W (l)[ker=d] cts ]E[W (l)[ker=d] ct s ] = K-1 t=0 K-1 s=0 V ar[W (l)[ker=d] cts e i( ut M + vs N )2π ] = K-1 t=0 K-1 s=0 V ar[W (l)[ker=d] cts ] //V ar[aX] = |a| 2 V ar[X] = K-1 t=0 K-1 s=0 σ 2 l //V ar[W (l)[ker=d] cts ] = σ 2 l = K 2 σ 2 l r = E[(T (l,uv) dc -E[T (l,uv) dc ])(T (l,uv) dc -E[T (l,uv) dc ])] //By definetion of r = C[T (l,uv) dc ] //Define C[X] = E[(X -E[X])(X -E[X])] = C[ K-1 t=0 K-1 s=0 W (l)[ker=d] cts e i( ut M + vs N )2π ] //Equation (11) //∀t = t or s = s : E[W (l)[ker=d] cts W (l)[ker=d] ct s ] = E[W (l)[ker=d] cts ]E[W (l)[ker=d] ct s ] = K-1 t=0 K-1 s=0 C[W (l)[ker=d] cts e i( ut M + vs N )2π ] = K-1 t=0 K-1 s=0 C[W (l)[ker=d] cts ]e i( 2ut M + 2vs N )2π //C[aX] = a 2 C[X] = σ 2 l K-1 t=0 K-1 s=0 e i( 2ut M + 2vs N )2π //V ar[W (l)[ker=d] cts ] = σ 2 l = σ 2 l R 2u,2v // R uv = K-1 t=0 K-1 s=0 e i( ut M + vs N )2π Finally, let us consider the value of R uv . R uv = K-1 t=0 K-1 s=0 e i( ut M + vs N )2π = K-1 t=0 e i( M )t K-1 s=0 e i( N )s = sin( Ku M π) sin( u M π) • sin( Kv N π) sin( v N π) • e i( (K-1)u M + (K-1)v N )π //According to Equation (1) Therefore, we prove that ∀d, c T (l,uv) dc ∼ ComplexN (μ = µ l Ruv, σ2 = K 2 σ 2 l , r = σ 2 l R2u,2v) s.t. Ruv = sin(uKπ/M ) sin(uπ/M ) sin(vKπ/N ) sin(vπ/N ) e i( (K-1)u M + (K-1)v N )π E[T (uv)(L:1) ] = E[T (L,uv) T (uv)(L-1:1) ] = (C L-1 E[T (L,uv) dc ]E[T (uv)(L-1:1) dc ])1 (C L ×C0) //Assumption 2, Equation (20) = (C L-1 µ l R uv E[T (uv)(L-1:1) dc ])1 (C L ×C0) //Theorem 4 = ( 1 C L L l=1 C l µ l R uv )1 (C L ×C0) //Assumption 2 Then, we have  SOM (T (uv)(L:1) ) = E[|T (uv)(L:1) | 2 ] = E[|T (L,uv) T (uv)(L-1:1) | 2 ] = (CL-1SOM (T (L,uv) dc )SOM (T (uv)(L-1:1) dc ) + CL-1(CL-1 -1)|E[T (L,uv) dc ]E[T (uv)(L-1:1) dc ]| 2 )1 (C L ×C 0 ) // T (uv)(l:1) d c ] = E[T (uv)(l:1) dc ]E[T (uv)(l:1) d c ] = (CL-1(|µLRuv| 2 + K 2 σ 2 L )SOM (T (uv)(L-1:1) dc ) + CL-1 -1 CL-1 |E[T (uv)(L:1) dc ]| 2 )1 (C L ×C 0 ) //According to Equation (19), Equation (22) = 1 CL L l=1 C l (|µ l Ru,v| 2 + (Kσ l ) 2 ) + L l=2 C l-1 -1 C l-1 | 1 C l l k=1 C k µ k Ru,v| 2 L j=l+1 Cj-1 |µjRu,v| 2 + (Kσj) 2 1C L ×C 0 Therefore, we prove that for the more general case that ∀l, C l > 1, the second-order moment SOM(T (uv)(L:1) ) also approximately exponentially increases along with the depth of the network. A.7 PROOF OF THEOREM 5 In this section, we prove Theorem 5 in the main paper. Theorem 5. Let each element in each c-th channel F (c) of the feature map follows the Gaussian distribution N (a, σ 2 ). G (c) ∈ C M ×N denotes the frequency spectrum of F (c) , and H (c) ∈ C M ×N denotes the frequency spectrum of the output feature F (c) after applying zero-padding on F (c) . Then, the zero-padding on F (c) brings in additional signals at each frequency [u, v] as follows, whose strength is measured by averaging over different sampled features. ∀0 ≤ u < M, 0 ≤ v < N, E F (c) [|H (c) uv -G (c) uv |] = |a| sin(M uπ/M ) sin(uπ/M ) N vπ/N vπ/N -M N δuv ; ∀M ≤ u < M , N ≤ v < N , E F (c) [|H (c) uv |] = a sin(M uπ/M ) sin(uπ/M ) N vπ/N vπ/N e -i( (M -1)u M + (N -1)v N )π Proof. E F (c) [G (c) uv ] = E[ M -1 m=0 N -1 n=0 F (c) mn e -i( um M + vn N )2π ] //Equation (11) = M -1 m=0 N -1 n=0 E[F (c) mn ]e -i( um M + vn N )2π = a M -1 m=0 N -1 n=0 e -i( um M + vn N )2π //F (c) mn ∼ N (a, σ 2 ) = aM N δ uv ; 0 ≤ u < M, 0 ≤ v < N //Equation (3) (24) E F (c) [H (c) uv ] = E F (c) [ M -1 m=0 N -1 n=0 F (c) mn e -i( um M + vn N )2π ] //Equation (11) = E F (c) [ M -1 m=0 N -1 n=0 F (c) mn e -i( um M + vn N )2π ] = a M -1 m=0 N -1 n=0 e -i( um M + vn N )2π //F (c) mn ∼ N (a, σ 2 ) A.8 PROOF OF THEOREM 6 In this section, we prove Theorem 6 in the main paper. Theorem 6. ) , H (2) , . . . , H (C l ) ] ∈ C C l ×M ×N of the output feature F can be computed as follows. Let G = [G (1) , G (2) , . . . , G (C l ) ] ∈ C C l ×M 0 ×N 0 denote spectrums of the C l channels of feature F. Then, spectrums H = [H (1 ∀c, u, v, H u+(s-1)M 0 ,v+(t-1)N 0 = G (c) uv s.t. s = 1, . . . , M/M0; t = 1, . . . , N/N0 Proof. G (c) uv = M0-1 m=0 N0-1 n=0 F (c) mn e -i( um M 0 + vn N 0 )2π //Equation (11) H (c) u+(s-1)M 0 ,v+(t-1)N 0 = M -1 m=0 N -1 n=0 F (c) mn e -i( (u+(s-1)M 0 )m M + (v+(t-1)N 0 )n N )2π //Equation (11) = M 0 -1 m=0 N 0 -1 n=0 F (c) mn e -i( (u+(s-1)M 0 )(m•ratio) M + (v+(t-1)N 0 )(n•ratio) N )2π = M 0 -1 m=0 N 0 -1 n=0 F (c) mn e -i( (u+(s-1)M 0 )m M/ratio + (v+(t-1)N 0 )n N/ratio )2π //M = M0 • ratio; N = N0 • ratio = M 0 -1 m=0 N 0 -1 n=0 F (c) mn e -i( (u+(s-1)M 0 )m M 0 + (v+(t-1)N 0 )n N 0 )2π = M 0 -1 m=0 N 0 -1 n=0 F (c) mn e -i( um M 0 + vn N 0 )2π • e -i((s-1)m+(t-1)n)2π = M 0 -1 m=0 N 0 -1 n=0 F (c) mn e -i( um M 0 + vn N 0 )2π //s, t ∈ Z = G (c) uv //Equation (27) Therefore we prove that: ∀c, u, v, H (c) u+(s-1)M 0 ,v+(t-1)N 0 = G (c) uv s.t. s = 1, . . . , M/M0; t = 1, . . . , N/N0

B RELATED WORK

Although few previous studies directly prove a DNN's bottleneck from the perspective of representing specific feature components, we still make a survey on research on the representation capacity of a DNN. Some studies focused on a specific frequency that took the landscape of the loss function on all input samples as the time domain (Xu et al., 2019b; Rahaman et al., 2019; Xu et al., 2019a; Luo et al., 2019) . Based on such a specific frequency, they observed and proved a phenomenon namely Frequency Principle (F-Principle) that a DNN first qucikly learned low-frequency components, and then relatively slowly learned the high-frequency ones, which might shed new light on understanding the representation capacity of a DNN. For example, Lin et al. (2019) empirically proposed to smooth out high-frequency components to improve the adversarial robustness. Besides, Ma et al. (2020) explored the boundary of the F-Principle, beyond which the F-Principle did not hold any more. In comparison, we focus on a fully different type of frequency, i.e., the frequency w.r.t. the DFT on an input image or a feature map. In this direction, previous studies mainly experimentally analyzed the relationship between the learning of different frequencies and the robustness of a DNN. Yin et al. (2019) conducted a lot of experiments to analyze the robustness of a DNN w.r.t. different frequencies of the image. They discovered that both adversarial training and Gaussian data augmentation improved the DNN's robustness to higher frequencies. Wang et al. (2020) empirically proposed to remove high frequency components of convolutional weights to improve the adversarial robustness. In comparison, we theoretically prove representations bottleneck of DNNs in the frequency domain. Besides, many studies explained the representation capacity of a DNN in the time domain. The information bottleneck hypothesis shows that the learning process of DNNs is to retain the taskrelevant input information and discard the task-irrelevant input information (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017; Wolchover & Reading, 2017; Amjad & Geiger, 2019) . The lottery ticket hypothesis shows that some initial parameters of DNNs inherently contribute more to the network output (Frankle & Carbin, 2018) . The double-descent phenomenon describes the specific training process of DNNs that the loss first declines, then rises, and then declines again (Nakkiran et al., 2019; Reinhard & Fatih, 2020) . DNNs with the batch normalization may sometimes conflicted with the weight decay (Van Laarhoven, 2017; Li et al., 2020) . DNNs are vulnerable to adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2014) . DNNs tipically encoded simple interactions between very few input variables and complex interactions between almost all input variables, but were difficult to encode interactions between intermediate number of input variables (Deng et al., 2022) .

C MORE EXPERIMENTAL RESULTS

C.1 VERIFYING THAT A NEURAL NETWORK USUALLY LEARNED LOW-FREQUENT COMPONENTS FIRST. In section, we provide more experimental results to verify that a neural network usually learned lowfrequent components first, which had already been shown in Figure 1 (a) in the main paper. Here, we also constructed a cascaded convolutional auto-encoder by using the VGG-16 as the encoder network. The decoder network contained three upconvolutional layers for the CIFAR-10 dataset, and contained three upconvolutional layers for the Broden dataset. Each convolutional/upconvolutional layer in the auto-encoder applied zero-paddings and was followed by a batch normalization layer and an ReLU layer. The auto-encoder was trained using the mean squared error (MSE) loss for image reconstruction. Results in Figure 6 verified that the auto-encoder usually learned low-frequent components first and gradually learned higher frequecies. We also attached the generated image below its spectrum map in Figure 7 , in order to help people understand the learning process of the auto-encoder.

C.2 VERIFYING THAT THE UPSAMPLING OPERATION MADE A DECODER NETWORK REPEAT

STRONG SIGNALS AT CERTAIN FREQUENCIES OF THE GENERATED IMAGE. In section, we provide more experimental results to verify that the upsampling operation in the decoder repeats strong frequency components of the input to generate spectrums of upper layers. First, we conducted experiments to verify Theorem 6 in the main paper, which claims that the upsampling operation repeats the strong magnitude of the fundamental frequency G (c) 00 of the lower layer to different frequency components ∀c, H (c) u * v * of the higher layer, where u * = 0, M0, 2M0, 3M0, . . . ; v * = 0, N0, 2N0, 3N0, . . .. To verify this, given an image, let the image pass through four cascaded upsampling layers. We visualized the feature spectrum generated by each upsampling layer, in order to verify whether the upsampling operation repeated the strong magnitude of the fundamental frequency of the input image to different frequency components of the feature spectrum generated by upsampling layers. Results on the CIFAR-10 dataset and the Tiny-ImageNet dataset in Figure 8 verified Theorem 6. Second, we provide more results on real neural networks, which have already been shown in Figure 1(b) in the main paper. We also constructed a cascaded convolutional auto-encoder by using the VGG-16 as the encoder network. The decoder network contained four upconvolutional layers. Each convolutional/upconvolutional layer in the auto-encoder applied zero-paddings and was followed by In section, we provide more experimental results to verify that the zero-padding operation strengthened the encoding of low-frequency components, which had already been shown in Figure 5 (c) in the main paper. Here, we also constructed the following two baseline networks. The first baseline network contained 5 convolutional layers, and each layer applied zero-paddings. Each convolutional layer contained 16 convolutional kernels (kernel size was 7×7), except for the last layer containing 3 convolutional kernels. The second baseline network was constructed by replacing all zero-padding operations with circular padding operations. Results in Figure 10 verified that the zero-padding operation strengthened the encoding of low-frequency components. C.4 VERIFYING THAT A DEEP NETWORK STRENGTHENED LOW-FREQUENCY COMPONENTS. In section, we provide more experimental results to verify that a deep network strengthened lowfrequency components, which had already been shown in Figure 5 (a) in the main paper. Here, we also constructed a network with 50 convolutional layers. Each convolutional layer applied zeropaddings to avoid changing the size of feature maps, and was followed by an ReLU layer. We visualized feature spectrums of different convolutional layers. Results on the CIFAR-10 dataset and the Tiny-ImageNet dataset in Figure 11 show that magnitudes of low-frequency components increased along with the network layer number. In section, we provide more experimental results to verify that a larger absolute mean value µ l of each l-th layer's parameters strengthened low-frequency components, which had already been shown in Figure 5 (b) in the main paper. Here, we also applied a network architecture with 5 convolutional layers. Each layer contained 16 convolutional kernels (kernel size was 9×9), except for the last layer containing 3 convolutional kernels. Based on this architecture, we constructed three networks, whose parameters were sampled from Gaussian distributions N (µ = 0, σ 2 = 0.01 2 ), N (µ = 0.001, σ 2 = 0.01 2 ), and N (µ = 0.01, σ 2 = 0.01 2 ), respectively. Results on the CIFAR-10 Here, each magnitude map of the feature spectrum was averaged over all channels. For clarity, we move low frequencies to the center of the spectrum map, move high frequencies to corners of the spectrum map, and set the magnitude of the fundamental frequency to be the same with the frequency that has the second large magnitude. Figure 12 : A network whose convolutional weights had a mean value significantly biased from 0 usually strengthened low-frequency components, but weakened high-frequency components. Here, each magnitude map of the feature spectrum was averaged over all channels. For clarity, we moved low frequencies to the center of the spectrum map, moved high frequencies to corners of the spectrum map, and set the magnitude of the fundamental frequency to be the same with the frequency that has the second large magnitude. For resutls in (b), we only visualized components in the center of the spectrum map with the range of relatively low frequencies u ∈ {u|0 ≤ u < M/6}∪{u|5M/6 ≤ u < M }; v ∈ {v|0 ≤ v < N/6} ∪ {v|5N/6 ≤ v < N } for clarity. dataset and the Tiny-ImageNet dataset in Figure 12 show that magnitudes of low-frequency components increased along with the absolute mean value of parameters.



Here, the decoder represents a typical network, whose feature map size is non-decreasing during the forward propagation. are provided in Appendix A. We have discribed additional experimental details in Appendix C, including various model architectures and benchmark datasets, which ensure the reproducibility. We will release all the codes and datasets when this paper is accepted.



Figure 1: Two representation bottlenecks of a cascaded convolutional decoder network. (a)The convolution operation and the zero-padding operation make the decoder usually learn low-frequency components first and then gradually learn higher frequencies. (b) For cascaded upconvolutional layers, the upsampling operation in the decoder repeats strong frequency components of the input to generate spectrums of upper layers. We visualize the magnitude map of the feature spectrum, which is averaged over all channels. For clarity, we move low frequencies to the center of the spectrum map, and move high frequencies to corners of the spectrum map. High frequency components in magnitude maps in (b) are also weakened by the convolution operation after upsampling.

Figure 2: (a) Forward propagation in the frequency domain and (b) forward propagation in the time domain. The convolution operation on an input feature F is essentially equivalent to matrix multiplication on spectrums G of the feature.

Figure 3: (a) Fitness between the derived feature spectrums H in Corollary 1 and the real feature spectrums H * measured in a real DNN. (b) Fitness between the derived change of T(l,uv) in Corollary 2 and the real T(l,uv) measured in a real DNN. The shaded area represents the standard deviation.

Figure4: (a) The exponential increase of the second-order moment of feature spectrums, SOM(h(uv) ) (the linear increase of logSOM(h(uv) ) along with the layer number linearly). (b) A small kernel size K usually made the network learn a higher proportion p low of low-frequency components.

generated

Figure 5: (a) A higher layer of a network usually generated features with more low-frequency components, but with less high-frequency components. (b) A network whose convolutional weights have a mean value significantly biased from 0 usually strengthened low-frequency components, but weakened high-frequency components. (c) A network with zero-padding operations usually strengthened more low-frequency components than a network with circular padding operations. Here, each magnitude map of the feature spectrum was averaged over all channels. For clarity, we moved low frequencies to the center of the spectrum map, and moved high frequencies to corners of the spectrum map. Besides, we only visualized components in the center of the spectrum map with the range of relatively low frequencies u ∈ {u|0 ≤ u < M/8} ∪ {u|7M/8 ≤ u < M }; v ∈ {v|0 ≤ v < N/8} ∪ {v|7N/8 ≤ v < N } for clarity.

show that the network with a small kernel size encoded more low-frequency components.

11)Based on Equation (11) and the derivation rule for complex numbers(Kreutz-Delgado, 2009), we can obtain the mathematical relationship between ∂Loss ∂G [ker=d] cts , as follows. Note that when we use gradient descent to optimize a real-valued loss function Loss with complex variables, people usually treat the real and imaginary values, a ∈ C and b ∈ C, of a complex variable (z = a + bi) as two separate real-valued variables, and separately update these two real-valued variables. In this way, the exact optimization step of z computed based on such a technology is equivalent to ∂Loss ∂z . Since F

(l-1,u v ) ) ∈ C C as follows.

uv) dc also follows a Gaussian distribution of complex numbers as follows.∀d, c T (l,uv) dc ∼ ComplexN (μ, σ2 , r)

According to Assumption 2 and Equation (21), //we further Assume ∀d = d ; c = c , E[T (uv)(l:1) dc

Figure6: Magnitude maps of feature spectrums of different epochs' network output. Each magnitude map was averaged over all channels. For clarity, we moved low frequencies to the center of the spectrum map, and moved high frequencies to corners of the spectrum map. Note that we set the magnitude of the fundamental frequency to be the same with the frequency that had the second large magnitude. For resutls in (b), we only visualized components in the center of the spectrum map with the range of relatively low frequencies u ∈ {u|0 ≤ u < M/8} ∪ {u|7M/8 ≤ u < M }; v ∈ {v|0 ≤ v < N/8} ∪ {v|7N/8 ≤ v < N } for clarity.

VERIFYING THAT A LARGER ABSOLUTE MEAN VALUE µ l OF EACH l-TH LAYER'S PARAMETERS STRENGTHENED LOW-FREQUENCY COMPONENTS.

Figure9: Magnitude maps of feature spectrums after one/two/there/four/five/six upsampling layers. Each magnitude map was averaged over all channels. For clarity, we moved low frequencies to the center of the spectrum map, and moved high frequencies to corners of the spectrum map.

Figure10: A network with zero-padding operations usually strengthened more low-frequency components than a network with circular padding operations. Here, each magnitude map of the feature spectrum was averaged over all channels. For clarity, we move low frequencies to the center of the spectrum map, move high frequencies to corners of the spectrum map, and set the magnitude of the fundamental frequency to be the same with the frequency that has the second large magnitude.

Figure11: Comparing feature spectrums of different layers. Results show that higher layers of a network usually generated features with more low-frequency components. For clarity, we move low frequencies to the center of the spectrum map, move high frequencies to corners of the spectrum map, and set the magnitude of the fundamental frequency to be the same with the frequency that has the second large magnitude. For resutls in (b), we only visualized components in the center of the spectrum map with the range of relatively low frequencies u ∈ {u|0 ≤ u < M/6} ∪ {u|5M/6 ≤ u < M }; v ∈ {v|0 ≤ v < N/6} ∪ {v|5N/6 ≤ v < N } for clarity.

Tiny-ImageNet dataset 𝜇 = 0 𝜇 = 0.001𝜇 = 0.01 Input 𝜇 = 0 𝜇 = 0.001𝜇 = 0.01 Input 𝜇 = 0 𝜇 = 0.001𝜇 = 0.01 Input 𝜇 = 0 𝜇 = 0.001𝜇 = 0.01 Input

A.6 PROOF OF THEOREM 4

Theorem 4. (proven in Appendix A.5) Based on Assumption 1 and Assumption 2, we can prove that T (l,uv) dc follows a Gaussian distribution of complex numbers, as follows. Proof. According to Theorem 4, ∀d, c, l :Then, we haveFor the more general case that each convolutional kernel contains more than one channel, i.e., ∀l, C l > 1, the SOM(T (uv)(L:1) ) also approximately exponentially increases along with the depth of the network with a quite complicated analytic solution, as proved below. Note that the following proof is based Assumption 2. Besides, we further assume that all element in T (uv)(l:1) are independent with each other. I.e., ∀d].Proof. According to Theorem 4, all element in T (l,uv) follow the same Gaussian distribution. Therefore, we haveand we haveLet us first consider the expectation of T (uv)(L:1) as follows.When| Therefore, we prove thatInput, spectrum (b) Broden dataset

Epoch1

Figure 7 : Magnitude maps of feature spectrums and the corresponding generated images of different epochs. Results show that in the very few epochs of the training, the network removed noisy signal caused by the upsampling, to some extent, which were in the grid pattern in the spectrum. After that, the network learned low-frequency components first, and then gradually learned higher frequencies. Each magnitude map in this figure was averaged over all channels. For clarity, we moved low frequencies to the center of the spectrum map, and moved high frequencies to corners of the spectrum map. Note that we set the magnitude of the fundamental frequency to be the same with the frequency that had the second large magnitude. 

