EXPLAINING REPRESENTATION BOTTLENECKS OF CONVOLUTIONAL DECODER NETWORKS

Abstract

In this paper, we prove representation bottlenecks of a cascaded convolutional decoder 1 network, considering the capacity of representing different frequency components of an input sample. We conduct the discrete Fourier transform on each channel of the feature map in an intermediate layer of the decoder network. Then, we introduce the rule of the forward propagation of such intermediate-layer spectrum maps, which is equivalent to the forward propagation of feature maps through a convolutional layer. Based on this, we find that each frequency component in the spectrum map is forward propagated independently with other frequency components. Furthermore, we prove two bottlenecks in representing feature spectrums. First, we prove that the convolution operation, the zero-padding operation, and a set of other settings all make a convolutional decoder network more likely to weaken high-frequency components. Second, we prove that the upsampling operation generates a feature spectrum, in which strong signals repetitively appears at certain frequencies. We will release all codes when this paper is accepted.

1. INTRODUCTION

Deep neural networks (DNNs) have exhibited superior performance in many tasks. However, in recent years, many studies discovered some theoretical defects of DNNs, e.g., the vulnerability to adversarial attacks (Goodfellow et al., 2014) , and the difficulty of learning middle-complex interactions (Deng et al., 2022) . Besides, other studies explained typical phenomena during the training of DNNs, e.g., the double-descent phenomenon (Nakkiran et al., 2019) , the information bottleneck hypothesis (Tishby & Zaslavsky, 2015) , and the lottery ticket hypothesis (Frankle & Carbin, 2018) . In comparison, in this study, we propose a new perspective to investigate how a cascaded convolutional decoderfoot_0 network represents features at different frequencies. I.e., when we apply the discrete Fourier transform (DFT) to each channel of the feature map or the input sample, we try to prove which frequency components of each input channel is usually strengthened/weakened by the network. To this end, previous studies (Xu et al., 2019a; Rahaman et al., 2019) claimed that DNNs were less likely to encode high-frequency components. However, these studies focused on a specific frequency that took the landscape of the loss function on all input samples as the time domain. In comparison, we focus on a fully different type of frequency, i.e., the frequency w.r.t. the DFT on an input image or a feature map. • Reformulating forward propagation in the frequency domain. As the basis for subsequent theoretical proof, we discover that we can reformulate the traditional forward propagation of feature maps as a new forward propagation on the feature spectrum. We derive the rule that forward propagates spectrums of different channels through a cascaded convolutional network, which is mathematically equivalent to the forward propagation on feature maps through a cascaded convolutional network. • Based on the reformulation of the forward propagation, we prove the following conclusions. (1) The layerwise forward propagation of each frequency component of the spectrum map is independent with other frequency components. In the forward propagation process, each frequency component of the feature spectrum is forward propagated independently with other frequency components, if the convolution operation does not change the size of the feature map in each channel. In this way, we analyze three classic operations, including the convolution, the zero-padding, and the upsampling operations, and prove two representation bottlenecks, as follows. (2) Representation bottleneck 1. We prove that both the convolution operation and the zero-padding operation make a cascaded convolutional decoder network more likely to weaken the high-frequency components of the input sample, as shown in Figure 1 (a), if the convolution operation with a padding operation does not change the size of the feature map in a channel. Besides, we also prove that the following three conditions further strengthen the above representation bottleneck, including (1) a deep network architecture; (2) a small convolutional kernel size; and (3) a large absolute value of the mean value of convolutional weights. (3) Representation bottleneck 2. We porve that the upsampling operation makes a cascaded convolutional decoder network generate a feature spectrum, in which strong signals repetitively appears at certain frequencies, as shown in Figure 1(b) . Note that all above findings can explain general trends of neural networks with convolution, zeropadding, and upsampling operations, instead of deriving the deterministic property of a specific network. Besides, we have not derived the property of max-pooling operations, so in this paper, it is difficult to extend such findings to neural networks for image classification.

2. RULES OF PROPAGATING FEATURE SPECTRUMS

In this section, we aim to reformulate the forward propagation of a cascaded convolutional decoder 1 network in the frequency domain. To this end, we first introduce the rule of a convolutional layer propagating the feature spectrum from a lower layer to an upper layer. • Convolution operation. Given a convolutional layer, let W [ker=1] , W [ker=2] , . . ., W [ker=D] denote D convolutional kernels of this layer, and let b [ker=1] , b [ker=2] , . . . , b [ker=D] ∈ R denote D bias terms. Each d-th kernel W [ker=d] ∈ R C×K×K is of the kernel size K × K, and C denotes the channel number. Accordingly, we apply the kernel on a feature F ∈ R C×M ×N with C channels, and obtain the output feature F ∈ R D×M ×N , as follows. F = Conv(F), s.t. F (d) = W [ker=d] ⊗ F + b [ker=d] 1D, d = 1, 2, . . . , D where F ∈ R M ×N denotes the feature map of the d-th channel. ⊗ denotes the convolution operation. 1D ∈ R D is an all-ones vector. • Discrete Fourier transform. Given the c-th channel of the feature F ∈ R C×M ×N , i.e., F (c) ∈ R M ×N , we use the discrete Fourier transform (DFT) (Sundararajan, 2001) to compute the frequency spectrum of this channel, which is termed G (c) ∈ C M ×N , as follows. C denotes the algebra of complex numbers. 



Here, the decoder represents a typical network, whose feature map size is non-decreasing during the forward propagation.



Figure 1: Two representation bottlenecks of a cascaded convolutional decoder network. (a)The convolution operation and the zero-padding operation make the decoder usually learn low-frequency components first and then gradually learn higher frequencies. (b) For cascaded upconvolutional layers, the upsampling operation in the decoder repeats strong frequency components of the input to generate spectrums of upper layers. We visualize the magnitude map of the feature spectrum, which is averaged over all channels. For clarity, we move low frequencies to the center of the spectrum map, and move high frequencies to corners of the spectrum map. High frequency components in magnitude maps in (b) are also weakened by the convolution operation after upsampling.

(c)  mn e -i( umM + vn N )2π , u = 0, . . . , M -1; v = 0, . . . , N -1 (2)

