WALKING THE TIGHTROPE: AN INVESTIGATION OF THE CONVOLUTIONAL AUTOENCODER BOTTLENECK

Abstract

In this paper, we present an in-depth investigation of the convolutional autoencoder (CAE) bottleneck. Autoencoders (AE), and especially their convolutional variants, play a vital role in the current deep learning toolbox. Researchers and practitioners employ CAEs for various tasks, ranging from outlier detection and compression to transfer and representation learning. Despite their widespread adoption, we have limited insight into how the bottleneck shape impacts the CAE's emergent properties. We demonstrate that increased bottleneck area (i.e., height × width) drastically improves generalization in terms of reconstruction error while also speeding up training. The number of channels in the bottleneck, on the other hand, is of secondary importance. Furthermore, we show empirically that CAEs do not learn an identity mapping, even when all layers have the same number of neurons as there are pixels in the input. Besides raising important questions for further research, our findings are directly applicable to two of the most common use-cases for CAEs: In image compression, it is advantageous to increase the feature map size in the bottleneck as this improves reconstruction quality greatly. For reconstruction-based outlier detection, we recommend decreasing the feature map size so that out-of-distribution samples will yield a higher reconstruction error.

1. INTRODUCTION

Autoencoders (AE) are an integral part of the neural network toolkit. They are a class of neural networks that consist of an encoder and decoder part and are trained by reconstructing datapoints after encoding them. Due to their conceptual simplicity, autoencoders often appear in teaching materials as introductory models to the field of unsupervised deep learning. Nevertheless, autoencoders have enabled major contributions in the application and research of the field. The main areas of application include outlier detection Xia et al. ( 2015 The focus of most such investigations so far has been the traditional autoencoder setting with fully connected layers. When working with image data, however, the default choice is to use convolutions, as they provide a prior that is well suited to this type of data Ulyanov et al. (2018) . For this reason, Masci et al. (2011) introduced the convolutional autoencoder (CAE) by replacing the fully connected layers in the classical AE with convolutions. In an autoencoder, the layer with the least amount of neurons is referred to as a bottleneck. In the regular AE, this bottleneck is simply a vector (rank-1 tensor). In CAEs, however, the bottleneck assumes the shape of a multichannel image (rank-3 tensor, height × width × channels) instead. This bottleneck shape prompts the question: What is the relative importance of bottleneck depth (i.e., the number of channels) versus the bottleneck area (i.e., feature map size) in determining the tightness of the CAE bottleneck? Intuitively, we might expect that only the total number of neurons should matter since convolutions with one-hot filters can distribute values across channels. In this paper, we share new insights into the properties of convolutional autoencoders, which we gained through extensive experimentation. We address the following questions: • How do bottleneck area and depth impact reconstruction quality? generalization ability? knowledge transfer to downstream tasks? • How and when do CAEs overfit? • Are CAEs capable of learning a "copy function" if the CAE is complete (i. e., when the number of pixels in input equals the number of neurons in bottleneck)? By copy function we are referring to a type of identity function, in which the input pixel values are transported through the bottleneck and copied to the output. The hypothesis that AEs learn an identity mapping is common for fully connected AEs and can sometimes be encountered for CAEs (see Sections 4 and 5 in Masci et al. (2011) . We begin the following section by formally introducing convolutional autoencoders and explaining the convolutional autoencoder model we used in our experiments. Additionally, we introduce our three datasets and the motivation for choosing them. In Section 3, we outline the experiments and their respective aims. Afterward, we present and discuss our findings in Section 4. All of our code, results, and trained models and datasets, is published on github. We invite interested readers to take a look and experiment with our models.

2.1. AUTOENCODERS AND CONVOLUTIONAL AUTOENCODERS

The regular autoencoder, as introduced by Rumelhart et al. (1985) , is a neural network that learns a mapping from data points in the input space x ∈ R d to a code vector in latent space h ∈ R m and back. Typically, unless we introduce some other constraint, m is set to be smaller than d to force the autoencoder to learn higher-level abstractions by having to compress the data. In this context, the encoder is the mapping f (x) : R d → R m and the decoder is the mapping g(h) : R m → R d . The layers in both the encoder and decoder are fully connected: l i+1 = σ(W i l i + b i ). Here, l i is the activation vector in the i-th layer, W i and b i are the trainable weights and σ is an element-wise non-linear activation function. If necessary, we can tie weights in the encoder to the ones in the decoder such that W i = (W n-i ) T , where n is the total number of layers. Literature refers to autoencoders with this type of encoder-decoder relation as weight-tied. The convolutional autoencoder keeps the overall structure of the traditional autoencoder but replaces the fully connected layers with convolutions: L i+1 = σ(W i * L i + b i ), where * denotes the convolution operation and the bias b i is broadcast to match the shape of L i such that the j-th entry in b i is added to the j-th channel in L i . Whereas before the hidden code was an m-dimensional vector, it is now a tensor with a rank equal to the input tensor's rank. In the case



); Chen et al. (2017); Zhou & Paffenroth (2017); Baur et al. (2019), data compression Yildirim et al. (2018); Cheng et al. (2018); Dumas et al. (2018), and image enhancement Mao et al. (2016); Lore et al. (2017). Additionally, autoencoders can be used as catalysts in the training of deep neural networks. The layers of the target network can be greedily pre-trained by treating them as autoencoders with one hidden layer Bengio et al. (2007). Subsequently, Erhan et al. (2009) demonstrated that autoencoder pre-training also benefits generalization. Currently, researchers in the field of representation learning frequently rely on autoencoders for learning nuanced and high-level representations of data Kingma & Welling (2013); Tretschk et al. (2019); Shu et al. (2018); Makhzani et al. (2015); Berthelot et al. (2018). However, despite its widespread use, we propose that the (deep) autoencoder model is not well understood. Many papers have aimed to deepen our understanding of the autoencoder through theoretical analysis Nguyen et al. (2018); Arora et al. (2013); Baldi (2012); Alain & Bengio (2012). While such analyses provide valuable theoretical insight, there is a significant discrepancy between the theoretical frameworks and actual behavior of autoencoders in practice, mainly due to the assumptions made (e.g., weight tying, infinite depth) or the simplicity of the models under study. Others have approached this issue from a more experimental angle Arpit et al. (2015); Bengio et al. (2013a); Le (2013); Vincent et al. (2008); Berthelot et al. (2019); Radhakrishnan et al. (2018). Such investigations are part of an ongoing effort to understand the behavior of autoencoders in a variety of settings.

