REVIVING AUTOENCODER PRETRAINING

Abstract

The pressing need for pretraining algorithms has been diminished by numerous advances in terms of regularization, architectures, and optimizers. Despite this trend, we re-visit the classic idea of unsupervised autoencoder pretraining and propose a modified variant that relies on a full reverse pass trained in conjunction with a given training task. We establish links between SVD and pretraining and show how it can be leveraged for gaining insights about the learned structures. Most importantly, we demonstrate that our approach yields an improved performance for a wide variety of relevant learning and transfer tasks ranging from fully connected networks over ResNets to GANs. Our results demonstrate that unsupervised pretraining has not lost its practical relevance in today's deep learning environment.

1. INTRODUCTION

While approaches such as greedy layer-wise autoencoder pretraining (Bengio et al., 2007; Vincent et al., 2010; Erhan et al., 2010) arguably paved the way for many fundamental concepts of today's methodologies in deep learning, the pressing need for pretraining neural networks has been diminished in recent years. This was primarily caused by numerous advances in terms of regularization (Srivastava et al., 2014; Hanson & Pratt, 1989; Weigend et al., 1991) , network architectures (Ronneberger et al., 2015; He et al., 2016; Vaswani et al., 2017) , and improved optimization algorithms (Kingma & Ba, 2014; Loshchilov & Hutter, 2017; Reddi et al., 2019) . Despite these advances, training deep neural networks that generalize well to a wide range of previously unseen tasks remains a fundamental challenge (Neyshabur et al., 2017; Kawaguchi et al., 2017; Frankle & Carbin, 2018) . Inspired by techniques for orthogonalization (Ozay & Okatani, 2016; Jia et al., 2017; Bansal et al., 2018) , we re-visit the classic idea of unsupervised autoencoder pretraining in the context of reversible network architectures. Hence, we propose a modified variant that relies on a full reverse 1



Figure1: Our pretraining (denoted as RR) yields improvements for numerous applications: a): For difficult shape classification tasks, it outperforms existing approaches (StdTS, OrtTS, PreTS): the RRTS model classifies the airplane shape with significantly higher confidence. b): Our approach establishes mutual information between input and output distributions. c): For CIFAR 10 classification with a Resnet110, RRC10 yields substantial practical improvements over the state-of-the-art. d): Learned weather forecasting has strictly limited real-world data: our pretraining yields improvements for pressure (Z500, zoomed in regions shown above), atmospheric temperature (T850) as well as ground temperature (T2M).

